How to get index of any charcter in unicode string

How to get index of any charcter in unicode string - c#

I having a string variable which basically holds value of corresponding English word in the form of Chinese.
String temp = "'％1'不能输入步骤'％2'";
But when i want to know wether the string having %1 in it or not by using IndexOf function
if(temp.IndexOf("%1") != -1)
{
}
I am not getting true even if it contain %1.
So is there any issue due to Chinese charters or any thing else.
Pls suggest me how i can get the index of any charter in above case.

That is because ％1 is not equal to %1 What you want to do in this case as workaround is select the symbols out of string you have like
var s = "'％1'不能输入步骤'％2'";
var firstFragment = s.Substring(1, 2); // this should select you ％1
and then do
if(temp.IndexOf(first) != -1){
}

Comments gave the answer. Use the same percent character, so instead of:
"%1"
use:
"％1"
Or, if you find that problematic (your source code is in a "poor" code page, or you fear the code is hard to read when it contains full-width characters that resemble ASCII characters), use:
"\uFF051"
or even:
"\uFF05" + "1"
(concatenation will be done by the C# compiler, no extra concatting done at run-time).
Another approach might be Unicode normalization:
temp = temp.Normalize(NormalizationForm.FormKC);
which seems to project the "exotic" percent char into the usual ASCII percent char, although I am not sure if that behavior is guaranteed, but see the Decomposition field on Unicode Character 'FULLWIDTH PERCENT SIGN' (U+FF05).

Related

String comparison returns False for same strings [duplicate]

I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see
=E2=80=8B
at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.
What is the easiest way to get rid of this exact sequence? I cannot do the obvious
MailItem.Body.Replace("=E2=80=8B", "")
because those characters don't show up in the c# string.
I also tried
byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);
But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).

As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:
MailItem.Body.Replace("\u200B", "");

As all the Regex.Replace() methods operate on strings, that's not going to be useful here.
The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:
StringBuilder newText = new StringBuilder();
for (int i = 0; i < MailItem.Body.Length; i++)
{
if (a[i] != '\u200b')
{
newText.Append(a[i]);
}
}

Use System.Web.HttpUtility.HtmlDecode(string);
Quite simple.

Other language word/string split/manipulation issue

Highlight the text searched scenario:
Ex: If I have a word RK and 'r' is searched, I have to highlight first occurance of 'r' i.e., RK. In the background it is like
< b >R< /b >K.
Similarly I have to highlight ம in மொ. Hence I am trying to find the position of ம in மொ and performing highlighting operation.
Here I am getting the text after manipulation as
< b >ம< /b >ொ and hence it is displayed as ம ொ
The code that I used for string manipulation and highlighting:
formattedString = string.Empty;
searchStringLength = searchString.Length;
formattedString += inputString.Substring(0, find);
formattedString += "<b>" + inputString.Substring(find, searchStringLength) + "</b>";
formattedString += inputString.Substring(find + searchStringLength);
The example is just for Tamil word, any suggestions to make it work for all other languages other than english?

I do not know Tamil. Looking at your question, the input string should be three letter string.
Probably, you are setting your find variable something like
find = inputString.IndexOf("ம");?
somewhere in your code.
The Tamil word மொ is not being counted as three letter word. Visual Studio is handling it as single letter while மொ.Length returns 2. ToCharArray() also returns array of two characters. That is why, IndexOf is always returning 0.
Your comment on question:
since ம + ொ = மொ, the ம find was returning true always. Now after this solution, ம find will return false and hence I don't have to highlight. Only if மொ is entered to find, it matches exactly and I can highlight.
I do not think problem is in SubString. The IndexOf needs to be handled tactically.

0x202A in filename: Why?

I recently needed to do a isnull in SQL on a varbinary image.
So far so (ab)normal.
I very quickly wrote a C# program to read in the file no_image.png from my desktop, and output the bytes as hex string.
That program started like this:
byte[] ba = System.IO.File.ReadAllBytes(#"‪D:\UserName\Desktop\no_image.png");
Console.WriteLine(ba.Length);
// From here, change ba to hex string
And as I had used readallbytes countless times before, I figured no big deal.
To my surprise, I got a "NotSupported" exception on ReadAllBytes.
I found that the problem was that when I right click on the file, go to tab "Security", and copy-paste the object-name (start marking at the right and move inaccurately to the left), this happens.
And it happens only on Windows 8.1 (and perhaps 8), but not on Windows 7.
When I output the string in question:
public static string ToHexString(string input)
{
string strRetVal = null;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
foreach (char c in input)
{
sb.Append(((int)c).ToString("X2"));
}
strRetVal = sb.ToString();
sb.Length = 0;
sb = null;
return strRetVal;
} // End Function ToHexString
string str = ToHexString(#"‪D:\UserName\Desktop\cookie.png");
string strRight = " (" + ToHexString(#"D:\UserName\Desktop\cookie.png") + ")"; // Correct value, for comparison
string msg = str + Environment.NewLine + " " + strRight;
Console.WriteLine(msg);
I get this:
202A443A5C557365724E616D655C4465736B746F705C636F6F6B69652E706E67
(443A5C557365724E616D655C4465736B746F705C636F6F6B69652E706E67)
First thing, when I lookup 20 2A in ascii, it's [space] + *
Since I don't see neither a space nor a star, when I google 20 2A, the first thing I get is paragraph 202a of the german penal code
http://dejure.org/gesetze/StGB/202a.html
But I suppose that is rather an unfortunate coincidence and it is actually the unicode control character 'LEFT-TO-RIGHT EMBEDDING' (U+202A)
http://www.fileformat.info/info/unicode/char/202a/index.htm
Is that a bug, or is that a feature ?
My guess is, it's a buggy feature.

The issue is that the string does not begin with a letter D at all - it just looks like it does.
It appears that the string is hard-coded in your source file.
If that's the case, then you have pasted the string from the security dialog. Unbeknownst to you, the string you pasted begins with the LRO character. This is an invisible character which tales no space, but tells the renderer to render characters from left-to-right, ignoring the usual rendering.
You just need to delete the character.
To do this, position the cursor AFTER the D in the string. Use the Backspace or Delete to Left key <x] to delete the D. Use the key again to delete the invisible LRO character. One more time to delete the ". Now retype the " and the D.
A similar problem could occur wherever the string came from - e.g. from user input, command line, script file etc.
Note: The security dialog shows the filename beginning with the LRO character to ensure that characters are displayed in the left-to-right order, which is necessary to ensure that the hierarchy is correctly understood when using RTL characters. e.g. a filename c:\folder\path\to\file in Arabic might be c:\folder\مسار/إلى/ملف. The "gotcha" is the Arabic parts read in the other direction so the word "path" according to google translate is مسار, and that is the rightmost word, making it appear is if it was the last element of the path, when in fact it is the element immediately after "c:\folder\".
Because security object paths have an hierarchy which is in conflict with the RTL text layout rules, the security dialog always displays RTL text in LTR mode. That means that the Arabic words will be mangled (letters in wrong order) on the security tab. (Imagine it as if it said "elif ot htap"). So the meaning is just about discernable, but from the point of view of security, the security semantics are preserved.

Filenames that contain RLO/LRO overrides are commonly created by malware. Eg. “exe” read backwards spells “malware”. You probably have an infected host, or the origin of the .png is infected.

This question bothered me a lot, how would it be possible that a deterministic function would give 2 different results for identical input? After some testing, it turns out that the answer is simple.
If you look through it in your debugger, you will see that the 'D' char in your #"‪D:\UserName\Desktop\cookie.png" (first use of Hex function) is NOT the same char as in #"D:\UserName\Desktop\cookie.png" (second use).
You must have used some other 'D'-like character, probably by unwanted keyboard shortcut or by messing with your Visual Studio character encoding.
It looks exactly the same, but in reality it's not event a single char 9try to watch the c variable in your toHex function.
if you change to the normal 'D' in your first example, it will work fine.

validate excel worksheet name

I'm getting the below error when setting the worksheet name dynamically. Does anyone has regexp to validate the name before setting it ?
The name that you type does not exceed 31 characters. The name does
not contain any of the following characters: : \ / ? * [ or ]
You did not leave the name blank.

You can use the method to check if the sheet name is valid
private bool IsSheetNameValid(string sheetName)
{
if (string.IsNullOrEmpty(sheetName))
{
return false;
}
if (sheetName.Length > 31)
{
return false;
}
char[] invalidChars = new char[] {':', '\\', '/', '?', '*', '[', ']'};
if (invalidChars.Any(sheetName.Contains))
{
return false;
}
return true;
}

To do worksheet validation for those specified invalid characters using Regex, you can use something like this:
string wsName = #"worksheetName"; //verbatim string to take special characters literally
Match m = Regex.Match(wsName, #"[\[/\?\]\*]");
bool nameIsValid = (m.Success || (string.IsNullOrEmpty(wsName)) || (wsName.Length > 31)) ? false : true;
This also includes a check to see if the worksheet name is null or empty, or if it's greater than 31. Those two checks aren't done via Regex for the sake of simplicity and to avoid over engineering this problem.

Let's match the start of the string, then between 1 and 31 things that aren't on the forbidden list, then the end of the string. Requiring at least one means we refuse empty strings:
^[^\/\\\?\*\[\]]{1,31}$
There's at least one nuance that this regex will miss: this will accept a sequence of spaces, tabs and newlines, which will be a problem if that is considered to be blank (as it probably is).
If you take the length check out of the regex, then you can get the blankness check by doing something like:
^[^\/\\\?\*\[\]]*[^ \t\/\\\?\*\[\]][^\/\\\?\*\[\]]*$
How does that work? If we defined our class above as WORKSHEET, that would be:
^[^WORKSHEET]*[^\sWORKSHEET][^WORKSHEET]*$
So we match one or more non-forbidden characters, then a character that is neither forbidden nor whitespace, then zero or more non-forbidden characters. The key is that we demand at least one non-whitespace character in the middle section.
But we've lost the length check. It's hard to do both the length check and the regex in one expression. In order to count, we have to phrase things in terms of matching n times, and the things being matched have to be known to be of length 1. But in order to allow whitespace to be placed freely - as long as it's not all whitespace - we need to have a part of the match that is not necessarily of length 1.
Well, that's not quite true. At this point this starts to become a really bad idea, but nevertheless: onwards, into the breach! (for educational purposes only)
Instead of using * for the possibly-blank sections, we can specify the number we expect of each, and include all the possible ways for those three sections to add up to 31. How many ways are there for two numbers to add up to 30? Well, there's 30 of them. 0+30, 1+29, 2+28, ... 30+0:
^[^WORKSHEET]{0}[^\sWORKSHEET][^WORKSHEET]{30}$
|^[^WORKSHEET]{1}[^\sWORKSHEET][^WORKSHEET]{29}$
|^[^WORKSHEET]{2}[^\sWORKSHEET][^WORKSHEET]{28}$
....
|^[^WORKSHEET]{30}[^\sWORKSHEET][^WORKSHEET]{0}$
Obviously if this was a good idea, you'd write a program that expression rather than specifying it all by hand (and getting something wrong). But I don't think I need to tell you it's not a good idea. It is, however, the only answer I have to your question.
While admittedly not actually answering your question, I think #HatSoft has the right approach, encoding the conditions directly and clearly. After all, I'm now satisfied that an answer to your question as asked is not actually a helpful thing.

You might want to do a check for the name History as this is a reserved sheet name in Excel.

Something like that?
public string validate(string name)
{
foreach (char c in Path.GetInvalidFileNameChars())
name = name.Replace(c.ToString(), "");
if (name.Length > 31)
name = name.Substring(0, 31);
return name;
}

Find and replace ASCII character with a new line

I am trying to find every occurrence of an ASCII character in a string and replace it with a new line. Here is what I have so far:
public string parseText(string inTxt)
{
//String builder based on the string passed into the method
StringBuilder n = new StringBuilder(inTxt);
//Convert the ASCII character we're looking for to a string
string replaceMe = char.ConvertFromUtf32(187);
//Replace all occurences of string with a new line
n.Replace(replaceMe, Environment.NewLine);
//Convert our StringBuilder to a string and output it
return n.ToString();
}
This does not add in a new line and the string all remains on one line. I’m not sure what the problem is here. I have tried this as well, but same result:
n.Replace(replaceMe, "\n");
Any suggestions?

char.ConvertFromUtf32, whilst correct, is not the simplest way to read a character based on its ASCII numeric value. (ConvertFromUtf32 is mainly intended for Unicode code points that lie outside the BMP, which result in surrogate pairs. This is not something you'd encounter in English or most modern languages.) Rather, you should just cast it using (char).
char c = (char)187;
string replaceMe = c.ToString();
You may, of course, define a string with the required character as a literal in your code: "»".
Your Replace would then be simplified to:
n.Replace("»", "\n");
Finally, on a technical level, ASCII only covers characters whose value lies in the 0–127 range. Character 187 is not ASCII; however, it corresponds to » in ISO 8859-1, Windows-1252, and Unicode, which collectively are by far the most popular encodings in use today.
Edit: I just tested your original code, and found that it actually worked. Are you sure the result remains on one line? It might be an issue with the way the debugger renders strings in single-line view:
Note that the \r\n sequences actually do represent newlines, despite being displayed as literals. You can check this from the multi-line display (by clicking on the magnifying glass):

StringBuilder.Replace returns a new StringBuilder with the changes made. Strange, I know, but this should work:
StringBuilder replaced = n.Replace(replaceMe, Environment.NewLine);
return replaced.ToString();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.