Other language word/string split/manipulation issue

Other language word/string split/manipulation issue - c#

Highlight the text searched scenario:
Ex: If I have a word RK and 'r' is searched, I have to highlight first occurance of 'r' i.e., RK. In the background it is like
< b >R< /b >K.
Similarly I have to highlight ம in மொ. Hence I am trying to find the position of ம in மொ and performing highlighting operation.
Here I am getting the text after manipulation as
< b >ம< /b >ொ and hence it is displayed as ம ொ
The code that I used for string manipulation and highlighting:
formattedString = string.Empty;
searchStringLength = searchString.Length;
formattedString += inputString.Substring(0, find);
formattedString += "<b>" + inputString.Substring(find, searchStringLength) + "</b>";
formattedString += inputString.Substring(find + searchStringLength);
The example is just for Tamil word, any suggestions to make it work for all other languages other than english?

I do not know Tamil. Looking at your question, the input string should be three letter string.
Probably, you are setting your find variable something like
find = inputString.IndexOf("ம");?
somewhere in your code.
The Tamil word மொ is not being counted as three letter word. Visual Studio is handling it as single letter while மொ.Length returns 2. ToCharArray() also returns array of two characters. That is why, IndexOf is always returning 0.
Your comment on question:
since ம + ொ = மொ, the ம find was returning true always. Now after this solution, ம find will return false and hence I don't have to highlight. Only if மொ is entered to find, it matches exactly and I can highlight.
I do not think problem is in SubString. The IndexOf needs to be handled tactically.

Related

Pig Latin Translator. C# Homework

Create a GUI application called IGPAY that lets a user enter a word. Then, when the user clicks a button, your program will generate and display the Pig Latin equivalent of that word. (To do this you remove the first letter of the word, then add that letter to the end of the word plus the letters "ay." For example, fish would become ishfay and ball would become allbay.) Make sure the GUI is attractive in appearance and that all labels, textboxes, buttons, and similar are clearly labeled. Hint: store the word in a string and consider using the Substring method. Also remember that the Length property of a string will tell you its length. See pages 79-80 of the text for examples.
Here is the code I came up with. I am new to this language and have a little knowledge with Python, but I'm just not understanding why it throws me an "Out of range exception" error. I am trying to make it so the code accepts any word and displays it in pig Latin.
private void button1_Click(object sender, EventArgs e)
{
string word;
string first;
string rest;
string full;
word = textBox1.Text;
first = word.Substring(0);
rest = word.Substring(1, word.Length);
full = rest + first + "ay";
label2.Text = full;
}

Not sure why you got a downvote, good job declaring this is homework and at least providing some work that you have done. Aside from that, everybody here should appreciate you are wanting to learn.
There are some things I would consider if I were you. In addition to the answer from Phylyp which should fix your exception, you should also handle if the user enters less than two characters, which can also cause exception.
Below I show an example of three things you can do. Not saying it is the best way but it is an option.
Check the entered string to make sure it is at least two characters and if not, prompt user.
Cut down on the amount of string variables and lines of declaration by using one string.Format call.
private void Button1_Click(object sender, EventArgs e)
{
const string suffix = "ay";
string enteredString = textBox1.Text;
//Check the length to make sure it is at least 2
if(enteredString.Length < 2)
{
MessageBox.Show("Please enter at least 2 or more characters");
return;
}
//We get here if 2 or more characters were entered.
//Lets go ahead an process our string
label2.Text = string.Format("{0}{1}{2}",
enteredString.Substring(1),
enteredString.Substring(0,1),
suffix);
}
EDIT - I have edited the answer above to what you actually want to have which is the first substring of just 1 and that will give you everything in the string after the first letter. Basically, you are saying give me the string starting at position 1, or the second letter. Then, the second variable says give me the string starting at position 0 and only for a length of 1 character. This will essentially give you only the first letter. So, this will give you what you want. Couple that with checking to make sure at least two characters are entered, and you shouldn't have any exceptions.
Hope this helps!

first = word.Substring(0, 1);
rest = word.Substring(1, word.Length - 1);
Since the Substring() method is zero-based, word.Length is outside the string. Changing the second argument to word.Length - 1 should avoid the error.
Also, since you need just the first character, the first call to Substring() should have the second argument as 1.
Good you've called this out as homework and have given your attempted code.
(Edited to incorporate Rob's very correct comment about .Substring(1))

How to get index of any charcter in unicode string

I having a string variable which basically holds value of corresponding English word in the form of Chinese.
String temp = "'％1'不能输入步骤'％2'";
But when i want to know wether the string having %1 in it or not by using IndexOf function
if(temp.IndexOf("%1") != -1)
{
}
I am not getting true even if it contain %1.
So is there any issue due to Chinese charters or any thing else.
Pls suggest me how i can get the index of any charter in above case.

That is because ％1 is not equal to %1 What you want to do in this case as workaround is select the symbols out of string you have like
var s = "'％1'不能输入步骤'％2'";
var firstFragment = s.Substring(1, 2); // this should select you ％1
and then do
if(temp.IndexOf(first) != -1){
}

Comments gave the answer. Use the same percent character, so instead of:
"%1"
use:
"％1"
Or, if you find that problematic (your source code is in a "poor" code page, or you fear the code is hard to read when it contains full-width characters that resemble ASCII characters), use:
"\uFF051"
or even:
"\uFF05" + "1"
(concatenation will be done by the C# compiler, no extra concatting done at run-time).
Another approach might be Unicode normalization:
temp = temp.Normalize(NormalizationForm.FormKC);
which seems to project the "exotic" percent char into the usual ASCII percent char, although I am not sure if that behavior is guaranteed, but see the Decomposition field on Unicode Character 'FULLWIDTH PERCENT SIGN' (U+FF05).

0x202A in filename: Why?

I recently needed to do a isnull in SQL on a varbinary image.
So far so (ab)normal.
I very quickly wrote a C# program to read in the file no_image.png from my desktop, and output the bytes as hex string.
That program started like this:
byte[] ba = System.IO.File.ReadAllBytes(#"‪D:\UserName\Desktop\no_image.png");
Console.WriteLine(ba.Length);
// From here, change ba to hex string
And as I had used readallbytes countless times before, I figured no big deal.
To my surprise, I got a "NotSupported" exception on ReadAllBytes.
I found that the problem was that when I right click on the file, go to tab "Security", and copy-paste the object-name (start marking at the right and move inaccurately to the left), this happens.
And it happens only on Windows 8.1 (and perhaps 8), but not on Windows 7.
When I output the string in question:
public static string ToHexString(string input)
{
string strRetVal = null;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
foreach (char c in input)
{
sb.Append(((int)c).ToString("X2"));
}
strRetVal = sb.ToString();
sb.Length = 0;
sb = null;
return strRetVal;
} // End Function ToHexString
string str = ToHexString(#"‪D:\UserName\Desktop\cookie.png");
string strRight = " (" + ToHexString(#"D:\UserName\Desktop\cookie.png") + ")"; // Correct value, for comparison
string msg = str + Environment.NewLine + " " + strRight;
Console.WriteLine(msg);
I get this:
202A443A5C557365724E616D655C4465736B746F705C636F6F6B69652E706E67
(443A5C557365724E616D655C4465736B746F705C636F6F6B69652E706E67)
First thing, when I lookup 20 2A in ascii, it's [space] + *
Since I don't see neither a space nor a star, when I google 20 2A, the first thing I get is paragraph 202a of the german penal code
http://dejure.org/gesetze/StGB/202a.html
But I suppose that is rather an unfortunate coincidence and it is actually the unicode control character 'LEFT-TO-RIGHT EMBEDDING' (U+202A)
http://www.fileformat.info/info/unicode/char/202a/index.htm
Is that a bug, or is that a feature ?
My guess is, it's a buggy feature.

The issue is that the string does not begin with a letter D at all - it just looks like it does.
It appears that the string is hard-coded in your source file.
If that's the case, then you have pasted the string from the security dialog. Unbeknownst to you, the string you pasted begins with the LRO character. This is an invisible character which tales no space, but tells the renderer to render characters from left-to-right, ignoring the usual rendering.
You just need to delete the character.
To do this, position the cursor AFTER the D in the string. Use the Backspace or Delete to Left key <x] to delete the D. Use the key again to delete the invisible LRO character. One more time to delete the ". Now retype the " and the D.
A similar problem could occur wherever the string came from - e.g. from user input, command line, script file etc.
Note: The security dialog shows the filename beginning with the LRO character to ensure that characters are displayed in the left-to-right order, which is necessary to ensure that the hierarchy is correctly understood when using RTL characters. e.g. a filename c:\folder\path\to\file in Arabic might be c:\folder\مسار/إلى/ملف. The "gotcha" is the Arabic parts read in the other direction so the word "path" according to google translate is مسار, and that is the rightmost word, making it appear is if it was the last element of the path, when in fact it is the element immediately after "c:\folder\".
Because security object paths have an hierarchy which is in conflict with the RTL text layout rules, the security dialog always displays RTL text in LTR mode. That means that the Arabic words will be mangled (letters in wrong order) on the security tab. (Imagine it as if it said "elif ot htap"). So the meaning is just about discernable, but from the point of view of security, the security semantics are preserved.

Filenames that contain RLO/LRO overrides are commonly created by malware. Eg. “exe” read backwards spells “malware”. You probably have an infected host, or the origin of the .png is infected.

This question bothered me a lot, how would it be possible that a deterministic function would give 2 different results for identical input? After some testing, it turns out that the answer is simple.
If you look through it in your debugger, you will see that the 'D' char in your #"‪D:\UserName\Desktop\cookie.png" (first use of Hex function) is NOT the same char as in #"D:\UserName\Desktop\cookie.png" (second use).
You must have used some other 'D'-like character, probably by unwanted keyboard shortcut or by messing with your Visual Studio character encoding.
It looks exactly the same, but in reality it's not event a single char 9try to watch the c variable in your toHex function.
if you change to the normal 'D' in your first example, it will work fine.

How do you find at what line a word is located in a textbox?

I'm currently working on a notepad that has a find option. When you type in a word it'll find it and highlight it. I got it working but I've reached a wall that I can't seem to pass with the method i'm currently using to do it. I'm currently splitting all the words in the textbox with ' ' and adding up the length of the words untill I find the inputted search term so I can see where exactly the found word was, so I can highlight it.
The problem I have now though, is that since i'm using split(' ') to get each word in the textbox, whenever the user adds a new line the split's return array will be "wordOnFirstLine\r\nwordOnSecondLine". So they will be counted as one word.
What's another way I can find a word in the textbox and see where exactly it's located so I can highlight it?

Try splitting the string as
string splitstring = stringToSplit.Split(new char[] { ' ', '\n', '\r' });
It'll give you an empty string in between all the '\n' and '\r' characters, but that fix may be closest to what you're currently doing.

I believe you're looking for the GetLineFromCharIndex(int) method. Passing in the index of the first character in your word should return its line number.

You should not split all words with ' '. You can split the text by sentence('\n') and then use IndexOf for each sentence to find the word's occurrence.

Do not split anything. It is a waste of resource to create a potentially big array of strings.
Simply the IndexOf (with the appropriate case comparer) could give you the position of the first word in your whole textbox text
So supposing you are searching the word "answer" as a word delimited by a space before and after in your notepad text you write
int pos = 0;
string searchText = " answer ";
pos = myNotePad.Text.IndexOf(searchText, pos, StringComparison.CurrentCulture.IgnoreCase);
now to select the string
if(pos > 0)
{
myNotePad.Text.SelectionStart = pos + 1;
myNotePad.Text.SelectionLength = searchText.Length - 2;
}
And if you save the value of the variable pos you could also easily implement the Find Next functionality

How to display word differences using c#?

I would like to show the differences between two blocks of text. Rather than comparing lines of text or individual characters, I would like to just compare words separated by specified characters ('\n', ' ', '\t' for example). My main reasoning for this is that the block of text that I'll be comparing generally doesn't have many line breaks in it and letter comparisons can be hard to follow.
I've come across the following O(ND) logic in C# for comparing lines and characters, but I'm sort of at a loss for how to modify it to compare words.
In addition, I would like to keep track of the separators between words and make sure they're included with the diff. So if space is replaced by a hard return, I would like that to come up as a diff.
I'm using Asp.net to display the entire block of text including the deleted original text and added new text (both will be highlighted to show that they were deleted/added). A solution that works with those technologies would be appreciated.
Any advice on how to accomplish this is appreciated?
Thanks!

Microsoft has released a diff project on CodePlex that allows you to do word, character, and line diffs. It is licensed under Microsoft Public License (Ms-PL).
https://github.com/mmanela/diffplex

Other than a few general optimizations, if you need to include the separators in the comparison you are essentially doing a character by character comparison with breaks. Though you could use the O(ND) you linked, you are going to make as many changes to it as you would basically writing your own.
The main problem with difference comparison is finding the continuation (if I delete a single word, but leave the rest the same).
If you want to use their code start with the example and do not write the deleted characters, if there are replaced characters in the same place, do not output this result. You then need to compute the longest continuous run of "changed" words, highlight this string and output.
Sorry thats not much of an answer, but for this problem the answer is basically writing and tuning the function.

Well String.Split with '\n', ' ' and '\t' as the split characters will return you an array of words in your block of text.
You could then compare each array for differences. A simple 1:1 comparison would tell you if any word had been changed. Comparing:
hello world how are you
and:
hello there how are you
would give you that world and changed to there.
What it wouldn't tell you was if words had been inserted or removed and you would still need to parse the text blocks character by character to see if any of the separator characters had been changed.

string string1 = "hello world how are you";
string string2 = "hello there how are you";
var first = string1.Split(' ');
var second = string2.Split(' ');
var primary = first.Length > second.Length ? first : second;
var secondary = primary == second ? first : second;
var difference = primary.Except(secondary).ToArray();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.