C#: Issues using dictionary with languages other than english - c#

Ok, so i'm basically trying to load the contents of a .txt file that contains 1 word per line into a dictionary.
I had no problems doing so when the words in that file were in english, but changing the file to a language with accents, i started having problems.
Had to change the encoding while creating the stream reader, also the culture in the ToLower method while adding the word to the dictionary.
Basically i now have something similar to this:
if (!dict.ContainsKey(word.ToLower(culture)))
dict.Add(word.ToLower(culture), true);
The problem is that words like "esta" and "está" are being considered the same. So, is there any way to set the ContainsKey method to a specific language or do we need to implement something in the lines of a comparable? Either way i'm kinda new to c# so i would apreciate an example please.
Another issue submerge with the new file... after like a hundred words it stops adding the rest of the file, leaving a word incomplete... but i cant see any special chars in that word to end the execution of the method, any ideas about this problem?
Many thanks.
EDIT:
1st Problem solved using Jon Skeet sugestion.
In regards of the 2nd problem:
Ok, changed the file format to UTF8 and removed the encoding in the stream reader since it now recognizes the accents just right. Testing some stuff regarding the 2nd issue now.
2nd problem also solved, it was a bug on my part... the shame...
Thnks for the quick answers everyone, and especially Jon Skeet.

I assume you're trying to get case insensitivity for the dictionary. Instead of calling ToLower, use the constructor of Dictionary which takes an equality comparer - and use StringComparer.Create(culture, true) to construct a suitable comparer.
I don't know what your second problem is about - we'd need more detail to diagnose it, including the code you're using, ideally.
EDIT: UTF-7 is almost certainly not the correct encoding. Don't just guess at the encoding; find out what it's really meant to be. Where did this text file come from? What can you open it successfully in?
I suspect that at least some of your problems are due to using UTF-7.

The problem is with the enconding you are using when opening the file to read. Looks like you may be using ASCIIEncoding.
.NET handles strings internally as UTF-8, so this kind of issue would not happen internally.

Related

How to map C# compiler error location (line, column) onto the SyntaxTree produced by Roslyn API?

So:
The C# compiler outputs the (line,column) style location.
The Roslyn API expects sequential text location
How to map the former to the latter?
The C# code could be UTF8 with or without the BOM or even UTF16. It could contain all kinds of characters in the form of comments or embedded strings.
Let us assume we know the encoding and have the respective Encoding object handy. I can convert the file bytes to char[]. The problem is that some chars may contribute zero to the final sequential position. I know that the BOM character does. I have no idea if others may too.
Now, if we know for sure that BOM is the only character that contributes 0 to the length, then I can skip it and count the characters and my question becomes trivial. This is what I do today - I just assume that the BOM is the only "bad" player.
But maybe there is a better way? Maybe Roslyn API contains some hidden gem that knows for a change to accept (line,column) and spit the sequential position? Or maybe some of the Microsoft.Build libraries?
EDIT 1
As per the accepted answer the following gives the location:
var srcText = SourceText.From(File.ReadAllText(err.FilePath));
int location = srcText.Lines[err.Line - 1].Start + err.Column - 1;
You have uncovered the reason that the SourceText type exists in the roslyn apis. Its entire purpose is to handle encoding of strings and preform calculations of lines, columns, and spans.
Due to the way .NET handles unicode and depending on which code pages are installed in your OS there could be cases that SourceText does not do what you need. It has generally proven "good enough" for our purposes though.

Something like translit.net but on autohotkey

I want to write Translit.net but on autohotkey. So I succesfully done with the part where I have only one letter:
:*:a::а
:*:b::б
:*:v::в
:*:g::г
:*:d::д
...
But now I have a problem with the translation of "shh" to "щ" and other 'two to one' char translations. When I start typing shh i get схх back, but I want to get щ. What could I do?
My current idea: When I press a key it should write down the letter and add non translated letter to a 3 element array and check if the array elements create a shh ,ch, sh or any other combination larger than one. Then I could remove the last 3 or 2 typed letter and send a russian letter what I need. Maybe someone know an easier way to do that. I want my script to work exactly like that page I posted. A solution in C or C# instead of AutoHotkey would help me too.
I have the same problem, while using the unicode version of Autohotkey, but only if the file is saved in UTF-8 without BOM format.
Saving the file as UNICODE (UCS-2, must be Little Endian) solves the problem.
It also works with UTF-8 with BOM, so apparently autohotkey has truble determining endianness on its own.

Regex to remove xml declaration from a string

First of all, I know this is a bad solution and I shouldn't be doing this.
Background: Feel free to skip
However, I need a quick fix for a live system. We currently have a data structure which serialises itself to a string by creating "xml" fragments via a series of string builders. Whether this is valid XML I rather doubt. After creating this xml, and before sending it over a message queue, some clean-up code searches the string for occurrences of the xml declaration and removes them.
The way this is done (iterate every character doing indexOf for the <?xml) is so slow its causing thread timeouts and killing our systems. Ultimately I'll be trying to fix this properly (build xml using xml documents or something similar) but for today I need a quick fix to replace what's there.
Please bear in mind, I know this is a far from ideal solution, but I need a quick fix to get us back up and running.
Question
My thought to use a regex to find the declarations. I was planning on: <\?xml.*?>, then using Regex.Replace(input, string.empty) to remove.
Could you let me know if there are any glaring problems with this regex, or whether just writing it in code using string.IndexOf("<?xml") and string.IndexOf("?>") pairs in a (much saner) loop is better.
EDIT
I need to take care of newlines.
Would: <\?xml[^>]*?> do the trick?
EDIT2
Thanks for the help. Regex wise <\?xml.*?\?> worked fine. I ended up writing some timing code and testing both using ar egex, and IndexOf(). I found, that for our simplest use case, JUST the declaration stripping took:
Nearly a second as it was
.01 of a second with the regex
untimable using a loop and IndexOf()
So I went for IndexOf() as it's easy a very simple loop.
You probably want either this: <\?xml.*\?> or this: <\?xml.*?\?>, because the way you have it now, the regex is not looking for '?>' but just for '>'. I don't think you want the first option, because it's greedy and it will remove everything between the first occurrence of ''. The second option will work as long as you don't have nested XML-tags. If you do, it will remove everything between the first ''. If you have another '' tag.
Also, I don't know how regexes are implemented in .NET, but I seriously doubt if they're faster than using indexOf.
strXML = strXML.Remove(0, sXMLContent.IndexOf(#"?>", 0) + 2);

How to prevent illegal characters to appear in my XML when retrieving it from SQL Server

Sometimes the string values of Properties in my Classes become odd. They contain illegal characters and are displayed like this (with boxes):
123[]45[]6789
I'm assuming those are illegal/unrecognized characters. I serialize all my objects to XML and then upload them via Web Service. When I retrieve them again, some characters are replaced with oddities. This happens most often with hyphens and dashes that have been typed using Word. Is that the cause of it?
Is there anyway I can check to see if the string contains any of these unrecognized characters via regex or something?
The first thing to remember, is that there is no such thing as a "special character" or an "illegal character". There are characters that are special in certain circumstances, there are non-characters, but there are no generally "special characters" or "illegal characters".
What you have here is either:
Perfectly normal characters for which your font doesn't have a glyph.
Perfectly normal characters that aren't printable (e.g. control characters).
An artefact of how the debugger works.
The first thing is to find out what that character is. Find the integer value of the character, and then look it up.
An important one to look out for is U+FFFD (�) as it is sometimes used when a decoder has recieved a bunch of bytes that make no sense in the context of the encoding it is trying to use (e.g. 0x80 followed by 0x20 makes no sense in UTF-8, and one possible response is to use U+FFFD as a "something strange here" marker, other possible responses are throwing an error, and also silently ignoring the error or trying to guess at intent though those last two bring security issues).
Once you've this figured out, you can begin to reason about why it's getting in there if it isn't expected. Could it be an ecoding issue (charset written in is not the charset read in)? Could it be actually intended to be there? Could it be something else? You can't begin to answer that until you have more information on the bug.
Finally, there's the matter of what to do about it. This will hopefully be obvious from the answers you've found in your research above. Possibly the answer will be "nothing it's fine", possibly something simple or something hard. Can't say yet.
Do not just filter with a regular expression. Maybe that will turn out to be the correct solution, but you don't know yet, so maybe you're making a deeper bug harder to find than it is now, or damaging perfectly good data.
Personally I don't think using a Regex to check for these characters is the correct solution. If you aren't storing those characters then there is obviously some sort of encoding issue.
Verify that the XML document itself is stored using the correct encoding to support the characters you need to store. Then verify when you are reading the file in that you are using the same encoding as the document i.e. if your XML document is stored as UTF-8 then you need to make sure when you read it in your encoding it as UTF-8.
Take a deeper look at the characters themselves, what are the acutal char values?
When a character shows up an a square it means you can't represent it visually. This is either because it's a non-visual character, or it's outside of your current char set.
edit, nope
In your example I'd venture a guess that your seeing imbedded newline characters.
Define the allowed characters and block everything else, i.e.:
// only lowercase letters and digits
if(Regex.IsMatch(yourString, #"^[a-z0-9]*$"))
{
// allowed
}
But I think your problem may lie somewhere else, because you say it comes from serializing (valid) string and then deserializing (invalid) strings. It is possibly that you use default serialization and that you don't apply proper ISerializable implementation for your classes (or proper use of the Serializable attributes), resulting in properties or fields being serialized that you don't want to be serialized.
PS: others have mentioned encoding issues, which is a possible cause and might mean you cannot read back the data at all. About encoding there's one simple rule: use the same encoding everywhere (streams, database, xml) and be specific. If you are not, the default encoding is used, which can be different from system to system.
Edit: possible solution
Based on new information (see thread under original question), it is pretty clear that the issue has to do with encoding. The OP mentions that it appears with dashes, which are often replaced with pretty dashes like "—" (—) when used in some fancy editing environment. Since it seems that there's some unclarity in how to fix SQL Server to accept proper encoded strings, you can also solve this in your XML.
When you create your XML, simply change the encoding to the most basic possible (US-ASCII). This will automatically force the XML writer to use the proper numerical entities. When you deserialize, this will be properly parsed in your strings without further ado. Something along these lines:
Stream stream = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.ASCII;
XmlWriter writer = XmlWriter.Create(stream, settings);
// make sure to output the xml-prolog header
But be aware of using StringBuilder or StringWriter, because it is fixed to using UTF-16, and the XmlWriter will always write in that encoding, more info on that issue at my blog, which is not compatible with SQL Server.
Note: when using the ASCII encoding, any character higher than 0x7F will be encoded. So, é will look like &#xE9 and the dash may look like &#x2014, but this means just the same and you should not worry about that. Every XML capable tool will properly interpret this input.
Note 2: the location where you want to change the way XML is written is the Web Service you talk of, that receives XML and then stores it into the SQL Server database. Before storing into SQL Server, the change must be applied. Earlier on in the chain is useless.
public static T DeserializeFromXml<T>(string xml)
{
T result;
XmlSerializerFactory serializerFactory = new XmlSerializerFactory();
XmlSerializer serializer =serializerFactory.CreateSerializer(typeof(T));
using (StringReader sr3 = new StringReader(xml))
{
XmlReaderSettings settings = new XmlReaderSettings()
{
CheckCharacters = false // default value is true;
};
using (XmlReader xr3 = XmlTextReader.Create(sr3, settings))
{
result = (T)serializer.Deserialize(xr3);
}
}
return result;
}

How to parse a midlet JAD file in C#?

Besides doing it manually using regular expression search, is there other better ways to parse a JAD file?
I need to be able to search for and replace/insert a new MIdlet-Install-Notify property to a JAD file given, also updating the value of the MIDlet-Jar-URL property.
Using ANTLR or TinyPG is a bit overkill for my case.
TIA
Even Regex might be overkill, although it certainly will get the job done. It is a very simple text format to parse, string.StartsWith() and string.IndexOf() to find the colon would work well.

Categories

Resources