convert special characters in XML to UTF-8 in C# - c#

I have a problem with my XML. When the tag values have special characters, I need these special characters to be converted to UTF-8. Do we have in C# any name space for handling this?

Is this what you are looking for?
http://devproj20.blogspot.com/2008/02/writing-xml-with-utf-8-encoding-using.html
Look at the first comment for an alternative

Related

Force .NET UTF-8 encoder to output 3-byte encoded characters

I am working on a foreign file format that was apparently developed in Japan. Most of their strings are stored with UTF-8 encoding in the 3-byte format (i.e. the capital A is represented as 0xEF,0xBC,0xA1). While it is no problem to decode such strings in .NET, I could not find a way to force the framework to output in the same format, as it will default to the abbreviated form (makes sense, but really I do need the 3-byte form).
Is there any standard functionality that will take care of this? Me being lazy I do not want to implement it myself :)
That's not the letter 'A'. It's a different rune, the FULLWIDTH LATIN CAPITAL LETTER A. Notice the extra spacing in 'A'.
This isn't a different UTF8 format, it's a different character. Whoever produced this kind of file either made a mistake, or intentionally used those glyphs for layout purposes.
If you want to produce a similar text, you'll have to find how those characters are used in the first place, eg. for some words, every word, specific sections? Then you'll have to modify your own text to match this, eg by replacing normal letters with the full-width equivalents.
You can convert such strings with String.Normalize, using the KC or KD normalisation forms. For example, the following expression :
"'AA'".Normalize(System.Text.NormalizationForm.FormKC)
Returns:
'AA'

C# Marshal.PtrToStringUni does not return unicode values

I'm using C# Marshal.PtrToStringUni(IntPtr) to read data from Sql Server
It returns Latin characters but doesn't give Unicode character correctly. For Unicode character I see some garbage
I'm missing some conversion?
ANSI encoding is not Unicode. Use PtrToStringUni method instead.

Special character encoding

I am working with JSON to communicate data between two systems. One of the properties in JSON is rich text. Most of the times there are no problems but once in a blue moon special characters like curly quotes which are not UTF-8 characters make it into the rich text.
I want to replace these special characters with their UTF-8 equivalents. How can I achieve this in C Sharp?
Example of this string - “Cops bring lettuce & tomato, dispose of evidence,”. If I create a regular quote it's like this - "
Thanks
The quotes you posted are sometimes called "smart quotes" - “”. They are UTF-8, but are not proper JSON (and most programming language) quotes.
They are the kind of quotes produced from pasting code into Word.
The fix it to replace both characters with quotes that are valid for JSON (that is ").
If these appear in the JSON values, you need to escape them with a \ - so instead of " you will use \".
Also, take a look at this question and its answers - make sure that the server returns the JSON response as UTF-8 and not some other encoding.

Remove specific HTML tags and non-ASCII characters

How can I remove <table>, <tr>, and <td> HTML tags plus non-ASCII characters from a string using C#?
I want to leave other tags in the string alone.
Check these questions:
Using C# regular expressions to remove HTML tags
How can you strip non-ASCII characters from a string? (in C#)
Simple Google search: http://en.csharp-online.net/Strip_all_HTML_tags
Depending on why you want to do this, I'd recommend against trying. There are many pitfalls, even with Regex.
Personally I'd recommend encoding the input, rather than trying to strip stuff out of it.

C# Regex - How to parse string for Swedish letters åäöÅÄÖ?

I'm trying to parse an HTML file for strings in this format:
MyUsername O22</td>
I want to retrieve the information where "305157", "MyUsername" and the first letter in "O22" (which can be either T, K or O).
I'm using this regex; \w* \w\d\d and it works fine, as long as there aren't any åäöÅÄÖ's where the "\w" are.
What should I do?
You can use a character class which specifically includes those things:
[\wåäöÅÄÖ]*
Or you can use the Unicode character class for letters:
\p{L}
or specifically for Latin:
\p{InBasicLatin}
You can use \p{L} to match any 'letter', which will support all letters in all languages, as suggested in this SO question.
Or, you can simply replace \w* with [^<]*, to match all characters that are not the opening of an HTML tag.
But as said by others, parsing HTML using regex is a first step towards insanity...
Firstly: DON'T USE REGULAR EXPRESSIONS TO PARSE HTML. USE AN HTML PARSER.
Secondly: if you really want to do this (and you don't) then instead of \w you could match any character apart from '<':
[^<]* \w\d\d

Categories

Resources