Encoding issue with Portuguese characters - c#

I am doing a web request for a Portuguese web page. After I get the result, I see the some of the characters are getting converted to hash format.
Example:
Source: PRAÇA DOS OMAGUÁS
Result I am getting : PRAÇA DOS OMAGUÁS
I tired set the encoding format to "IBM860" (reference: http://msdn.microsoft.com/en-us/library/system.text.encodinginfo.getencoding.aspx) for webrequest. But it is still not able to convert.
Any ideas?

string s = System.Web.HttpUtility.HtmlDecode("PRAÇA DOS OMAGUÁS");

> HttpUtility.HtmlDecode("PRAÇA DOS OMAGUÁS")
You can use HttpUtility.HtmlDecode
If you are using .NET 4.0+ you can also use WebUtility.HtmlDecode which does not require an extra assembly reference as it is available in the System.Net namespace
How can I decode HTML characters in C#?

Related

Unable to encode Url properly using HttpUtility.UrlEncode() method

I have created an application in which I need to encode/decode special characters from the url which is entered by user.
For example : if user enters http://en.wikipedia.org/wiki/Å then it's respective Url should be http://en.wikipedia.org/wiki/%C3%85.
I made console application with following code.
string value = "http://en.wikipedia.org/wiki/Å";
Console.WriteLine(System.Web.HttpUtility.UrlEncode(value));
It decodes the character Å successfully and also encodes :// characters. After running the code I am getting output like : http%3a%2f%2fen.wikipedia.org%2fwiki%2f%c3%85 but I want http://en.wikipedia.org/wiki/%C3%85
What should I do?
Uri.EscapeUriString(value) returns the value that you expect. But it might have other problems.
There are a few URL encoding functions in the .NET Framework which all behave differently and are useful in different situations:
Uri.EscapeUriString
Uri.EscapeDataString
WebUtility.UrlEncode (only in .NET 4.5)
HttpUtility.UrlEncode (in System.Web.dll, so intended for web applications, not desktop)
You could use regular expressions to select hostname and then urlencode only other part of string:
var inputString = "http://en.wikipedia.org/wiki/Å";
var encodedString;
var regex = new Regex("^(?<host>https?://.+?/)(?<path>.*)$");
var match = regex.Match(inputString);
if (match.Success)
encodedString = match.Groups["host"] + System.Web.HttpUtility.UrlEncode(match.Groups["path"].ToString());
Console.WriteLine(encodedString);

A socket message from Python to C# comes through garbled

I'm trying to set up a very basic ZeroMQ-based socket link between Python server and C# client using simplejson and Json.NET.
I try to send a dict from Python and read it into an object in C#. Python code:
message = {'MessageType':"None", 'ContentType':"None", 'Content':"OK"}
message_blob = simplejson.dumps(message).encode(encoding = "UTF-8")
alive_socket.send(message_blob)
The message is sent as normal UTF-8 string or, if I use UTF-16, as "'\xff\xfe{\x00"\x00..." etc.
Code in C# is where my problem is:
string reply = client.Receive(Encoding.UTF8);
The UTF-8 message is received as "≻潃瑮湥≴›..." etc.
I tried to use UTF-16 and the message comes through OK, but the first symbols are still the little-endian \xFF \xFE BOM so when I try to feed it to the deserializer,
PythonMessage replyMessage = JsonConvert.DeserializeObject<PythonMessage>(reply);
//PythonMessage is just a very simple class with properties,
//not relevant to the problem
I get an error (obviously occurring at the first symbol, \xFF):
Unexpected character encountered while parsing value: .
Something is obviously wrong in the way I'm using encoding. Can you please show me the right way to do this?
The byte-order-mark is obligatory in UTF-16. You can use UTF-16LE or UTF-16BE to assume a particular byte order and the BOM will not be generated. That is, use:
message_blob = simplejson.dumps(message).encode(encoding = "UTF-16le")

C# decoding "â„¢" to "TM"

on a web page there is following string
"Qualcomm Snapdragon™ S4"
when i get this string in my .net code the string convert to "Qualcomm Snapdragonâ„¢ S4"
the character "TM" change to â„¢
how can i decode "â„¢" back to "TM"
Update
follwoing is the code for downloaded string using webproxy
wc is webproxy
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8");
string html = Server.HtmlEncode(wc.DownloadString(url));
You should read the webpage in its proper encoding in the first place. In this case it seems you are reading with Encoding.Default (i.e. probably CP1252) and the page is really in UTF-8. This should be apparent either by reading the Content-Type header of the response or by looking for a <meta http-equiv="Content-Type" content='text/html; charset=utf-8'> in the content.
If you still need to do this after the fact, then use
var bytes = Encoding.Default.GetBytes(myString);
var correctString = Encoding.UTF8.GetString(bytes);
In any case you would need to know the exact encodings that were used on the page and for reading the malformed string in the first place. Furthermore I'd generally advise explicitly against using Encoding.Default because its value isn't fixed. It's just the legacy encoding on a Windows system for use in non-Unicode applications and also gets used as the default non-Unicode text file encoding. It should have no place whatsoever in handling external resources.

Read UNIX encoded file with C#

I have c# program we use to replace some Values with others, to be used after as parameters. Like 'NAME1' replaced with &1, 'NAME2' with &2, and so on.
The problem is that the data to modify is on a text file encoded on UNIX, and special characters like í, which even on memory, gets read as a square(Invalid char). Due specifications that are out of my control, the file can't be changed and have no other choice than read it like that.
I have tryed to read with most of the 130 Encodings c# offers me with:
EncodingInfo[] info = System.Text.Encoding.GetEncodings();
string text;
for (int a = 0; a < info.Length; ++a)
{
text = File.ReadAllText(fn, info[a].GetEncoding());
File.WriteAllText(fn + a, text, info[a].GetEncoding());
}
fn is the file path to read. Have checked all the made files(like 130), no one of them writes properly the í so im out of ideas and im unable to find anything on internet.
SOLUTION:
Looks like finally this code made the work to get the text properly, also, had to fix the same encoder for the Writing part:
System.Text.Encoding encoding = System.Text.Encoding.GetEncodings()[41].GetEncoding();
String text = File.ReadAllText(fn, encoding); // get file text
// DO ALL THE STUFF I HAD TO
File.WriteAllText(fn, text, encoding) System.Text.Encoding.GetEncodings()[115].GetEncoding(); //Latin 9 (ISO)
/* ALL THIS ENCODINGS WORKED APARENTLY FOR ME WITH ALL WEIRD CHARS I WAS ABLE TO WRITE :P
System.Text.Encoding.GetEncodings()[108].GetEncoding(); //Baltic (ISO)
System.Text.Encoding.GetEncodings()[107].GetEncoding(); //Latin 3 (ISO)
System.Text.Encoding.GetEncodings()[106].GetEncoding(); //Central European (ISO)
System.Text.Encoding.GetEncodings()[105].GetEncoding(); //Western European (ISO)
System.Text.Encoding.GetEncodings()[49].GetEncoding(); //Vietnamese (Windows)
System.Text.Encoding.GetEncodings()[45].GetEncoding(); //Turkish (Windows)
System.Text.Encoding.GetEncodings()[41].GetEncoding(); //Central European (Windows) <-- Used this one
*/
Thank you very much for your help
Noman(1)
you have to get the proper encoding format. try
use file -i. That will output MIME-type information for the file,
which will also include the character-set encoding. I found a
man-page for it, too :)
Or try enca
It can guess and even convert between encodings. Just look at
the man page.
If you have the proper encoding format, look for a way to apply it to your file reading.
Quotes: How to find encoding of a file in Unix via script(s)

Google Translate Api and Special Characters

I've recently started using the google translate API inside a c# project. I am trying to translate some text from english to french. I am having issues with some special characters though.
For example the word Company comes thru as Société instead of Société as it should. Is there some way in code I can convert these to the correct special characters? ie (é to é)
Thanks
If you need anymore info let me know.
I ran into this same exact issue. If you're using the WebClient class to download the json response from google, try setting the Encoding property to UTF8.
using(var webClient = new WebClient { Encoding = Encoding.UTF8 })
{
string json = webClient.DownloadString(someUri);
...
}
I have reproduced your problem, and it looks like you are using the UTF7 encoding. UTF8 is the way you need to go.
I use Google's API by creating a WebRequest to get an HTTP response from the server, then I read the response stream with a StreamReader. StreamReader defaults to UTF8, but to reproduce your problem, I passed Encoding.UTF7 into the StreamReader's constructor.

Categories

Resources