Reading a web page in a foreign language with StreamReader - c#

I'm trying to fetch a web page that is a mix of English and Korean. The browser can fetch and display the page just fine, but when I try to grab it programmatically I can't get the Korean characters to display properly.
I know that you can specify an Encoding in the StreamReader but I haven't found one that works yet.
This is the code that I'm using to read the response:
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(response.CharacterSet));
response.CharacterSet returns UTF8. I've also tried all of the basic encoding options - ASCII, BigEndian, Default, Unicode, UTF32, UTF7, and manually adding Encoding.UTF8.
I've also tried going about it through the CultureInfo:
CultureInfo kr = CultureInfo.GetCultureInfo("ko");
StreamReader sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(kr.TextInfo.ANSICodePage));
using both "ko" and "ko-KR". I get varied results from all these different types, but none of them are correct.
I've also tried the code page directly:
StreamReader sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(949));
response.ContentEncoding returns an empty string. I'm running out of ideas.
Edit: Here is an example of what I'm expecting:
프로젝트:
and here is what I'm getting:
//ASCII == ??????
//BigEndian == ़汩湫â¨ç‰¥æ˜½âˆ¯æ©³â½¤ç°æ”
//Default == íâ€â€žÃ«Â¡Å“ì Â트:
//Unicode == íâ€â€žÃ«Â¡Å“ì Â트
//UTF32 == ���������ï
//UTF7 == 프로ì Â트
//UTF8 == 프로ì 트

FWIW: a stream reader is likely not going to work well.
Prefer using HttpWebRequest Class to do browser requests (or you will start feeling sorry very soon when you get 302 responses or gzipped and/or chunked encoding)
I promoted this to an answer, as it might very well be the problem you're having already. I don't know what the response you are getting looks like, of course

Related

Get content of HttpWebResponse while debugging

I'm trying to run sequential requests to a web api url every 10 seconds to log changes in the data returned. The code snippet looks like this:
using (Stream objStream = response.GetResponseStream())
{
query result = (query)serializer.Deserialize(objStream);
Console.WriteLine(result.results.quote.Name + " " + result.results.quote.Ask);
objStream.Flush();
objStream.Close();
}
Every now and then an InvalidOperationException is thrown when running the deserialiation with the message saying that the XML document is badly formated. In an effort to isolate the problem I'm trying to find the "raw" response content in debug mode using the autos/locals/watch view, but I really can't find it.
I can find the response header and a lot of other information and as far as I can see this looks okay with one exception; the content-length which shows -1. I'm not sure if this is something that I should care about really but since I can't find the response "body" I can't help being suspicious about it.
So my real question here is: how can I find the "body" inside a HttpWebResponse or Stream object?
And the side question: Is the content-length with value -1 something to be bothered about.
If you read the entire contents from the stream and store it in a variable before deserializing it, you should be able to see the contents while debugging
For debugging i would suggest you replicate the response into string and that way you watch it.
using (Stream objStream = response.GetResponseStream())
{
StreamReader sr = new StreamReader(objStream);
string response = sr.ReadToEnd();
objStream.Seek(0,SeekOrigin.Begin); // Get the pointer back to the begining.
query result = (query)serializer.Deserialize(objStream);
Console.WriteLine(result.results.quote.Name + " " + result.results.quote.Ask);
objStream.Flush(); // remove
objStream.Close();//remove
}
I would also recommend to remove:
objStream.Flush();
objStream.Close();
when using 'using' statement it calls Dispose() (IDisposable()), which will eventauly close the stream by itself .

The correct way to create an HTMLDocument using HTMLAgilityPack?

Consider the code below:
string url="http://badoo.com";
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
Now I have the htmlstring inside the result variable, let's try something:
// save normally
File.WriteAllText("1.html",result);
// save using HTMLAgilityPack
HtmlAgilityPack.HtmlDocument hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml(result);
hdoc.Save("2.html");
Can someone please tell me why 1.html and 2.html doesn't look the same ? Although they have the same file size ?
Link to the correct one (file.writealltext() ) : http://woman2.com/1.html
Link to the wrong one (saved with htmlagility pack) : http://woman2.com/2.html
Update:
I have also tried to save the file on local disk and then
hdoc.Load("path/to/local",true);
I have also tried:
hdoc.LoadHtml(result);
And tried:
hdoc.Save("2.html",Encoding.UTF8);
but any of the attemps seems to be working to me. I've been struggling with this for 3 days now.
Andrew Morton is correct. The file '1.html' is formed in a way that makes agility pack angry/scared/confused. In all seriousness, I ran your code and diffed the resulting files and here are some of the differences:
Innocuous:
removes whitespace redundancy
adds closing tags where were previously self-closing
Possibly affect the site:
adds attribute values where none previously existed
changes language-specific characters
Likely affect the site:
"fixes" unmatched quotations (I would put my money here if I had to guess)
Again, as Andrew mentioned, fix that up first before banging your head against the wall any further.

HttpWebRequest an Unicode characters

I am using this code:
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
string result = null;
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
StreamReader reader = new StreamReader(resp.GetResponseStream());
result = reader.ReadToEnd();
reader.Close();
}
In result I get text like 003cbr /003e003cbr /003e (I think this should be 2 line breaks instead). I tried with the 2, 3 parameter versions of Streamreader but the string was the same. (the request returns a json string)
Why am I getting those characters, and how can I avoid them?
It's not really clear what that text is, but you're not specifying an encoding at the moment. What content encoding is the server using? StreamReader will default to UTF-8.
It sounds like actually you're getting some sort of oddly-encoded HTML, as U+003C is < and U+003E is >, giving <br /><br /> as the content. That's not JSON...
Two tests:
Use WebClient.DownloadString, which will detect the right encoding to use
See what gets shown using the same URL in a browser
EDIT: Okay, now that I've seen the text, it's actually got:
\u003cbr /\u003e
The \u part is important here - that's part of the JSON which states that the next four characters form ar the hex representation of a UTF-16 code unit.
Any JSON API used to parse that text should perform the unescaping for you.

How to make Stream.Write() output in UTF-8 format

My issue is this:
I am generating and uploading a SQL file using ASP.NET, but after the file is saved to the FTP server, characters like ü are changed to &uul;, ø to ø and so on... How can I prevent this from happening? I don't want the file to be formatted with ASCII code, but with UTF-8.
The code that generates and uploads the file looks like this:
//request = the object to be made an request out of.
Stream requestStream = request.GetReguestStream();
var encoding = new UTF8Encoding();
//fileContent is the string to be saved in the file
byte[] buffer = encoding.GetBytes(fileContent);
requestStream.Write(buffer, 0, buffer.Length);
requestStream.Close();
As you can see I've tried to use the System.Text.UTF8Encoding, but it doesn't work.
Remember, with streams you can almost always wrap the streams as necessary. If you want to write UTF-8 encoded content you wrap the request stream in a StreamWriter with the correct encoding:
using (Stream requestStream = request.GetRequestStream())
using (StreamWriter writer = new StreamWriter(requestStream, Encoding.UTF8)) {
writer.Write(fileContent);
}
Since you say you're uploading to a web service be sure to set your content encoding as well. Since you haven't posted where the request object comes from, I'll assume it's a normal HttpWebRequest.
With a HttpWebRequest you would tell the server what the content encoding is by using the ContentType property.
request.ContentType = "text/plain;charset=utf-8";
As others have mentioned, though, the FTP transfer itself may be breaking it too. If you can, make sure it's transferred in binary mode, not ASCII mode.
Put it in debug and look at what gets put in 'buffer' after encoding.GetBytes() is called. This will verify if it's the rx side causing it.

screen scraping

i am screen scraping a website which is in danish language.. i am unable to scrape certain characters as like må ..
any idea to solve this?
thanks
Try UTF-8 or Windows-1252 charset.
If you are using a Web browser control, you can set the page encoding to whatever language that can show that character. Then just extract the page source.
i just used System.Web.HttpContext.Current.Server.HtmlDecode()
it works ..
I use iso-8859-1 for decoding.
HTH
Its better to use the same encoding that the HttpWebResponse object has,
Below is the code that will work with all langauges and characters .
response = (HttpWebResponse)request.GetResponse();
string Charset = response.CharacterSet;
Encoding encoding = Encoding.GetEncoding(Charset);
if (response.StatusCode == HttpStatusCode.OK)
{
response_stream = new StreamReader(response.GetResponseStream(), encoding);
html = response_stream.ReadToEnd();
}

Categories

Resources