HttpWebRequest an Unicode characters - c#

I am using this code:
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
string result = null;
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
StreamReader reader = new StreamReader(resp.GetResponseStream());
result = reader.ReadToEnd();
reader.Close();
}
In result I get text like 003cbr /003e003cbr /003e (I think this should be 2 line breaks instead). I tried with the 2, 3 parameter versions of Streamreader but the string was the same. (the request returns a json string)
Why am I getting those characters, and how can I avoid them?

It's not really clear what that text is, but you're not specifying an encoding at the moment. What content encoding is the server using? StreamReader will default to UTF-8.
It sounds like actually you're getting some sort of oddly-encoded HTML, as U+003C is < and U+003E is >, giving <br /><br /> as the content. That's not JSON...
Two tests:
Use WebClient.DownloadString, which will detect the right encoding to use
See what gets shown using the same URL in a browser
EDIT: Okay, now that I've seen the text, it's actually got:
\u003cbr /\u003e
The \u part is important here - that's part of the JSON which states that the next four characters form ar the hex representation of a UTF-16 code unit.
Any JSON API used to parse that text should perform the unescaping for you.

Related

WebClient.DownloadString uses wrong encoding

I'm downloading XML files from sharepoint online using webclient.
However, when I use WebClient.DownloadString(string url) method, some characters are not correctly decoded.
When I use WebClient.DownloadFile(string url, string file) and then I read the file all characters are correct.
The xml itself does not contain encoding declaration.
string wrongXml = webClient.DownloadString(url);
//wrongXml contains Ä™ instead of ę
webClient.DownloadFile(url, #"C:\temp\file1.xml");
string correctXml = File.ReadAllText(#"C:\temp\file1.xml");
//contains ę, like it should.
Also, when open the url in Internet Explorer, it is shown correctly.
Why is that? Is it because of the default windows encoding on my machine or webclient handles responses differently when using DownloadString, resp DownloadFile?
Probably the encoding it is using now is not the one the service returns.
You can set the encoding you expect before you make the request:
webClient.Encoding = Encoding.UTF8;
string previouslyWrongXml = webClient.DownloadString(url);

How to open link address containing HEX without converting HEX to digit?

I have links containing HEX values, I'm opening them with code below:
var webRequest = WebRequest.Create(url);
try
{
WebResponse webResponse = webRequest.GetResponse();
webResponse.Close();
}
But when WebRequest is created, links are converting to URI with loosing all special HEX values.
For example:
original link: /generic=http%3A%2F%2Fnym1.ib.adnxs.com%2Fab%3Fenc%3DRmeXLCKt4T
actual link: /generic=http%3a/nym1.ib.adnxs.com/ab%3fenc%3drmexlckt4t
%2F was converted to /
Is it possible to force the opening of the original link, or maybe it is another way to open such links?
Thanks!

The correct way to create an HTMLDocument using HTMLAgilityPack?

Consider the code below:
string url="http://badoo.com";
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
Now I have the htmlstring inside the result variable, let's try something:
// save normally
File.WriteAllText("1.html",result);
// save using HTMLAgilityPack
HtmlAgilityPack.HtmlDocument hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml(result);
hdoc.Save("2.html");
Can someone please tell me why 1.html and 2.html doesn't look the same ? Although they have the same file size ?
Link to the correct one (file.writealltext() ) : http://woman2.com/1.html
Link to the wrong one (saved with htmlagility pack) : http://woman2.com/2.html
Update:
I have also tried to save the file on local disk and then
hdoc.Load("path/to/local",true);
I have also tried:
hdoc.LoadHtml(result);
And tried:
hdoc.Save("2.html",Encoding.UTF8);
but any of the attemps seems to be working to me. I've been struggling with this for 3 days now.
Andrew Morton is correct. The file '1.html' is formed in a way that makes agility pack angry/scared/confused. In all seriousness, I ran your code and diffed the resulting files and here are some of the differences:
Innocuous:
removes whitespace redundancy
adds closing tags where were previously self-closing
Possibly affect the site:
adds attribute values where none previously existed
changes language-specific characters
Likely affect the site:
"fixes" unmatched quotations (I would put my money here if I had to guess)
Again, as Andrew mentioned, fix that up first before banging your head against the wall any further.

Reading a web page in a foreign language with StreamReader

I'm trying to fetch a web page that is a mix of English and Korean. The browser can fetch and display the page just fine, but when I try to grab it programmatically I can't get the Korean characters to display properly.
I know that you can specify an Encoding in the StreamReader but I haven't found one that works yet.
This is the code that I'm using to read the response:
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(response.CharacterSet));
response.CharacterSet returns UTF8. I've also tried all of the basic encoding options - ASCII, BigEndian, Default, Unicode, UTF32, UTF7, and manually adding Encoding.UTF8.
I've also tried going about it through the CultureInfo:
CultureInfo kr = CultureInfo.GetCultureInfo("ko");
StreamReader sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(kr.TextInfo.ANSICodePage));
using both "ko" and "ko-KR". I get varied results from all these different types, but none of them are correct.
I've also tried the code page directly:
StreamReader sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding(949));
response.ContentEncoding returns an empty string. I'm running out of ideas.
Edit: Here is an example of what I'm expecting:
프로젝트:
and here is what I'm getting:
//ASCII == ??????
//BigEndian == ़汩湫â¨ç‰¥æ˜½âˆ¯æ©³â½¤ç°æ”
//Default == íâ€â€žÃ«Â¡Å“ì Â트:
//Unicode == íâ€â€žÃ«Â¡Å“ì Â트
//UTF32 == ���������ï
//UTF7 == 프로ì Â트
//UTF8 == 프로ì 트
FWIW: a stream reader is likely not going to work well.
Prefer using HttpWebRequest Class to do browser requests (or you will start feeling sorry very soon when you get 302 responses or gzipped and/or chunked encoding)
I promoted this to an answer, as it might very well be the problem you're having already. I don't know what the response you are getting looks like, of course

Need help for parsing HTML in C#

For personal use i am trying to parse a little html page that show in a simple grid the result of the french soccer championship.
var Url = "http://www.lfp.fr/mobile/ligue1/resultat.asp?code_jr_tr=J01";
WebResponse result = null;
WebRequest req = WebRequest.Create(Url);
result = req.GetResponse();
Stream ReceiveStream = result.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding(0);
StreamReader sr = new StreamReader(ReceiveStream, encode);
while (sr.Read() != -1)
{
Line = sr.ReadLine();
Line = Regex.Replace(Line, #"<(.|\n)*?>", " ");
Line = Line.Replace(" ", "");
Line = Line.TrimEnd();
Line = Line.TrimStart();
and then i really dont have a clue either take line by line or the
whole stream at one and how to retreive only the team's name with the next number that would be the score.
At the end i want to put both 2 team's with scores in a liste or xml to use it with an phone application
If anyone has an idea it would be great thanks!
Take a look at Html Agility Pack
You could put the stream into an XmlDocument, allowing you to query via something like XPath. Or you could use LINQ to XML with an XDocument.
It's not perfect though, because HTML files aren't always well-formed XML (don't we know it!), but it's a simple solution using stuff already available in the framework.
You'll need an SgmlReader, which provides an XML-like API over any SGML document (which an HTML document really is).
You could use the Regex.Match method to pull out the team name and score. Examine the html to see how each row is built up. This is a common technique in screen scraping.

Categories

Resources