I'm reading a Dutch webpage :
HttpWebRequest oReq = (HttpWebRequest)WebRequest.Create(website);
oReq.Method = "GET";
HttpWebResponse resp = (HttpWebResponse)oReq.GetResponse();
HtmlDocument doc;
doc.Load(resp.GetResponseStream(), Encoding.GetEncoding("iso-8859-1"));
When I get the text of some random element within the page I get some weird characters not the Dutch ones I see in Chrome:
HtmlNode node = doc.DocumentNode.SelectSingleNode(xpath);
if(node != null)
{
MessageBox.Show(node.InnerText, "--- just scrapped some xpath ---");
}
Instead of café I get café
How do I solve this? I get the same text when writting it to a file, when I assign it to a richtextbox, etc ,etc, the same broken text.
Change the encoding to Unicode, e.g. utf-8
Related
I'm trying on form load, to make it count the number of lines in a pastebin raw & return the value to a textbox. Been racking my brains and still cant figure it out.
textBox1.Text = new WebClient().DownloadString("yourlink").
I'm expanding my comment to an answer.
As already mentioned, you need a HttpRequest or WebRequest to get the content of your string.
Maybe new WebClient().DownloadString(url);, but I prefer to use the WebRequest since it's also supported in .NET Core.
What you need to do is, extract the content of the RAW TextArea object from html. I know, people will probably hate me for that, but I used regex for that task. Alternatively you can use a html parser.
The Raw data is contained within a textarea with following attributes:
<textarea id="paste_code" class="paste_code" name="paste_code" onkeydown="return catchTab(this,event)">
So the regex pattern looks like this:
private static string rgxPatternPasteBinRawContent = #"<textarea id=""paste_code"" class=""paste_code"" name=""paste_code"" onkeydown=""return catchTab\(this,event\)"">(.*)<\/textarea>";
Since the html code is spread over multiple lines, our Regex has to be use with a single line option.
Regex rgx = new Regex(rgxPatternPasteBinRawContent, RegexOptions.Singleline);
Now find the match, that contains the RAW data:
string htmlContent = await GetHtmlContentFromPage("SomePasteBinURL");
//Possibly your new WebClient().DownloadString("SomePasteBinURL");
//await not necesseraly needed here!
Match match = rgx.Match(htmlContent);
string rawContent = "ERROR: No Raw content found!";
if (match.Groups.Count > 0)
{
rawContent = match.Groups[1].Value;
}
int numberOfLines = rawContent.Split('\n').Length + 1;
And you're done.
The WebRequest looks like this for me:
private static async Task<string> GetHtmlContentFromPage(string url)
{
WebRequest request = WebRequest.CreateHttp(url);
WebResponse response = await request.GetResponseAsync();
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
readStream = new StreamReader(receiveStream);
string data = readStream.ReadToEnd();
response.Dispose();
readStream.Dispose();
return data;
}
got a trouble with HtmlAgilityPack. I can't parse Cyrillic text, it's appears as some unknown symbols.
HtmlWeb webGet = new HtmlWeb();
webGet.OverrideEncoding = Encoding.UTF8;
HtmlAgilityPack.HtmlDocument doc = webGet.Load("http://vk.com/glitchhop");
HtmlNode myNode = doc.DocumentNode.SelectSingleNode("//div[#id='page_wall_posts']/*[2]//div[#class='wall_post_text']");
if (myNode != null)
return myNode.InnerText;
else return "Nothing found";
Also attach example of error and how that text should be looks like
This problem is not related to HTMLAgilityPack, it is caused by incorrect encoding you're using.
Webpage you're trying to parse is encoded using windows-1251 encoding.
So changing webGet.OverrideEncoding from Encoding.UTF8 to Encoding.GetEncoding(1251) should help you.
I'm getting geografic info from a webservice.
I'm trying to parse the return data for hours, but have been getting no where.
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
{
StreamReader reader = new StreamReader(response.GetResponseStream());
string result = reader.ReadToEnd();
XDocument document = XDocument.Parse(result, LoadOptions.None);
i got this
document <html>
<body>
<state>Apure</state>
<municipality>RÓMULO GALLEGOS</municipality>
<parish>URBANA ELORZA</parish>
<street>La Trinidad De Arauca</street>
</body>
</html> System.Xml.Linq.XDocument
I try
document.Elements("state")
document.Descendants("body")
document.GetElementsByTagName("state");
But nothing.
I'm sure there is a simple way of do something so basic.
I'm seriously considering convert that to a string and do the parsing myself.
Aditional consideration:
The fields include it in the result is variable.
Because some info doesnt have all fields.
Ok, I make a change.
I read a XElement instead of a XDocument;
XElement sitemap = XElement.Parse(result, LoadOptions.None);
foreach (var bodyElement in sitemap.Elements("body"))
{
foreach (var fieldElement in bodyElement.Elements())
{
Console.WriteLine(fieldElement.Name);
Console.WriteLine(fieldElement.Value);
}
}
Probably there is a way to skip the first foreach, but still looking for it.
#Jonesy line works but that mean I have to know the fields names. This way i just create the info for the values I got.
i have a problem. My goal is to save some Text from a (Japanese Shift-JS encoded)html into a utf8 encoded text file.
But i don't really know how to encode the text.. The HtmlNode object is encoded in Shift-JS. But after i used the ToString() Method, the content is corrupted.
My method so far looks like this:
public String getPage(String url)
{
String content = "";
HtmlDocument page = new HtmlWeb(){AutoDetectEncoding = true}.Load(url);
HtmlNode anchor = page.DocumentNode.SelectSingleNode("//div[contains(#class, 'article-def')]");
if (anchor != null)
{
content = anchor.InnerHtml.ToString();
}
return content;
}
I tried
Console.WriteLine(page.Encoding.EncodingName.ToString());
and got: Japanese Shift-JIS
But converting the html into a String produces the error. I thought there should be a way, but since documentation for html-agility-pack is sparse and i couldn't really find a solution via google, i'm here too get some hints.
Well, AutoDetectEncoding doesn't really work like you'd expect it to. From what i found from looking at the source code of the AgilityPack, the property is only used when loading a local file from disk, not from an url.
So there's three options. One would be to just set the Encoding
OverrideEncoding = Encoding.GetEncoding("shift-jis")
If you know the encoding will always be the same that's the easiest fix.
Or you could download the file locally and load it the same way you do now but instead of the url you'd pass the file path.
using (var client=new WebClient())
{
client.DownloadFile(url, "20130519-OYT1T00606.htm");
}
var htmlWeb = new HtmlWeb(){AutoDetectEncoding = true};
var file = new FileInfo("20130519-OYT1T00606.htm");
HtmlDocument page = htmlWeb.Load(file.FullName);
Or you can detect the encoding from your content like this:
byte[] pageBytes;
using (var client = new WebClient())
{
pageBytes = client.DownloadData(url);
}
HtmlDocument page = new HtmlDocument();
using (var ms = new MemoryStream(pageBytes))
{
page.Load(ms);
var metaContentType = page.DocumentNode.SelectSingleNode("//meta[#http-equiv='Content-Type']").GetAttributeValue("content", "");
var contentType = new System.Net.Mime.ContentType(metaContentType);
ms.Position = 0;
page.Load(ms, Encoding.GetEncoding(contentType.CharSet));
}
And finally, if the page you are querying returns the content-Type in the response you can look here for how to get the encoding.
Your code would of course need a few more null checks than mine does. ;)
I make a simple web scraper that scrapes lyrics for me then writes it to a database. everything works but for some reason it's replacing some characters with question marks and when I view this information on a simple php web page I'm seeing a lot of mistakes in the lyrics.
I?m = I'm
Let?s = Let's
haven?t = haven't
stuff like that.
I know the error is in c# and my code because I put a breakpoints before it writes to the database and I display it in a rich text box. How would I get it to display these characters correctly?
public static string getSourceCode(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
sr.Close();
resp.Close();
return sourceCode;
}
........
string url = txbURL2.Text;
string sourceCode = sourceCode = WorkerClass.getSourceCode(url);
int startIndex = sourceCode.IndexOf("<td valign=\"top\" width=\"100%\">");
sourceCode = sourceCode.Substring(startIndex, sourceCode.Length - startIndex);
........
//Gets Lyric
startIndex = sourceCode.IndexOf("<br><b>Lyrics:</b><br><br>") + 30;
endIndex = sourceCode.IndexOf(" <br><br>", startIndex);
string lyric = sourceCode.Substring(startIndex, endIndex - startIndex) + "";
rtbLyric.Text = lyric;
//End Lyric
The problem is probably character encoding. My guess is that the web page you're scraping is encoded in UTF8, but somewhere along the line you're converting to ASCII.
Check out the excellent article called "What every developer should know about character encoding" for more details.
Update
You could try this, although the StreamReader should default to UTF-8 anyway:
var encoding = System.Text.Encoding.GetEncoding("utf-8");
StreamReader sr = new StreamReader(resp.GetResponseStream(), encoding);
Check the encoding by searching for charset in the html code.
Your code snipplet misses the actual load process, so it is impossible to tell where it goes wrong.
You can also try using the WebClient:
WebClient client = new WebClient { Encoding = Encoding.UTF8 };
string html = client.DownloadString(url);