Need help for parsing HTML in C# - c#

For personal use i am trying to parse a little html page that show in a simple grid the result of the french soccer championship.
var Url = "http://www.lfp.fr/mobile/ligue1/resultat.asp?code_jr_tr=J01";
WebResponse result = null;
WebRequest req = WebRequest.Create(Url);
result = req.GetResponse();
Stream ReceiveStream = result.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding(0);
StreamReader sr = new StreamReader(ReceiveStream, encode);
while (sr.Read() != -1)
{
Line = sr.ReadLine();
Line = Regex.Replace(Line, #"<(.|\n)*?>", " ");
Line = Line.Replace(" ", "");
Line = Line.TrimEnd();
Line = Line.TrimStart();
and then i really dont have a clue either take line by line or the
whole stream at one and how to retreive only the team's name with the next number that would be the score.
At the end i want to put both 2 team's with scores in a liste or xml to use it with an phone application
If anyone has an idea it would be great thanks!

Take a look at Html Agility Pack

You could put the stream into an XmlDocument, allowing you to query via something like XPath. Or you could use LINQ to XML with an XDocument.
It's not perfect though, because HTML files aren't always well-formed XML (don't we know it!), but it's a simple solution using stuff already available in the framework.

You'll need an SgmlReader, which provides an XML-like API over any SGML document (which an HTML document really is).

You could use the Regex.Match method to pull out the team name and score. Examine the html to see how each row is built up. This is a common technique in screen scraping.

Related

The correct way to create an HTMLDocument using HTMLAgilityPack?

Consider the code below:
string url="http://badoo.com";
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
Now I have the htmlstring inside the result variable, let's try something:
// save normally
File.WriteAllText("1.html",result);
// save using HTMLAgilityPack
HtmlAgilityPack.HtmlDocument hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml(result);
hdoc.Save("2.html");
Can someone please tell me why 1.html and 2.html doesn't look the same ? Although they have the same file size ?
Link to the correct one (file.writealltext() ) : http://woman2.com/1.html
Link to the wrong one (saved with htmlagility pack) : http://woman2.com/2.html
Update:
I have also tried to save the file on local disk and then
hdoc.Load("path/to/local",true);
I have also tried:
hdoc.LoadHtml(result);
And tried:
hdoc.Save("2.html",Encoding.UTF8);
but any of the attemps seems to be working to me. I've been struggling with this for 3 days now.
Andrew Morton is correct. The file '1.html' is formed in a way that makes agility pack angry/scared/confused. In all seriousness, I ran your code and diffed the resulting files and here are some of the differences:
Innocuous:
removes whitespace redundancy
adds closing tags where were previously self-closing
Possibly affect the site:
adds attribute values where none previously existed
changes language-specific characters
Likely affect the site:
"fixes" unmatched quotations (I would put my money here if I had to guess)
Again, as Andrew mentioned, fix that up first before banging your head against the wall any further.

HtmlAgilityPack - Convert MHTML To HTML as String

I have a MHTML file and I am trying to convert it to HTML.
I have installed the HtmlAgilityPack and tried the following code:
var doc = new HtmlAgilityPack.MixedCodeDocument();
doc.Load("C:\\Users\\DickTracey\\Downloads\\Club Membership Report.mhtml");
var ms = new MemoryStream();
var sw = new StreamWriter(ms);
doc.Save(sw);
ms.Position = 0;
var sr = new StreamReader(ms);
return sr.ReadToEnd();
But it always returns null.
Can anyone explain the correct procedure to convert MHTML to HTML please?
MHTML to HTML Decoding in C#!
string mhtml = "This is your MHTML string"; // Make sure the string is in UTF-8 encoding
MHTMLParser parser = new MHTMLParser(mhtml);
string html = parser.getHTMLText(); // This is the converted HTML
git link : https://github.com/DavidBenko/MHTML-to-HTML-Decoding-in-C-Sharp.git
I had a quick look at an MHTML file with HxD. Although, as noted above, HtmlAgilityPack has little or no support for MHTML, the format itself looks simple enough. It appears to consist of the usual suspects (unencoded HTML, CSS, JS, graphics encoded in Base64, etc) concatenated in a way (with mime type headers) that could be worked out with a little effort. Having said that, the format is probably fully documented somewhere -- so dust off your browser, write some C# to parse it, and spoon-feed HtmlAgilityPack with the results.

Pulling data from a webpage, parsing it for specific pieces, and displaying it

I've been using this site for a long time to find answers to my questions, but I wasn't able to find the answer on this one.
I am working with a small group on a class project. We're to build a small "game trading" website that allows people to register, put in a game they have they want to trade, and accept trades from others or request a trade.
We have the site functioning long ahead of schedule so we're trying to add more to the site. One thing I want to do myself is to link the games that are put in to Metacritic.
Here's what I need to do. I need to (using asp and c# in visual studio 2012) get the correct game page on metacritic, pull its data, parse it for specific parts, and then display the data on our page.
Essentially when you choose a game you want to trade for we want a small div to display with the game's information and rating. I'm wanting to do it this way to learn more and get something out of this project I didn't have to start with.
I was wondering if anyone could tell me where to start. I don't know how to pull data from a page. I'm still trying to figure out if I need to try and write something to automatically search for the game's title and find the page that way or if I can find some way to go straight to the game's page. And once I've gotten the data, I don't know how to pull the specific information I need from it.
One of the things that doesn't make this easy is that I'm learning c++ along with c# and asp so I keep getting my wires crossed. If someone could point me in the right direction it would be a big help. Thanks
This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements.
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
var web = new HtmlAgilityPack.HtmlWeb();
HtmlDocument doc = web.Load(url);
string metascore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
}
An easy way to obtain the XPath for a given element is by using your web browser (I use Chrome) Developer Tools:
Open the Developer Tools (F12 or Ctrl + Shift + C on Windows or Command + Shift + C for Mac).
Select the element in the page that you want the XPath for.
Right click the element in the "Elements" tab.
Click on "Copy as XPath".
You can paste it exactly like that in c# (as shown in my code), but make sure to escape the quotes.
You have to make sure you use some error handling techniques because Web scraping can cause errors if they change the HTML formatting of the page.
Edit
Per #knocte's suggestion, here is the link to the Nuget package for HTMLAgilityPack:
https://www.nuget.org/packages/HtmlAgilityPack/
I looked and Metacritic.com doesn't have an API.
You can use an HttpWebRequest to get the contents of a website as a string.
using System.Net;
using System.IO;
using System.Windows.Forms;
string result = null;
string url = "http://www.stackoverflow.com";
WebResponse response = null;
StreamReader reader = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
response = request.GetResponse();
reader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);
result = reader.ReadToEnd();
}
catch (Exception ex)
{
// handle error
MessageBox.Show(ex.Message);
}
finally
{
if (reader != null)
reader.Close();
if (response != null)
response.Close();
}
Then you can parse the string for the data that you want by taking advantage of Metacritic's use of meta tags. Here's the information they have available in meta tags:
og:title
og:type
og:url
og:image
og:site_name
og:description
The format of each tag is: meta name="og:title" content="In a World..."
I recommend Dcsoup. There's a nuget package for it and it uses CSS selectors so it is familiar if you use jquery. I've tried others but it is the best and easiest to use that I've found. There's not much documentation, but it's open source and a port of the java jsoup library that has good documentation. (Documentation for the .NET API here.) I absolutely love it.
var timeoutInMilliseconds = 5000;
var uri = new Uri("http://www.metacritic.com/game/pc/fallout-4");
var doc = Supremes.Dcsoup.Parse(uri, timeoutInMilliseconds);
// <span itemprop="ratingValue">86</span>
var ratingSpan = doc.Select("span[itemprop=ratingValue]");
int ratingValue = int.Parse(ratingSpan.Text);
// selectors match both critic and user scores
var scoreDiv = doc.Select("div.score_summary");
var scoreAnchor = scoreDiv.Select("a.metascore_anchor");
int criticRating = int.Parse(scoreAnchor[0].Text);
float userRating = float.Parse(scoreAnchor[1].Text);
I'd recomend you WebsiteParser - it's based on HtmlAgilityPack (mentioned by Hanlet EscaƱo) but it makes web scraping easier with attributes and css selectors:
class PersonModel
{
[Selector("#BirdthDate")]
[Converter(typeof(DateTimeConverter))]
public DateTime BirdthDate { get; set; }
}
// ...
PersonModel person = WebContentParser.Parse<PersonModel>(html);
Nuget link

Parsing XML in a C# Application?

Right now, I am getting a Google search's XML. However, the XML doc is so big, I can't find anything anywhere. I am wondering how I can find the answer on Google. By that, I mean when you Google "Capital of Florida" the box at the top says Tallahassee. I want to access that information but I am unsure how.
var request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
var response = request.GetResponse();
var rstream = response.GetResponseStream();
var sr = new StreamReader(rstream);
var json = sr.ReadToEnd();
Console.WriteLine(json.ToString());
The last Console.Writeline obviously just shoots out a huge monster of an XML doc
See this it uses LINQ to extract a piece of info from XML documents https://coderwall.com/p/qghcqw
if you are requesting HTML, a good way to parse the data is using HtmlAgilityPack
http://htmlagilitypack.codeplex.com/

HttpWebRequest an Unicode characters

I am using this code:
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
string result = null;
using (HttpWebResponse resp = (HttpWebResponse)req.GetResponse())
{
StreamReader reader = new StreamReader(resp.GetResponseStream());
result = reader.ReadToEnd();
reader.Close();
}
In result I get text like 003cbr /003e003cbr /003e (I think this should be 2 line breaks instead). I tried with the 2, 3 parameter versions of Streamreader but the string was the same. (the request returns a json string)
Why am I getting those characters, and how can I avoid them?
It's not really clear what that text is, but you're not specifying an encoding at the moment. What content encoding is the server using? StreamReader will default to UTF-8.
It sounds like actually you're getting some sort of oddly-encoded HTML, as U+003C is < and U+003E is >, giving <br /><br /> as the content. That's not JSON...
Two tests:
Use WebClient.DownloadString, which will detect the right encoding to use
See what gets shown using the same URL in a browser
EDIT: Okay, now that I've seen the text, it's actually got:
\u003cbr /\u003e
The \u part is important here - that's part of the JSON which states that the next four characters form ar the hex representation of a UTF-16 code unit.
Any JSON API used to parse that text should perform the unescaping for you.

Categories

Resources