parsing resume tablewith HtmlAgilityPack - c#

Can you say me how to parse data from this link?
http://www.e1.ru/business/job/resume.detail.php?id=956004
I tryed something like this
var nodes = doc.DocumentNode.SelectNodes("/html[1]/body[1]/table[5]/tbody[1]/tr[1]/td[2]/table[4]/tbody[1]/tr[1]/td[1]/table[1]");
but it is not good variant.

Abbath, I recommend using some 3rd party tools. which can extract data from HTML and then extract your required data. like egrabber, rchilli and many more .
if you are looking for your own solution - then add a index of complete text, and then catch them as XML - study DOM structure and pick out selective values.

Related

Separate a certain piece of data from html with given start and endpoints

I am learning screen-scraping using C# and I was wondering
How can I separate certain pieces of gathered html,
I am using htmlAgilityPack and ScrapySharp library's for scraping thus with this code I can retrieve a html page:
WebPage PageResult = Browser.NavigateToPage(new Uri("localhost"));
Console.WriteLine(PageResult);
Of course I get back the whole source code with all the syntax and mishmash, but what If I wanted to only catch data between <h2></h2> tags and omit all else?
My very simple-minded pseudo code would be:
If result reads h2
Trim all behind
start writing out after
If result reads /h2
stop writing
Trim anything that comes after
The main question I'm having is how do I feed In the rule that when I read h2 trim everything from before, write the data after that and if /h2 appears, stop and trim the end of the result?
There are a few ways you can achieve this, one such would be to red the page as XML and parse the data you are looking for,
This can be with the use of,
XElement
XmlElement
XDocument
etc.
The second way, would be to use a third-party library like HtmlAgilityPack, this also supports XPath as well,
var nodes = doc.DocumentNode.SelectNodes("//form//input");

XML: Searching elements for specific text using C#

I'm trying to get a list of PDF links from different websites. First I'm using the Web client class to download the page source. I then use sgmlReader to convert the HTML to XML. So for one particular site, I'll get a tag that looks like this:
<p>1985 to 1997 Board Action Summary</p>
I need to grab all the links that contain ".pdf". Obviously not all websites are laid out the same, so just searching for a <p> tag, wont be dynamic enough. I'd rather not use linq, but I will if I have to. Thanks in advance.
Linq makes this easy...
var hrefs = doc.Root.Descendants("a")
.Where(a => a.Attrib("href").Value.ToUpper().EndsWith(".PDF"))
.Select(a => a.Attrib("href"));
away you go! (note: did this from memory, so you might have to fix it somewhat)
This will break down for <a/> tags that don't have an href (anchors) but you can fix that surely...
I think you have 2 options here. If you need only the links, you can use Regular Expressions to find the matches for strings ending with .pdf. If you need to manipulate the XML structure or get other values from the XML, it would be better to use XmlDocument and use an XPath query to find out the nodes which have a link to a pdf file in it. Using LINQ to XML just reduces the number of lines of code you have to write.

How to get text off a webpage?

I want to get text off of a webpage in C#.
I don't want to get the HTML, I want the real text off of the webpage. Like if I type "<b>cake</b>", I want the cake, not the tags.
Use the HTML Agility Pack library.
That's very fine library for parse HTML, for your requirement use this code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}
It depends. If your application downloads the webpage using a WebBrowser component, then that component will do the parsing for you automatically in the background (just like Internet Explorer). Just walk the DOM tree and extract the text you want. You will find HtmlElement.InnerText property especially useful :)
You can strip tags using regular expressions such as this one2 (a simple example):
// You can import System.Text.RegularExpressions for convenience, of course.
System.Text.RegularExpressions.Regex tag = new System.Text.RegularExpressions.Regex("\<.+?\>");
myHTML = tag.Replace(myHTML, String.Empty);
But if you need to retrieve large volumes of well-structured data, then you might be better off using an HTML library1. (If the webpage is XHTML, all the better - use the System.Xml classes.)
1 Like http://htmlagilitypack.codeplex.com/, for example.
2 This might have unintended side-effects if you're trying to get data out of JavaScript, or if the data is inside an element's attribute and includes angle brackets. You'll also need to accept escape sequences like &.

Read specific text from page into string array in C#

I've tried this and searched for help but I cannot figure it out. I can get the source for a page but I don't need the whole thing, just one string that is repeated. Think of it like trying to grab only the titles of articles on a page and adding them in order to an array without losing any special characters. Can someone shed some light?
You can use a Regular Expression
to extract the content you want from a string, such as your html string.
Or you can use a DOM parser such as
Html Agility Pack
Hope this helps!
You could use something like this -
var text = "12 hello 45 yes 890 bye 999";
var matches = System.Text.RegularExpressions.Regex.Matches(text,#"\d+").Cast<Match>().Select(m => m.Value).ToList();
The example pulls all numbers in the text variable into a list of strings. But you could change the Regular Expression to do something more suited to your needs.
if the page is well-formed xml, you could use linq to xml by loading the page into an XDocument and using XPath or another way of traversing to the element(s) you desire and loading what you need into the array for which you are looking (or just use the enumerable if all you want to do is enumerate). if the page is not under your control, though, this is a brittle solution that could break at any time when subtle changes could break the well-formedness of the xml. if that's the case, you're probably better off using regular expressions. eiither way, though, the page could be changed under you and your code suddenly won't work anymore.
the best thing you could do would be to get the provider of the page to expose what you need as a webservice rather than trying to scrape their page.

Convert HTML to XML with WP7

simple situation, want to search through a HTML string, get out a couple of information.
Gets annoying after writing mass lines of .Substing and. IndexOf for each element i want to find and cut out of the HTML file.
Afaik i´m unable to load such dll as HTMLtidy or HTML Agility Pack into my WP7 project so is there a more efficient and reliable way to search trough my HTML string instead of building Substings with IndexOf?
void client_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
string document = string.Empty;
using (var reader = new StreamReader(e.Result))
document = reader.ReadToEnd();
string temp = document.Substring(document.IndexOf("Games Played"), (document.IndexOf("League Games") - document.IndexOf("Games Played")));
temp = (temp.Substring(temp.IndexOf("<span>"), (temp.IndexOf("</span>") - temp.IndexOf("<span>")))).Remove(0, 6);
Int32.TryParse(temp, out leaugeGamesPlayed);
}
Thanks for your help
Gpx
You can use the HTML Agility Pack but you need the converted version of HTML Agility Pack for the Phone. It's only available from svn repository but it works great, I use it in my app.
http://htmlagilitypack.codeplex.com/SourceControl/changeset/view/77494#
You can find two projects under trunk named HAPPhone and HAPPhoneTest. You can use the download button to the right to get the code. It uses Linq instead of XPath to work.
You could use LINQ to parse the HTML and locate the elements that you're interested in. For example:
XDocument parsed = XDocument.Parse(document);
var spans = parsed.Descendants("span");
Beth Massi has a great blog post: Querying HTML with LINQ to XML
Assuming you're doing this because you're getting the HTML from a web site/page/server.
Don't convert it on the device.
Create a wrapper/proxy site/server/page to do the conversion for you. While this has the downside of having to create the extra service, it has the following advantages:
Code on the server will be easier to update than code within a distrbued app. (Experience with parsing HTML you don't directly control will show that you will need to make changes in your parsing as the original HTML is almost certain to throw something unexpected at you when changed in the future.)
If you can do it once on the server you can cache the result rather than having instance of the app have to do the conversion over.
By virtue of the above 2 points, the app will run faster!
If you have the HTML file at design/build time then convert it to something easier to work with and avoid unnecessary computation at run time.
As a workaround, you could consider loading the HTML into a WebBrowser control and then query the DOM via injected javascript (which calls back to .NET)

Categories

Resources