Getting a "summary" of a webpage

Getting a "summary" of a webpage - c#

I have something of a a hairy problem, I'd like to generate a couple of paragraphs of "description" of a given url, normally the start of an article. The Meta description field is one way to go but it isn't always good or set properly.
It's fair to say it's a bit problematic to accomplish this from the screenscraped HTML. I had a general idea that perhaps one could scan the HTML for the first "appropriate" segment but it's hard to say what that is, perhaps something like the first paragraph containing a certain amount of text...
Anyone have any good ideas? :) It doesn't have to be foolproof

So, you wanna become a new Google, heh? :-)
Many sites are "SEO friendly" these days. This enables you to go for the headings and then look for paragraphs bellow.
Also, look for lists. There is a lot of content in some sort of tab-like (tabs, accordions...) interfaces that is done using ordered or unordered lists.
If that fails, maybe look for a div with class "content" or "main" or a combination and start from there.
If you use different approaches, make sure you keep statistics of what worked and what didn't (maybe even save a full page), so you can review and tweak your parsing and searching methods.
As a side note, I've used htmlagilitypack to parse and search through html with success. Well, at leasts it beats parsing with regex :-)

Perhaps look for the div element that contains the most p elements, and then grab the first p child. If no div, get the first p from the body element.
This will always have its problems.

You can strip the HTML tags using this regular expression
string stripped = Regex.Replace(textBox1.Text,#"<(.|\n)*?>",string.Empty)
You will them get the content text you can use to generate your paragraphs.

Related

Why is this seemingly simple Xpath navigation not working?

I'm having what seems like a really simple problem. I'm trying to navigate to an element in HTML by Xpath, and can't seem to get it to function properly.
I want to grab a span from the html contents of a page. The page is fairly complex, so I've been using Firebug's "get element by xpath" and pasting the result into my code. I've noticed it's slightly different than the xpath you get from doing the same thing in Chrome, but they both seem to direct to the same place.
The html I'm trying to navigate is found here. The field I'm trying to access via xpath is the first "Results 1 - 10 of n".
Based on FireBug's 'inspect element' the xpath should be: /html/body/div/center/table/tbody/tr[6]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/span
However when I try to use this xpath to identify the element in a C# codebehind, it gives me a number of errors that that path cannot be found.
Am I doing something wrong here? I've tried a number of permutations of the xpath and I don't understand why this wouldn't be cooperating within code.
Edit: I'm having this problem both in HTMLAgilityPack (but managed to hack out a bad solution using regexes instead) and a SELECT statement modeled after the answer found here
Edit 2: I'm trying to figure out this issue using Yahoo's free proxy as shown in the example here:
var query = 'SELECT * FROM html WHERE url="http://mattgemmell.com/2008/12/08/what-have-you-tried/" and xpath="//h1" and class="entry-title"';
var url = "http://query.yahooapis.com/v1/public/yql?q=" + query + "&format=json&callback=??";
$.getJSON(url,function(data){
alert(data.query.results.h1.content);
})
I'm having the same problems with HTML agility pack but I'm more interested in getting this part to work. It works for the provided URL that the answerer gave me (seen above). However when I try to use even simple xpath expressions on the http://nl.newsbank.com url, I get errors that no object has been retrieved every time, no matter how basic the xpath.
Edit 3: I thought I'd elaborate a little more on the big picture of the larger problem I'm trying to solve of which this problem is a critical component in the hopes that maybe it provides a little more insight.
To learn basic ASP.NET development skills from scratch, I decided to make a simple web application, based around the news archive search at http://nl.newsbank.com/. In its current iteration, it sends a POST request (although I've now learned you can use a GET request and just dump the body at the end of the URL) to send search criteria, as if the user entered criteria in the search bar. It then searches the response (using RegExes, not Xpath because I couldnt get that working) for the "Results 1-n of n" span, extracts n, and dumps it in a table. It's a cool little tool for looking up news occurrence rates over time.
I wanted to add functionality such that you could enter a date range (say May 2002 - June 2010) and run a frequency search for every month / week in that range. This is very easy to implement conceptually. HOWEVER the problem is, right now all this happens server side, and since there's no API, the HTTP response contains the entire page, and is therefore very bandwidth intensive. Sending dozens of queries at once would swallow absolutely unspeakable amounts of bandwidth and wouldn't be even a little scalable.
As a result I tried rewriting the application to work client-side. However because of the same-origin policy I'm not able to send a request to an external host from the client-side. HOWEVER there is a loophole that I can use a free Yahoo proxy that makes the request and converts it to JSON, and then I can use the JSON exception of the Same-Origin Policy to retrieve that data from the proxy.
Here's where I'm running into these xpath problems specific to http://nl.newsbank.com. I'm not able to retrieve html with any xpath, and I'm not sure why or how I can fix it. When debugging in VS2010, I'll receive the error Microsoft JScript runtime error: Unable to get value of the property 'content': object is null or undefined

As paul t. already mentioned in a comment, the TBODY elements are generated by the webkit engine. The next problem is that the DIV between the BODY and CENTER does not exist on the page by default. It is added by an JS statement on line 119.
After stripping out the DIV and TBODY elements like
/html/body/center/table/tr[6]/td/table/tr/td[2]/table/tr/td/table/tr/td/table/tr/td/table/tr/td/span
i can successfull select a node with the HthmlAgilityPack.
Edit: don't use tools like Firebug for getting an XPath value on a website. Don't even use it if you just want wo look at the source of the page. The problem with Firebug is, that it will show you the current DOM document tree which probably on almost every is already (heavily) modified by JS.

Your sample HTML page's elements haven't got many classes to select on, but if you're interested in the first <span> element that contains "Results: 1 - 10 of n", you can use an XPath expression that explicitly targets this textual content.
For example:
//table//span[starts-with(., "Results:")]
will select all <span> elements, contained in a <table>, and that contain text beginning with "Results:" (the //table is not strictly necessary in your case I think, but might as well restrict a little)
You want the first one of these <span>, so you can use this expression:
(//table//span[starts-with(., "Results:")])[1]
Note the brackets around the whole previous expression and then [1] to select the first of all the <span> matching the text

It may sound kind of simplistic, but the element you are looking for is the only doc element that is using the css class "basic-text-white". I would think this would be a lot easier to find and extract than a long xpath. Web-scraping is never a stable thing, but I would think this is probably as stable as the xpath. Trying to debug the xpath just about makes my eyes bleed.

Detect Razor/C# code?

Is there a way to detect if an HTML page contains any razor/C# code? Essentially I want users to be able to provide custom layouts, with tags that I will replace with RenderSection. I want to validate that prior to making this replacement, that none of the HTML contains anything like for example, <a href="#(some C# code)".
All discussions about alternative ways to do this, should/could/would aside, just simply:
Is there a way to programmatically detect if a file contains C#/Razor code?

I don't know a lot about the Razor markup -- but I am thinking that when you grab the layout string they are passing in you will want to parse the text out and grab everything that starts with an # and toss those words into an array. Then, when you republish it to you website use razor code to access the data in the array...
Alternately, and easier, would be to go through all the passed in code and replace all the # signs with a different symbol say & that way it wont get interpreted by the Razor processor:
layoutString = layoutString.Replace('#', '&');

In the browser? No, because unless the programmer made a mistake, there is no Razor/C# code in teh rendered HTML, only HTML that was the result of that.
What you ask is like asking what type of oven was used to bake a pizza from the pizza. Bad news - you never will know.
If you provie sensible tags from those, you could parse them in javascript, but you have to output that metadata yourself as part of the generated html.

After reading your comment to TomTom; the answer is:
No. Razor does not come with any public syntax parser.

XML: Searching elements for specific text using C#

I'm trying to get a list of PDF links from different websites. First I'm using the Web client class to download the page source. I then use sgmlReader to convert the HTML to XML. So for one particular site, I'll get a tag that looks like this:
<p>1985 to 1997 Board Action Summary</p>
I need to grab all the links that contain ".pdf". Obviously not all websites are laid out the same, so just searching for a <p> tag, wont be dynamic enough. I'd rather not use linq, but I will if I have to. Thanks in advance.

Linq makes this easy...
var hrefs = doc.Root.Descendants("a")
.Where(a => a.Attrib("href").Value.ToUpper().EndsWith(".PDF"))
.Select(a => a.Attrib("href"));
away you go! (note: did this from memory, so you might have to fix it somewhat)
This will break down for <a/> tags that don't have an href (anchors) but you can fix that surely...

I think you have 2 options here. If you need only the links, you can use Regular Expressions to find the matches for strings ending with .pdf. If you need to manipulate the XML structure or get other values from the XML, it would be better to use XmlDocument and use an XPath query to find out the nodes which have a link to a pdf file in it. Using LINQ to XML just reduces the number of lines of code you have to write.

Read specific text from page into string array in C#

I've tried this and searched for help but I cannot figure it out. I can get the source for a page but I don't need the whole thing, just one string that is repeated. Think of it like trying to grab only the titles of articles on a page and adding them in order to an array without losing any special characters. Can someone shed some light?

You can use a Regular Expression
to extract the content you want from a string, such as your html string.
Or you can use a DOM parser such as
Html Agility Pack
Hope this helps!

You could use something like this -
var text = "12 hello 45 yes 890 bye 999";
var matches = System.Text.RegularExpressions.Regex.Matches(text,#"\d+").Cast<Match>().Select(m => m.Value).ToList();
The example pulls all numbers in the text variable into a list of strings. But you could change the Regular Expression to do something more suited to your needs.

if the page is well-formed xml, you could use linq to xml by loading the page into an XDocument and using XPath or another way of traversing to the element(s) you desire and loading what you need into the array for which you are looking (or just use the enumerable if all you want to do is enumerate). if the page is not under your control, though, this is a brittle solution that could break at any time when subtle changes could break the well-formedness of the xml. if that's the case, you're probably better off using regular expressions. eiither way, though, the page could be changed under you and your code suddenly won't work anymore.
the best thing you could do would be to get the provider of the page to expose what you need as a webservice rather than trying to scrape their page.

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash

If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>

IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.

Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.

A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Getting a "summary" of a webpage - c#

Perhaps look for the div element that contains the most p elements, and then grab the first p child. If no div, get the first p from the body element. This will always have its problems.

You can strip the HTML tags using this regular expression string stripped = Regex.Replace(textBox1.Text,#"<(.|\n)*?>",string.Empty) You will them get the content text you can use to generate your paragraphs.

Related

Why is this seemingly simple Xpath navigation not working?

Detect Razor/C# code?

XML: Searching elements for specific text using C#

Read specific text from page into string array in C#

Removing <div>'s from text file?

Categories

Resources