How to get text off a webpage? - c#

I want to get text off of a webpage in C#.
I don't want to get the HTML, I want the real text off of the webpage. Like if I type "<b>cake</b>", I want the cake, not the tags.

Use the HTML Agility Pack library.
That's very fine library for parse HTML, for your requirement use this code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}

It depends. If your application downloads the webpage using a WebBrowser component, then that component will do the parsing for you automatically in the background (just like Internet Explorer). Just walk the DOM tree and extract the text you want. You will find HtmlElement.InnerText property especially useful :)

You can strip tags using regular expressions such as this one2 (a simple example):
// You can import System.Text.RegularExpressions for convenience, of course.
System.Text.RegularExpressions.Regex tag = new System.Text.RegularExpressions.Regex("\<.+?\>");
myHTML = tag.Replace(myHTML, String.Empty);
But if you need to retrieve large volumes of well-structured data, then you might be better off using an HTML library1. (If the webpage is XHTML, all the better - use the System.Xml classes.)
1 Like http://htmlagilitypack.codeplex.com/, for example.
2 This might have unintended side-effects if you're trying to get data out of JavaScript, or if the data is inside an element's attribute and includes angle brackets. You'll also need to accept escape sequences like &.

Related

Regex to get values between Double Quotes

I have a value i am pulling from a database
<iframe width="420" height="315" src="//www.youtube.com/embed/8GRDA1gG8R8" frameborder="0" allowfullscreen></iframe>
I am trying to get the src as a value using regex.
Regex.Match(details.Tables["MarketingDetails"].Rows[0]["MarketingVideo"].ToString(), "\\\"([^\\\"]*)\\\"").Groups[2].Value
that is how i am currently writing it
How would I write this to pull the correct value of src?
You could do it like this....
Match match = Regex.Match( #"<iframe width=""420"" height=""315"" src=""//www.youtube.com/embed/8GRDA1gG8R8"" frameborder=""0"" allowfullscreen></iframe>", #"src=(\""[^\""]*\"")");
Console.WriteLine (match.Groups[1].Value);
However, as others have already commented on your question... it's better practice to use an actual html parser.
Don't use regex to parse xml or html. It's not worth it. I'll let you read this post, and it sort of exagerates the point, but the main thing to keep in mind is you can get into a lot of trouble with regex and html.
So, instead you should use an actual html/xml parser! For starters, use XElement, a class built into the .net framework.
string input = "<iframe width=\"420\" height=\"315\" src=\"//www.youtube.com/embed/8GRDA1gG8R8\" frameborder=\"0\" allowfullscreen=''></iframe>";
XElement html = XElement.Parse(input);
string src = html.Attribute("src").Value;
This will make src have the value //www.youtube.com/embed/8GRDA1gG8R8. You can then split that up to get whatever you need from it.
I should also note that your input is not valid xml. allowfullscreen does not have a value attached, which is why I added =''.
If you need to get more complex, such as your input, use an HTML parser (XElement is meant for xml). Use the Html Agility Pack like this (using the previous example):
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(input);
string src = doc.DocumentNode
.Element("iframe")
.Attributes["src"]
.Value;
This parser is more forgiving for invalid or incorrect (or just irregular) inputs. This will parse your original input just fine (so missing the ='').

Better alternative to getting the 'inner html' of a div?

I have a string:
<div class="className1234"><p>Some html</p></div>
From this string, I would like to get <p>Some html</p> i.e. I would like to remove the surrounding div tags based on the fact that it's class contains 'className'.
What I've Tried
What I've tried works, but it's cludgey - and I know there'll be a better alternative like regex or something. What I currently do is chain a series of substring(), indexof() and replace() calls to strip out the divs.
EDIT: I've used the phrase 'innerhtml' because I'd like to think there's a library out there somewhere that would allow me to manipulate a string with regard to the tags within it.
PLEASE NOTE: There's no JQuery involved in this. It's all server-side C#.
(See tags)
I would suggest Html Agility Pack, it's designed to allow operations on html documents, kind of like the builtin support for XML in the framework.
It might be overkill, but it will get the work done, easily, and you won't have to care about bad html
How about:
XmlDocument doc = new XmlDocument();
doc.LoadXml(divStr);
// classAtr will be null if the root is not a div with a class with the value className1234
XmlNode classAtr = doc.SelectSingleNode("/div/#class[contains(., 'className1234')]");
string result = classAtr != null ? doc.DocumentElement.InnerXml : divStr;
Whenever you need to manipulate HTML, you should use a dedicated HTML parser/DOM library. One library I've found recommended here on StackOverflow for .Net is HTMLAgilityPack.
As others said HtmlAgilityPack is the best for html parsing, also be sure to download HAP Explorer from HtmlAgilityPack site, use it to test your selects, anyway this SelectNode command will get :
HtmlDocument doc = new HtmlDocument();
doc.Load(htmlFile);
var myNodes = doc.DocumentNode.SelectNodes("/div/#class[. = 'className1234']");
foreach (HtmlNode node in myNodes)
{
// you code
}

Does .NET framework offer methods to parse an HTML string?

Knowing that I can't use HTMLAgilityPack, only straight .NET, say I have a string that contains some HTML that I need to parse and edit in such ways:
find specific controls in the hierarchy by id or by tag
modify (and ideally create) attributes of those found elements
Are there methods available in .net to do so?
HtmlDocument
GetElementById
HtmlElement
You can create a dummy html document.
WebBrowser w = new WebBrowser();
w.Navigate(String.Empty);
HtmlDocument doc = w.Document;
doc.Write("<html><head></head><body><img id=\"myImage\" src=\"c:\"/><a id=\"myLink\" href=\"myUrl\"/></body></html>");
Console.WriteLine(doc.Body.Children.Count);
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Console.WriteLine(doc.GetElementById("myLink").GetAttribute("href"));
Console.ReadKey();
Output:
2
file:///c:
about:myUrl
Editing elements:
HtmlElement imageElement = doc.GetElementById("myImage");
string newSource = "d:";
imageElement.OuterHtml = imageElement.OuterHtml.Replace(
"src=\"c:\"",
"src=\"" + newSource + "\"");
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Output:
file:///d:
Assuming you're dealing with well formed HTML, you could simply treat the text as an XML document. The framework is loaded with features to do exactly what you're asking.
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx
Aside from the HTML Agility Pack, and porting HtmlUnit over to C#, what sounds like solid solutions are:
Most obviously - use regex. (System.Text.RegularExpressions)
Using an XML Parser. (because HTML is a system of tags treat it like an XML document?)
Linq?
One thing I do know is that parsing HTML like XML may cause you to run into a few problems. XML and HTML are not the same. Read about it: here
Also, here is a post about Linq vs Regex.
You can look at how HTML Agility Pack works, however, it is .Net. You can reflect the assembly and see that it is using the MFC and could be reproduced if you so wanted, but you'd be doing nothing more than moving the assembly, not making it any more .Net.

How I can extract text from HTML without using third-party libraries?

_request = (HttpWebRequest)WebRequest.Create(url);
_response = (HttpWebResponse) _request.GetResponse();
StreamReader streamReader = new StreamReader(_response.GetResponseStream());
string text = streamReader.ReadToEnd();
Text with html tags. How i can get text without html tags?
How do you extract text from dynamic HTML without using 3rd party libraries? Simple, you invent your own HTML parsing library using the string parsing functions present in the .NET framework.
Seriously, doing this by yourself is a bad idea. If you're pulling dynamic HTML off the web, you have to be prepared for different closing tags, mismatched tags, missing end tags, and so forth. Unless you have a really good reason why you need to write one yourself, just use HTML Agility Pack, and let that do the hard work for you.
Also, make sure you're not succumbing to Not Invented Here Syndrome.
Try this:
System.Xml.XmlDocument docXML = new System.Xml.XmlDocument();
docXML.Load(url);
string textWithoutTags = docXML.InnerText;
Be happy :)
1) Do not use Regular Expressions. (see this great StackOverflow post: RegEx match open tags except XHTML self-contained tags)
2) Use HtmlAgilityPack. But I see you do not want 3rd Party libraries, so we are forced to....
3) Use XmlReader. You can pretty much use the example code straight from MSDN, and just ignore all cases of XmlNodeType except for XmlNodeType.Text. For that case simply write your output to a StreamWriter.
This question has been asked before. There are a few ways to do it, including using a Regular Expression or as pointed out by Adrian, the Agility Pack.
See this question: How can I strip HTML tags from a string in ASP.NET?

Convert HTML to XML with WP7

simple situation, want to search through a HTML string, get out a couple of information.
Gets annoying after writing mass lines of .Substing and. IndexOf for each element i want to find and cut out of the HTML file.
Afaik i´m unable to load such dll as HTMLtidy or HTML Agility Pack into my WP7 project so is there a more efficient and reliable way to search trough my HTML string instead of building Substings with IndexOf?
void client_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
string document = string.Empty;
using (var reader = new StreamReader(e.Result))
document = reader.ReadToEnd();
string temp = document.Substring(document.IndexOf("Games Played"), (document.IndexOf("League Games") - document.IndexOf("Games Played")));
temp = (temp.Substring(temp.IndexOf("<span>"), (temp.IndexOf("</span>") - temp.IndexOf("<span>")))).Remove(0, 6);
Int32.TryParse(temp, out leaugeGamesPlayed);
}
Thanks for your help
Gpx
You can use the HTML Agility Pack but you need the converted version of HTML Agility Pack for the Phone. It's only available from svn repository but it works great, I use it in my app.
http://htmlagilitypack.codeplex.com/SourceControl/changeset/view/77494#
You can find two projects under trunk named HAPPhone and HAPPhoneTest. You can use the download button to the right to get the code. It uses Linq instead of XPath to work.
You could use LINQ to parse the HTML and locate the elements that you're interested in. For example:
XDocument parsed = XDocument.Parse(document);
var spans = parsed.Descendants("span");
Beth Massi has a great blog post: Querying HTML with LINQ to XML
Assuming you're doing this because you're getting the HTML from a web site/page/server.
Don't convert it on the device.
Create a wrapper/proxy site/server/page to do the conversion for you. While this has the downside of having to create the extra service, it has the following advantages:
Code on the server will be easier to update than code within a distrbued app. (Experience with parsing HTML you don't directly control will show that you will need to make changes in your parsing as the original HTML is almost certain to throw something unexpected at you when changed in the future.)
If you can do it once on the server you can cache the result rather than having instance of the app have to do the conversion over.
By virtue of the above 2 points, the app will run faster!
If you have the HTML file at design/build time then convert it to something easier to work with and avoid unnecessary computation at run time.
As a workaround, you could consider loading the HTML into a WebBrowser control and then query the DOM via injected javascript (which calls back to .NET)

Categories

Resources