Convert HTML to XML with WP7 - c#

simple situation, want to search through a HTML string, get out a couple of information.
Gets annoying after writing mass lines of .Substing and. IndexOf for each element i want to find and cut out of the HTML file.
Afaik i´m unable to load such dll as HTMLtidy or HTML Agility Pack into my WP7 project so is there a more efficient and reliable way to search trough my HTML string instead of building Substings with IndexOf?
void client_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
string document = string.Empty;
using (var reader = new StreamReader(e.Result))
document = reader.ReadToEnd();
string temp = document.Substring(document.IndexOf("Games Played"), (document.IndexOf("League Games") - document.IndexOf("Games Played")));
temp = (temp.Substring(temp.IndexOf("<span>"), (temp.IndexOf("</span>") - temp.IndexOf("<span>")))).Remove(0, 6);
Int32.TryParse(temp, out leaugeGamesPlayed);
}
Thanks for your help
Gpx

You can use the HTML Agility Pack but you need the converted version of HTML Agility Pack for the Phone. It's only available from svn repository but it works great, I use it in my app.
http://htmlagilitypack.codeplex.com/SourceControl/changeset/view/77494#
You can find two projects under trunk named HAPPhone and HAPPhoneTest. You can use the download button to the right to get the code. It uses Linq instead of XPath to work.

You could use LINQ to parse the HTML and locate the elements that you're interested in. For example:
XDocument parsed = XDocument.Parse(document);
var spans = parsed.Descendants("span");
Beth Massi has a great blog post: Querying HTML with LINQ to XML

Assuming you're doing this because you're getting the HTML from a web site/page/server.
Don't convert it on the device.
Create a wrapper/proxy site/server/page to do the conversion for you. While this has the downside of having to create the extra service, it has the following advantages:
Code on the server will be easier to update than code within a distrbued app. (Experience with parsing HTML you don't directly control will show that you will need to make changes in your parsing as the original HTML is almost certain to throw something unexpected at you when changed in the future.)
If you can do it once on the server you can cache the result rather than having instance of the app have to do the conversion over.
By virtue of the above 2 points, the app will run faster!
If you have the HTML file at design/build time then convert it to something easier to work with and avoid unnecessary computation at run time.

As a workaround, you could consider loading the HTML into a WebBrowser control and then query the DOM via injected javascript (which calls back to .NET)

Related

Separate a certain piece of data from html with given start and endpoints

I am learning screen-scraping using C# and I was wondering
How can I separate certain pieces of gathered html,
I am using htmlAgilityPack and ScrapySharp library's for scraping thus with this code I can retrieve a html page:
WebPage PageResult = Browser.NavigateToPage(new Uri("localhost"));
Console.WriteLine(PageResult);
Of course I get back the whole source code with all the syntax and mishmash, but what If I wanted to only catch data between <h2></h2> tags and omit all else?
My very simple-minded pseudo code would be:
If result reads h2
Trim all behind
start writing out after
If result reads /h2
stop writing
Trim anything that comes after
The main question I'm having is how do I feed In the rule that when I read h2 trim everything from before, write the data after that and if /h2 appears, stop and trim the end of the result?
There are a few ways you can achieve this, one such would be to red the page as XML and parse the data you are looking for,
This can be with the use of,
XElement
XmlElement
XDocument
etc.
The second way, would be to use a third-party library like HtmlAgilityPack, this also supports XPath as well,
var nodes = doc.DocumentNode.SelectNodes("//form//input");

Is it safe to render Html that was uploaded as Markdown and converted serverside to Html?

I have a webform that allows users to upload text as Markdown.
The Markdown is converted to Html on the server(using Markdig) and also stored.
When displaying the converted Html that the user uploaded, should I #Html.Encode the content - the project is in c#, MVC 5/razor with request validation on.
Generally it depends on the markdown converter.
By default Markdig doesn't escape html. You can however use the DisableHtml function in the pipeline that escapes all remaining HTML encodable strings that were not processed by previous extensions. This should also give better performance than letting an anti-xss function run over the string again.
See example:
var pipeline = new MarkdownPipelineBuilder().DisableHtml().Build();
var result = Markdig.Markdown.ToHtml("<a href='javascript:evil()'>hello</a>", pipeline);
No, it isn't.
I just trivially tested the following:
hello
and markdig lets it through:
See online example.
Although I haven't looked into it too deeply, the Microsoft AntiXSS library might be useful here:
var safeHtml = Microsoft.Security.Application.Sanitizer
.GetSafeHtmlFragment("<a href='javascript:evil()'>hello</a>");
gives:
hello
but
var safeHtml = Microsoft.Security.Application.Sanitizer
.GetSafeHtmlFragment("<a href='http://stackoverflow.com'>hello</a>");
gives:
hello

Does .NET framework offer methods to parse an HTML string?

Knowing that I can't use HTMLAgilityPack, only straight .NET, say I have a string that contains some HTML that I need to parse and edit in such ways:
find specific controls in the hierarchy by id or by tag
modify (and ideally create) attributes of those found elements
Are there methods available in .net to do so?
HtmlDocument
GetElementById
HtmlElement
You can create a dummy html document.
WebBrowser w = new WebBrowser();
w.Navigate(String.Empty);
HtmlDocument doc = w.Document;
doc.Write("<html><head></head><body><img id=\"myImage\" src=\"c:\"/><a id=\"myLink\" href=\"myUrl\"/></body></html>");
Console.WriteLine(doc.Body.Children.Count);
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Console.WriteLine(doc.GetElementById("myLink").GetAttribute("href"));
Console.ReadKey();
Output:
2
file:///c:
about:myUrl
Editing elements:
HtmlElement imageElement = doc.GetElementById("myImage");
string newSource = "d:";
imageElement.OuterHtml = imageElement.OuterHtml.Replace(
"src=\"c:\"",
"src=\"" + newSource + "\"");
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Output:
file:///d:
Assuming you're dealing with well formed HTML, you could simply treat the text as an XML document. The framework is loaded with features to do exactly what you're asking.
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx
Aside from the HTML Agility Pack, and porting HtmlUnit over to C#, what sounds like solid solutions are:
Most obviously - use regex. (System.Text.RegularExpressions)
Using an XML Parser. (because HTML is a system of tags treat it like an XML document?)
Linq?
One thing I do know is that parsing HTML like XML may cause you to run into a few problems. XML and HTML are not the same. Read about it: here
Also, here is a post about Linq vs Regex.
You can look at how HTML Agility Pack works, however, it is .Net. You can reflect the assembly and see that it is using the MFC and could be reproduced if you so wanted, but you'd be doing nothing more than moving the assembly, not making it any more .Net.

How to get text off a webpage?

I want to get text off of a webpage in C#.
I don't want to get the HTML, I want the real text off of the webpage. Like if I type "<b>cake</b>", I want the cake, not the tags.
Use the HTML Agility Pack library.
That's very fine library for parse HTML, for your requirement use this code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}
It depends. If your application downloads the webpage using a WebBrowser component, then that component will do the parsing for you automatically in the background (just like Internet Explorer). Just walk the DOM tree and extract the text you want. You will find HtmlElement.InnerText property especially useful :)
You can strip tags using regular expressions such as this one2 (a simple example):
// You can import System.Text.RegularExpressions for convenience, of course.
System.Text.RegularExpressions.Regex tag = new System.Text.RegularExpressions.Regex("\<.+?\>");
myHTML = tag.Replace(myHTML, String.Empty);
But if you need to retrieve large volumes of well-structured data, then you might be better off using an HTML library1. (If the webpage is XHTML, all the better - use the System.Xml classes.)
1 Like http://htmlagilitypack.codeplex.com/, for example.
2 This might have unintended side-effects if you're trying to get data out of JavaScript, or if the data is inside an element's attribute and includes angle brackets. You'll also need to accept escape sequences like &.

Is there a jQuery-like CSS/HTML selector that can be used in C#?

I'm wondering if there's a jQuery-like css selector that can be used in C#.
Currently, I'm parsing some html strings using regex and thought it would be much nicer to have something like the css selector in jQuery to match my desired elements.
Update 10/18/2012
CsQuery is now in release 1.3. The latest release incorporates a C# port of the validator.nu HTML5 parser. As a result CsQuery will now produce a DOM that uses the HTML5 spec for invalid markup handling and is completely standards compliant.
Original Answer
Old question but new answer. I've recently released version 1.1 of CsQuery, a jQuery port for .NET 4 written in C# that I've been working on for about a year. Also on NuGet as "CsQuery"
The current release implements all CSS2 & CSS3 selectors, all jQuery extensions, and all jQuery DOM manipulation methods. It's got extensive test coverage including all the tests from jQuery and sizzle (the jQuery CSS selection engine). I've also included some performance tests for direct comparisons with Fizzler; for the most part CsQuery dramatically outperforms it. The exception is actually loading the HTML in the first place where Fizzler is faster; I assume this is because fizzler doesn't build an index. You get that time back after your first selection, though.
There's documentation on the github site, but at a basic level it works like this:
Create from a string of HTML
CQ dom = CQ.Create(htmlString);
Load synchronously from the web
CQ dom = CQ.CreateFromUrl("http://www.jquery.com");
Load asynchronously (non-blocking)
CQ.CreateFromUrlAsync("http://www.jquery.com", responseSuccess => {
Dom = response.Dom;
}, responseFail => {
..
});
Run selectors & do jQuery stuff
var childSpans = dom["div > span"];
childSpans.AddClass("myclass");
the CQ object is like thejQuery object. The property indexer used above is the default method (like $(...).
Output:
string html = dom.Render();
You should definitely see #jamietre's CsQuery. Check out his answer to this question!
Fizzler and Sharp-Query provide similar functionality, but the projects seem to be abandoned.
Not quite jQuery like, but this may help:
http://www.codeplex.com/htmlagilitypack
For XML you might use XPath...
I'm not entirely clear as to what you're trying to achieve, but if you have a HTML document that you're trying to extract data from, I'd recommend loading it with a parser, and then it becomes fairly trivial to query the object to pull desired elements.
The parser I linked above allows for use of XPath queries, which sounds like what you are looking for.
Let me know if I've misunderstood.

Categories

Resources