I am currently trying to have a plain HTML file template in folder which I then retrieve from a C# project. I had small placeholder words that I would then replace with the help of a stringbuilder. This would make the HTML template dynamic so I can change certain parts depending on what I need.
But I was wondering if it was possible to find the HTML element by ID or something along those lines. Instead of replacing each place where the placeholder word is, I would instead try to manipulate the HTML element.
I had tried something named HTML Agility Pack but I couldn't seem to get that to work.
I have this file of simple HTML.
<h1 id="test> </h1>
Which I then parse into the HTML Agility Pack and try to find the id of the element and then I try to parse some text into it.
private string DefineHTML(string html, string id)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
doc.GetElementbyId(id).AppendChild(doc.CreateElement("<p>test</p>"));
return doc.Text;
}
But it just outputs the same HTML it got into it instead of adding the next child to the element.
I need it to input the element into the heading element. Like so
<h1 id="test">
<p>test</p>
</h1>
So I was wondering if there was a way to do this. Since I feel like replacing each placeholder word seems like more trouble than it is worth.
private string DefineHTML(string html, string id)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var htmlNode = doc.DocumentNode.SelectSingleNode($"//p[#id='{id}']");
var child = htmlDoc.CreateElement("<p>test</p>");
htmlNode.AppendChild(child);
return doc.DocumentNode.InnerHtml;
}
Related
I don't have any experience with HTML, so excuse any incorrect terminology.
I am trying to parse an HTML document using the HTML Agility Pack, and I am looking for a very specific string.
I want to obtain all strings of the form:
<img src="..." etc=....">
So my select parameter is
HtmlNodeCollection images = doc.DocumentNode.SelectNodes("//img[#src]");
However, this also ends up returning strings such as
<img width="..." src="..." etc="..">
It seems to me (at least to the best of my knowledge): The img tag is searched for and src only needs to be found on the same level, not necessarily right next to the img tag.
After looking at the documentation I feel that I am trying to do something I am not allowed to with this function.
Can someone please suggest the correct way to do this. Thanks!
"The img tag is searched for and src only needs to be found on the same level, not necessarily right next to the img tag."
It seems that you want to find <img> element where src attributes is the first attribute. Notice that XML/HTML parser doesn't have to preserve attributes order, so generally you don't want to select element based on certain attribute order i.e where src attribute comes first, etc.
Anyway, attributes order happen to be preserved by HAP in my oversimplified test, hence using Attributes[0].Name* to check the name of the first attribute also worked :
var raw = #"<div>
<img src=""..."" etc=""...."">
<img width=""..."" src=""..."" etc="".."">
<img>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
var result = doc.DocumentNode
.SelectNodes("//img[#src]")
.Where(o => o.Attributes[0].Name == "src")
.ToList();
foreach (var item in result)
{
Console.WriteLine(item.OuterHtml);
}
output :
<img src="..." etc="....">
*) The XPath already filter img elements that has attribute src, so Attributes[0].Name would never produce NRE, if you are concerned.
I am not familiar with XPATH, so I am assuming yours is correct (I usually use css selectors using ScrapySharp library in addition to HtmlAgilityPack).
The following Console project code snippet will return only the img node you want, i.e., the one with 2 attributes only - src and etc, not less not more.
I manually load a sample html with 3 image nodes, like the following:
HtmlDocument doc = new HtmlDocument();
string html = #"
<img src='img1.jpg' />
<img src='img1.jpg' etc='etcValue' />
<img width='200px' src='img1.jpg' />
";
doc.LoadHtml(html);
var relevantImgNodes = doc.DocumentNode.SelectNodes("//img")
.Where(n =>
n.Attributes.Count == 2 &&
!string.IsNullOrEmpty(n.GetAttributeValue("src")) &&
!string.IsNullOrEmpty(n.GetAttributeValue("etc")));
Console.WriteLine(relevantImgNodes.Count()); // prints 1
I am looking for the best way to remove all of the text in between 2 div tags, including the tags themselves.
For example:
<body>
<div id="spacer"> This is a title </div>
</body>
becomes:
<body>
</body>
Edit: This needs to happen on the server side (C#)
You can use this lib: http://htmlagilitypack.codeplex.com/ to manipulate on the server side, below is example for your case:
var doc = new HtmlDocument();
doc.LoadHtml("<body><div id=\"spacer\"> This is a title </div></body>");
doc.GetElementbyId("spacer").Remove();
var stream = new StringWriter();
doc.Save(stream);
var result = stream.ToString();
Edit:
You also can use xpath to select any nodes you want:
var nodes = doc.DocumentNode.SelectNodes("body/div");
nodes.ToList().ForEach(node => node.Remove());
Not sure what you are trying to achieve, but best way to hide or remove detail on fly in your case would be JQuery/Javascript since you are not refering to server side control.
In case you are just parsing string:-
1)Parse and find first occurance/last occurance and trim things in between.
2)XML parsing would be other way and a better one I guess because you can iterate throughout the xml to manipulate in a better way.
you can use regular expression to strip html tags and text. You will find several examples in google.
I have a string that contains html code of a document.
It can have multiple image tags inside.
What I want to do is to pass the img tag's src attribute value i.e. url to a C# function and replace that value with the function returns.
How can I do this?
Regex is not a good for parsing HTML files.HTML is not strict nor is it regular with its format.(for example: in non strict html its OK to have a tag without a closing tag)
Use htmlagilitypack
You can use htmlagilitypack to do it like this
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
foreach(var item in doc.DocumentNode.SelectNodes("//img[#src]"))//select only those img that have a src attribute..ahh not required to do [#src] i guess
{
item.Attributes["src"].Value=yourFunction(item.Attributes["src"].Value);
}
doc.Save("yourFile");//dont forget to save
I'm trying to figure out how to grab DOM elements from a webpage. Here is the function I'm using:
private void processHTML(String htmlContent)
{
IHTMLDocument2 htmlDocument = (IHTMLDocument2)new mshtml.HTMLDocument();
htmlDocument.write(htmlContent);
IHTMLElementCollection allElements = htmlDocument.all;
webBrowser1.DocumentText = allElements.item("storytext").innerHTML;
textBox2.Text = allElements.item("chap_select").length.ToString();
}
If I set a breakpoint at either of the last two lines and then check the allElements collection, I'm able to find the SELECT element. It correctly shows the ID as being chap_select and the length property shows 13 for the particular document that is being passed. For some reason the length that is being put into the textBox2 field is 2, however.
Any suggestions on what I'm doing wrong here? I've spent several hours trying to figure this out, but have not been able to find any code samples of somebody trying to grab this property of a SELECT.
Instead of using IHTMLDocument2 and mshtml.HTMLDocument I suggest using the much easier to work with HTML Agility Pack.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Something like (untested):
var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
textBox2.Text = doc.DocumentNode
.SelectNodes("//select[#id='chap_select']/option").Count().ToString();
Here's a link:
http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/nba/results/2010-2011/boxscore819588.html
I'm using HTML Agility Pack and I would like to extract, say, the 188 from the 'Odds' column. My editor gives /html/body/form/div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7] when asked for path. I tried that path with various of omissions of body or html, but neither of them return any results when passed to .DocumentNode.SelectNodes(). I also tried with the // at the beginning (which, I assume, is the root of the document tree). What gives?
EDIT:
Code:
WebClient client = new WebClient();
string html = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("/some/xpath/expression"))
{
Console.WriteLine("[" + node.InnerText + "]");
}
When scraping sites, you can't rely safely on the exact XPATH given by tools as in general, they are too restrictive, and in fact catch nothing most of the time. The best way is to have a look at the HTML and determine something more resilient to changes.
Here is a piece of code that works with your example:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(your html);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[text()='MIA']/ancestor::tr/td[7]"))
{
Console.WriteLine(node.InnerText.Trim());
}
It outputs 188.
The way it works is:
select an A element with inner text set to "MIA"
find the parent TR element of this A element
get to the seventh TD of this TR element
and then we use InnerText property of that TD element
Try this:
/html/body/form/div/div[2]/div/table/*/tr/td[2]/div/table/*/tr[3]/td[7]
The * catch the mandatory <tbody> element that is part of the DOM representation of tables even if it is not denoted in the HTML.
Other than that, it's more robust to select by ID, CSS class name or some other unique property instead of by hierarchy and document structure:
//table[#class='data']//tr[3]/td[7]
By default HtmlAgilityPack treats form tag differently (because form tags can overlap), so you need to remove form tag from xpath, for examle: /html/body//div/div[2]/div/table/tr/td[2]/div/table/tr[3]/td[7]
Other way is to force HtmlAgilityPack to treat form tag as others:
HtmlNode.ElementsFlags.Remove("form");