I am writing a program that parses a bit of HTML. Specifically, I am looking for underlined elements within a list, and turning those underlined elements into hyperlinks.
Here's an example of the pre-converted HTML:
<ul>
<li>
<u>Mode selector </u>
</li>
<li>
<u>LAND ALT</u>
</li>
<li>
<u>FLT ALT</u>
</li>
</ul>
Here's what I'm wanting the result to look like:
<ul>
<li>
<a id="triv14522" onclick="TxtLinkAction(15627,15673)">
<span style="color: rgb(102, 204, 255); font-size: 11pt;">
<u>Mode selector</u>
</span>
</a>
</l1>
<li>
<a id="triv14523" onclick="TxtLinkAction(15627,15674)">
<span style="color: rgb(102, 204, 255); font-size: 11pt;">
<u>LAND ALT</u>
</span>
</a>
</li>
<a id="triv14887" onclick="TxtLinkAction(15627,15679)">
<span style="color: rgb(102, 204, 255); font-size: 11pt;">
<u>FLT ALT</u>
</span>
</a>
</li>
</ul>
In my program, I've already built the anchor and span elements for each underlined element. Just for reference, here's how I've done this:
TrivId = trivId;
ActionItemId = actionItemId;
TextLayerId = textLayerId;
var trivIdText = "id=\"triv" + TrivId + "\"";
var onClickText = "onclick=\"TxtLinkAction(" + TextLayerId + "," + ActionItemId + ")\"";
var anchor = "<a " + trivIdText + " " + onClickText + ">";
var span = "<span style=\"color: rgb(102, 204, 255); font-size: 11pt;\">";
So, my main problem is I don't exactly know how to "wrap" each underlined element in the list with my anchor and span elements. If this were XML, I could add my XML element by using AddBeforeSelf. Can I do something similar with HTML?
NOTE: I notice that the C# tag has been removed, and Javascript tag added. I should clarify: This is a C# program that is parsing a PowerPoint document. One of the values that is being brought in is in HTML format. I am not using Javascript at all, since this isn't an actual webpage. I'm just grabbing this particular value from the PowerPoint slide, which happens to be in HTML format.
For further clarification, here's the C# method that I'm using. The resulting, modified HTML will be written out to an XML file. The resulting HTML will be stored in an XML tag, <RTF>, with the valid HTML as that tag's value.
public Hyperlink(int textLayerId, int runGroupId)
{
TrivId = LectoraTitle.GetId();
ActionItemId = LectoraTitle.GetId();
TextLayerId = textLayerId;
var trivIdText = "id=\"triv" + TrivId + "\"";
var onClickText = "onclick=\"TxtLinkAction(" + TextLayerId + "," + ActionItemId + ")\"";
var styleText = "style=\"" + Settings.Default.Style + "\"";
// build anchor/span and determine where to insert into text.text
var anchor = "<a " + trivIdText + " " + onClickText + " " + styleText + ">";
var span = "<span style=\"color: rgb(102, 204, 255); font-size: 11pt;\">";
ActionItem = new ActionItem { ActionType = ActionType.rungroup, TargetId = runGroupId };
}
Further explanation: I'm assuming that I can iterate over my HTML elements with a foreach loop, using something like the below code:
// note: this is pseudocode
var nodes = htmlSnippet;
foreach (var node in nodes)
{
// if node is underline element
// surround node with generated anchor
// and span elements.
}
I'm just not quite sure how to get my HTML snippet into an enumerable state so that I can iterate over it, and then wrap a particular element with my generated elements.
NEW EDIT:
So, after looking at HtmlAgilityPack, I've incorporated it into my program and am iterating over the Html like so (The variable text contains the HTML value (see first example above)):
htmlDocument.LoadHtml(text);
var nodes = htmlDocument.DocumentNode.SelectNodes("//u");
foreach (var node in nodes)
{
// insert code here to wrap the
// underline element with the generated
// anchor/span elements
}
So, now I'm able to parse the HTML and get only the underline elements. I now need to figure out how to surround these underline elements with my generated anchor/span elements. I was hoping I could do something like node.AddParent(anchor).
In order to iterate the HTML you may want to use HTML Agility Pack
http://htmlagilitypack.codeplex.com/
Examples here:
http://htmlagilitypack.codeplex.com/wikipage?title=Examples
A decent how-to here:
http://www.codeproject.com/Articles/659019/Scraping-HTML-DOM-elements-using-HtmlAgilityPack-H
You can install it using NuGet.
Related
I am trying to correctly extract the innerText of a list of div I am getting from a website.
This is what I came up with but still a bit buggy as it misses whitespaces and the - symbol.
var first = mainmenuTitles[x].Descendants("div").FirstOrDefault(o => o.GetAttributeValue("class", "") == "left").Elements("a").ToList();
string final = "";
foreach (var countfirst in first)
{
final += countfirst.InnerText;
}
Console.WriteLine("Tittle: " + final);
This is how the html code looks like
<div class="row row-tall mt4">
<div class="clear">
<div class="left">
<a href="/soccer/italy/">
<strong>Italy</strong>
</a>
-
Serie C:: group B
</div> <div class="right fs11"> March 31 </div> </div> </div>
The text I am trying to get should look like this ->
Italy - Serie C:: group B
I am not a html guru so forgive me if it is too simple and I am missing it.
You can write a query to look up all nodes with xpath //div/a and then concatenate the inner text to get the text you are looking for. Make sure you trim the text to get rid of extra spaces and returns.
Console.WriteLine(string.Join(" - ", doc.DocumentNode.SelectNodes("//div/a").Select(x => x.InnerText.Trim())));
Output:
Italy - Serie C:: group B
Side note... you can use different queries to ensure you get the right div by using name of class as well. e.g. .SelectNodes("//div[#class='row row-tall mt4']/a");. This will give you all the <a> tags under that div.
I'm using Html Agility Pack for build a library with different functionalities.
One of them is:
search into HTML all the HTML parts contained between "start comment tag" and "end comment tag"
replace all HTML for the HTML part that matches one search string
For example:
I need to search HTML parts contained between <!-- data-example-start start tag and <!-- data-example-end end tag. Both are keyword (comments starts with those keywords)
The HTML part to replace is the one that contains the keyword "hello"
<body>
<p>Title
</p>
<!-- data-example-start-try_1 -->
<div>
</div>
<span id="hello"> Hi
</span>
<!-- data-example-end-try_1 -->
<!-- data-example-start-goodbye 2-->
<div>
<span id="bye"> Bye
</span>
</div>
<p>
</p>
<!-- data-example-end-goodbye 2-->
</body>
In this case I expect to replace the first HTML part contained between <!-- data-example-start-try_1 --> and <!-- data-example-end-try_1 -->, because inside there is the Search Word "hello" that I'm searching for.
How can I select, into Html Agility Pack , HTML parts contained between two HTML comments?
Thanks in advance
Here is an online example that shows how to get nodes between comment:
https://dotnetfiddle.net/JlkMot
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var docNode = doc.DocumentNode.InnerHtml;
var descendants = doc.DocumentNode.Descendants().ToList();
var startNode = descendants.FindIndex(x => x.InnerHtml == "<!-- data-example-start-try_1 -->");
var endEnd = descendants.FindIndex(x => x.InnerHtml == "<!-- data-example-end-try_1 -->");
if (startNode != -1 && endEnd != 1)
{
var betweenNodes = descendants.GetRange(startNode + 1, endEnd - startNode - 1);
foreach (var node in betweenNodes)
{
// show 2 times "Hi", once for the span, once for the text
Console.WriteLine(node.InnerHtml);
}
}
I have some HTML that I'm parsing using C#
The sample text is below, though this is repeated about 150 times with different records
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
I'm trying to get the text in an array which will be like
customerArray [0,0] = Title
customerArray [0,1] = Mr
customerArray [1,0] = First Name
customerArray [1,1] = Fake
customerArray [2,0] = Surname
customerArray [2,1] = Guy
I can get the text in the array but I'm just having trouble getting the text after the STRONG closing tab up until the BR tag then finding the next STRONG tag
any help would be appreciated
You can use XPath following-sibling::text()[1] to get text node located directly after each strong. Here is a minimal but complete example :
var raw = #"<div>
<strong>Title</strong>: Mr<br>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//strong"))
{
var val = node.SelectSingleNode("following-sibling::text()[1]");
Console.WriteLine(node.InnerText + ", " + val.InnerText);
}
dotnetfiddle demo
output :
Title, : Mr
First name, : Fake
Surname, : Guy
You should be able to remove the ":" by doing simple string manipulation, if needed...
<strong> is a common tag, so something specific for the sample format you provided.
var html = #"
<div>
<strong>First name</strong><em>italic</em>: Fake<br>
<strong>Bold</strong> <a href='#'>hyperlink</a><br>.
<strong>bold</strong>
<strong>bold</strong> <br>
text
</div>
<div>
<strong>Title</strong>: Mr<BR>
<strong>First name</strong>: Fake<br>
<strong>Surname</strong>: Guy<br>
</div>";
var document = new HtmlDocument();
document.LoadHtml(html);
// 1. <strong>
var strong = document.DocumentNode.SelectNodes("//strong");
if (strong != null)
{
foreach (var node in strong.Where(
// 2. followed by non-empty text node
x => x.NextSibling is HtmlTextNode
&& !string.IsNullOrEmpty(x.NextSibling.InnerText.Trim())
// 3. followed by <br>
&& x.NextSibling.NextSibling is HtmlNode
&& x.NextSibling.NextSibling.Name.ToLower() == "br"))
{
Console.WriteLine("{0} {1}", node.InnerText, node.NextSibling.InnerText);
}
}
I have and XML element like this:
string markup = #"<a href='#'>
<span>
outer content
<span>inner content</span>
</span>
</a>";
XElement elelemt = XDocument.Parse(markup).Root;
I want to add brackets to the outer span so it becomes:
<a href='#'>
<span>
(outer content
<span>inner content</span>)
</span>
</a>
I tried modifying the Value propery byt it strips away the inner element and replaces it with only text:
elelemt.Element("span").Value = "(" + elelemt.Element("span").Value + ")";
You would need replace the child nodes with the existing nodes with your text on either side. Something approximately like this:
var span = element.Element("span");
span.ReplaceNodes(
new XText("("),
span.Nodes(),
new XText(")"));
It will get a little trickier if the whitespace must match what you've specified. You'd have to iterate through span.Nodes() to work out where to insert your XText nodes.
As an aside, there exists XElement.Parse, so your parsing could be written as:
var element = XElement.Parse(markup);
For the VB'ers that might come across this.
Dim markup As XElement
markup = <a href='#'>
<span>
outer content
<span>inner content</span>
</span>
</a>
Dim newmarkup As XElement = New XElement(markup)
newmarkup.<span>.DescendantNodes.Remove()
newmarkup.<span>.Value = "("
For Each el As XNode In markup.<span>.Nodes
newmarkup.<span>.Nodes.LastOrDefault.AddAfterSelf(el)
Next
newmarkup.<span>.Nodes.LastOrDefault.AddAfterSelf(")")
I'm using html agility pack for take some data from a website, now there is a bit problem. I want get some data from this div:
<div class="container middle">
<div class="details clearfix">
<dl>
<dt>Gara</dt>
<dd>Super League</dd>
<dt>Data</dt>
<dd><span class='timestamp' data-value='1467459300' data-format='d mmmm yyyy'>2 luglio 2016</span></dd>
<dt>Game week</dt>
<dd>15</dd>
<dt>calcio di inizio</dt>
<dd>
<span class='timestamp' data-value='1467459300' data-format='HH:MM'>13:35</span>
(<span class="game-minute">FP'</span>)
</dd>
</dl>
</div>
the problem's that there are two div with the class container middle and details clearfix, I want get the content onlhy of the specific div pasted above. This div have a dl tag for each tag.
This is my code:
var url = "http://it.soccerway.com/matches/2016/07/02/china-pr/csl/henan-jianye/beijing-guoan-football-club/2207361/";
var doc = new HtmlDocument();
doc.LoadHtml(new WebClient().DownloadString(url));
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode("//div[#class='container middle']");
and this return a wrong result, in particular this:
<div class="container middle">
<h3 class="thick scoretime score-orange">
0 - 0
</h3>
this is the complete source code.
Well, you could do the following, for this particular web-page:
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
Console.WriteLine(matchDetails[1].InnerHtml);
and working with HtmlNode via matchDetails[1]. To retrieve other data you can use similar xpath requests, like:
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectNodes(".//div[#class='container middle']");
var dl = matchDetails[1].SelectSingleNode(".//dl");
var dt = dl.SelectNodes(".//dt");
var dd = dl.SelectNodes(".//dd");
for (int i = 0; i < dt.Count; i++) {
var name = dt[i].InnerHtml;
var value = dd[i].InnerHtml;
Console.WriteLine(name + ": " + value);
}
Of course, you need some check for the NullReference and stuff
Query div with class details clearfix should return the target div element. There is one crucial detail you need to be aware of though,
that a . before / is needed to make the XPath relative to the context element referenced by infoDiv, otherwise the XPath will be evaluated on the root document context (as if it was called on doc.DocumentNode instead of on infoDiv) :
var infoDiv = doc.DocumentNode.SelectSingleNode("//div[#class='block_match_info real-content clearfix ']");
var matchDetails = infoDiv.SelectSingleNode(".//div[#class='details clearfix']");