In the HTML page, I will need to match all innerHTML one by one.
I make a REGEX wich permit to match all tag except innerHTML (include BR tag) but I can not do the opposite...
([<][^br][^<]*[>])
You can see an example on this URL : https://regex101.com/r/h9tKHj/1
On this DOM :
<li class="product-faq-item">
<p class="product-faq-title">{{XXXXXXXXXXXX1}}</p>
<div class="product-faq-container">
<p class="product-faq-text">{{XXXXXXXXXXXX2}}<br>
{{XXXXXXXXXXXX3}}
</p>
</div>
</li>
<li class="product-faq-item">
<p class="product-faq-title">{{XXXXXXXXXXXX4}}</p>
<div class="product-faq-container">
<p class="product-faq-text">{{XXXXXXXXXXXX5}}</p>
</div>
</li>
My goal is to recover this :
Match 1 : {{XXXXXXXXXXXX1}}
Match 2 : {{XXXXXXXXXXXX2}}
Match 3 : {{XXXXXXXXXXXX3}}
Match 4 : {{XXXXXXXXXXXX4}}
Match 5 : {{XXXXXXXXXXXX5}}
Thanks in advance for your help !
Have a nice day,
Anthony,
If you want to replace {{key}} with value maby replace it like this:
var input = #"<li class='product-faq-item'>
<p class='product-faq-title'>{{XXXXXXXXXXXX1}}</p>
<div class='product-faq-container'>
<p class='product-faq-text'>{{XXXXXXXXXXXX2}}<br>
{{XXXXXXXXXXXX3}}
</p>
</div>
</li>
<li class='product-faq-item'>
<p class='product-faq-title'>{{XXXXXXXXXXXX4}}</p>
<div class='product-faq-container'>
<p class='product-faq-text'>{{XXXXXXXXXXXX5}}</p>
</div>
</li>";
var regex = new Regex("{{.*?}}");
var dic = new Dictionary<string, object>();
dic["XXXXXXXXXXXX1"] = "X1Val";
dic["XXXXXXXXXXXX2"] = "X2Val";
dic["XXXXXXXXXXXX3"] = "X3Val";
dic["XXXXXXXXXXXX4"] = "X4Val";
dic["XXXXXXXXXXXX5"] = "X5Val";
var output = regex.Replace(input, match => $"{dic[match.Value.Replace("{", "").Replace("}", "")]}");
Related
I'm using the HtmlAgilitypack to extract some data from the following website:
<div class="pull-right">
<ul class="list-inline">
<li class="social">
<a target="_blank" href="https://www.facebook.com/wsat.a?ref=ts&fref=ts" class="">
<i class="icon fa fa-facebook" aria-hidden="true"></i>
</a>
</li>
<li class="social">
<a target="_blank" href="https://twitter.com/wsat_News" class="">
<i class="icon fa fa-twitter" aria-hidden="true"></i>
</a>
</li>
<li>
<a href="/user" class="hide">
<i class=" icon fa fa-user" aria-hidden="true"></i>
</a>
</li>
<li>
<a onclick="ga('send', 'event', 'PDF', 'Download', '');" href="https://wsat.com/pdf/issue15170/index.html" target="_blank" class="">
PDF
<i class="icon fa fa-file-pdf-o" aria-hidden="true"></i>
</a>
</li>
I've managed to write this code to extract the first link in the html script which is https://www.facebook.com/wsat. However, all I want is to extract the link with the pdf which is
https://wsat.com/pdf/issue15170/index.html but without any luck. How do I specify which link to extract ?
var url = "https://wsat.com/";
var HttpClient = new HttpClient();
var html = await HttpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var links = htmlDocument.DocumentNode.Descendants("div").Where(node => node.GetAttributeValue("class", "").Equals("pull-right")).ToList();
var alink = links.First().Descendants("a").FirstOrDefault().ChildAttributes("href")?.FirstOrDefault().Value;
await Launcher.OpenAsync(alink);
Use an xpath expression as a selector:
var alink = htmlDocument.DocumentNode
.SelectSingleNode("//li/a[contains(#onclick, 'PDF')]")
.GetAttributeValue("href", "");
Explanation of xpath (as requested):
Match li tag at any depth in the document with an immediate child a tag, which has an attribute onclick that contains the string 'PDF'.
In your query Descendants("a") selected you all links in the root div. And following FirstOrDefault() returns you just the first link. So what you can do is to map every link into its href, and then use string operation over collection to find appropriate.
var alink = links.First().Descendants("a")
.Select(node => node.ChildAttributes("href").FirstOrDefault()?.Value)
.Where(s => !string.IsNullOrEmpty(s))
.ToList();
foreach (var l in alink)
{
Console.WriteLine(l);
}
Console.WriteLine();
var wsatCom = alink.FirstOrDefault(s => s.StartsWith("https://wsat.com"));
Console.WriteLine(wsatCom);
In addition. ?. operator is needed after FirstOrDefault() not before, if you want to handle links without href. I believe in that case ChildAttributes("href") returns empty collection, FirstOrDefault returns null, and you've got null reference exceotion.
Could Regex help you here? I think it will be a lot easier than using the HTML agility pack to traverse through the links and feels a lot less like a lucky shot.
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"https:\/\/wsat\.com\/[\w\-\.]+[^#?\s][^""]+";
string input = #"<div class=""pull-right"">
<ul class=""list-inline"">
<li class=""social"">
<a target=""_blank"" href=""https://www.facebook.com/wsat.a?ref=ts&fref=ts"" class="""">
<i class=""icon fa fa-facebook"" aria-hidden=""true""></i>
</a>
</li>
<li class=""social"">
<a target=""_blank"" href=""https://twitter.com/wsat_News"" class="""">
<i class=""icon fa fa-twitter"" aria-hidden=""true""></i>
</a>
</li>
<li>
<a href=""/user"" class=""hide"">
<i class="" icon fa fa-user"" aria-hidden=""true""></i>
</a>
</li>
<li>
<a onclick=""ga('send', 'event', 'PDF', 'Download', '');"" href=""https://wsat.com/pdf/issue15170/index.html"" target=""_blank"" class="""">
PDF
<i class=""icon fa fa-file-pdf-o"" aria-hidden=""true""></i>
</a>
</li>";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
For this kind of job I'd recommend using AngleSharp
It allows you to use css selectors to select whatever element you need.
var doc = new HtmlParser().ParseDocument(myHtml);
var pdfUrl = doc.QuerySelector("ul.list-inline a:nth-child(4)").GetAttribute("href");
or
var links = doc.QuerySelectorAll("ul.list-inline a").Where(a=> a.GetAttribute("href").StartsWith("https://wsat.com/pdf/")).ToList();
Bonus point is that you can always test your selector in any browser developper console without having to code/compile your C#
I'm trying to select the first tag and get the href value. But the problem is I'm using SelectNodes.
Here is the code i want to select a href value from:
<li>
<a class="img" href="link1"></a>
<div class="m_text">
<a class="title" href="link2" rel="27418">A Story</a>
<p><span class="stars star45"></span><span class="rate">4.35</span></p>
<p class="info" title="Action"></p>
<p class="nowrap latest">A Story</span> 29</p>
</div>
</li>
<li>
<a class="img" href="link1"></a>
<div class="m_text">
<a class="title" href="link2" rel="27418">A Story</a>
<p><span class="stars star45"></span><span class="rate">4.35</span></p>
<p class="info" title="Action"></p>
<p class="nowrap latest">A Story</span> 29</p>
</div>
</li>
Now as you see i have to select the first href value of a tag for multiple times and then i will use foreach.
The html i want to get value is :
<a class="img" href="link1"></a>
My code:
var documentx = new HtmlWeb().Load(post.ExtLink);
var div = documentx.DocumentNode.SelectNodes("//div[#id='content']/*//ul[#class='list']//li");
var test = div.Descendants("a")
.Select(a => a.GetAttributeValue("href", null))
.Where(s => !String.IsNullOrEmpty(s))
.ToList();
My code works fine but it's get all the a tag values and i only looking to get the first a tag href value.
Change
.Where(s=> !String.IsNullOrEmpty(s))
To
.FirstOrDefault(s=> !String.IsNullOrEmpty(s))
And remove the .ToList() at the end.
There is a website (Evite to be exact) that has a list of attendees for an event I created. Is there a way to get a list of the names of people contained in an unordered list? The actual info I'm trying to get here is the "Some Name" text from each list item. The html looks something like this:
<ul>
<li class="group-replies yes"
id="button_group_replies_yes">
<h4 class="guest-list-group ">Yes (75)</h4>
<div class="arrow"></div>
<div class="guest-list-panel">
<ul>
<li class="host " data-guestid="">
<a class="profile-link" href="/profile/public/00B6AAQZXGK5ZYADLKASDKLR5OASKE">
<div class="avatar small "
data-letters="AS"
data-disk="5"
data-key="00B6AAAWDGK5ZYAD3OEPAHCPASDWWQKE"
data-size="small"
href="javascript:void(0);"
>
<span class="avatar-badge"></span>
</div>
<div class="wrapper">
<span class="username">Some Name
<span class="badge">Host</span>
</span>
</div>
</a>
<div class="profile-hover">
<div class="divet"></div>
<div class="contents">
<div class="meta">
<p class="timestamp">
<span class="left">Replied 135 days ago</span>
</p>
<p class="guests">
<span class="adults">
1 guest
</span>
</p>
</div>
</div>
</div>
</li>
I've tried using HTML agility pack, but I wasn't able to efficiently get the list of names without first finding the list, then going through multiple sets of child nodes to finally find what I was looking for. Is there a better way to do this? Thanks.
The first way will be using Html Agility Pack which is recommended.
But if you would like to use some other way, what about using regex?
string text = File.ReadAllText(#"test.html"); // Or any way getting your html string
string pattern = "<span class=\"username\">(?<after>[\\w ]+)";
MatchCollection matches = Regex.Matches(text, pattern);
for (int i = 0; i < matches.Count; i++)
{
Console.WriteLine("Username:" + matches[i].Groups["after"].ToString());
}
Resolving this issue I think we need to use the HTML Parser. There are various HTML Parser available.
I used Html Agility Pack.
https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers
I want to get link,title and price from this html(this is one result of ten results)
<div class="listing-item">
<div class="block item-title">
<h3 id="title">
<span style="direction: ltr" class="title">
<a xtcltype="S" xtclib="listing_list_1_title_link" href="http://dubai.dubizzle.com/motors/used-cars/ford/explorer/2013/7/1/ford-explorer-2012-new-model-expat-leaving-2/?back=ZHViYWkuZHViaXp6bGUuY29tL21vdG9ycy91c2VkLWNhcnMv&pos=1">FORD EXPLORER - 2012 - NEW MODEL - EXPAT LEAV...</a>
</span>
</h3>
<div class="price">
AED 118,000
<br>
</div>
</div>
</div>
Here is my code
var allCarResults = rootNode.SelectNodes("//div[normalize-space(#class)='listing-item']");
foreach (var carResult in allCarResults)
{
var dataNode = carResult.SelectSingleNode(".//div[#class='block item-title']");
var carNameNode = dataNode.SelectSingleNode(".//h3/a");
string carName = carNameNode.InnerText.Trim();
}
This give me object reference issue to get carName.What mistake i am doing here?
dataNode.SelectSingleNode(".//h3/a"); tries to select a <a> node directly under the <h3> that is somewhere under that dataNode.
However, in your case there is a <span> inbetween. So use dataNode.SelectSingleNode(".//h3//a"); (note the // between h3 and a) to get an <a> node somewhere below a <h3>.
I would like the nodes in the collection but with iterating SelectSingleNode I keep getting the same object just node.Id is changing...
What i try is to readout the webresponse of a given site and catch some information like values, links .. in special defined elements.
int offSet = 0;
string address = "http://www.testsite.de/ergebnisliste.html?offset=" + offSet;
HtmlWeb web = new HtmlWeb();
//web.OverrideEncoding = Encoding.UTF8;
HtmlDocument doc = web.Load(address);
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[#itemtype='http://schema.org/Posting']");
foreach (HtmlNode node in collection) {
string id = HttpUtility.HtmlDecode(node.Id);
string cpname = HttpUtility.HtmlDecode(node.SelectSingleNode("//span[#itemprop='name']").InnerText);
string cptitle = HttpUtility.HtmlDecode(node.SelectSingleNode("//span[#itemprop='title']").InnerText);
string cpaddress = HttpUtility.HtmlDecode(node.SelectSingleNode("//span[#itemprop='addressLocality']").InnerText);
string date = HttpUtility.HtmlDecode(node.SelectSingleNode("//div[#itemprop='datePosted']").InnerText);
string link = "http://www.testsite.de" + HttpUtility.HtmlDecode(node.SelectSingleNode("//div[#class='h3 title']//a[#href]").GetAttributeValue("href", "default"));
}
This is for example for 1 iteration:
<div id="66666" itemtype="http://schema.org/Posting">
<div>
<a>
<img />
</a>
</div>
<div>
<div class="h3 title">
<a href="/test.html" title="Test">
<span itemprop="title">Test</span>
</a>
</div>
<div>
<span itemprop="name">TestName</span>
</div>
</div>
<div>
<div>
<div>
<div>
<span itemprop="address">Test</span>
</div>
<span>
<a>
<span><!-- --></span>
<span></span>
</a>
</span>
</div>
</div>
<div itemprop="date">
<time datetime="2013-03-01">01.03.13</time>
</div>
</div>
By writing
node.SelectSingleNode("//span[#itemprop='name']").InnerText
it's like you writing
doc.DocumentNode.SelectSingleNode("//span[#itemprop='name']").InnerText
To do what you want to do you should write it like this: node.SelectSingleNode(".//span[#itemprop='name']").InnerText.
This .dot / period tells make a search on the current node which is node instead on doc