C#: Getting a list of names from a website? - c#

There is a website (Evite to be exact) that has a list of attendees for an event I created. Is there a way to get a list of the names of people contained in an unordered list? The actual info I'm trying to get here is the "Some Name" text from each list item. The html looks something like this:
<ul>
<li class="group-replies yes"
id="button_group_replies_yes">
<h4 class="guest-list-group ">Yes (75)</h4>
<div class="arrow"></div>
<div class="guest-list-panel">
<ul>
<li class="host " data-guestid="">
<a class="profile-link" href="/profile/public/00B6AAQZXGK5ZYADLKASDKLR5OASKE">
<div class="avatar small "
data-letters="AS"
data-disk="5"
data-key="00B6AAAWDGK5ZYAD3OEPAHCPASDWWQKE"
data-size="small"
href="javascript:void(0);"
>
<span class="avatar-badge"></span>
</div>
<div class="wrapper">
<span class="username">Some Name
<span class="badge">Host</span>
</span>
</div>
</a>
<div class="profile-hover">
<div class="divet"></div>
<div class="contents">
<div class="meta">
<p class="timestamp">
<span class="left">Replied 135 days ago</span>
</p>
<p class="guests">
<span class="adults">
1 guest
</span>
</p>
</div>
</div>
</div>
</li>
I've tried using HTML agility pack, but I wasn't able to efficiently get the list of names without first finding the list, then going through multiple sets of child nodes to finally find what I was looking for. Is there a better way to do this? Thanks.

The first way will be using Html Agility Pack which is recommended.
But if you would like to use some other way, what about using regex?
string text = File.ReadAllText(#"test.html"); // Or any way getting your html string
string pattern = "<span class=\"username\">(?<after>[\\w ]+)";
MatchCollection matches = Regex.Matches(text, pattern);
for (int i = 0; i < matches.Count; i++)
{
Console.WriteLine("Username:" + matches[i].Groups["after"].ToString());
}

Resolving this issue I think we need to use the HTML Parser. There are various HTML Parser available.
I used Html Agility Pack.
https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers

Related

HtmlNode Get inner text from nested span

I am trying to get information from a html segment, it is all going well however I am struggling to return the value of the Trade in value. Below is a copy of the code I have tried so far.
htmlNode.Descendants("li").Where(x => x.HasClass("trade-in-price")).Select(x => x.Descendants("span").Where(z => z.HasClass("value")).Last().InnerText);
which returns the following:
"£36.00"
Now, I don't really want to substring this value to get the cost as I don't think it is the best way to do this however I have tried every other way and i can't seem to return 'just the cost' value.
Here is a copy of the html I am trying to navigate to get the desired value
<section
class="product-item"
itemscope="itemscope">
<div>
<div class="group">
<div>
<div class="product-image"><a
href="/trade-in-sell/call-of-duty-modern-warfare-ps4"
itemprop="url"
><span><img
width="160"
height="200"
alt="Call Of Duty: Modern Warfare"
title="Show more information on Call Of Duty: Modern Warfare"
itemprop="image"
/></span></a></div>
<div class="product-categories gray">
<ul>
<li>PlayStation</li>
</ul>
</div>
<div class="product-label top-seller"><strong>modernwarfare</strong></div>
<h2 class="product-title" itemprop="name">Call Of Duty: Modern Warfare</h2>
</div>
</div>
<div class="group">
<div>
<div class="product-price">
<ul>
<li class="buy-new-price">
<Buy new</span> <span class="value"><span class="symbol l">£</span>49.99</span>
</li>
<li class="trade-in-price">
<a href="/trade-in-sell/call-of-duty-modern-warfare-ps4">
<span class="label">Trade in</span>
<span class="value">
<span class="symbol l">
£
</span>
36.00 // I want this value here
</span>
</a>
</li>
<li class="sell-price">
<a href="/trade-in-sell/call-of-duty-modern-warfare-ps4">
<span class="label">Get cash</span>
<span class="value">
<span class="symbol l">
£
</span>
32.00
</span>
</a>
</li>
</ul>
</div>
</div>
</div>
</div>
</section>
Does anyone know where abouts I am going wrong in my LINQ query?
I think you can use the method GetDirectInnerText() instead of property InnerText. For me it returns only text of node itself without childs.
htmlNode.Descendants("li").Where(x => x.HasClass("trade-in-price")).Select(x => x.Descendants("span").Where(z => z.HasClass("value")).Last().GetDirectInnerText());

HtmlAgilityPack - How to select first a tag href while using selectnodes

I'm trying to select the first tag and get the href value. But the problem is I'm using SelectNodes.
Here is the code i want to select a href value from:
<li>
<a class="img" href="link1"></a>
<div class="m_text">
<a class="title" href="link2" rel="27418">A Story</a>
<p><span class="stars star45"></span><span class="rate">4.35</span></p>
<p class="info" title="Action"></p>
<p class="nowrap latest">A Story</span> 29</p>
</div>
</li>
<li>
<a class="img" href="link1"></a>
<div class="m_text">
<a class="title" href="link2" rel="27418">A Story</a>
<p><span class="stars star45"></span><span class="rate">4.35</span></p>
<p class="info" title="Action"></p>
<p class="nowrap latest">A Story</span> 29</p>
</div>
</li>
Now as you see i have to select the first href value of a tag for multiple times and then i will use foreach.
The html i want to get value is :
<a class="img" href="link1"></a>
My code:
var documentx = new HtmlWeb().Load(post.ExtLink);
var div = documentx.DocumentNode.SelectNodes("//div[#id='content']/*//ul[#class='list']//li");
var test = div.Descendants("a")
.Select(a => a.GetAttributeValue("href", null))
.Where(s => !String.IsNullOrEmpty(s))
.ToList();
My code works fine but it's get all the a tag values and i only looking to get the first a tag href value.
Change
.Where(s=> !String.IsNullOrEmpty(s))
To
.FirstOrDefault(s=> !String.IsNullOrEmpty(s))
And remove the .ToList() at the end.

HtmlAgilityPack work with structure

I'm starting with some crawler in C# and I heard that HtmlAgilityPack is best solution for this.
I can't find valid example of usage so maybe someone here will help me with my issue.
In one class I'm using method to get part of code I want. For example ul with class "testable ul"
public static string GetElement(string url, string element, string type, string name)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string rate = doc.DocumentNode.SelectSingleNode("//"+ element +"[#"+ type +"='"+ name +"']").OuterHtml;
return rate;
}
so I am running
string content = SiteMethods.GetElement(startPage, "ul", "class", "testable ul");
now there is a part where I am doing some background work but in the end I'm loading that string again to HtmlAgality:
HtmlDocument html = new HtmlDocument();
html.OptionOutputAsXml = true;
html.LoadHtml(content);
HtmlNode document = html.DocumentNode;
And here I have a problem.
The structure inside content string is like that:
<ul class="testable ul">
<li>
<a href="http://www.veryimportant.link">
<div class="img">
<img src="http://image.so.important/">
</div>
<div class="info">
<span class="name">
NAME
</span>
<span class="price">10</span>
<span class="price2">8</span>
<span class="grade">C</span>
</div>
<p class="tips">tips</p>
</a>
</li>
<li>
<a href="http://www.veryimportant.link/2">
<div class="img">
<img src="http://image.so.important/2">
</div>
<div class="info">
<span class="name">
NAME2
</span>
<span class="price">3</span>
<span class="price2">4</span>
<span class="grade">A</span>
</div>
<p class="tips">tips2</p>
</a>
</li>
</ul>
So the questions are:
How to get every <li> to diffrent object? For further actions.
is it possible in one simple command to get links http://www.veryimportant.link and http://www.veryimportant.link/2 or for example images http://image.so.important/ and http://image.so.important/2 ? How to get them?
How to get NAME and NAME2 in list?
Is it possible to map the whole struct of html to list?
Please, with some examples the rest of learning will be really easy.

C# HTML Parse of web page

I am attempting to parse a html web page to get information from it. Here is a sample of the source:
<div class="market_listing_row market_recent_listing_row listing_2107979855708535333" id="listing_2107979855708535333">
<div class="market_listing_item_img_container"> <img id="listing_2107979855708535333_image" src="asdgfasdfgasgasgdasgasdgsdasgsadg" style="border-color: #D2D2D2;" class="market_listing_item_img" alt="" /> </div>
<div class="market_listing_right_cell market_listing_action_buttons">
<div class="market_listing_buy_button">
<a href="javascript:BuyMarketListing('listing', '2107979855708535333', 570, '2', '508690045')" class="item_market_action_button item_market_action_button_green">
<span class="item_market_action_button_edge item_market_action_button_left"></span>
<span class="item_market_action_button_contents">
Buy Now </span>
<span class="item_market_action_button_edge item_market_action_button_right"></span>
<span class="item_market_action_button_preload"></span>
</a>
</div>
</div>
<div class="market_listing_right_cell market_listing_their_price">
<span>
<span class="market_listing_price market_listing_price_with_fee">
0,03 pуб. </span>
<span class="market_listing_price market_listing_price_without_fee">
0,01 pуб. </span>
<br/>
</span>
Basically I need to get the part that is enclosed in the
<span class="market_listing_price market_listing_price_with_fee">
0,03 pуб. </span>
I have attempted to use HTMLAgiltiyPack, but can't seem to figure it out.
You can use HtmlAgilityPack
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode
.SelectSingleNode("//span[#class='market_listing_price market_listing_price_without_fee']");
var text = WebUtility.HtmlDecode(node.InnerText);
I figured out you cannot just put a URL into the doc.LoadHtml. You have to use a HttpWebRequest and Response.

html agility pack issue in C#

I want to get link,title and price from this html(this is one result of ten results)
<div class="listing-item">
<div class="block item-title">
<h3 id="title">
<span style="direction: ltr" class="title">
<a xtcltype="S" xtclib="listing_list_1_title_link" href="http://dubai.dubizzle.com/motors/used-cars/ford/explorer/2013/7/1/ford-explorer-2012-new-model-expat-leaving-2/?back=ZHViYWkuZHViaXp6bGUuY29tL21vdG9ycy91c2VkLWNhcnMv&pos=1">FORD EXPLORER - 2012 - NEW MODEL - EXPAT LEAV...</a>
</span>
</h3>
<div class="price">
AED 118,000
<br>
</div>
</div>
</div>
Here is my code
var allCarResults = rootNode.SelectNodes("//div[normalize-space(#class)='listing-item']");
foreach (var carResult in allCarResults)
{
var dataNode = carResult.SelectSingleNode(".//div[#class='block item-title']");
var carNameNode = dataNode.SelectSingleNode(".//h3/a");
string carName = carNameNode.InnerText.Trim();
}
This give me object reference issue to get carName.What mistake i am doing here?
dataNode.SelectSingleNode(".//h3/a"); tries to select a <a> node directly under the <h3> that is somewhere under that dataNode.
However, in your case there is a <span> inbetween. So use dataNode.SelectSingleNode(".//h3//a"); (note the // between h3 and a) to get an <a> node somewhere below a <h3>.

Categories

Resources