I'm trying to open an HTML file, loop through the divs that match a certain criteria, and then loop through the p tags that match a certain criteria within those divs.
CQ dom = CQ.CreateFromFile("page.html");
CQ document_divs = dom["div"];
document_divs.Each((i,document_div) =>
{
string divid = document_div.Id;
if (divid.Contains("page"))
{
CQ page_ptags = document_div["p"];
page_ptags.Each((j, page_ptag) =>
{
lblOutput.Text = page_ptag.Id;
});
}
});
It is selecting the divs fine, but I'm not sure how to select the p tags within a div. I know there is something wrong with this line:
CQ page_ptags = document_div["p"];
But what should I change?
Try this:
CQ page_ptags = document_div.Cq().Find("p");
When you are looking throw a CQ object, each elements will be type of IDom.
That's why you need or wrap it in CQ object, or use native Dom functions to work with.
Related
i have a list on a website that stores the part number and the order number.
in this list are different div elements and i would like to export every part number in this list.
The list looks like this:
<div class="spareValue">
<span class="label">OrderNumber:</span>
<span class="PartNumber">180011</span>
</div>
<div class="spareValue">
<span class="label">SparePartNumber:</span>
<span class="PartNumber">01002523</span>
</div>
How can i export every OrderNumber and put them into a list in c# that i can work with the values??
lot of ways to do that:
var spans = driver.FindElements(By.CssSelector("div.spareValue span"));
var nbrspans = spans.Count;
var Listordernumber = new List<string>();
for(int i = 0; i < nbrspans; i +=2)
{
if (spans[i].GetAttribute("textContent") != "OrderNumber:") continue;
Listordernumber.Add(spans[i + 1].GetAttribute("textContent"));
}
so Listordernumber contains the result
if you prefer linq, you could use that:
string path = "//div[#class='spareValue' and ./span[text()='OrderNumber:']]/span[#class = 'PartNumber']";
var Listordernumber = driver.FindElements(By.XPath(path)).Select(s => s.GetAttribute("textContent")).ToList();
Oke, you want every partnummer, that belongs to an ordernummer. That leads me to this xPath, find the div that has a ordernumber in it, than find the partnumber element inside.
Then to find them all (or none).
Last put them all in a neat list, selecting the partnumber text:
string path = "//div[#class='spareValue' and ./span[text()='OrderNumber:']]/div[#class='PartNumber']"
var elements = driver.findElements(By.XPath(path));
var listOrderNumber = elements.Select(e=>e.Text);
var els = driver.FindElements(By.XPath("//*[text()='OrderNumber:']"));
foreach(var el in els){
var el = els.FindElement(By.XPath("./../span[#class='PartNumber']"));
console.writeline("OrderNumber: "+el.Text());
}
first, you have to find all elements that have "OrderNumber:" text on it and eliminate all elements that don't.
now, iterate through all elements that have "OrderNumber:" we have found from step above and go to its parent node, then find all element inside the parent node that the class name is "PartNumber".
I have a list of a tags. I want to get an a tag which contains a string.
I used the below code and everything work fine.
string mainLink = "";
List<HtmlNode> dlLink = new List<HtmlNode>();
dlLink = doc.DocumentNode.SelectNodes("//div[#class='links']//a").ToList();
foreach (var item in dlLink) {
if (item.Attributes["href"].Value.Contains("prefile"))
{
mainLink = item.Attributes["href"].Value;
}
}
but I want to write a simple code
var dlLink = doc.DocumentNode.SelectNodes("//div[#class='link']//a").ToList().Where(x => x.Attributes["href"].Value.Contains("prefile")).ToList().ToString();
But it does not work and I get nothing.
Your foreach is setting mainLink string, but your linq chain is using ToString on a List result.
Converting your code, you will have something like this:
mainLink = doc.DocumentNode.SelectNodes("//div[#class='links']//a")
.Where(item => item.Attributes["href"].Value.Contains("prefile"))
.Select(item => item.Attributes["href"].Value)
.Last();
I used Select to get only the href values, and getting the last as your foreach did, maybe you need to validate this last step, use a LastOrDefault, First, etc.
You can also use the Last or First instead of the Where condition:
mainlink = doc.DocumentNode.SelectNodes("//div[#class='links']//a")
.Last(item => item.Attributes["href"].Value.Contains("prefile"))
.Attributes["href"].Value;
I am new in Webscraping and trying to get data from a website with HTMLAgilityPack using ASP.NET C#. HTML structure which I am trying to parse is:
<li class='subsubnav' id='new-women-clothing'>
<span class='cat-name'>CLOTHING</span>
<ul>
<li>Just In</li>
<li>Exclusives</li>
<li>Dresses & Gowns</li>
<li>Coats</li>
<li>Jackets</li>
<li>Shirts & Blouses</li>
<li>Tops</li>
<li>Knitwear</li>
<li>Sweatshirts</li>
<li>Skirts & Shorts</li>
<li>Trousers</li>
<li>Jumpsuits</li>
<li>Jeans</li>
<li>Swimwear</li>
<li>Lingerie</li>
<li>Nightwear</li>
<li>Sportswear</li>
<li>Ski Wear</li>
</ul>
</li>
I am getting the parent categories which in this case is CLOTHING perfectly but i am unable to get elements inside ul.
here is my c# code:
var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://www.harrods.com/men/t-shirts?icid=megamenu_MW_clothing_t_shirts"));
var root = html.DocumentNode;
var nodes = root.Descendants();
var totalNodes = nodes.Count();
var dt = root.Descendants().Where(n => n.GetAttributeValue("class", "").Equals("cat-name"));
foreach(var x in dt)
{
foreach (var element in x.Descendants("ul"))
{
child_data.Add(new cat_childs(element.InnerText));
}
data.Add(new Categories(x.InnerText,child_data));
}
test.DataSource = data;
test.DataBind();
So how can I get the link and text of anchor tags inside <ul>?
If you want to base the iteration on span with class='cat-name', then the target ul relation to the span is following sibling instead of descendant. You can use SelectNodes() to get following sibling elements from current span, like so :
foreach (var x in dt)
{
foreach (var element in x.SelectNodes("following-sibling::ul/li/a"))
{
child_data.Add(new cat_childs(element.InnerText));
}
data.Add(new Categories(x.InnerText,child_data));
}
UPDATE :
It seems that the actual problem is in child_data variable being declared outside the outer loop. It means that you're keep adding item to the same child_data instance. Try to declare it inside the outer loop, right after foreach (var x in dt){. Alternatively, you can write the entire codes as a LINQ expression, something like this :
var data = (from d in dt
let child_data = x.SelectNodes("following-sibling::ul/li/a")
.Select(o => new cat_childs(o.InnerText))
.ToList()
select new Categories(x.InnerText, child_data)
).ToList();
Using this xpath. It will get all the <li> that contain a <span> that has a class='cat-name'. After which it picks all the <a>s that are enclosed by <li>.
//If the span has no influence on what you want you can simply use:
//HtmlNodeCollection hNC = htmlDoc.DocumentNode.SelectNodes("//ul/li/a");
HtmlNodeCollection hNC = htmlDoc.DocumentNode.SelectNodes("//li/span[#class='cat-name']/parent::*/ul/li");
foreach (HtmlNode h in hNC)
{
Console.Write(h.InnerText+" ");
Console.WriteLine(h.GetAttributeValue("href", ""));
}
I have a web page in which I am giving USER the options of writing notes. Now when ever the web page checks that a USER is:abc then it pulls up the note from the MEMO Table.
Here is my code in Page_Load():
using (EntityMemoDataContext em = new EntityMemoDataContext())
{
int getEntity = Int16.Parse(Session["EntityIdSelected"].ToString());
var showMemo = from r in em.EntityMemoVs_1s
where r.EntityID == getEntity
select r.Memo;
tbShowNote.Text = String.Join(#"<br />", showMemo);
}
tbShowNote is showing me value like this:
test<br />test1<br />test1<br />test4<br />test4
And I want it like this:
Test
Test1
Test2 ...
tbShowNote is a TextBox!
You only asked for the first memo, so that's what you got back. If you want it enumerated with each one on it's own line in html, you could do this:
using (EntityMemoDataContext em = new EntityMemoDataContext())
{
int getEntity1 = Int16.Parse(Session["EntityIdSelected"].ToString());
var showMemo = from r in em.EntityMemoVs_1s
where r.EntityID == getEntity1
select new
{
r.Memo
};
tbShowNote.Text = String.Join(#"<br />", showMemo);
}
The key takeaway is if r.Memo is of type string, then the LINQ query you executed gave you back a IQueryable<string>. It's on you to decide if you want to flatten that list later.
Edit: Equiso made a good observation in that you're actually returning an IQueryable of an anonymous type, not IQueryable<string> due to the new { ... } syntax. I'd say combine his answer with mine and run with it:
var showMemo = from r in em.EntityMemoVs_1s
where r.EntityID == getEntity1
select r.Memo;
tbShowNote.Text = String.Join(#"<br />", showMemo);
The problem is in the select part of your linq query, you are wrapping your results in an anonymous type, that is why when you call ToString() you see { Memo = test }. You probably want it like this:
var showMemo = from r in em.EntityMemoVs_1s
where r.EntityID == getEntity1
select r.Memo;
After that showMemo will contain just strings.
It looks like your showMemo is a collection and you are then just assigning the top value? If you are putting them in one string then you need to aggregate them together.
Consider the following html code:
<div id='x'><div id='y'>Y content</div>X content</div>
I'd like to extract only the content of 'x'. However, its innerText property includes the content of 'y' as well. I tried iterating over its children and all properties but they only return the inner tags.
How can I access through the IHTMLElement interface only the actual data of 'x'?
Thanks
Use something like:
function getText(this) {
var txt = this.innerHTML;
txt.replace(/<(.)*>/g, "");
return txt;
}
Since this.innerHTML returns
<div id='y'>Y content</div>X content
the function getText would return
X content
Maybe this'll help.
Use the childNodes collection to return child elements and textnodes
You need to QI IHTMLDomNote from IHTMLelement for that.
Here is the final code as suggested by Sheng (just a part of the sample, of course):
mshtml.IHTMLElementCollection c = ((mshtml.HTMLDocumentClass)(wbBrowser.Document)).getElementsByTagName("div");
foreach (IHTMLElement div in c)
{
if (div.className == "lyricbox")
{
IHTMLDOMNode divNode = (IHTMLDOMNode)div;
IHTMLDOMChildrenCollection children = (IHTMLDOMChildrenCollection)divNode.childNodes;
foreach (IHTMLDOMNode child in children)
{
Console.WriteLine(child.nodeValue);
}
}
}
Since innerText() doesn't work with ie, there is no real way i guess.
Maybe try server-side solving the issue by creating content the following way:
<div id='x'><div id='y'>Y content</div>X content</div>
<div id='x-plain'>_plain X content_</div>
"Plain X content" represents your c# generated content for the element.
Now you gain access to the element by refering to getObject('x-plan').innerHTML().