CsQuery/JQuery can't get element from google search - c#

I'm trying to get the results of a google "define word" search. According to Chrome's Inspect Elements, the text I want is under the class "div class="lr_dct_ent vmod" data-hveid="28"" I'm using this code to try and do it:
var thecq = CQ.CreateFromUrl("https://www.google.be/search?q=define+word&oq=define+word");
var please = thecq.Select(".lr_dct_ent.vmod").Text();
var work = thecq[".lr_dct_ent.vmod"].Text();
Console.WriteLine(please);
Console.WriteLine(work);
neither of these return anything in Console, just empty lines. If I do "div" instead of ".lr_dct_ent.vmod" I get a lot of text and one of them is the text I want which leads me to believe that ".lr_dct_ent.vmod" is not how I'm supposed to search the div class that I wanted. But according to every documentation I found, it IS how I'm supposed to do it. Is Google just a special case or am I the one who's special here?

Related

Word - Replace text by hyperlinks

I am working on a MS-Word addin that reads the content of a document and replaces every occurence of a specific word by a hyperlink.
So far, I came up with this working algorithm.
// Initializes the Find parameters
searchRange.Find.ClearFormatting();
searchRange.Find.Forward = true;
searchRange.Find.Text = "foo";
do
{
searchRange.Find.Execute(Wrap: Word.WdFindWrap.wdFindStop);
if (searchRange.Find.Found)
{
// Creates a Hyperlink at the found location in the current document
this.WordDocument.Hyperlinks.Add(searchRange, externalLink, link, "bar");
}
searchRange.Find.Execute(Wrap: Word.WdFindWrap.wdFindStop);
} while (searchRange.Find.Found);
This code works, however, it can be slow on bigger documents. Thus, instead of adding hyperlinks one by one, I wanted to simply to use the Find.Replacement object and with the WdReplace.ReplaceAllproperty.
However, I cannot manage to replace my search result by a Hyperlink.
Is there a way to replace a piece of text by a hyperlink using the Replacemethod ?
In other words, I'd like to find a way to do this :
Find.Replacement.Text = new Hyperlink(...);
On an other side, I've seen that, by hitting Alt + F9in Word, we can see hyperlinks as code.
The code looks like this :
{ HYPERLINK \l "link" \o "Caption" }
Another solution would be to be able to set the text replacement as that string and make Word interpret it and thus, create the link.
Thanks for reading.
As far as I know, fields can only be inserted programmatically, or by using CTRL-F9. There are two possible reasons for this that I see:
They are not simple text. They have two ranges, the Code and the Result, only one of which is displayed at any time.
How else would a user insert text that looks like a code but is not supposed to be one, unless there was a special mechanism to create one?

Html Agility Pack, search through site for a specified string of words

I'm using the Html Agility Pack for this task, basically I've got a URL, and my program should read through the content of the html page on it, and if it finds a line of text (ie: "John had three apples"), it should change a label's text to "Found it".
I tried to do it with contains, but I guess it only checks for one word.
var nodeBFT = doc.DocumentNode.SelectNodes("//*[contains(text(), 'John had three apples')]");
if (nodeBFT != null && nodeBFT.Count != 0)
myLabel.Text = "Found it";
EDIT: Rest of my code, now with ako's attempt:
if (CheckIfValidUrl(v)) // foreach var v in a list..., checks if the URL works
{
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(v);
try
{
if (doc.DocumentNode.InnerHtml.ToString().Contains("string of words"))
{
mylabel.Text = v;
}
...
One possible option is using . instead of text(). Passing text() to contains() function the way you did will, as you suspected, effective only when the searched text is the first direct child of the current element :
doc.DocumentNode.SelectNodes("//*[contains(., 'John had three apples')]");
In the other side, contains(., '...') evaluates the entire text content of current element, concatenated. So, just a heads up, the above XPath will also consider the following element for example, as a match :
<span>John had <br/>three <strong>apples</strong></span>
If you need the XPath to only consider cases when the entire keyword contained in a single text node, and therefore considers the above case as a no-match, you can try this way :
doc.DocumentNode.SelectNodes("//*[text()[contains(., 'John had three apples')]]");
If none of the above works for you, please post minimal HTML snippet that contains the keyword but returned no match, so we can examine further what possibly causes that behavior and how to fix it.
use this:
if (doc.DocumentNode.InnerHtml.ToString().Contains("John had three apples"))
myLabel.Text="Found it";

I want to retrieve the ElementID that contains a known string

C# Selenium Webdriver
So i need to ensure that none of my pages (around 200 pages) contain a particular known string. Is there any way that i can scan a page for the existence of this string and if it does then return both the ElementID of that element and the entire string?
For example my source is like:
<a id="cancel_order_lnkCancel">Cancel Order</a>
I want to search for the word 'Cancel' on the whole page (<div id="sitewrapper">) and return both
cancel_order_lnkCancel;Cancel Order
Thanks
You can use XPath to find by text. e.g.:
var element = driver.FindElement(By.XPath(string.Format("//*[contains(text(), '{0}')]", value)));
value being the string you are searching for.
Then to get the element's markup and content:
var html = element.GetAttribute("outerHTML");
var text = element.Text;
or
var text = element.GetAttribute("innerHTML");
I haven't worked in C# binding but you can use FindElements to get a list of all elements containing the text. You can by no doubt use #Jarga's xpath. The good thing with FIndElements will be that it won't throw you an exception (atleast this is what happens in java) though you have to use try catch to handle getAttribute if you get null for value of id. And if you iterate over the list you can fetch all texts using getText method.

Programmatically get amount of facebook likes for a specific page

I'm building a website in ASP.net/C# and currently I want to get the amount of Facebook likes of a specific page (think of a video/article). I need this value programmatically, because I want to sort on it later, but that's a different story.
I already know the link Facebook itself provides to get this amount, which is posted below.
http://api.facebook.com/method/fql.query?query=select%20like_count%20from%20link_stat%20where%20url=%27http://www.google.com%27
With www.google.com being the website, whose links are being counted and can of course be changed to whichever page one needs.
Does anybody know how I can access the xml file, of the URL/XML file posted above? I've done some research, but I can't seem to find an answer that works for me.
EDIT: I found the answer. I had to navigate through the XML a bit and modify the actual URL used. Working code is posted below.
string result;
string urlToXMLfile, currentURL;
currentURL = Globals.NavigateURL(TabId, "", "CategoryId=" + catId, "MovieId=" + Request.QueryString["MovieId"]);
urlToXMLfile = "https://api.facebook.com/method/fql.query?query=select%20%20like_count%20from%20link_stat%20where%20url=%22";
urlToXMLfile += currentURL;
urlToXMLfile += "%22";
//XDocument xdoc = XDocument.Load(urlToXMLfile);
//string test = xdoc.Descendants(XName.Get("like_count")).First().Value;
XmlDocument doc = new XmlDocument();
doc.Load(urlToXMLfile);
result = doc.FirstChild.NextSibling.InnerText;
return result;
I had same issue once, when I've worked with Selenium. I found that for me it was ok just to get the text representation of that xml and keep it simple string, storing the HTML body in a variable. Which allowed me later to extract the count I need via regex or other algorithm.
I added my own answer below the question. That line of code works and returns a simple String, with the amount of FB likes that page got.
I found a Selenium solution for you, try this:
string pageSource = driver.getPageSource();
and after you get the data, you can do something like:
// Extract the text between the two like_count elements
pattern = "(?i)(<like_count.*?>)(.+?)(</like_count>)";

HTML Agility Pack Screen Scraping XPATH isn't returning data

I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing.
The page that I'm scraping currently is http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND
The code I'm currently using is pretty quick and dirty...
//This function retrieves data from the digikey
private static List<string> ExtractProductInfo(HtmlDocument doc)
{
List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>();
List<string> m_unparsedProductInfo = new List<string>();
//Base Node for part info
string m_baseNode = #"//html[1]/body[1]/div[2]";
//Write part info to list
m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + #"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]"));
//More lines of similar form will go here for more info
//this retrieves digikey PN
foreach(HtmlNode node in m_unparsedProductInfoNodes)
{
m_unparsedProductInfo.Add(node.InnerText);
}
return m_unparsedProductInfo;
}
Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes"
Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div with the only significant child being "cs=####" which seems to vary with browser user agents. If I try to use this in anyway (putting /cs=0 in the path for the unidentifiable browser) it pitches a fit insisting that my expression doesn't evaluate to a node set, but leaving them still leaves the problem of all data past div[2] is returned as NULL.
Try using this XPath expression:
/html[1]/body[1]/div[2]/cs=0[1]/rf=141[1]/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]
Using Google Chrome Developer Tools and Firebug in Firefox, it seems like webpage has a 'cs' and 'rf' tags before the first table. Something like:
<cs="0">
<rf="141">
<table>
...
</table>
</rf>
</cs>
There is something that might be useful to know what is happening when you want to parse a known HTML file and you're not getting results as expected. In this case I just did:
string xpath = "";
//In this case I'll get all cells and see what cell has the text "296-12602-1-ND"
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td"))
{
if (node.InnerText.Trim() == "296-12602-1-ND")
xpath = node.XPath; //Here it is
}
Or you could just debug your application after document loads, and go through each child node until you find the node you want to get the info from. If you just set a breakpoint when InnerText is found, you can just go through parents and then keep looking for other nodes. I usually do that entering manually commands in a 'watch' window and navigating using the treeview to see properties, attributes and childs.
Just for an update:
I switched from c# into a bit more friendly Python (my experience with programming is asm, c, and python, the whole OO thing was totally new) and managed to correct my xpath issues. The tag was indeed the problem, but luckily it's unique, so a little regular expression and a removed line and I was in good shape. I'm not sure why a tag like that breaks the XPATH though. If anyone has some insight I'd like to hear it.

Categories

Resources