Selenium C# Webdriver FindElements(By.LinkText) RegEx? - c#

Is it possible to find links on a webpage by searching their text using a pattern like A-ZNN:NN:NN:NN, where N is a single digit (0-9).
I've used Regex in PHP to turn text into links, so I was wondering if it's possible to use this sort of filter in Selenium with C# to find links that will all look the same, following a certain format.
I tried:
driver.FindElements(By.LinkText("[A-Z][0-9]{2}):([0-9]{2}):([0-9]{2}):([0-9]{2}")).ToList();
But this didn't work. Any advice?

In a word, no, none of the FindElement() strategies support using regular expressions for finding elements. The simplest way to do this would be to use FindElements() to find all of the links on the page, and match their .Text property to your regular expression.
Note though that if clicking on the link navigates to a new page in the same browser window (i.e., does not open a new browser window when clicking on the link), you'll need to capture the exact text of all of the links you'd like to click on for later use. I mention this because if you try to hold onto the references to the elements found during your initial FindElements() call, they will be stale after you click on the first one. If this is your scenario, the code might look something like this:
// WARNING: Untested code written from memory.
// Not guaranteed to be exactly correct.
List<string> matchingLinks = new List<string>();
// Assume "driver" is a valid IWebDriver.
ReadOnlyCollection<IWebElement> links = driver.FindElements(By.TagName("a"));
// You could probably use LINQ to simplify this, but here is
// the foreach solution
foreach(IWebElement link in links)
{
string text = link.Text;
if (Regex.IsMatch("your Regex here", text))
{
matchingLinks.Add(text);
}
}
foreach(string linkText in matchingLinks)
{
IWebElement element = driver.FindElement(By.LinkText(linkText));
element.Click();
// do stuff on the page navigated to
driver.Navigate().Back();
}

Dont use regex to parse Html.
Use htmlagilitypack
You can follow these steps:
Step1 Use HTML PARSER to extract all the links from the particular webpage and store it into a List.
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]"))
{
//collect all links here
}
Step2 Use this regex to match all the links in the list
.*?[A-Z]\d{2}:\d{2}:\d{2}:\d{2}.*?
Step 3 You get your desired links.

Related

Html Agility Pack, search through site for a specified string of words

I'm using the Html Agility Pack for this task, basically I've got a URL, and my program should read through the content of the html page on it, and if it finds a line of text (ie: "John had three apples"), it should change a label's text to "Found it".
I tried to do it with contains, but I guess it only checks for one word.
var nodeBFT = doc.DocumentNode.SelectNodes("//*[contains(text(), 'John had three apples')]");
if (nodeBFT != null && nodeBFT.Count != 0)
myLabel.Text = "Found it";
EDIT: Rest of my code, now with ako's attempt:
if (CheckIfValidUrl(v)) // foreach var v in a list..., checks if the URL works
{
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(v);
try
{
if (doc.DocumentNode.InnerHtml.ToString().Contains("string of words"))
{
mylabel.Text = v;
}
...
One possible option is using . instead of text(). Passing text() to contains() function the way you did will, as you suspected, effective only when the searched text is the first direct child of the current element :
doc.DocumentNode.SelectNodes("//*[contains(., 'John had three apples')]");
In the other side, contains(., '...') evaluates the entire text content of current element, concatenated. So, just a heads up, the above XPath will also consider the following element for example, as a match :
<span>John had <br/>three <strong>apples</strong></span>
If you need the XPath to only consider cases when the entire keyword contained in a single text node, and therefore considers the above case as a no-match, you can try this way :
doc.DocumentNode.SelectNodes("//*[text()[contains(., 'John had three apples')]]");
If none of the above works for you, please post minimal HTML snippet that contains the keyword but returned no match, so we can examine further what possibly causes that behavior and how to fix it.
use this:
if (doc.DocumentNode.InnerHtml.ToString().Contains("John had three apples"))
myLabel.Text="Found it";

Scraping HTML from Financial Statements

First attempt at learning to work with HTML in Visual Studio and C#. I am using html agility pack library. to do the parsing.
From this page I am attempting to pull out information from various places within this page and save them as correctly formatted strings
here is my current code (taken from: shriek )
HtmlNode tdNode = document.DocumentNode.DescendantNodes().FirstOrDefault(n => n.Name == "td"
&& n.InnerText.Trim() == "Net Income");
if (tdNode != null)
{
HtmlNode trNode = tdNode.ParentNode;
foreach (HtmlNode node in trNode.DescendantNodes().Where(n => n.NodeType == HtmlNodeType.Element))
{
Console.WriteLine(node.InnerText.Trim());
//Output:
//Net Income
//265.00
//298.00
//601.00
//672.00
//666.00
}
}
It works correctly however I want to get more information and I am unsure of how to search through the html correctly. First I would like to also be able to select these numbers from the annual data, not only from the quarterly, (View option at the top of the page).
I would also like to get the dates for each column of numbers, both quarterly and annual (the "As of ..." at the top of each column)
also for future projects, does google provide an API for this?
If you take a close look at the original input html source, you will see its data is organized around 6 main sections that are DIV html elements with one of the following 'id' attributes: "incinterimdiv" "incannualdiv" "balinterimdiv" "balannualdiv" "casinterimdiv" "casannualdiv". Obviously, these matches Income Statement, Balance Sheet, and Cash Flow for Quaterly or Annual Data.
Now, when you're scraping a site with Html Agility Pack, I suggest you use XPATH wich is the easiest way to get to any node inside the HTML code, without any dependency on XML, as Html Agility Pack supports plain XPATH over HTML.
XPATH has to be learned, for sure, but is very elegant because it does so many things in just one line. I know this may look old-fashioned with the new cool C#-oriented XLinq syntax :), but XPATH is much more concise. It also enables you to concentrate the bindings between your code and the input HTML in plain old strings, and avoid recompilation of the code when the input source evolves (for example, when the ID change). This make your scraping code more robust, and future-proof. You could also put the XPATH bindings in an XSL(T) file, to be able to transform the HTML into the data presented as XML.
Anyway, enough digression :) Here is a sample code that allows you to get the financial data from a specific line title, and another that gets all data from all lines (from one of the 6 main sections):
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.google.com/finance?q=NASDAQ:TXN&fstype=ii");
// How get a specific line:
// 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
// 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
// 3) recursively get all TD elements containing the given text (trimmed)
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#id='casannualdiv']/table[#id='fs-table']//td[normalize-space(text()) = 'Deferred Taxes']"))
{
Console.WriteLine("Title:" + node.InnerHtml.Trim());
// get all following sibling TD elements
foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
{
Console.WriteLine(" data:" + sibling.InnerText.Trim()); // InnerText works also for negative values
}
}
// How to get all lines:
// 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
// 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
// 3) recursively get all TD elements containing the class 'lft lm'
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#id='casannualdiv']/table[#id='fs-table']//td[#class='lft lm']"))
{
Console.WriteLine("Title:" + node.InnerHtml.Trim());
foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
{
Console.WriteLine(" data:" + sibling.InnerText.Trim());
}
}
You have two options. One is to reverse engineer the HTML page, figure out what JavaScript code is run when you click on Annual Data, see where it gets the data from and ask for the data.
The second solution, which is more robust, is to use a platform such as Selenium, that actually emulates the web browser and runs JavaScript for you.
As far as I could tell, there's no CSV interface to the financial statements. Perhaps Yahoo! has one.
If you need to navigate around to get to the right page, then you probably want to look into using WatiN. WatiN was designed as an automated testing tool for web pages and drives a selected web browser to get the page. It also allows you to identify input fields and enter text in textboxes or push buttons. It's a lot like HtmlAgilityPack, so you shouldn't find it too difficult to master.
I would highly recommend against this approach. The HTML that google is spitting out is likely highly volatile, so even once you solidify your parsing approach to get all of the data you need, in a day, a week or a month the HTML format could all change and you would need to rewrite your parsing logic.
You should try to use something more static, like XBRL.
SEC publishes this XBRL for each publicly traded company here = http://xbrl.sec.gov/
You can use this toolkit to work with the data programatically - http://code.google.com/p/xbrlware/
EDIT: The path of least resistance is actually using http://www.xignite.com/xFinancials.asmx, but this service costs money.

How to get text off a webpage?

I want to get text off of a webpage in C#.
I don't want to get the HTML, I want the real text off of the webpage. Like if I type "<b>cake</b>", I want the cake, not the tags.
Use the HTML Agility Pack library.
That's very fine library for parse HTML, for your requirement use this code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}
It depends. If your application downloads the webpage using a WebBrowser component, then that component will do the parsing for you automatically in the background (just like Internet Explorer). Just walk the DOM tree and extract the text you want. You will find HtmlElement.InnerText property especially useful :)
You can strip tags using regular expressions such as this one2 (a simple example):
// You can import System.Text.RegularExpressions for convenience, of course.
System.Text.RegularExpressions.Regex tag = new System.Text.RegularExpressions.Regex("\<.+?\>");
myHTML = tag.Replace(myHTML, String.Empty);
But if you need to retrieve large volumes of well-structured data, then you might be better off using an HTML library1. (If the webpage is XHTML, all the better - use the System.Xml classes.)
1 Like http://htmlagilitypack.codeplex.com/, for example.
2 This might have unintended side-effects if you're trying to get data out of JavaScript, or if the data is inside an element's attribute and includes angle brackets. You'll also need to accept escape sequences like &.

HTML Agility Pack Screen Scraping XPATH isn't returning data

I'm attempting to write a screen scraper for Digikey that will allow our company to keep accurate track of pricing, part availability and product replacements when a part is discontinued. There seems to be a discrepancy between the XPATH that I'm seeing in Chrome Devtools as well as Firebug on Firefox and what my C# program is seeing.
The page that I'm scraping currently is http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=296-12602-1-ND
The code I'm currently using is pretty quick and dirty...
//This function retrieves data from the digikey
private static List<string> ExtractProductInfo(HtmlDocument doc)
{
List<HtmlNode> m_unparsedProductInfoNodes = new List<HtmlNode>();
List<string> m_unparsedProductInfo = new List<string>();
//Base Node for part info
string m_baseNode = #"//html[1]/body[1]/div[2]";
//Write part info to list
m_unparsedProductInfoNodes.Add(doc.DocumentNode.SelectSingleNode(m_baseNode + #"/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]"));
//More lines of similar form will go here for more info
//this retrieves digikey PN
foreach(HtmlNode node in m_unparsedProductInfoNodes)
{
m_unparsedProductInfo.Add(node.InnerText);
}
return m_unparsedProductInfo;
}
Although the path I'm using appears to be "correct" I keep getting NULL when I look at the list "m_unparsedProductInfoNodes"
Any idea what's going on here? I'll also add that if I do a "SelectNodes" on the baseNode it only returns a div with the only significant child being "cs=####" which seems to vary with browser user agents. If I try to use this in anyway (putting /cs=0 in the path for the unidentifiable browser) it pitches a fit insisting that my expression doesn't evaluate to a node set, but leaving them still leaves the problem of all data past div[2] is returned as NULL.
Try using this XPath expression:
/html[1]/body[1]/div[2]/cs=0[1]/rf=141[1]/table[1]/tr[1]/td[1]/table[1]/tr[1]/td[1]
Using Google Chrome Developer Tools and Firebug in Firefox, it seems like webpage has a 'cs' and 'rf' tags before the first table. Something like:
<cs="0">
<rf="141">
<table>
...
</table>
</rf>
</cs>
There is something that might be useful to know what is happening when you want to parse a known HTML file and you're not getting results as expected. In this case I just did:
string xpath = "";
//In this case I'll get all cells and see what cell has the text "296-12602-1-ND"
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td"))
{
if (node.InnerText.Trim() == "296-12602-1-ND")
xpath = node.XPath; //Here it is
}
Or you could just debug your application after document loads, and go through each child node until you find the node you want to get the info from. If you just set a breakpoint when InnerText is found, you can just go through parents and then keep looking for other nodes. I usually do that entering manually commands in a 'watch' window and navigating using the treeview to see properties, attributes and childs.
Just for an update:
I switched from c# into a bit more friendly Python (my experience with programming is asm, c, and python, the whole OO thing was totally new) and managed to correct my xpath issues. The tag was indeed the problem, but luckily it's unique, so a little regular expression and a removed line and I was in good shape. I'm not sure why a tag like that breaks the XPATH though. If anyone has some insight I'd like to hear it.

regular expression to parse links from html code

I'm working on a method that accepts a string (html code) and returns an array that contains all the links contained with in.
I've seen a few options for things like html ability pack but It seems a little more complicated than this project calls for
I'm also interested in using regular expression because i don't have much experience with it in general and i think this would be a good learning opportunity.
My code thus far is
WebClient client = new WebClient();
string htmlCode = client.DownloadString(p);
Regex exp = new Regex(#"http://(www\.)?([^\.]+)\.com", RegexOptions.IgnoreCase);
string[] test = exp.Split(htmlCode);
but I'm not getting the results I want because I'm still working on the regular expression
sudo code for what I'm looking for is "
If you are looking for a fool proof solution regular expressions are not your answers. They are fundamentally limited and cannot be used to reliably parse out links, or other tags for that matter, from an HTML file due to the complexity of the HTML language.
Long Winded Version: http://blogs.msdn.com/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
Instead you'll need to use an actual HTML DOM API to parse out links.
Regular Expressions are not the best idea for HTML.
see previous questions:
When
is it wise to use regular expressions
with HTML?
Regexp that matches all the text content of
a HTML input
Rather, you want something that already knows how to parse the DOM; otherwise, you're re-inventing the wheel.
Other users may tell you "No, Stop! Regular expressions should not mix with HTML! It's like mixing bleach and ammonia!". There is a lot of wisdom in that advice, but it's not the full story.
The truth is that regular expressions work just fine for collecting commonly formatted links. However, a better approach would be to use a dedicated tool for this type of thing, such as the HtmlAgilityPack.
If you use regular expressions, you may match 99.9% of the links, but you may miss on rare unanticipated corner cases or malformed html data.
Here's a function I put together that uses the HtmlAgilityPack to meet your requirements:
private static IEnumerable<string> DocumentLinks(string sourceHtml)
{
HtmlDocument sourceDocument = new HtmlDocument();
sourceDocument.LoadHtml(sourceHtml);
return (IEnumerable<string>)sourceDocument.DocumentNode
.SelectNodes("//a[#href!='#']")
.Select(n => n.GetAttributeValue("href",""));
}
This function creates a new HtmlAgilityPack.HtmlDocument, loads a string containing HTML into it, and then uses an xpath query "//a[#href!='#']" to select all of the links on the page that do not point to "#". Then I use the LINQ extension Select to convert the HtmlNodeCollection into a list of strings containing the value of the href attribute - where the link is pointing to.
Here's an example use:
List<string> links =
DocumentLinks((new WebClient())
.DownloadString("http://google.com")).ToList();
Debugger.Break();
This should be a lot more effective than regular expressions.
You could look for anything that is sort-of-like a url for http/https schema. This is not HTML proof, but it will get you things that looks like http URLs, which is what you need, I suspect. You can add more sachems, and domains.
The regex looks for things that look like URL "in" href attributes (not strictly).
class Program {
static void Main(string[] args) {
const string pattern = #"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
var regex = new Regex(pattern);
var urls = new string[] {
"href='http://company.com'",
"href=\"https://company.com\"",
"href='http://company.org'",
"href='http://company.org/'",
"href='http://company.org/path'",
};
foreach (var url in urls) {
Match match = regex.Match(url);
if (match.Success) {
Console.WriteLine("{0} -> {1}", url, match.Groups["url"].Value);
}
}
}
}
output:
href='http://company.com' -> http://company.com
href="https://company.com" -> https://company.com
href='http://company.org' -> http://company.org
href='http://company.org/' -> http://company.org
href='http://company.org/path' -> http://company.org

Categories

Resources