Xpath does not recognize regex parts - c#

I am new to this and I am trying to understand the usage of selenium and XPath
string exp = "//*[#id=\"g_1_bHwVovAN\"]/td[2]";
var dateTime = chromeDriver.FindElementsByXPath(exp);
With this code, I can only take 1 element. How can I change this "bHwVovAN" part, so I can reach the all that kinds of elements on the website.
string exp = "//*[#id=\"g_1_[^[0-9+]]\"]/td[2]";
var dateTime = chromeDriver.FindElementsByXPath(exp);
I tried to use regex, but It did not work. It did not recognize regex parts. Also, I looked at the other posts, and tried to use matches, and also did not work. How can I solve it?
If I did not write clear and correct way, I am also new to English. Sorry

There is a matches() function in XPath 2.0 that would solve your issue, but selenium doesn't support this XPath version.
You can try below to match elements with id attribute that starts with "g_1_":
string exp = "//*[starts-with(#id,\"g_1_\")]/td[2]";

Related

How to find the index of the shortest word in a string using C#?

Let's say I have a string and know what word I'm looking for. The word is specifically the shortest one. How do I find its index without going through every character? String.IndexOf() obviously won't work as the same sequence of characters can be encountered in a longer word. Is there any way around this? It would be nice to know how to do this without LINQ as well.
For example the word is "dog" and I'm looking for it in "the .. doginess, of this dog.,. is dogy.". I would get the result as 7 even though it's not 7.
I'd use Regular Expressions, where \b means 'word boundary'. Find a match, and return the index of that match.
var r = new Regex("\bfoo\b");
var result = r.Match(searchString).Index;
So I solved it myself. I used Regex, quite similar to the one above. My one looks like this:
string pattern = "[^a-zA-Z0-9]+" + myWord + "[^a-zA-Z0-9]";
int index = Regex.Match(someString, pattern).Index;
Thanks for the help!

C# Selenium Extract Data from span with partial ID

I am trying to create a proper XPATH syntax in C# Selenium to extract an order number on a web page. Here is what I've tried to far to grab the order number shown in the screen shot. All of these have errored out on me.
var result = driver.FindElement(By.XPath("//span[#id^='order-number-'")).Text;
var result = driver.FindElement(By.XPath("//div[#id='a-column a-span7']/h5")).Text;
var result = driver.FindElement(By.XPath("//div[#id='a-column a-span7']/span[#class='a-text-bold']")).Text;
Below is the inspection from Chrome. I am trying to grab the order number, but it will not always be the same so I cannot hard code the span id.
The driver.FindElement(By.XPath("//span[#id^='order-number-'")) would definitely match nothing since ^= is not a valid operator in XPath language. Plus, you are not closing the square brackets.
Instead, if you want to have a shorter and more readable version, use a CSS selector:
driver.FindElement(By.CssSelector("span[id^=order-number]"))
Here ^= means "starts with".
If you want to stay with XPath, use starts-with() function:
driver.FindElement(By.XPath("//span[starts-with(#id, 'order-number-')]"))
You can try this out:
var result = driver.FindElement(By.XPath("//span[contains(#id, 'order-number-')]")).Text;
It uses a "contains" on the span ID. Let me know if this helps.

Regular expression for filenames that doesn't exclude whitespaces

I have been using this regular expression to extract file names out of file path strings:
Regex r = new Regex(#"\w+[.]\w+$+");
This works, as long as there is no space in the file name. For example:
r.Match("c:\somestuff\myfile.doc").Value = "myfile.doc"
r.Match("c:\somestuff\my file.doc").Value = "file.doc"
I need my regular expression to give me "my file.doc", and not just "file.doc"
I tried messing around with the expression myself. In particular I tried adding \s+ after learning that that is for matching whitespaces. I didn't get the results I hoped for.
I did devise a solution just to get the job done: I started at the end of the string, went backwards until a backslash was reached. This gave me the file name in reverse order (i.e. cod.elifym) into an array of chars, then I used Array.Reverse() to turn it around. However I'd like to learn how to achieve this by simply modifying my original regular expression.
Does it have to be a regular expression? Use System.IO.Path.GetFileName() instead.
Regex r = new Regex(#"[\w ]+\.\w+$");
A working regex might simply look like:
[^\\]+$
Consider using:
System.IO.Path.GetFileName(path)

I need help with a regular expression in order to extract a link from a string in C#

I need to extract a link from a string using regular expression in C#. I cannot use a substring method since both the letters in the string and the link may vary.
This is the link with surrounding letters:
-sv"><a href="http://sv.wikipedia.org/wiki/%C3%84pple" title="
The -sv"><a href=" part must be included in the regex or it won't be specific enough.
The end of the regex may be at the quotation markat the end of the link or whichever is the easiest way.
I've had another suggestion aswell, however, this does not include the sv-part in the beginning and the submitter couldnt make it compile:
#"]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?[^>]*?>";
Now I'm turning to you guys on stackoverflow.
Thanks in advance!
Max
Check question:
Regex to Parse Hyperlinks and Descriptions
Parsing stuff out of html with regex is fraught with danger. Please see this classic answer which explains this with force and humour.
The problem with your question is that we don't know the context.
Are your sure the same substring won't appear twice?
Are you sure there won't be extra whitespace?
Are you sure the html will be valid? (i.e., they could forget to use "", or use '' instead)
Are you sure they won't put the title before the href?
There are lots of ways to get it wrong...
However, to answer your question, this regex pattern will work for the exact string you have pasted:
-sv"><a href="([^"]+)"
However, you won't be able to do a replace directly with that. Note the (), this is a regex capture. I'd recommend looking that up yourself, that way you won't be a newbie forever :)
Try using HTML parser. Source code is very intuitive for learning as well.
Download library, add reference to HtmlAgilityPack.dll. Get all your links with:
List<string> listOfUrls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(#"c:\ht.html");
HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//li[#class='interwiki-sv']");
foreach (HtmlNode li in coll)
{
if (li.ChildNodes.Count < 1) continue;
HtmlNode node = li.ChildNodes.First();
if (null == node) continue;
HtmlAttribute att = node.Attributes["href"];
if (null == att) continue;
listOfUrls.Add(att.Value);
}
//Now, You got your listOfUrls to process.

regular expression to parse links from html code

I'm working on a method that accepts a string (html code) and returns an array that contains all the links contained with in.
I've seen a few options for things like html ability pack but It seems a little more complicated than this project calls for
I'm also interested in using regular expression because i don't have much experience with it in general and i think this would be a good learning opportunity.
My code thus far is
WebClient client = new WebClient();
string htmlCode = client.DownloadString(p);
Regex exp = new Regex(#"http://(www\.)?([^\.]+)\.com", RegexOptions.IgnoreCase);
string[] test = exp.Split(htmlCode);
but I'm not getting the results I want because I'm still working on the regular expression
sudo code for what I'm looking for is "
If you are looking for a fool proof solution regular expressions are not your answers. They are fundamentally limited and cannot be used to reliably parse out links, or other tags for that matter, from an HTML file due to the complexity of the HTML language.
Long Winded Version: http://blogs.msdn.com/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
Instead you'll need to use an actual HTML DOM API to parse out links.
Regular Expressions are not the best idea for HTML.
see previous questions:
When
is it wise to use regular expressions
with HTML?
Regexp that matches all the text content of
a HTML input
Rather, you want something that already knows how to parse the DOM; otherwise, you're re-inventing the wheel.
Other users may tell you "No, Stop! Regular expressions should not mix with HTML! It's like mixing bleach and ammonia!". There is a lot of wisdom in that advice, but it's not the full story.
The truth is that regular expressions work just fine for collecting commonly formatted links. However, a better approach would be to use a dedicated tool for this type of thing, such as the HtmlAgilityPack.
If you use regular expressions, you may match 99.9% of the links, but you may miss on rare unanticipated corner cases or malformed html data.
Here's a function I put together that uses the HtmlAgilityPack to meet your requirements:
private static IEnumerable<string> DocumentLinks(string sourceHtml)
{
HtmlDocument sourceDocument = new HtmlDocument();
sourceDocument.LoadHtml(sourceHtml);
return (IEnumerable<string>)sourceDocument.DocumentNode
.SelectNodes("//a[#href!='#']")
.Select(n => n.GetAttributeValue("href",""));
}
This function creates a new HtmlAgilityPack.HtmlDocument, loads a string containing HTML into it, and then uses an xpath query "//a[#href!='#']" to select all of the links on the page that do not point to "#". Then I use the LINQ extension Select to convert the HtmlNodeCollection into a list of strings containing the value of the href attribute - where the link is pointing to.
Here's an example use:
List<string> links =
DocumentLinks((new WebClient())
.DownloadString("http://google.com")).ToList();
Debugger.Break();
This should be a lot more effective than regular expressions.
You could look for anything that is sort-of-like a url for http/https schema. This is not HTML proof, but it will get you things that looks like http URLs, which is what you need, I suspect. You can add more sachems, and domains.
The regex looks for things that look like URL "in" href attributes (not strictly).
class Program {
static void Main(string[] args) {
const string pattern = #"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
var regex = new Regex(pattern);
var urls = new string[] {
"href='http://company.com'",
"href=\"https://company.com\"",
"href='http://company.org'",
"href='http://company.org/'",
"href='http://company.org/path'",
};
foreach (var url in urls) {
Match match = regex.Match(url);
if (match.Success) {
Console.WriteLine("{0} -> {1}", url, match.Groups["url"].Value);
}
}
}
}
output:
href='http://company.com' -> http://company.com
href="https://company.com" -> https://company.com
href='http://company.org' -> http://company.org
href='http://company.org/' -> http://company.org
href='http://company.org/path' -> http://company.org

Categories

Resources