How to extract number from text, which is between <br> <<b> - Xpath - c#

I do not know how to do than, so I do not post any my code :/
<div class="style2 f_left">Wyprawa do <b>Tana</b><br>Czas trwania: <b>32</b> minut.<br>Szansa powodzenia: <b>75 %</b>.<br></div>
From this div I need to extract number 32(it's random generated)

XPath is an option, but since you don't post any requirement for it I suggest some other solutions.
You could use a regular expression to get the number:
<b>(\d+?)<\/b>
The answer will be in the first group.
Since you're working with HTML you could also use HtmlAgilityPack or similar solutions to step through it and get the value from there.

If you are using javascript you could do as below:
var num = parseInt($('#DivIdHere').text().match(/\d+/)[0], 10);

just get all text from < div > element and substring everything between "Czas trwania:" and "minut", no need to use complex xPath

Related

Regex against markup after XPath?

Have been searching for the solution to my problem now already for a while and have been playing around regex101.com for a while but cannot find a solution.
The problem I am facing is that I have to make a string select for different inputs, thus I wanted to do this with Regular expressions to get the wanted data from these strings.
The regular expression will come from a configuration for each string seperately. (since they differ)
The string below is gained with a XPath: //body/div/table/tbody/tr/td/p[5] but I cannot dig any lower into this anymore to retrieve the right data or can I ?
The string I am using at the moment as example is the following:
<strong>Kontaktdaten des Absenders:</strong>
<br>
<strong>Name:</strong> Wanted data
<br>
<strong>Telefon:</strong>
<a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
<br>
From this string I am trying to get the "Wanted data"
My regular expression so far is the following:
(?<=<\/strong> )(.*)(?= <br>)
But this returns the whole:
<br> <strong>Name:</strong> Wanted data <br> <strong>Telefon:</strong> <a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
I thought I could solve this with a repeat group
((:?(?<=<\/strong> )(.*)(?= <br>))+)
But this returns the same output as without the repeat group.
I know I could build a for { } loop around this regex to gain the same output, but since this is the only regular expression I have to do this for (but means I have to change it for all the other data) I was wondering if it is possible to do this in a regular expression.
Thank you for the support already so far.
Regex is the wrong tool for parsing markup. You have a proper XML parsing tool, XPath, in hand. Finish the job with it:
This XPath,
strong[.='Name:']/following-sibling::text()[1]
when appended to your original XPath,
//body/div/table/tbody/tr/td/p[5]/strong[.='Name:']/following-sibling::text()[1]
will finish the job of selecting the text node immediately following the <strong>Name:</strong> label, as requested, with no regex hacks over markup required.
You can try to match everything but tag markers:
(?<=<\/strong> )([^<>]*)(?= <br>)
Demo

C# Selenium Extract Data from span with partial ID

I am trying to create a proper XPATH syntax in C# Selenium to extract an order number on a web page. Here is what I've tried to far to grab the order number shown in the screen shot. All of these have errored out on me.
var result = driver.FindElement(By.XPath("//span[#id^='order-number-'")).Text;
var result = driver.FindElement(By.XPath("//div[#id='a-column a-span7']/h5")).Text;
var result = driver.FindElement(By.XPath("//div[#id='a-column a-span7']/span[#class='a-text-bold']")).Text;
Below is the inspection from Chrome. I am trying to grab the order number, but it will not always be the same so I cannot hard code the span id.
The driver.FindElement(By.XPath("//span[#id^='order-number-'")) would definitely match nothing since ^= is not a valid operator in XPath language. Plus, you are not closing the square brackets.
Instead, if you want to have a shorter and more readable version, use a CSS selector:
driver.FindElement(By.CssSelector("span[id^=order-number]"))
Here ^= means "starts with".
If you want to stay with XPath, use starts-with() function:
driver.FindElement(By.XPath("//span[starts-with(#id, 'order-number-')]"))
You can try this out:
var result = driver.FindElement(By.XPath("//span[contains(#id, 'order-number-')]")).Text;
It uses a "contains" on the span ID. Let me know if this helps.

Linq parse html string

I want to parse an html page and get a specific value from it. How can I do this using Linq or string parsing in C# ?
------------- MORE HTML ----------
<span class="date">
04.09.2012
</span>
<table cellspacing="0"><tr><th scope="row">1 EUR</th><td><span>**4,4907**</span></td><td><span class="rise">+0,0009</span></td><td><span class="rise">+0,02%</span></td></tr><tr><th scope="row">1 USD</th><td><span>3,5635</span></td><td><span class="fall">-0,0093</span></td><td><span class="fall">-0,26%</span></td></tr></table>
------------- MORE HTML ----------
I am interested in getting the value 4,4907 in bold!
Any idea how to achieve this?
Thanks!
If you only need that bit, use a regular expression. (But don't use a regular expression to parse more complex HTML.)
<td><span>4,4907</span></td>
would be matched most conveniently by the regular expression
<td><span>([0-9,]+)</span></td>
And see for example this quickly Googled page on how to use regexps with C#.
Be careful when trying to parse HTML.
I think the obvious way would be to load it into an XDocument (as XML) but as HTML is often ambiguous or contains syntax errors this is bound to fail.
People here on Stack overflow have instead suggested to use http://htmlagilitypack.codeplex.com/ which is said to do a great job parsing html. Then you may use xpath to query your document for various contents.
You can try a regular expression in C# this way:
http://www.c-sharpcorner.com/UploadFile/prasad_1/RegExpPSD12062005021717AM/RegExpPSD.aspx
To find the string between "< span > * " and " * < / span >".
Or you can use an HTML parser like "jericho" and navigate through HTML tags to reach your value.

Read specific text from page into string array in C#

I've tried this and searched for help but I cannot figure it out. I can get the source for a page but I don't need the whole thing, just one string that is repeated. Think of it like trying to grab only the titles of articles on a page and adding them in order to an array without losing any special characters. Can someone shed some light?
You can use a Regular Expression
to extract the content you want from a string, such as your html string.
Or you can use a DOM parser such as
Html Agility Pack
Hope this helps!
You could use something like this -
var text = "12 hello 45 yes 890 bye 999";
var matches = System.Text.RegularExpressions.Regex.Matches(text,#"\d+").Cast<Match>().Select(m => m.Value).ToList();
The example pulls all numbers in the text variable into a list of strings. But you could change the Regular Expression to do something more suited to your needs.
if the page is well-formed xml, you could use linq to xml by loading the page into an XDocument and using XPath or another way of traversing to the element(s) you desire and loading what you need into the array for which you are looking (or just use the enumerable if all you want to do is enumerate). if the page is not under your control, though, this is a brittle solution that could break at any time when subtle changes could break the well-formedness of the xml. if that's the case, you're probably better off using regular expressions. eiither way, though, the page could be changed under you and your code suddenly won't work anymore.
the best thing you could do would be to get the provider of the page to expose what you need as a webservice rather than trying to scrape their page.

how can i remove an outer <p>...</p> from a string

I want to query a string (html) from a database and display it on a webpage. The problem is that the data has a
<p> around the text (ending with </p>
I want to strip this outer tag in my viewmodel or controlleraction that returns this data. what is the best way of doing this in C#?
Might be overkill for your needs, but if you want to parse the HTML you can use the HtmlAgilityPack - certainly a cleaner solution in general than most suggested here, although it might not be as performant:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<p> around the text (ending with </p>");
string result = doc.DocumentNode.FirstChild.InnerHtml;
If you're absolutely sure the string will always have that tag, you can use String.Substring like myString.Substring(3, myString.Length-7) or so.
A more robust method would be to either manually code the appropriate tests or use a regular expression, or ultimately, use an HTML parser as suggested by BrokenGlass's answer.
UPDATE: Using regexes you could do:
String filteredString = Regex.Match(myString, "^<p>(.*)</p>").ToString();
You could add \s after the initial ^ to remove also leading whitespace. Also, you can check the result of Match to see if the string matched the <p>...</p> pattern at all. This may also help.
If the data is always surrounded by <p> ... </p>:
string withoutParas = withParas.Substring(3, withParas.Length - 7);
Try using string function Remove() passing it the FirstIndex() of <p> and the last index of </p> with length 3
If you are absolutely guaranteed that you string will always fit the pattern of <p>...</p>, then the other solutions using data.Substring(3, data.Length - 6) are sufficient. If, however, there's any chance that it could look at all different, then you really need to use an HTML parser. The consensus is that the HTML Agility Pack is the way to go.
s = s.Replace("<p>", String.Empty).Replace("</p>", String.Empty);

Categories

Resources