Regex Grouping in C# - c#

I have multiple p tags in a HTML code.
<p class=MsoNormal><b style='mso-bidi-font-weight:normal'><span
style='font-size:7.0pt'>PA<span style='mso-spacerun:yes'> </span>ARALIĞI</span></b><span
style='font-size:7.0pt'> [İng. <b style='mso-bidi-font-weight:normal'>PA
interval</b>]. (<i style='mso-bidi-font-style:normal'>Kardiyoloji</i>).
Atriyum’un P dalgasının başlangıcını ayıran mesafe. İntraatriyal ya da
sino-nodal iletim süresinin (35-45 milisaniye) ölçümünü verir. Uzaması ileti
bozukluğunun göstergesidir. <o:p></o:p></span></p>
<p class=MsoNormal><b style='mso-bidi-font-weight:normal'><span
style='font-size:7.0pt'>PA<span style='mso-spacerun:yes'> </span>ARALIĞI</span></b> <span
style='font-size:7.0pt'> [İng. <b style='mso-bidi-font-weight:normal'>PA
interval</b>]. (<i style='mso-bidi-font-style:normal'>Kardiyoloji</i>).
Atriyum’un P dalgasının başlangıcını ayıran mesafe. İntraatriyal ya da
sino-nodal iletim süresinin (35-45 milisaniye) ölçümünü verir. Uzaması ileti
bozukluğunun göstergesidir. <o:p></o:p></span></p>
How can I get them in a List as different indexes. I need to take each p as a member in the list. My code is :
Regex Rx = new Regex(#"<p(.*)>(.*)<\/p>",RegexOptions.Multiline);
MatchCollection mc = Rx.Matches(yazi);
Thanks

Is a really bad idea to parse HTML with regular expressions. The syntax of HTML is too complex.
Use an HTML parser instead: Looking for C# HTML parser

Related

C# HTMLNode get correctly innerText of div

I am trying to correctly extract the innerText of a list of div I am getting from a website.
This is what I came up with but still a bit buggy as it misses whitespaces and the - symbol.
var first = mainmenuTitles[x].Descendants("div").FirstOrDefault(o => o.GetAttributeValue("class", "") == "left").Elements("a").ToList();
string final = "";
foreach (var countfirst in first)
{
final += countfirst.InnerText;
}
Console.WriteLine("Tittle: " + final);
This is how the html code looks like
<div class="row row-tall mt4">
<div class="clear">
<div class="left">
<a href="/soccer/italy/">
<strong>Italy</strong>
</a>
-
Serie C:: group B
</div> <div class="right fs11"> March 31 </div> </div> </div>
The text I am trying to get should look like this ->
Italy - Serie C:: group B
I am not a html guru so forgive me if it is too simple and I am missing it.
You can write a query to look up all nodes with xpath //div/a and then concatenate the inner text to get the text you are looking for. Make sure you trim the text to get rid of extra spaces and returns.
Console.WriteLine(string.Join(" - ", doc.DocumentNode.SelectNodes("//div/a").Select(x => x.InnerText.Trim())));
Output:
Italy - Serie C:: group B
Side note... you can use different queries to ensure you get the right div by using name of class as well. e.g. .SelectNodes("//div[#class='row row-tall mt4']/a");. This will give you all the <a> tags under that div.

find link with multiple keywords in c# with HTML Agility Pack

I am writing a program that parse a website.
I manage to find a link in the website, but I needed to pass the exact Innertext words to find it.
I'm looking for a way to do the same thing but to find it by partial inner text
example:
innertext is: "hi my name is"
I want to be able to find it by putting only
"hi my"
foreach (var title in htmlNodes)
{
if (keywords == title.SelectSingleNode("div/h1").InnerText)
{
if (color == title.SelectSingleNode("div/p").InnerText)
{
Console.WriteLine(title.SelectSingleNode("div/p/a").GetAttributeValue("href", "pas d'addresse"));
}
}
}
here keywords need to match exactly the innertext in div/h1. I want it to be partial.
here is the html code :
<article>
<div class="inner-article">
<a style = "height:150px;" href="/shop/shirts/c712g63kx/p1us9bkh7">
<img width = "150" height="150" src="//assets.supremenewyork.com/146319/vi/qW2Nur88W30.jpg" alt="Qw2nur88w30">
</a>
<h1>
<a class="name-link" href="/shop/shirts/c712g63kx/p1us9bkh7">Tiger Stripe Rayon Shirt</a>
</h1>
<p>
<a class="name-link" href="/shop/shirts/c712g63kx/p1us9bkh7">Teal</a>
</p>
</div>
</article>
thank you all for your answers!
I found out how to resolve my problem. It was actually quite simple. here is the code:
if ((title.SelectSingleNode("div/h1").InnerText).Contains(keywords))
Now the problem is to do it with case insensitive.

Searching in HTML file using C# where many similar tags exist

Imagin the part of HTML file below:
<div class='span1 league'>
<div class='league-gold-1 leagues size-64'></div>
</div>
<div class='span4 stats'>
<div class='points'>
<span class="gold">491</span>
points
(<span class="gold">391</span> away for region #1)
</div>
<div class='games'>
Won <span class="text-success">37</span>,
lost <span class="text-error">51</span>,
ratio <span>42.05</span>%
</div>
<div class='race'>
Favorite Race:
<div class='race-terran races size-16'></div>
<span>Terran</span>
</div>
</div>
Say I need to get number of Won and Lost games which are 37 and 51 in this case. Also the points (in this case 491). I've been trying with html agility pack but no success so far. If you now a way around this please let me know!
Using HtmlAgilityPack
var doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(fname);
var won = doc.DocumentNode.SelectSingleNode("//div[#class='games']/*[#class='text-success']").InnerText;
var lost = doc.DocumentNode.SelectSingleNode("//div[#class='games']/*[#class='text-error']").InnerText;
var points = doc.DocumentNode.SelectSingleNode("//div[#class='points']/*[#class='gold']").InnerText;
You can also use Linq instead of XPath
var won = doc.DocumentNode.Descendants("span")
.First(s=>s.Attributes.Any(a=>a.Value=="text-success"))
.InnerText;
As a workaround you could try regex
Match m = Regex.Match(htmlstring, "<span class=\"text-success\">([0-9]+?)</span>.*?<span class=\"text-error\">([0-9]+?)</span>", RegexOptions.Singleline);
string won = m.Result("$1");
string loss = m.Result("$2");

Extract the contents of a string between two string delimiters using match in C#

So, say I'm parsing the following HTML string:
<html>
<head>
RANDOM JAVASCRIPT AND CSS AHHHHHH!!!!!!!!
</head>
<body>
<table class="table">
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
</table>
<body>
</html>
and I want to isolate the contents of ** (everything inside of the table class)
Now, I used regex to accomplish this:
string pagesource = (method that extracts the html source and stores it into a string);
string[] splitSource = Regex.Split(pagesource, "<table class=/"member/">;
string memberList = Regex.Split(splitSource[1], "</table>");
//the list of table members will be in memberList[0];
//method to extract links from the table
ExtractLinks(memberList[0]);
I've been looking at other ways to do this extraction, and I came across the Match object in C#.
I'm attempting to do something like this:
Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n)*?</table>");
The purpose of the above was to hopefully extract a match value between the two delimiters, but, when I try to run it the match value is:
match.value = </table>
MY question, as such, is: is there a way to extract data from my string that is slightly easier/more readable/shorter than my method using regex? For this simple example, regex is fine, but for more complex examples, I find myself with the coding equivalent of scribbles all over my screen.
I would really like to use match, because it seems like a very neat and tidy class, but I can't seem to get it working for my needs. Can anyone help me with this?
Thank you very much!
Use an HTML parser, like HTML Agility Pack.
var doc = new HtmlDocument();
using (var wc = new WebClient())
using (var stream = wc.OpenRead(url))
{
doc.Load(stream);
}
var table = doc.DocumentElement.Element("html").Element("body").Element("table");
string tableHtml = table.OuterHtml;
You can use XPath with the HTmlAgilityPack:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var elements = doc.DocumentNode.SelectNodes("//table[#class='table']");
foreach (var ele in elements)
{
MessageBox.Show(ele.OuterHtml);
}
You have add parenthesis in the regular expression in order to capture the matches:
Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n*?)</table>");
Anyways it seems that only Chuck Norris can parse HTML with regex correctly.

C# Regex: Getting URL and text from multiple "a href"-tags

I want to be able to scrape a webpage containing multiple "<a href"-tags and return a structured collection of them.
<div>
<p>Lorem ipsum... Classic link
<a title="test" href=http://sloppy-html-5-href.com>I lovez HTML 5</a>
</p>
<a class="abc" href='/my-tribute-to-javascript.html'>I also love JS</a>
<iframe width="420" height="315" src="http://www.youtube.com/embed/JVPT4h_ilOU"
frameborder="0" allowfullscreen></iframe><!-- Don't catch me! -->
</div>
So I want these values:
https://stackoverflow | Classic link
http://sloppy-html-5-href.com | I lovez HTML 5
/my-tribute-to-javascript.html | I also love JS
As you can see, only values in an "a href" should be caught, with both link and content within the tags. It should support all HTML 5-valid href. The href-attributes can be surrounded with any other attributes.
So I basically want a regex to fill in the following code:
public IEnumerable<Tuple<string, string>> GetLinks(string html) {
string pattern = string.Empty; // TODO: Get solution from Stackoverflow
var matches = Regex.Matches(html, pattern);
foreach(Match match in matches) {
yield return new Tuple<string, string>(
match.Groups[0].Value, match.Groups[1].Value);
}
}
I've always read that parsing Html with Regular Expression is the Evil. Ok... it's surely true...
But like the Evil, Regex are so fun :)
So I'd give a try to this one:
Regex r = new Regex(#"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>");
foreach (Match match in r.Matches(html))
yield return new Tuple<string, string>(
match.Groups["href"].Value, match.Groups["value"].Value);
isnt it easier to use html agility pack and xpath ? than regex
it would be like
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var aNodeCollection = document.DocumentNode.Descendants("//a[#href]")
foreach (HtmlNode node id aNodeCollection)
{
node.Attributes["href"].value
node.htmltext
}
its pseudo code

Categories

Resources