I have a webpage. If I look at the "view-source" of the page, I find multiple instance of following statement:
<td class="my_class" itemprop="main_item">statement 1</td>
<td class="my_class" itemprop="main_item">statement 2</td>
<td class="my_class" itemprop="main_item">statement 3</td>
I want to extract data like this:
statement 1
statement 2
statement 3
To accomplish this, I have made a method "GetContent" which takes "URL" as parameter and copy all the content of the webpage source in a C# string.
private string GetContent(string url)
{
HttpWebResponse response = null;
StreamReader respStream = null;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Timeout = 100000;
response = (HttpWebResponse)request.GetResponse();
respStream = new StreamReader(response.GetResponseStream());
return respStream.ReadToEnd();
}
Now I want to create a method "GetMyList" which will extract the list I want. I am searching for the possible regex which can serve my purpose. Any help is highly appreciated.
using the HTML AgilityPack, this would be really easy...
HtmlDocument doc= new HtmlDocument ();
doc.LoadHtml(html);
//var nodes = doc.DocumentNode.SelectNodes("//td//text()");
var nodes = doc.DocumentNode.SelectNodes("//td[#itemprop=\"main_item\"]//text()");
var list = new List<string>();
foreach (var m in nodes)
{
list.Add(m.InnerText);
}
But if you want Regex, Try this :
string regularExpressionPattern1 = #"<td.*?>(.*?)<\/td>";
Regex regex = new Regex(regularExpressionPattern1, RegexOptions.Singleline);
MatchCollection collection = regex.Matches(html.ToString());
var list = new List<string>();
foreach (Match m in collection)
{
list.Add( m.Groups[1].Value);
}
Hosseins answer is pretty much the solution (and I would recommend you to use a parser if you have the option) but a regular expression with non-capturing paraentheses ?: would bring you the extracted data statement 1 or statement 2 as you need it:
IEnumerable<string> GetMyList(string str)
{
foreach(Match m in Regex.Matches(str, #"(?:<td.*?>)(.*?)(?:<\/td>)"))
yield return m.Groups[1].Value;
}
See Explanation at regex101 for a more detailed description.
Related
I have an Index view where I would like show a list of news article, the Text property is a string which contains a html string coming from a html editor; now the html content could be really long, so I would like show only the first <p> element.
I am doing that:
public ActionResult Index()
{
var articles = db.Articles.ToList().Select(a => new{Title = a.Title,
Tags = a.Tags,
Id = a.Id,
Text = (System.Xml.Linq.XDocument.Parse(a.Text).Descendants("p").FirstOrDefault())
}).ToList();
return View(articles);
}
But in the html string there is not a root node, so the Linq query fall in exception, How I can manage this case?
Thanks in advance for any suggestion
It might be a shorthand solution, but should wrapping your xml in a root node not fix the problem?
System.Xml.Linq.XDocument.Parse(
String.Format("<myRootNode>{0}</myRootNode>" , a.Text)
)
You can do it by using regex
static String GetTheFirstPElement(String rawHtml)
{
Regex myRegex = new Regex(#"(<p[^>]*>.*?</p>)", RegexOptions.IgnoreCase);
MatchCollection matches = myRegex.Matches(rawHtml);
var firstMatch = matches.FirstOrDefault() ;
return firstMatch != null ? firstMatch.Value : null ;
}
I have two code for getting no of characters inside templates first one is
string html = this.GetHTMLContent(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()"))
{
sb.AppendLine(node.InnerText);
}
string final = sb.ToString();
int lenght = final.Length;
And second one is
var length = doc.DocumentNode.SelectNodes("//text()")
.Where(x => x.NodeType == HtmlNodeType.Text)
.Select(x => x.InnerText.Length)
.Sum();
When I run both code return me different result.
Finally I identified the problem. the problem was inside loop I used appendLine() method instead of append() method. so it appended new line each time of looping. So that some white spaces it also recognized as character.
I'm currently writing a very basic program that'll firstly go through the html code of a website to find all RSS Links, and thereafter put the RSS Links into an array and parse each content of the links into an existing XML file.
However, I'm still learning C# and I'm not that familiar with all the classes yet. I have done all this in PHP by writing own class with get_file_contents() and as well been using cURL to do the work. I managed to get around it with Java also. Anyhow, I'm trying to accomplish the same results by using C#, but I think I'm doing something wrong here.
TLDR; What's the best way to write the regex to find all RSS links on a website?
So far, my code looks like this:
private List<string> getRSSLinks(string websiteUrl)
{
List<string> links = new List<string>();
MatchCollection collection = Regex.Matches(websiteUrl, #"(<link.*?>.*?</link>)", RegexOptions.Singleline);
foreach (Match singleMatch in collection)
{
string text = singleMatch.Groups[1].Value;
Match matchRSSLink = Regex.Match(text, #"type=\""(application/rss+xml)\""", RegexOptions.Singleline);
if (matchRSSLink.Success)
{
links.Add(text);
}
}
return links;
}
Don't use Regex to parse html. Use an html parser instead See this link for the explanation
I prefer HtmlAgilityPack to parse htmls
using (var client = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(client.DownloadString("http://www.xul.fr/en-xml-rss.html"));
var rssLinks = doc.DocumentNode.Descendants("link")
.Where(n => n.Attributes["type"] != null && n.Attributes["type"].Value == "application/rss+xml")
.Select(n => n.Attributes["href"].Value)
.ToArray();
}
I am attempting to use the HTML Agility Pack to look up specific keywords on Google, then check through linked nodes until it find my websites string url, then parse the innerHTML of the node I am on for my Google ranking.
I am relatively new to the Agility Pack (as in, I started really looking through it yesterday) so I was hoping I could get some help on it. When I do the search below, I get Failures on my Xpath queries every time. Even if I insert something as simple as SelectNodes("//*[#id='rso']"). Is this something I am doing incorrectly?
private void GoogleScrape(string url)
{
string[] keys = keywordBox.Text.Split(',');
for (int i = 0; i < keys.Count(); i++)
{
var raw = "http://www.google.com/search?num=100&q=";
string search = raw + HttpUtility.UrlEncode(keys[i]);
var webGet = new HtmlWeb();
var document = webGet.Load(search);
loadtimeBox.Text = webGet.RequestDuration.ToString();
var ranking = document.DocumentNode.SelectNodes("//*[#id='rso']");
if (ranking != null)
{
googleBox.Text = "Something";
}
else
{
googleBox.Text = "Fail";
}
}
}
It's not the Agility pack's guilt -- it is tricky google's. If you inspect _text property of HtmlDocument with debugger, you'll find that <ol> that has id='rso' when you inspect it in a browser do not have any attributes for some reason.
I think, in this case you can just serach by "//ol", because there is only one <ol> tag in the google's result page at the moment...
UPDATE: I've done further checks. For example when I do this:
using (StreamReader sr =
new StreamReader(HttpWebRequest
.Create("http://www.google.com/search?num=100&q=test")
.GetResponse()
.GetResponseStream()))
{
string s = sr.ReadToEnd();
var m2 = Regex.Matches(s, "\\sid=('[^']+'|\"[^\"]+\")");
foreach (var x in m2)
Console.WriteLine(x);
}
The only ids that are returned are: "sflas", "hidden_modes" and "tbpr_12".
To conclude: I've used Html Agility Pack and it's coped pretty well even with malformed html (unclosed <p> and even <li> tags etc.).
string htmlHeaderPattern = ("(<h[2|3])>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
From this code, I get a bunch of h2 and h3-elements. In these, I'd like to insert an ID-attribute, with the value equal to (the content in the header, minus special chars and ToLower()). I also need this value as a separate string, as I need to store it for later use.
Input: <h3>Some sort of header!</h3>
Output: <h3 id="#some-sort-of-header">Some sort of header!</h3>
Plus, I need the values "#some-sort-of-header" and "Some sort of header!" stored in a dictionary or list or whatever else.
This is what I have so far:
string htmlHeaderPattern = ("(<h[2|3]>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
Dictionary<string,string> returnValue = new Dictionary<string, string>();
foreach (Match match in matches)
{
string idValue = StripTextValue(match.Groups[4].Value);
returnValue.Add(idValue, match.Groups[4].Value);
}
MainBody = Regex.Replace(mainBody, htmlHeaderPattern, "this is where i must replace all the headers with one with an ID-attribute?");
Any regex-wizards out there to help me?
There are a lot of mentions regarding not to use regex when parsing HTML, so you could use e.g. Html Agility Pack for this:
var html = #"<h2>Some sort of header!</h2>";
HtmlDocument document= new HtmlDocument();
document.LoadHtml(html);
var headers = document.DocumentNode.SelectNodes("//h2|//h3");
if (headers != null)
{
foreach (HtmlNode header in headers)
{
var innerText = header.InnerText;
var idValue = StripTextValue(innerText);
if (header.Attributes["id"] != null)
{
header.Attributes["id"].Value = idValue;
}
else
{
header.Attributes.Add("id", idValue);
}
}
}
This code finds all the <h2> and <h3> elements in the document passed, gets inner text from there and setting(or adding) id attributes to them.
With this example you should get something like:
<h2 id='#some-sort-of-header'>Some sort of header!</h2>