Split html row into string array - c#

I have data in an html file, in a table:
<table>
<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>
<tr><td>002</td><td>Tone Loc</td><td>Funky Cold Medina</td></tr>
<tr><td>003</td><td>Funkdoobiest</td><td>Bow Wow Wow</td></tr>
</table>
How do I split a single row into an array or list?
string row = streamReader.ReadLine();
List<string> data = row.Split //... how do I do this bit?
string artist = data[1];

Short answer: never try to parse HTML from the wild with regular expressions. It will most likely come back to haunt you.
Longer answer: As long as you can absolutely, positively guarantee that the HTML that you are parsing fits the given structure, you can use string.Split() as Jenni suggested.
string html = "<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>";
string[] values = html.Split(new string[] { "<tr>","</tr>","<td>","</td>" }, StringSplitOptions.RemoveEmptyEntries);
List<string> list = new List<string>(values);
Listing the tags independently keeps this slightly more readable, and the .RemoveEmptyEntries will keep you from getting an empty string in your list between adjacent closing and opening tags.
If this HTML is coming from the wild, or from a tool that may change - in other words, if this is more than a one-off transaction - I strongly encourage you to use something like the HTML Agility Pack instead. It's pretty easy to integrate, and there are lots of examples on the Intarwebs.

If your HTML is well-formed you could use LINQ to XML:
string input = #"<table>
<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>
<tr><td>002</td><td>Tone Loc</td><td>Funky Cold Medina</td></tr>
<tr><td>003</td><td>Funkdoobiest</td><td>Bow Wow Wow</td></tr>
</table>";
var xml = XElement.Parse(input);
// query each row
foreach (var row in xml.Elements("tr"))
{
foreach (var item in row.Elements("td"))
{
Console.WriteLine(item.Value);
}
Console.WriteLine();
}
// if you really need a string array...
var query = xml.Elements("tr")
.Select(row => row.Elements("td")
.Select(item => item.Value)
.ToArray());
foreach (var item in query)
{
// foreach over item content
// or access via item[0...n]
}

You could try:
Row.Split /<tr><td>|<\/td><td>|<\/td><\/tr>/
But it depends on how regular the HTML is. Is it programmatically generated, or does a human write it? You should only use a regular expression if you're sure it will always be generated the same way, otherwise you should use a proper HTML parser

When parsing HTML, I usually turn to the HTML Agility Pack.

Related

Distinct() values still letting in duplicates

This is another programming issue in which I think everything looks fine but does not work as intended.
What I'm trying to do is scrape all links from a webpage with htmlagilitypack and add them to a datagrid, but NOT to add duplicates to the datagrid.
Code:
webBrowser.Navigate(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser.DocumentText);
if (debug)
{
Helpers.SaveDebugToFile(#"Debug\[google.com]-" + DateTime.Now.ToString("hhmmssffffff") + "-debug.html", webBrowser.DocumentText);
}
List<string> values = new List<string>();
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute href = link.Attributes["href"];
if (href.Value.Contains("google.") || href.Value.Contains("search?") || href.Value.StartsWith("/") || href.Value.Length < 5)
{
// Ignore.
}
else
{
// DO NOT ADD TO THE DATAGRID IF href.Value ALREADY EXISTS IN COLUMN 1 //
values.Add(href.Value);
}
}
foreach (var value in values.Distinct().ToList())
{
DataGridViewLinks.Rows.Add(value, randomKeyword);
}
The code works but it's still adding duplicates in the first column, but I'm only adding Distinct() values in (or that's what I intended it to happen).
I can't see the reason for this issue, i have looked over the code a good few times and don't see anything obvious wrong.
EDIT:
As it was already mentioned in above comments, most likely somewhere the content isn't exactly equal (different casing, some leading or trailing whitespace, ...)
Better would be to check for duplicates (with defined casing, and removing whitespaces), already when inserting to the "values" list
Instead of using Distinct directly in the for loop you can check the result in a List what all values you are getting and then can find whether the problem is in this section of code or any other section. Possibly list is appending while the loop is iterating.

c# regex replace but with different replacement value each time

I have a string like this:
<div>
<query>select * from table1</query>
</div>
<div>
<query>select * from table2</query>
</div>
This is a templating usecase. Each query will be replaced by a different value (ie SQL result). Is it possible to use Regex Replace method to do this ?
The solution I'm thinking of is to use Regex.Match in the first pass, collect all the matches and then use string.replace in the second pass to replace the matches one by one. Is there a better way to solve this ?
var source =
#"<div>
<query>select * from table1</query>
</div>
<div>
<query>select * from table2</query>
</div>";
var result = Regex.Replace(
source,
"(?<=<query>).*?(?=</query>)",
match => Sql.Execute(match.Value));
The Sql.Execute is a placeholder function for whatever logic you invoke to execute your query. Upon completion, its results will substitute the original <query>…</query> contents.
If you want the query tags to be eliminated, then use a named capture group rather than lookarounds:
var result = Regex.Replace(
source,
"<query>(?<q>.*?)</query>",
match => Sql.Execute(match.Groups["q"].Value));
You could use Html Agility Pack to get first the query tags and replace the inner text with whatever you want:
var html = new HtmlDocument();
html.Load(filepath);
var queries = html.DocumentNode.SelectNodes("//query");
foreach(var node in queries)
{
if(node.InnerText=="select * from table1")
{
node.InnerText="your result";
}
}
You could also use a dictionary to save the pattern as key and the replacement as value:
var dict = new Dictionary<string, string>();
dict.Add("select * from table1","your result");
//...
var html = new HtmlDocument();
html.Load(filepath);
var queries = html.DocumentNode.SelectNodes("//query");
foreach(var node in queries)
{
if(dict.Keys.Contains(node.InnerText))
{
node.InnerText=dict[node.InnerText];
}
}
We know regex is not good for html parsing, but I think you don't need to parse html here, but simply get what's inside <query>xxx</query> pattern.
So it doesn't matter what is the rest of the document as you don't want to traverse it, nor validate nor change, nothing (according with your question).
So, in this particular case, I would use regex more than html parser:
var pattern = "<query>.+<\/query>";
And then replace every match with string Replace method

C# to get data from a website

I would like to get the data from this website and put them into a dictionary.
Basically these are prices and quantities for some financial instruments.
I have this source code for the page (here is just an extract of the whole text):
<tr>
<td class="quotesMaxTime1414148558" id="notation115602071"><span>4,000.00</span></td>
<td><span>0</span></td>
<td class="icon red"><span id="domhandler:8.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PERFORMANCE_PCT.wtkm:options_options_snapshot_1">-3.87%</span></td>
<td><span id="domhandler:9.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PRICE.wtkm:options_options_snapshot_1">960.40</span></td>
</tr>
Now I would like to extraxt the following information:
The value "4000" from the second line;
The value "-3.87%" from the fourth line;
The value "960.40" from the fifth line.
I have tried to use the following to extract the first information (the value 4000):
string url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var firstData = from x in document.DocumentNode.Descendants()
where x.Name == "td" && x.Attributes.Contains("class")
select x.InnerText;
but firstData doesn't contains the info I want (the value 4000) but this:
System.Linq.Enumerable+WhereSelectEnumerableIterator`2[HtmlAgilityPack.HtmlNode,System.String]
How can I get these data? I would also need to repeat this task several times cause in the page there is more than one line containing similar information. Is HTML Agility Pack useful in this context? Thanks.
This may be somewhat ugly but it was quickly thrown together and could probably be cleaned up greatly, but it returns all of the values that you are looking for from the Prices/Quotes table found on that page. hope it helps.
var url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var pricesAndQuotesDataTable =
(from elem in
document.DocumentNode.Descendants()
.Where(
d =>
d.Attributes["class"] != null && d.Attributes["class"].Value == "toggleTitle" &&
d.ChildNodes.Any(h => h.InnerText != null && h.InnerText == "Prices/Quotes"))
select
elem.Descendants()
.FirstOrDefault(
d => d.Attributes["class"] != null && d.Attributes["class"].Value == "dataTable")).FirstOrDefault();
if (pricesAndQuotesDataTable != null)
{
var dataRows = from elem in pricesAndQuotesDataTable.Descendants()
where elem.Name == "tr" && elem.ParentNode.Name == "tbody"
select elem;
var dataPoints = new List<object>();
foreach (var row in dataRows)
{
var dataColumns = (from col in row.ChildNodes.Where(n => n.Name == "td")
select col).ToList();
dataPoints.Add(
new
{
StrikePrice = dataColumns[0].InnerText,
DifferenceToPreviousDay = dataColumns[9].InnerText,
LastPrice = dataColumns[10].InnerText
});
}
}
That's because your LINQ hasn't executed. If you check the Results View in the debugger and run the query, you'll get all the items, the first being that value you are looking for.
So, this will get you 4,000.00
var firstData = (from x in document.DocumentNode.Descendants()
where x.Name == "td" && x.Attributes.Contains("class")
select x.InnerText).First();
if you want them all, call ToList() instead of First()
if you open to use CSQuery.. then try this one.
static void Main()
{
CsQuery.CQ cq = CsQuery.CQ.CreateFromUrl("http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411");
string str = cq["#notation115602071 span"].Text();
}
You could use the HtmlAgility Pack. Unlike XmlDocument or XDocument, the Html Agility pack is tolerant of malformed HTML (which exists all over the internet and probably on the site you are trying to parse).
Not all HTML pages can be assumed to be valid XML.
With the HTMLAgility pack you can load your page and parse it with XPath or an object model similar to System.Xml.
Html Agility Pack
Optionally, you could use a PDF to Text Converter and parse a text file with much better accuracy, since the website you linked offers a PDF Export of that same data,
PDF Export Link
Convert PDF to Text
We did a similar project a few years back to spider all the major online betting websites and create a comparison tool to get the best prices for each type of event, eg. display all the major bookmakers with betting odds for a particular football game in order of best return.
Turned out to be a complete nightmare- the rendered html output for the websites kept changing almost daily and quite often generated poorly formed html which could sometimes crash the spider daemon, so we had to constantly maintain the system to keep it working properly.
With these sorts of things its often economical to subscribe to a data feed which requires much less maintenance and easier integration.

C# grab urls using htmlagility

Okay so I have this list of URLs on this webpage, I am wondering how do I grab the URLs and add them to a ArrayList?
http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A
I only want the URLs which are in the list, look at it to see what I mean. I tried doing it myself and for whatever reason it takes all of the other URLs except for the ones I need.
http://pastebin.com/a7hJnXPP
Using Html Agility Pack
using (var wc = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(wc.DownloadString("http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A"));
var links = doc.DocumentNode.SelectSingleNode("//div[#class='lst']")
.Descendants("a")
.Select(x => x.Attributes["href"].Value)
.ToArray();
}
If you want only the ones in the list, then the following code should work (this is assuming you have the page loaded into an HtmlDocument already)
List<string> hrefList = new List<string>(); //Make a list cause lists are cool.
foreach (HtmlNode node animePage.DocumentNode.SelectNodes("//a[contains(#href, 'id=')]"))
{
//Append animenewsnetwork.com to the beginning of the href value and add it
// to the list.
hrefList.Add("http://www.animenewsnetwork.com" + node.GetAttributeValue("href", "null"));
}
//a[contains(#href, 'id=')] Breaking this XPath down as follows:
//a Select all <a> nodes...
[contains(#href, 'id=')] ... that contain an href attribute that contains the text id=.
That should be enough to get you going.
As an aside, I would suggest not listing each link in its own messagebox considering there are around 500 links on that page. 500 links = 500 messageboxes :(

Get all RSS links on a website

I'm currently writing a very basic program that'll firstly go through the html code of a website to find all RSS Links, and thereafter put the RSS Links into an array and parse each content of the links into an existing XML file.
However, I'm still learning C# and I'm not that familiar with all the classes yet. I have done all this in PHP by writing own class with get_file_contents() and as well been using cURL to do the work. I managed to get around it with Java also. Anyhow, I'm trying to accomplish the same results by using C#, but I think I'm doing something wrong here.
TLDR; What's the best way to write the regex to find all RSS links on a website?
So far, my code looks like this:
private List<string> getRSSLinks(string websiteUrl)
{
List<string> links = new List<string>();
MatchCollection collection = Regex.Matches(websiteUrl, #"(<link.*?>.*?</link>)", RegexOptions.Singleline);
foreach (Match singleMatch in collection)
{
string text = singleMatch.Groups[1].Value;
Match matchRSSLink = Regex.Match(text, #"type=\""(application/rss+xml)\""", RegexOptions.Singleline);
if (matchRSSLink.Success)
{
links.Add(text);
}
}
return links;
}
Don't use Regex to parse html. Use an html parser instead See this link for the explanation
I prefer HtmlAgilityPack to parse htmls
using (var client = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(client.DownloadString("http://www.xul.fr/en-xml-rss.html"));
var rssLinks = doc.DocumentNode.Descendants("link")
.Where(n => n.Attributes["type"] != null && n.Attributes["type"].Value == "application/rss+xml")
.Select(n => n.Attributes["href"].Value)
.ToArray();
}

Categories

Resources