Reading html from online website C#

Reading html from online website C# - c#

I am reading websites in C# and get contents as string....there are some sites which do not have well formed html structure.
I tried HtmlAgilityPack and some others but they need well formed html which is not possible in my case.
Now i need a very simple way to read it by Div or span id/class.
Here is my html http://jsfiddle.net/bwJU7/
please give me a simple C# code which will read
div class="item "
and get title ,price ,photos and description in my html.

If you load content as a string and do not expect any regular structure from it then Regular Expressions are your friend.
Something like this might help you:
String content = "Your content goes here";
var regex = new Regex("<div(?:.*?)class=\"item\"[^>]*>(.*?)</div>");
foreach (Match div in regex.Matches(content))
{
Console.WriteLine(div.Groups[0].Value);
}

Related

Regex to get values between Double Quotes

I have a value i am pulling from a database
<iframe width="420" height="315" src="//www.youtube.com/embed/8GRDA1gG8R8" frameborder="0" allowfullscreen></iframe>
I am trying to get the src as a value using regex.
Regex.Match(details.Tables["MarketingDetails"].Rows[0]["MarketingVideo"].ToString(), "\\\"([^\\\"]*)\\\"").Groups[2].Value
that is how i am currently writing it
How would I write this to pull the correct value of src?

You could do it like this....
Match match = Regex.Match( #"<iframe width=""420"" height=""315"" src=""//www.youtube.com/embed/8GRDA1gG8R8"" frameborder=""0"" allowfullscreen></iframe>", #"src=(\""[^\""]*\"")");
Console.WriteLine (match.Groups[1].Value);
However, as others have already commented on your question... it's better practice to use an actual html parser.

Don't use regex to parse xml or html. It's not worth it. I'll let you read this post, and it sort of exagerates the point, but the main thing to keep in mind is you can get into a lot of trouble with regex and html.
So, instead you should use an actual html/xml parser! For starters, use XElement, a class built into the .net framework.
string input = "<iframe width=\"420\" height=\"315\" src=\"//www.youtube.com/embed/8GRDA1gG8R8\" frameborder=\"0\" allowfullscreen=''></iframe>";
XElement html = XElement.Parse(input);
string src = html.Attribute("src").Value;
This will make src have the value //www.youtube.com/embed/8GRDA1gG8R8. You can then split that up to get whatever you need from it.
I should also note that your input is not valid xml. allowfullscreen does not have a value attached, which is why I added =''.
If you need to get more complex, such as your input, use an HTML parser (XElement is meant for xml). Use the Html Agility Pack like this (using the previous example):
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(input);
string src = doc.DocumentNode
.Element("iframe")
.Attributes["src"]
.Value;
This parser is more forgiving for invalid or incorrect (or just irregular) inputs. This will parse your original input just fine (so missing the ='').

RegEx to pull out specific URL format from HTML source

I'm having problems with RegEx and trying to pull out a specifically formatted HTML link from a page's HTML source.
The HTML source contains many of these links. The link is in the format:
<a class="link" href="pagedetail.html?record_id=123456">RecordName</a>
For each matching link, I would like to be able to easily extract the following two bits of information:
The URL bit. E.g. pagedetail.html?record_id=123456
The link name. E.g. RecordName
Can anyone please help with this as I'm completely stuck. I'm needing this for a C# program so if there is any C# specific notation then that would be great. Thanks
TIA

People will tell you you should not parse HTML with REGEX. And I think it is a valid statement.
But sometimes with well formatted HTML and really easy cases like it seems is yours. You can use some regex to do the job.
For example you can use this regex and obtain group 1 for the URL and group 2 for the RecordName
<a class="link" href="([^"]+)">([^<]+)<
DEMO

I feel a bit silly answering this, because it should be evident through the two comments to your question, but...
You should not parse HTML with REGEX!
Use an XML parser, or better yet, a dedicated tool, like the HTML Agility Pack (which is still an XML parser, but fancier for working with HTML).

You can use TagRegex and EndTagRegex classes to parse html string and find tag you want. You need to iterate through all characters in html string to find out desired tag.
e.g.
var position = 0;
var tagRegex = new TagRegex();
var endTagRegex = new EndTagRegex();
while (position < html.length)
{
var match = tagRegex.Match(html, position);
if (match.Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
else if (endTagRegex.match(html, position).Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
position++;
}

How to Remove all the HTML tags and display a plain text using C#

I want to remove all html tags from a string.i can achieve this using REGX.
but inside the string if it contains number inside the angular braces <100> it should not remove it .
var withHtml = "<p>hello <b>there<1234></b></p>";
var withoutHtml = Regex.Replace(withHtml, "\\<[^\\>]*\\>", string.Empty);
Result: hello there
but needed output :
hello there 1234

Your example of HTML isn't valid HTML since it contains a non-HTML tag. I figure you intended for the angle-brackets to be encoded.
I don't think regular expressions are suitable for HTML parsing. I recommend using an HTML parser such as HTML Agility Pack to do this.
Here's an example:
var withHtml = "<p>hello <b>there<1234></b></p>";
var document = new HtmlDocument();
document.LoadHtml(withHtml);
var withoutHtml = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);
Just add the HtmlAgilityPack NuGet package and a reference to System.Xml to make it work.

Not sure you can do this in one regular expression, or that a regex is really the correct way as others have suggested. A simple improvement that gets you almost there is:
Regex.Replace(withHtml, "\\<[^\\>0-9]*\\>", string.Empty);
Gives "hello there<1234>" You then just need to replace all angled brackets.

Selenium C# Webdriver FindElements(By.LinkText) RegEx?

Is it possible to find links on a webpage by searching their text using a pattern like A-ZNN:NN:NN:NN, where N is a single digit (0-9).
I've used Regex in PHP to turn text into links, so I was wondering if it's possible to use this sort of filter in Selenium with C# to find links that will all look the same, following a certain format.
I tried:
driver.FindElements(By.LinkText("[A-Z][0-9]{2}):([0-9]{2}):([0-9]{2}):([0-9]{2}")).ToList();
But this didn't work. Any advice?

In a word, no, none of the FindElement() strategies support using regular expressions for finding elements. The simplest way to do this would be to use FindElements() to find all of the links on the page, and match their .Text property to your regular expression.
Note though that if clicking on the link navigates to a new page in the same browser window (i.e., does not open a new browser window when clicking on the link), you'll need to capture the exact text of all of the links you'd like to click on for later use. I mention this because if you try to hold onto the references to the elements found during your initial FindElements() call, they will be stale after you click on the first one. If this is your scenario, the code might look something like this:
// WARNING: Untested code written from memory.
// Not guaranteed to be exactly correct.
List<string> matchingLinks = new List<string>();
// Assume "driver" is a valid IWebDriver.
ReadOnlyCollection<IWebElement> links = driver.FindElements(By.TagName("a"));
// You could probably use LINQ to simplify this, but here is
// the foreach solution
foreach(IWebElement link in links)
{
string text = link.Text;
if (Regex.IsMatch("your Regex here", text))
{
matchingLinks.Add(text);
}
}
foreach(string linkText in matchingLinks)
{
IWebElement element = driver.FindElement(By.LinkText(linkText));
element.Click();
// do stuff on the page navigated to
driver.Navigate().Back();
}

Dont use regex to parse Html.
Use htmlagilitypack
You can follow these steps:
Step1 Use HTML PARSER to extract all the links from the particular webpage and store it into a List.
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]"))
{
//collect all links here
}
Step2 Use this regex to match all the links in the list
.*?[A-Z]\d{2}:\d{2}:\d{2}:\d{2}.*?
Step 3 You get your desired links.

How to get text off a webpage?

I want to get text off of a webpage in C#.
I don't want to get the HTML, I want the real text off of the webpage. Like if I type "<b>cake</b>", I want the cake, not the tags.

Use the HTML Agility Pack library.
That's very fine library for parse HTML, for your requirement use this code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}

It depends. If your application downloads the webpage using a WebBrowser component, then that component will do the parsing for you automatically in the background (just like Internet Explorer). Just walk the DOM tree and extract the text you want. You will find HtmlElement.InnerText property especially useful :)

You can strip tags using regular expressions such as this one2 (a simple example):
// You can import System.Text.RegularExpressions for convenience, of course.
System.Text.RegularExpressions.Regex tag = new System.Text.RegularExpressions.Regex("\<.+?\>");
myHTML = tag.Replace(myHTML, String.Empty);
But if you need to retrieve large volumes of well-structured data, then you might be better off using an HTML library1. (If the webpage is XHTML, all the better - use the System.Xml classes.)
1 Like http://htmlagilitypack.codeplex.com/, for example.
2 This might have unintended side-effects if you're trying to get data out of JavaScript, or if the data is inside an element's attribute and includes angle brackets. You'll also need to accept escape sequences like &.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading html from online website C# - c#

Related

Regex to get values between Double Quotes

RegEx to pull out specific URL format from HTML source

How to Remove all the HTML tags and display a plain text using C#

Selenium C# Webdriver FindElements(By.LinkText) RegEx?

How to get text off a webpage?

Categories

Resources