regular expression to parse links from html code

regular expression to parse links from html code - c#

I'm working on a method that accepts a string (html code) and returns an array that contains all the links contained with in.
I've seen a few options for things like html ability pack but It seems a little more complicated than this project calls for
I'm also interested in using regular expression because i don't have much experience with it in general and i think this would be a good learning opportunity.
My code thus far is
WebClient client = new WebClient();
string htmlCode = client.DownloadString(p);
Regex exp = new Regex(#"http://(www\.)?([^\.]+)\.com", RegexOptions.IgnoreCase);
string[] test = exp.Split(htmlCode);
but I'm not getting the results I want because I'm still working on the regular expression
sudo code for what I'm looking for is "

If you are looking for a fool proof solution regular expressions are not your answers. They are fundamentally limited and cannot be used to reliably parse out links, or other tags for that matter, from an HTML file due to the complexity of the HTML language.
Long Winded Version: http://blogs.msdn.com/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
Instead you'll need to use an actual HTML DOM API to parse out links.

Regular Expressions are not the best idea for HTML.
see previous questions:
When
is it wise to use regular expressions
with HTML?
Regexp that matches all the text content of
a HTML input
Rather, you want something that already knows how to parse the DOM; otherwise, you're re-inventing the wheel.

Other users may tell you "No, Stop! Regular expressions should not mix with HTML! It's like mixing bleach and ammonia!". There is a lot of wisdom in that advice, but it's not the full story.
The truth is that regular expressions work just fine for collecting commonly formatted links. However, a better approach would be to use a dedicated tool for this type of thing, such as the HtmlAgilityPack.
If you use regular expressions, you may match 99.9% of the links, but you may miss on rare unanticipated corner cases or malformed html data.
Here's a function I put together that uses the HtmlAgilityPack to meet your requirements:
private static IEnumerable<string> DocumentLinks(string sourceHtml)
{
HtmlDocument sourceDocument = new HtmlDocument();
sourceDocument.LoadHtml(sourceHtml);
return (IEnumerable<string>)sourceDocument.DocumentNode
.SelectNodes("//a[#href!='#']")
.Select(n => n.GetAttributeValue("href",""));
}
This function creates a new HtmlAgilityPack.HtmlDocument, loads a string containing HTML into it, and then uses an xpath query "//a[#href!='#']" to select all of the links on the page that do not point to "#". Then I use the LINQ extension Select to convert the HtmlNodeCollection into a list of strings containing the value of the href attribute - where the link is pointing to.
Here's an example use:
List<string> links =
DocumentLinks((new WebClient())
.DownloadString("http://google.com")).ToList();
Debugger.Break();
This should be a lot more effective than regular expressions.

You could look for anything that is sort-of-like a url for http/https schema. This is not HTML proof, but it will get you things that looks like http URLs, which is what you need, I suspect. You can add more sachems, and domains.
The regex looks for things that look like URL "in" href attributes (not strictly).
class Program {
static void Main(string[] args) {
const string pattern = #"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
var regex = new Regex(pattern);
var urls = new string[] {
"href='http://company.com'",
"href=\"https://company.com\"",
"href='http://company.org'",
"href='http://company.org/'",
"href='http://company.org/path'",
};
foreach (var url in urls) {
Match match = regex.Match(url);
if (match.Success) {
Console.WriteLine("{0} -> {1}", url, match.Groups["url"].Value);
}
}
}
}
output:
href='http://company.com' -> http://company.com
href="https://company.com" -> https://company.com
href='http://company.org' -> http://company.org
href='http://company.org/' -> http://company.org
href='http://company.org/path' -> http://company.org

Related

RegEx to pull out specific URL format from HTML source

I'm having problems with RegEx and trying to pull out a specifically formatted HTML link from a page's HTML source.
The HTML source contains many of these links. The link is in the format:
<a class="link" href="pagedetail.html?record_id=123456">RecordName</a>
For each matching link, I would like to be able to easily extract the following two bits of information:
The URL bit. E.g. pagedetail.html?record_id=123456
The link name. E.g. RecordName
Can anyone please help with this as I'm completely stuck. I'm needing this for a C# program so if there is any C# specific notation then that would be great. Thanks
TIA

People will tell you you should not parse HTML with REGEX. And I think it is a valid statement.
But sometimes with well formatted HTML and really easy cases like it seems is yours. You can use some regex to do the job.
For example you can use this regex and obtain group 1 for the URL and group 2 for the RecordName
<a class="link" href="([^"]+)">([^<]+)<
DEMO

I feel a bit silly answering this, because it should be evident through the two comments to your question, but...
You should not parse HTML with REGEX!
Use an XML parser, or better yet, a dedicated tool, like the HTML Agility Pack (which is still an XML parser, but fancier for working with HTML).

You can use TagRegex and EndTagRegex classes to parse html string and find tag you want. You need to iterate through all characters in html string to find out desired tag.
e.g.
var position = 0;
var tagRegex = new TagRegex();
var endTagRegex = new EndTagRegex();
while (position < html.length)
{
var match = tagRegex.Match(html, position);
if (match.Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
else if (endTagRegex.match(html, position).Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
position++;
}

Selenium C# Webdriver FindElements(By.LinkText) RegEx?

Is it possible to find links on a webpage by searching their text using a pattern like A-ZNN:NN:NN:NN, where N is a single digit (0-9).
I've used Regex in PHP to turn text into links, so I was wondering if it's possible to use this sort of filter in Selenium with C# to find links that will all look the same, following a certain format.
I tried:
driver.FindElements(By.LinkText("[A-Z][0-9]{2}):([0-9]{2}):([0-9]{2}):([0-9]{2}")).ToList();
But this didn't work. Any advice?

In a word, no, none of the FindElement() strategies support using regular expressions for finding elements. The simplest way to do this would be to use FindElements() to find all of the links on the page, and match their .Text property to your regular expression.
Note though that if clicking on the link navigates to a new page in the same browser window (i.e., does not open a new browser window when clicking on the link), you'll need to capture the exact text of all of the links you'd like to click on for later use. I mention this because if you try to hold onto the references to the elements found during your initial FindElements() call, they will be stale after you click on the first one. If this is your scenario, the code might look something like this:
// WARNING: Untested code written from memory.
// Not guaranteed to be exactly correct.
List<string> matchingLinks = new List<string>();
// Assume "driver" is a valid IWebDriver.
ReadOnlyCollection<IWebElement> links = driver.FindElements(By.TagName("a"));
// You could probably use LINQ to simplify this, but here is
// the foreach solution
foreach(IWebElement link in links)
{
string text = link.Text;
if (Regex.IsMatch("your Regex here", text))
{
matchingLinks.Add(text);
}
}
foreach(string linkText in matchingLinks)
{
IWebElement element = driver.FindElement(By.LinkText(linkText));
element.Click();
// do stuff on the page navigated to
driver.Navigate().Back();
}

Dont use regex to parse Html.
Use htmlagilitypack
You can follow these steps:
Step1 Use HTML PARSER to extract all the links from the particular webpage and store it into a List.
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]"))
{
//collect all links here
}
Step2 Use this regex to match all the links in the list
.*?[A-Z]\d{2}:\d{2}:\d{2}:\d{2}.*?
Step 3 You get your desired links.

Regular expression to replace quotation marks in HTML tags only

I have the following string:
<div id="mydiv">This is a "div" with quotation marks</div>
I want to use regular expressions to return the following:
<div id='mydiv'>This is a "div" with quotation marks</div>
Notice how the id attribute in the div is now surrounded by apostrophes?
How can I do this with a regular expression?
Edit: I'm not looking for a magic bullet to handle every edge case in every situation. We should all be weary of using regex to parse HTML but, in this particular case and for my particular need, regex IS the solution...I just need a bit of help getting the right expression.
Edit #2: Jens helped to find a solution for me but anyone randomly coming to this page should think long and very hard about using this solution. In my case it works because I am very confident of the type of strings that I'll be dealing with. I know the dangers and the risks and make sure you do to. If you're not sure if you know then it probably indicates that you don't know and shouldn't use this method. You've been warned.

This could be done in the following way: I think you want to replace every instance of ", that is between a < and a > with '.
So, you look for each " in your file, look behind for a <, and ahead for a >. The regex looks like:
(?<=\<[^<>]*)"(?=[^><]*\>)
You can replace the found characters to your liking, maybe using Regex.Replace.
Note: While I found the Stack Overflow community most friendly and helpful, these Regex/HTML questions are responded with a little too much anger, in my opinion. After all, this question here does not ask "What regex matches all valid HTML, and does not match anything else."

I see you're aware of the dangers of using Regex to do these kinds of replacements. I've added the following answer for those in search of a method that is a lot more 'stable' if you want to have a solution that will keep working as the input docs change.
Using the HTML Agility Pack (project page, nuget), this does the trick:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes();
foreach (var node in nodes)
{
foreach (var att in node.Attributes)
{
att.QuoteType = AttributeValueQuote.SingleQuote;
}
}
var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);

You can match:
(<div.*?id=)"(.*?)"(.*?>)
and replace this with:
$1'$2'$3

c# : parsing text from html

I have an string input-buffer that contains html.
That html contains a lot of text, including some stuff I want to parse.
What I'm actually looking for are the lines like this : "< strong>Filename< /strong>: yadayada.thisandthat.doc< /p>"
(Although position and amount of whitespace / semicolons is variable)
What's the best way to get all the filenames into a List< string> ?

Well a regular expression to accomplish this will be very hard to write and will end up being unreliable anyway.
Probably your best bet is to have a whitelist of extensions you want to look for (.doc, .pdf etc), and trawl through the html looking for instances of these extensions. When you find one, track back to the next whitespace character and that's your filename.
Hope this helps.

You have a couple of options. You can use regular expressions, it could be something like Filename: (.*?)< /p> , but it will need to be much more complex. You would need to look at more of the text file to write a proper one. This could work depending on the structure of all your text, if there is always a certain tag after a filename for example.
If it is valid HTML you can also use a HTML parser like HTML Agility Pack to go through the html and pull out text from certain tags, then use a regex to seperate out the path.

I'm not sure a regular expression is the best way to do this, traversing the HTML tree is probably more sensible, but the following regex should do it:
<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>
As you can see, I've been extremely tolerant of whitespace, as well as tolerant on the content of the filename. Also, multiple (or no) semicolons are permitted.
The C# to build a List (off the top of my head):
List<String> fileNames = new List<String>();
Regex regexObj = new Regex(#"<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
fileNames.Add(matchResults.Groups[0].Value);
matchResults = matchResults.NextMatch();
}

Parse HTML links using C#

Is there a built in dll that will give me a list of links from a string. I want to send in a string with valid html and have it parse all the links. I seem to remember there being something built into either .net or an unmanaged library.
I found a couple open source projects that looked promising but I thought there was a built in module. If not I may have to use one of those. I just didn't want an external dependency at this point if it wasn't necessary.

I'm not aware of anything built in and from your question it's a little bit ambiguous what you're looking for exactly. Do you want the entire anchor tag, or just the URL from the href attribute?
If you have well-formed XHtml, you might be able to get away with using an XmlReader and an XPath query to find all the anchor tags (<a>) and then hit the href attribute for the address. Since that's unlikely, you're probably better off using RegEx to pull down what you want.
Using RegEx, you could do something like:
List<Uri> findUris(string message)
{
string anchorPattern = "<a[\\s]+[^>]*?href[\\s]?=[\\s\\\"\']+(?<href>.*?)[\\\"\\']+.*?>(?<fileName>[^<]+|.*?)?<\\/a>";
MatchCollection matches = Regex.Matches(message, anchorPattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.Compiled);
if (matches.Count > 0)
{
List<Uri> uris = new List<Uri>();
foreach (Match m in matches)
{
string url = m.Groups["url"].Value;
Uri testUri = null;
if (Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out testUri))
{
uris.Add(testUri);
}
}
return uris;
}
return null;
}
Note that I'd want to check the href to make sure that the address actually makes sense as a valid Uri. You can eliminate that if you aren't actually going to be pursuing the link anywhere.

I don't think there is a built-in library, but the Html Agility Pack is popular for what you want to do.
The way to do this with the raw .NET framework and no external dependencies would be use a regular expression to find all the 'a' tags in the string. You would need to take care of a lot of edge cases perhaps. eg href = "http://url" vs href=http://url etc.

SubSonic.Sugar.Web.ScrapeLinks seems to do part of what you want, however it grabs the html from a url, rather than from a string. You can check out their implementation here.

Google gives me this module: http://www.majestic12.co.uk/projects/html_parser.php
Seems to be a HTML parser for .NET.

A simple regular expression -
#"<a.*?>"
passed in to Regex.Matches should do what you need. That regex may need a tiny bit of tweaking, but it's pretty close I think.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

regular expression to parse links from html code - c#

Regular Expressions are not the best idea for HTML. see previous questions: When is it wise to use regular expressions with HTML? Regexp that matches all the text content of a HTML input Rather, you want something that already knows how to parse the DOM; otherwise, you're re-inventing the wheel.

Related

RegEx to pull out specific URL format from HTML source

Selenium C# Webdriver FindElements(By.LinkText) RegEx?

Regular expression to replace quotation marks in HTML tags only

c# : parsing text from html

Parse HTML links using C#

Categories

Resources