RegEx to pull out specific URL format from HTML source

RegEx to pull out specific URL format from HTML source - c#

I'm having problems with RegEx and trying to pull out a specifically formatted HTML link from a page's HTML source.
The HTML source contains many of these links. The link is in the format:
<a class="link" href="pagedetail.html?record_id=123456">RecordName</a>
For each matching link, I would like to be able to easily extract the following two bits of information:
The URL bit. E.g. pagedetail.html?record_id=123456
The link name. E.g. RecordName
Can anyone please help with this as I'm completely stuck. I'm needing this for a C# program so if there is any C# specific notation then that would be great. Thanks
TIA

People will tell you you should not parse HTML with REGEX. And I think it is a valid statement.
But sometimes with well formatted HTML and really easy cases like it seems is yours. You can use some regex to do the job.
For example you can use this regex and obtain group 1 for the URL and group 2 for the RecordName
<a class="link" href="([^"]+)">([^<]+)<
DEMO

I feel a bit silly answering this, because it should be evident through the two comments to your question, but...
You should not parse HTML with REGEX!
Use an XML parser, or better yet, a dedicated tool, like the HTML Agility Pack (which is still an XML parser, but fancier for working with HTML).

You can use TagRegex and EndTagRegex classes to parse html string and find tag you want. You need to iterate through all characters in html string to find out desired tag.
e.g.
var position = 0;
var tagRegex = new TagRegex();
var endTagRegex = new EndTagRegex();
while (position < html.length)
{
var match = tagRegex.Match(html, position);
if (match.Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
else if (endTagRegex.match(html, position).Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
position++;
}

Related

Reading html from online website C#

I am reading websites in C# and get contents as string....there are some sites which do not have well formed html structure.
I tried HtmlAgilityPack and some others but they need well formed html which is not possible in my case.
Now i need a very simple way to read it by Div or span id/class.
Here is my html http://jsfiddle.net/bwJU7/
please give me a simple C# code which will read
div class="item "
and get title ,price ,photos and description in my html.

If you load content as a string and do not expect any regular structure from it then Regular Expressions are your friend.
Something like this might help you:
String content = "Your content goes here";
var regex = new Regex("<div(?:.*?)class=\"item\"[^>]*>(.*?)</div>");
foreach (Match div in regex.Matches(content))
{
Console.WriteLine(div.Groups[0].Value);
}

Finding text between tags and replacing it along with the tags

I am using The following regex pattern to find text between [code] and [/code] tags:
(?<=[code]).*?(?=[/code])
It returns me anything which is enclosed between these 2 tags, e.g. this: [code]return Hi There;[/code] gives me return Hi There;.
I need help with regex to replace entire text along with the tags.

Use this:
var s = "My temp folder is: [code]Path.GetTempPath()[/code]";
var result = Regex.Replace(s, #"\[code](.*?)\[/code]",
m =>
{
var codeString = m.Groups[1].Value;
// then evaluate this string
return EvaluateMyCode(codeString)
});

I would use a HTML Parser for this. I can see that what you are trying to do is simple, however these things have a habit to get much more complicated overtime. The end result is much pain for the poor sole who has to maintain the code in the future.
Take a look at this question about HTML Parsers
What is the best way to parse html in C#?
[Edit]
Here is a much more relevant answer to the question asked.
#Milad Naseri regex is correct you just need to do something like
string matchCodeTag = #"\[code\](.*?)\[/code\]";
string textToReplace = "[code]The Ape Men are comming[/code]";
string replaceWith = "Keep Calm";
string output = Regex.Replace(textToReplace, matchCodeTag, replaceWith);
Check out this web sites for more examples
http://www.dotnetperls.com/regex-replace
http://oreilly.com/windows/archive/csharp-regular-expressions.html
Hope this helps

You need to use back referencing, i.e. replace \[code\](.*?)\[/code\] with something like <code>$1</code> which will give you what's been enclosed by the [code][/code] tags enclosed in -- for this example -- <code></code> tags.

I need help with a regular expression in order to extract a link from a string in C#

I need to extract a link from a string using regular expression in C#. I cannot use a substring method since both the letters in the string and the link may vary.
This is the link with surrounding letters:
-sv"><a href="http://sv.wikipedia.org/wiki/%C3%84pple" title="
The -sv"><a href=" part must be included in the regex or it won't be specific enough.
The end of the regex may be at the quotation markat the end of the link or whichever is the easiest way.
I've had another suggestion aswell, however, this does not include the sv-part in the beginning and the submitter couldnt make it compile:
#"]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?[^>]*?>";
Now I'm turning to you guys on stackoverflow.
Thanks in advance!
Max

Check question:
Regex to Parse Hyperlinks and Descriptions

Parsing stuff out of html with regex is fraught with danger. Please see this classic answer which explains this with force and humour.
The problem with your question is that we don't know the context.
Are your sure the same substring won't appear twice?
Are you sure there won't be extra whitespace?
Are you sure the html will be valid? (i.e., they could forget to use "", or use '' instead)
Are you sure they won't put the title before the href?
There are lots of ways to get it wrong...
However, to answer your question, this regex pattern will work for the exact string you have pasted:
-sv"><a href="([^"]+)"
However, you won't be able to do a replace directly with that. Note the (), this is a regex capture. I'd recommend looking that up yourself, that way you won't be a newbie forever :)

Try using HTML parser. Source code is very intuitive for learning as well.
Download library, add reference to HtmlAgilityPack.dll. Get all your links with:
List<string> listOfUrls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(#"c:\ht.html");
HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//li[#class='interwiki-sv']");
foreach (HtmlNode li in coll)
{
if (li.ChildNodes.Count < 1) continue;
HtmlNode node = li.ChildNodes.First();
if (null == node) continue;
HtmlAttribute att = node.Attributes["href"];
if (null == att) continue;
listOfUrls.Add(att.Value);
}
//Now, You got your listOfUrls to process.

Regular expression to replace quotation marks in HTML tags only

I have the following string:
<div id="mydiv">This is a "div" with quotation marks</div>
I want to use regular expressions to return the following:
<div id='mydiv'>This is a "div" with quotation marks</div>
Notice how the id attribute in the div is now surrounded by apostrophes?
How can I do this with a regular expression?
Edit: I'm not looking for a magic bullet to handle every edge case in every situation. We should all be weary of using regex to parse HTML but, in this particular case and for my particular need, regex IS the solution...I just need a bit of help getting the right expression.
Edit #2: Jens helped to find a solution for me but anyone randomly coming to this page should think long and very hard about using this solution. In my case it works because I am very confident of the type of strings that I'll be dealing with. I know the dangers and the risks and make sure you do to. If you're not sure if you know then it probably indicates that you don't know and shouldn't use this method. You've been warned.

This could be done in the following way: I think you want to replace every instance of ", that is between a < and a > with '.
So, you look for each " in your file, look behind for a <, and ahead for a >. The regex looks like:
(?<=\<[^<>]*)"(?=[^><]*\>)
You can replace the found characters to your liking, maybe using Regex.Replace.
Note: While I found the Stack Overflow community most friendly and helpful, these Regex/HTML questions are responded with a little too much anger, in my opinion. After all, this question here does not ask "What regex matches all valid HTML, and does not match anything else."

I see you're aware of the dangers of using Regex to do these kinds of replacements. I've added the following answer for those in search of a method that is a lot more 'stable' if you want to have a solution that will keep working as the input docs change.
Using the HTML Agility Pack (project page, nuget), this does the trick:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes();
foreach (var node in nodes)
{
foreach (var att in node.Attributes)
{
att.QuoteType = AttributeValueQuote.SingleQuote;
}
}
var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);

You can match:
(<div.*?id=)"(.*?)"(.*?>)
and replace this with:
$1'$2'$3

Parse HTML links using C#

Is there a built in dll that will give me a list of links from a string. I want to send in a string with valid html and have it parse all the links. I seem to remember there being something built into either .net or an unmanaged library.
I found a couple open source projects that looked promising but I thought there was a built in module. If not I may have to use one of those. I just didn't want an external dependency at this point if it wasn't necessary.

I'm not aware of anything built in and from your question it's a little bit ambiguous what you're looking for exactly. Do you want the entire anchor tag, or just the URL from the href attribute?
If you have well-formed XHtml, you might be able to get away with using an XmlReader and an XPath query to find all the anchor tags (<a>) and then hit the href attribute for the address. Since that's unlikely, you're probably better off using RegEx to pull down what you want.
Using RegEx, you could do something like:
List<Uri> findUris(string message)
{
string anchorPattern = "<a[\\s]+[^>]*?href[\\s]?=[\\s\\\"\']+(?<href>.*?)[\\\"\\']+.*?>(?<fileName>[^<]+|.*?)?<\\/a>";
MatchCollection matches = Regex.Matches(message, anchorPattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.Compiled);
if (matches.Count > 0)
{
List<Uri> uris = new List<Uri>();
foreach (Match m in matches)
{
string url = m.Groups["url"].Value;
Uri testUri = null;
if (Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out testUri))
{
uris.Add(testUri);
}
}
return uris;
}
return null;
}
Note that I'd want to check the href to make sure that the address actually makes sense as a valid Uri. You can eliminate that if you aren't actually going to be pursuing the link anywhere.

I don't think there is a built-in library, but the Html Agility Pack is popular for what you want to do.
The way to do this with the raw .NET framework and no external dependencies would be use a regular expression to find all the 'a' tags in the string. You would need to take care of a lot of edge cases perhaps. eg href = "http://url" vs href=http://url etc.

SubSonic.Sugar.Web.ScrapeLinks seems to do part of what you want, however it grabs the html from a url, rather than from a string. You can check out their implementation here.

Google gives me this module: http://www.majestic12.co.uk/projects/html_parser.php
Seems to be a HTML parser for .NET.

A simple regular expression -
#"<a.*?>"
passed in to Regex.Matches should do what you need. That regex may need a tiny bit of tweaking, but it's pretty close I think.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

RegEx to pull out specific URL format from HTML source - c#

I feel a bit silly answering this, because it should be evident through the two comments to your question, but... You should not parse HTML with REGEX! Use an XML parser, or better yet, a dedicated tool, like the HTML Agility Pack (which is still an XML parser, but fancier for working with HTML).

Related

Reading html from online website C#

Finding text between tags and replacing it along with the tags

I need help with a regular expression in order to extract a link from a string in C#

Regular expression to replace quotation marks in HTML tags only

Parse HTML links using C#

Categories

Resources