Regex to match anchor with # in href for .NET - c#

I'm trying to match and replace anchor tags using a regex. What i have so far is this:
"(<a href=['\"]?([\\w_\\.]*)['\"]?)"
The problem with this approach is that it fails to capture hrefs that also have # in their value. I've tried
"(<a href=['\"]?([\\w_\\.#]*)['\"]?)"
and
"(<a href=['\"]?([\\w_\\.\\#]*)['\"]?)"
with no success.
What am i doing wrong?
Thank you

I don't think the problem is with # (works fine for me) but with missing other url characters, such as -, /, : etc.
How about a regex like this:
<a href=("[^"]+"|'[^']+'|[^ >]+)
Note: If possible, use other parsing DOM methods for valid html.

If you just want to replace the anchor part use string operations. They are simpler and faster
var parts = "http://someurl.com#hashpart".Split("#");
// yields "http://someurl.com" and "hashpart" as array.
// you may want to check if the result has length of two
// if it does :
var newUrl = string.Format("{0}#{1}" parts[0], "some replacement for hashpart");
If your URL contains multiple hashes try using string.Substring to split at the first hashtag.
var url = "http://someurl.com#hash#hashhash";
var hashPos = url.IndexOf("#");
var urlPart = url.Substring(hashPos);
var hashPart = url.Substring(hashPos +1, url.length - hashPos -1);
Should work, wrote it without verification, maybe you have to toss around some +/- 1 to get the right positions.

<a href=(('|")[^\2]+?\2|[^>]+)

Related

Regex to Replace the end of the Url

I have a url something that follows a pattern as below :
http://i.ebayimg.com/00/s/MTUw12323gxNTAw/$(KGr123qF,!p0F123Q~~60_12.JPG?set_id=88123231232F
I need a regex to find and replace the end of the url _12.JPG with _14.JPG. So basically i need to capture the _[numbers only].JPG pattern and replace it with my value.
var regex = new Regex(#"_\d+\.JPG");
var newUrl = regex.Replace(url, "_14.JPG");
_[0-9]+\.JPG\?
works for the sample URL. You didn't really mention whether you wanted the
?set_id=88123231232F gone or not.
Basically, you shouldn't normally be concerned with periods anywhere else in the URL. It is possible, but the additional constraint of the jpg extension should limit anything returned with not much issue.
///_(\d?\d).jpg/ig
var regex = new Regex(#"_(\d?\d).[Jj][Pp][Gg]");
That will capture one or two numbers between an underscore and .jpg
I will double check this, but it should work for both one digit and two digits.

Find and replace - should I use Regex?

I need to create a simple markup fix an I already did everything that I need like bold and italic etc.. But this is a bit harder than what I've done so far and I have no idea how to do this. Basically my input is very simple:
[imgGroup="group1"]
image1.jpg
[/imgGroup]
As you can see I pass a param that is group1 and inside I have image1. I need to convert this into a link that has this image inside and group in rel tag like so:
<a href="image1.jpg" rel="group1" >
<img src="image1.jpg" />
</a>
I think that I will need to use Regex for this problem, however I only know how to find something in between 2 tags, not so much for this problem... I'm using ASP.NET MVC3 with C#.
You could use named groups in RegEx to match, then you can just re-assemble them into the order you want:
var regex = new RegEx(("$1(\d\s)$2([a-z])"); // Set up your regex with named groups
var result = regex.Replace("inputstring", "$2 $1"); // Replace input string with the given text, including anything matched in the named groups $1 and $2
Be warned though, RegEx with things like Urls and HTML parsing can very, very quickly turn into a horror beyond your wildest dreams ;)
Good luck!
Named groups for RegEx in dot net
Here is my suggestion for you:
var regex = new Regex(#"\[imgGroup=" + "group1" + #"\]\s*(?<Content>\S*)\s*\[imgGroup\]");
var newValue = regex.Replace(oldValue, #"<a href=""$1"" rel=""group1"" ><img src=""$1"" /> </a> );
That should do what you've expected.

Regular expression for recognizing url

I want to create a Regex for url in order to get all links from input string.
The Regex should recognize the following formats of the url address:
http(s)://www.webpage.com
http(s)://webpage.com
www.webpage.com
and also the more complicated urls like:
- http://www.google.pl/#sclient=psy&hl=pl&site=&source=hp&q=regex+url&pbx=1&oq=regex+url&aq=f&aqi=g1&aql=&gs_sm=e&gs_upl=1582l3020l0l3199l9l6l0l0l0l0l255l1104l0.2.3l5l0&bav=on.2,or.r_gc.r_pw.&fp=30a1604d4180f481&biw=1680&bih=935
I have the following one
((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)
but it does not recognize the following pattern: www.webpage.com. Can someone please help me to create an appropriate Regex?
EDIT:
It should works to find an appropriate link and moreover place a link in an appropriate index like this:
private readonly Regex RE_URL = new Regex(#"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)", RegexOptions.Multiline);
foreach (Match match in (RE_URL.Matches(new_text)))
{
// Copy raw string from the last position up to the match
if (match.Index != last_pos)
{
var raw_text = new_text.Substring(last_pos, match.Index - last_pos);
text_block.Inlines.Add(new Run(raw_text));
}
// Create a hyperlink for the match
var link = new Hyperlink(new Run(match.Value))
{
NavigateUri = new Uri(match.Value)
};
link.Click += OnUrlClick;
text_block.Inlines.Add(link);
// Update the last matched position
last_pos = match.Index + match.Length;
}
I don't know why your result in match is only http:// but I cleaned your regex a bit
((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:##%/;$()~_?\+,\-=\\.&]+)
(?:) are non capturing groups, that means there is only one capturing group left and this contains the complete matched string.
(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.) The link has now to start with something fom the first list followed by an optional www. or with an www.
[\w\d:##%/;$()~_?\+,\-=\\.&] I added a comma to the list (otherwise your long example does not match) escaped the - (you were creating a character range) and unescaped the . (not needed in a character class.
See this here on Regexr, a useful tool to test regexes.
But URL matching is not a simple task, please see this question here
I've just written up a blog post on recognising URLs in most used formats such as:
www.google.com
http://www.google.com
mailto:somebody#google.com
somebody#google.com
www.url-with-querystring.com/?url=has-querystring
The regular expression used is /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/ however I would recommend you got to http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the to see a complete working example along with an explanation of the regular expression in case you need to extend or tweak it.
The regex you give doesn't work for www. addresses because it is expecting a URI scheme (the bit before the URL, like http://). The 'www.' part in your regular expression doesn't work because it would only match www.:// (which is meaningless)
Try something like this instead:
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:##%/;$()~_?\+-=\\\.&]*)
This will match something with a valid URI scheme, or something beginning with 'www.'

Regex for a string

It would be great if someone could provide me the Regular expression for the following string.
Sample 1: <div>abc</div><br>
Sample 2: <div>abc</div></div></div></div></div><br>
As you can see in the samples provided above, I need to match the string no matter how many number of </div> occurs. If there occurs any other string between </div> and <br>, say like this <div>abc</div></div></div>DEF</div></div><br> OR <div>abc</div></div></div></div></div>DEF<br>, then the Regex should not match.
Thanks in advance.
Try this:
<div>([^<]+)(?:<\/div>)*<br>
As seen on rubular
Notes:
This only works if there are not tags in the abc part (or anything that has a < symbol).
You might want to use start and end of string anchors (^<div>([^<]+)(?:<\/div>)*<br>$ if you want your string to match the pattern exactly.
If you want to allow the abc part to be empty, use * instead of +
That being said, you should be wary of using regex to parse HTML.
In this example, you can use regex because you are parsing a (hopefully) known, regular subset of HTML. But a more robust solution (ie: an [X]HTML parser like HtmlAgilityPack) is preferred when it comes to parsing HTML.
You need to use a real parser. Things like infinitely nested tags can't be handled via regex.
You could also include a named group in the the expression, e.g.:
<div>(?<text>[^<]*)(?:<\/div>)*<br>
Implemented in C#:
var regex = new Regex(#"<div>(?<text>[^<]*)(?:<\/div>)*<br>");
Func<Match, string> getGroupText = m => (m.Success && m.Groups["text"] != null) ? m.Groups["text"].Value : null;
Func<string, string> getText = s => getGroupText(regex.Match(s));
Console.WriteLine(getText("<div>abc</div><br>"));
Console.WriteLine(getText("<div>123</div></div></div></div></div><br>"));
NullUserException's answer is good. Here are a couple of questions, and variations, depending on what you want.
Do you want to prevent anything from occurring before the open div tag? If so, keep the ^ at the beginning of the regex. If not, drop it.
The rest of this post refers to the following section of the regex:
([^<]+?)
Do you want to capture the contents of the div, or just know that it matches your form? To capture, leave it as is. If you don't need to capture, drop the parentheses from the above.
Do you want to match if there is nothing inside the div? If so change the + in the above to *
Finally, although it will work fine, you don't need the ? in the above.
I think, this regex is more flexible:
<div\b[^><]*+>(?>.*?</div>)(?:\s*+</div>)*+\s*+<br(?:\s*+/)?>
I don't include the ^ and $ in the beginning and the end of my regex because we cannot assure that your sample will always in a single line.

C# extracting certain parts of a string

I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the html code that is returned.
Below is a fragment of the html I am interested in:
<span class="header">Number of People:</span>
<span class="peopleCount">1001</span> <!-- this is the line we are interested in! -->
<span class="footer">As of June 2009.</span>
Assume that the above html is contained in a string called "responseHtml". I would like to just extract the 'People Count' value, (second line).
I've searched stack over flow and found some code that could work:
How do I extract text that lies between parentheses (round brackets)?
But when I implement it, it doesn't work - I don't think it likes the way I have placed HTML tags into the regex:
string responseHtml; // this is already filled with html code above ^^
string insideBrackets = null;
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");
Match match = regex.Match(responseHtml);
if (match.Success)
{
insideBrackets = match.Groups["TextInsideBrackets"].Value;
Console.WriteLine(insideBrackets);
}
The above just fails to work, is it something to do with the html span brackets? All I want is the text value in between the tags for that specific span.
Thanks in advance!
Try this one:
Regex regex = new Regex("class=\\\"peopleCount\\\"\\>(?<data>[^\\<]*)",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
It should be a tad faster, as you are basically saying the data you are looking for starts after peopleCount"> and ends at the first <
(I changed the group name to data)
Cheers,
Florian
?<TextInsideBrackets> is incorrect
You need:
(?<TextInsideBrackets>...)
I assume you want to do a named capture.
You should use
Regex regex = new Regex("\\<span class=\"peopleCount\">(?<TextInsideBrackets>\\w+)\\</span>");
and not
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

Categories

Resources