Find and replace - should I use Regex? - c#

I need to create a simple markup fix an I already did everything that I need like bold and italic etc.. But this is a bit harder than what I've done so far and I have no idea how to do this. Basically my input is very simple:
[imgGroup="group1"]
image1.jpg
[/imgGroup]
As you can see I pass a param that is group1 and inside I have image1. I need to convert this into a link that has this image inside and group in rel tag like so:
<a href="image1.jpg" rel="group1" >
<img src="image1.jpg" />
</a>
I think that I will need to use Regex for this problem, however I only know how to find something in between 2 tags, not so much for this problem... I'm using ASP.NET MVC3 with C#.

You could use named groups in RegEx to match, then you can just re-assemble them into the order you want:
var regex = new RegEx(("$1(\d\s)$2([a-z])"); // Set up your regex with named groups
var result = regex.Replace("inputstring", "$2 $1"); // Replace input string with the given text, including anything matched in the named groups $1 and $2
Be warned though, RegEx with things like Urls and HTML parsing can very, very quickly turn into a horror beyond your wildest dreams ;)
Good luck!
Named groups for RegEx in dot net

Here is my suggestion for you:
var regex = new Regex(#"\[imgGroup=" + "group1" + #"\]\s*(?<Content>\S*)\s*\[imgGroup\]");
var newValue = regex.Replace(oldValue, #"<a href=""$1"" rel=""group1"" ><img src=""$1"" /> </a> );
That should do what you've expected.

Related

Regex formula assistance

I'm trying to find a regex formula for these HTML nodes:
first: Need the inner html value
<span class="profile fn">Any Name Here</span>
second: need the title value
<abbr class="time published" title="2012-08-11T07:02:50+0000">August 10, 2012 at 5:02 pm</abbr>
third: need the inner html value
<div class="msgbody">
Some message here. might contain any character.
</div>
I'm fairly new to regex, and was hoping someone could offer me some guidance with this. I'll be using it with C# if that makes a difference.
Edit:
The HTML I'd be pulling this out of would look like this:
<div class="message">
<div class="from"><span class="profile fn">Name</span></div>
<abbr class="time published" title="2012-08-11T07:02:50+0000">August 10, 2012 at 5:02 pm</abbr>
<div class="msgbody">
Some message
</div>
</div>
A lot of people are quite dismissive of using Regular Expressions to deal with HTML; However, I believe that if your HTML is assuredly regular and well formatted then you can use Regex successfully.
If you can't assure then, then I urge you to check out the HTML Agility Pack, it's a library for parsing HTML in C# and works very well.
I'm not on my PC, but I'll edit my answer with a suggested regex for your examples, give you something to try at least.
For this one:
<span class="profile fn">Any Name Here</span>
Try
"<span.*?>(?<span>.*?)</span>"
Then you can access this via the Match.Groups("span") property of your regex result.
For the Abbr tag:
<abbr class="time published" title="2012-08-11T07:02:50+0000">...snip...</abbr>
It's similar
"<abbr.*?title=\"(?<title>.*?)\".*?>"
And lastly for the div:
<div class="msgbody">
Some message here. might contain any character.
</div>
Is:
"<div.*?>(?<div>.*?)</div>"
For this one, you may need to set the Multiline regex option.
The key point is the .*? operator.
Adding the question match turns a greedy match into a look ahead match, it tells the Regex engine to look forwards from the place it finds the match, rather then finding the last match and then working backwards; this is incredibly important for matching in HTML where you will have many Chevrons closing tags.
The big problem you'll get though is, what happens if the inner text or an attribute has an '<' or an '"' character in it?
It's very hard to make Regex only match balanced <>'s and it can't easily not use ones that are in between quotes; this is why the Agility pack is often preferred.
Hope this helps anyway!
Edit:
How to use named capture groups
This syntax (?..selector..) tells the Regex engine to encapsulate whatever is between the brackets into a value that can be taken out the actual match object.
So for this HTML
<span>TEST</span>
You'd use this code:
string HTML = "<span>TEST</span>";
Regex r = new Regex("<span>(?<s>.*?)</span>");
var match = r.Match(HTML);
string stuff = match.Groups["s"].Value;
//stuff should = "TEST"
If you think you'll have multiple captures then you'd use a variant of this overload:
foreach (Match m in r.Matches(HTML))
{
string stuff = m.Groups["s"].Value;
}
This should yield the answer you need.
If your html is always the same you can use this ugly pattern:
"profile fn"[^>]*>(?<name>[^<]+)(?:[^t]+|t(?!itle=))+title="(?<time>[^"]+)(?:[^m]+|m(?!sgbody"))+msgbody">\s*(?<msg>(?:[^<\s]+|(?>\s+)(?!<))+)
results are in m.Groups["name"], m.Groups["time"], m.Groups["msg"]

Regex to match anchor with # in href for .NET

I'm trying to match and replace anchor tags using a regex. What i have so far is this:
"(<a href=['\"]?([\\w_\\.]*)['\"]?)"
The problem with this approach is that it fails to capture hrefs that also have # in their value. I've tried
"(<a href=['\"]?([\\w_\\.#]*)['\"]?)"
and
"(<a href=['\"]?([\\w_\\.\\#]*)['\"]?)"
with no success.
What am i doing wrong?
Thank you
I don't think the problem is with # (works fine for me) but with missing other url characters, such as -, /, : etc.
How about a regex like this:
<a href=("[^"]+"|'[^']+'|[^ >]+)
Note: If possible, use other parsing DOM methods for valid html.
If you just want to replace the anchor part use string operations. They are simpler and faster
var parts = "http://someurl.com#hashpart".Split("#");
// yields "http://someurl.com" and "hashpart" as array.
// you may want to check if the result has length of two
// if it does :
var newUrl = string.Format("{0}#{1}" parts[0], "some replacement for hashpart");
If your URL contains multiple hashes try using string.Substring to split at the first hashtag.
var url = "http://someurl.com#hash#hashhash";
var hashPos = url.IndexOf("#");
var urlPart = url.Substring(hashPos);
var hashPart = url.Substring(hashPos +1, url.length - hashPos -1);
Should work, wrote it without verification, maybe you have to toss around some +/- 1 to get the right positions.
<a href=(('|")[^\2]+?\2|[^>]+)

Regex for a string

It would be great if someone could provide me the Regular expression for the following string.
Sample 1: <div>abc</div><br>
Sample 2: <div>abc</div></div></div></div></div><br>
As you can see in the samples provided above, I need to match the string no matter how many number of </div> occurs. If there occurs any other string between </div> and <br>, say like this <div>abc</div></div></div>DEF</div></div><br> OR <div>abc</div></div></div></div></div>DEF<br>, then the Regex should not match.
Thanks in advance.
Try this:
<div>([^<]+)(?:<\/div>)*<br>
As seen on rubular
Notes:
This only works if there are not tags in the abc part (or anything that has a < symbol).
You might want to use start and end of string anchors (^<div>([^<]+)(?:<\/div>)*<br>$ if you want your string to match the pattern exactly.
If you want to allow the abc part to be empty, use * instead of +
That being said, you should be wary of using regex to parse HTML.
In this example, you can use regex because you are parsing a (hopefully) known, regular subset of HTML. But a more robust solution (ie: an [X]HTML parser like HtmlAgilityPack) is preferred when it comes to parsing HTML.
You need to use a real parser. Things like infinitely nested tags can't be handled via regex.
You could also include a named group in the the expression, e.g.:
<div>(?<text>[^<]*)(?:<\/div>)*<br>
Implemented in C#:
var regex = new Regex(#"<div>(?<text>[^<]*)(?:<\/div>)*<br>");
Func<Match, string> getGroupText = m => (m.Success && m.Groups["text"] != null) ? m.Groups["text"].Value : null;
Func<string, string> getText = s => getGroupText(regex.Match(s));
Console.WriteLine(getText("<div>abc</div><br>"));
Console.WriteLine(getText("<div>123</div></div></div></div></div><br>"));
NullUserException's answer is good. Here are a couple of questions, and variations, depending on what you want.
Do you want to prevent anything from occurring before the open div tag? If so, keep the ^ at the beginning of the regex. If not, drop it.
The rest of this post refers to the following section of the regex:
([^<]+?)
Do you want to capture the contents of the div, or just know that it matches your form? To capture, leave it as is. If you don't need to capture, drop the parentheses from the above.
Do you want to match if there is nothing inside the div? If so change the + in the above to *
Finally, although it will work fine, you don't need the ? in the above.
I think, this regex is more flexible:
<div\b[^><]*+>(?>.*?</div>)(?:\s*+</div>)*+\s*+<br(?:\s*+/)?>
I don't include the ^ and $ in the beginning and the end of my regex because we cannot assure that your sample will always in a single line.

Need a regular expression to get rid of parenthesis in html image tag filename

So say I have some html with an image tag like this:
<p> (1) some image is below:
<img src="/somwhere/filename_(1).jpg">
</p>
I want a regex that will just get rid of the parenthesis in the filename so my html will look like this:
<p> (1) some image is below:
<img src="/somwhere/filename_1.jpg">
</p>
Does anyone know how to do this? My programming language is C#, if that makes a difference...
I will be eternally grateful and send some very nice karma your way. :)
I suspect your job would be much easier if you used the HTML Agility that can help you to do this instead of regex's judging from the answers, it will make parsing the HTML a lot easier for you to achieve what you are trying to do.
Hope this helps,
Best regards,
Tom.
This (rather dense) regex should do it:
string s = Regex.Replace(input, #"(<img\s+[^>]*src=""[^""]*)\((\d+)\)([^""]*""[^>]*>)", "$1$2$3");
Nick's solution is fine if the file names always match that format, but this one matches any parenthesis, anywhere in the attribute:
s = Regex.Replace(#"(?i)(?<=<img\s+[^>]*\bsrc\s*=\s*""[^""]*)[()]", "");
The lookbehind ensures that the match occurs inside the src attribute of an img tag. It assumes the attribute is enclosed in double-quotes (quotation marks); if you need to allow for single-quotes (apostrophes) or no quotes at all, the regex gets much more complicated. I'll post that if you need it.
In this simple case, you could just use string.Replace, for example:
string imgFilename = "/somewhere/image_(1).jpg";
imgFilename = imgFilename.Replace("(", "").Replace(")", "");
Or do you need a regex for replacing the complete tag inside a HTML string?
Regex.Replace(some_input, #"(?<=<\s*img\s*src\s*=\s*""[^""]*?)(?:\(|\))(?=[^""]*?""\s*\/?\s*?>)", "");
Finds ( or ) preceded by <img src =" and, optionally, text (with any whitespace combination, though I didn't include newline), and followed by optional text and "> or "/>, again with any whitespace combination, and replaces them with nothingness.

C# extracting certain parts of a string

I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the html code that is returned.
Below is a fragment of the html I am interested in:
<span class="header">Number of People:</span>
<span class="peopleCount">1001</span> <!-- this is the line we are interested in! -->
<span class="footer">As of June 2009.</span>
Assume that the above html is contained in a string called "responseHtml". I would like to just extract the 'People Count' value, (second line).
I've searched stack over flow and found some code that could work:
How do I extract text that lies between parentheses (round brackets)?
But when I implement it, it doesn't work - I don't think it likes the way I have placed HTML tags into the regex:
string responseHtml; // this is already filled with html code above ^^
string insideBrackets = null;
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");
Match match = regex.Match(responseHtml);
if (match.Success)
{
insideBrackets = match.Groups["TextInsideBrackets"].Value;
Console.WriteLine(insideBrackets);
}
The above just fails to work, is it something to do with the html span brackets? All I want is the text value in between the tags for that specific span.
Thanks in advance!
Try this one:
Regex regex = new Regex("class=\\\"peopleCount\\\"\\>(?<data>[^\\<]*)",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
It should be a tad faster, as you are basically saying the data you are looking for starts after peopleCount"> and ends at the first <
(I changed the group name to data)
Cheers,
Florian
?<TextInsideBrackets> is incorrect
You need:
(?<TextInsideBrackets>...)
I assume you want to do a named capture.
You should use
Regex regex = new Regex("\\<span class=\"peopleCount\">(?<TextInsideBrackets>\\w+)\\</span>");
and not
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

Categories

Resources