Regex to parse and replace img src in C#/.NET?

Regex to parse and replace img src in C#/.NET? - c#

Ahoy,
I have a problem, see; I have strings like:
<img width="594" height="392" src="/sites/it_kb/SiteAssets/Pages/exploding%20the%20VDI%20vDesktop/VDI3.PNG" alt="" style="margin:5px;width:619px;height:232px" />
They are not consistently formatted.
I need to parse strings like this, and return the following:
<img width="594" height="392" src="/exploding%20the%20VDI%20vDesktop-VDI3.PNG" alt="" style="margin:5px;width:619px;height:232px" />
Changes:
Remove everything except the immediate directory in which the image file lay.
Instead of that directory being a subdirectory, prepend it onto the file name.
So if the file is currently in /blabla/bla/blaaaaah/pickles/pickle.png
then I want the IMG SRC attribute to say pickles-pickle.png
Now, I've been trying to do this with regex, but after 3 hours, I've discovered something about myself... I am awful at regex. I could be at this for weeks, and I'd never get anywhere.
Thus, I am asking this wonderful community for two things:
How would you do this? Is regex even the right answer? I need to be able to parse any SRC attributes inside IMG tags (whether or not they have height/width or other attributes).
What resources would you recommend for me to learn regex with .NET?
Now for the problem at hand, I suppose I could do a string.replace where I....
Find the IMG tag, and get indexes of the surrounding '<' and '>'
Find index of 'SRC=' and ' ' (space) between those two instances
Find last index of '/' between the src and space indexes
Find second to last index of '/' between src and space indexes
Replace... er no, remove... everything before the second to last instance of '/'...
...String.Replace remaining '/' with '-'.
....I.. I think that'd do it?
But DAMN that is ugly. A regex would be so much prettier, don't you think?
Any advice?
Note: I tagged this as 'homework', but it's not homework. I'm volunteering for work after-hours to save the company like 200k. This is literally the last piece of an incredibly convoluted (to me) puzzle. Of course, I don't see a penny of that 200k, but I look good doing it.

To get the tag, I suggest using HtmlAgilityPack. It's just safer than to do regex on an entire HTML page.
Use something like this to get the image nodes:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var imgs = doc.DocumentNode.SelectNodes("//img");
Use something like this to get/set the attributes:
foreach (var img in imgs)
{
string orig = img.Attributes["src"].Value;
//do replacements on orig to a new string, newsrc
img.SetAttributeValue("src",newsrc);
}
So, what kind of replacements should you do? I do agree that using a Regex is much more elegant. Things like these are what it's for after all!
Something like this should do the trick:
string s = #"/sites/it_kb/SiteAssets/Pages/exploding%20the%20VDI%20vDesktop/VDI3.PNG";
string n = Regex.Replace(s,#"(.*?)\/([^\/]*?)\/([^\/]*?)$",#"/$2-$3");
Some resources that you can use to learn C# Regexing:
dotnetperls Regex.Match
MSDN: Regex.Match method
MSDN Regex Cheat Sheet

(?<=src=)"[^" ]*\/(?=[^\/"]*\/)
Try this.Replace with empty string.
http://regex101.com/r/dZ1vT6/50
Must warn you its a kind of hack.Html should not be parsed with regex.

Replace this
(?i)(?<=<img\s[\s\S]*?src=")(?:[^"]*\/)+(?=[^"]*\/)([^\/]*)\/([^"]+)
To:
/$1-$2

Related

Regex matching URL's by sub-folder

I am trying to essentially write an outbound URL matcher so I can replace a stream of html containing URL's to point to my CDN. I cant use the IIS URL Rewrite module as I am using compression. I currently have a regex that matches on a sub folder for a specific file type i.e.
Regex ASSET_PATH = new Regex(#"(?i)assets/([A-Za-z0-9\-_/.]+)\.(jpg|jpeg|bmp|tiff|png|gif|js|css|mov|mp4|ogg|avi|mp3)", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase );
This works great and allows me to manipulate anything in the string from that point onwards ( i.e. from "assets/" onwards to the right ). What I need to achieve is to manipulate the string to the left of the "assets/" sub-folder, without necessarily knowing the format? Here are some examples :
<img src="./assets/123/pig.jpg" />
<img src="http://mysite.blah/assets/123/pig.jpg" />
<img src="http://www.mysite.blah/assets/123/pig.jpg" />
<img src='assets/123/pig.jpg' />
in css / inline styles :
background-image : URL('assets/123/pig.jpg')
background-image : URL(http://www.mysite.blah/assets/123/pig.jpg)
anyway, I think you get the picture. I essentially want to be able to look to the "left" of the word "assets" until I can find the logical start point of the url and then manipulate it from there to point to my CDN.
I'm not sure this is possible in regex, so any suggestions using a combination of regex / c# /HTML Agility Pack are welcome

Is this what you're after?
(?<BeforeAssets>.*?(?:\/|^))assets\/(?<AfterAssets>[A-Za-z0-9\-_\/.]+)\.(?<FileExtension>jpg|jpeg|bmp|tiff|png|gif|js|css|mov|mp4|ogg|avi|mp3)
You can try this out here: http://regexstorm.net/tester
Or here: https://regex101.com/r/b8XxcF/1
NB: In the above regex I escaped the forward slash characters. .Net doesn't require this, but doesn't complain; and doing so makes this compatible with other Regex engines; which means it can be tested on Regex101.
When testing with those tools you'll need to specify the MultiLine or SingleLine options to get the example where assets/ has nothing preceding it, since otherwise the ^ character won't match the start of that line. This option may not be required in your code; i.e. if you're only matching one string at a time, rather than a whole block of text.
Update
Apologies for misreading; you're parsing the full HTML page; not just the URIs returned from that page. To do this you could use something like:
["'\(](?<BeforeAssets>[^"'\(\)]*?)assets\/(?<AfterAssets>[A-Za-z0-9\-_\/.]+)\.(?<FileExtension>jpg|jpeg|bmp|tiff|png|gif|js|css|mov|mp4|ogg|avi|mp3)
(thankfully characters ", ', and ( are illegal in the URL, so should be OK to detect the start of a variable: https://www.rfc-editor.org/rfc/rfc3986#section-2.2.)
This isn't fool-proof; it's better to use an HTML parsing tool, then pull out the URIs from that; but if you are doing everything with regex, hopefully this will help.

Find and replace - should I use Regex?

I need to create a simple markup fix an I already did everything that I need like bold and italic etc.. But this is a bit harder than what I've done so far and I have no idea how to do this. Basically my input is very simple:
[imgGroup="group1"]
image1.jpg
[/imgGroup]
As you can see I pass a param that is group1 and inside I have image1. I need to convert this into a link that has this image inside and group in rel tag like so:
<a href="image1.jpg" rel="group1" >
<img src="image1.jpg" />
</a>
I think that I will need to use Regex for this problem, however I only know how to find something in between 2 tags, not so much for this problem... I'm using ASP.NET MVC3 with C#.

You could use named groups in RegEx to match, then you can just re-assemble them into the order you want:
var regex = new RegEx(("$1(\d\s)$2([a-z])"); // Set up your regex with named groups
var result = regex.Replace("inputstring", "$2 $1"); // Replace input string with the given text, including anything matched in the named groups $1 and $2
Be warned though, RegEx with things like Urls and HTML parsing can very, very quickly turn into a horror beyond your wildest dreams ;)
Good luck!
Named groups for RegEx in dot net

Here is my suggestion for you:
var regex = new Regex(#"\[imgGroup=" + "group1" + #"\]\s*(?<Content>\S*)\s*\[imgGroup\]");
var newValue = regex.Replace(oldValue, #"<a href=""$1"" rel=""group1"" ><img src=""$1"" /> </a> );
That should do what you've expected.

Regular expression to replace quotation marks in HTML tags only

I have the following string:
<div id="mydiv">This is a "div" with quotation marks</div>
I want to use regular expressions to return the following:
<div id='mydiv'>This is a "div" with quotation marks</div>
Notice how the id attribute in the div is now surrounded by apostrophes?
How can I do this with a regular expression?
Edit: I'm not looking for a magic bullet to handle every edge case in every situation. We should all be weary of using regex to parse HTML but, in this particular case and for my particular need, regex IS the solution...I just need a bit of help getting the right expression.
Edit #2: Jens helped to find a solution for me but anyone randomly coming to this page should think long and very hard about using this solution. In my case it works because I am very confident of the type of strings that I'll be dealing with. I know the dangers and the risks and make sure you do to. If you're not sure if you know then it probably indicates that you don't know and shouldn't use this method. You've been warned.

This could be done in the following way: I think you want to replace every instance of ", that is between a < and a > with '.
So, you look for each " in your file, look behind for a <, and ahead for a >. The regex looks like:
(?<=\<[^<>]*)"(?=[^><]*\>)
You can replace the found characters to your liking, maybe using Regex.Replace.
Note: While I found the Stack Overflow community most friendly and helpful, these Regex/HTML questions are responded with a little too much anger, in my opinion. After all, this question here does not ask "What regex matches all valid HTML, and does not match anything else."

I see you're aware of the dangers of using Regex to do these kinds of replacements. I've added the following answer for those in search of a method that is a lot more 'stable' if you want to have a solution that will keep working as the input docs change.
Using the HTML Agility Pack (project page, nuget), this does the trick:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes();
foreach (var node in nodes)
{
foreach (var att in node.Attributes)
{
att.QuoteType = AttributeValueQuote.SingleQuote;
}
}
var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);

You can match:
(<div.*?id=)"(.*?)"(.*?>)
and replace this with:
$1'$2'$3

Need a regular expression to get rid of parenthesis in html image tag filename

So say I have some html with an image tag like this:
<p> (1) some image is below:
<img src="/somwhere/filename_(1).jpg">
</p>
I want a regex that will just get rid of the parenthesis in the filename so my html will look like this:
<p> (1) some image is below:
<img src="/somwhere/filename_1.jpg">
</p>
Does anyone know how to do this? My programming language is C#, if that makes a difference...
I will be eternally grateful and send some very nice karma your way. :)

I suspect your job would be much easier if you used the HTML Agility that can help you to do this instead of regex's judging from the answers, it will make parsing the HTML a lot easier for you to achieve what you are trying to do.
Hope this helps,
Best regards,
Tom.

This (rather dense) regex should do it:
string s = Regex.Replace(input, #"(<img\s+[^>]*src=""[^""]*)\((\d+)\)([^""]*""[^>]*>)", "$1$2$3");

Nick's solution is fine if the file names always match that format, but this one matches any parenthesis, anywhere in the attribute:
s = Regex.Replace(#"(?i)(?<=<img\s+[^>]*\bsrc\s*=\s*""[^""]*)[()]", "");
The lookbehind ensures that the match occurs inside the src attribute of an img tag. It assumes the attribute is enclosed in double-quotes (quotation marks); if you need to allow for single-quotes (apostrophes) or no quotes at all, the regex gets much more complicated. I'll post that if you need it.

In this simple case, you could just use string.Replace, for example:
string imgFilename = "/somewhere/image_(1).jpg";
imgFilename = imgFilename.Replace("(", "").Replace(")", "");
Or do you need a regex for replacing the complete tag inside a HTML string?

Regex.Replace(some_input, #"(?<=<\s*img\s*src\s*=\s*""[^""]*?)(?:\(|\))(?=[^""]*?""\s*\/?\s*?>)", "");
Finds ( or ) preceded by <img src =" and, optionally, text (with any whitespace combination, though I didn't include newline), and followed by optional text and "> or "/>, again with any whitespace combination, and replaces them with nothingness.

c# : parsing text from html

I have an string input-buffer that contains html.
That html contains a lot of text, including some stuff I want to parse.
What I'm actually looking for are the lines like this : "< strong>Filename< /strong>: yadayada.thisandthat.doc< /p>"
(Although position and amount of whitespace / semicolons is variable)
What's the best way to get all the filenames into a List< string> ?

Well a regular expression to accomplish this will be very hard to write and will end up being unreliable anyway.
Probably your best bet is to have a whitelist of extensions you want to look for (.doc, .pdf etc), and trawl through the html looking for instances of these extensions. When you find one, track back to the next whitespace character and that's your filename.
Hope this helps.

You have a couple of options. You can use regular expressions, it could be something like Filename: (.*?)< /p> , but it will need to be much more complex. You would need to look at more of the text file to write a proper one. This could work depending on the structure of all your text, if there is always a certain tag after a filename for example.
If it is valid HTML you can also use a HTML parser like HTML Agility Pack to go through the html and pull out text from certain tags, then use a regex to seperate out the path.

I'm not sure a regular expression is the best way to do this, traversing the HTML tree is probably more sensible, but the following regex should do it:
<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>
As you can see, I've been extremely tolerant of whitespace, as well as tolerant on the content of the filename. Also, multiple (or no) semicolons are permitted.
The C# to build a List (off the top of my head):
List<String> fileNames = new List<String>();
Regex regexObj = new Regex(#"<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
fileNames.Add(matchResults.Groups[0].Value);
matchResults = matchResults.NextMatch();
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.