Regex formula assistance - c#

I'm trying to find a regex formula for these HTML nodes:
first: Need the inner html value
<span class="profile fn">Any Name Here</span>
second: need the title value
<abbr class="time published" title="2012-08-11T07:02:50+0000">August 10, 2012 at 5:02 pm</abbr>
third: need the inner html value
<div class="msgbody">
Some message here. might contain any character.
</div>
I'm fairly new to regex, and was hoping someone could offer me some guidance with this. I'll be using it with C# if that makes a difference.
Edit:
The HTML I'd be pulling this out of would look like this:
<div class="message">
<div class="from"><span class="profile fn">Name</span></div>
<abbr class="time published" title="2012-08-11T07:02:50+0000">August 10, 2012 at 5:02 pm</abbr>
<div class="msgbody">
Some message
</div>
</div>

A lot of people are quite dismissive of using Regular Expressions to deal with HTML; However, I believe that if your HTML is assuredly regular and well formatted then you can use Regex successfully.
If you can't assure then, then I urge you to check out the HTML Agility Pack, it's a library for parsing HTML in C# and works very well.
I'm not on my PC, but I'll edit my answer with a suggested regex for your examples, give you something to try at least.
For this one:
<span class="profile fn">Any Name Here</span>
Try
"<span.*?>(?<span>.*?)</span>"
Then you can access this via the Match.Groups("span") property of your regex result.
For the Abbr tag:
<abbr class="time published" title="2012-08-11T07:02:50+0000">...snip...</abbr>
It's similar
"<abbr.*?title=\"(?<title>.*?)\".*?>"
And lastly for the div:
<div class="msgbody">
Some message here. might contain any character.
</div>
Is:
"<div.*?>(?<div>.*?)</div>"
For this one, you may need to set the Multiline regex option.
The key point is the .*? operator.
Adding the question match turns a greedy match into a look ahead match, it tells the Regex engine to look forwards from the place it finds the match, rather then finding the last match and then working backwards; this is incredibly important for matching in HTML where you will have many Chevrons closing tags.
The big problem you'll get though is, what happens if the inner text or an attribute has an '<' or an '"' character in it?
It's very hard to make Regex only match balanced <>'s and it can't easily not use ones that are in between quotes; this is why the Agility pack is often preferred.
Hope this helps anyway!
Edit:
How to use named capture groups
This syntax (?..selector..) tells the Regex engine to encapsulate whatever is between the brackets into a value that can be taken out the actual match object.
So for this HTML
<span>TEST</span>
You'd use this code:
string HTML = "<span>TEST</span>";
Regex r = new Regex("<span>(?<s>.*?)</span>");
var match = r.Match(HTML);
string stuff = match.Groups["s"].Value;
//stuff should = "TEST"
If you think you'll have multiple captures then you'd use a variant of this overload:
foreach (Match m in r.Matches(HTML))
{
string stuff = m.Groups["s"].Value;
}
This should yield the answer you need.

If your html is always the same you can use this ugly pattern:
"profile fn"[^>]*>(?<name>[^<]+)(?:[^t]+|t(?!itle=))+title="(?<time>[^"]+)(?:[^m]+|m(?!sgbody"))+msgbody">\s*(?<msg>(?:[^<\s]+|(?>\s+)(?!<))+)
results are in m.Groups["name"], m.Groups["time"], m.Groups["msg"]

Related

Regex to extract pure text within specific HTML tag [duplicate]

This question already has answers here:
Regex select all text between tags
(23 answers)
Closed 2 years ago.
In this case, I am supposed to only use a single regex match.
See the following HTML code:
<html>
<body>
<p>This is some <strong>strong</strong> text</p>
</body>
</html>
I want to make a regex that can return This is some strong text. In this case, the text inside the <p> tag.
Overall, it should:
Match only text between two HTML tags.
Exclude HTML tags within the two tags, but keep the text inside those tags.
So far I know:
<p>(.*)<\/p> Will match the region from <p> to </p>
<[^>]*> Will match any HTML tag
The hard part for me is how to combine the two (maybe there is an even better way of doing it).
How would you write such regex?
How real software engineers solve this problem: Use the right tool for the right job, i.e. don't use regexes to parse HTML
The most straightforward way is to use an HTML parsing library, since parsing even purely conforming XML with regex is extremely non-trivial, and handling all HTML edge cases is an inhumanly difficult task.
If your requirements are "you must use a regex library to pull innerHTML from a <p> element", I'd much prefer to split it into two tasks:
1) using regex to pull out the container element with its innerHTML. (I'm showing an example that only works for getting the outermost element of a known tag. To extract an arbitrary nested item you'd have to use some trick like https://blogs.msdn.microsoft.com/bclteam/2005/03/15/net-regular-expressions-regex-and-balanced-matching-ryan-byington/ to match the balanced expression)
2) using a simple Regex.Replace to strip out all tag content
let html = #"<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>"
for m in Regex.Matches(html, #"<p>(.*?)</p>") do
printfn "(%O)" (Regex.Replace(m.Groups.[1].Value, "<.*?>", ""))
(This is some strong text)
(This is some reallystrong text)
If you are constrained to a single "Regex.Matches" call, and you're okay with ignoring the possibility of nested <p> tags (as luck would have it, in conformant HTML you can't nest ps but this solution wouldn't work for a containing element like <div>) you should be able to do it with a nongreedy matching of a text part and a tag part wrapped up inside a <p>...</p> pattern. (Note 1: this is F#, but it should be trivial to convert to C#) (Note 2: This relies on .NET-flavored regex-isms like stackable group names and multiple captures per group)
let rx = #"
<p>
(?<p_text>
(?:
(?<text>[^<>]+)
(?:<.*?>)+
)*?
(?<text>[^<>]+)?
)</p>
"
let regex = new Regex(rx, RegexOptions.IgnorePatternWhitespace)
for m in regex.Matches(#"
<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
") do
printfn "p content: %O" m
for capture in m.Groups.["text"].Captures do
printfn "text: %O" capture
p content: <p>This is some <strong>strong</strong> text</p>
text: This is some
text: strong
text: text
p content: <p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
text: This is some
text: really
text: strong
text: text
Remember that both the above examples don't work that well on malformed HTML or cases where the same tag is nested in itsel
Following #Jimmy's answer, and going with the title of post on how to "extract" the text, I thought I would include the C# code for the Regex.Replace.
This bit of code should work to extract the text:
string HTML = "<html><body><p>This is some <strong>strong</strong> text</p></body></html>";
Regex Reg = new Regex("<[^>]*>");
String parsedText = Reg.Replace(HTML, "").Trim();
MessageBox.Show(parsedText);
Obviously this does not match between the two tags exclusively (it would grab anything outside the paragraph tags as well), but I would suggest that the replace function is the best option in making only ONE match.
If you need to get only the content between the two tags, I think you would need to do that in two expressions, as #Jimmy suggested.
I would be very curious to see if anyone could get it all in one expression, but I'm guessing this is what they are looking for at your school.

Remove attributes with whitelist

I need to remove attributes from a string with tags.
Here is the C# code:
strContent = Regex.Replace(strContent, #"<(\w+)[^>]*(?<=( ?/?))>", "<$1$2>",
RegexOptions.IgnoreCase);
For example, this code will replace
This is some <div id="div1" class="cls1">content</div>. This is some more <span
id="span1" class="cls1">content</span>. This is <input type="readonly" id="input1"
value="further content"></input>.
with
This is some <div>content</div>. This is some more <span>content</span>. This is
<input></input>.
But I need a "whitelist" when removing the attributes. In the above example, I want that "input" tag attributes must not be removed. So I want the output as:
This is some <div>content</div>. This is some more <span>content</span>. This is
<input type="readonly" id="input1" value="further content"></input>.
Appreciate your help on this.
For your example you could use:
(<(?!input)[^\s>]+)[^>]*(>)
Replace with $1$2.
I'm not sure how you plan to specify the whitelist though. If you can hardcode it then you can easily add more (?!whitelistTag) to the above, which could done programmatically pretty easily from an array too.
Working on RegExr
In response to the usual You should not parse HTML with regex, you can rephrase the problem as:
This is a "quoted string", cull each "quoted string to its" first word unless the "string starts with" the word "string, like these last two".
Would you claim that regex shouldn't be used to solve that problem? Because it's exactly the same problem. Of course an HTML parser can be used for the job, but it hardly invalidates the idea of using regex for the same thing.

Find and replace - should I use Regex?

I need to create a simple markup fix an I already did everything that I need like bold and italic etc.. But this is a bit harder than what I've done so far and I have no idea how to do this. Basically my input is very simple:
[imgGroup="group1"]
image1.jpg
[/imgGroup]
As you can see I pass a param that is group1 and inside I have image1. I need to convert this into a link that has this image inside and group in rel tag like so:
<a href="image1.jpg" rel="group1" >
<img src="image1.jpg" />
</a>
I think that I will need to use Regex for this problem, however I only know how to find something in between 2 tags, not so much for this problem... I'm using ASP.NET MVC3 with C#.
You could use named groups in RegEx to match, then you can just re-assemble them into the order you want:
var regex = new RegEx(("$1(\d\s)$2([a-z])"); // Set up your regex with named groups
var result = regex.Replace("inputstring", "$2 $1"); // Replace input string with the given text, including anything matched in the named groups $1 and $2
Be warned though, RegEx with things like Urls and HTML parsing can very, very quickly turn into a horror beyond your wildest dreams ;)
Good luck!
Named groups for RegEx in dot net
Here is my suggestion for you:
var regex = new Regex(#"\[imgGroup=" + "group1" + #"\]\s*(?<Content>\S*)\s*\[imgGroup\]");
var newValue = regex.Replace(oldValue, #"<a href=""$1"" rel=""group1"" ><img src=""$1"" /> </a> );
That should do what you've expected.

Need a regular expression to get rid of parenthesis in html image tag filename

So say I have some html with an image tag like this:
<p> (1) some image is below:
<img src="/somwhere/filename_(1).jpg">
</p>
I want a regex that will just get rid of the parenthesis in the filename so my html will look like this:
<p> (1) some image is below:
<img src="/somwhere/filename_1.jpg">
</p>
Does anyone know how to do this? My programming language is C#, if that makes a difference...
I will be eternally grateful and send some very nice karma your way. :)
I suspect your job would be much easier if you used the HTML Agility that can help you to do this instead of regex's judging from the answers, it will make parsing the HTML a lot easier for you to achieve what you are trying to do.
Hope this helps,
Best regards,
Tom.
This (rather dense) regex should do it:
string s = Regex.Replace(input, #"(<img\s+[^>]*src=""[^""]*)\((\d+)\)([^""]*""[^>]*>)", "$1$2$3");
Nick's solution is fine if the file names always match that format, but this one matches any parenthesis, anywhere in the attribute:
s = Regex.Replace(#"(?i)(?<=<img\s+[^>]*\bsrc\s*=\s*""[^""]*)[()]", "");
The lookbehind ensures that the match occurs inside the src attribute of an img tag. It assumes the attribute is enclosed in double-quotes (quotation marks); if you need to allow for single-quotes (apostrophes) or no quotes at all, the regex gets much more complicated. I'll post that if you need it.
In this simple case, you could just use string.Replace, for example:
string imgFilename = "/somewhere/image_(1).jpg";
imgFilename = imgFilename.Replace("(", "").Replace(")", "");
Or do you need a regex for replacing the complete tag inside a HTML string?
Regex.Replace(some_input, #"(?<=<\s*img\s*src\s*=\s*""[^""]*?)(?:\(|\))(?=[^""]*?""\s*\/?\s*?>)", "");
Finds ( or ) preceded by <img src =" and, optionally, text (with any whitespace combination, though I didn't include newline), and followed by optional text and "> or "/>, again with any whitespace combination, and replaces them with nothingness.

What is the REGEX to match this pattern in a html document in C#?

I really can't work out how to best do this, I can do fairly simple regex expressions, but the more complex ones really stump me.
The following appears in specific HTML documents:
<span id="label">
<span>
Joe Bloggs
now using
</span>
<span>
'
Important Data
'
</span>
<span>
on
Important data 2
</span>
</span>
I need to extract the two 'important data' points and could spend hours working out the regex to do it.(I'm using the .net Regex Library in C# 3.5)
As often stated befor, regular expressions are usually not the right tool for parsing HTML, XML, and friends - think about using HTML or XML parsing libraries. If you really want to or have to use regular expressions, the following will match the content of the tags in many cases, but might still fail in some cases.
(?<data>[^<]*)
This expression will match all links not starting with http:// - this is the only obviouse difference I can see between the links.
(?<data>[^<]*)
The below uses HtmlAgilityPack. It prints any text within a second-or-later link within the "label" id. Of course, it's relatively simple to modify the XPath to do something a little different.
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(#"<span id=""label"">
<span>
Joe Bloggs
now using
</span>
<span>
'
Important Data
'
</span>
<span>
on
Important data 2
</span>
</span>
"));
HtmlNode root = doc.DocumentNode;
HtmlNodeCollection anchors;
anchors = root.SelectNodes("//span[#id='label']/span[position()>=2]/a/text()");
IList<string> importantStrings;
if(anchors != null)
{
importantStrings = new List<string>(anchors.Count);
foreach(HtmlNode anchor in anchors)
importantStrings.Add(((HtmlTextNode)anchor).Text);
}
else
importantStrings = new List<string>(0);
foreach(string s in importantStrings)
Console.WriteLine(s);
Look up look-behind and look-ahead syntax for .NET and use that to look for the anchor tags in the HTML. This site may help you. As an alternative to regular expressions, you might consider using a System.Xml.XPath.XPathNavigator to address those nodes directly.
My Regex is a little rusty but something along the lines of the following may help (although it will probably need some fine-tuning):
(?<=\<a href="/variableLink[/]?"\>)(.*)+(?=</a>)
<a\shref.*?"/variableLink/?">(.*)</a>
First group contains the Name of the anchors. Tested with Expresso. Works on the sample text you've provided.
Update: works with Snippy too.
Regex regex = new Regex(#"<a\shref.*?""/variableLink/?"">(.*)</a>", RegexOptions.Multiline);
foreach (Match everyMatch in regex.Matches(sText))
{
Console.WriteLine("{0}", everyMatch.Groups[1]);
}
Outputs:
Important Data
Important data 2

Categories

Resources