Remove attributes with whitelist

Remove attributes with whitelist - c#

I need to remove attributes from a string with tags.
Here is the C# code:
strContent = Regex.Replace(strContent, #"<(\w+)[^>]*(?<=( ?/?))>", "<$1$2>",
RegexOptions.IgnoreCase);
For example, this code will replace
This is some <div id="div1" class="cls1">content</div>. This is some more <span
id="span1" class="cls1">content</span>. This is <input type="readonly" id="input1"
value="further content"></input>.
with
This is some <div>content</div>. This is some more <span>content</span>. This is
<input></input>.
But I need a "whitelist" when removing the attributes. In the above example, I want that "input" tag attributes must not be removed. So I want the output as:
This is some <div>content</div>. This is some more <span>content</span>. This is
<input type="readonly" id="input1" value="further content"></input>.
Appreciate your help on this.

For your example you could use:
(<(?!input)[^\s>]+)[^>]*(>)
Replace with $1$2.
I'm not sure how you plan to specify the whitelist though. If you can hardcode it then you can easily add more (?!whitelistTag) to the above, which could done programmatically pretty easily from an array too.
Working on RegExr
In response to the usual You should not parse HTML with regex, you can rephrase the problem as:
This is a "quoted string", cull each "quoted string to its" first word unless the "string starts with" the word "string, like these last two".
Would you claim that regex shouldn't be used to solve that problem? Because it's exactly the same problem. Of course an HTML parser can be used for the job, but it hardly invalidates the idea of using regex for the same thing.

Related

Regex against markup after XPath?

Have been searching for the solution to my problem now already for a while and have been playing around regex101.com for a while but cannot find a solution.
The problem I am facing is that I have to make a string select for different inputs, thus I wanted to do this with Regular expressions to get the wanted data from these strings.
The regular expression will come from a configuration for each string seperately. (since they differ)
The string below is gained with a XPath: //body/div/table/tbody/tr/td/p[5] but I cannot dig any lower into this anymore to retrieve the right data or can I ?
The string I am using at the moment as example is the following:
<strong>Kontaktdaten des Absenders:</strong>
<br>
<strong>Name:</strong> Wanted data
<br>
<strong>Telefon:</strong>
<a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
<br>
From this string I am trying to get the "Wanted data"
My regular expression so far is the following:
(?<=<\/strong> )(.*)(?= <br>)
But this returns the whole:
<br> <strong>Name:</strong> Wanted data <br> <strong>Telefon:</strong> <a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
I thought I could solve this with a repeat group
((:?(?<=<\/strong> )(.*)(?= <br>))+)
But this returns the same output as without the repeat group.
I know I could build a for { } loop around this regex to gain the same output, but since this is the only regular expression I have to do this for (but means I have to change it for all the other data) I was wondering if it is possible to do this in a regular expression.
Thank you for the support already so far.

Regex is the wrong tool for parsing markup. You have a proper XML parsing tool, XPath, in hand. Finish the job with it:
This XPath,
strong[.='Name:']/following-sibling::text()[1]
when appended to your original XPath,
//body/div/table/tbody/tr/td/p[5]/strong[.='Name:']/following-sibling::text()[1]
will finish the job of selecting the text node immediately following the <strong>Name:</strong> label, as requested, with no regex hacks over markup required.

You can try to match everything but tag markers:
(?<=<\/strong> )([^<>]*)(?= <br>)
Demo

Regex to extract pure text within specific HTML tag [duplicate]

This question already has answers here:
Regex select all text between tags
(23 answers)
Closed 2 years ago.
In this case, I am supposed to only use a single regex match.
See the following HTML code:
<html>
<body>
<p>This is some <strong>strong</strong> text</p>
</body>
</html>
I want to make a regex that can return This is some strong text. In this case, the text inside the <p> tag.
Overall, it should:
Match only text between two HTML tags.
Exclude HTML tags within the two tags, but keep the text inside those tags.
So far I know:
<p>(.*)<\/p> Will match the region from <p> to </p>
<[^>]*> Will match any HTML tag
The hard part for me is how to combine the two (maybe there is an even better way of doing it).
How would you write such regex?

How real software engineers solve this problem: Use the right tool for the right job, i.e. don't use regexes to parse HTML
The most straightforward way is to use an HTML parsing library, since parsing even purely conforming XML with regex is extremely non-trivial, and handling all HTML edge cases is an inhumanly difficult task.
If your requirements are "you must use a regex library to pull innerHTML from a <p> element", I'd much prefer to split it into two tasks:
1) using regex to pull out the container element with its innerHTML. (I'm showing an example that only works for getting the outermost element of a known tag. To extract an arbitrary nested item you'd have to use some trick like https://blogs.msdn.microsoft.com/bclteam/2005/03/15/net-regular-expressions-regex-and-balanced-matching-ryan-byington/ to match the balanced expression)
2) using a simple Regex.Replace to strip out all tag content
let html = #"<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>"
for m in Regex.Matches(html, #"<p>(.*?)</p>") do
printfn "(%O)" (Regex.Replace(m.Groups.[1].Value, "<.*?>", ""))
(This is some strong text)
(This is some reallystrong text)
If you are constrained to a single "Regex.Matches" call, and you're okay with ignoring the possibility of nested <p> tags (as luck would have it, in conformant HTML you can't nest ps but this solution wouldn't work for a containing element like <div>) you should be able to do it with a nongreedy matching of a text part and a tag part wrapped up inside a <p>...</p> pattern. (Note 1: this is F#, but it should be trivial to convert to C#) (Note 2: This relies on .NET-flavored regex-isms like stackable group names and multiple captures per group)
let rx = #"
<p>
(?<p_text>
(?:
(?<text>[^<>]+)
(?:<.*?>)+
)*?
(?<text>[^<>]+)?
)</p>
"
let regex = new Regex(rx, RegexOptions.IgnorePatternWhitespace)
for m in regex.Matches(#"
<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
") do
printfn "p content: %O" m
for capture in m.Groups.["text"].Captures do
printfn "text: %O" capture
p content: <p>This is some <strong>strong</strong> text</p>
text: This is some
text: strong
text: text
p content: <p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
text: This is some
text: really
text: strong
text: text
Remember that both the above examples don't work that well on malformed HTML or cases where the same tag is nested in itsel

Following #Jimmy's answer, and going with the title of post on how to "extract" the text, I thought I would include the C# code for the Regex.Replace.
This bit of code should work to extract the text:
string HTML = "<html><body><p>This is some <strong>strong</strong> text</p></body></html>";
Regex Reg = new Regex("<[^>]*>");
String parsedText = Reg.Replace(HTML, "").Trim();
MessageBox.Show(parsedText);
Obviously this does not match between the two tags exclusively (it would grab anything outside the paragraph tags as well), but I would suggest that the replace function is the best option in making only ONE match.
If you need to get only the content between the two tags, I think you would need to do that in two expressions, as #Jimmy suggested.
I would be very curious to see if anyone could get it all in one expression, but I'm guessing this is what they are looking for at your school.

Using regex to capture everything except a certain (possibly repeated) pattern

I am trying to capture all of a string minus any occurrences of <span class="notranslate">*any text*</span> (i do NOT need to parse HTML or anything, i just need to ignore those whole sections. the tags must match exactly to be removed, because i want to keep other tags). In a given string there would be at least one tag, no upper limit (though more than a couple would be uncommon)
My ultimate goal is to match two texts, one where there are variable names and one where the variable names have been replaced with their values (can't replace the variables myself, I don't have access to that db). These variables will always be surrounded by the span tags I mentioned. I know my tags say "notranslate" - but this is pretranslation, so all of the other text will be exactly the same.
For example, if these are my two input texts:
Dear <span class="notranslate">$customer</span>, I am sorry that you
are having trouble logging in. Please follow the instructions at this
URL <span class="notranslate">$article431</span> and let me know if
that fixes your problem.
Dear <span class="notranslate">John Doe</span>, I am sorry that you
are having trouble logging in. Please follow the instructions at this
URL <span class="notranslate">http://url.for.help/article</span> and
let me know if that fixes your problem.
I want the regex to return:
Dear , I am sorry that you are having trouble logging in. Please follow the instructions at this URL and let me know if that fixes your problem.
OR
Dear <span class="notranslate"></span>, I am sorry that you are having trouble logging in. Please follow the instructions at this URL <span class="notranslate"></span> and let me know if that fixes your problem.
For both of them, so I can easily do String.Equals() and find out if they are equal. (I will need to compare the input w/ variables against multiple texts where the variables have been replaced, to find the match)
I was easily able to come up with a regex that tells me whether a string has any "notranslate" sections in it: (<span class="notranslate">(.+?)</span>), which is how i decide whether i need to strip out sections before comparison. However I'm having a lot of trouble with the (I thought very similar) task above.
I am using Expresso and regexstorm.net to test, and have played with many variations of (?:(.+?)(?:<span class=\"notranslate\">(?:.+?)</span>)), using ideas from other SO questions, but with all of them I get problems that I don't understand. For example, that one seems to almost work in Expresso but it can't grab the end text after the last set of span tags; when i make the span tags optional or try to add another (.+?) at the end it won't grab anything at all? I have tried using lookaheads, but then I still end up grabbing the tags+internal text later.

This will capture all, then process out the matched html tags which are ignored.
string data = "Dear <span class=\"notranslate\">$customer</span>, I am sorry that you\r\n are havin" +
"g trouble logging in. Please follow the instructions at this\r\n URL <span class=" +
"\"notranslate\">$article431</span> and let me know if\r\n that fixes your problem.";
string pattern = #"(?<Words>[^<]+)(?<Ignore><[^>]+>[^>]+>)?";
Regex.Matches(data, pattern)
.OfType<Match>()
.Select(mt => mt.Groups["Words"].Value)
.Aggregate((sentance, words) => sentance + words );
The result is a string which has with the original carriage return and line feeds in your example actually:
Dear , I am sorry that you
are having trouble logging in. Please follow the instructions at this
URL and let me know if
that fixes your problem.

Regex formula assistance

I'm trying to find a regex formula for these HTML nodes:
first: Need the inner html value
<span class="profile fn">Any Name Here</span>
second: need the title value
<abbr class="time published" title="2012-08-11T07:02:50+0000">August 10, 2012 at 5:02 pm</abbr>
third: need the inner html value
<div class="msgbody">
Some message here. might contain any character.
</div>
I'm fairly new to regex, and was hoping someone could offer me some guidance with this. I'll be using it with C# if that makes a difference.
Edit:
The HTML I'd be pulling this out of would look like this:
<div class="message">
<div class="from"><span class="profile fn">Name</span></div>
<abbr class="time published" title="2012-08-11T07:02:50+0000">August 10, 2012 at 5:02 pm</abbr>
<div class="msgbody">
Some message
</div>
</div>

A lot of people are quite dismissive of using Regular Expressions to deal with HTML; However, I believe that if your HTML is assuredly regular and well formatted then you can use Regex successfully.
If you can't assure then, then I urge you to check out the HTML Agility Pack, it's a library for parsing HTML in C# and works very well.
I'm not on my PC, but I'll edit my answer with a suggested regex for your examples, give you something to try at least.
For this one:
<span class="profile fn">Any Name Here</span>
Try
"<span.*?>(?<span>.*?)</span>"
Then you can access this via the Match.Groups("span") property of your regex result.
For the Abbr tag:
<abbr class="time published" title="2012-08-11T07:02:50+0000">...snip...</abbr>
It's similar
"<abbr.*?title=\"(?<title>.*?)\".*?>"
And lastly for the div:
<div class="msgbody">
Some message here. might contain any character.
</div>
Is:
"<div.*?>(?<div>.*?)</div>"
For this one, you may need to set the Multiline regex option.
The key point is the .*? operator.
Adding the question match turns a greedy match into a look ahead match, it tells the Regex engine to look forwards from the place it finds the match, rather then finding the last match and then working backwards; this is incredibly important for matching in HTML where you will have many Chevrons closing tags.
The big problem you'll get though is, what happens if the inner text or an attribute has an '<' or an '"' character in it?
It's very hard to make Regex only match balanced <>'s and it can't easily not use ones that are in between quotes; this is why the Agility pack is often preferred.
Hope this helps anyway!
Edit:
How to use named capture groups
This syntax (?..selector..) tells the Regex engine to encapsulate whatever is between the brackets into a value that can be taken out the actual match object.
So for this HTML
<span>TEST</span>
You'd use this code:
string HTML = "<span>TEST</span>";
Regex r = new Regex("<span>(?<s>.*?)</span>");
var match = r.Match(HTML);
string stuff = match.Groups["s"].Value;
//stuff should = "TEST"
If you think you'll have multiple captures then you'd use a variant of this overload:
foreach (Match m in r.Matches(HTML))
{
string stuff = m.Groups["s"].Value;
}
This should yield the answer you need.

If your html is always the same you can use this ugly pattern:
"profile fn"[^>]*>(?<name>[^<]+)(?:[^t]+|t(?!itle=))+title="(?<time>[^"]+)(?:[^m]+|m(?!sgbody"))+msgbody">\s*(?<msg>(?:[^<\s]+|(?>\s+)(?!<))+)
results are in m.Groups["name"], m.Groups["time"], m.Groups["msg"]

Need a regular expression to get rid of parenthesis in html image tag filename

So say I have some html with an image tag like this:
<p> (1) some image is below:
<img src="/somwhere/filename_(1).jpg">
</p>
I want a regex that will just get rid of the parenthesis in the filename so my html will look like this:
<p> (1) some image is below:
<img src="/somwhere/filename_1.jpg">
</p>
Does anyone know how to do this? My programming language is C#, if that makes a difference...
I will be eternally grateful and send some very nice karma your way. :)

I suspect your job would be much easier if you used the HTML Agility that can help you to do this instead of regex's judging from the answers, it will make parsing the HTML a lot easier for you to achieve what you are trying to do.
Hope this helps,
Best regards,
Tom.

This (rather dense) regex should do it:
string s = Regex.Replace(input, #"(<img\s+[^>]*src=""[^""]*)\((\d+)\)([^""]*""[^>]*>)", "$1$2$3");

Nick's solution is fine if the file names always match that format, but this one matches any parenthesis, anywhere in the attribute:
s = Regex.Replace(#"(?i)(?<=<img\s+[^>]*\bsrc\s*=\s*""[^""]*)[()]", "");
The lookbehind ensures that the match occurs inside the src attribute of an img tag. It assumes the attribute is enclosed in double-quotes (quotation marks); if you need to allow for single-quotes (apostrophes) or no quotes at all, the regex gets much more complicated. I'll post that if you need it.

In this simple case, you could just use string.Replace, for example:
string imgFilename = "/somewhere/image_(1).jpg";
imgFilename = imgFilename.Replace("(", "").Replace(")", "");
Or do you need a regex for replacing the complete tag inside a HTML string?

Regex.Replace(some_input, #"(?<=<\s*img\s*src\s*=\s*""[^""]*?)(?:\(|\))(?=[^""]*?""\s*\/?\s*?>)", "");
Finds ( or ) preceded by <img src =" and, optionally, text (with any whitespace combination, though I didn't include newline), and followed by optional text and "> or "/>, again with any whitespace combination, and replaces them with nothingness.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove attributes with whitelist - c#

Related

Regex against markup after XPath?

Regex to extract pure text within specific HTML tag [duplicate]

Using regex to capture everything except a certain (possibly repeated) pattern

Regex formula assistance

Need a regular expression to get rid of parenthesis in html image tag filename

Categories

Resources