Using regex to capture everything except a certain (possibly repeated) pattern

Using regex to capture everything except a certain (possibly repeated) pattern - c#

I am trying to capture all of a string minus any occurrences of <span class="notranslate">*any text*</span> (i do NOT need to parse HTML or anything, i just need to ignore those whole sections. the tags must match exactly to be removed, because i want to keep other tags). In a given string there would be at least one tag, no upper limit (though more than a couple would be uncommon)
My ultimate goal is to match two texts, one where there are variable names and one where the variable names have been replaced with their values (can't replace the variables myself, I don't have access to that db). These variables will always be surrounded by the span tags I mentioned. I know my tags say "notranslate" - but this is pretranslation, so all of the other text will be exactly the same.
For example, if these are my two input texts:
Dear <span class="notranslate">$customer</span>, I am sorry that you
are having trouble logging in. Please follow the instructions at this
URL <span class="notranslate">$article431</span> and let me know if
that fixes your problem.
Dear <span class="notranslate">John Doe</span>, I am sorry that you
are having trouble logging in. Please follow the instructions at this
URL <span class="notranslate">http://url.for.help/article</span> and
let me know if that fixes your problem.
I want the regex to return:
Dear , I am sorry that you are having trouble logging in. Please follow the instructions at this URL and let me know if that fixes your problem.
OR
Dear <span class="notranslate"></span>, I am sorry that you are having trouble logging in. Please follow the instructions at this URL <span class="notranslate"></span> and let me know if that fixes your problem.
For both of them, so I can easily do String.Equals() and find out if they are equal. (I will need to compare the input w/ variables against multiple texts where the variables have been replaced, to find the match)
I was easily able to come up with a regex that tells me whether a string has any "notranslate" sections in it: (<span class="notranslate">(.+?)</span>), which is how i decide whether i need to strip out sections before comparison. However I'm having a lot of trouble with the (I thought very similar) task above.
I am using Expresso and regexstorm.net to test, and have played with many variations of (?:(.+?)(?:<span class=\"notranslate\">(?:.+?)</span>)), using ideas from other SO questions, but with all of them I get problems that I don't understand. For example, that one seems to almost work in Expresso but it can't grab the end text after the last set of span tags; when i make the span tags optional or try to add another (.+?) at the end it won't grab anything at all? I have tried using lookaheads, but then I still end up grabbing the tags+internal text later.

This will capture all, then process out the matched html tags which are ignored.
string data = "Dear <span class=\"notranslate\">$customer</span>, I am sorry that you\r\n are havin" +
"g trouble logging in. Please follow the instructions at this\r\n URL <span class=" +
"\"notranslate\">$article431</span> and let me know if\r\n that fixes your problem.";
string pattern = #"(?<Words>[^<]+)(?<Ignore><[^>]+>[^>]+>)?";
Regex.Matches(data, pattern)
.OfType<Match>()
.Select(mt => mt.Groups["Words"].Value)
.Aggregate((sentance, words) => sentance + words );
The result is a string which has with the original carriage return and line feeds in your example actually:
Dear , I am sorry that you
are having trouble logging in. Please follow the instructions at this
URL and let me know if
that fixes your problem.

Related

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.

Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

Regex against markup after XPath?

Have been searching for the solution to my problem now already for a while and have been playing around regex101.com for a while but cannot find a solution.
The problem I am facing is that I have to make a string select for different inputs, thus I wanted to do this with Regular expressions to get the wanted data from these strings.
The regular expression will come from a configuration for each string seperately. (since they differ)
The string below is gained with a XPath: //body/div/table/tbody/tr/td/p[5] but I cannot dig any lower into this anymore to retrieve the right data or can I ?
The string I am using at the moment as example is the following:
<strong>Kontaktdaten des Absenders:</strong>
<br>
<strong>Name:</strong> Wanted data
<br>
<strong>Telefon:</strong>
<a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
<br>
From this string I am trying to get the "Wanted data"
My regular expression so far is the following:
(?<=<\/strong> )(.*)(?= <br>)
But this returns the whole:
<br> <strong>Name:</strong> Wanted data <br> <strong>Telefon:</strong> <a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
I thought I could solve this with a repeat group
((:?(?<=<\/strong> )(.*)(?= <br>))+)
But this returns the same output as without the repeat group.
I know I could build a for { } loop around this regex to gain the same output, but since this is the only regular expression I have to do this for (but means I have to change it for all the other data) I was wondering if it is possible to do this in a regular expression.
Thank you for the support already so far.

Regex is the wrong tool for parsing markup. You have a proper XML parsing tool, XPath, in hand. Finish the job with it:
This XPath,
strong[.='Name:']/following-sibling::text()[1]
when appended to your original XPath,
//body/div/table/tbody/tr/td/p[5]/strong[.='Name:']/following-sibling::text()[1]
will finish the job of selecting the text node immediately following the <strong>Name:</strong> label, as requested, with no regex hacks over markup required.

You can try to match everything but tag markers:
(?<=<\/strong> )([^<>]*)(?= <br>)
Demo

C# Regex Replace ignore specific string

Since this is my first question here on stackoverflow I hope my question is correctly asked.
Basicly I have a normal .txt file which contains any text like:
car accident
people died
cat without owner
<!-- Text added at 6/29/2011 9:20:38 AM -->
Some addintional Text
other Text added
add Text
I have a write/append function which allows the user to append some text and set a little timestamp.
So my problem is: With another function, you can search and replace text in the textfile, but as you can guess if someone wants to replace the word "Text" it will be replaced in the xml-stylish comment(timestamp) as well.
My result until now is
content = Regex.Replace(content,"[^<+.*"+input+".*>+]*", replace);
//content = content of the .txt file, input = search term, replace = string to replace
But this fails miserably, as some regex pro's will see without executing it.
Now I hope that some regex pro could help me out here and provide me a search pattern which replaces the normal text but ignores the timestamp.
I'm not realy aware of the logic from regex until now, nevertheless I understand the single expressions so this would be a hook for me to understand Regex more properly.
Thanks in advice.

If I understand your question correctly, you want to replace every instance of "Text" except for the one(s) inside the comment.
The easist way is to use a negative lookbehind (fantastic description here) as below:
content = Regex.Replace(content, #"(?<!<!--.*?)" + input, replace);
What you're doing is attempting to replace a repetition of any length of a character that is NOT <+.*> or a character contained in input with the value in replace.
If you're going to be working a lot with Regex, I would HIGHLY recommend giving the website above a good read. It's hands down the best intro to Regex that I've found, the time spent now will save you lots of headaches later!
Edit
Updated to add flexibility thanks to #stema

Wikilinks - turn the text [[a]] into an internal link

I need to implement something similar to wikilinks on my site. The user is entering plain text and will enter [[asdf]] wherever there is an internal link. Only the first five examples are really applicable in the implementation I need.
Would you use regex, what expression would do this? Is there a library out there somewhere that already does this in C#?

On the pure regexp side, the expression would rather be:
\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)
\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)
By replacing the (.+?) suggested by David with ([^\]\|\r\n]+?), you ensure to only capture legitimate wiki links texts, without closing square brackets or newline characters.
([^\] ]\S+) at the end ensures the wiki link expression is not followed by a closing square bracket either.
I am note sure if there is C# libraries already implementing this kind of detection.
However, to make that kind of detection really full-proof with regexp, you should use the pushdown automaton present in the C# regexp engine, as illustrated here.

I don't know if there are existing libraries to do this, but if it were me I'd probably just use regexes:
match \[\[(.+?)\|(.+?)\]\](\S+) and replace with \1\3
match \[\[(.+?)\]\](\S+) and replace with \1\2
Or something like that, anyway.

Although this is an old question and already answered, I thought I'd add this as an addendum for anyone else coming along. The existing two answers do all the real work and got me 90% there, but here is the last bit for anyone looking for code to get straight on with trying:
string html = "Some text with a wiki style [[page2.html|link]]";
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$2$3");
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$1$2");
The only change to the actual regex is I think the original answer had the replacement parts the wrong way around, so the href was set to the display text and the link was shown on the page. I've therefore swapped them.

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash

If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>

IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.

Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.

A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using regex to capture everything except a certain (possibly repeated) pattern - c#

Related

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

Regex against markup after XPath?

C# Regex Replace ignore specific string

Wikilinks - turn the text [[a]] into an internal link

Removing <div>'s from text file?

Categories

Resources