Replace all alphanumeric characters in a string except pattern

Replace all alphanumeric characters in a string except pattern - c#

I'm trying to obfuscate a string, but need to preserve a couple patterns. Basically, all alphanumeric characters need to be replaced with a single character (say 'X'), but the following (example) patterns need to be preserved (note that each pattern has a single space at the beginning)
QQQ"
RRR"
I've looked through a few samples on negative lookahead/behinds, but still not haven't any luck with this (only testing QQQ).
var test = #"""SOME TEXT AB123 12XYZ QQQ""""empty""""empty""1A2BCDEF";
var regex = new Regex(#"((?!QQQ)(?<!\sQ{1,3}))[0-9a-zA-Z]");
var result = regex.Replace(test, "X");
The correct result should be:
"XXXX XXXX XXXXX XXXXX QQQ""XXXXX""XXXXX"XXXXXXXX
This works for an exact match, but will fail with something like ' QQR"', which returns
"XXXX XXXX XXXXX XXXXX XQR""XXXXX""XXXXX"XXXXXXXX

You can use this:
var regex = new Regex(#"((?> QQQ|[^A-Za-z0-9]+)*)[A-Za-z0-9]");
var result = regex.Replace(test, "$1X");
The idea is to match all that must be preserved first and to put it in a capturing group.
Since the target characters are always preceded by zero or more things that must be preserved, you only need to write this capturing group before [A-Za-z0-9]

Here's a non-regex solution. Works quite nice, althought it fails when there is one pattern in an input sequence more then once. It would need a better algorithm fetching occurances. You can compare it with a regex solution for a large strings.
public static string ReplaceWithPatterns(this string input, IEnumerable<string> patterns, char replacement)
{
var patternsPositions = patterns.Select(p =>
new { Pattern = p, Index = input.IndexOf(p) })
.Where(i => i.Index > 0);
var result = new string(replacement, input.Length);
if (!patternsPositions.Any()) // no pattern in the input
return result;
foreach(var p in patternsPositions)
result = result.Insert(p.Index, p.Pattern); // return patterns back
return result;
}

Related

Replace Nth regex match occurrence in string

I know there are quite a few of these questions on SO, but I can't find one that explains how they implemented the pattern to return the N'th match, that was broken down. All the answers I looked just give the code to the OP with minimal explanation.
What I know is, you need to implement this {X} in the pattern where the X is the number occurrence you want to return.
So I am trying to match a string between two chars and I seemed to have been able to get that working.
The string to be tested looks something like this,
"=StringOne&=StringTwo&=StringThree&=StringFour&"
"[^/=]+(?=&)"
Again, after reading as much as I could, this pattern will also return all matches,
[^/=]+(?=&){1}
Due to {1} being the default and therefore redundant in the above pattern.
But I can't do this,
[^/=]+(?=&){2}
As it will not return 3rd match as I was expecting it too.
So could someone please shove me in the right direction and explain how to get the pattern needed to find the occurrence of the match that will be needed?

A pure regex way is possible, but is not really very efficient if your pattern is complex.
var s = "=StringOne&=StringTwo&=StringThree&=StringFour&";
var idx = 2; // Replace this occurrence
var result = Regex.Replace(s, $#"^(=(?:[^=&]+&=){{{idx-1}}})[^=&]+", "${1}REPLACED");
Console.WriteLine(result); // => =StringOne&=REPLACED&=StringThree&=StringFour&
See this C# demo and the regex demo.
Regex details
^ - start of string
(=(?:[^=&]+&=){1}) - Group 1 capturing:
= - a = symbol
(?:[^=&]+&=){1} - 1 occurrence (this number is generated dynamically) of
[^=&]+ - 1 or more chars other than = and & (NOTE that in case the string may contain = and &, it is safer to replace it with .*? and pass RegexOptions.Singleline option to the regex compiler)
&= - a &= substring.
[^=&]+ - 1 or more chars other than = and &
The ${1} in the replacement pattern inserts the contents of Group 1 back into the resulting string.
As an alternative, I can suggest introducing a counter and increment is upon each match, and only replace the one when the counter is equal to the match occurrence you specify.
Use
var s = "=StringOne&=StringTwo&=StringThree&=StringFour&";
var idx_to_replace = 2; // Replace this occurrence
var cnt = 0; // Counter
var result = Regex.Replace(s, "[^=]+(?=&)", m => { // Match evaluator
cnt++; return cnt == idx_to_replace ? "REPLACED" : m.Value; });
Console.WriteLine(result);
// => =StringOne&=REPLACED&=StringThree&=StringFour&
See the C# demo.
The cnt is incremented inside the match evaluator inside Regex.Replace and m is assigned the current Match object. When cnt is equal to idx_to_replace the replacement occurs, else, the whole match is pasted back (with m.Value).
Another approach is to iterate through the matches, and once the Nth match is found, replace it by splitting the string into parts before the match and after the match breaking out of the loop once the replacement is done:
var s = "=StringOne&=StringTwo&=StringThree&=StringFour&";
var idx_to_replace = 2; // Replace this occurrence
var cnt = 0; // Counter
var result = string.Empty; // Final result variable
var rx = "[^=]+(?=&)"; // Pattern
for (var m=Regex.Match(s, rx); m.Success; m = m.NextMatch())
{
cnt++;
if (cnt == idx_to_replace) {
result = $"{s.Substring(0, m.Index)}REPLACED{s.Substring(m.Index+m.Length)}";
break;
}
}
Console.WriteLine(result); // => =StringOne&=REPLACED&=StringThree&=StringFour&
See another C# demo.
This might be quicker since the engine does not have to find all matches.

Regex: Give priority to optional pattern

Let's say I have a string like this:
555 3553 666 555
And a regex like this
var pat = new Regex("3?553?");
When the string above is matched pat.Match(mystring) the result returned will be "55".
I need the result returned to be "3553" if possible, and if not, then only then I want the result to be "55". As in: The 3? is optional and doesn't have to be there, but if it is it will always be matched first.
So this 555 3553 666 555 will return 3553
And this 222 5555 777 will return 55
Is this possible to achieve without using two separate regex definitions?
Thank you.

Regex engines always go through the string from left to right (assuming a left-to-right script). In your case, the first two characters match the regex, therefore it returns.
So, instead of stopping after the first match, you need to do all the matches and choose the longest one. However, there is a caveat: Regex matches can't overlap (every character can be matched only once). Therefore, in a string like
55553553
your regex would return 55, 553, and 553.
The solution is to use a lookahead assertion, combined with a capturing group:
var pat = new Regex("(?=(3?553?))", "g");
and get all its matches
var match = pat.exec(subject);
while (match != null) {
// matched text: match[1], add that to an array
}
match = pat.exec(subject);
}
Then choose the longest match.

I think you want to use a priority over matches, if yes! I think below code can help you:
var matches = Regex.Matches(txt, #"(?<G1>3553)|(?<G2>55)").OfType<Match>();
var res = matches
.GroupBy(x => x.Success)
.Select(x =>
new {
Success = x.Key,
G = !string.IsNullOrEmpty(x.Max(w => w.Groups["G1"].Value))
? x.Max(w => w.Groups["G1"].Value)
: x.Max(w => w.Groups["G2"].Value)
})
.SingleOrDefault();
C# Demo

Your regex matches 55 simply because that was the first match it can find. There is nothing to do with priorities.
I think what you want here is to get the longest match. You should use Matches to get all the matches and get the longest one by checking Length.
var matches = Regex.Matches("555 3553 666 555", "3?553?");
var longestMatch = matches.Cast<Match>().OrderByDescending(x => x.Value.Length).First().Value

How to split a string every time the character changes?

I'd like to turn a string such as abbbbcc into an array like this: [a,bbbb,cc] in C#. I have tried the regex from this Java question like so:
var test = "aabbbbcc";
var split = new Regex("(?<=(.))(?!\\1)").Split(test);
but this results in the sequence [a,a,bbbb,b,cc,c] for me. How can I achieve the same result in C#?

Here is a LINQ solution that uses Aggregate:
var input = "aabbaaabbcc";
var result = input
.Aggregate(" ", (seed, next) => seed + (seed.Last() == next ? "" : " ") + next)
.Trim()
.Split(' ');
It aggregates each character based on the last one read, then if it encounters a new character, it appends a space to the accumulating string. Then, I just split it all at the end using the normal String.Split.
Result:
["aa", "bb", "aaa", "bb", "cc"]

I don't know how to get it done with split. But this may be a good alternative:
//using System.Linq;
var test = "aabbbbcc";
var matches = Regex.Matches(test, "(.)\\1*");
var split = matches.Cast<Match>().Select(match => match.Value).ToList();

There are several things going on here that are producing the output you're seeing:
The regex combines a positive lookbehind and a negative lookahead to find the last character that matches the one preceding it but does not match the one following it.
It creates capture groups for every match, which are then fed into the Split method as delimiters. The capture groups are required by the negative lookahead, specifically the \1 identifier, which basically means "the value of the first capture group in the statement" so it can not be omitted.
Regex.Split, given a capture group or multiple capture groups to match on when identifying the splitting delimiters, will include the delimiters used for every individual Split operation.
Number 3 is why your string array is looking weird, Split will split on the last a in the string, which becomes split[0]. This is followed by the delimiter at split[1], etc...
There is no way to override this behaviour on calling Split.
Either compensation as per Gusman's answer or projecting the results of a Matches call as per Ruard's answer will get you what you want.

To be honest I don't exactly understand how that regex works, but you can "repair" the output very easily:
Regex reg = new Regex("(?<=(.))(?!\\1)", RegexOptions.Singleline);
var res = reg.Split("aaabbcddeee").Where((value, index) => index % 2 == 0 && value != "").ToArray();

Could do this easily with Linq, but I don't think it's runtime will be as good as regex.
A whole lot easier to read though.
var myString = "aaabbccccdeee";
var splits = myString.ToCharArray()
.GroupBy(chr => chr)
.Select(grp => new string(grp.Key, grp.Count()));
returns the values `['aaa', 'bb', 'cccc', 'd', 'eee']
However this won't work if you have a string like "aabbaa", you'll just get ["aaaa","bb"] as a result instead of ["aa","bb","aa"]

Omit unnecessary parts in string array

In C#, I have a string comes from a file in this format:
Type="Data"><Path.Style><Style
or maybe
Type="Program"><Rectangle.Style><Style
,etc. Now I want to only extract the Data or Program part of the Type element. For that, I used the following code:
string output;
var pair = inputKeyValue.Split('=');
if (pair[0] == "Type")
{
output = pair[1].Trim('"');
}
But it gives me this result:
output=Data><Path.Style><Style
What I want is:
output=Data
How to do that?

This code example takes an input string, splits by double quotes, and takes only the first 2 items, then joins them together to create your final string.
string input = "Type=\"Data\"><Path.Style><Style";
var parts = input
.Split('"')
.Take(2);
string output = string.Join("", parts); //note: .net 4 or higher
This will make output have the value:
Type=Data
If you only want output to be "Data", then do
var parts = input
.Split('"')
.Skip(1)
.Take(1);
or
var output = input
.Split('"')[1];

What you can do is use a very simple regular express to parse out the bits that you want, in your case you want something that looks like this and then grab the two groups that interest you:
(Type)="(\w+)"
Which would return in groups 1 and 2 the values Type and the non-space characters contained between the double-quotes.

Instead of doing many split, why don't you just use Regex :
output = Regex.Match(pair[1].Trim('"'), "\"(\w*)\"").Value;

Maybe I missed something, but what about this:
var str = "Type=\"Program\"><Rectangle.Style><Style";
var splitted = str.Split('"');
var type = splitted[1]; // IE Data or Progam
But you will need some error handling as well.

How about a regex?
var regex = new Regex("(?<=^Type=\").*?(?=\")");
var output = regex.Match(input).Value;
Explaination of regex
(?<=^Type=\") This a prefix match. Its not included in the result but will only match
if the string starts with Type="
.*? Non greedy match. Match as many characters as you can until
(?=\") This is a suffix match. It's not included in the result but will only match if the next character is "

Given your specified format:
Type="Program"><Rectangle.Style><Style
It seems logical to me to include the quote mark (") when splitting the strings... then you just have to detect the end quote mark and subtract the contents. You can use LinQ to do this:
string code = "Type=\"Program\"><Rectangle.Style><Style";
string[] parts = code.Split(new string[] { "=\"" }, StringSplitOptions.None);
string[] wantedParts = parts.Where(p => p.Contains("\"")).
Select(p => p.Substring(0, p.IndexOf("\""))).ToArray();

Regex to remove all (non numeric OR period)

I need for text like "joe ($3,004.50)" to be filtered down to 3004.50 but am terrible at regex and can't find a suitable solution. So only numbers and periods should stay - everything else filtered. I use C# and VS.net 2008 framework 3.5

This should do it:
string s = "joe ($3,004.50)";
s = Regex.Replace(s, "[^0-9.]", "");

The regex is:
[^0-9.]
You can cache the regex:
Regex not_num_period = new Regex("[^0-9.]")
then use:
string result = not_num_period.Replace("joe ($3,004.50)", "");
However, you should keep in mind that some cultures have different conventions for writing monetary amounts, such as: 3.004,50.

You are dealing with a string - string is an IEumerable<char>, so you can use LINQ:
var input = "joe ($3,004.50)";
var result = String.Join("", input.Where(c => Char.IsDigit(c) || c == '.'));
Console.WriteLine(result); // 3004.50

For the accepted answer, MatthewGunn raises a valid point in that all digits, commas, and periods in the entire string will be condensed together. This will avoid that:
string s = "joe.smith ($3,004.50)";
Regex r = new Regex(#"(?:^|[^w.,])(\d[\d,.]+)(?=\W|$)/)");
Match m = r.match(s);
string v = null;
if (m.Success) {
v = m.Groups[1].Value;
v = Regex.Replace(v, ",", "");
}

The approach of removing offending characters is potentially problematic. What if there's another . in the string somewhere? It won't be removed, though it should!
Removing non-digits or periods, the string joe.smith ($3,004.50) would transform into the unparseable .3004.50.
Imho, it is better to match a specific pattern, and extract it using a group. Something simple would be to find all contiguous commas, digits, and periods with regexp:
[\d,\.]+
Sample test run:
Pattern understood as:
[\d,\.]+
Enter string to check if matches pattern
> a2.3 fjdfadfj34 34j3424 2,300 adsfa
Group 0 match: "2.3"
Group 0 match: "34"
Group 0 match: "34"
Group 0 match: "3424"
Group 0 match: "2,300"
Then for each match, remove all commas and send that to the parser. To handle case of something like 12.323.344, you could do another check to see that a matching substring has at most one ..

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Replace all alphanumeric characters in a string except pattern - c#

Related

Replace Nth regex match occurrence in string

Regex: Give priority to optional pattern

How to split a string every time the character changes?

Omit unnecessary parts in string array

Regex to remove all (non numeric OR period)

Categories

Resources