c# regex - changing pattern matches until find specific word - c#

usually i can workaround and get everything works by myself, but this one is kinda tricky, even msdn references and examples confuses more than helps.
i have testing some codes and stuck at mixing a capture grouping for changing with a non-capturing group, to stop the matchings when i wish
a simpler code that i want to change is:
stats = "label:100,value:7878,label:110,value:7879,something,label:200,value:8888";
valor = "value:8080";
i know if i use
pattern = #"value:(\d+)";
i can change every value number to 8080 when i do
Regex.Replace(stats, pattern, valor);
but i need he stops changing these when find 'something' string
i managed to change every single char to 'valor' until he finds 'something' using
pattern = #"^(?:(?!something).)*";
is there a way to only change 'value:(\d+)' numbers to 'valor' , along with the ?:(?!something) to stop the matchings in the same sentence?
ive seen lots of examples but they never said something like this so i dunno if its possible to merge both conditions at same time

You can make use of a look-behind solution that makes sure there is no something before the value:
(?<!\bsomething\b.*)value:\d+
See demo
Note that something is matched as a whole word due to \b word boundaries.
The result of replace operation:
Note that (?:(?!something).) is very inefficient and should be used when no other means works. In .NET, there is a powerful variable-width look-behind, which is the right tool for this task.
Also note that if you are not using capture group backreferences, you do not need those capturing groups in your pattern (I remove parentheses from around \d+).

Related

Issue with find and replace apostrophe( ' ) in a Word Docx using OpenXML and Regex

Word seems to use a different apostrophe character than Visual Studio and it is causing problems with using Regex.
I am trying to edit some Word documents in C# using OpenXML. I am basically replacing [[COMPANY]] with a company name. This has worked pretty smoothly until I have reached my corner case of companies with names that end in s. I end up with issue s where sometimes it creates a s's.
Example:
Company Name: Simmons
Text in Doc: The [[COMPANY]]'s business is cars.
Result: The Simmons's business is cars.
This is improper English.
I should be able to just use a basic find and replace like I did for [[COMPANY]], but it is not working.
Regex apostropheReplace = new Regex("s\\'s");
docText = apostropheReplace.Replace(docText, "s\'");
This does not. It seems that Word is using an different character for and apostrophe(') than the standard one that is created when I use the key on my keyboard in Visual Studio. If I write a find and replace using my keyboard it will not work, but if I copy and paste the apostrophe from Word it does.
Regex apostrophyReplace = new Regex("s\\’s");
docText = apostrophyReplace.Replace(docText, "s\'");
Notice the different character in the Regex for the second one. I'm confused as to why this is, and also want to know if the is a proper way of doing this. I tried "&apos;" but that does not work. I just want to know if using the copied character from Word is the proper way of doing this, and is there a way to do it so that both characters work so I don't have an issue with docs that may be created with a different program.
The reason this happens is because they are different characters.
Word actually changes some punctuation characters after you type them in order to give them the right inclination or to improve presentation.
I ran in the very same issue before and I used this as regular expression: [\u2018\u2019\u201A\u201b\u2032']
So essentially modify your code to:
Regex apostropheReplace = new Regex("s\\[\u2018\u2019\u201A\u201b\u2032']s");
docText = apostropheReplace.Replace(docText, "s\'")
I found these were the five most common type of single quotes and apostrophes used.
And in case you come across the same issue with double quotes, here is what you can use: [\u201C\u201D\u201E\u201F\u2033\u2036\"]
Answering the question:
Is there a way to do it so that both characters work?
If you want one Regex to be able to handle both scenarios, this is perhaps a simple and readable solution:
Regex apostropheReplace = new Regex("s\\['’]s");
docText = apostropheReplace.Replace(docText, "s\'")
This has the added benefit of being understandable to other developers that you are attempting to cover both apostrophe cases. This benefit gets at the other part of your question:
If using the copied character from Word is the proper way of doing this?
That depends on what you mean by "proper". If you mean "most understandable to other developers," I'd say yes, because there would be the least amount of look-up needed to know exactly what your Regex is looking for. If you mean "most performant", that should not be an issue with this straightforward Regex search (some nice Regex performance tips can be found here).
If you mean "most versatile/robust single quote Regex", then as #Leonardo-Seccia points out, there are other character encodings that might cause trouble. (Some of the common Microsoft Word ones are listed here.) Such a solution might look like this:
Regex apostropheReplace =
new Regex("s\\['\u2018\u2019\u201A\u201b]s");
docText = apostropheReplace.Replace(docText, "s\'")
But you can certainly add other character encodings as needed. A more complete list of character encodings can be found here - to add them to the above Regex, simply change the "U+" to "u" and add it to the list after another "\" character. For example, to add the "prime" symbol (′ or U+2032) to the list above, change the RegEx string from
Regex("s\\['\u2018\u2019\u201A\u201b]s")
to
Regex("s\\['\u2018\u2019\u201A\u201b\u2032]s")
Ultimately, you would be the judge of what character encodings are the most "proper" for inclusion in your Regex based on your use cases.

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

regex to replace last two digits of an assemblyversion

I'm working with teamcity and a C# project, and I want to use the file content patcher to replace the last two digits in an assemblyversion (eg: the two stars in [assembly: AssemblyVersion("1.0.*.*")]). I've found the docs on the file content patcher and it suggests using
(^\s*\[\s*assembly\s*:\s*((System\s*\.)?\s*Reflection\s*\.)?\s*AssemblyVersion(Attribute)?\s*\(\s*#?\")(([0-9\*]+\.)+)[0-9\*]+(\"\s*\)\s*\]) if you just want to change the LAST digit, which got me partway there.
I figured if I did (^\s*\[\s*assembly\s*:\s*((System\s*\.)?\s*Reflection\s*\.)?\s*AssemblyVersion(Attribute)?\s*\(\s*#?\")(([0-9\*]+(\.))+)[0-9\*]+(\"\s*\)\s*\]) it would capture the last period as it's own group, letting me replace the two stars without a problem. However it looks like the first star is still captured in the group with the 1.0 (so the group becomes 1.0.*.).
What I want is to restrict the first group to capturing the {major}.{minor}. and then have the last period be it's own group so I could do something like: $1$5\%build.number%$7%build.vcs.number%$8 which would give me AssemblyVersion("1.0.{build#}.{vcs#}")]
Generally I can stumble through regex without many problems but I've been working on this for the last few hours and I can't seem to get it correct. Any information on reaching this conclusion would he appreciated.
If you want to keep to the solution you found to replace while also validating, you may use
(^\s*\[\s*assembly\s*:\s*((System\s*\.)?\s*Reflection\s*\.)?‌​\s*AssemblyVersion(A‌​ttribute)?\s*\(\s*#?‌​\")([0-9\*]+\.[0-9\*‌​]+)\.([0-9\*]+\.[0-9‌​\*]+)(\"\s*\)\s*\])
and replace with $1$5.%build.number%.%build.vcs.number%$7.
See the regex demo
I just unrolled the ([0-9\*]+(\.))+ into ([0-9\*]+\.[0-9\*‌​]+)\.([0-9\*]+\.[0-9‌​\*]+), 2 groups (([0-9\*]+\.[0-9\*‌​]+)) separated with a literal dot (\.). I also had to remove the [0-9\*]+ that followed the ([0-9\*]+(\.))+ pattern.
I would first extract 1.0.*.* and then use Version.Parse.
Much smaller regex (and can be shortened more)..
string input = #"[assembly:AssemblyVersion(""1.2.3.4"")]";
var verStr = Regex.Match(input, #"\[.+?\(\""(.+?)\""\)\]").Groups[1].Value;
var version = Version.Parse(verStr);

Regex to identify C# functions

I need to find all functions in my VS solution with a certain attribute and insert a line of code at the end and at the beginning of each one. For identifying the functions, I've got as far as
\[attribute\]\r?\n(.*)void(.*)\r?\n.*\{\r?\n([^\{\}]*)\}
But that only works on functions that don't contain any other blocks of code delimited by braces. If I set the last capturing group to [\s\S] (all characters), it simply selects all text from the start of the first function to the end of the last one. Is there a way to get around this and select just one whole function?
I am afraid balancing constructs themselves are not enough since you may have unbalanced number of them in the method body. You can still try this regex that will handle most of the caveats:
\[attribute\](?<signature>[^{]*)(?<body>(?:\{[^}]*\}|//.*\r?\n|"[^"]*"|[\S\s])*?\{(?:\{[^}]*\}|//.*\r?\n|"[^"]*"|[\S\s])*?)\}
See demo on RegexStorm
The regex will ignore all { and } in the string literals and //-like comments, and will consume {...} blocks. The only thing it does not support is /*...*/ multiline comments. Please let me know if you also need to account for them.
The bad news is that you can't do that by the Search-And-Replace feature because it doesn't support balancing groups. You can write a separate program in C# that does it for you.
The construct to get the matching closing brace is:
(?=\{)(?:(?<open>\{)|(?<-open>\})|[^\{\}])+?(?(open)(?!))
This matches a block of {...}. But as #DmitryBychenko mentioned it doesn't respect comments or strings.

how to replace the exact word by another in a list?

I have a list like :
george fg
michel fgu
yasser fguh
I would like to replace fg, fgu, and fguh by "fguhCool" I already tried something like this :
foreach (var ignore in NameToPoulate)
{
tempo = ignore.Replace("fg", "fguhCool");
NameToPoulate_t.Add(tempo);
}
But then "fgu" become "fguhCoolu" and "fguh" become "fguhCooluh" is there are a better idea ?
Thanks for your help.
I assume that this is a homework assignment and that you are being tested for the specific algorihm rather than any code that does the job.
This is probably what your teacher has in mind:
Students will realize that the code should check for "fguh" first, then "fgu" then "fg". The order is important because replacing "fg" will, as you have noticed, destroy a "fguh".
This will by some students be implemented as a loop with if-else conditions in them. So that you will not replace a "fg" that is within an already replaced "fguhCool".
But then you will find that the algorithm breaks down if "fg" and "fgu" are both within the same string. You cannot then allow the presence of "fgu" prevent you to check for "fg" at a different part of the string.
The answer that your teacher is looking for is probably that you should first locate "fguh", "fgu" and "fg" (in that order) and replace them with an intermediary string that doesn't contain "fg". Then after you have done that, you can search for that intermediary string and replace it with "fguhCool".
You could use regular expressions:
Regex.Replace(#"\bfg\b", "fguhCool");
The \b matches a so-called word boundary which means it matches the beginnnig or end of a word (roughly, but for this purpose enough).
Use a regular expression:
Regex.Replace("fg(uh?)?", "fguhCool");
An alternative would be replacing the long words for the short ones first, then replacing the short for the end value (I'm assuming all words - "fg", "fgu" and "fguh" - would map to the same value "fguhCool", right?)
tempo = ignore
.Replace("fguh", "fg")
.Replace("fgu", "fg")
.Replace("fg", "fguhCool");
Obs.: That assumes those words can appear anywhere in the string. If you're worried about whole words (i.e. cases where those words are not substrings of a bigger word), then see #Joey's answer (in this case, simple substitutions won't do, regexes are really the best option).

Categories

Resources