Conditional Regex replace to add dashes - c#

Ok, so I need to design a regex to insert dashes. Im tasked with building a web API function that returns a specifically formatted string based upon input parameters. For some reason that hasn't been made clear to me, the source data isn't properly formatted, and I need to reformat the data with dashes in the correct place.
Depending on the first two characters and string length there is an optional third dash. Fortunately Im not concerned what those characters are. This system is a passthrough, so garbage in, garbage out. However, i do need to make sure the dashes are spaced appropriately on length.
Structure Types
XX-9999999999-XX AB
XX-9999999999-99 CD, EF
XX-9999999999-XXX-99 GH
XX-9999999999-XX-99 IJ, KL
For Example:
AB123456789044 should be AB-01234567890-44 and
GH1234567890YYY99 becomes GH-01234567890-YYY-99.
Thus far ive gotten to this point.
^(\w\w)(\d{10})(\w{2,3})(\d\d)?$
Which leads to my Question(s)
1) Im attempting to replace with $1-$2-$3-$4 However, whenever there is a fourth section of decimals, such as the case with IJ, its hard to distinguish between that and AB in the replace.
Ive gotten GH-01234567890-YY-99 And GH-01234567890-YY-.
How do I reference a conditional capture group in a replace string such that the dash relating to it only shows up if the grouping exists?

The problem is that you need conditional replacements, and C# doesn't support those. So you've got to do the replacements programmatically. Something like:
string resultString = null;
try {
Regex regexObj = new Regex(#"([A-Z]{2})-?(\d{10})-?(?:([A-Z]{2,3})|(\d{2}))-?(\d{2})?", RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
resultString = regexObj.Replace(subjectString, new MatchEvaluator(ComputeReplacement));
} catch (ArgumentException ex) {
// Error handling
}
public String ComputeReplacement(Match m) {
// Vary the replacement text in C# as needed
return "$1-$2-$3-$4-$5";
}
I haven't paid too much attention to the actual RegEx here, as it seems like you know what you're doing with it. I just included some conditional hyphens in case the data are quite dirty (partially formatted). Obviously you have to edit the "return" part of this, using conditionals in case any of the captures are blank. I haven't worked out that logic for you, as C# isn't my strength.

Related

Issue with find and replace apostrophe( ' ) in a Word Docx using OpenXML and Regex

Word seems to use a different apostrophe character than Visual Studio and it is causing problems with using Regex.
I am trying to edit some Word documents in C# using OpenXML. I am basically replacing [[COMPANY]] with a company name. This has worked pretty smoothly until I have reached my corner case of companies with names that end in s. I end up with issue s where sometimes it creates a s's.
Example:
Company Name: Simmons
Text in Doc: The [[COMPANY]]'s business is cars.
Result: The Simmons's business is cars.
This is improper English.
I should be able to just use a basic find and replace like I did for [[COMPANY]], but it is not working.
Regex apostropheReplace = new Regex("s\\'s");
docText = apostropheReplace.Replace(docText, "s\'");
This does not. It seems that Word is using an different character for and apostrophe(') than the standard one that is created when I use the key on my keyboard in Visual Studio. If I write a find and replace using my keyboard it will not work, but if I copy and paste the apostrophe from Word it does.
Regex apostrophyReplace = new Regex("s\\’s");
docText = apostrophyReplace.Replace(docText, "s\'");
Notice the different character in the Regex for the second one. I'm confused as to why this is, and also want to know if the is a proper way of doing this. I tried "'" but that does not work. I just want to know if using the copied character from Word is the proper way of doing this, and is there a way to do it so that both characters work so I don't have an issue with docs that may be created with a different program.
The reason this happens is because they are different characters.
Word actually changes some punctuation characters after you type them in order to give them the right inclination or to improve presentation.
I ran in the very same issue before and I used this as regular expression: [\u2018\u2019\u201A\u201b\u2032']
So essentially modify your code to:
Regex apostropheReplace = new Regex("s\\[\u2018\u2019\u201A\u201b\u2032']s");
docText = apostropheReplace.Replace(docText, "s\'")
I found these were the five most common type of single quotes and apostrophes used.
And in case you come across the same issue with double quotes, here is what you can use: [\u201C\u201D\u201E\u201F\u2033\u2036\"]
Answering the question:
Is there a way to do it so that both characters work?
If you want one Regex to be able to handle both scenarios, this is perhaps a simple and readable solution:
Regex apostropheReplace = new Regex("s\\['’]s");
docText = apostropheReplace.Replace(docText, "s\'")
This has the added benefit of being understandable to other developers that you are attempting to cover both apostrophe cases. This benefit gets at the other part of your question:
If using the copied character from Word is the proper way of doing this?
That depends on what you mean by "proper". If you mean "most understandable to other developers," I'd say yes, because there would be the least amount of look-up needed to know exactly what your Regex is looking for. If you mean "most performant", that should not be an issue with this straightforward Regex search (some nice Regex performance tips can be found here).
If you mean "most versatile/robust single quote Regex", then as #Leonardo-Seccia points out, there are other character encodings that might cause trouble. (Some of the common Microsoft Word ones are listed here.) Such a solution might look like this:
Regex apostropheReplace =
new Regex("s\\['\u2018\u2019\u201A\u201b]s");
docText = apostropheReplace.Replace(docText, "s\'")
But you can certainly add other character encodings as needed. A more complete list of character encodings can be found here - to add them to the above Regex, simply change the "U+" to "u" and add it to the list after another "\" character. For example, to add the "prime" symbol (′ or U+2032) to the list above, change the RegEx string from
Regex("s\\['\u2018\u2019\u201A\u201b]s")
to
Regex("s\\['\u2018\u2019\u201A\u201b\u2032]s")
Ultimately, you would be the judge of what character encodings are the most "proper" for inclusion in your Regex based on your use cases.

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

Regular Expression to match a quoted string embedded in another quoted string

I have a data source that is comma-delimited, and quote-qualified. A CSV. However, the data source provider sometimes does some wonky things. I've compensated for all but one of them (we read in the file line-by-line, then write it back out after cleansing), and I'm looking to solve the last remaining problem when my regex-fu is pretty weak.
Matching a Quoted String inside of another Quoted String
So here is our example string...
"foobar", 356, "Lieu-dit "chez Métral", Chilly, FR", "-1,000.09", 467, "barfoo", 1,345,456,235,231, "935.18"
I am looking to match the substring "chez Métral", in order to replace it with the substring chez Métral. Ideally, in as few lines of code as possible. The final goal is to write the line back out (or return it as a method return value) with the replacement already done.
So our example string would end up as...
"foobar", 356, "Lieu-dit chez Métral, Chilly, FR", "-1,000.09", 467, "barfoo", 1,345,456,235,231, "935.18"
I know I could define a pattern such as (?<quotedstring>\"\w+[^,]+\") to match quoted strings, but my regex-fu is weak (database developer, almost never use C#), so I'm not sure how to match another quoted string within the named group quotedstring.
FYI: For those noticing the large integer that is formatted with commas but not quote-qualified, that's already handled. As is the random use of row-delimiters (sometimes CR, sometimes LF). As other problems...
Replace with this regex
(?<!,\s*|^)"([^",]*)"
now replace it with $1
try it here
escaping " with "" it would become
(?<!,\s*|^)""([^"",]*)""

Quick & Dirty way to update "IDs" in a string formatted as XML (C#)

For a one-shot operation, i need to parse the contents of an XML string and change the numbers of the "ID" field. However, i can not risk changing anything else of the string, eg. whitespace, line feeds, etc. MUST remain as they are!
Since i have made the experience that XmlReader tends to mess whitespace up and may even reformat your XML i don't want to use it (but feel free to convince me otherwise). This also screams for RegEx but ... i'm not good at RegEx, particularly not with the .NET implementation.
Here's a short part of the string, the number of the ID field needs to be updated in some cases. There can be many such VAR entries in the string. So i need to convert each ID to Int32, compare & modify it, then put it back into the string.
<VAR NAME="sf_name" ID="1001210">
I am looking for the simplest (in terms of coding time) and safest way to do this.
The regex pattern you are looking for is:
ID="(\d+)"
Match group 1 would contain the number. Use a MatchEvaluator Delegate to replace matches with dynamically calculated replacements.
Regex r = new Regex("ID=\"(\\d+)\"");
string outputXml = r.Replace(inputXml, new MatchEvaluator(ReplaceFunction));
where ReplaceFunction is something like this:
public string ReplaceFunction(Match m)
{
// do stuff with m.Groups(1);
return result.ToString();
}
If you need I can expand the Regex to match more specifically. Currently all ID values (that contain numbers only) are replaced. You can also build that bit of "extra intelligence" into the match evaluator function and make it return the match unchanged if you don't want to change it.
Take a look at this property PreserveWhitespace in XmlDocument class

Easiest way to format rtf/unicode/utf-8 in a RichTextBox?

I'm currently beating my head against a wall trying to figure this out. But long story short, I'd like to convert a string between 2 UTF-8 '\u0002' to bold formating. This is for an IRC client that I'm working on so I've been running into these quite a bit. I've treid regex and found that matching on the rtf as ((\'02) works to catch it, but I'm not sure how to match the last character and change it to \bclear or whatever the rtf formating close is.
I can't exactly paste the text I'm trying to parse because the characters get filtered out of the post. But when looking at the char value its an int of 2.
Here's an attempt to paste the offending text:
[02:34] test test
You could use either
rtb.Rtf = Regex.Replace(rtb.Rtf, #"\\'02\s*(.*?)\s*\\'02", #"\b $1 \b0");
or
rtb.Rtf = Regex.Replace(rtb.Rtf, #"\\'02\s*(.*?)\s*\\'02", #"\'02 \b $1 \b0 \'02");
depending on whether you want to keep the \u0002s in there.
The \b and \b0 turn the bold on and off in RTF.
I don't have a test case, but you could also probably use the Clipboard class's GetText method with the Unicode TextDataFormat. Basically, I think you could place the input in the clipboard and get it out in a different format (works for RTF and the like). Here's MS's demo code (not applicable directly, but demonstrates the API):
// Demonstrates SetText, ContainsText, and GetText.
public String SwapClipboardHtmlText(String replacementHtmlText)
{
String returnHtmlText = null;
if (Clipboard.ContainsText(TextDataFormat.Html))
{
returnHtmlText = Clipboard.GetText(TextDataFormat.Html);
Clipboard.SetText(replacementHtmlText, TextDataFormat.Html);
}
return returnHtmlText;
}
Of course, if you do that, you probably want to save and restore what was in the clipboard, or else you may upset your users!

Categories

Resources