C# Matching a recursive structure in a string - c#

I'm currently working on a program that loads up a text file, searches through it to find a specific structure, and then replaces a certain part of that structure with different text.
The structure I need to find and extract is "N"(N) where N is any character. For example. Lets say I had a text file like this:
Everyone knows the saying "Do not do more than you can do" (Jim Doe).
Well, I'm here to tell you that this saying is awesome. Here is
another, "The sky is blue and the sun is bright" (Sally Wantsmore).
I would want to be able to match the text "Do not do more than you can do" (Jim Doe) along with "The sky is blue and the sun is bright" (Sally Wantsmore).
I don't think there is really a way to do this with a regular expression from the best of my knowledge. I've been trying for the last few days. I can't get it to work, it's a recursive pattern by nature. My question is, how would I go about writing C# code to parse through and find these patterns. I would like to do something where I can find the position within the string and the length, that way I can then extract it from the string.
EDIT
I need to be able to capture all characters in the quote. This means that there could also be another set of quotes within the quote and even another set of parenthesis. This means that the structure could also contain a match within itself.
I'm now trying to use this expression because I need to be able to capture all characters within a quote: \"(.+)\" \(([\w ]+)\)
The listed answers below both work. However, I've discovered a limitation. There is a possible recursive structure to this. The problem I am currently having is when there is a "N"(N) inside of a "N"(N)". For example:
"Random quote" (random person) Here is a fun saying, "The sky is blue and
the sun is bright, some even say "really bright" (others)" (Sally
Wantsmore).
This presents many problems. There is only one match because it takes the very first ", and then finds the last " just after (others) and finds the set of parens for (Sally Wantsmore) and only finds that match. However, I desire for it to find all the matches, especially the beginning one and last one separably, and even the inner quote. Is this possible with Regular expressions? If not, how do I go about solving this with Recursive c# code.

The following regex should find the two things you're looking for:
\"([\w ]+)\" \(([\w ]+)\)
In C# you can use Regex.Match to retrieve the two items in brackets.

An example on how you could have it in C#:
var quotes = Regex.Matches(#"Everyone knows the saying ""Do not do more than you can do"" (Jim Doe). Well, I'm here to tell you that this saying is awesome. Here is another, ""The sky is blue and the sun is bright"" (Sally Wantsmore).",
"(?<Quotes>\"(?<Text>[\\w ]+)\\\" \\((?<Author>[\\w ]+)\\))", RegexOptions.Singleline);
foreach (Match quote in quotes)
{
var text = quote.Groups["Text"].Value;
var author = quote.Groups["Author"].Value;
Console.WriteLine($"Text: {text}, Author: {author}");
}

Related

Issue with find and replace apostrophe( ' ) in a Word Docx using OpenXML and Regex

Word seems to use a different apostrophe character than Visual Studio and it is causing problems with using Regex.
I am trying to edit some Word documents in C# using OpenXML. I am basically replacing [[COMPANY]] with a company name. This has worked pretty smoothly until I have reached my corner case of companies with names that end in s. I end up with issue s where sometimes it creates a s's.
Example:
Company Name: Simmons
Text in Doc: The [[COMPANY]]'s business is cars.
Result: The Simmons's business is cars.
This is improper English.
I should be able to just use a basic find and replace like I did for [[COMPANY]], but it is not working.
Regex apostropheReplace = new Regex("s\\'s");
docText = apostropheReplace.Replace(docText, "s\'");
This does not. It seems that Word is using an different character for and apostrophe(') than the standard one that is created when I use the key on my keyboard in Visual Studio. If I write a find and replace using my keyboard it will not work, but if I copy and paste the apostrophe from Word it does.
Regex apostrophyReplace = new Regex("s\\’s");
docText = apostrophyReplace.Replace(docText, "s\'");
Notice the different character in the Regex for the second one. I'm confused as to why this is, and also want to know if the is a proper way of doing this. I tried "&apos;" but that does not work. I just want to know if using the copied character from Word is the proper way of doing this, and is there a way to do it so that both characters work so I don't have an issue with docs that may be created with a different program.
The reason this happens is because they are different characters.
Word actually changes some punctuation characters after you type them in order to give them the right inclination or to improve presentation.
I ran in the very same issue before and I used this as regular expression: [\u2018\u2019\u201A\u201b\u2032']
So essentially modify your code to:
Regex apostropheReplace = new Regex("s\\[\u2018\u2019\u201A\u201b\u2032']s");
docText = apostropheReplace.Replace(docText, "s\'")
I found these were the five most common type of single quotes and apostrophes used.
And in case you come across the same issue with double quotes, here is what you can use: [\u201C\u201D\u201E\u201F\u2033\u2036\"]
Answering the question:
Is there a way to do it so that both characters work?
If you want one Regex to be able to handle both scenarios, this is perhaps a simple and readable solution:
Regex apostropheReplace = new Regex("s\\['’]s");
docText = apostropheReplace.Replace(docText, "s\'")
This has the added benefit of being understandable to other developers that you are attempting to cover both apostrophe cases. This benefit gets at the other part of your question:
If using the copied character from Word is the proper way of doing this?
That depends on what you mean by "proper". If you mean "most understandable to other developers," I'd say yes, because there would be the least amount of look-up needed to know exactly what your Regex is looking for. If you mean "most performant", that should not be an issue with this straightforward Regex search (some nice Regex performance tips can be found here).
If you mean "most versatile/robust single quote Regex", then as #Leonardo-Seccia points out, there are other character encodings that might cause trouble. (Some of the common Microsoft Word ones are listed here.) Such a solution might look like this:
Regex apostropheReplace =
new Regex("s\\['\u2018\u2019\u201A\u201b]s");
docText = apostropheReplace.Replace(docText, "s\'")
But you can certainly add other character encodings as needed. A more complete list of character encodings can be found here - to add them to the above Regex, simply change the "U+" to "u" and add it to the list after another "\" character. For example, to add the "prime" symbol (′ or U+2032) to the list above, change the RegEx string from
Regex("s\\['\u2018\u2019\u201A\u201b]s")
to
Regex("s\\['\u2018\u2019\u201A\u201b\u2032]s")
Ultimately, you would be the judge of what character encodings are the most "proper" for inclusion in your Regex based on your use cases.

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

Regular Expression for Digits and Special Characters - C#

I use Html-Agility-Pack to extract information from some websites. In the process I get data in the form of string and I use that data in my program.
Sometimes the data I get includes multiple details in the single string. As the name of this Movie "Dog Eats Dog (2012) (2012)". The name should have been "Dog Eats Dog (2012)" rather than the first one.
Above is the one example from many. In order to correct the issue I tried to use string.Distinct() method but it would remove all the duplicate characters in the string as in above example it would return "Dog Eats (2012)". Now it solved my initial problem by removing the 2nd (2012) but created a new one by changing the actual title.
I thought my problem could be solved with Regex but I have no idea as to how I can use it here. As far as I know if I use Regex it would tell me that there are duplicate items in the string according to the defined Regex code.
But how do I remove it? There can be a string like "Meme 2013 (2013) (2013)".
Now the actual title is "Meme 2013" with year (2013) and the duplicate year (2013). Even if I get a bool value indicating that the string has duplicate year, I cant think of any method to actually remove the duplicate substring.
The duplicate year always comes in the end of the string. So what should be the Regex that I would use to determine that the string actually has two years in it, like (2012) (2012)?
If I can correctly identify the string contains duplicate maybe I can use string.LastIndexOf() to try and remove the duplicate part. If there is any better way to do it please let me know.
Thanks.
The right regex is "( \(\d{4}\))\1+".
string pattern = #"( \(\d{4}\))\1+";
new Regex(pattern).Replace(s, "$1");
Example here : https://repl.it/Evcy/2
Explanation:
Capture one " (dddd)" block, and remove all following identical ones.
( \(\d{4}\)) does the capture, \1+ finds any non empty sequence of that captured block
Finally, replace the initial block and its copies by the initial block alone.
This regex will allow for any pattern of whitespace, even none, as in (2013)(2013)
`#"(\(\d{4}\))(?:\s*\1)+"`
I have a demo of it here

C# Comparing two strings, one with unique indenifiers

First of all I'd like to mention that I'm new to programming and this sight so I'm still an infant in this world, however, I have a problem.
I have to make code that can compare two strings but the second string (from a file) will have unique identifiers within it. For example:
first string:
I have 10 cats and their fur is #000000
Second string from a file:
I have <d> cats and their fur is <h>
Although I probably don't need to explain, 'd' is for numbers or decimal and 'h' for hex. There are also 's' and 'a' associated to ASCII.
What's supposed to happen is that the first string can have any different number which can be of different length and/or Hex when the data comes in but the rest of the message stays the same, E.G.
I have 1500 cats and their fur is #000000
the code will still match the two strings as True matches as it'll effectively ignore anything that is an int and hex. (this identifiers are User defined so they can be anywhere in any string).
The end game is that if it finds a relative match the code will change the colour of the text in the app among other things. it's basically to highlight errors in a log file.
I've searched High an low on Stackflow and looked into Regex and string comparisons. I'm currently going to make a start on the code, however, would like some input/help.
Obviously I'm not asking for something to be written for me, just to be pointed in the right direction so I can learn.
Many thanks in advance! And apologies if there is a similar post out there, but alas I couldn't find it if there is.
If I understand it correctly I think I would solve this by replacing the <d> etc. by a RegEx expression. Then use that RegEx to replace the values by an empty string. That way you can compare them without the values.
Hope that makes sense. I didn't include any code because you asked for just some directions.

C# Regex Replace ignore specific string

Since this is my first question here on stackoverflow I hope my question is correctly asked.
Basicly I have a normal .txt file which contains any text like:
car accident
people died
cat without owner
<!-- Text added at 6/29/2011 9:20:38 AM -->
Some addintional Text
other Text added
add Text
I have a write/append function which allows the user to append some text and set a little timestamp.
So my problem is: With another function, you can search and replace text in the textfile, but as you can guess if someone wants to replace the word "Text" it will be replaced in the xml-stylish comment(timestamp) as well.
My result until now is
content = Regex.Replace(content,"[^<+.*"+input+".*>+]*", replace);
//content = content of the .txt file, input = search term, replace = string to replace
But this fails miserably, as some regex pro's will see without executing it.
Now I hope that some regex pro could help me out here and provide me a search pattern which replaces the normal text but ignores the timestamp.
I'm not realy aware of the logic from regex until now, nevertheless I understand the single expressions so this would be a hook for me to understand Regex more properly.
Thanks in advice.
If I understand your question correctly, you want to replace every instance of "Text" except for the one(s) inside the comment.
The easist way is to use a negative lookbehind (fantastic description here) as below:
content = Regex.Replace(content, #"(?<!<!--.*?)" + input, replace);
What you're doing is attempting to replace a repetition of any length of a character that is NOT <+.*> or a character contained in input with the value in replace.
If you're going to be working a lot with Regex, I would HIGHLY recommend giving the website above a good read. It's hands down the best intro to Regex that I've found, the time spent now will save you lots of headaches later!
Edit
Updated to add flexibility thanks to #stema

Categories

Resources