Stop regex from spanning across unrequired content - c#

I need to extract a series of meaningful values from a file. The basic pattern for the values I need to match looks like:
"indicator\..+?"\[true\]
Unfortunately, in places this is spanning across quite a bit of content to get a true match, and the lazy quantifier (?) is not being as lazy as I'd like.
How do I modify the above so that out of the following:
"indicator.value here"[false],"other content","more other
content","indicator don't match this one because the full stop is missing"[true],"indicator.this is the
value I want matched"[true]
only this value is returned: "indicator.this is the value I want matched"[true]
Currently, that whole string is being returned by my above regex.

Assuming commas are the delimiter - simply avoid matching on them:
#"""indicator\.[^,]+?""\[true\]"

Try using "indicator\.(.*)?"\[true\] instead and see if that helps. I think the lazy only applies to the * operator. I vaguely remember having this issue years ago.

You can leverage the discard technique by discarding the pattern you don't want. So, you could have something like this:
"indicator\..+?"\[false\]|"indicator\.(.+?)"\[true\]
Discard this pattern --^ Capture this --^
Working demo
Match information
MATCH 1
1. [150-182] `this is the value I want matched`

Related

How to allow only one / in regex

I have an regex looking like this "[a-æøåA-ÆØÅ0-9-/().\s]{1,100}$". I would like to allow ONE / in user input from textbox, e.g. like "3/4 inch fitting bla bla".
How can I do that in a safe way, and is it safe at all?
My query look something like this.
XmlElement bemærkning = xmldoc.CreateElement("Bemærkning");
bemærkning.InnerText = txtBemærkningWT.Text;
//XmlElement usernamePCxml = xmldoc.CreateElement("UsernamepcXML");
//usernamePCxml.InnerText = usernamePC.ToString();
parentelement.AppendChild(type);
parentelement.AppendChild(art);
parentelement.AppendChild(l);
parentelement.AppendChild(bemærkning);
parentelement.AppendChild(varenummer);
parentelement.AppendChild(opretter);
parentelement.AppendChild(date);
//parentelement.AppendChild(usernamePCxml);
xmldoc.DocumentElement.AppendChild(parentelement);
xmldoc.Save(Server.MapPath(map));
Since you also check the total number of characters {1,100} there is no simple solution (that I can think of) for one regex. The easyest way is probably to do a separate check either for the occurence of / or for the overall length. If you check the total length separately you could use a regex like this:
"^[a-æøåA-ÆØÅ0-9-().\s]*\/?[a-æøåA-ÆØÅ0-9-().\s]*$"
Notice that I added a ^ at the beginnin to indicate the start of the input. I don'nt know if this is necessary in your case but it probably is.
Regarding the safity: you are using InnerText, which escapes markup that might be contained in the input, InnerXml does not. So you should be fine.

Regular Expression for Digits and Special Characters - C#

I use Html-Agility-Pack to extract information from some websites. In the process I get data in the form of string and I use that data in my program.
Sometimes the data I get includes multiple details in the single string. As the name of this Movie "Dog Eats Dog (2012) (2012)". The name should have been "Dog Eats Dog (2012)" rather than the first one.
Above is the one example from many. In order to correct the issue I tried to use string.Distinct() method but it would remove all the duplicate characters in the string as in above example it would return "Dog Eats (2012)". Now it solved my initial problem by removing the 2nd (2012) but created a new one by changing the actual title.
I thought my problem could be solved with Regex but I have no idea as to how I can use it here. As far as I know if I use Regex it would tell me that there are duplicate items in the string according to the defined Regex code.
But how do I remove it? There can be a string like "Meme 2013 (2013) (2013)".
Now the actual title is "Meme 2013" with year (2013) and the duplicate year (2013). Even if I get a bool value indicating that the string has duplicate year, I cant think of any method to actually remove the duplicate substring.
The duplicate year always comes in the end of the string. So what should be the Regex that I would use to determine that the string actually has two years in it, like (2012) (2012)?
If I can correctly identify the string contains duplicate maybe I can use string.LastIndexOf() to try and remove the duplicate part. If there is any better way to do it please let me know.
Thanks.
The right regex is "( \(\d{4}\))\1+".
string pattern = #"( \(\d{4}\))\1+";
new Regex(pattern).Replace(s, "$1");
Example here : https://repl.it/Evcy/2
Explanation:
Capture one " (dddd)" block, and remove all following identical ones.
( \(\d{4}\)) does the capture, \1+ finds any non empty sequence of that captured block
Finally, replace the initial block and its copies by the initial block alone.
This regex will allow for any pattern of whitespace, even none, as in (2013)(2013)
`#"(\(\d{4}\))(?:\s*\1)+"`
I have a demo of it here

c# regex - changing pattern matches until find specific word

usually i can workaround and get everything works by myself, but this one is kinda tricky, even msdn references and examples confuses more than helps.
i have testing some codes and stuck at mixing a capture grouping for changing with a non-capturing group, to stop the matchings when i wish
a simpler code that i want to change is:
stats = "label:100,value:7878,label:110,value:7879,something,label:200,value:8888";
valor = "value:8080";
i know if i use
pattern = #"value:(\d+)";
i can change every value number to 8080 when i do
Regex.Replace(stats, pattern, valor);
but i need he stops changing these when find 'something' string
i managed to change every single char to 'valor' until he finds 'something' using
pattern = #"^(?:(?!something).)*";
is there a way to only change 'value:(\d+)' numbers to 'valor' , along with the ?:(?!something) to stop the matchings in the same sentence?
ive seen lots of examples but they never said something like this so i dunno if its possible to merge both conditions at same time
You can make use of a look-behind solution that makes sure there is no something before the value:
(?<!\bsomething\b.*)value:\d+
See demo
Note that something is matched as a whole word due to \b word boundaries.
The result of replace operation:
Note that (?:(?!something).) is very inefficient and should be used when no other means works. In .NET, there is a powerful variable-width look-behind, which is the right tool for this task.
Also note that if you are not using capture group backreferences, you do not need those capturing groups in your pattern (I remove parentheses from around \d+).

how to replace the exact word by another in a list?

I have a list like :
george fg
michel fgu
yasser fguh
I would like to replace fg, fgu, and fguh by "fguhCool" I already tried something like this :
foreach (var ignore in NameToPoulate)
{
tempo = ignore.Replace("fg", "fguhCool");
NameToPoulate_t.Add(tempo);
}
But then "fgu" become "fguhCoolu" and "fguh" become "fguhCooluh" is there are a better idea ?
Thanks for your help.
I assume that this is a homework assignment and that you are being tested for the specific algorihm rather than any code that does the job.
This is probably what your teacher has in mind:
Students will realize that the code should check for "fguh" first, then "fgu" then "fg". The order is important because replacing "fg" will, as you have noticed, destroy a "fguh".
This will by some students be implemented as a loop with if-else conditions in them. So that you will not replace a "fg" that is within an already replaced "fguhCool".
But then you will find that the algorithm breaks down if "fg" and "fgu" are both within the same string. You cannot then allow the presence of "fgu" prevent you to check for "fg" at a different part of the string.
The answer that your teacher is looking for is probably that you should first locate "fguh", "fgu" and "fg" (in that order) and replace them with an intermediary string that doesn't contain "fg". Then after you have done that, you can search for that intermediary string and replace it with "fguhCool".
You could use regular expressions:
Regex.Replace(#"\bfg\b", "fguhCool");
The \b matches a so-called word boundary which means it matches the beginnnig or end of a word (roughly, but for this purpose enough).
Use a regular expression:
Regex.Replace("fg(uh?)?", "fguhCool");
An alternative would be replacing the long words for the short ones first, then replacing the short for the end value (I'm assuming all words - "fg", "fgu" and "fguh" - would map to the same value "fguhCool", right?)
tempo = ignore
.Replace("fguh", "fg")
.Replace("fgu", "fg")
.Replace("fg", "fguhCool");
Obs.: That assumes those words can appear anywhere in the string. If you're worried about whole words (i.e. cases where those words are not substrings of a bigger word), then see #Joey's answer (in this case, simple substitutions won't do, regexes are really the best option).

Regex match a CSV file

I am trying to create a regex to match a CSV file of records in the form of:
optional value, , ,, again some value; this is already, next record;
Now there is an upper limit of commas (10) separating attributes of each record and unlimited number of ; separating each record. Values might or might not be present. I am unexperienced with regex and my effort is rather futile so far. Please help. If necessary, I will include more details.
EDIT
I want to verify that the file is in the required form and get the number of records in it.
Do you really need to use regular expressions for this? Might be a little bit overkill. I'd just perform one String.Split() to get the records, then another String.Split() on each record to get the values. Also rather easy to get the number of elements etc. then.
If you really want to use Regexps, I'd use two steps again:
/(.*?);/ to get the datasets;
/(.*?)[,;]/ to get the values.
Could probably be done with one regexp as well but I'd consider this overkill (as you'd have to find the sub matches etc. identify their parent record, etc.).
Escaped characters would be another thing but rather similar to do: e.g. /(.*?[^\\]);/
try this
bool isvalid = csv.Split(';')
.Select(c => c.Split(',')
.Count())
.Distinct()
.Count() == 1;
Reminds me to the famous article form Coding Horror: Regular Expressions: Now You Have Two Problems.
FileHelpers saved my day when dealing with CSV or other text format.

Categories

Resources