Why the result of the matche is different from the expression? - c#

The Leetcode have a question:"Given a List of words, return the words that can be typed using letters of alphabet on only one row's of American keyboard.". To solve this, I try to using regular expression in C# like this:
public string[] FindWords(string[] words)
{
return words.Where(x => Regex.Match(
x, #"[qwertyuiop]*|[asdfghjkl]*|[zxcvbnm]*",
RegexOptions.IgnoreCase).Value == x).ToArray();
}
But still cannot get right.For example, when the input like:
["a", "b", "p", "hello"]
I can only get "p" returned.
Where am I doing wrong?

Your regex pattern is a bit off for what you are trying to achieve. Let's look at it and analyze it.
First, we need to indicate that we're actually trying to match a word, which has a start and an end. It means that we need to prepend the regex with an ^ and add $ at the end to indicate string start and end.
Then we need to make sure that we actually have a word, which means there's at least one character. To enforce "one or more character" rule we will need to use + quantifier instead of *.
Lastly, the Regex pattern you're trying to use does not ensure that we are using characters from only one row. It does ensure that for each capturing group (sections between the OR operator) but we end up having as many capturing groups as there are scenarios that should invalidate the string. Which basically means that the following word will still validate:
today
The Regex will match three capturing groups: "to", "da" and "y". Instead, we need to explicitly set the grouping.
I've ended up with the following pattern:
^([qwertyuiop]+|[asdfghjkl]+|[zxcvbnm]+)$

Related

C# regex to match words containing known substrings and not equal to specific keywords

I need to verify if a string contains "error" or "exception" in it, excluding certain keywords: "exception1", "exception2", "includeException", "error1".
This regex seems to do the job:
\b\w*(?!exception1)(?!exception2)(?!includeException)(?!error1)(exception|error)\w*\b
It correctly returns 2 matches when run against the following string:
Test string: "exception1 exception2 exception3 includeException error1 error2"
Matches: "exception3", "error2"
However, if I set the RegexOptions.IgnoreCase flag or add "(?i)" at the beginning of the Regex it also returns a match for "includeException".
What am I missing here?
Using a good Regex tester can help you figure out what's actually being matched. I used this one:
http://regexhero.net/tester/
In the results where it highlights the matches, there is a small button with an 'i' for information. So the reason that it's matching innerException when it's case insensitive is because you are matching the latter half of the word. The regex doesn't require white space separating the words.
Your regex would match with case invariant off if innerException were written as innerexception because your positive match (exception|error) is matching the last half. You can also see that when you start removing spaces. exception1exception2 doesn't match, but exception1exception2exception3 does.
While Regex is very compact, there are several ways to get it wrong. A straightforward approach might be a better solution in this case.
Changing your regex to remove the last wildcard * characters will make what you have work the way you want:
\b\w*(?!exception1)(?!exception2)(?!includeException)(?!error1)(exception|error)\w\b
I see two main bottlenecks with your regex:
It has several unanchored lookaheads (when unanchored, they usually do not help unless used in a tempered greedy token and other complex patterns)
The \w* subpatterns are placed on both sides of lookaheads, thus, removing any impact from the lookaheads.
The problem with case-insensitivity is described in Berin's answer, you want to match the word exception and includeException contains that substring. So, a possible solution is to add a leading word boundary to (error|exception) pattern:
\b\w*(?!exception1)(?!exception2)(?!includeException)(?!error1)\b(exception|error)\w*\b
^^
However, if you need to match words containing error or exception, that ARE NOT EQUAL to specific keywords, use
\b(?!(?:exception1|exception2|includeException|error1)\b)\w*(exception|error)\w*\b
Here, the lookaheads are anchored to the leading word boundary, they are only checked once after each word boundary, not at each position inside a word. Certainly, you can contract it further: \b(?!(?:exception[12]|includeException|error1)\b)\w*(exception|error)\w*\b.
Now, if you need to match words containing error or exception, that DO NOT CONTAIN specific keywords, use
\b(?!\w*(?:exception1|exception2|includeException|error1))\w*(exception|error)\w*\b
All regex patterns used here are tested at regexhero.net
Regex is not very readable... how about a pure C# solution?
public static Boolean ContainsErrorOrExceptionExcept(this string input, string[] excludedKeywords)
{
if (input.Contains("error") || input.Contains("exception"))
{
foreach (string x in excludedKeywords)
{
if (input.Contains(x))
{
return false;
}
}
return true;
}
else
{
return false;
}
}

Regex to extract string between quotes

I'm trying to extract a string between two quotes, and I thought I had my regex working, but it's giving me two strings in my GroupCollection, and I can't get it to ignore the first one, which includes the first quote and ID=
The string that I want to parse is
Test ID="12345" hello
I want to return 12345 in a group, so that I can manipulate it in code later. I've tried the following regex: http://regexr.com/3bgtl, with this code:
nodeValue = "Test ID=\"12345\" hello";
GroupCollection ids = Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups;
The problem is that the GroupCollection contains two entries:
ID="12345
12345
I just want it to return the second one.
Use positive lookbehind operator:
GroupCollection ids = Regex.Match(nodeValue, "(?<=ID=\")[^\"]*").Groups;
You also used a capturing group (the parenthesis), this is why you get 2 results.
There are a few ways to accomplish this. I like named capture groups for readability.
Regex with named capture group:
"(?<capture>.*?)"
And your code would be:
match.Groups["capture"].Value
Your code is totally OK and is the most efficient from all the solutions suggested here. Capturing groups allow the quickest and least resource-consuming way to match substrings inside larger texts.
All you need to do with your regex is just access the captured group 1 that is defined by the round brackets. Like this:
var nodeValue = "Test ID=\"12345\" hello";
GroupCollection ids = Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups;
Console.WriteLine(ids[1].Value);
// or just on one line
// Console.WriteLine(Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups[1].Value);
See IDEONE demo
Please have a look at Grouping Constructs in Regular Expressions:
Grouping constructs delineate the subexpressions of a regular expression and capture the substrings of an input string. You can use grouping constructs to do the following:
Match a subexpression that is repeated in the input string.
Apply a quantifier to a subexpression that has multiple regular expression language elements. For more information about quantifiers, see [Quantifiers in Regular Expressions][3].
Include a subexpression in the string that is returned by the [Regex.Replace][4] and [Match.Result][5] methods.
Retrieve individual subexpressions from the [Match.Groups][6] property and process them separately from the matched text as a whole.
Note that if you do not need overlapping matches, capturing group mechanism is the best solution here.

RegEx : Find match based on 1st two chars

I am new to RegEx and thus have a question on RegEx. I am writing my code in C# and need to come up with a regex to find matching strings.
The possible combination of strings i get are,
XYZF44DT508755
ABZF44DT508755
PQZF44DT508755
So what i need to check is whether the string starts with XY or AB or PQ.
I came up with this one and it doesn't work.
^((XY|AB|PQ).){2}
Note: I don't want to use regular string StartsWith()
UPDATE:
Now if i want to try a new matching condition like this -
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
How to write the RegEx for that?
You can modify you expression to the following and use the IsMatch() method.
Regex.IsMatch(input, "^(?:XY|AB|PQ)")
The outer capturing group in conjuction with . (any single character) is trying to match a third character and then repeat the sequence twice because of the range quantifier {2} ...
According to your updated edit, you can simply place "ZF" after the grouping construct.
Regex.IsMatch(input, "^(?:XY|AB|PQ)ZF")
You want to test for just ^(XY|AB|PQ). Your RegEx means: Search for either XY, AB or PQ, then a random character, and repeat the whole sequence twice, for example "XYKPQL" would match your RegEx.
This is a screenshot of the matches on regex101:
^ forces the start of line,
(...) creates a matching group and
XY|AB|PQ matches either XY, AB or PQ.
If you want the next two characters to be ZF, just append ZF to the RegEx so it becomes ^(XY|AB|PQ)ZF.
Check out regex101, a great way to test your RegExes.
You were on the right track. ^(XY|AB|PQ) should match your string correctly.
The problem with ^((XY|AB|PQ).){2} is following the entire group with {2}. This means exactly 2 occurrences. That would be 2 occurrences of your first 2 characters, plus . (any single character), meaning this would match strings like XY_AB_. The _ could be anything.
It may have been your intention with the . to match a larger string. In this case you might try something along the lines of ^((XY|AB|PQ)\w*). The \w* will match 0 or more occurrences of "word characters", so this should match all of XYZF44DT508755 up to a space, line break, punctuation, etc., and not just the XY at the beginning.
There are some good tools out there for understanding regexes, one of my favorites is debuggex.
UPDATE
To answer your updated question:
If string starts with "XY" or "AB" or "PQ" and 3rd character is "Z" and 4th character is "F"
The regex would be (assuming you want to match the entire "word").
^((XY|AB|PQ)ZF\w*)
Debuggex Demo

Can Regular Expressions Achieve This?

I'm trying to split a string into tokens (via regular expressions)
in the following way:
Example #1
input string: 'hello'
first token: '
second token: hello
third token: '
Example #2
input string: 'hello world'
first token: '
second token: hello world
third token: '
Example #3
input string: hello world
first token: hello
second token: world
i.e., only split up the string if it is NOT in single quotation marks, and single quotes should be in their own token.
This is what I have so far:
string pattern = #"'|\s";
Regex RE = new Regex(pattern);
string[] tokens = RE.Split("'hello world'");
This will work for example #1 and example #3 but it will NOT work for example #2.
I'm wondering if there's theoretically a way to achieve what I want with regular expressions
You could build a simple lexer, which would involve consuming each of the tokens one by one. So you would have a list of regular expressions and would attempt to match one of them at each point. That is the easiest and cleanest way to do this if your input is anything beyond the very simple.
Use a token parsor to split into tokens. Use regex to find a string patterns
'[^']+' will match text inside single quotes. If you want it grouped, (')([^']+)('). If no matches are found, then just use a regular string split. I don't think it makes sense to try to do the whole thing in one regular expression.
EDIT: It seems from your comments on the question that you actually want this applied over a larger block of text rather than just simple inputs like you indicated. If that's the case, then I don't think a regular expression is your answer.
While it would be possible to match ' and the text inside separately, and also alternatively match the text alone, RegExp does not allow an indefinite number of matches. Or better said, you can only match those objects you explicitely state in the expression. So ((\w+)+\b) could theoretically match all words one-by-one. The outer group will correctly match the whole text, and also the inner group will match the words separately correctly, but you will only be able to reference the last match.
There is no way to match a group of matched matches (weird sentence). The only possible way would be to match the string and then split it into separate words.
Not exactly what you are trying to do, but regular expression conditions might help out as you look for a solution:
(?<quot>')?(?<words>(?(quot)[^']|\w)+)(?(quot)')
If a quote is found, then it matches until a non-quote is found. Otherwise looks at word characters. Your results are in groups named "quot" and "words".
You'll have hard time using Split here, but you can use a MatchCollection to find all matches in your string:
string str = "hello world, 'HELLO WORLD': we'll be fine.";
MatchCollection matches = Regex.Matches(str, #"(')([^']+)(')|(\w+)");
The regex searches for a string between single quotes. If it cannot find one, it takes a single word.
Now it gets a little tricky - .net returns a collection of Matchs. Each Match has several Groups - the first Group has the whole string ('hello world'), but the rest have sub-matches (',hello world,'). Also, you get many empty unsuccessful Groups.
You can still iterate easily and get your matches. Here's an example using LINQ:
var tokens = from match in matches.Cast<Match>()
from g in match.Groups.Cast<Group>().Skip(1)
where g.Success
select g.Value;
tokens is now a collection of strings:
hello, world, ', HELLO WORLD, ', we, ll, be, fine
You can first split on quoted string, and then further tokenize.
foreach (String s in Regex.Split(input, #"('[^']+')")) {
// Check first if s is a quote.
// If so, split out the quotes.
// If not, do what you intend to do.
}
(Note: you need the brackets in the pattern to make sure Regex.Split returns those too)
Try this Regular Expression:
([']*)([a-z]+)([']*)
This finds 1 or more single quotes at the beginning and end of a string. It then finds 1 or more characters in the a-z set (if you don't set it to be case insensitive it will only find lower case characters). It groups these so that group 1 has the ', group 2 (or more) has the words which are split by anything that is not a character a - z and the last group has the single quote if it exists.

Extending [^,]+, Regular Expression in C#

Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);

Categories

Resources