Regex matching key="value" pattern - c#

I want to match following pattern:
key="value" key="value" key="value" key="value" ...
where key and value are [a-z0-9]+, both should be grouped (2 groups, the " - chars can be matched or skipped)
input that should not be matched:
key="value"key="value" (no space between pairs)
For now I got this(not .NET syntax):
([a-z0-9]+)=(\"[a-z0-9]+\")(?=\s|$)
the problem with that, that it matches key4="value4" in input:
key3="value3"key4="value4"

The spec isn't very clear, but you can try:
(?<!\S)([a-z0-9]+)=("[a-z0-9]+")(?!\S)
Or, as a C# string literal:
"(?<!\\S)([a-z0-9]+)=(\"[a-z0-9]+\")(?!\\S)"
This uses a negative lookarounds to ensure that the the key-value pair is neither preceded nor followed by non-whitespace characters.
Here's an example snippet (as seen on ideone.com):
var input = "key1=\"value1\" key2=\"value2\"key3=\"value3\" key4=\"value4\"";
Console.WriteLine(input);
// key1="value1" key2="value2"key3="value3" key4="value4"
Regex r = new Regex("(?<!\\S)([a-z0-9]+)=(\"[a-z0-9]+\")(?!\\S)");
foreach (Match m in r.Matches(input)) {
Console.WriteLine(m);
}
// key1="value1"
// key4="value4"
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
On validating the entire input
You can use Regex.IsMatch to see if the input string matches against what should be the correct input pattern. You can also use the same pattern to extract the keys/values, thanks to the fact that .NET regex lets you access individual captures.
string[] inputs = {
"k1=\"v1\" k2=\"v2\" k3=\"v3\" k4=\"v4\"",
"k1=\"v1\" k2=\"v2\"k3=\"v3\" k4=\"v4\"",
" k1=\"v1\" k2=\"v2\" k3=\"v3\" k4=\"v4\" ",
" ",
" what is this? "
};
Regex r = new Regex("^\\s*(?:([a-z0-9]+)=\"([a-z0-9]+)\"(?:\\s+|$))+$");
foreach (string input in inputs) {
Console.Write(input);
if (r.IsMatch(input)) {
Console.WriteLine(": MATCH!");
Match m = r.Match(input);
CaptureCollection keys = m.Groups[1].Captures;
CaptureCollection values = m.Groups[2].Captures;
int N = keys.Count;
for (int i = 0; i < N; i++) {
Console.WriteLine(i + "[" + keys[i] + "]=>[" + values[i] + "]");
}
} else {
Console.WriteLine(": NO MATCH!");
}
}
The above prints (as seen on ideone.com):
k1="v1" k2="v2" k3="v3" k4="v4": MATCH!
0[k1]=>[v1]
1[k2]=>[v2]
2[k3]=>[v3]
3[k4]=>[v4]
k1="v1" k2="v2"k3="v3" k4="v4": NO MATCH!
k1="v1" k2="v2" k3="v3" k4="v4" : MATCH!
0[k1]=>[v1]
1[k2]=>[v2]
2[k3]=>[v3]
3[k4]=>[v4]
: NO MATCH!
what is this? : NO MATCH!
References
Is there a regex flavor that allows me to count the number of repetitions matched by the * and + operators?
Explanation of the pattern
The pattern to validate the entire input is essentially:
maybe leading
spaces ___ end of string anchor
| /
^\s*(entry)+$
| \
beginning \__ one or more entry
of string
anchor
Where each entry is:
key=value(\s+|$)
That is, a key/value pair followed by either spaces or the end of the string.

I think SilentGhost proposal is about using String.Split()
Like this :
String keyValues = "...";
foreach(String keyValuePair in keyValues.Split(' '))
Console.WriteLine(keyValuePair);
This is definitively faster and simpler.

Use a lookbehind like you used your lookahead:
(?<=\s|^)([a-z0-9]+)=(\"[a-z0-9]+\")(?=\s|$)

I second Jens' answer (but am still too puny to comment on others' answers).
Also, I've found this Regular Expressions Reference site to be quite awesome. There's a section on Lookaround about halfway down on the Advanced page, and some further notes about Lookbehind.

Related

Regex Matchcollection groups

I already tried two days to solve the Problem, that I have a MatchCollection. In the patter is a Group and I want to have a list with the Solutions of the Group (there were two or more Solutions).
string input = "<tr><td>Mi, 09.09.15</td><td>1</td><td>PK</td><td>E</td><td>123</td><td></td></tr><tr><td>Mi, 09.09.15</td><td>2</td><td>ER</td><td>ER</td><td>234</td><td></td></tr>";
string Patter2 = "^<tr>$?<td>$?[D-M][i-r],[' '][0-3][1-9].[0-1][1-9].[0-9][0-9]$?</td>$?<td>$?([1-9][0-2]?)$?</td>$?";
Regex r2 = new Regex(Patter2);
MatchCollection mc2 = r2.Matches(input);
foreach (Match match in mc2)
{
GroupCollection groups = match.Groups;
string s = groups[1].Value;
Datum2.Text = s;
}
But only the last match (2) appears in the TextBox "Datum2".
I know that I have to use e.g. a listbox, but the Groups[1].Value is a string...
Thanks for your help and time.
Dieter
First thing you need to correct in the code is Datum2.Text = s; would overwrite the text in Datum2 if it were more than one match.
Now, about your regex,
^ forces a match at the begging of the line, so there is really only 1 match. If you remove it, it'll match twice.
I can't seem to understand what was intended with $? all over the pattern (just take them out).
[' '] matches "either a quote, a space or a quote (no need to repeat characters in a character class.
All dots in [0-3][1-9].[0-1][1-9].[0-9][0-9] need to be escaped. A dot matches any character otherwise.
[0-1][1-9] matches all months except "10". The second character shoud be [0-9] (or \d).
Code:
string input = "<tr><td>Mi, 09.09.15</td><td>1</td><td>PK</td><td>E</td><td>123</td><td></td></tr><tr><td>Mi, 09.09.15</td><td>2</td><td>ER</td><td>ER</td><td>234</td><td></td></tr>";
string Patter2 = "<tr><td>[D-M][i-r],[' ][0-3][0-9]\\.[0-1][0-9]\\.[0-9][0-9]</td><td>([1-9][0-2]?)</td>";
Regex r2 = new Regex(Patter2);
MatchCollection mc2 = r2.Matches(input);
string s= "";
foreach (Match match in mc2)
{
GroupCollection groups = match.Groups;
s = s + " " + groups[1].Value;
}
Datum2.Text = s;
Output:
1 2
DEMO
You should know that regex is not the tool to parse HTML. It'll work for simple cases, but for real cases do consider using HTML Agility Pack

Escape Special Character in Regex

Is there a way to escape the special characters in regex, such as []()* and others, from a string?
Basically, I'm asking the user to input a string, and I want to be able to search in the database using regex. Some of the issues I ran into are too many)'s or [x-y] range in reverse order, etc.
So what I want to do is write a function to do replace on the user input. For example, replacing ( with \(, replacing [ with \[
Is there a built-in function for regex to do so? And if I have to write a function from scratch, is there a way to account all characters easily instead of writing the replace statement one by one?
I'm writing my program in C# using Visual Studio 2010
You can use .NET's built in Regex.Escape for this. Copied from Microsoft's example:
string pattern = Regex.Escape("[") + "(.*?)]";
string input = "The animal [what kind?] was visible [by whom?] from the window.";
MatchCollection matches = Regex.Matches(input, pattern);
int commentNumber = 0;
Console.WriteLine("{0} produces the following matches:", pattern);
foreach (Match match in matches)
Console.WriteLine(" {0}: {1}", ++commentNumber, match.Value);
// This example displays the following output:
// \[(.*?)] produces the following matches:
// 1: [what kind?]
// 2: [by whom?]
you can use Regex.Escape for the user's input
string matches = "[]()*";
StringBuilder sMatches = new StringBuilder();
StringBuilder regexPattern = new StringBuilder();
for(int i=0; i<matches.Length; i++)
sMatches.Append(Regex.Escape(matches[i].ToString()));
regexPattern.AppendFormat("[{0}]+", sMatches.ToString());
Regex regex = new Regex(regexPattern.ToString());
foreach(var m in regex.Matches("ADBSDFS[]()*asdfad"))
Console.WriteLine("Found: " + m.Value);

Regex split and replace

I need to replace a word that starts with %.
For example Welcome to home | %brand %productName
hoping to split on words begining with % which would give me { brand, productName }.
My regex is less than average so would appreciate help with this.
Following code might help you :
string[] splits = "Welcome to home | %brand %productName".Split(' ');
List<string> lstdata = new List<string>();
for(i=0;i<splits.length;i++)
{
if(splits[i].StartsWith("%"))
lstdata.Add(splits[i].Replace('%',''));
}
Nothing wrong with string.split approach, mind you, but here's a regex approach:
string input = #"Welcome to home | %brand %productName";
string pattern = #"%\S+";
var matches = Regex.Matches(input, pattern);
string result = string.Empty;
for (int i = 0; i < matches.Count; i++)
{
result += "match " + i + ",value:" + matches[i].Value + "\n";
}
Console.WriteLine(result);
Try this:
(?<=%)\w+
This looks for any combination of word characters immediately preceded by a percent symbol.
Now, if you're doing search and replace on these matches, you'll probably want to remove the % sign as well, so you'd need to remove the lookbehind group and just have this:
%\w+
But in doing so, your replacement code would need to trim off the % sign from each match to get the word by itself.

c# Regex question

I have a problem dealing with the # symbol in Regex, I am trying to remove #sometext
from a text string can't seem to find anywhere where it uses the # as a literal. I have tried myself but doesn't remove the word from the string. Any ideas?
public string removeAtSymbol(string input)
{
Regex findWords = new Regex(______);//Find the words like "#text"
Regex[] removeWords;
string test = input;
MatchCollection all = findWords.Matches(test);
removeWords = new Regex[all.Count];
int index = 0;
string[] values = new string[all.Count];
YesOutputBox.Text = " you got here";
foreach (Match m in all) //List all the words
{
values[index] = m.Value.Trim();
index++;
YesOutputBox.Text = YesOutputBox.Text + " " + m.Value;
}
for (int i = 0; i < removeWords.Length; i++)
{
removeWords[i] = new Regex(" " + values[i]);
// If the words appears more than one time
if (removeWords[i].Matches(test).Count > 1)
{
removeWords[i] = new Regex(" " + values[i] + " ");
test = removeWords[i].Replace(test, " "); //Remove the first word.
}
}
return test;
}
You can remove all occurences of "#sometext" from string test via the method
Regex.Replace(test, "#sometext", "")
or for any word starting with "#" you can use
Regex.Replace(test, "#\\w+", "")
If you need specifically a separate word (i.e. nothing like #comp within tom#comp.com) you may preceed the regex with a special word boundary (\b does not work here):
Regex.Replace(test, "(^|\\W)#\\w+", "")
You can use:
^\s#([A-Za-z0-9_]+)
as the regex to recognize Twitter usernames.
Regex to remove #something from this string: I want to remove #something from this string.
var regex = new Regex("#\\w*");
string result = regex.Replace(stringWithAt, "");
Is that what you are looking for?
I've had good luck applying this pattern:
\B#\w+
This will match any string starting with an # character that contains alphanumeric characters, plus some linking punctuation like the underscore character, if it does not occur on a boundary between alphanumeric and non-alphanumeric characters.
The result of executing this code:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"\B#\w+",
#"redacted");
is the following string:
redacted redacted this2#3that redacted redacted#beta#gamma
If this question is Twitter-specific, then Twitter provides an open source library that helps capture Twitter-specific entities like links, mentions and hashtags. This java file contains the code defining the regular expressions that Twitter uses, and this yml file contains test strings and expected outcomes of many unit tests that exercise the regular expressions in the Twitter library.
Twitter's mention-matching pattern (extracted from their library, modified to remove unnecessary capture groups, and edited to make sense in the context of a replacement) is shown below. The match should be performed in a case-insensitive manner.
(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}
Here is an example which reproduces the results of the first replacement in my answer:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}",
#"$1redacted",
RegexOptions.IgnoreCase);
Note the need to include the substitution $1 since the first capture group can't be directly converted into an atomic zero-width assertion.

Regex filter " with <> tags included

am having problems with some Regex code can anyone help.
I have the following string of data see below:
abcd " something code " nothing "f <b> cannot find this section </b> "
I want to find the sections between " quotes.
I can get if to work fine using the following regax:
foreach (Match match in Regex.Matches(sourceLine, #""((\\")|[^"(\\")])+""))
However, if section between the quotes contain <> does not find the section. Not sure what to do to include the <> tags in the regex.
Thanks for your time.
public List<string> Parse(string input)
{
List<string> results = new List<string>();
bool startSection = true;
int startIndex = 0;
foreach (Match m in Regex.Matches(input, #"(^|[^\\])(")"))
{
if (startSection)
{
startSection = false;
// capture a new section
startIndex = m.Index + """.Length;
}
else
{
// next match starts a new section to capture
startSection = true;
results.Add(input.Substring(startIndex, m.Index - startIndex + 1));
}
}
return results;
}
A character class […] describes a set of allowed characters and a negated character class [^…] describes a set of disallowed characters. So [^"(\\")] means any character except &, q, u, o, t, ;, (, \, and ). It does not mean anything but "(").
Try this instead:
"(.*?)"
Using the ungreedy quantifier *? matches as little as possible in opposite to the greedy quantifier * that matches as much as possible.
You can use HttpUtility.HtmlDecode to convert this text to normal characters.
Then using a regex to extract text between the double quotes would be simple.

Categories

Resources