Regex to match a repeated set of possible characters - c#

I'm writing a short text token replacement system which takes the form:
$varName(opt1|opt2|opt3)
It's designed to easily swap out things based on arbitrary values like this:
$gender(he|she)
I figured the best way to get and process those was a regex that matches the pattern but i can't figure out how to recognise the options between the brackets because they can repeat an arbitrary number of times and may not have as many pipe characters as options.
Any help?
(I'm using C# as the regex host)
EDIT:
I tried this but it only seems to work with something with 2 options
\$[a-zA-Z]+\(([a-zA-Z]+\|)+[a-zA-Z]+\)

Something like this should work:
string text = "$gender(he|she|it|alien)";
string pattern = #"\$(\w+)\(([\w\|]*)\)";
Match match = Regex.Match(text, pattern);
string varName = match.Groups[1].Value;
string[] values = match.Groups[2].Value.Split('|');
Console.WriteLine(varName + ": ");
foreach (string value in values)
{
Console.WriteLine(" " + value);
}
This is what it prints out:
gender:
he
she
it
alien
varName has the name of the variable, and then values is an array of strings containing each option.
However, if you put in something like "$gender()" with no values, or "$gender(he|she|)" with an extra pipe on the end, you'll get empty strings in the result. If that might be a problem, try this:
string[] values = match.Groups[2].Value.Split('|').Where((s) => !string.IsNullOrEmpty(s)).ToArray();

I figured it out.
I was forgetting to account for numbers in the options.
\$[a-zA-Z]+\(([a-zA-Z0-9]+\|)+[a-zA-Z0-9]+\)

Related

Regex Replacing only whole matches

I am trying to replace a bunch of strings in files. The strings are stored in a datatable along with the new string value.
string contents = File.ReadAllText(file);
foreach (DataRow dr in FolderRenames.Rows)
{
contents = Regex.Replace(contents, dr["find"].ToString(), dr["replace"].ToString());
File.SetAttributes(file, FileAttributes.Normal);
File.WriteAllText(file, contents);
}
The strings look like this _-uUa, -_uU, _-Ha etc.
The problem that I am having is when for example this string "_uU" will also overwrite "_-uUa" so the replacement would look like "newvaluea"
Is there a way to tell regex to look at the next character after the found string and make sure it is not an alphanumeric character?
I hope it is clear what I am trying to do here.
Here is some sample data:
private function _-0iX(arg1:flash.events.Event):void
{
if (arg1.type == flash.events.Event.RESIZE)
{
if (this._-2GU)
{
this._-yu(this._-2GU);
}
}
return;
}
The next characters could be ;, (, ), dot, comma, space, :, etc.
First of all, you should use Regex.Escape.
You can use then
contents = Regex.Replace(
contents,
Regex.Escape(dr["find"].ToString()) + #"(?![a-zA-Z])",
Regex.Escape(dr["replace"].ToString()));
or even better
contents = Regex.Replace(
contents,
#"\b" + Regex.Escape(dr["find"].ToString()) + #"\b",
Regex.Escape(dr["replace"].ToString()));
I think this is what you're looking for:
contents = Regex.Replace(
contents,
string.Format(#"(?<!\w){0}(?!\w)", Regex.Escape(dr["find"].ToString())),
dr["replace"].ToString().Replace("$", "$$")
);
You can't use \b because your search strings don't always start and end with word characters. Instead, I used (?<!\w) and (?!\w) to make sure the matched substring is not immediately preceded or followed by a word character (i.e., a letter, a digit, or an underscore). I don't know the complete specs for your search strings, so this pattern might need some tweaking.
None of the sample patterns you provided contain regex metacharacters, but like the other responders, I used Regex.Escape() to render it safe anyway. In the replacement string the only character you have to watch out for is the dollar sign (ref), and the way to escape that is with another dollar sign. Notice that I used String.Replace() for that instead of Regex.Replace().
There are two tricks that can help you here:
Order all the search string by length, and replace the longest ones first, that way you won't accidentally replace the shorter ones.
Use a MatchEvaluator and instead of looping through all your rows, search fro all replacement patterns in the string and look them up in your dataset.
Option one is simple, option two would look like this:
Regex.Replace(contents", "_-\\w+", ReplaceIdentifier)
public string ReplaceIdentifier(Match m)
{
DataRow row = FolderRenames.Rows.FindRow("find"); // Requires a primary key on "find"
if (row != null) return row["replace"];
else return m.Value;
}

C# sort and put back Regex.matches

Is there any way of using RegEx.Matches to find, and write back matched values but in different (alphabetical) order?
For now I have something like:
var pattern = #"(KEY `[\w]+?` \(`.*`*\))";
var keys = Regex.Matches(line, pattern);
Console.WriteLine("\n\n");
foreach (Match match in keys)
{
Console.WriteLine(match.Index + " = " + match.Value.Replace("\n", "").Trim());
}
But what I really need is to take table.sql dump and sort existing INDEXES alphabetically, example code:
line = "...PRIMARY KEY (`communication_auto`),\n KEY `idx_current` (`current`),\n KEY `idx_communication` (`communication_id`,`current`),\n KEY `idx_volunteer` (`volunteer_id`,`current`),\n KEY `idx_template` (`template_id`,`current`)\n);"
Thanks
J
Update:
Thanks, m.buettner solution gave me basics that I could use to move on. I'm not so good at RegEx sadly, but I ended up with code that I believe can be still improved:
...
//sort INDEXES definitions alphabetically
if (line.Contains(" KEY `")) line = Regex.Replace(
line,
#"[ ]+(KEY `[\w]+` \([\w`,]+\),?\s*)+",
ReplaceCallbackLinq
);
static string ReplaceCallbackLinq(Match match)
{
var result = String.Join(",\n ",
from Capture item in match.Groups[1].Captures
orderby item.Value.Trim()
select item.Value.Trim().Replace("),", ")")
);
return " " + result + "\n";
}
Update:
There is also a case when index field is longer than 255 chars mysql trims index up to 255 and writes it like this:
KEY `idx3` (`app_property_definition_id`,`value`(255),`audit_current`),
so, in order to match this case too I had to change some code:
in ReplaceCallbackLinq:
select item.Value.Trim().Replace("`),", "`)")
and regex definition to:
#"[ ]+(KEY `[\w]+` \([\w`(\(255\)),]+\),?\s*)+",
This cannot be done with regex alone. But you could use a callback function and make use of .NET's unique capability of capturing multiple things with the same capturing group. This way you avoid using Matches and writing everything back by yourself. Instead you can use the built-in Replace function. My example below simply sorts the KEY phrases and puts them back as they were (so it does nothing but sort they phrases within the SQL statement). If you want a different output you can easily achieve that by capturing different parts of the pattern and adjusting the Join operation at the very end.
First we need a match evaluator to pass the callback:
MatchEvaluator evaluator = new MatchEvaluator(ReplaceCallback);
Then we write a regex that matches the whole set of indices at once, capturing the index-names in a capturing group. We put this in the overload of Replace that takes an evaluator:
output = Regex.Replace(
input,
#"(KEY `([\w]+)` \(`[^`]*`(?:,`[^`]*`)*\),?\s*)+",
evaluator
);
Now in most languages this would not be useful, because due to the repetition capturing group 1 would always contain only the first or last thing that was captured (same as capturing group 2). But luckily, you are using C#, and .NET's regex engine is just one powerful beast. So let's have a look at the callback function and how to use the multiple captures:
static string ReplaceCallback(Match match)
{
int captureCount = match.Groups[1].Captures.Count;
string[] indexNameArray = new string[captureCount];
string[] keyBlockArray = new string[captureCount];
for (int i = 0; i < captureCount; i++)
{
keyBlockArray[i] = match.Groups[1].Captures[i].Value;
indexNameArray[i] = match.Groups[2].Captures[i].Value;
}
Array.Sort(indexNameArray, keyBlockArray);
return String.Join("\n ", keyBlockArray);
}
match.Groups[i].Captures lets us access the multiple captures of a single group. Since these are Capture objects which do not seem really useful right now, we build two string arrays from their values. Then we use Array.Sort which sorts two arrays based on the values of one (which is considered the key). As the "key" we use the capturing of the table name. As the "value" we use the full capture of one complete KEY ..., block. This sorts the full blocks by their names. Then we can simply join together the blocks, add in the whitespace separator that was used before and return them.
Not sure if I fully understand the question, but does changing the foreach to:
foreach (Match match in keys.Cast<Match>().OrderBy(m => m.Value))
do what you want?

match first digits before # symbol

How to match all first digits before # in this line
26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html
Im trying to get this number 26909578
My try
string text = #"26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html";
MatchCollection m1 = Regex.Matches(text, #"(.+?)#", RegexOptions.Singleline);
but then its outputs all text
Make it explicit that it has to start at the beginning of the string:
#"^(.+?)#"
Alternatively, if you know that this will always be a number, restrict the possible characters to digits:
#"^\d+"
Alternatively use the function Match instead of Matches. Matches explicitly says, "give me all the matches", while Match will only return the first one.
Or, in a trivial case like this, you might also consider a non-RegEx approach. The IndexOf() method will locate the '#' and you could easily strip off what came before.
I even wrote a sscanf() replacement for C#, which you can see in my article A sscanf() Replacement for .NET.
If you dont want to/dont like to use regex, use a string builder and just loop until you hit the #.
so like this
StringBuilder sb = new StringBuilder();
string yourdata = "yourdata";
int i = 0;
while(yourdata[i]!='#')
{
sb.Append(yourdata[i]);
i++;
}
//when you get to that # your stringbuilder will have the number you want in it so return it with .toString();
string answer = sb.toString();
The entire string (except the final url) is composed of segments that can be matched by (.+?)#, so you will get several matches. Retrieve only the first match from the collection returned by matching .+?(?=#)

Regex: C# extract text within double quotes

I want to extract only those words within double quotes. So, if the content is:
Would "you" like to have responses to your "questions" sent to you via email?
The answer must be
you
questions
Try this regex:
\"[^\"]*\"
or
\".*?\"
explain :
[^ character_group ]
Negation: Matches any single character that is not in character_group.
*?
Matches the previous element zero or more times, but as few times as possible.
and a sample code:
foreach(Match match in Regex.Matches(inputString, "\"([^\"]*)\""))
Console.WriteLine(match.ToString());
//or in LINQ
var result = from Match match in Regex.Matches(line, "\"([^\"]*)\"")
select match.ToString();
Based on #Ria 's answer:
static void Main(string[] args)
{
string str = "Would \"you\" like to have responses to your \"questions\" sent to you via email?";
var reg = new Regex("\".*?\"");
var matches = reg.Matches(str);
foreach (var item in matches)
{
Console.WriteLine(item.ToString());
}
}
The output is:
"you"
"questions"
You can use string.TrimStart() and string.TrimEnd() to remove double quotes if you don't want it.
I like the regex solutions. You could also think of something like this
string str = "Would \"you\" like to have responses to your \"questions\" sent to you via email?";
var stringArray = str.Split('"');
Then take the odd elements from the array. If you use linq, you can do it like this:
var stringArray = str.Split('"').Where((item, index) => index % 2 != 0);
This also steals the Regex from #Ria, but allows you to get them into an array where you then remove the quotes:
strText = "Would \"you\" like to have responses to your \"questions\" sent to you via email?";
MatchCollection mc = Regex.Matches(strText, "\"([^\"]*)\"");
for (int z=0; z < mc.Count; z++)
{
Response.Write(mc[z].ToString().Replace("\"", ""));
}
I combine Regex and Trim:
const string searchString = "This is a \"search text\" and \"another text\" and not \"this text";
var collection = Regex.Matches(searchString, "\\\"(.*?)\\\"");
foreach (var item in collection)
{
Console.WriteLine(item.ToString().Trim('"'));
}
Result:
search text
another text
Try this (\"\w+\")+
I suggest you to download Expresso
http://www.ultrapico.com/Expresso.htm
I needed to do this in C# for parsing CSV and none of these worked for me so I came up with this:
\s*(?:(?:(['"])(?<value>(?:\\\1|[^\1])*?)\1)|(?<value>[^'",]+?))\s*(?:,|$)
This will parse out a field with or without quotes and will exclude the quotes from the value while keeping embedded quotes and commas. <value> contains the parsed field value. Without using named groups, either group 2 or 3 contains the value.
There are better and more efficient ways to do CSV parsing and this one will not be effective at identifying bad input. But if you can be sure of your input format and performance is not an issue, this might work for you.
Slight improvement on answer by #ria,
\"[^\" ][^\"]*\"
Will recognize a starting double quote only when not followed by a space to allow trailing inch specifiers.
Side effect: It will not recognize "" as a quoted value.

How to do this Regex in C#?

I've been trying to do this for quite some time but for some reason never got it right.
There will be texts like these:
12325 NHGKF
34523 KGJ
29302 MMKSEIE
49504EFDF
The rule is there will be EXACTLY 5 digit number (no more or less) after that a 1 SPACE (or no space at all) and some text after as shown above. I would like to have a MATCH using a regex pattern and extract THE NUMBER and SPACE and THE TEXT.
Is this possible? Thank you very much!
Since from your wording you seem to need to be able to get each component part of the input text on a successful match, then here's one that'll give you named groups number, space and text so you can get them easily if the regex matches:
(?<number>\d{5})(?<space>\s?)(?<text>\w+)
On the returned Match, if Success==true then you can do:
string number = match.Groups["number"].Value;
string text = match.Groups["text"].Value;
bool hadSpace = match.Groups["space"] != null;
The expression is relatively simple:
^([0-9]{5}) ?([A-Z]+)$
That is, 5 digits, an optional space, and one or more upper-case letter. The anchors at both ends ensure that the entire input is matched.
The parentheses around the digits pattern and the letters pattern designate capturing groups one and two. Access them to get the number and the word.
string test = "12345 SOMETEXT";
string[] result = Regex.Split(test, #"(\d{5})\s*(\w+)");
You could use the Split method:
public class Program
{
static void Main()
{
var values = new[]
{
"12325 NHGKF",
"34523 KGJ",
"29302 MMKSEIE",
"49504EFDF"
};
foreach (var value in values)
{
var tokens = Regex.Split(value, #"(\d{5})\s*(\w+)");
Console.WriteLine("key: {0}, value: {1}", tokens[1], tokens[2]);
}
}
}

Categories

Resources