C# sort and put back Regex.matches - c#

Is there any way of using RegEx.Matches to find, and write back matched values but in different (alphabetical) order?
For now I have something like:
var pattern = #"(KEY `[\w]+?` \(`.*`*\))";
var keys = Regex.Matches(line, pattern);
Console.WriteLine("\n\n");
foreach (Match match in keys)
{
Console.WriteLine(match.Index + " = " + match.Value.Replace("\n", "").Trim());
}
But what I really need is to take table.sql dump and sort existing INDEXES alphabetically, example code:
line = "...PRIMARY KEY (`communication_auto`),\n KEY `idx_current` (`current`),\n KEY `idx_communication` (`communication_id`,`current`),\n KEY `idx_volunteer` (`volunteer_id`,`current`),\n KEY `idx_template` (`template_id`,`current`)\n);"
Thanks
J
Update:
Thanks, m.buettner solution gave me basics that I could use to move on. I'm not so good at RegEx sadly, but I ended up with code that I believe can be still improved:
...
//sort INDEXES definitions alphabetically
if (line.Contains(" KEY `")) line = Regex.Replace(
line,
#"[ ]+(KEY `[\w]+` \([\w`,]+\),?\s*)+",
ReplaceCallbackLinq
);
static string ReplaceCallbackLinq(Match match)
{
var result = String.Join(",\n ",
from Capture item in match.Groups[1].Captures
orderby item.Value.Trim()
select item.Value.Trim().Replace("),", ")")
);
return " " + result + "\n";
}
Update:
There is also a case when index field is longer than 255 chars mysql trims index up to 255 and writes it like this:
KEY `idx3` (`app_property_definition_id`,`value`(255),`audit_current`),
so, in order to match this case too I had to change some code:
in ReplaceCallbackLinq:
select item.Value.Trim().Replace("`),", "`)")
and regex definition to:
#"[ ]+(KEY `[\w]+` \([\w`(\(255\)),]+\),?\s*)+",

This cannot be done with regex alone. But you could use a callback function and make use of .NET's unique capability of capturing multiple things with the same capturing group. This way you avoid using Matches and writing everything back by yourself. Instead you can use the built-in Replace function. My example below simply sorts the KEY phrases and puts them back as they were (so it does nothing but sort they phrases within the SQL statement). If you want a different output you can easily achieve that by capturing different parts of the pattern and adjusting the Join operation at the very end.
First we need a match evaluator to pass the callback:
MatchEvaluator evaluator = new MatchEvaluator(ReplaceCallback);
Then we write a regex that matches the whole set of indices at once, capturing the index-names in a capturing group. We put this in the overload of Replace that takes an evaluator:
output = Regex.Replace(
input,
#"(KEY `([\w]+)` \(`[^`]*`(?:,`[^`]*`)*\),?\s*)+",
evaluator
);
Now in most languages this would not be useful, because due to the repetition capturing group 1 would always contain only the first or last thing that was captured (same as capturing group 2). But luckily, you are using C#, and .NET's regex engine is just one powerful beast. So let's have a look at the callback function and how to use the multiple captures:
static string ReplaceCallback(Match match)
{
int captureCount = match.Groups[1].Captures.Count;
string[] indexNameArray = new string[captureCount];
string[] keyBlockArray = new string[captureCount];
for (int i = 0; i < captureCount; i++)
{
keyBlockArray[i] = match.Groups[1].Captures[i].Value;
indexNameArray[i] = match.Groups[2].Captures[i].Value;
}
Array.Sort(indexNameArray, keyBlockArray);
return String.Join("\n ", keyBlockArray);
}
match.Groups[i].Captures lets us access the multiple captures of a single group. Since these are Capture objects which do not seem really useful right now, we build two string arrays from their values. Then we use Array.Sort which sorts two arrays based on the values of one (which is considered the key). As the "key" we use the capturing of the table name. As the "value" we use the full capture of one complete KEY ..., block. This sorts the full blocks by their names. Then we can simply join together the blocks, add in the whitespace separator that was used before and return them.

Not sure if I fully understand the question, but does changing the foreach to:
foreach (Match match in keys.Cast<Match>().OrderBy(m => m.Value))
do what you want?

Related

Select dynamically a list of char in a string form a DLL

I'm new here and my English will not be the best you'll read today.
I just imported from a DLL a list of "key"
(#8yg54w-#95jz#e-##9ixop-#7ps-#ny#9qv-#+pzbk5-#bp669x-#bp6696-#bp6696-#bp6696-#bp6696-#bp6696-#fbhstu-#ehddtk-####9py),
we will name it this way it's a simple string.
I need to select the "key" that compose this string after each # but it has to be done dynamically and not like you choose in an ArrayList [0,1,2 ...].
The end result should look like 8yg54w and after u got this one it's a loop and u get the next one, which means 95jz#e. The first "#" is a separator for each key.
I wanna know how can I proceed to get each key after the first separator.
I'll try to answer your questions because I think that there will be some, this is probably poorly explained, I apologize in advance! Thanks
Your solution may be a simple function that returns an IEnumerable<string>. You can accomplish this by splitting the string and using the yield keyword to return an iterator. E.g.
// Define the splitting function
public IEnumerable<string> GetKeys(string source) {
var splitted = source.Split("-#");
foreach (var key in splitted)
yield return key;
}
// Use it in your code
var myKeys = GetKeys("#8yg54w-#95jz#e-##9ixop-#7ps-#ny#9qv-#+pzbk5-#bp669x-#bp6696-#bp6696-#bp6696-#bp6696-#bp6696-#fbhstu-#ehddtk-####9py");
foreach(var k in myKeys) {
// This will print your keys in the console one per line.
Console.WriteLine(k);
}
You can use this approach but I suggest to better hide the logic to get the nex Key if you need it to be a Unique Gobal Key generator. Using a static class with only a GetNextKey() method that can be the combination of the code above...
This should return an array of keys.
string.Split("-#");
When you just need the string:
string x = "(#8yg54w-#95jz#e-##9ixop-#7ps-#ny#9qv-#+pzbk5-#bp669x-#bp6696-#bp6696-#bp6696-#bp6696-#bp6696-#fbhstu-#ehddtk-####9py)";
Console.WriteLine(string.Join("-", x.Split("-#")));
You can use a Regular expression.
string input = "(#8yg54w-#95jz#e-##9ixop-#7ps-#ny#9qv-#+pzbk5-#bp669x-#bp6696-#bp6696-#bp6696-#bp6696-#bp6696-#fbhstu-#ehddtk-####9py)";
MatchCollection matches = Regex.Matches(input, #"(?<=\#)[A-Za-z1-9#]+(?=-)");
foreach (Match match in matches) {
Console.WriteLine(match.Value);
}
Output:
8yg54w
95jz#e
#9ixop
7ps
ny#9qv
bp669x
bp6696
bp6696
bp6696
bp6696
bp6696
fbhstu
ehddtk
Explanation of the regular expression (?<=\#)[A-Za-z1-9#]+(?=-)
General form (?<=prefix)find(?=suffix) finds the pattern find between a prefix and a suffix.
(?<=\#) prefix # (escaped with \).
[A-Za-z1-9#] character set to match (upper and lower case letters + digits + #).
+ quantifier: At leat one character.
(?=-) suffix -.
I am not sure whether the ) is part of string. To get the last key ###9py if the string contains ) use (?<=\#)[A-Za-z1-9#]+(?=-|\)) where \) is the right brace escaped. If ) is in there, use (?<=\#)[A-Za-z1-9#]+(?=-|$) where $ is the end of the string. | means OR. I.e., the suffix is either '-' OR ) or it is - OR $ (end of line).

How to split a string every time the character changes?

I'd like to turn a string such as abbbbcc into an array like this: [a,bbbb,cc] in C#. I have tried the regex from this Java question like so:
var test = "aabbbbcc";
var split = new Regex("(?<=(.))(?!\\1)").Split(test);
but this results in the sequence [a,a,bbbb,b,cc,c] for me. How can I achieve the same result in C#?
Here is a LINQ solution that uses Aggregate:
var input = "aabbaaabbcc";
var result = input
.Aggregate(" ", (seed, next) => seed + (seed.Last() == next ? "" : " ") + next)
.Trim()
.Split(' ');
It aggregates each character based on the last one read, then if it encounters a new character, it appends a space to the accumulating string. Then, I just split it all at the end using the normal String.Split.
Result:
["aa", "bb", "aaa", "bb", "cc"]
I don't know how to get it done with split. But this may be a good alternative:
//using System.Linq;
var test = "aabbbbcc";
var matches = Regex.Matches(test, "(.)\\1*");
var split = matches.Cast<Match>().Select(match => match.Value).ToList();
There are several things going on here that are producing the output you're seeing:
The regex combines a positive lookbehind and a negative lookahead to find the last character that matches the one preceding it but does not match the one following it.
It creates capture groups for every match, which are then fed into the Split method as delimiters. The capture groups are required by the negative lookahead, specifically the \1 identifier, which basically means "the value of the first capture group in the statement" so it can not be omitted.
Regex.Split, given a capture group or multiple capture groups to match on when identifying the splitting delimiters, will include the delimiters used for every individual Split operation.
Number 3 is why your string array is looking weird, Split will split on the last a in the string, which becomes split[0]. This is followed by the delimiter at split[1], etc...
There is no way to override this behaviour on calling Split.
Either compensation as per Gusman's answer or projecting the results of a Matches call as per Ruard's answer will get you what you want.
To be honest I don't exactly understand how that regex works, but you can "repair" the output very easily:
Regex reg = new Regex("(?<=(.))(?!\\1)", RegexOptions.Singleline);
var res = reg.Split("aaabbcddeee").Where((value, index) => index % 2 == 0 && value != "").ToArray();
Could do this easily with Linq, but I don't think it's runtime will be as good as regex.
A whole lot easier to read though.
var myString = "aaabbccccdeee";
var splits = myString.ToCharArray()
.GroupBy(chr => chr)
.Select(grp => new string(grp.Key, grp.Count()));
returns the values `['aaa', 'bb', 'cccc', 'd', 'eee']
However this won't work if you have a string like "aabbaa", you'll just get ["aaaa","bb"] as a result instead of ["aa","bb","aa"]

Determining which pattern matched using Regex.Matches

I'm writing a translator, not as any serious project, just for fun and to become a bit more familiar with regular expressions. From the code below I think you can work out where I'm going with this (cheezburger anyone?).
I'm using a dictionary which uses a list of regular expressions as the keys and the dictionary value is a List<string> which contains a further list of replacement values. If I'm going to do it this way, in order to work out what the substitute is, I obviously need to know what the key is, how can I work out which pattern triggered the match?
var dictionary = new Dictionary<string, List<string>>
{
{"(?!e)ight", new List<string>(){"ite"}},
{"(?!ues)tion", new List<string>(){"shun"}},
{"(?:god|allah|buddah?|diety)", new List<string>(){"ceiling cat"}},
..
}
var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
foreach (Match metamatch in Regex.Matches(input
, regex
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
substitute = GetRandomReplacement(dictionary[ ????? ]);
input = input.Replace(metamatch.Value, substitute);
}
Is what I'm attempting possible, or is there a better way to achieve this insanity?
You can name each capture group in a regular expression and then query the value of each named group in your match. This should allow you to do what you want.
For example, using the regular expression below,
(?<Group1>(?!e))ight
you can then extract the group matches from your match result:
match.Groups["Group1"].Captures
You've got another problem. Check this out:
string s = #"My weight is slight.";
Regex r = new Regex(#"(?<!e)ight\b");
foreach (Match m in r.Matches(s))
{
s = s.Replace(m.Value, "ite");
}
Console.WriteLine(s);
output:
My weite is slite.
String.Replace is a global operation, so even though weight doesn't match the regex, it gets changed anyway when slight is found. You need to do the match, lookup, and replace at the same time; Regex.Replace(String, MatchEvaluator) will let you do that.
Using named groups like Jeff says is the most robust way.
You can also access the groups by number, as they are expressed in your pattern.
(first)|(second)
can be accessed with
match.Groups[1] // match group 2 -> second
Of course if you have more parenthesis which you don't want to include, use the non-capture operator ?:
((?:f|F)irst)|((?:s|S)econd)
match.Groups[1].Value // also match group 2 -> second

How can I find a string after a specific string/character using regex

I am hopeless with regex (c#) so I would appreciate some help:
Basicaly I need to parse a text and I need to find the following information inside the text:
Sample text:
KeywordB:***TextToFind* the rest is not relevant but **KeywordB: Text ToFindB and then some more text.
I need to find the word(s) after a certain keyword which may end with a “:”.
[UPDATE]
Thanks Andrew and Alan: Sorry for reopening the question but there is quite an important thing missing in that regex. As I wrote in my last comment, Is it possible to have a variable (how many words to look for, depending on the keyword) as part of the regex?
Or: I could have a different regex for each keyword (will only be a hand full). But still don't know how to have the "words to look for" constant inside the regex
The basic regex is this:
var pattern = #"KeywordB:\s*(\w*)";
\s* = any number of spaces
\w* = 0 or more word characters (non-space, basically)
() = make a group, so you can extract the part that matched
var pattern = #"KeywordB:\s*(\w*)";
var test = #"KeywordB: TextToFind";
var match = Regex.Match(test, pattern);
if (match.Success) {
Console.Write("Value found = {0}", match.Groups[1]);
}
If you have more than one of these on a line, you can use this:
var test = #"KeywordB: TextToFind KeyWordF: MoreText";
var matches = Regex.Matches(test, #"(?:\s*(?<key>\w*):\s?(?<value>\w*))");
foreach (Match f in matches ) {
Console.WriteLine("Keyword '{0}' = '{1}'", f.Groups["key"], f.Groups["value"]);
}
Also, check out the regex designer here: http://www.radsoftware.com.au/. It is free, and I use it constantly. It works great to prototype expressions. You need to rearrange the UI for basic work, but after that it's easy.
(fyi) The "#" before strings means that \ no longer means something special, so you can type #"c:\fun.txt" instead of "c:\fun.txt"
Let me know if I should delete the old post, but perhaps someone wants to read it.
The way to do a "words to look for" inside the regex is like this:
regex = #"(Key1|Key2|Key3|LastName|FirstName|Etc):"
What you are doing probably isn't worth the effort in a regex, though it can probably be done the way you want (still not 100% clear on requirements, though). It involves looking ahead to the next match, and stopping at that point.
Here is a re-write as a regex + regular functional code that should do the trick. It doesn't care about spaces, so if you ask for "Key2" like below, it will separate it from the value.
string[] keys = {"Key1", "Key2", "Key3"};
string source = "Key1:Value1Key2: ValueAnd A: To Test Key3: Something";
FindKeys(keys, source);
private void FindKeys(IEnumerable<string> keywords, string source) {
var found = new Dictionary<string, string>(10);
var keys = string.Join("|", keywords.ToArray());
var matches = Regex.Matches(source, #"(?<key>" + keys + "):",
RegexOptions.IgnoreCase);
foreach (Match m in matches) {
var key = m.Groups["key"].ToString();
var start = m.Index + m.Length;
var nx = m.NextMatch();
var end = (nx.Success ? nx.Index : source.Length);
found.Add(key, source.Substring(start, end - start));
}
foreach (var n in found) {
Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
}
}
And the output from this is:
Key=Key1, Value=Value1
Key=Key2, Value= ValueAnd A: To Test
Key=Key3, Value= Something
/KeywordB\: (\w)/
This matches any word that comes after your keyword. As you didn´t mentioned any terminator, I assumed that you wanted only the word next to the keyword.

regular expression for finding parts of a string within another

I have two strings: the first's value is "catdog" and the second's is "got".
I'm trying to find a regex that tells me if the letters for "got" are in "catdog". I'm particularly looking to avoid the case where there are duplicate letters. For example, I know "got" is a match, however "gott" is not a match because there are not two "t" in "catdog".
EDIT:
Based on Adam's response below this is the C# code I got to work in my solution. Thanks to all those that responded.
Note: I had to convert the char to int and subtract 97 to get the appropriate index for the array. In my case the letters are always lower case.
private bool CompareParts(string a, string b)
{
int[] count1 = new int[26];
int[] count2 = new int[26];
foreach (var item in a.ToCharArray())
count1[(int)item - 97]++;
foreach (var item in b.ToCharArray())
count2[(int)item - 97]++;
for (int i = 0; i < count1.Length; i++)
if(count2[i] > count1[i])
return false;
return true;
}
You're using the wrong tool for the job. This is not something regular expressions are capable of handling easily. Fortunately, it's relatively easy to do this without regular expressions. You just count up the number of occurrences of each letter within both strings, and compare the counts between the two strings - if for each letter of the alphabet, the count in the first string is at least as large as the count in the second string, then your criteria are satisfied. Since you didn't specify a language, here's an answer in pseudocode that should be easily translatable into your language:
bool containsParts(string1, string2)
{
count1 = array of 26 0's
count2 = array of 26 0's
// Note: be sure to check for an ignore non-alphabetic characters,
// and do case conversion if you want to do it case-insensitively
for each character c in string1:
count1[c]++
for each character c in string2:
count2[c]++
for each character c in 'a'...'z':
if count1[c] < count2[c]:
return false
return true
}
Previous suggestions have already been made that perhaps regex isn't the best way to do this and I agree, however, your accepted answer is a little verbose considering what you're trying to achieve and that is test to see if a set of letters is the subset of another set of letters.
Consider the following code which achieves this in a single line of code:
MatchString.ToList().ForEach(Item => Input.Remove(Item));
Which can be used as follows:
public bool IsSubSetOf(string InputString, string MatchString)
{
var InputChars = InputString.ToList();
MatchString.ToList().ForEach(Item => InputChars.Remove(Item));
return InputChars.Count == 0;
}
You can then just call this method to verify if it's a subset or not.
What is interesting here is that "got" will return a list with no items because each item in the match string only appears once, but "gott" will return a list with a single item because there would only be a single call to remove the "t" from the list. Consequently you would have an item left in the list. That is, "gott" is not a subset of "catdog" but "got" is.
You could take it one step further and put the method into a static class:
using System;
using System.Linq;
using System.Runtime.CompilerServices;
static class extensions
{
public static bool IsSubSetOf(this string InputString, string MatchString)
{
var InputChars = InputString.ToList();
MatchString.ToList().ForEach(Item => InputChars.Remove(Item));
return InputChars.Count == 0;
}
}
which makes your method into an extension of the string object which actually makes thins a lot easier in the long run, because you can now make your calls like so:
Console.WriteLine("gott".IsSubSetOf("catdog"));
I don't think there is a sane way to do this with regular expressions. The insane way would be to write out all the permutations:
/^(c?a?t?d?o?g?|c?a?t?d?g?o?| ... )$/
Now, with a little trickery you could do this with a few regexps (example in Perl, untested):
$foo = 'got';
$foo =~ s/c//;
$foo =~ s/a//;
...
$foo =~ s/d//;
# if $foo is now empty, it passes the test.
Sane people would use a loop, of course:
$foo = 'got'
foreach $l (split(//, 'catdog') {
$foo =~ s/$l//;
}
# if $foo is now empty, it passes the test.
There are much better performing ways to pull this off, of course, but they don't use regexps. And there are no doubt ways to do it if e.g., you can use Perl's extended regexp features like embedded code.
You want a string that matches exact those letters, exactly once. It depends what you're writing the regex in, but it's going to be something like
^[^got]*(g|o|t)[^got]$
If you've got an operator for "exactly one match" that will help.
Charlie Martin almost has it right, but you have to do a complete pass for each letter. You can do that with a single regex by using lookaheads for all but the last pass:
/^
(?=[^got]*g[^got]*$)
(?=[^got]*o[^got]*$)
[^got]*t[^got]*
$/x
This makes a nice exercise for honing your regex skills, but if I had to do this in real-life, I wouldn't do it this way. A non-regex approach will require a lot more typing, but any minimally competent programmer will be able to understand and maintain it. If you use a regex, that hypothetical maintainer will also have to be more-than-minimally competent at regexes.
#Adam Rosenfield's solution in Python:
from collections import defaultdict
def count(iterable):
c = defaultdict(int)
for hashable in iterable:
c[hashable] += 1
return c
def can_spell(word, astring):
"""Whether `word` can be spelled using `astring`'s characters."""
count_string = count(astring)
count_word = count(word)
return all(count_string[c] >= count_word[c] for c in word)
The best way to do it with regular expressions is, IMO:
A. Sort the characters in the large string (search space)
Thus: turn "catdog" into "acdgot"
B.
Do the same with the string of which you search the characters of: "gott" becomes, eh, "gott"...
Insert ".*" between each of these characters
Use the latter as the regular expression to search in the former.
For example, some Perl code (if you don't mind):
$main = "catdog"; $search = "gott";
# break into individual characters, sort, and reconcatenate
$main = join '', sort split //, $main;
$regexp = join ".*", sort split //, $search;
print "Debug info: search in '$main' for /$regexp/ \n";
if($main =~ /$regexp/) {
print "Found a match!\n";
} else {
print "Sorry, no match...\n";
}
This prints:
Debug info: search in 'acdgot' for /g.*o.*t.*t/
Sorry, no match...
Drop one "t" and you get a match.

Categories

Resources