I have a DataTable like this:
column1 column2
----------- ----------
1 abc d Alpha
2 ab Gamma
3 abc de Harry
4 xyz Peter
I want to check if a substring of a string exists in the datatable.
e.g. If the string I am looking for is "abc defg", record 3 should be returned (although record 1 is also a match, record 3 has more common characters in sequence).
I am not able to find any way to search as described above.
any help, guidance would be much appreciated.
This would be a two-step process.
Filter the table for rows that match. This can be done with the string.Contains method. In LINQ, this would look something like:
const string myText = "abc defg";
IEnumerable<Row> matches = MyTable.Where(row => myText.Contains(row.Column1));
Select the longest match. In LINQ, this might look something like this.
Row longestMatch = matches.OrderByDescending<Row, int>(row => row.Column1.Length).First();
Related
Take this data as an example:
ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021
I was wondering if it's possible to create a regex that will return this set of matches
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
I did try creating one below:
ID: (?<id>\w+).*\|(?<instrument>\w+):\s(?<count>\d).*Expiry:\s(?<expiry>[\w\d]+)
but it only returned the one with the violin instrument. I would highly appreciate your insights on this.
I would not use a regular expression. Especially since the string ID: JK546|Guitar: 0|Expiry: Aug14,2021 does not appear in the string ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021, so it's not strictly a match, but more of a replacement. But there's no good way to get all replacements from all matches.
So, I'd just split the input string on |.
Then you want to compose a result string that is comprised of the first field, one of the middle fields, and the last field. You'll get one result for each middle field that exists. If it splits into N fields, you'll get N-2 results. e.g.: if it splits into 5 fields, then you'll get 3 results, one for each of the "middle" fields.
string input = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string[] fields = input.Split('|');
for( int i = 1; i < fields.Length - 1; ++i) {
string result = string.Join("|", fields.First(), fields[i], fields.Last());
Console.WriteLine(result);
}
output:
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
A single regular expression to return multiple matches on multiple calls?
I wonder whether that is possible.
I’m not familiar with how to do regex processing in C#,
but this sed command will do what you want.
Perhaps you can understand how it works and adapt it to your needs:
sed -n ':loop; h; s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p; g; s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/; t loop'
For simplicity, let’s pretend that the input string is “A|B|C|D|E”.
What it does:
-n is the option to tell sed not to print anything automatically
(but only print when told to, with a p command).
:loop is a label for, effectively, a “goto”.
So use a while loop structure.
h saves the pattern space into the hold space.
In other words, make a copy of your string.
s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p captures the first two segments
and the last one, and prints the result.
So “A|B|C|D|E” becomes “A|B|E” (i.e., your first desired output).
g restores the saved string from the hold space into the pattern space.
In other words, retrieve the copy of the string that you saved.
s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/ captures the first segment,
skips the second, and then captures the rest.
So “A|B|C|D|E” becomes “A|C|D|E”.
t loop is the “goto” command.
It says to go back to the beginning of the loop
if the most recent substitution succeeded.
In other words, this is the end of the loop,
and the specification of the loop condition.
The second iteration of the loop will change “A|C|D|E” to “A|C|E”
and print it.
And then change “A|C|D|E” to “A|D|E” and iterate.
The third iteration of the loop will change “A|D|E” to “A|D|E” and print it.
(Obviously there is no change, because the .* in the middle of the regex
matches the zero-length string between “A|D” and “|E”.)
The final substitution changes “A|D|E” to “A|E”,
and then there is nothing left to find.
You can make use of the .NET Groups.Captures property to get the values of Guitar, Piano and Violin.
(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)
The pattern matches:
(ID: \w+\|) Capture group 1 match ID: 1+ word chars and |
(\w+: \d+\|)+ Capture group 2 Repeat 1+ times matching 1+ word chars : 1+ digits |
(Expiry: \w+,\d+) Capture group 3 match Expiry: 1+ word chars , and 1+ digits
See a .NET regex demo | C# demo
For example
var str = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string pattern = #"(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)";
Match m = Regex.Match(str, pattern);
foreach(Capture c in m.Groups[2].Captures) {
Console.WriteLine(m.Groups[1].Value + c.Value + m.Groups[3].Value);
}
Output
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
It should be possible with look behind and look ahead:
string foo = #"ID: JK546 | Guitar: 0 | Piano: 1 | Violin: 0 | Expiry: Aug14,2021";
// First look at "Guitar: 0", "Piano: 1" and "Violin: 0". Then look behind "(?<= )" and search for the ID. Then look ahead "(?= )" and search for Expiry.
string pattern = #"(\w+: \d)(?<=(ID: [A-Z0-9]+).*?)(?=.*?(Expiry: \S+))";
foreach (var match in Regex.Matches(foo, pattern))
{
....
}
Fortunately c# is one of the few languages that can handle variable length look behinds.
I'm not an expert on regex and need some help to set up one.
I'm using Powershell and its [regex] type, which is a C# class, the final objective is to read a toml file (sample data at the bottom, or use this link to regex101), in which I need to:
match some values (values between "__")
ignore comments. (a comment starts with "#")
To match the values and put them in a capture group the following regex works:
match the template value (values between "__" ):
__(?<tokenName>[\w\.]+)__
I also want to ignore the commented lines, and I came up with this:
Ignore lines that start with a comment (even if "#" is preceded by spaces or tabs):
^(?!\s*\t*#).*
The problem starts when I put them together
^(?!\s*\t*#).*__(?<tokenName>[\w\.]+)__
this expression has the following problems:
up to one match per line, the last one (ie: in the line with "Prop5 = ..." I get one match instead of two)
Comments at the end of a line are not considered (ie: line with "Prop4 = ..." has two matches instead of one)
I've also tried to
add this at the end of the expression, it should stop the match on the first occurrence of the character
[^#]
add this at the beginning, which should check if the matched string has the given char before it and exclude it
(?<!^#)
This is a sample of my data
#templateFile
[Agent]
Prop1 = "__Data.Agent.Prop1__"
Prop2 = [__Data.Agent.Prop2__]
#I'm a comment
#Prop3 = "__NotUsed__"
Prop4 = [__Data.Agent.Prop4__] #sample usage comment __Data.Agent.xxx__
Prop5 = ["__Data.Agent.Prop5a__","__Data.Agent.Prop5b__"]
I think the easier solution will be to match the given string, only if there is no "#" before it on the same line.
Is it possible?
EDIT:
The first expression proposed by #the-fourth-bird works perfectly, it just needs the multiline modifier to be specified.
The final (runnable) result looks like this in PowerShell.
[regex]$reg = "(?m)(?<!^.*#.*)__(?<tokenName>[\w.]+)__"
$text = '
#templateFile
[Agent]
Prop1 = "__Data.Agent.Prop1__"
Prop2 = [__Data.Agent.Prop2__]
Prop5 = ["__Data.Agent.Prop5a__","__Data.Agent.Prop5b__"]
#a comment
#Prop3 = "__Data.Agent.Prop3__"
Prop4 = [__Data.Agent.Prop4__] #sample usage comment __Data.Agent.xxx__
'
$reg.Matches($text) | Format-Table
#This returns
Groups Success Name Captures Index Length Value
------ ------- ---- -------- ----- ------ -----
{0, tokenName} True 0 {0} 31 20 __Data.Agent.Prop1__
{0, tokenName} True 0 {0} 62 20 __Data.Agent.Prop2__
{0, tokenName} True 0 {0} 94 21 __Data.Agent.Prop5a__
{0, tokenName} True 0 {0} 118 21 __Data.Agent.Prop5b__
{0, tokenName} True 0 {0} 194 20 __Data.Agent.Prop4__
I think you could make use of infinite repetition to check if what precedes does not contain a # to also account for the comment in Prop4
(?<!^.*#.*)__(?<tokenName>[\w.]+)__
.Net regex demo
If Prop4 should have 2 matches, you might use:
(?<!^[ \t]*#.*)__(?<tokenName>[\w.]+)__
.NET regex demo
Both expressions needs the multiline modifier to work properly.
it can be specified inline by adding (?m) at the beginning. (or by specifying it in a constructor that supports it)
(?m)(?<!^.*#.*)__(?<tokenName>[\w.]+)__
I'm trying to modify the following method so it will show the column names of the non matched items in the in the output of the 2 CSV file I'm comparing:
public static void CompareCSVFiles_2(string file1, string file2)
{
string[] names1 = File.ReadAllLines(file1);
string[] names2 = File.ReadAllLines(file2);
IEnumerable<string> differenceQuery = names1.Except(names2);
foreach (string s in differenceQuery)
Console.WriteLine(s);
}
The format of the 2 files I'm trying to compare is plain CSV for example:
CSV_1 CSV_2
Column_1 Column_2 Column_3 Column_1 Column_2 Column_3
123 hhh bbb 123 hhh bbb
135 ddd lll 135 ddd zzz
The output I'm after needs to indicate not only that a diff was found between the 2 files but also indicate the column name.
For example:
'Diff found in Column_3, line 2'.
I do know that the 'Column' is just line [0] in the CSV but what am I'm missing here ?
Thanks.
The current implementation compares the two files row by row. If you want to find the differences in columns, you have to parse the rows first.
There is a good nuget package called CsvHelper that helps you with the parsing. Check the "Reading" > "Reading by Hand example" on their website to see how to read the file column by column.
I'd like a group of 3 values in the following regular expression and input string
With the help of the SO experts this is what I have:
string item = "strawb bana 1 10 1.93";
string pattern = #"(?<str>[\w\s]*)(?<qty>\s\d*\s)(?<num>\d*\.\d+)";
Basically,
The first value is going to be the product description. I put a 1 on the end just in case the description has a number in it.
The second value is the quantity.
The third value is price.
Does this look correct? Might I be missing other cases?
Result should be the following
Group 1 = "strawb bana 1"
Group 2 = "10"
Group 3 = "1.93"
It looks like you forgot to include digits in your first match.
string item = "strawb bana 1 10 1.93";
string pattern = #"(?<str>[\w\s]*)(?<qty>\s\d*\s)(?<num>\d*\.\d+)";
Should be:
string item = "strawb bana 1 10 1.93";
string pattern = #"(?<str>[\w\s\d]*)(?<qty>\s\d*\s)(?<num>\d*\.\d+)";
I'm a little fuzzy on what the difference between a "group" and a "capture" are when it comes to .NET's regular expression language. Consider the following C# code:
MatchCollection matches = Regex.Matches("{Q}", #"^\{([A-Z])\}$");
I expect this to result in a single capture for the letter 'Q', but if I print the properties of the returned MatchCollection, I see:
matches.Count: 1
matches[0].Value: {Q}
matches[0].Captures.Count: 1
matches[0].Captures[0].Value: {Q}
matches[0].Groups.Count: 2
matches[0].Groups[0].Value: {Q}
matches[0].Groups[0].Captures.Count: 1
matches[0].Groups[0].Captures[0].Value: {Q}
matches[0].Groups[1].Value: Q
matches[0].Groups[1].Captures.Count: 1
matches[0].Groups[1].Captures[0].Value: Q
What exactly is going on here? I understand that there's also a capture for the entire match, but how do the groups come in? And why doesn't matches[0].Captures include the capture for the letter 'Q'?
You won't be the first who's fuzzy about it. Here's what the famous Jeffrey Friedl has to say about it (pages 437+):
Depending on your view, it either adds
an interesting new dimension to the
match results, or adds confusion and
bloat.
And further on:
The main difference between a Group
object and a Capture object is that
each Group object contains a
collection of Captures representing
all the intermediary matches by the
group during the match, as well as the
final text matched by the group.
And a few pages later, this is his conclusion:
After getting past the .NET
documentation and actually
understanding what these objects add,
I've got mixed feelings about them. On
one hand, it's an interesting
innovation [..] on the other hand, it
seems to add an efficiency burden [..]
of a functionality that won't be used
in the majority of cases
In other words: they are very similar, but occasionally and as it happens, you'll find a use for them. Before you grow another grey beard, you may even get fond of the Captures...
Since neither the above, nor what's said in the other post really seems to answer your question, consider the following. Think of Captures as a kind of history tracker. When the regex makes his match, it goes through the string from left to right (ignoring backtracking for a moment) and when it encounters a matching capturing parentheses, it will store that in $x (x being any digit), let's say $1.
Normal regex engines, when the capturing parentheses are to be repeated, will throw away the current $1 and will replace it with the new value. Not .NET, which will keep this history and places it in Captures[0].
If we change your regex to look as follows:
MatchCollection matches = Regex.Matches("{Q}{R}{S}", #"(\{[A-Z]\})+");
you will notice that the first Group will have one Captures (the first group always being the whole match, i.e., equal to $0) and the second group will hold {S}, i.e. only the last matching group. However, and here's the catch, if you want to find the other two catches, they're in Captures, which contains all intermediary captures for {Q} {R} and {S}.
If you ever wondered how you could get from the multiple-capture, which only shows last match to the individual captures that are clearly there in the string, you must use Captures.
A final word on your final question: the total match always has one total Capture, don't mix that with the individual Groups. Captures are only interesting inside groups.
This can be explained with a simple example (and pictures).
Matching 3:10pm with the regular expression ((\d)+):((\d)+)(am|pm), and using Mono interactive csharp:
csharp> Regex.Match("3:10pm", #"((\d)+):((\d)+)(am|pm)").
> Groups.Cast<Group>().
> Zip(Enumerable.Range(0, int.MaxValue), (g, n) => "[" + n + "] " + g);
{ "[0] 3:10pm", "[1] 3", "[2] 3", "[3] 10", "[4] 0", "[5] pm" }
So where's the 1?
Since there are multiple digits that match on the fourth group, we only "get at" the last match if we reference the group (with an implicit ToString(), that is). In order to expose the intermediate matches, we need to go deeper and reference the Captures property on the group in question:
csharp> Regex.Match("3:10pm", #"((\d)+):((\d)+)(am|pm)").
> Groups.Cast<Group>().
> Skip(4).First().Captures.Cast<Capture>().
> Zip(Enumerable.Range(0, int.MaxValue), (c, n) => "["+n+"] " + c);
{ "[0] 1", "[1] 0" }
Courtesy of this article.
A Group is what we have associated with groups in regular expressions
"(a[zx](b?))"
Applied to "axb" returns an array of 3 groups:
group 0: axb, the entire match.
group 1: axb, the first group matched.
group 2: b, the second group matched.
except that these are only 'captured' groups. Non capturing groups (using the '(?: ' syntax are not represented here.
"(a[zx](?:b?))"
Applied to "axb" returns an array of 2 groups:
group 0: axb, the entire match.
group 1: axb, the first group matched.
A Capture is also what we have associated with 'captured groups'. But when the group is applied with a quantifier multiple times, only the last match is kept as the group's match. The captures array stores all of these matches.
"(a[zx]\s+)+"
Applied to "ax az ax" returns an array of 2 captures of the second group.
group 1, capture 0 "ax "
group 1, capture 1 "az "
As for your last question -- I would have thought before looking into this that Captures would be an array of the captures ordered by the group they belong to. Rather it is just an alias to the groups[0].Captures. Pretty useless..
From the MSDN documentation:
The real utility of the Captures property occurs when a quantifier is applied to a capturing group so that the group captures multiple substrings in a single regular expression. In this case, the Group object contains information about the last captured substring, whereas the Captures property contains information about all the substrings captured by the group. In the following example, the regular expression \b(\w+\s*)+. matches an entire sentence that ends in a period. The group (\w+\s*)+ captures the individual words in the collection. Because the Group collection contains information only about the last captured substring, it captures the last word in the sentence, "sentence". However, each word captured by the group is available from the collection returned by the Captures property.
Imagine you have the following text input dogcatcatcat and a pattern like dog(cat(catcat))
In this case, you have 3 groups, the first one (major group) corresponds to the match.
Match == dogcatcatcat and Group0 == dogcatcatcat
Group1 == catcatcat
Group2 == catcat
So what it's all about?
Let's consider a little example written in C# (.NET) using Regex class.
int matchIndex = 0;
int groupIndex = 0;
int captureIndex = 0;
foreach (Match match in Regex.Matches(
"dogcatabcdefghidogcatkjlmnopqr", // input
#"(dog(cat(...)(...)(...)))") // pattern
)
{
Console.Out.WriteLine($"match{matchIndex++} = {match}");
foreach (Group #group in match.Groups)
{
Console.Out.WriteLine($"\tgroup{groupIndex++} = {#group}");
foreach (Capture capture in #group.Captures)
{
Console.Out.WriteLine($"\t\tcapture{captureIndex++} = {capture}");
}
captureIndex = 0;
}
groupIndex = 0;
Console.Out.WriteLine();
}
Output:
match0 = dogcatabcdefghi
group0 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group1 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group2 = catabcdefghi
capture0 = catabcdefghi
group3 = abc
capture0 = abc
group4 = def
capture0 = def
group5 = ghi
capture0 = ghi
match1 = dogcatkjlmnopqr
group0 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group1 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group2 = catkjlmnopqr
capture0 = catkjlmnopqr
group3 = kjl
capture0 = kjl
group4 = mno
capture0 = mno
group5 = pqr
capture0 = pqr
Let's analyze just the first match (match0).
As you can see there are three minor groups: group3, group4 and group5
group3 = kjl
capture0 = kjl
group4 = mno
capture0 = mno
group5 = pqr
capture0 = pqr
Those groups (3-5) were created because of the 'subpattern' (...)(...)(...) of the main pattern (dog(cat(...)(...)(...)))
Value of group3 corresponds to it's capture (capture0). (As in the case of group4 and group5). That's because there are no group repetition like (...){3}.
Ok, let's consider another example where there is a group repetition.
If we modify the regular expression pattern to be matched (for code shown above)
from (dog(cat(...)(...)(...))) to (dog(cat(...){3})),
you'll notice that there is the following group repetition: (...){3}.
Now the Output has changed:
match0 = dogcatabcdefghi
group0 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group1 = dogcatabcdefghi
capture0 = dogcatabcdefghi
group2 = catabcdefghi
capture0 = catabcdefghi
group3 = ghi
capture0 = abc
capture1 = def
capture2 = ghi
match1 = dogcatkjlmnopqr
group0 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group1 = dogcatkjlmnopqr
capture0 = dogcatkjlmnopqr
group2 = catkjlmnopqr
capture0 = catkjlmnopqr
group3 = pqr
capture0 = kjl
capture1 = mno
capture2 = pqr
Again, let's analyze just the first match (match0).
There are no more minor groups group4 and group5 because of (...){3} repetition ({n} wherein n>=2)
they've been merged into one single group group3.
In this case, the group3 value corresponds to it's capture2 (the last capture, in other words).
Thus if you need all the 3 inner captures (capture0, capture1, capture2) you'll have to cycle through the group's Captures collection.
Сonclusion is: pay attention to the way you design your pattern's groups.
You should think upfront what behavior causes group's specification, like (...)(...), (...){2} or (.{3}){2} etc.
Hopefully it will help shed some light on the differences between Captures, Groups and Matches as well.