Regex that match different format sentences in c# - c#

Format of file
POS ID PosScore NegScore SynsetTerms Gloss
a 00001740 0.125 0 able#1" able to swim"; "she was able to program her computer";
a 00002098 0 0.75 unable#1 "unable to get to town without a car";
a 00002312 0 0 dorsal#2 abaxial#1 "the abaxial surface of a leaf is the underside or side facing away from the stem"
a 00002843 0 0 basiscopic#1 facing or on the side toward the base
a 00002956 0 0.23 abducting#1 abducent#1 especially of muscles; drawing away from the midline of the body or from an adjacent part
a 00003131 0 0 adductive#1 adducting#1 adducent#1 especially of muscles;
In this file, I want to extract (ID,PosScore,NegScore and SynsetTerms) field. The (ID,PosScore,NegScore) field data extraction is easy and I use the following code for the data of these fields.
Regex expression = new Regex(#"(\t(\d+)|(\w+)\t)");
var results = expression.Matches(input);
foreach (Match match in results)
{
Console.WriteLine(match);
}
Console.ReadLine();
and it give the correct result but the Filed SynsetTerms create a problem because some lines have two or more words so how organize word and get against it PosScore And NegScore.
For example, in fifth line there are two words abducting#1 and abducent#1 but both have same score.
So what will be regex for such line that get Word and its score, like:
Word PosScore NegScore
abducting#1 0 0.23
abducent#1 0 0.23

The non-regex, string-splitting version might be easier:
var data =
lines.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
.Skip(1)
.Select(line => line.Split('\t'))
.SelectMany(parts => parts[4].Split().Select(word => new
{
ID = parts[1],
Word = word,
PosScore = decimal.Parse(parts[2]),
NegScore = decimal.Parse(parts[3])
}));

You can use this regex
^(?<pos>\w+)\s+(?<id>\d+)\s+(?<pscore>\d+(?:\.\d+)?)\s+(?<nscore>\d+(?:\.\d+)?)\s+(?<terms>(?:.*?#[^\s]*)+)\s+(?<gloss>.*)$
You can create a list like this
var lst=Regex.Matches(input,regex)
.Cast<Match>()
.Select(x=>
new
{
pos=x.Groups["pos"].Value,
terms=Regex.Split(x.Groups["terms"].Value,#"\s+"),
gloss=x.Groups["gloss"].Value
}
);
and now you can iterate over it
foreach(var temp in lst)
{
temp.pos;
//you can now iterate over terms
foreach(var t in temp.terms)
{
}
}

Related

Find in the List of words with letters in certain positions

I'm doing a crossword puzzle maker. The user selects cells for words, and the program compiles a crossword puzzle from the dictionary (all words which can be used in the crossword) - List<string>.
I need to find a word (words) in a dictionary which matches given mask (pattern).
For example, I need to find all words which match
#a###g
pattern, i.e. all words of length 6 in the dictionary with "a" at index 1 and "g" at index 5
The number of letters and their position are unknown in advance
How do I realize this?
You can convert word description (mask)
#a###g
into corresponding regular expression pattern:
^\p{L}a\p{L}{3}g$
Pattern explained:
^ - anchor, word beginning
\p{L} - arbitrary letter
a - letter 'a'
\p{L}{3} - exactly 3 arbitrary letters
g - letter 'g'
$ - anchor, word ending
and then get all words from dictionary which match this pattern:
Code:
using System.Linq;
using System.Text.RegularExpressions;
...
private static string[] Variants(string mask, IEnumerable<string> availableWords) {
Regex regex = new Regex("^" + Regex.Replace(mask, "#*", m => #$"\p{{L}}{{{m.Length}}}") + "$");
return availableWords
.Where(word => regex.IsMatch(availableWords))
.OrderBy(word => word)
.ToArray();
}
Demo:
string[] allWords = new [] {
"quick",
"brown",
"fox",
"jump",
"rating",
"coding"
"lazy",
"paring",
"fang",
"dog",
};
string[] variants = Variants("#a###g", allWords);
Console.Write(string.Join(Environment.NewLine, variants));
Outcome:
paring
rating
I need to find a word in a list with "a" at index 1 and "g" at index 5, like the following
wordList.Where(word => word.Length == 6 && word[1] == 'a' && word[5] == 'g')
The length check first will be critical to preventing a crash, unless your words are arranged into different lists by length..
If you mean that you literally will pass "#a###g" as the parameter that conveys the search term:
var term = "#a###g";
var search = term.Select((c,i) => (Chr:c,Idx:i)).Where(t => t.Chr != '#').ToArray();
var words = wordList.Where(word => word.Length == term.Length && search.All(t => word[t.Idx] == t.Chr));
How it works:
Take "#a###g" and project it to a sequence of the index of the char and the char itself, so ('#', 0),('a', 1),('#', 2),('#', 3),('#', 4),('g', 5)
Discard the '#', leaving only ('a', 1),('g', 5)
This means "'a' at position 1 and 'g' at 5"
Search the wordlist demanding that the word length is same as "#a###g", and also that All the search terms match when we "get the char out of the word at Idx and check it matches the Chr in the search term

Regex replacement in a custom tag

I have a string that may contain one or more of the following tags:
<CHOICE [some words] [other words]>
I need to replace (C#) all occurrences of this tag as follows:
Example: I like <CHOICE [cars and bikes] [apple and oranges]>
Result: I like cars and bikes
Example: I like <CHOICE [cars and bikes] [apple and oranges]>, I also like <CHOICE [pizza] [pasta]>
Result: I like cars and bikes, I also like pizza
Basically, replace the entire tag with only the string appearing in the first set of brackets.
Looks like capture groups is the way to go but I wasn't able to understand how to make them work.
Any help is appreciated!
EDIT: Regex is not a requirement, I thought it would be the best approach, but I see some comments telling me that it's not needed so any other suggestion will be just as fine. Thanks!
Just for fun. Here is a school-yard foreach state-machine, with a linear O(n) time complexity.
var line = "I like <CHOICE [cars and bikes] [apple and oranges]>";
var result = new StringBuilder();
var state = 0;
foreach (char c in line)
{
if (state == 0 && c == '<') state = 1;
else if (state == 1 && c == '[') state = 2;
else if (state == 2 && c == ']') state = 3;
else if (state == 3 && c == '>') state = 0;
else if (state == 0 || state == 2) result.Append(c);
};
Output
I like cars and bikes
Demo here
Get groups of Matches First, then for each Matched Group replace a first string in [ and ]
MatchCollection matches = Regex.Matches(InputStr, #"<CHOICE(.*?)>");
foreach(Match Item in matches)
{
MatchCollection matches1 = Regex.Matches(Item.ToString(), #"\[(.+?)]");
string FirstOccurence = matches1[0].Groups[1].ToString();
InputStr = InputStr.Replace(Item.ToString(), FirstOccurence);
}
Find the demo
string pattern = #"\< *CHOICE *((\[(?<choice>[a-zA-Z0-9 ]+)\]) *)+ *>";
Regex regex = new Regex(pattern);
string source = "I like <CHOICE [cars and bikes] [apple and oranges]>";
var match = regex.Match(source);
if (match.Success)
{
for (int i = 0; i < match.Groups["choice"].Captures.Count; i++)
{
Debug.WriteLine(match.Groups["choice"].Captures[i]);
}
string replaced = regex.Replace(source, match.Groups["choice"].Captures[0].Value);
Debug.WriteLine(replaced);
}
The output is:
cars and bikes
apple and oranges
I like cars and bikes
\< *CHOICE *
matches "<" "zero or more spaces" "CHOICE" "zero or more spaces"
([a-zA-Z0-9 ]+)
matches words and spaces
?<choice>
gives above group a name:choice
\[(?<choice>[a-zA-Z0-9 ]+)\]
matches one choice in []
((\[(?<choice>[a-zA-Z0-9 ]+)\] *)
matches choices separated by zero or more spaces
+
means you should have at lease one choice
*>
you can have zero or more spaces at the end before ">"
I assume this is the best way to do that.
string text = "This is some dummy text with the choice < CHOICE [ white black green cyan ] [yellow green]>." +
" The second choice <CHOICE [pink brown red] [blue cyan]>.";
string pattern = #"<\s*?CHOICE\s*\[\s*?(.+?)\s*?\].*?>";
var result = Regex.Replace(text, pattern, r => String.Join(" and ", r.Groups[1].Value.Split(' ', StringSplitOptions.RemoveEmptyEntries)));
Console.WriteLine(result);
Output
This is some dummy text with the choice white and black and green and cyan. The second choice pink and brown and red.

Trying to match multiple words multiple times, any order using regex

I'm trying to check if a text contains two or more specific words. The words can be in any order an can show up in the text multiple times but at least once.
If the text is a match I will need to get the information about location of the words.
Lets say we have the text :
"Once I went to a store and bought a coke for a dollar and I got another coke for free"
In this example I want to match the words coke and dollar.
So the result should be:
coke : index 37, lenght 4
dollar : index 48, length 6
coke : index 84, length 4
What I have already is this: (which I think is little bit wrong because it should contain each word at least once so the + should be there instead of the *)
(?:(\bcoke\b))\*(?:(\bdollar\b))\*
But with that regex the RegEx Buddy highlights all the three words if I ask it to hightlight group 1 and group 2.
But when I run this in C# I won't get any results.
Can you point me to the right direction ?
I don't think it's possible what you want only using regular expressions.
Here is a possible solution using regular expressions and linq:
var words = new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "coke", "dollar" };
var regex = new Regex(#"\b(?:"+string.Join("|", words)+#")\b", RegexOptions.IgnoreCase);
var text = #"Once I went to a store and bought a coke
for a dollar and I got another coke for free";
var grouped = regex.Matches(text)
.OfType<Match>()
.GroupBy(m => m.Value, StringComparer.OrdinalIgnoreCase)
.ToArray();
if (grouped.Length != words.Count)
{
//not all words were found
}
else
{
foreach (var g in grouped)
{
Console.WriteLine("Found: " + g.Key);
foreach (var match in g)
Console.WriteLine(" At {0} length {1}", match.Index, match.Length);
}
}
Output:
Found: coke
At 36 length 4
At 72 length 4
Found: dollar
At 47 length 6
How about this, it is pret-tay bad but I think it has a shot at working and it is pure RegEx no extra tools.
(?:^|\W)[cC][oO][kK][eE](?:$|\W)|(?:^|\W)[dD][oO][lL][lL][aA][rR](?:$|\W)
Get rid of the \w's if you want it to capture cokeDollar or dollarCoKe etc.

Split a string containing various spaces

I have txt file as follows and would like to split them into double arrays
node Strain Axis Strain F P/S Sum Cur Moment
0 0.00000 0.00 0.0000 0 0 0 0 0.00
1 0.00041 -83.19 0.0002 2328 352 0 0 -0.80
2 0.00045 -56.91 0.0002 2329 352 0 0 1.45
3 0.00050 -42.09 0.0002 2327 353 0 0 -0.30
My goal is to have a series of arrays of each column. i.e.
node[] = {0,1,2,3), Axis[]= {0.00,-83.19,-56.91,-42.09}, ....
I know how to read the txt file and covert strings to double arrays. but the problem is the values are not separated by tab, but by different number of spaces. I googled to find out a way to do it. However, I couldn't find any. some discussed a way to do with a constant spaces. If you know how to do or there is an existing Q&A for this issue and let me know, it will be greatly appreciated. Thanks,
A different way, although I would suggest you stick with the other answers here using RemoveEmptyEntries would be to use a regular expression, but in this case it is overkill:
string[] elements = Regex.Split(s, #"\s+");
StringSplitOptions.RemoveEmptyEntires should do the trick:
var items = source.Split(new [] { " " }, StringSplitOptions.RemoveEmptyEntries);
The return value does not include array elements that contain an empty string
var doubles = text.Split("\n\r".ToCharArray(), StringSplitOptions.RemoveEmptyEntries)
.Skip(1)
.Select(line => line.Split(new char[]{' '},StringSplitOptions.RemoveEmptyEntries)
.Select(x => double.Parse(x)).ToArray())
.ToArray();
Use the option StringSplitOptions.RemoveEmptyEntries to treat consecutive delimiters as one:
string[] parts = source.Split(' ',StringSplitOptions.RemoveEmptyEntries);
then parse from there:
double[] values = parts.Select(s => double.Parse(s)).ToArray();

Regex masking of words that contain a digit

Trying to come up with a 'simple' regex to mask bits of text that look like they might contain account numbers.
In plain English:
any word containing a digit (or a train of such words) should be matched
leave the last 4 digits intact
replace all previous part of the matched string with four X's (xxxx)
So far
I'm using the following:
[\-0-9 ]+(?<m1>[\-0-9]{4})
replacing with
xxxx${m1}
But this misses on the last few samples below
sample data:
123456789
a123b456
a1234b5678
a1234 b5678
111 22 3333
this is a a1234 b5678 test string
Actual results
xxxx6789
a123b456
a1234b5678
a1234 b5678
xxxx3333
this is a a1234 b5678 test string
Expected results
xxxx6789
xxxxb456
xxxx5678
xxxx5678
xxxx3333
this is a xxxx5678 test string
Is such an arrangement possible with a regex replace?
I think I"m going to need some greediness and lookahead functionality, but I have zero experience in those areas.
This works for your example:
var result = Regex.Replace(
input,
#"(?<!\b\w*\d\w*)(?<m1>\s?\b\w*\d\w*)+",
m => "xxxx" + m.Value.Substring(Math.Max(0, m.Value.Length - 4)));
If you have a value like 111 2233 33, it will print xxxx3 33. If you want this to be free from spaces, you could turn the lambda into a multi-line statement that removes whitespace from the value.
To explain the regex pattern a bit, it's got a negative lookbehind, so it makes sure that the word behind it does not have a digit in it (with optional word characters around the digit). Then it's got the m1 portion, which looks for words with digits in them. The last four characters of this are grabbed via some C# code after the regex pattern resolves the rest.
I don't think that regex is the best way to solve this problem and that's why I am posting this answer. For so complex situations, building the corresponding regex is too difficult and, what is worse, its clarity and adaptability is much lower than a longer-code approach.
The code below these lines delivers the exact functionality you are after, it is clear enough and can be easily extended.
string input = "this is a a1234 b5678 test string";
string output = "";
string[] temp = input.Trim().Split(' ');
bool previousNum = false;
string tempOutput = "";
foreach (string word in temp)
{
if (word.ToCharArray().Where(x => char.IsDigit(x)).Count() > 0)
{
previousNum = true;
tempOutput = tempOutput + word;
}
else
{
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
output = output + " " + word;
}
}
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
Have you tried this:
.*(?<m1>[\d]{4})(?<m2>.*)
with replacement
xxxx${m1}${m2}
This produces
xxxx6789
xxxx5678
xxxx5678
xxxx3333
xxxx5678 test string
You are not going to get 'a123b456' to match ... until 'b' becomes a number. ;-)
Here is my really quick attempt:
(\s|^)([a-z]*\d+[a-z,0-9]+\s)+
This will select all of those test cases. Now as for C# code, you'll need to check each match to see if there is a space at the beginning or end of the match sequence (e.g., the last example will have the space before and after selected)
here is the C# code to do the replace:
var redacted = Regex.Replace(record, #"(\s|^)([a-z]*\d+[a-z,0-9]+\s)+",
match => "xxxx" /*new String("x",match.Value.Length - 4)*/ +
match.Value.Substring(Math.Max(0, match.Value.Length - 4)));

Categories

Resources