Parsing into dictionary with regex as separator for splitting - c#

As I said in title, I think the idea would be to split it by something like this\d+?=.*?\d= but not quite sure... Any idea how best to parse this string:
1=Some dummy sentence
2=Some other sentence 3=Third sentence which can be in the same line
4=Forth sentence
some text which shouldn't be captured and spplitted
And what I'm hoping to get from this is a Dictionary which will have this number for key, and this string in the value, so for example:
1, "Some dummy sentence"
2, "Some other sentence"
3, "Third sentence which can be in the same line"
4, "Forth sentence"

Method to parse text into dictionary:
public static Dictionary<int, string> GetValuesToDictionary(string text)
{
var pattern = #"(\d+)=(.*?)((?=\d=)|\n)";
//If spaces between digit and equal sign are possible then (\d+)\s*=\s*(.*?)((?=\d\s?=)|\n)
var regex = new Regex(pattern);
var pairs = new Dictionary<int, string>();
var matches = regex.Matches(text);
foreach (Match match in matches)
{
var key = int.Parse(match.Groups[1].Value);
var value = match.Groups[2].Value;
if (!pairs.ContainsKey(key))
{
pairs.Add(key, value);
}
//pairs.Add(key, value);
}
return pairs;
}
In this case i check if lkey already exists and if so i do not add it but you can see for yourself if you need this check.
Includes digit groups without equal sign in the value.

What about this: https://regex101.com/r/6ED8Om/2
\n?(\d+)=(.*?)(?= *\d|\n)
\n?(\d+)= matches optional new line character followed by digits and equal sign
(.*?) matches following text
(?= *\d|\n) matches any number of spaces followed by a digit, or a new line character. The spaces prevent #2 to include the two spaces between its end end #3
EDIT: Use other answer code with this regex to save your values to a dictionnary. Group 1 matches the digits, group 2 matches the text.

Related

Regex How to Match 2 fields

How would capture both the filenames inside the quotes, and the numbers following as named captures (Regex / C#)?
Files("fileone.txt", 5969784, "file2.txt", 45345333)
Out of every occurrence in the string, the ability to capture "fileone.txt" and the integer following (a loop cycles each pair)
I am trying to use this https://regex101.com/r/MwMzBo/1 but having issues matching without the '[' and ']'.
Required to be able to loop each filename+size as a pair and moving next.
Any help is appreciated!
UPDATE
string file = "Files(\"fileone.txt\", 5969784, \"file2.txt\", 45345333, \"file2.txt\", 45345333)";
var regex = new Regex(#"(?:\G(?!\A)\s*,\s*|\w+\()(?:""(?<file>.*?)""|'(?<file>.*?)')\s*,\s*(?<number>\d+)");
var match = regex.Match(file);
var names = match.Groups["file"].Captures.Cast<Capture>();
var lengths = match.Groups["number"].Captures.Cast<Capture>();
var filelist = names.Zip(lengths, (f, n) => new { file = f.Value, length = long.Parse(n.Value) }).ToArray();
foreach (var item in filelist)
{
// Only returning 1 pair result, ignoring the rest
}
Reading match.Value to confirm what is being read. Only first pair is being picked up.
while (match.Success)
{
MessageBox.Show(match.Value);
match = match.NextMatch();
}
Now we are getting all results properly. I read, that Regex.Match only returns the first matched result. This explains a lot.
You can use
(?:\G(?!\A)\s*,\s*|\w+\()(?:""(?<file>.*?)""|'(?<file>.*?)')\s*,\s*(?<number>\d+)
See the regex demo
Details:
(?:\G(?!\A)\s*,\s*|\w+\() - end of the previous successful match and a comma enclosed with zero or more whitespaces, or a word and an opening ( char
(?:""(?<file>.*?)""|'(?<file>.*?)') - ", Group "file" capturing any zero or more chars other than a newline char as few as possible and then a ", or a ', Group "file" capturing any zero or more chars other than a newline char as few as possible and then a '
\s*,\s* - a comma enclosed with zero or more whitespaces
(?<number>\d+) - Group "number": one or more digits.
I like doing it in smaller pieces :
string input = "cov('Age', ['5','7','9'])";
string pattern1 = #"\((?'key'[^,]+),\s+\[(?'values'[^\]]+)";
Match match = Regex.Match(input, pattern1);
string key = match.Groups["key"].Value.Trim(new char[] {'\''});
string pattern2 = #"'(?'value'[^']+)'";
string values = match.Groups["values"].Value;
MatchCollection matches = Regex.Matches(values, pattern2);
int[] number = matches.Cast<Match>().Select(x => int.Parse(x.Value.Replace("'",string.Empty))).ToArray();

How to check if a string contains a word and ignore special characters?

I need to check if a sentence contains any of the word from a string array but while checking it should ignore special characters like comma. But the result should have original sentence.
For example, I have a sentence "Tesla car price is $ 250,000."
In my word array I've wrdList = new string[5]{ "250000", "Apple", "40.00"};
I have written the below line of code, but it is not returning the result because 250,000 and 250000 are not matching.
List<string> res = row.ItemArray.Where(itmArr => wrdList.Any(wrd => itmArr.ToString().ToLower().Contains(wrd.ToString()))).OfType<string>().ToList();
And one important thing is, I need to get original sentence if it matches with string array.
For example, result should be "Tesla car price is $ 250,000."
not like "Tesla car price is $ 250000."
How about Replace(",", "")
itmArr.ToString().ToLower().Replace(",", "").Contains(wrd.ToString())
side note: .ToLower() isn't required since digits are case insensitive and a string doesn't need .ToString()
so the resuld could also be
itmArr.Replace(",", "").Contains(wrd)
https://dotnetfiddle.net/A2zN0d
Update
sice the , could be a different character - culture based, you can also use
ystem.Threading.Thread.CurrentThread.CurrentCulture.NumberFormat.NumberGroupSeparator
instead
The first option to consider for most text matching problems is to use regular expressions. This will work for your problem. The core part of the solution is to construct an appropriate regular expression to match what you need to match.
You have a list of words, but I'll focus on just one word. Your requirements specify that you want to match on a "word". So to start with, you can use the "word boundary" pattern \b. To match the word "250000", the regular expression would be \b250000\b.
Your requirements also specify that the word can "contain" characters that are "special". For it to work correctly, you need to be clear what it means to "contain" and which characters are "special".
For the "contain" requirement, I'll assume you mean that the special character can be between any two characters in the word, but not the first or last character. So for the word "250000", any of the question marks in this string could be a special character: "2?5?0?0?0?0".
For the "special" requirement, there are options that depend on your requirements. If it's simply punctuation, you can use the character class \p{P}. If you need to specify a specific list of special characters, you can use a character group. For example, if your only special character is comma, the character group would be [,].
To put all that together, you would create a function to build the appropriate regular expression for each target word, then use that to check your sentence. Something like this:
public static void Main()
{
string sentence = "Tesla car price is $ 250,000.";
var targetWords = new string[]{ "250000", "350000", "400000"};
Console.WriteLine($"Contains target word? {ContainsTarget(sentence, targetWords)}");
}
private static bool ContainsTarget(string sentence, string[] targetWords)
{
return targetWords.Any(targetWord => ContainsTarget(sentence, targetWord));
}
private static bool ContainsTarget(string sentence, string targetWord)
{
string targetWordExpression = TargetWordExpression(targetWord);
var re = new Regex(targetWordExpression);
return re.IsMatch(sentence);
}
private static string TargetWordExpression(string targetWord)
{
var sb = new StringBuilder();
// If special characters means a specific list, use this:
string specialCharacterMatch = $"[,]?";
// If special characters means any punctuation, then you can use this:
//string specialCharactersMatch = "\\p{P}?";
bool any = false;
foreach (char c in targetWord)
{
if (any)
{
sb.Append(specialCharacterMatch);
}
any = true;
sb.Append(c);
}
return $"\\b{sb}\\b";
}
Working code: https://dotnetfiddle.net/5UJSur
Hope below solution can help,
Used Regular expression for removing non alphanumeric characters
Returns the original string if it contains any matching word from wrdList.
string s = "Tesla car price is $ 250,000.";
string[] wrdList = new string[3] { "250000", "Apple", "40.00" };
Regex rgx = new Regex("[^a-zA-Z0-9 -]");
string str = rgx.Replace(s, "");
if (wrdList.Any(str.Contains))
{
Console.Write(s);
}
else
{
Console.Write("No Match Found!");
}
Uplodade on fiddle for more exploration
https://dotnetfiddle.net/zbwuDy
In addition for paragraph, can split into array of sentences and iterate through. Check the same on below fiddle.
https://dotnetfiddle.net/AvO6FJ

Return a number from String after a specific word

I want to get a Substring out of a String.
The Substring I want is a sequence of numerical characters.
Input
"abcdefKD-0815xyz42ghijk";
"dag4ah424KD-42ab333k";
"BeverlyHills90210KD-433Nokia3310";
Generally it could be any String, but they all have one thing in common:
There is a part that starts with KD-
and ends with a number
Everything after the number to be gone.
In the examples above this number would be 0815, 42, 433 respectively. But it could be any number
Right now I have a Substring that contains all numerical characters after KD- but I would like to have only the 0815ish part of the string.
What i have so far
String toMakeSub = "abcdef21KD-0815xyz429569468949489694694689ghijk";
toMakeSub = toMakeSub.Substring(toMakeSub.IndexOf("KD-") + "KD-".Length);
String result = Regex.Replace(toMakeSub, "[^0-9]", "");
The Result is 0815429569468949489694694689 but I want only the 0815 (it could be any length though so cutting after four digits is not possible).
Its as easy as the following pattern
(?<=KD-)\d+
The way to read this
(?<=subpattern) : Zero-width positive lookbehind assertion. Continues matching only if subpattern matches on the left.
\d : Matches any decimal digit.
+ : Matches previous element one or more times.
Example
var input = "abcdef21KD-0815xyz429569468949489694694689ghijk";
var regex = new Regex(#"(?<=KD-)\d+");
var match = regex.Match(input);
if (match.Success)
{
Console.WriteLine(match.Value);
}
input = "abcdef21KD-0815xyz429569468949489694694689ghijk, KD-234dsfsdfdsf";
// or to match multiple times
var matches = regex.Matches(input);
foreach (var matchValue in matches)
{
Console.WriteLine(matchValue);
}

Using a regex with 'or' operator and getting matched groups?

I have some string in a file in the format
rid="deqn1-2"
rid="deqn3"
rid="deqn4-5a"
rid="deqn5b-7"
rid="deqn7-8"
rid="deqn9a-10v"
rid="deqn11a-12c"
I want a regex to match each deqnX-Y where X and Y are either both integers or both combination of integer and alphabet and if there is a match store X and Y in some variables.
I tried using the regex (^(\d+)-(\d+)$|^(\d+[a-z])-(\d+[a-z]))$
, but how do I get the values of the matched groups in variables?
For a match between two integers the groups would be (I think)
Groups[2].Value
Groups[3].Value
and for match between two integer and alphabet combo will be
Groups[4].Value
Groups[5].Value
How do I determine which match actually occured and then capture the matching groups accordingly?
As branch reset(?|) is not supported in C#, we can use named capturing group with same name like
deqn(?:(?<match1>\d+)-(?<match2>\d+)|(?<match1>\d+\w+)-(?<match2>\d+\w+))\b
regextester demo
C# code
String sample = "deqn1-2";
Regex regex = new Regex("deqn(?:(?<match1>\\d+)-(?<match2>\\d+)|(?<match1>\\d+\\w+)-(?<match2>\\d+\\w+))\\b");
Match match = regex.Match(sample);
if (match.Success) {
Console.WriteLine(match.Groups["match1"].Value);
Console.WriteLine(match.Groups["match2"].Value);
}
dotnetfiddle demo
You could simply not care. One of the pairs will be empty anyway. So what if you just interpret the result as a combination of both? Just slap them together. First value of the first pair plus first value of the second pair, and second value of the first pair plus second value of the second pair. This always gives the right result.
Regex regex = new Regex("^deqn(?:(\\d+)-(\\d+)|(\\d+[a-z])-(\\d+[a-z]))$");
foreach (String str in listData)
{
Match match = regex.Match(str);
if (!match.Success)
continue;
String value1 = Groups[1].Value + Groups[3].Value;
String value2 = Groups[2].Value + Groups[4].Value;
// process your strings
// ...
}

what is the regular expression to remove everything but digit?

I have this data in format
"NEW ITEM:1_BELT:3_JEANS:1_BELT:1_SUIT 3 PCS:1_SHOES:1"
the format is Item1:Item1Qty_Item2:Item2Qty.........ItemN:ItemNQty
I need to separte the the items and their corresponding quantities and form arrays. I did the item part like this..
var allItemsAry = Regex.Replace(myString, "[\\:]+\\d", "").Split('_');
Now allItemsAry is correct like this [NEW ITEM, BELT, JEANS, BELT, SUIT 3 PCS, SHOES]
But I can't figrure out how to get qty, whatever expression I try that 3 from SUIT 3 PCS comes along with that, like these
var allQtyAry = Regex.Replace(dataForPackageConsume, "[^(\\:+\\d)]", "").split(':')
This comes up as :1:3:1:13:1:1 (when replaced). So I can't separate by : to get make it array, as can be seen the forth item is 13, while it should be 1, that 3 is coming from SUIT 3 PCS. I also tried some other variations, but that 3 from SUIT 3 PCS always pops in. How do I just get the quantities of clothes (possible attached with : so I can split them by this and form the array?
UPDATE : If I didn't make it clear before I want the numbers that are exactly preceded by : along with the semicolon.
So, what I want is :1:3:1:1:1:1.
Instead of removing everything except numerals, how about matching only numerals?
For instance:
Regex regex = new Regex(#":\d+");
string result = string.Empty;
foreach (Match match in regex.Matches(input))
result += match.Value;
[^\d:]+|:(?!\d)|(?<!:)\d+
[^\d:]+ will match all non-digit non-:s.
:(?!\d) will match all :s not followed by a digit (negative lookahead).
(?<!:)\d+ will match all digits not preceded by a : (negative lookbehind).
Source
NEW ITEM:1_BELT:3_JEANS:1_BELT:1_SUIT 3 PCS:1_SHOES:1
Regular Expression
[^\d:]+|:(?!\d)|(?<!:)\d+
Results
Match
NEW ITEM
_BELT
_JEANS
_BELT
_SUIT
3
PCS
_SHOES
You want it only numbers like :1:3:1:1:3:1:1 ?
string s = "NEW ITEM:1_BELT:3_JEANS:1_BELT:1_SUIT 3 PCS:1_SHOES:1";
var output = Regex.Replace(s, #"[^0-9]+", "");
StringBuilder sb = new StringBuilder();
foreach (var i in output)
{
sb.Append(":" + i);
}
Console.WriteLine(sb); // :1:3:1:1:3:1:1
Here is a DEMO.
Ok, if every char is digit after : then you can use it like;
string s = "NEW ITEM:1_BELT:3_JEANS:1_BELT:1_SUIT 3 PCS:1_SHOES:1";
var array = s.Split(new char[] { ':' }, StringSplitOptions.RemoveEmptyEntries);
StringBuilder sb = new StringBuilder();
foreach (var item in array)
{
if (Char.IsDigit(item[0]))
{
sb.Append(":" + item[0]);
}
}
Console.WriteLine(sb); //:1:3:1:1:1:1
DEMO.
This will work with one replace:
var allQtyAry = Regex.Replace(dataForPackageConsume, #"[^_:]+:", "").split('_')
Explanation:
[^_:] means match anything that's not a _ or a :
[^_:]+: means match any sequence of at least one character not matching either _ or :, but ending with a :
Since regular expressions are greedy by default (ie they grab as much as possible), matching will start at the beginning of the string or after each _:
NEW ITEM: 1_BELT: 3_JEANS: 1_BELT: 1_SUIT 3 PCS: 1_SHOES: 1
Removing the matched parts (the italic bold bits above) results in:
1_3_1_1_1_1
Splitting by _ results in:
[1, 3, 1, 1, 1, 1]
Try this regex [^:\d+?].*?(?=:), it should do the trick
string[] list = Regex.Replace(test, #"[^:\d+?].*?(?=:)", string.Empty).Split(new char[] { ':' }, StringSplitOptions.RemoveEmptyEntries);
The regex matches and replaces with an empty string everything preceding the colon : (exclusive) .*?(?=:). It also excludes :# from the match [^:\d+?] thus you end up with :1:3:1:1:1:1 before the split

Categories

Resources