Split the String and get different output - c#

I have 2 different strings need to split them get the output as I needed. Trying different solutions didnt work for me and blocked
Input
"12.2 - Chemicals, products and including,14.0 - Plastic products ,17.2 - Metal Products ,19.1 - and optical equipment (excluding and other electronic components; semiconductors; bare printed circuit boards; opti, watches)"
OutPut
"12.2, 14.0, 17.2, 19.1"
The other case. [UpDated as per new codes]
Input
"MD 0102.2.3 - Test 123 ,MD 0102.2.5 - Test Hello ,MD 1101.2 - Dialysis and blood treatment, MDM 123 - Test, MDM 12.32.0 - Test 12"
Output
"MD 0102.2.3, MD 0102.2.5,MD 1101.2,MDM 123,MDM 12.32.0"
didn't understand which logic I need to find it.

You can solve both tasks with a help of regular expressions and Linq. The only difference is in the patterns (fiddle):
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
...
string input = "12.2 - Chemicals, products and including,14.0 - Plastic products ,17.2 - Metal Products ,19.1 - and optical equipment ...";
string pattern = #"[0-9]{1,2}\.[0-9]";
string[] result = Regex
.Matches(input, pattern)
.Cast<Match>()
.Select(match => match.Value)
.ToArray();
Console.Write(string.Join(", ", result));
Note, it is pattern which differ
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
...
string input = "DM 0405 - trtststodfm, fhreuu ,RD 3756 - yeyerffydydyd";
string pattern = #"[A-Z]{2}\s[0-9]{4}";
string[] result = Regex
.Matches(input, pattern)
.Cast<Match>()
.Select(match => match.Value)
.ToArray();
Console.Write(string.Join(", ", result));
Patterns explained:
[A-Z]{2}\s[0-9]{4}
[A-Z]{2} - 2 capital English letters in A..Z range
\s - white space
[0-9]{4} - 4 digits in 0..9 range
[0-9]{1,2}\.[0-9]
[0-9]{1,2} - from 1 up to 2 digits in 0..9 range
\. - dot .
[0-9] - digit in 0..9 range

If the input is always in the combination of {id}-{item},{id}-{item}.
I would split the string on the ',' character. After you've done that it would be quick if you search through the collection with Linq and Regex.
But you would need to know in what formats the ID of the item can have.
if it is like
2.2, 14.0, 17.2, 19.1
and does not change. Then a simple Regex like beneath suffices, which you can use in your Linq query.
new Regex(#"(\d*\.\d*)")

You could use this regex: ((\w{2} \d{4})|\d+\.\d+)(?=( - ))
Here's a fiddle demonstrating it: https://dotnetfiddle.net/0pTqFQ
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string input1 = "12.2 - Chemicals, products and including,14.0 - Plastic products ,17.2 - Metal Products ,19.1 - and optical equipment (excluding and other electronic components; semiconductors; bare printed circuit boards; opti, watches)";
string input2 = "DM 0405 - trtststodfm, fhreuu ,RD 3756 - yeyerffydydyd";
var regex1 = new Regex(#"((\w{2} \d{4})|\d+\.\d+)(?=( - ))");
var matches1 = regex1.Matches(input1);
var matches2 = regex1.Matches(input2);
PrintMatches(matches1);
PrintMatches(matches2);
}
private static void PrintMatches(MatchCollection matches)
{
foreach (var match in matches)
{
Console.WriteLine(match);
}
}
}

Related

Extracting integer ranges separated with hyphen

I try to filter some strings I streamed for some useful information in C#.
I got two possible string structures:
string examplestring1 = "from - to (mm) no. 1\r\n\r\nna 570 - 590\r\n60 18.12.20\r\nna 5390 - 5410\r\n60 18.12.20\r\nna 11380 - 11390 60 18.12.20\r\nPage 1/1";
string examplestring2 = "e ne 570 - 590 ne 5390 - 5410 ne 11380 - 11390 e";
I'd like to get an array or a List of strings in the format of "xxx - xxx". Like:
string[] example = new string[]{"570 - 590","5390 - 5410","11380 - 11390"};
I tried to use Regex:
List<string> numbers = new List<string>();
numbers.AddRange(Regex.Split(examplestring2, #"\D+"));
At least I get a list only containg the numbers. But that doesn't work out for examplestring1 since there is date within.
Also I tried to play around with Regex pattern. But things like following does not work.
Regex.Split(examplestring1, #"\D+" + " - " + #"\D+");
I'd be grateful for a solution or at least some hint how to solve that matter.
You can use
var results = Regex.Matches(text, #"\d+\s*-\s*\d+").Cast<Match>().Select(x => x.Value);
See the regex demo. If there must be a single regular space on both ends of the -, you can use \d+ - \d+ regex.
If you want to match any -, you can use [\p{Pd}\xAD] instead of -.
Note that \d in .NET matches any Unicode digits, to only match ASCII digits, use RegexOptions.ECMAScript option: Regex.Matches(text, #"\d+\s*-\s*\d+", RegexOptions.ECMAScript).

Simplify regex code in C#: Add a space between a digit/decimal and unit

I have a regex code written in C# that basically adds a space between a number and a unit with some exceptions:
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+", #"$1");
dosage_value = Regex.Replace(dosage_value, #"(\d)%\s+", #"$1%");
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d+)?)", #"$1 ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+%", #"$1% ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+:", #"$1:");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+e", #"$1e");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+E", #"$1E");
Example:
10ANYUNIT
10:something
10 : something
10 %
40 e-5
40 E-05
should become
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
Exceptions are: %, E, e and :.
I have tried, but since my regex knowledge is not top-notch, would someone be able to help me reduce this code with same expected results?
Thank you!
For your example data, you might use 2 capture groups where the second group is in an optional part.
In the callback of replace, check if capture group 2 exists. If it does, use is in the replacement, else add a space.
(\d+(?:\.\d+)?)(?:\s*([%:eE]))?
( Capture group 1
\d+(?:\.\d+)? match 1+ digits with an optional decimal part
) Close group 1
(?: Non capture group to match a as a whole
\s*([%:eE]) Match optional whitespace chars, and capture 1 of % : e E in group 2
)? Close non capture group and make it optional
.NET regex demo
string[] strings = new string[]
{
"10ANYUNIT",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
string pattern = #"(\d+(?:\.\d+)?)(?:\s*([%:eE]))?";
var result = strings.Select(s =>
Regex.Replace(
s, pattern, m =>
m.Groups[1].Value + (m.Groups[2].Success ? m.Groups[2].Value : " ")
)
);
Array.ForEach(result.ToArray(), Console.WriteLine);
Output
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
As in .NET \d can also match digits from other languages, \s can also match a newline and the start of the pattern might be a partial match, a bit more precise match can be:
\b([0-9]+(?:\.[0-9]+)?)(?:[\p{Zs}\t]*([%:eE]))?
I think you need something like this:
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d*)?)\s*((E|e|%|:)+)\s*", #"$1$3 ");
Group 1 - (\d+(\.\d*)?)
Any number like 123 1241.23
Group 2 - ((E|e|%|:)+)
Any of special symbols like E e % :
Group 1 and Group 2 could be separated with any number of whitespaces.
If it's not working as you asking, please provide some samples to test.
For me it's too complex to be handled just by one regex. I suggest splitting into separate checks. See below code example - I used four different regexes, first is described in detail, the rest can be deduced based on first explanation.
using System.Text.RegularExpressions;
var testStrings = new string[]
{
"10mg",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
foreach (var testString in testStrings)
{
Console.WriteLine($"Input: '{testString}', parsed: '{RegexReplace(testString)}'");
}
string RegexReplace(string input)
{
// First look for exponential notation.
// Pattern is: match zero or more whitespaces \s*
// Then match one or more digits and store it in first capturing group (\d+)
// Then match one ore more whitespaces again.
// Then match part with exponent ([eE][-+]?\d+) and store it in second capturing group.
// It will match lower or uppercase 'e' with optional (due to ? operator) dash/plus sign and one ore more digits.
// Then match zero or more white spaces.
var expForMatch = Regex.Match(input, #"\s*(\d+)\s+([eE][-+]?\d+)\s*");
if(expForMatch.Success)
{
return $"{expForMatch.Groups[1].Value}{expForMatch.Groups[2].Value}";
}
var matchWithColon = Regex.Match(input, #"\s*(\d+)\s*:\s*(\w+)");
if (matchWithColon.Success)
{
return $"{matchWithColon.Groups[1].Value}:{matchWithColon.Groups[2].Value}";
}
var matchWithPercent = Regex.Match(input, #"\s*(\d+)\s*%");
if (matchWithPercent.Success)
{
return $"{matchWithPercent.Groups[1].Value}%";
}
var matchWithUnit = Regex.Match(input, #"\s*(\d+)\s*(\w+)");
if (matchWithUnit.Success)
{
return $"{matchWithUnit.Groups[1].Value} {matchWithUnit.Groups[2].Value}";
}
return input;
}
Output is:
Input: '10mg', parsed: '10 mg'
Input: '10:something', parsed: '10:something'
Input: '10 : something', parsed: '10:something'
Input: '10 %', parsed: '10%'
Input: '40 e-5', parsed: '40e-5'
Input: '40 E-05', parsed: '40E-05'

Seperating numbers, punctuation and letters trought the whole case

What I'm trying to achieve:
Split the string into separate parts of numbers, punctuation(except the . and , these should be found in the number part), and letters.
Example:
Case 1:
C_TANTALB
Result:
Alpha[]: C, TANTALB
Beta[]:
Zeta[]: _
Case 2:
BGA-100_T0.8
Result:
Alpha[]: BGA, T
Beta[]: 100, 0.8
Zeta[]: -, _
Case 3: C0201
Result:
Alpha[]: C
Beta[]: 0201
Zeta[]:
I've found this post but it doesn't the entire job for me as it fails on example 1 not returning even the alpha part. And it doesn't find the punctuation.
Any help would be appricated.
If iterating the string an test with IsDigit and IsLetter a bit to complexe,
You can use Regex for this : (?<Alfas>[a-zA-Z]+)|(?<Digits>\d+)|(?<Others>[^a-zA-Z\d])
1/. Named Capture Group Alfas (?<Alfas>[a-zA-Z]+)
Match a single character present in the list below [a-zA-Z]+
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
2/. Named Capture Group Digits (?<Digits>[\d,.]+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
3/. Named Capture Group Others (?<Others>[^a-zA-Z\d]+)
Match a single character not present in the list below [^a-zA-Z\d]
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
\d matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Then to get one goup values:
var matches = Regex.Matches(testInput, pattern).Cast<Match>();
var alfas = matches.Where(x => !string.IsNullOrEmpty(x.Groups["Alfas"].Value))
.Select(x=> x.Value)
.ToList();
LiveDemo
Probably the simplest way to do this is with 3 separate regular expressions; one for each class of characters.
[A-Za-z]+ for letter sequences
[\d.,]+ for numbers
[-_]+ for punctuation (incomplete for now; please feel free to extend the list)
Example:
using System;
using System.Linq;
using System.Text.RegularExpressions;
class MainClass
{
private static readonly Regex _regexAlpha = new Regex(#"[A-Za-z]+");
private static readonly Regex _regexBeta = new Regex(#"[\d.,]+");
private static readonly Regex _regexZeta = new Regex(#"[-_]+");
public static void Main (string[] args)
{
Console.Write("Input: ");
string input = Console.ReadLine();
var resultAlpha = _regexAlpha.Matches(input).Select(m => m.Value);
var resultBeta = _regexBeta.Matches(input).Select(m => m.Value);
var resultZeta = _regexZeta.Matches(input).Select(m => m.Value);
Console.WriteLine($"Alpha: {string.Join(", ", resultAlpha)}");
Console.WriteLine($"Beta: {string.Join(", ", resultBeta)}");
Console.WriteLine($"Zeta: {string.Join(", ", resultZeta)}");
}
}
Sample output:
Input: ABC_3.14m--nop
Alpha: ABC, m, nop
Beta: 3.14
Zeta: _, --
Live demo: https://repl.it/repls/LopsidedUsefulBucket

Display all possible matches for a regex pattern

I have the following RegEx pattern in order to determine some 3-digit exchanges of phone numbers:
(?:2(?:04|[23]6|[48]9|50)|3(?:06|43|65)|4(?:03|1[68]|3[178]|50)|5(?:06|1[49]|79|8[17])|6(?:0[04]|13|39|47)|7(?:0[59]|78|8[02])|8(?:[06]7|19|73)|90[25])
It looks pretty daunting, but it only yields around 40 or 50 numbers. Is there a way in C# to generate all numbers that match this pattern? Offhand, I know I can loop through the numbers 001 thru 999, and check each number against the pattern, but is there a cleaner, built-in way to just generate a list or array of matches?
ie - {"204","226","236",...}
No, there is no off the shelf tool to determine all matches given a regex pattern. Brute force is the only way to test the pattern.
Update
It is unclear why you are using (?: ) which is the "Match but don't capture". It is used to anchor a match, for example take this phone text phone:303-867-5309 where we don't care about the phone: but we want the number.
The pattern used would be
(?:phone\:)(\d{3}-\d{3}-\d{4})
which would match the whole line, but the capture returned would just be the second match of the phone number 303-867-5309.
So the (?: ) as mentioned is used to anchor a match capture at a specific point; with text match text thrown away.
With that said, I have redone your pattern with comments and a test to 2000:
string pattern = #"
^ # Start at beginning of line so no mid number matches erroneously found
(
2(04|[23]6|49|[58]0) # 2 series only match 204, 226, 236, 249, 250, 280
| # Or it is not 2, then match:
3(06|43|65) # 3 series only match 306, 343, 365
)
$ # Further Anchor it to the end of the string to keep it to 3 numbers";
// RegexOptions.IgnorePatternWhitespace allows us to put the pattern over multiple lines and comment it. Does not
// affect regex parsing/processing.
var results = Enumerable.Range(0, 2000) // Test to 2000 so we don't get any non 3 digit matches.
.Select(num => num.ToString().PadLeft(3, '0'))
.Where (num => Regex.IsMatch(num, pattern, RegexOptions.IgnorePatternWhitespace))
.ToArray();
Console.WriteLine ("These results found {0}", string.Join(", ", results));
// These results found 204, 226, 236, 249, 250, 280, 306, 343, 365
I took the advice of #LucasTrzesniewski and just looped through the possible values. Since I know I’m dealing w/ 3-digit numbers, I just looped through the numbers/strings “000” thru “999” and checked for matches like this:
private static void FindRegExMatches(string pattern)
{
for (var i = 0; i < 1000; i++)
{
var numberString = i.ToString().PadLeft(3, '0');
if (!Regex.IsMatch(numberString, pattern)) continue;
Console.WriteLine("Found a match: {0}, numberString);
}
}

Regex to do not match certain sequence

I have a text file as below:
1.1 - Hello
1.2 - world!
2.1 - Some
data
here and it contains some 32 digits so i cannot use \D+
2.2 - Etc..
so i want a regex to get 4 matches in this case for each point. My regex doesn't work as I wish. Please, advice:
private readonly Regex _reactionRegex = new Regex(#"(\d+)\.(\d+)\s*-\s*(.+)", RegexOptions.Compiled | RegexOptions.Singleline);
even this regex isn't very helpful:
(\d+)\.(\d+)\s*-\s*(.+)(?<!\d+\.\d+)
Alex, this regex will do it:
(?sm)^\d+\.\d+\s*-\s*((?:.(?!^\d+\.\d+))*)
This is assuming that you want to capture the point, without the numbers, for instance: just Hello
If you want to also capture the digits, for instance 1.1 - Hello, you can use the same regex and display the entire match, not just Group 1. The online demo below will show you both.
How does it work?
The idea is to capture the text you want to Group 1 using (parentheses).
We match in multi-line mode m to allow the anchor ^ to work on each line.
We match in dotall mode s to allow the dot to eat up strings on multiple lines
We use a negative lookahead (?! to stop eating characters when what follows is the beginning of the line with your digit marker
Here is full working code and an online demo.
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program {
static void Main() {
string yourstring = #"1.1 - Hello
1.2 - world!
2.1 - Some
data
here and it contains some 32 digits so i cannot use \D+
2.2 - Etc..";
var resultList = new StringCollection();
try {
var yourRegex = new Regex(#"(?sm)^\d+\.\d+\s*-\s*((?:.(?!^\d+\.\d+))*)");
Match matchResult = yourRegex.Match(yourstring);
while (matchResult.Success) {
resultList.Add(matchResult.Groups[1].Value);
Console.WriteLine("Whole Match: " + matchResult.Value);
Console.WriteLine("Group 1: " + matchResult.Groups[1].Value + "\n");
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
This may do for what you're looking for, though there is some ambiguity of the expected result.
(\d+)\.(\d+)\s*-\s*(.+?)(\n)(?>\d|$)
The ambiguity is for example what would you expect to match if data looked like:
1.1 - Hello
1.2 - world!
2.1 - Some
data here and it contains some
32 digits so i cannot use \D+
2.2 - Etc..
Not clear if 32 here starts a new record or not.

Categories

Resources