Split string at first alphabetic character - c#

I have a following string array as shown in the image. while looping through the array, i need to separate numeric value and Alphabetic value .
eg:
35.00MY to 35.00 and MY
2.10D8 to 2.10 and D8
80.00YRI to 80.00 and YRI
4.00G8 to 4.00 and G8
I tried following code , but that didn't help
foreach (string taxText in taxSplit) {
Regex re = new Regex(#"([a-zA-Z]+)(\d+)");
Match result = re.Match(taxText);
string alphaPart = result.Groups[1].ToString();
string numberPart = result.Groups[2].ToString(); }
Both returned empty

You can bastardize a Split and use a lookahead (?= ... ) and a lookbehind (?<= ... ):
string original = "35.00ab3500bc";
Regex reg = new Regex("(?<=[0-9])(?=[A-Za-z])");
string[] parts = reg.Split(original, 2);
Here, we have to instantiate a new Regex instance because this version of Split isn't available as a static method. The pattern we pass says to find a void where the left side of the void is a number (i.e. the lookbehind), and the right side of the void is a letter (i.e. the lookahead). We pass a 2 to say that we want at most two items in parts.

var lst = new List<string>() { "35.00MY", "2.10D8", "80.00YRI", "4.00GB" };
var res = new List<string>();
lst.ForEach(v =>
{
res.Add(new string(v.TakeWhile(c => !Char.IsLetter(c)).ToArray()));
res.Add(v.TrimStart("01234567890.".ToCharArray()));
} );

Related

Need to extract text in a C# Regex

I have a collection of strings such as Some song [FEAT. John Doe] and I'm trying to extract the 'featured' part. It could be identified by one of several different words "FEAT|FEAT\\.|Featuring" and may or may not be enclosed by brackets. I'm using a Regex for this and here is what I've got so far:
[TestMethod]
public void ExtractFeaturedPerformers()
{
IEnumerable<string> titles = new string[]
{
"abc [FEAT one two] 123",
"def(FEAT. three'four) 456",
"ghi Featuring five",
"jkl"
};
// Must be able to use an arbitrary array of words
var arrayOfWords = new string[] { "FEAT", "FEAT.", "Featuring" };
string options = string.Join("|", arrayOfWords.Select(s => Regex.Escape(s)));
var result = new List<string>();
foreach(string title in titles)
{
var _ = Regex.Match(title, $#"(?<=({options})\s*)(.*?)(?=[\]\)]|$)");
if (_.Success)
result.Add(_.Value);
}
Assert.AreEqual(3, result.Count());
Assert.IsTrue(result.Contains("one two"));
Assert.IsTrue(result.Contains("three'four"));
Assert.IsTrue(result.Contains("five"));
}
I have it mostly working but there are two limitations. My main problem is that the second result includes the ..
. three'four
How can I remove this as part of the Regex so that I can accept an arbitrary options string rather than stripping it away later? Dealing with the . is my main concern but I would also appreciate suggestions for removing the leading and trailing whitespace from the result so that I don't have to call Trim() later.
You need
(?:FEAT\.?|Featuring)\s*([^])]*)
See the regex demo
Details
(?:FEAT\.?|Featuring) - FEAT and an optional . or Featuring
\s* - zero or more whitespace
([^])]*) - Capturing group 1: zero or more chars other than ] and ).
You need to amend the C# code to get Group 1 values.
Here is the full C# demo:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
IEnumerable<string> titles = new string[]
{
"abc [FEAT one two] 123",
"def(FEAT. three'four) 456",
"ghi Featuring five",
"jkl"
};
var keys = new List<string> { "FEAT", "FEAT.", "Featuring" };
keys = keys.OrderByDescending(x => x.Length).ToList();
var pattern = $#"(?:{string.Join("|", keys.Select(z => Regex.Escape(z)))})\s*([^])]*)";
Console.WriteLine(pattern);
var result = new List<string>();
foreach(string title in titles)
{
var _ = Regex.Match(title, pattern);
if (_.Success)
result.Add(_.Groups[1].Value);
}
Console.WriteLine( result.Count()); // Assert.AreEqual(3, result.Count());
Console.WriteLine( result.Contains("one two") ); //Assert.IsTrue(result.Contains("one two"));
Console.WriteLine( result.Contains("three'four") ); //Assert.IsTrue(result.Contains("three'four"));
Console.WriteLine( result.Contains("five") ); // Assert.IsTrue(result.Contains("five"));
}
}
The output is
(?:Featuring|FEAT.|FEAT)\s*([^])]*)
3
True
True
True
Note how the regex pattern is built:
var keys = new List<string> { "FEAT", "FEAT.", "Featuring" }; initializes the keys string list with the search phrases
keys = keys.OrderByDescending(x => x.Length).ToList(); - sorts the items in the list by length in the descending order
var pattern = $#"(?:{string.Join("|", keys.Select(z => Regex.Escape(z)))})\s*([^])]*)"; - creates the regex pattern by putting the escaped search phrases into a non-capturing group with | alternation operator in between, (?:Featuring|FEAT\.|FEAT).

Find pattern to solve regex in one step

I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?
You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56
I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there
Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.
It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.

C# use split to separate two parts of string

I have the following text in an Excel spreadsheet cell:
"Calories (kcal) "
(minus quotes).
I can get the value of the cell into my code:
string nutrientLabel = dataRow[0].ToString().Trim();
I'm new to C# and need help in separating the "Calories" and "(kcal)" to to different variables that I can upload into my online system. I need the result to be two strings:
nutrientLabel = Calories
nutrientUOM = kcal
I've googled the hell out of this and found out how to make it work to separate them and display into Console.WriteLine but I need the values actually out to 2 variables.
foreach (DataRow dataRow in nutrientsdataTable.Rows)
{
string nutrientLabel = dataRow[0].ToString().Trim();
}
char[] paraSeparator = new char[] { '(', ')' };
string[] result;
Console.WriteLine("=======================================");
Console.WriteLine("Para separated strings :\n");
result = nutrientLabel.Split(paraSeparator,
StringSplitOptions.RemoveEmptyEntries);
foreach (string str in result)
{
Console.WriteLine(str);
}
You can use a simple regex for this:
var reg = new Regex(#"(?<calories>\d+)\s\((?<kcal>\d+)\)");
Which essentially says:
Match at least one number and store it in the group 'calories'
Match a space and an opening parenthesis
Match at least one number and store it in the group 'kcal'
Match a closing parenthesis
Then we can extract the results using the named groups:
var sampleInput = "15 (35)";
var match = reg.Match(sampleInput);
var calories = match.Groups["calories"];
var kcal = match.Groups["kcal"];
Note that calories and kcal are still strings here, you'll need to parse them into an integer (or decimal)
string [] s = dataRow[0].ToString().Split(' ');
nutrientLabel = s[0];
nutrientUOM = s[1].Replace(")","").Replace("(","");

3-digit grouping of all numbers in an alphanumeric string

I found it not efficient to iterate through string parts split by space character and extract numeric parts and apply
UInt64.Parse(Regex.Match(numericPart, #"\d+").Value)
and the concatenating them together to form the string with numbers being grouped.
Is there a better, more efficient way to 3-digit grouping of all numbers in an string containing other characters?
I am pretty sure the most efficient way (CPU-wise, with just a single pass over the string) is the basic foreach loop, along these lines
var sb = new StringBuilder()
foreach(char c in inputString)
{
// if c is a digit count
// else reset counter
// if there are three digits insert a "."
}
return sb.ToString()
This will produce 123.456.7
If you want 1.234.567 you'll need an additional buffer for digit-sequences
So you want to replace all longs in a string with the same long but with a number-group-separator of the current culture? .... Yes
string[] words = input.Split();
var newWords = words.Select(w =>
{
long l;
bool isLong = System.Int64.TryParse(w.Trim(), out l);
if(isLong)
return l.ToString("N0");
else
return w;
});
string result = string.Join(" ", newWords);
With the input from your comment:
string input = "hello 134443 in the 33 when 88763 then";
You get the expected result: "hello 134,443 in the 33 when 88,763 then", if your current culture uses comma as number-group-separator.
I will post my regex-based example. I believe regex does not have to be too slow, especially once it is compiled and is declared with static and readonly.
// Declare the regex
private static readonly Regex regex = new Regex(#"(\d)(?=(\d{3})+(?!\d))", RegexOptions.Compiled);
// Then, somewhere inside a method
var replacement = string.Format("$1{0}", System.Globalization.CultureInfo.CurrentCulture.NumberFormat.NumberGroupSeparator); // Get the system digit grouping separator
var strn = "Hello 34234456 where 3334 is it?"; // Just a sample string
// Somewhere (?:inside a loop)?
var res = regex.Replace(strn, replacement);
Output (if , is a system digit grouping separator):
Hello 34,234,456 where 3,334 is it?

Regular Expression to split a string with comma and double quotes in c#

I have tried a regular expression to split a string with comma and space. Expression matches all the cases except only one. The code I have tried is:
List<string> strNewSplit = new List<string>();
Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
strNewSplit.Add(match.Value.TrimStart(','));
}
return strNewSplit;
CASE1: "MYSQL,ORACLE","C#,ASP.NET"
EXpectedOutput:
"MYSQL,ORACLE"
"C#,ASP.NET"
RESULT : PASS
CASE2: "MYSQL,ORACLE", "C#,ASP.NET"
ExpectedOutput:
"MYSQL,ORACLE"
"C#,ASP.NET"
Actual OutPut:
"MYSQL,ORACLE"
"C#
ASP.NET"
RESULT: FAIL.
If I provide a space after a comma in between two DoubleQuotes then I didn't get appropriate output. Am I missing anything? Please provide a better solution.
I normally write down the EBNF of my Input to parse.
In your case I would say:
List = ListItem {Space* , Space* ListItem}*;
ListItem = """ Identifier """; // Identifier is everything whitout "
Space = [\t ]+;
Which means a List consists of a ListItem that is folled by zero or mutliple (*) ListItems that are separated with spaces a comma and again spaces.
That lead me to the following (you are searching for ListItems):
static void Main(string[] args)
{
matchRegex("\"MYSQL,ORACLE\",\"C#,ASP.NET\"").ForEach(Console.WriteLine);
matchRegex("\"MYSQL,ORACLE\", \"C#,ASP.NET\"").ForEach(Console.WriteLine);
}
static List<string> matchRegex(string input)
{
List<string> strNewSplit = new List<string>();
Regex csvSplit = new Regex(
"(\"(?:[^\"]*)\")"
, RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
strNewSplit.Add(match.Value.TrimStart(','))
}
return strNewSplit;
}
Which returns what you wanted. Hope I understood you correctly.

Categories

Resources