Regex.Split on two chars - c#

Input string is: name = valu\=e;
I want to split it with Regex to: name and valu\=e;.
So split expresion should split on char = which is not prefixed by \.
I want to keep spaces after name or before valu\=e etc. Couldn't it show here because SO trims ``.
EDIT: Input string can contains many name=value pairs. Example: name=value;name2=value2;.

You can use this pattern:
#"(?<!\\)="
(?<!..) is a negative lookbehind assertion and means:"not preceded by"
Heinzi question is interesting. If you choose that an even number of backslashes doesn't escape the equal sign, you must replace the pattern by:
#"(?<![^\\](?:\\{2})*\\)="

Instead of using regular expressions, you might use 'regular' code. :)
string items = "name=value;name2 = valu= = e2";
// Split the list on items.
var itemlist = items.Split(';');
// Split each item after the first '='.
var nameValueArrayList = itemlist.Select(s => s.Split("=".ToCharArray(), 2));
// Convert the list of arrays to a dictionary
var nameValues = nameValueArrayList.ToDictionary(i => i[0], i => i[1]);
MessageBox.Show("<<<" + nameValues["name2 "] + ">>>");
Or in short:
string items = "name=value;name2 = valu= = e2";
var nameValues = items
.Split(';')
.Select(s => s.Split("=".ToCharArray(), 2))
.ToDictionary(i => i[0], i => i[1]);
MessageBox.Show("<<<" + nameValues["name2 "] + ">>>");
I personally think that code like this is easier to maintain or pull apart and modify when the specs change. And it gives you an actual dictionary from which you can pull values by their key.
Maybe it's possible to write this even a little shorter, but I'm still practicing with this. :)

(?<name>[^=]+)=(?<value>[^;]+;)
then use the named groups "name" and "value" to retrieve each part separately.
e.g:
var matches = System.Text.RegularExpressions.Regex.Matches("myInput", #"(?<name>[^=]+)=(?<value>[^;]+;)");
foreach(Match match in matches)
{
var name = match.Groups["name"];
var value = match.Groups["value"];
doSomething(name, value);
}
EDIT:
I don't know why you say it won't work, here is what I get in LinqPad using the input you gave me in the comments:
void Main()
{
var matches = System.Text.RegularExpressions.Regex.Matches(#"zenek=ben\\\;\\ek;juzek=jozek;benek2=true;krowa=-2147483648;du-pa=\\\\\\3/\\\=3\;3\;;", #"(?<name>[^=]+)=(?<value>[^;]+;)");
foreach(Match match in matches)
{
var name = match.Groups["name"].Value;
var value = match.Groups["value"].Value;
("Name: "+name).Dump();
("Value: "+value).Dump();
}
}
Results:
Name: zenek
Value: ben\\\;
Name: \\ek;juzek
Value: jozek;
Name: benek2
Value: true;
Name: krowa
Value: -2147483648;
Name: du-pa
Value: \\\\\\3/\\\=3\;

Related

Regular expression split string, extract string value before and numeric value between square brackets

I need to parse a string that looks like "Abc[123]". The numerical value between the brackets is needed, as well as the string value before the brackets.
The most examples that I tested work fine, but have problems to parse some special cases.
This code seems to work fine for "normal" cases, but has some problems handling "special" cases:
var pattern = #"\[(.*[0-9])\]";
var query = "Abc[123]";
var numVal = Regex.Matches(query, pattern).Cast<Match>().Select(m => m.Groups[1].Value).FirstOrDefault();
var stringVal = Regex.Split(query, pattern)
.Select(x => x.Trim())
.FirstOrDefault();
How should the code be adjusted to handle also some special cases?
For instance for the string "Abc[]" the parser should return correctly "Abc" as the string value and indicate an empty the numeric value (which could be eventually defaulted to 0).
For the string "Abc[xy33]" the parser should return "Abc" as the string value and indicate an invalid numeric value.
For the string "Abc" the parser should return "Abc" as the string value and indicate a missing numeric value. The blanks before/after or inside the brackets should be trimmed "Abc [ 123 ] ".
Try this pattern: ^([^\[]+)\[([^\]]*)\]
Explanation of a pattern:
^ - match beginning of a string
([^\[]+) - match one or more of any character ecept [ and store it insinde first capturing group
\[ - match [ literally
([^\]]*) - match zero or more of any character except ] and store inside second capturing group
\] - match ] literally
Here's tested code:
var pattern = #"^([^\[]+)\[([^\]]*)\]";
var queries = new string[]{ "Abc[123]", "Abc[xy33]", "Abc[]", "Abc[ 33 ]", "Abc" };
foreach (var query in queries)
{
string beforeBrackets;
string insideBrackets;
var match = Regex.Match(query, pattern);
if (match.Success)
{
beforeBrackets = match.Groups[1].Value;
insideBrackets = match.Groups[2].Value.Trim();
if (insideBrackets == "")
insideBrackets = "0";
else if (!int.TryParse(insideBrackets, out int i))
insideBrackets = "incorrect value!";
}
else
{
beforeBrackets = query;
insideBrackets = "no value";
}
Console.WriteLine($"Input string {query} : before brackets: {beforeBrackets}, inside brackets: {insideBrackets}");
}
Console.ReadKey();
Output:
We can try doing a regex replacement on the input, for a one-liner solution:
string input = "Abc[123]";
string letters = Regex.Replace(input, "\\[.*\\]", "");
string numbers = Regex.Replace("Abc[123]", ".*\\[(\\d+)\\]", "$1");
Console.WriteLine(letters);
Console.WriteLine(numbers);
This prints:
Abc
123
Pretty sure there'd be some language-based techniques for that, which I wouldn't know, yet with a regular expression, we'd capture everything using capturing groups and check for things one by one, maybe:
^([A-Za-z]+)\s*(\[?)\s*([A-Za-z]*)(\d*)\s*(\]?)\s*$
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
You can achieve that easily without using regex
string temp = "Abc[123]";
string[] arr = temp.Split('[');
string name = arr[0];
string value = arr[1].ToString().TrimEnd(']');
output name = Abc, and value = 123

How do I get quoted fields from a delimited string as a list of unquoted values using LINQ?

Original text line is:
"125"|"Bio Methyl"|"99991"|"OPT12"|"CB"|"1"|"12"|"5"|"23"
Expected string list is free of double quotes and split by |:
125
Bio Methyl
99991
The text may contain empty quoted strings as in (former "OPT12" value now empty ""):
"125"|"Bio Methyl"|"99991"|""|"CB"|"1"|"12"|"5"|"23"
So I checked these two questions & answers :QA1 and QA2 to derive my solution.
var eList = uEList.ElementAt(i).Split(BarDelimiter);
var xList = eList.ElementAt(0).Where(char.IsDigit).ToList();
Of course it doesn't work the way I need it to be since xList is a list with elements like this: xList(0) = 1, xList(1) = 2, xList(2) = 5
I do not want to write another line to join them because this doesn't look like a suitable solution. There has to be something better with LINQ right?
How about this:
// Based on OPs comment: preserve empty non-quoted entries.
var splitOptions = StringSplitOptions.None;
//change to the below if empty entries should be removed
//var splitOptions = StringSplitOptions.None;
var line = "\"125\"|\"Bio Methyl\"|\"99991\"|\"OPT12\"|\"CB\"|\"1\"|\"12\"|\"5\"|\"23\"";
var result = line
.Split(new[] { "|" }, splitOptions)
.Select(p => p.Trim('\"'))
.ToList();
Console.WriteLine(string.Join(", ", result));
The Split(...) statement splits the input into an array with parts like
{ \"99991\", \"OPT12\", ... };
The p.Trim('\"') statement removes the leading and trailing quote from each of the parts.
As an alternative to the trimming, if there's no " in your values, you could simply sanitize the input before splitting it. You can do so by replacing the " symbol by nothing (either "" or string.Empty).
Your Split code would then give the correct result afterwards:
string uEList = "\"125\"|\"Bio Methyl\"|\"99991\"|\"OPT12\"|\"CB\"|\"1\"|\"12\"|\"5\"|\"23\"";
var eList = uEList.Replace("\"", string.Empty).Split(BarDelimiter);

Omit unnecessary parts in string array

In C#, I have a string comes from a file in this format:
Type="Data"><Path.Style><Style
or maybe
Type="Program"><Rectangle.Style><Style
,etc. Now I want to only extract the Data or Program part of the Type element. For that, I used the following code:
string output;
var pair = inputKeyValue.Split('=');
if (pair[0] == "Type")
{
output = pair[1].Trim('"');
}
But it gives me this result:
output=Data><Path.Style><Style
What I want is:
output=Data
How to do that?
This code example takes an input string, splits by double quotes, and takes only the first 2 items, then joins them together to create your final string.
string input = "Type=\"Data\"><Path.Style><Style";
var parts = input
.Split('"')
.Take(2);
string output = string.Join("", parts); //note: .net 4 or higher
This will make output have the value:
Type=Data
If you only want output to be "Data", then do
var parts = input
.Split('"')
.Skip(1)
.Take(1);
or
var output = input
.Split('"')[1];
What you can do is use a very simple regular express to parse out the bits that you want, in your case you want something that looks like this and then grab the two groups that interest you:
(Type)="(\w+)"
Which would return in groups 1 and 2 the values Type and the non-space characters contained between the double-quotes.
Instead of doing many split, why don't you just use Regex :
output = Regex.Match(pair[1].Trim('"'), "\"(\w*)\"").Value;
Maybe I missed something, but what about this:
var str = "Type=\"Program\"><Rectangle.Style><Style";
var splitted = str.Split('"');
var type = splitted[1]; // IE Data or Progam
But you will need some error handling as well.
How about a regex?
var regex = new Regex("(?<=^Type=\").*?(?=\")");
var output = regex.Match(input).Value;
Explaination of regex
(?<=^Type=\") This a prefix match. Its not included in the result but will only match
if the string starts with Type="
.*? Non greedy match. Match as many characters as you can until
(?=\") This is a suffix match. It's not included in the result but will only match if the next character is "
Given your specified format:
Type="Program"><Rectangle.Style><Style
It seems logical to me to include the quote mark (") when splitting the strings... then you just have to detect the end quote mark and subtract the contents. You can use LinQ to do this:
string code = "Type=\"Program\"><Rectangle.Style><Style";
string[] parts = code.Split(new string[] { "=\"" }, StringSplitOptions.None);
string[] wantedParts = parts.Where(p => p.Contains("\"")).
Select(p => p.Substring(0, p.IndexOf("\""))).ToArray();

Replace all alphanumeric characters in a string except pattern

I'm trying to obfuscate a string, but need to preserve a couple patterns. Basically, all alphanumeric characters need to be replaced with a single character (say 'X'), but the following (example) patterns need to be preserved (note that each pattern has a single space at the beginning)
QQQ"
RRR"
I've looked through a few samples on negative lookahead/behinds, but still not haven't any luck with this (only testing QQQ).
var test = #"""SOME TEXT AB123 12XYZ QQQ""""empty""""empty""1A2BCDEF";
var regex = new Regex(#"((?!QQQ)(?<!\sQ{1,3}))[0-9a-zA-Z]");
var result = regex.Replace(test, "X");
The correct result should be:
"XXXX XXXX XXXXX XXXXX QQQ""XXXXX""XXXXX"XXXXXXXX
This works for an exact match, but will fail with something like ' QQR"', which returns
"XXXX XXXX XXXXX XXXXX XQR""XXXXX""XXXXX"XXXXXXXX
You can use this:
var regex = new Regex(#"((?> QQQ|[^A-Za-z0-9]+)*)[A-Za-z0-9]");
var result = regex.Replace(test, "$1X");
The idea is to match all that must be preserved first and to put it in a capturing group.
Since the target characters are always preceded by zero or more things that must be preserved, you only need to write this capturing group before [A-Za-z0-9]
Here's a non-regex solution. Works quite nice, althought it fails when there is one pattern in an input sequence more then once. It would need a better algorithm fetching occurances. You can compare it with a regex solution for a large strings.
public static string ReplaceWithPatterns(this string input, IEnumerable<string> patterns, char replacement)
{
var patternsPositions = patterns.Select(p =>
new { Pattern = p, Index = input.IndexOf(p) })
.Where(i => i.Index > 0);
var result = new string(replacement, input.Length);
if (!patternsPositions.Any()) // no pattern in the input
return result;
foreach(var p in patternsPositions)
result = result.Insert(p.Index, p.Pattern); // return patterns back
return result;
}

string replace using a List<string>

I have a List of words I want to ignore like this one :
public List<String> ignoreList = new List<String>()
{
"North",
"South",
"East",
"West"
};
For a given string, say "14th Avenue North" I want to be able to remove the "North" part, so basically a function that would return "14th Avenue " when called.
I feel like there is something I should be able to do with a mix of LINQ, regex and replace, but I just can't figure it out.
The bigger picture is, I'm trying to write an address matching algorithm. I want to filter out words like "Street", "North", "Boulevard", etc. before I use the Levenshtein algorithm to evaluate the similarity.
How about this:
string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)));
or for .Net 3:
string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)).ToArray());
Note that this method splits the string up into individual words so it only removes whole words. That way it will work properly with addresses like Northampton Way #123 that string.Replace can't handle.
Regex r = new Regex(string.Join("|", ignoreList.Select(s => Regex.Escape(s)).ToArray()));
string s = "14th Avenue North";
s = r.Replace(s, string.Empty);
Something like this should work:
string FilterAllValuesFromIgnoreList(string someStringToFilter)
{
return ignoreList.Aggregate(someStringToFilter, (str, filter)=>str.Replace(filter, ""));
}
What's wrong with a simple for loop?
string street = "14th Avenue North";
foreach (string word in ignoreList)
{
street = street.Replace(word, string.Empty);
}
If you know that the list of word contains only characters that do not need escaping inside a regular expression then you can do this:
string s = "14th Avenue North";
Regex regex = new Regex(string.Format(#"\b({0})\b",
string.Join("|", ignoreList.ToArray())));
s = regex.Replace(s, "");
Result:
14th Avenue
If there are special characters you will need to fix two things:
Use Regex.Escape on each element of ignore list.
The word-boundary \b will not match a whitespace followed by a symbol or vice versa. You may need to check for whitespace (or other separating characters such as punctuation) using lookaround assertions instead.
Here's how to fix these two problems:
Regex regex = new Regex(string.Format(#"(?<= |^)({0})(?= |$)",
string.Join("|", ignoreList.Select(x => Regex.Escape(x)).ToArray())));
If it's a short string as in your example, you can just loop though the strings and replace one at a time. If you want to get fancy you can use the LINQ Aggregate method to do it:
address = ignoreList.Aggregate(address, (a, s) => a.Replace(s, String.Empty));
If it's a large string, that would be slow. Instead you can replace all strings in a single run through the string, which is much faster. I made a method for that in this answer.
LINQ makes this easy and readable. This requires normalized data though, particularly in that it is case-sensitive.
List<string> ignoreList = new List<string>()
{
"North",
"South",
"East",
"West"
};
string s = "123 West 5th St"
.Split(' ') // Separate the words to an array
.ToList() // Convert array to TList<>
.Except(ignoreList) // Remove ignored keywords
.Aggregate((s1, s2) => s1 + " " + s2); // Reconstruct the string
Why not juts Keep It Simple ?
public static string Trim(string text)
{
var rv = text.trim();
foreach (var ignore in ignoreList) {
if(tv.EndsWith(ignore) {
rv = rv.Replace(ignore, string.Empty);
}
}
return rv;
}
You can do this using and expression if you like, but it's easier to turn it around than using a Aggregate. I would do something like this:
string s = "14th Avenue North"
ignoreList.ForEach(i => s = s.Replace(i, ""));
//result is "14th Avenue "
public static string Trim(string text)
{
var rv = text;
foreach (var ignore in ignoreList)
rv = rv.Replace(ignore, "");
return rv;
}
Updated For Gabe
public static string Trim(string text)
{
var rv = "";
var words = text.Split(" ");
foreach (var word in words)
{
var present = false;
foreach (var ignore in ignoreList)
if (word == ignore)
present = true;
if (!present)
rv += word;
}
return rv;
}
If you have a list, I think you're going to have to touch all the items. You could create a massive RegEx with all your ignore keywords and replace to String.Empty.
Here's a start:
(^|\s+)(North|South|East|West){1,2}(ern)?(\s+|$)
If you have a single RegEx for ignore words, you can do a single replace for each phrase you want to pass to the algorithm.

Categories

Resources