Unexpected regex result with a single space

Unexpected regex result with a single space - c#

Can somebody please tell me why a space comes up with 2 matches for the below pattern?
((?<key>(?:((?!\d)\w+(?:\.(?!\d)\w+)*)\.)?((?!\d)\w+)):(?<value>([^ "]+)|("[^"]*?")+))*
Trying to match the following cases:
var body = "Key:Hello";
var body = "Key:\"Hello\"";
var body = "Key1:Hello Key2:\"Goodbye\"";
This may provide more context:
pattern = #"((?<key>" + StringExtensions.REGEX_IDENTIFIER_MIDSTRING + "):(?<value>([^ \"]+)|(\"[^\"]*?\")+))*";
My goal is to pull the keys, values out of a command-line like string in the form of [key]:[value] with optional repeats. Values can either be a with no spaces or in quotes with spaces.
Probably right there in front of me but I'm not seeing it.

Probably because “.”, because a period in regex, marches every character except line breaks

I took a different approach:
public static Dictionary<string, string> GetCommandLineKeyValues(this string commandLine)
{
var keyValues = new Dictionary<string, string>();
var pattern = #"(?<command>(" + StringExtensions.REGEX_IDENTIFIER + " )?)(?<args>.*)";
var args = commandLine.RegexGet(pattern, "args");
Match match;
if (args.Length > 0)
{
string key;
string value;
pattern = #" ?(?<key>" + StringExtensions.REGEX_IDENTIFIER_MIDSTRING + ")*?:(?<value>([^ \"]+)|(\"[^\"]*?\")+)";
do
{
match = args.RegexGetMatch(pattern);
if (match == null)
{
break;
}
key = match.Groups["key"].Value;
value = match.Groups["value"].Value;
keyValues.Add(key, value);
args = match.Replace(args, string.Empty);
}
while (args.RegexIsMatch(pattern));
}
return keyValues;
}
I took what I call the "pac-man" approach to Regex.. match, eat (hence the Match.Replace), and continue matching.
For convenience:
public const string REGEX_IDENTIFIER = #"^(?:((?!\d)\w+(?:\.(?!\d)\w+)*)\.)?((?!\d)\w+)$";

Related

How to perform multiple Regex replacements in sequence from a list of unique items cleanly in C#

I'm trying to find a cleaner way of performing multiple sequential replacements on a single string where each replacement has a unique pattern and string replacement.
For example, if I have 3 pairs of patterns-substitutions strings:
1. /(?<!\\)\\n/, "\n"
2. /(\\)(?=[\;\:\,])/, ""
3. /(\\{2})/, "\\"
I want to apply regex replacement 1 on the original string, then apply 2 on the output of 1, and so on and so forth.
The following console program example does exactly what I want, but it has a lot of repetition, I am looking for a cleaner way to do the same thing.
SanitizeString
static public string SanitizeString(string param)
{
string retval = param;
//first replacement
Regex SanitizePattern = new Regex(#"([\\\;\:\,])");
retval = SanitizePattern.Replace(retval, #"\$1");
//second replacement
SanitizePattern = new Regex(#"\r\n?|\n");
retval = SanitizePattern.Replace(retval, #"\n");
return retval;
}
ParseCommands
static public string ParseCommands(string param)
{
string retval = param;
//first replacement
Regex SanitizePattern = new Regex(#"(?<!\\)\\n");
retval = SanitizePattern.Replace(retval, System.Environment.NewLine);
//second replacement
SanitizePattern = new Regex(#"(\\)(?=[\;\:\,])");
retval = SanitizePattern.Replace(retval, "");
//third replacement
SanitizePattern = new Regex(#"(\\{2})");
retval = SanitizePattern.Replace(retval, #"\");
return retval;
}
Main
using System;
using System.IO;
using System.Text.RegularExpressions;
...
static void Main(string[] args)
{
//read text that contains user input
string sampleText = File.ReadAllText(#"c:\sample.txt");
//sanitize input with certain rules
sampleText = SanitizeString(sampleText);
File.WriteAllText(#"c:\sanitized.txt", sampleText);
//parses escaped characters back into the original text
sampleText = ParseCommands(sampleText);
File.WriteAllText(#"c:\parsed_back.txt", sampleText);
}
Don't mind the file operations. I just used that as a quick way to visualize the actual output. In my program I'm going to use something different.

Here's one way:
var replacements = new List<(Regex regex, string replacement)>()
{
(new Regex(#"(?<!\\)\\n"), System.Environment.NewLine),
(new Regex(#"(\\)(?=[\;\:\,])"), ""),
(new Regex(#"(\\{2})"), #"\"),
};
(Ideally cache that in a static readonly field):
Then:
string retval = param;
foreach (var (regex, replacement) in replacements)
{
retval = regex.Replace(retval, replacement);
}
Or you could go down the linq route:
string retval = replacements
.Aggregate(param, (str, x) => x.regex.Replace(str, x.replacement));

Regex without escaping Characters - Problems

I found some solutions for my problem, which is quite simple:
I have a string, which is looking like this:
"\r\nContent-Disposition: form-data; name=\"ctl00$cphMainContent$grid$ctl03$ucPicture$ctl00\""
My goal is to break it down, so I have a Dictionary of values, like:
Key = "name", value ? "ctl..."
My approach was: Split it by "\r\n" and then by the equal or the colon sign.
This worked fine, but then some funny Tester uploaded a file with all allowed charactes, which made the String looking like this:
"\r\nContent-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C:\\Users\\matthias.mueller\\Desktop\\- ie+![]{}_-´;,.$¨##ç %&()=~^`'.jpg\"\r\nContent-Type: image/jpeg"
Of course, the simple splitting doesn't work anymore, since it splits now the filename.
I corrected this by reading out "filename=" and escaping the signs I'm looking to split, and then creating a regex.
Now comes my problem: I found two Regex-samples, which could do the work for the equal sign, the semicolon and the colon. one is:
[^\\]=
The other one I found was:
(?<!\\\\)=
The problem is, the first one doesn't only split, but it splits the equal sign and one character before this sign, which means my key in the Dictionary is "nam" instead of "name"
The second one works fine on this matter, but it still splits the escaped equal sign in the filename.
Is my approach for this problem even working? Would there be a better solution for this? And why is the first Regex cutting a character?
Edit: To avoid confusion, my escaped String looks like this:
"Content-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C\:\Users\matthias.mueller\Desktop\- ie+![]{}_-´\;,.$¨##ç %&()\=~^`'.jpg\""
So I want basically: Split by equal Sign EXCEPT the escaped ones. By the way: The string here shows only one \, but there are 2.
Edit 2: OK seems like I have a working solution, but it's so ugly:
Dictionary<string, string> ParseHeader(byte[] bytes, int pos)
{
Dictionary<string, string> items;
string header;
string[] headerLines;
int start;
int end;
string input = _encoding.GetString(bytes, pos, bytes.Length - pos);
start = input.IndexOf("\r\n", 0);
if (start < 0) return null;
end = input.IndexOf("\r\n\r\n", start);
if (end < 0) return null;
WriteBytes(false, bytes, pos, end + 4 - 0); // Write the header to the form content
header = input.Substring(start, end - start);
items = new Dictionary<string, string>();
headerLines = Regex.Split(header, "\r\n");
Regex regLineParts = new Regex(#"(?<!\\\\);");
Regex regColon = new Regex(#"(?<!\\\\):");
Regex regEqualSign = new Regex(#"(?<!\\\\)=");
foreach (string hl in headerLines)
{
string workString = hl;
//Escape the Semicolon in filename
if (hl.Contains("filename"))
{
String orig = hl.Substring(hl.IndexOf("filename=\"") + 10);
orig = orig.Substring(0, orig.IndexOf('"'));
string toReplace = orig;
toReplace = toReplace.Replace(toReplace, toReplace.Replace(";", #"\\;"));
toReplace = toReplace.Replace(toReplace, toReplace.Replace(":", #"\\:"));
toReplace = toReplace.Replace(toReplace, toReplace.Replace("=", #"\\="));
workString = hl.Replace(orig, toReplace);
}
string[] lineParts = regLineParts.Split(workString);
for (int i = 0; i < lineParts.Length; i++)
{
string[] p;
if (i == 0)
p = regColon.Split(lineParts[i]);
else
p = regEqualSign.Split(lineParts[i]);
if (p.Length == 2)
{
string orig = p[0];
orig = orig.Replace(#"\\;", ";");
orig = orig.Replace(#"\\:", ":");
orig = orig.Replace(#"\\=", "=");
p[0] = orig;
orig = p[1];
orig = orig.Replace(#"\\;", ";");
orig = orig.Replace(#"\\:", ":");
orig = orig.Replace(#"\\=", "=");
p[1] = orig;
items.Add(p[0].Trim(), p[1].Trim());
}
}
}
return items;
}
Needs some further testing.

I had a go at writing a parser for you. It handles literal strings, like "here is a string", as the values in name-value pairs. I've also written a few tests, and the last shows an '=' character inside a literal string. It also handles escaping quotes (") inside literal strings by escaping as \" -- I'm not sure if this is right, but you could change it.
A quick explanation. I first find anything that looks like a literal string and replace it with a value like PLACEHOLDER8230498234098230498. This means the whole thing is now literal name-value pairs; eg
key="value"
becomes
key=PLACEHOLDER8230498234098230498
The original string value is stored off in the literalStrings dictionary for later.
So now we split on semicolons (to get key=value strings) and then on equals, to get the proper key/value pairs.
Then I substitute the placeholder values back in before returning the result.
public class HttpHeaderParser
{
public NameValueCollection Parse(string header)
{
var result = new NameValueCollection();
// 'register' any string values;
var stringLiteralRx = new Regex(#"""(?<content>(\\""|[^\""])+?)""", RegexOptions.IgnorePatternWhitespace);
var equalsRx = new Regex("=", RegexOptions.IgnorePatternWhitespace);
var semiRx = new Regex(";", RegexOptions.IgnorePatternWhitespace);
Dictionary<string, string> literalStrings = new Dictionary<string, string>();
var cleanedHeader = stringLiteralRx.Replace(header, m =>
{
var replacement = "PLACEHOLDER" + Guid.NewGuid().ToString("N");
var stringLiteral = m.Groups["content"].Value.Replace("\\\"", "\"");
literalStrings.Add(replacement, stringLiteral);
return replacement;
});
// now it's safe to split on semicolons to get name-value pairs
var nameValuePairs = semiRx.Split(cleanedHeader);
foreach(var nameValuePair in nameValuePairs)
{
var nameAndValuePieces = equalsRx.Split(nameValuePair);
var name = nameAndValuePieces[0].Trim();
var value = nameAndValuePieces[1];
string replacementValue;
if (literalStrings.TryGetValue(value, out replacementValue))
{
value = replacementValue;
}
result.Add(name, value);
}
return result;
}
}
There's every chance there are some proper bugs in it.
Here's some unit tests you should incorporate, too;
[TestMethod]
public void TestMethod1()
{
var tests = new[] {
new { input=#"foo=bar; baz=quux", expected = #"foo|bar^baz|quux"},
new { input=#"foo=bar;baz=""quux""", expected = #"foo|bar^baz|quux"},
new { input=#"foo=""bar"";baz=""quux""", expected = #"foo|bar^baz|quux"},
new { input=#"foo=""b,a,r"";baz=""quux""", expected = #"foo|b,a,r^baz|quux"},
new { input=#"foo=""b;r"";baz=""quux""", expected = #"foo|b;r^baz|quux"},
new { input=#"foo=""b\""r"";baz=""quux""", expected = #"foo|b""r^baz|quux"},
new { input=#"foo=""b=r"";baz=""quux""", expected = #"foo|b=r^baz|quux"},
};
var parser = new HttpHeaderParser();
foreach(var test in tests)
{
var actual = parser.Parse(test.input);
var actualAsString = String.Join("^", actual.Keys.Cast<string>().Select(k => string.Format("{0}|{1}", k, actual[k])));
Assert.AreEqual(test.expected, actualAsString);
}
}

Looks to me like you'll need a bit more of a solid parser for this than a regex split. According to this page the name/value pairs can either be 'raw';
x=1
or quoted;
x="foo bar baz"
So you'll need to look for a solution that not only splits on the equals, but ignores any equals inside;
x="y=z"
It might be that there is a better or more managed way for you to access this info. If you are using a classic ASP.NET WebForms FileUpload control, you can access the filename using the properties of the control, like
FileUpload1.HasFile
FileUpload1.FileName
If you're using MVC, you can use the HttpPostedFileBase class as a parameter to the action method. See this answer
[HttpPost]
public ActionResult Index(HttpPostedFileBase file)
{
// Verify that the user selected a file
if (file != null && file.ContentLength > 0)
{
// extract only the fielname
var fileName = Path.GetFileName(file.FileName);
// store the file inside ~/App_Data/uploads folder
var path = Path.Combine(Server.MapPath("~/App_Data/uploads"), fileName);
file.SaveAs(path);
}
// redirect back to the index action to show the form once again
return RedirectToAction("Index");
}

This:
(?<!\\\\)=
matches = not preceded by \\.
It should be:
(?<!\\)=
(Make sure you use # (verbatim) strings for the regex, to avoid confusion)

How to remove " [ ] \ from string

I have a string
"[\"1,1\",\"2,2\"]"
and I want to turn this string onto this
1,1,2,2
I am using Replace function for that like
obj.str.Replace("[","").Replace("]","").Replace("\\","");
But it does not return the expected result.
Please help.

You haven't removed the double quotes. Use the following:
obj.str = obj.str.Replace("[","").Replace("]","").Replace("\\","").Replace("\"", "");

Here is an optimized approach in case the string or the list of exclude-characters is long:
public static class StringExtensions
{
public static String RemoveAll(this string input, params Char[] charactersToRemove)
{
if(string.IsNullOrEmpty(input) || (charactersToRemove==null || charactersToRemove.Length==0))
return input;
var exclude = new HashSet<Char>(charactersToRemove); // removes duplicates and has constant lookup time
var sb = new StringBuilder(input.Length);
foreach (Char c in input)
{
if (!exclude.Contains(c))
sb.Append(c);
}
return sb.ToString();
}
}
Use it in this way:
str = str.RemoveAll('"', '[', ']', '\\');
// or use a string as "remove-array":
string removeChars = "\"{[]\\";
str = str.RemoveAll(removeChars.ToCharArray());

You should do following:
obj.str = obj.str.Replace("[","").Replace("]","").Replace("\"","");
string.Replace method does not replace string content in place. This means that if you have
string test = "12345" and do
test.Replace("2", "1");
test string will still be "12345". Replace doesn't change string itself, but creates new string with replaced content. So you need to assign this new string to a new or same variable
changedTest = test.Replace("2", "1");
Now, changedTest will containt "11345".
Another note on your code is that you don't actually have \ character in your string. It's only displayed in order to escape quote character. If you want to know more about this, please read MSDN article on string literals.

how about
var exclusions = new HashSet<char>(new[] { '"', '[', ']', '\\' });
return new string(obj.str.Where(c => !exclusions.Contains(c)).ToArray());
To do it all in one sweep.
As Tim Schmelter writes, if you wanted to do it often, especially with large exclusion sets over long strings, you could make an extension like this.
public static string Strip(
this string source,
params char[] exclusions)
{
if (!exclusions.Any())
{
return source;
}
var mask = new HashSet<char>(exclusions);
var result = new StringBuilder(source.Length);
foreach (var c in source.Where(c => !mask.Contains(c)))
{
result.Append(c);
}
return result.ToString();
}
so you could do,
var result = "[\"1,1\",\"2,2\"]".Strip('"', '[', ']', '\\');

Capture the numbers only with this regular expression [0-9]+ and then concatenate the matches:
var input = "[\"1,1\",\"2,2\"]";
var regex = new Regex("[0-9]+");
var matches = regex.Matches(input).Cast<Match>().Select(m => m.Value);
var result = string.Join(",", matches);

How can I parse a value that appears some place in a file using C#

I have strings and each contain a value of RowKey stored like this:
data-RowKey=029
This occurs only once in each file. Is there some way I can get the number out with a C# function or do I have to write some kind of select myself. I have a team mate who suggested linq but I'm not sure if this even works on strings and I don't know how I could use this.
Update:
Sorry I changed this from file to string.

Linq does not really help you here. Use a regular expression to extract the number:
data-Rowkey=(\d+)
Update:
Regex r = new Regex(#"data-Rowkey=(\d+)");
string abc = //;
Match match = r.Match(abc);
if (match.Success)
{
string rowKey = match.Groups[1].Value;
}
Code:
public string ExtractRowKey(string filePath)
{
Regex r = new Regex(#"data-Rowkey=(\d+)");
using (StreamReader reader = new StreamReader(filePath))
{
string line;
while ((line = reader.ReadLine()) != null)
{
Match match = r.Match(line);
if (match.Success) return match.Groups[1].Value;
}
}
}

Assuming that it only exists once in a file, i would even throw an exception otherwise:
String rowKey = null;
try
{
rowKey = File.ReadLines(path)
.Where(l => l.IndexOf("data-RowKey=") > -1)
.Select(l => l.Substring(12 + l.IndexOf("data-RowKey=")))
.Single();
}
catch (InvalidOperationException) {
// you might want to log this exception instead
throw;
}
Edit: The simple approach with a string, take the first occurence which is always of length 3:
rowKey = text.Substring(12 + text.IndexOf("data-RowKey="), 3);

Assuming following
File must contain data-Row (with exact match including case)
Number length is 3
Following is the code snippet
var fileNames = Directory.GetFiles("rootDirPath");
var tuples = new List<Tuple<String, int>>();
foreach(String fileName in fileNames)
{
String fileData =File.ReadAllText(fileName) ;
int index = fileData.IndexOf("data-RowKey=");
if(index >=0)
{
String numberStr = fileData.Substring(index+12,3);// ASSUMING data-RowKey is always found, and number length is always 3
int number = 0;
int.TryParse(numberStr, out number);
tuples.Add(Tuple.Create(fileName, number));
}
}

Regex g = new Regex(#"data-RowKey=(?<Value>\d+)");
using (StreamReader r = new StreamReader("myFile.txt"))
{
string line;
while ((line = r.ReadLine()) != null)
{
Match m = g.Match(line);
if (m.Success)
{
string v = m.Groups["Value"].Value;
// ...
}
}
}

Replace Multiple String Elements in C#

Is there a better way of doing this...
MyString.Trim().Replace("&", "and").Replace(",", "").Replace(" ", " ")
.Replace(" ", "-").Replace("'", "").Replace("/", "").ToLower();
I've extended the string class to keep it down to one job but is there a quicker way?
public static class StringExtension
{
public static string clean(this string s)
{
return s.Replace("&", "and").Replace(",", "").Replace(" ", " ")
.Replace(" ", "-").Replace("'", "").Replace(".", "")
.Replace("eacute;", "é").ToLower();
}
}
Just for fun (and to stop the arguments in the comments)
I've shoved a gist up benchmarking the various examples below.
https://gist.github.com/ChrisMcKee/5937656
The regex option scores terribly; the dictionary option comes up the fastest; the long winded version of the stringbuilder replace is slightly faster than the short hand.

Quicker - no. More effective - yes, if you will use the StringBuilder class. With your implementation each operation generates a copy of a string which under circumstances may impair performance. Strings are immutable objects so each operation just returns a modified copy.
If you expect this method to be actively called on multiple Strings of significant length, it might be better to "migrate" its implementation onto the StringBuilder class. With it any modification is performed directly on that instance, so you spare unnecessary copy operations.
public static class StringExtention
{
public static string clean(this string s)
{
StringBuilder sb = new StringBuilder (s);
sb.Replace("&", "and");
sb.Replace(",", "");
sb.Replace(" ", " ");
sb.Replace(" ", "-");
sb.Replace("'", "");
sb.Replace(".", "");
sb.Replace("eacute;", "é");
return sb.ToString().ToLower();
}
}

If you are simply after a pretty solution and don't need to save a few nanoseconds, how about some LINQ sugar?
var input = "test1test2test3";
var replacements = new Dictionary<string, string> { { "1", "*" }, { "2", "_" }, { "3", "&" } };
var output = replacements.Aggregate(input, (current, replacement) => current.Replace(replacement.Key, replacement.Value));

this will be more efficient:
public static class StringExtension
{
public static string clean(this string s)
{
return new StringBuilder(s)
.Replace("&", "and")
.Replace(",", "")
.Replace(" ", " ")
.Replace(" ", "-")
.Replace("'", "")
.Replace(".", "")
.Replace("eacute;", "é")
.ToString()
.ToLower();
}
}

Maybe a little more readable?
public static class StringExtension {
private static Dictionary<string, string> _replacements = new Dictionary<string, string>();
static StringExtension() {
_replacements["&"] = "and";
_replacements[","] = "";
_replacements[" "] = " ";
// etc...
}
public static string clean(this string s) {
foreach (string to_replace in _replacements.Keys) {
s = s.Replace(to_replace, _replacements[to_replace]);
}
return s;
}
}
Also add New In Town's suggestion about StringBuilder...

There is one thing that may be optimized in the suggested solutions. Having many calls to Replace() makes the code to do multiple passes over the same string. With very long strings the solutions may be slow because of CPU cache capacity misses. May be one should consider replacing multiple strings in a single pass.
The essential content from that link:
static string MultipleReplace(string text, Dictionary replacements) {
return Regex.Replace(text,
"(" + String.Join("|", adict.Keys.ToArray()) + ")",
delegate(Match m) { return replacements[m.Value]; }
);
}
// somewhere else in code
string temp = "Jonathan Smith is a developer";
adict.Add("Jonathan", "David");
adict.Add("Smith", "Seruyange");
string rep = MultipleReplace(temp, adict);

Another option using linq is
[TestMethod]
public void Test()
{
var input = "it's worth a lot of money, if you can find a buyer.";
var expected = "its worth a lot of money if you can find a buyer";
var removeList = new string[] { ".", ",", "'" };
var result = input;
removeList.ToList().ForEach(o => result = result.Replace(o, string.Empty));
Assert.AreEqual(expected, result);
}

I'm doing something similar, but in my case I'm doing serialization/De-serialization so I need to be able to go both directions. I find using a string[][] works nearly identically to the dictionary, including initialization, but you can go the other direction too, returning the substitutes to their original values, something that the dictionary really isn't set up to do.
Edit: You can use Dictionary<Key,List<Values>> in order to obtain same result as string[][]

Regular Expression with MatchEvaluator could also be used:
var pattern = new Regex(#"These|words|are|placed|in|parentheses");
var input = "The matching words in this text are being placed inside parentheses.";
var result = pattern.Replace(input , match=> $"({match.Value})");
Note:
Obviously different expression (like: \b(\w*test\w*)\b) could be used for words matching.
I was hoping it to be more optimized to find the pattern in expression and do the replacements
The advantage is the ability to process the matching elements while doing the replacements

This is essentially Paolo Tedesco's answer, but I wanted to make it re-usable.
public class StringMultipleReplaceHelper
{
private readonly Dictionary<string, string> _replacements;
public StringMultipleReplaceHelper(Dictionary<string, string> replacements)
{
_replacements = replacements;
}
public string clean(string s)
{
foreach (string to_replace in _replacements.Keys)
{
s = s.Replace(to_replace, _replacements[to_replace]);
}
return s;
}
}
One thing to note that I had to stop it being an extension, remove the static modifiers, and remove this from clean(this string s). I'm open to suggestions as to how to implement this better.

string input = "it's worth a lot of money, if you can find a buyer.";
for (dynamic i = 0, repl = new string[,] { { "'", "''" }, { "money", "$" }, { "find", "locate" } }; i < repl.Length / 2; i++) {
input = input.Replace(repl[i, 0], repl[i, 1]);
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Unexpected regex result with a single space - c#

Probably because “.”, because a period in regex, marches every character except line breaks

Related

How to perform multiple Regex replacements in sequence from a list of unique items cleanly in C#

Regex without escaping Characters - Problems

How to remove " [ ] \ from string

How can I parse a value that appears some place in a file using C#

Replace Multiple String Elements in C#

Categories

Resources