Regex without escaping Characters - Problems

Regex without escaping Characters - Problems - c#

I found some solutions for my problem, which is quite simple:
I have a string, which is looking like this:
"\r\nContent-Disposition: form-data; name=\"ctl00$cphMainContent$grid$ctl03$ucPicture$ctl00\""
My goal is to break it down, so I have a Dictionary of values, like:
Key = "name", value ? "ctl..."
My approach was: Split it by "\r\n" and then by the equal or the colon sign.
This worked fine, but then some funny Tester uploaded a file with all allowed charactes, which made the String looking like this:
"\r\nContent-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C:\\Users\\matthias.mueller\\Desktop\\- ie+![]{}_-´;,.$¨##ç %&()=~^`'.jpg\"\r\nContent-Type: image/jpeg"
Of course, the simple splitting doesn't work anymore, since it splits now the filename.
I corrected this by reading out "filename=" and escaping the signs I'm looking to split, and then creating a regex.
Now comes my problem: I found two Regex-samples, which could do the work for the equal sign, the semicolon and the colon. one is:
[^\\]=
The other one I found was:
(?<!\\\\)=
The problem is, the first one doesn't only split, but it splits the equal sign and one character before this sign, which means my key in the Dictionary is "nam" instead of "name"
The second one works fine on this matter, but it still splits the escaped equal sign in the filename.
Is my approach for this problem even working? Would there be a better solution for this? And why is the first Regex cutting a character?
Edit: To avoid confusion, my escaped String looks like this:
"Content-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C\:\Users\matthias.mueller\Desktop\- ie+![]{}_-´\;,.$¨##ç %&()\=~^`'.jpg\""
So I want basically: Split by equal Sign EXCEPT the escaped ones. By the way: The string here shows only one \, but there are 2.
Edit 2: OK seems like I have a working solution, but it's so ugly:
Dictionary<string, string> ParseHeader(byte[] bytes, int pos)
{
Dictionary<string, string> items;
string header;
string[] headerLines;
int start;
int end;
string input = _encoding.GetString(bytes, pos, bytes.Length - pos);
start = input.IndexOf("\r\n", 0);
if (start < 0) return null;
end = input.IndexOf("\r\n\r\n", start);
if (end < 0) return null;
WriteBytes(false, bytes, pos, end + 4 - 0); // Write the header to the form content
header = input.Substring(start, end - start);
items = new Dictionary<string, string>();
headerLines = Regex.Split(header, "\r\n");
Regex regLineParts = new Regex(#"(?<!\\\\);");
Regex regColon = new Regex(#"(?<!\\\\):");
Regex regEqualSign = new Regex(#"(?<!\\\\)=");
foreach (string hl in headerLines)
{
string workString = hl;
//Escape the Semicolon in filename
if (hl.Contains("filename"))
{
String orig = hl.Substring(hl.IndexOf("filename=\"") + 10);
orig = orig.Substring(0, orig.IndexOf('"'));
string toReplace = orig;
toReplace = toReplace.Replace(toReplace, toReplace.Replace(";", #"\\;"));
toReplace = toReplace.Replace(toReplace, toReplace.Replace(":", #"\\:"));
toReplace = toReplace.Replace(toReplace, toReplace.Replace("=", #"\\="));
workString = hl.Replace(orig, toReplace);
}
string[] lineParts = regLineParts.Split(workString);
for (int i = 0; i < lineParts.Length; i++)
{
string[] p;
if (i == 0)
p = regColon.Split(lineParts[i]);
else
p = regEqualSign.Split(lineParts[i]);
if (p.Length == 2)
{
string orig = p[0];
orig = orig.Replace(#"\\;", ";");
orig = orig.Replace(#"\\:", ":");
orig = orig.Replace(#"\\=", "=");
p[0] = orig;
orig = p[1];
orig = orig.Replace(#"\\;", ";");
orig = orig.Replace(#"\\:", ":");
orig = orig.Replace(#"\\=", "=");
p[1] = orig;
items.Add(p[0].Trim(), p[1].Trim());
}
}
}
return items;
}
Needs some further testing.

I had a go at writing a parser for you. It handles literal strings, like "here is a string", as the values in name-value pairs. I've also written a few tests, and the last shows an '=' character inside a literal string. It also handles escaping quotes (") inside literal strings by escaping as \" -- I'm not sure if this is right, but you could change it.
A quick explanation. I first find anything that looks like a literal string and replace it with a value like PLACEHOLDER8230498234098230498. This means the whole thing is now literal name-value pairs; eg
key="value"
becomes
key=PLACEHOLDER8230498234098230498
The original string value is stored off in the literalStrings dictionary for later.
So now we split on semicolons (to get key=value strings) and then on equals, to get the proper key/value pairs.
Then I substitute the placeholder values back in before returning the result.
public class HttpHeaderParser
{
public NameValueCollection Parse(string header)
{
var result = new NameValueCollection();
// 'register' any string values;
var stringLiteralRx = new Regex(#"""(?<content>(\\""|[^\""])+?)""", RegexOptions.IgnorePatternWhitespace);
var equalsRx = new Regex("=", RegexOptions.IgnorePatternWhitespace);
var semiRx = new Regex(";", RegexOptions.IgnorePatternWhitespace);
Dictionary<string, string> literalStrings = new Dictionary<string, string>();
var cleanedHeader = stringLiteralRx.Replace(header, m =>
{
var replacement = "PLACEHOLDER" + Guid.NewGuid().ToString("N");
var stringLiteral = m.Groups["content"].Value.Replace("\\\"", "\"");
literalStrings.Add(replacement, stringLiteral);
return replacement;
});
// now it's safe to split on semicolons to get name-value pairs
var nameValuePairs = semiRx.Split(cleanedHeader);
foreach(var nameValuePair in nameValuePairs)
{
var nameAndValuePieces = equalsRx.Split(nameValuePair);
var name = nameAndValuePieces[0].Trim();
var value = nameAndValuePieces[1];
string replacementValue;
if (literalStrings.TryGetValue(value, out replacementValue))
{
value = replacementValue;
}
result.Add(name, value);
}
return result;
}
}
There's every chance there are some proper bugs in it.
Here's some unit tests you should incorporate, too;
[TestMethod]
public void TestMethod1()
{
var tests = new[] {
new { input=#"foo=bar; baz=quux", expected = #"foo|bar^baz|quux"},
new { input=#"foo=bar;baz=""quux""", expected = #"foo|bar^baz|quux"},
new { input=#"foo=""bar"";baz=""quux""", expected = #"foo|bar^baz|quux"},
new { input=#"foo=""b,a,r"";baz=""quux""", expected = #"foo|b,a,r^baz|quux"},
new { input=#"foo=""b;r"";baz=""quux""", expected = #"foo|b;r^baz|quux"},
new { input=#"foo=""b\""r"";baz=""quux""", expected = #"foo|b""r^baz|quux"},
new { input=#"foo=""b=r"";baz=""quux""", expected = #"foo|b=r^baz|quux"},
};
var parser = new HttpHeaderParser();
foreach(var test in tests)
{
var actual = parser.Parse(test.input);
var actualAsString = String.Join("^", actual.Keys.Cast<string>().Select(k => string.Format("{0}|{1}", k, actual[k])));
Assert.AreEqual(test.expected, actualAsString);
}
}

Looks to me like you'll need a bit more of a solid parser for this than a regex split. According to this page the name/value pairs can either be 'raw';
x=1
or quoted;
x="foo bar baz"
So you'll need to look for a solution that not only splits on the equals, but ignores any equals inside;
x="y=z"
It might be that there is a better or more managed way for you to access this info. If you are using a classic ASP.NET WebForms FileUpload control, you can access the filename using the properties of the control, like
FileUpload1.HasFile
FileUpload1.FileName
If you're using MVC, you can use the HttpPostedFileBase class as a parameter to the action method. See this answer
[HttpPost]
public ActionResult Index(HttpPostedFileBase file)
{
// Verify that the user selected a file
if (file != null && file.ContentLength > 0)
{
// extract only the fielname
var fileName = Path.GetFileName(file.FileName);
// store the file inside ~/App_Data/uploads folder
var path = Path.Combine(Server.MapPath("~/App_Data/uploads"), fileName);
file.SaveAs(path);
}
// redirect back to the index action to show the form once again
return RedirectToAction("Index");
}

This:
(?<!\\\\)=
matches = not preceded by \\.
It should be:
(?<!\\)=
(Make sure you use # (verbatim) strings for the regex, to avoid confusion)

Related

Unexpected regex result with a single space

Can somebody please tell me why a space comes up with 2 matches for the below pattern?
((?<key>(?:((?!\d)\w+(?:\.(?!\d)\w+)*)\.)?((?!\d)\w+)):(?<value>([^ "]+)|("[^"]*?")+))*
Trying to match the following cases:
var body = "Key:Hello";
var body = "Key:\"Hello\"";
var body = "Key1:Hello Key2:\"Goodbye\"";
This may provide more context:
pattern = #"((?<key>" + StringExtensions.REGEX_IDENTIFIER_MIDSTRING + "):(?<value>([^ \"]+)|(\"[^\"]*?\")+))*";
My goal is to pull the keys, values out of a command-line like string in the form of [key]:[value] with optional repeats. Values can either be a with no spaces or in quotes with spaces.
Probably right there in front of me but I'm not seeing it.

Probably because “.”, because a period in regex, marches every character except line breaks

I took a different approach:
public static Dictionary<string, string> GetCommandLineKeyValues(this string commandLine)
{
var keyValues = new Dictionary<string, string>();
var pattern = #"(?<command>(" + StringExtensions.REGEX_IDENTIFIER + " )?)(?<args>.*)";
var args = commandLine.RegexGet(pattern, "args");
Match match;
if (args.Length > 0)
{
string key;
string value;
pattern = #" ?(?<key>" + StringExtensions.REGEX_IDENTIFIER_MIDSTRING + ")*?:(?<value>([^ \"]+)|(\"[^\"]*?\")+)";
do
{
match = args.RegexGetMatch(pattern);
if (match == null)
{
break;
}
key = match.Groups["key"].Value;
value = match.Groups["value"].Value;
keyValues.Add(key, value);
args = match.Replace(args, string.Empty);
}
while (args.RegexIsMatch(pattern));
}
return keyValues;
}
I took what I call the "pac-man" approach to Regex.. match, eat (hence the Match.Replace), and continue matching.
For convenience:
public const string REGEX_IDENTIFIER = #"^(?:((?!\d)\w+(?:\.(?!\d)\w+)*)\.)?((?!\d)\w+)$";

Objects to be comma separated and with double quote

I have an object array:
object[] keys
I need to transform this array into a string which is comma separated and I did it by doing this:
var newKeys = string.Join(",", keys);
My problem here is I want this values to be double quoted.
ex:
"value1","value2","value3"

var new= "\"" + string.Join( "\",\"", keys) + "\"";
To include a double quote in a string, you escape it with a backslash character, thus "\"" is a string consisting of a single double quote character, and "\", \"" is a string containing a double quote, a comma, a space, and another double quote.

Please give a try to this.
var keys = new object[] { "test1", "hello", "world", null, "", "oops"};
var csv = string.Join(",", keys.Select(k => string.Format("\"{0}\"", k)));
Because you have an object[] array, string.Format can deal with null as well as other types than strings. This solutions also works in .NET 3.5.
When the object[] array is empty, then a empty string is returned.

If performance is the key, you can always use a StringBuilder to concatenate everything.
Here's a fiddle to see it in action, but the main part can be summarized as:
// these look like snails, but they are actually pretty fast
using #_____ = System.Collections.Generic.IEnumerable<object>;
using #______ = System.Func<object, object>;
using #_______ = System.Text.StringBuilder;
public static string GetCsv(object[] input)
{
// use a string builder to make things faster
var #__ = new StringBuilder();
// the rest should be self-explanatory
Func<#_____, #______, #_____>
#____ = (_6,
_2) => _6.Select(_2);
Func<#_____, object> #_3 = _6
=> _6.FirstOrDefault();
Func<#_____, #_____> #_4 = _8
=> _8.Skip(input.Length - 1);
Action<#_______, object> #_ = (_9,
_2) => _9.Append(_2);
Action<#_______>
#___ = _7 =>
{ if (_7.Length > 0) #_(
#__, ",");
}; var #snail =
#____(input, (#_0 =>
{ #___(#__); #_(#__, #"""");
#_(#__, #_0); #_(#__, #"""");
return #__; }));
var #linq = #_4(#snail);
var #void = #_3(#linq);
// get the result
return #__.ToString();
}

How to remove " [ ] \ from string

I have a string
"[\"1,1\",\"2,2\"]"
and I want to turn this string onto this
1,1,2,2
I am using Replace function for that like
obj.str.Replace("[","").Replace("]","").Replace("\\","");
But it does not return the expected result.
Please help.

You haven't removed the double quotes. Use the following:
obj.str = obj.str.Replace("[","").Replace("]","").Replace("\\","").Replace("\"", "");

Here is an optimized approach in case the string or the list of exclude-characters is long:
public static class StringExtensions
{
public static String RemoveAll(this string input, params Char[] charactersToRemove)
{
if(string.IsNullOrEmpty(input) || (charactersToRemove==null || charactersToRemove.Length==0))
return input;
var exclude = new HashSet<Char>(charactersToRemove); // removes duplicates and has constant lookup time
var sb = new StringBuilder(input.Length);
foreach (Char c in input)
{
if (!exclude.Contains(c))
sb.Append(c);
}
return sb.ToString();
}
}
Use it in this way:
str = str.RemoveAll('"', '[', ']', '\\');
// or use a string as "remove-array":
string removeChars = "\"{[]\\";
str = str.RemoveAll(removeChars.ToCharArray());

You should do following:
obj.str = obj.str.Replace("[","").Replace("]","").Replace("\"","");
string.Replace method does not replace string content in place. This means that if you have
string test = "12345" and do
test.Replace("2", "1");
test string will still be "12345". Replace doesn't change string itself, but creates new string with replaced content. So you need to assign this new string to a new or same variable
changedTest = test.Replace("2", "1");
Now, changedTest will containt "11345".
Another note on your code is that you don't actually have \ character in your string. It's only displayed in order to escape quote character. If you want to know more about this, please read MSDN article on string literals.

how about
var exclusions = new HashSet<char>(new[] { '"', '[', ']', '\\' });
return new string(obj.str.Where(c => !exclusions.Contains(c)).ToArray());
To do it all in one sweep.
As Tim Schmelter writes, if you wanted to do it often, especially with large exclusion sets over long strings, you could make an extension like this.
public static string Strip(
this string source,
params char[] exclusions)
{
if (!exclusions.Any())
{
return source;
}
var mask = new HashSet<char>(exclusions);
var result = new StringBuilder(source.Length);
foreach (var c in source.Where(c => !mask.Contains(c)))
{
result.Append(c);
}
return result.ToString();
}
so you could do,
var result = "[\"1,1\",\"2,2\"]".Strip('"', '[', ']', '\\');

Capture the numbers only with this regular expression [0-9]+ and then concatenate the matches:
var input = "[\"1,1\",\"2,2\"]";
var regex = new Regex("[0-9]+");
var matches = regex.Matches(input).Cast<Match>().Select(m => m.Value);
var result = string.Join(",", matches);

How to replace "http:" with "https:" in a string with C#?

I've stored all URLs in my application with "http://" - I now need to go through and replace all of them with "https:". Right now I have:
foreach (var link in links)
{
if (link.Contains("http:"))
{
/// do something, slice or replace or what?
}
}
I'm just not sure what the best way to update the string would be. How can this be done?

If you're dealing with uris, you probably want to use UriBuilder since doing a string replace on structured data like URIs is not a good idea.
var builder = new UriBuilder(link);
builder.Scheme = "https";
Uri modified = builder.Uri;
It's not clear what the type of links is, but you can create a new collection with the modified uris using linq:
IEnumerable<string> updated = links.Select(link => {
var builder = new UriBuilder(link);
builder.Scheme = "https";
return builder.ToString();
});

The problem is your strings are in a collection, and since strings are immutable you can't change them directly. Since you didn't specify the type of links (List? Array?) the right answer will change slightly. The easiest way is to create a new list:
links = links.Select(link => link.Replace("http://","https://")).ToList();
However if you want to minimize the number of changes and can access the string by index you can just loop through the collection:
for(int i = 0; i < links.Length; i++ )
{
links[i] = links[i].Replace("http://","https://");
}

based on your current code, link will not be replace to anything you want because it is read only (see here: Why can't I modify the loop variable in a foreach?). instead use for
for(int a = 0; a < links.Length; a++ )
{
links[a] = links[a].Replace("http:/","https:/")
}

http://myserver.xom/login.aspx?returnurl=http%3a%2f%2fmyserver.xom%2fmyaccount.aspx&q1=a%20b%20c&q2=c%2b%2b
What about the urls having also url in the querystring part? I think we should also replace them. And because of the url encoding-escaping this is the hard part of the job.
private void BlaBla()
{
// call the replacing function
Uri myNewUrl = ConvertHttpToHttps(myOriginalUrl);
}
private Uri ConvertHttpToHttps(Uri originalUri)
{
Uri result = null;
int httpsPort = 443;// if needed assign your own value or implement it as parametric
string resultQuery = string.Empty;
NameValueCollection urlParameters = HttpUtility.ParseQueryString(originalUri.Query);
if (urlParameters != null && urlParameters.Count > 0)
{
StringBuilder sb = new StringBuilder();
foreach (string key in urlParameters)
{
if (sb.Length > 0)
sb.Append("&");
string value = urlParameters[key].Replace("http://", "https://");
string valuEscaped = Uri.EscapeDataString(value);// this is important
sb.Append(string.Concat(key, "=", valuEscaped));
}
resultQuery = sb.ToString();
}
UriBuilder resultBuilder = new UriBuilder("https", originalUri.Host, httpsPort, originalUri.AbsolutePath);
resultBuilder.Query = resultQuery;
result = resultBuilder.Uri;
return result;
}

Use string.Replace and some LINQ:
var httpsLinks = links.Select(l=>l.Replace("http://", "https://");

Extract data from a big string

First of all, i'm using the function below to read data from a pdf file.
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
pdfReader.Close();
}
}
return text.ToString();
}
As you can see , all data is saved in a string. The string looks like this:
label1: data1;
label2: data2;
label3: data3;
.............
labeln: datan;
My question: How can i get the data from string based on labels ?
I've tried this , but i'm getting stuck:
if ( string.Contains("label1"))
{
extracted_data1 = string.Substring(string.IndexOf(':') , string.IndexOf(';') - string.IndexOf(':') - 1);
}
if ( string.Contains("label2"))
{
extracted_data2 = string.Substring(string.IndexOf("label2") + string.IndexOf(':') , string.IndexOf(';') - string.IndexOf(':') - 1);
}

Have a look at the String.Split() function, it tokenises a string based on an array of characters supplied.
e.g.
string[] lines = text.Split(new[] {';'}, StringSplitOptions.RemoveEmptyEntries);
now loop through that array and split each one again
foreach(string line in lines) {
string[] pair = line.Split(new[] {':'});
string key = pair[0].Trim();
string val = pair[1].Trim();
....
}
Obviously check for empty lines, and use .Trim() where needed...
[EDIT]
Or alternatively as a nice Linq statement...
var result = from line in text.Split(new[] {';'}, StringSplitOptions.RemoveEmptyEntries)
let tokens = line.Split(new[] {':'})
select tokens;
Dictionary<string, string> =
result.ToDictionary (key => key[0].Trim(), value => value[1].Trim());

It's pretty hard-coded, but you could use something like this (with a little bit of trimming to your needs):
string input = "label1: data1;" // Example of your input
string data = input.Split(':')[1].Replace(";","").Trim();

You can do this by using Dictionary<string,string>,
Dictionary<string, string> dicLabelData = new Dictionary<string, string>();
List<string> listStrSplit = new List<string>();
listStrSplit = strBig.Split(';').ToList<string>();//strBig is big string which you want to parse
foreach (string strSplit in listStrSplit)
{
if (strSplit.Split(':').ToList<string>().Count > 1)
{
List<string> listLable = new List<string>();
listLable = strSplit.Split(':').ToList<string>();
dicLabelData.Add(listLable[0],listLable[1]);//Key=Label,Value=Data
}
}
dicLabelData contains data of all label....

i think you can use regex to solve this problem. Just split the string on the break line and use a regex to get the right number.

You can use a regex to do it:
Regex rx = new Regex("label([0-9]+): ([^;]*);");
var matches = rx.Matches("label1: a string; label2: another string; label100: a third string;");
foreach (Match match in matches) {
var id = match.Groups[1].ToString();
var data = match.Groups[2].ToString();
var idAsNumber = int.Parse(id);
// Here you use an array or a dictionary to save id/data
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex without escaping Characters - Problems - c#

This: (?<!\\\\)= matches = not preceded by \\. It should be: (?<!\\)= (Make sure you use # (verbatim) strings for the regex, to avoid confusion)

Related

Unexpected regex result with a single space

Objects to be comma separated and with double quote

How to remove " [ ] \ from string

How to replace "http:" with "https:" in a string with C#?

Extract data from a big string

Categories

Resources