Extract data from a big string

Extract data from a big string - c#

First of all, i'm using the function below to read data from a pdf file.
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
pdfReader.Close();
}
}
return text.ToString();
}
As you can see , all data is saved in a string. The string looks like this:
label1: data1;
label2: data2;
label3: data3;
.............
labeln: datan;
My question: How can i get the data from string based on labels ?
I've tried this , but i'm getting stuck:
if ( string.Contains("label1"))
{
extracted_data1 = string.Substring(string.IndexOf(':') , string.IndexOf(';') - string.IndexOf(':') - 1);
}
if ( string.Contains("label2"))
{
extracted_data2 = string.Substring(string.IndexOf("label2") + string.IndexOf(':') , string.IndexOf(';') - string.IndexOf(':') - 1);
}

Have a look at the String.Split() function, it tokenises a string based on an array of characters supplied.
e.g.
string[] lines = text.Split(new[] {';'}, StringSplitOptions.RemoveEmptyEntries);
now loop through that array and split each one again
foreach(string line in lines) {
string[] pair = line.Split(new[] {':'});
string key = pair[0].Trim();
string val = pair[1].Trim();
....
}
Obviously check for empty lines, and use .Trim() where needed...
[EDIT]
Or alternatively as a nice Linq statement...
var result = from line in text.Split(new[] {';'}, StringSplitOptions.RemoveEmptyEntries)
let tokens = line.Split(new[] {':'})
select tokens;
Dictionary<string, string> =
result.ToDictionary (key => key[0].Trim(), value => value[1].Trim());

It's pretty hard-coded, but you could use something like this (with a little bit of trimming to your needs):
string input = "label1: data1;" // Example of your input
string data = input.Split(':')[1].Replace(";","").Trim();

You can do this by using Dictionary<string,string>,
Dictionary<string, string> dicLabelData = new Dictionary<string, string>();
List<string> listStrSplit = new List<string>();
listStrSplit = strBig.Split(';').ToList<string>();//strBig is big string which you want to parse
foreach (string strSplit in listStrSplit)
{
if (strSplit.Split(':').ToList<string>().Count > 1)
{
List<string> listLable = new List<string>();
listLable = strSplit.Split(':').ToList<string>();
dicLabelData.Add(listLable[0],listLable[1]);//Key=Label,Value=Data
}
}
dicLabelData contains data of all label....

i think you can use regex to solve this problem. Just split the string on the break line and use a regex to get the right number.

You can use a regex to do it:
Regex rx = new Regex("label([0-9]+): ([^;]*);");
var matches = rx.Matches("label1: a string; label2: another string; label100: a third string;");
foreach (Match match in matches) {
var id = match.Groups[1].ToString();
var data = match.Groups[2].ToString();
var idAsNumber = int.Parse(id);
// Here you use an array or a dictionary to save id/data
}

Related

How to split and take multiple strings from a url in c#?

I have a string looking something like this:
/Gender=&Age=&Query=&Orgrimmar+l%C3%A4n=01&Stormwind+l%C3%A4n=07&Undercity+l%C3%A4n=09&Pag
I want a list of string with "Orgrimmar", "Stormwind" and "Undercity". How is this possible so that it splits AFTER Query and between & and + in order so that we avoid getting a string like this "Orgrimmar+l%C3%A4n=01&Stormwind".
Let us assume that we don't know the name of the strings.. :)
Updated, i still don't seem to get it to work. I have added a list of counties that i can use to validate this. However i still find it hard in this case. countyList is used to validate that the counties/cities in the url matches a pre-existing Collection.
var countyQuery = Request.Url.Query;
var counties = this._locationService.GetAllCounties();
List<string> countyList = new List<string>();
List<string> selectedCountiesList = new List<string>();
foreach (var i in counties)
{
countyList.Add(i.Name);
}
Regex r = new Regex(#"&(.+?)\+");
MatchCollection mc = r.Matches(countyQuery);
foreach (Match curMatch in mc)
{
if (countyList.Contains(curMatch.Groups[1].Value))
{
selectedCountiesList.Add(curMatch.Groups[1].Value);
}
}
return selectedCountiesList;
Changed url to be/?Gender=&Age=&Query=&county=13&county=08&county=01&Page=1
where 13, 08, 01 and so on is Id of the counties
The final solution was:
var selectedCountyQuery = Request.QueryString
//CountySearch = "county"
[QueryStringParameters.CountySearch];
List countyList = new List();
List<string> selectedCounties = new List<string>();
if (!string.IsNullOrEmpty(selectedCountyQuery))
{
var selectedCountiesArray = selectedCountyQuery.Split(new[]{ ',' });
foreach (var selectedCounty in selectedCountiesArray)
{
selectedCounties.Add(selectedCounty);
}
}
return selectedCounties;

You can get all parameter and value with Substring() and Split() method.
Example :
var URL = "controller/method?var1=&var2=&var3=dsgdf";
var ParameterPart = URL.Split("?")[1];
var ParametersArray = ParameterPart.Split("&");
//output : ["var1=","var2=","var3=dsgdf"];
foreach(var Parameter in ParametersArray)
{
var ParameterName= Parameter.Split("=")[0];
var ParameterValue= Parameter.Split("=")[1];
}

You can use a regex and extract the matches:
Regex r = new Regex(#"&(.+?)\+");
MatchCollection mc = r.Matches(s);
Then you can itterate your desired strings (in this case wow cities) like:
foreach(Match curMatch in mc)
{
Console.WriteLine(curMatch.Groups[1].Value);
}

string[] numbers ={ "/Gender=&Age=&Query=&Orgrimmar+l%C3%A4n=01&Stormwind+l%C3%A4n=07&Undercity+l%C3%A4n=09&Pag"};
string sPattern = #"(?<=&Orgrimmar)+";
foreach (string s in numbers){
if (System.Text.RegularExpressions.Regex.IsMatch(s, sPattern)){
System.Console.WriteLine(" - valid");}
else{System.Console.WriteLine(" - invalid");}
Output: valid
string[] numbers ={ "/Gender=&Age=&Query=Orgrimmar+l%C3%A4n=01&Stormwind+l%C3%A4n=07&Undercity+l%C3%A4n=09&Pag"};
Output: invalid
Further to check two parameters:
string[] numbers ={ "/Gender=&Age=&Query=&Orgrimmar+l%C3%A4n=01&Stormwind+l%C3%A4n=07&Undercity+l%C3%A4n=09&Pag"};
string sPattern = #"(?<=&Orgrimmar)+";
string sPattern2 = #"(?<=&Stormwind)+";
foreach (string s in numbers){
if (System.Text.RegularExpressions.Regex.IsMatch(s, sPattern) && System.Text.RegularExpressions.Regex.IsMatch(s, sPattern2))
...

Regex without escaping Characters - Problems

I found some solutions for my problem, which is quite simple:
I have a string, which is looking like this:
"\r\nContent-Disposition: form-data; name=\"ctl00$cphMainContent$grid$ctl03$ucPicture$ctl00\""
My goal is to break it down, so I have a Dictionary of values, like:
Key = "name", value ? "ctl..."
My approach was: Split it by "\r\n" and then by the equal or the colon sign.
This worked fine, but then some funny Tester uploaded a file with all allowed charactes, which made the String looking like this:
"\r\nContent-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C:\\Users\\matthias.mueller\\Desktop\\- ie+![]{}_-´;,.$¨##ç %&()=~^`'.jpg\"\r\nContent-Type: image/jpeg"
Of course, the simple splitting doesn't work anymore, since it splits now the filename.
I corrected this by reading out "filename=" and escaping the signs I'm looking to split, and then creating a regex.
Now comes my problem: I found two Regex-samples, which could do the work for the equal sign, the semicolon and the colon. one is:
[^\\]=
The other one I found was:
(?<!\\\\)=
The problem is, the first one doesn't only split, but it splits the equal sign and one character before this sign, which means my key in the Dictionary is "nam" instead of "name"
The second one works fine on this matter, but it still splits the escaped equal sign in the filename.
Is my approach for this problem even working? Would there be a better solution for this? And why is the first Regex cutting a character?
Edit: To avoid confusion, my escaped String looks like this:
"Content-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C\:\Users\matthias.mueller\Desktop\- ie+![]{}_-´\;,.$¨##ç %&()\=~^`'.jpg\""
So I want basically: Split by equal Sign EXCEPT the escaped ones. By the way: The string here shows only one \, but there are 2.
Edit 2: OK seems like I have a working solution, but it's so ugly:
Dictionary<string, string> ParseHeader(byte[] bytes, int pos)
{
Dictionary<string, string> items;
string header;
string[] headerLines;
int start;
int end;
string input = _encoding.GetString(bytes, pos, bytes.Length - pos);
start = input.IndexOf("\r\n", 0);
if (start < 0) return null;
end = input.IndexOf("\r\n\r\n", start);
if (end < 0) return null;
WriteBytes(false, bytes, pos, end + 4 - 0); // Write the header to the form content
header = input.Substring(start, end - start);
items = new Dictionary<string, string>();
headerLines = Regex.Split(header, "\r\n");
Regex regLineParts = new Regex(#"(?<!\\\\);");
Regex regColon = new Regex(#"(?<!\\\\):");
Regex regEqualSign = new Regex(#"(?<!\\\\)=");
foreach (string hl in headerLines)
{
string workString = hl;
//Escape the Semicolon in filename
if (hl.Contains("filename"))
{
String orig = hl.Substring(hl.IndexOf("filename=\"") + 10);
orig = orig.Substring(0, orig.IndexOf('"'));
string toReplace = orig;
toReplace = toReplace.Replace(toReplace, toReplace.Replace(";", #"\\;"));
toReplace = toReplace.Replace(toReplace, toReplace.Replace(":", #"\\:"));
toReplace = toReplace.Replace(toReplace, toReplace.Replace("=", #"\\="));
workString = hl.Replace(orig, toReplace);
}
string[] lineParts = regLineParts.Split(workString);
for (int i = 0; i < lineParts.Length; i++)
{
string[] p;
if (i == 0)
p = regColon.Split(lineParts[i]);
else
p = regEqualSign.Split(lineParts[i]);
if (p.Length == 2)
{
string orig = p[0];
orig = orig.Replace(#"\\;", ";");
orig = orig.Replace(#"\\:", ":");
orig = orig.Replace(#"\\=", "=");
p[0] = orig;
orig = p[1];
orig = orig.Replace(#"\\;", ";");
orig = orig.Replace(#"\\:", ":");
orig = orig.Replace(#"\\=", "=");
p[1] = orig;
items.Add(p[0].Trim(), p[1].Trim());
}
}
}
return items;
}
Needs some further testing.

I had a go at writing a parser for you. It handles literal strings, like "here is a string", as the values in name-value pairs. I've also written a few tests, and the last shows an '=' character inside a literal string. It also handles escaping quotes (") inside literal strings by escaping as \" -- I'm not sure if this is right, but you could change it.
A quick explanation. I first find anything that looks like a literal string and replace it with a value like PLACEHOLDER8230498234098230498. This means the whole thing is now literal name-value pairs; eg
key="value"
becomes
key=PLACEHOLDER8230498234098230498
The original string value is stored off in the literalStrings dictionary for later.
So now we split on semicolons (to get key=value strings) and then on equals, to get the proper key/value pairs.
Then I substitute the placeholder values back in before returning the result.
public class HttpHeaderParser
{
public NameValueCollection Parse(string header)
{
var result = new NameValueCollection();
// 'register' any string values;
var stringLiteralRx = new Regex(#"""(?<content>(\\""|[^\""])+?)""", RegexOptions.IgnorePatternWhitespace);
var equalsRx = new Regex("=", RegexOptions.IgnorePatternWhitespace);
var semiRx = new Regex(";", RegexOptions.IgnorePatternWhitespace);
Dictionary<string, string> literalStrings = new Dictionary<string, string>();
var cleanedHeader = stringLiteralRx.Replace(header, m =>
{
var replacement = "PLACEHOLDER" + Guid.NewGuid().ToString("N");
var stringLiteral = m.Groups["content"].Value.Replace("\\\"", "\"");
literalStrings.Add(replacement, stringLiteral);
return replacement;
});
// now it's safe to split on semicolons to get name-value pairs
var nameValuePairs = semiRx.Split(cleanedHeader);
foreach(var nameValuePair in nameValuePairs)
{
var nameAndValuePieces = equalsRx.Split(nameValuePair);
var name = nameAndValuePieces[0].Trim();
var value = nameAndValuePieces[1];
string replacementValue;
if (literalStrings.TryGetValue(value, out replacementValue))
{
value = replacementValue;
}
result.Add(name, value);
}
return result;
}
}
There's every chance there are some proper bugs in it.
Here's some unit tests you should incorporate, too;
[TestMethod]
public void TestMethod1()
{
var tests = new[] {
new { input=#"foo=bar; baz=quux", expected = #"foo|bar^baz|quux"},
new { input=#"foo=bar;baz=""quux""", expected = #"foo|bar^baz|quux"},
new { input=#"foo=""bar"";baz=""quux""", expected = #"foo|bar^baz|quux"},
new { input=#"foo=""b,a,r"";baz=""quux""", expected = #"foo|b,a,r^baz|quux"},
new { input=#"foo=""b;r"";baz=""quux""", expected = #"foo|b;r^baz|quux"},
new { input=#"foo=""b\""r"";baz=""quux""", expected = #"foo|b""r^baz|quux"},
new { input=#"foo=""b=r"";baz=""quux""", expected = #"foo|b=r^baz|quux"},
};
var parser = new HttpHeaderParser();
foreach(var test in tests)
{
var actual = parser.Parse(test.input);
var actualAsString = String.Join("^", actual.Keys.Cast<string>().Select(k => string.Format("{0}|{1}", k, actual[k])));
Assert.AreEqual(test.expected, actualAsString);
}
}

Looks to me like you'll need a bit more of a solid parser for this than a regex split. According to this page the name/value pairs can either be 'raw';
x=1
or quoted;
x="foo bar baz"
So you'll need to look for a solution that not only splits on the equals, but ignores any equals inside;
x="y=z"
It might be that there is a better or more managed way for you to access this info. If you are using a classic ASP.NET WebForms FileUpload control, you can access the filename using the properties of the control, like
FileUpload1.HasFile
FileUpload1.FileName
If you're using MVC, you can use the HttpPostedFileBase class as a parameter to the action method. See this answer
[HttpPost]
public ActionResult Index(HttpPostedFileBase file)
{
// Verify that the user selected a file
if (file != null && file.ContentLength > 0)
{
// extract only the fielname
var fileName = Path.GetFileName(file.FileName);
// store the file inside ~/App_Data/uploads folder
var path = Path.Combine(Server.MapPath("~/App_Data/uploads"), fileName);
file.SaveAs(path);
}
// redirect back to the index action to show the form once again
return RedirectToAction("Index");
}

This:
(?<!\\\\)=
matches = not preceded by \\.
It should be:
(?<!\\)=
(Make sure you use # (verbatim) strings for the regex, to avoid confusion)

How can I parse a value that appears some place in a file using C#

I have strings and each contain a value of RowKey stored like this:
data-RowKey=029
This occurs only once in each file. Is there some way I can get the number out with a C# function or do I have to write some kind of select myself. I have a team mate who suggested linq but I'm not sure if this even works on strings and I don't know how I could use this.
Update:
Sorry I changed this from file to string.

Linq does not really help you here. Use a regular expression to extract the number:
data-Rowkey=(\d+)
Update:
Regex r = new Regex(#"data-Rowkey=(\d+)");
string abc = //;
Match match = r.Match(abc);
if (match.Success)
{
string rowKey = match.Groups[1].Value;
}
Code:
public string ExtractRowKey(string filePath)
{
Regex r = new Regex(#"data-Rowkey=(\d+)");
using (StreamReader reader = new StreamReader(filePath))
{
string line;
while ((line = reader.ReadLine()) != null)
{
Match match = r.Match(line);
if (match.Success) return match.Groups[1].Value;
}
}
}

Assuming that it only exists once in a file, i would even throw an exception otherwise:
String rowKey = null;
try
{
rowKey = File.ReadLines(path)
.Where(l => l.IndexOf("data-RowKey=") > -1)
.Select(l => l.Substring(12 + l.IndexOf("data-RowKey=")))
.Single();
}
catch (InvalidOperationException) {
// you might want to log this exception instead
throw;
}
Edit: The simple approach with a string, take the first occurence which is always of length 3:
rowKey = text.Substring(12 + text.IndexOf("data-RowKey="), 3);

Assuming following
File must contain data-Row (with exact match including case)
Number length is 3
Following is the code snippet
var fileNames = Directory.GetFiles("rootDirPath");
var tuples = new List<Tuple<String, int>>();
foreach(String fileName in fileNames)
{
String fileData =File.ReadAllText(fileName) ;
int index = fileData.IndexOf("data-RowKey=");
if(index >=0)
{
String numberStr = fileData.Substring(index+12,3);// ASSUMING data-RowKey is always found, and number length is always 3
int number = 0;
int.TryParse(numberStr, out number);
tuples.Add(Tuple.Create(fileName, number));
}
}

Regex g = new Regex(#"data-RowKey=(?<Value>\d+)");
using (StreamReader r = new StreamReader("myFile.txt"))
{
string line;
while ((line = r.ReadLine()) != null)
{
Match m = g.Match(line);
if (m.Success)
{
string v = m.Groups["Value"].Value;
// ...
}
}
}

splitting string in dictionary

How do I split the following string
string s = "username=bill&password=mypassword";
Dictionary<string,string> stringd = SplitTheString(s);
such that I could capture it as follows:
string username = stringd.First().Key;
string password = stringd.First().Values;
Please let me know. Thanks

You can populate the dictionary list like so:
Dictionary<string, string> dictionary = new Dictionary<string, string>();
string s = "username=bill&password=mypassword";
foreach (string x in s.Split('&'))
{
string[] values = x.Split('=');
dictionary.Add(values[0], values[1]);
}
this would allow you to access them like so:
string username = dictionary["username"];
string password = dictionary["password"];
NOTE: keep in mind there is no validation in this function, it assumes your input string is correctly formatted

It looks like you are trying to parse a query string - this is already built in, you can use HttpUtility.ParseQueryString() for this:
string input = "username=bill&password=mypassword";
var col = HttpUtility.ParseQueryString(input);
string username = col["username"];
string password = col["password"];

I think something similar to this should work
public Dictionary<string, string> SplitTheStrings(s) {
var d = new Dictionary<string, string>();
var a = s.Split('&');
foreach(string x in a) {
var b = x.Split('=');
d.Add(b[0], b[1]);
}
return d;
}

var splitString = "username=bill&password=pass";
var splits = new char[2];
splits[0] = '=';
splits[1] = '&';
var items = splitString.Split(splits);
var list = new Dictionary<string, string> {{items[1], items[3]}};
var username = list.First().Key;
var password = list.First().Value;
this my also work

If Keys will not repeat
var dict = s.Split('&').Select( i=>
{
var t = i.Split('=');
return new {Key=t[0], Value=t[1]};}
).ToDictionary(i=>i.Key, i=>i.Value);
If Keys can repeat
string s = "username=bill&password=mypassword";
var dict = s.Split('&').Select( i=>
{
var t = i.Split('=');
return new {Key=t[0], Value=t[1]};}
).ToLookup(i=>i.Key, i=>i.Value);

The other answers are better, easier to read, simpler, less prone to bugs, etc, but an alternate solution is to use a regular expression like this to extract all the keys and values:
MatchCollection mc = Regex.Matches("username=bill&password=mypassword&","(.*?)=(.*?)&");
Each match in the match collection will have two groups, a group for the key text and a group for the value text.
I am not too good at regular expressions so I don't know how to get it to match without adding the trailing '&' to the input string...

Best way get first word and rest of the words in a string in C#

In C#
var parameters =
from line in parameterTextBox.Lines
select new {name = line.Split(' ').First(), value = line.Split(' ').Skip(1)};
Is there a way to do this without having to split twice?

you can store the split in a let clause
var parameters =
from line in parameterTextBox.Lines
let split = line.Split(' ')
select new {name = split.First(), value = split.Skip(1)};

Sure.
var parameters = from line in parameterTextBox.Lines
let words = line.Split(' ')
select new { name = words.First(), words.skip(1) };

string Str= "one all of the rest";
Match m = Regex.match(Str,"(\w*) (\w.*)");
string wordone = m.Groups[1];
string wordtwo = m.Groups[2];

You might try this:
private Dictionary<string, string> getParameters(string[] lines)
{
Dictionary<string, string> results = new Dictionary<string, string>();
foreach (string line in lines)
{
string pName = line.Substring(0, line.IndexOf(' '));
string pVal = line.Substring(line.IndexOf(' ') + 1);
results.Add(pName, pVal);
}
return results;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract data from a big string - c#

It's pretty hard-coded, but you could use something like this (with a little bit of trimming to your needs): string input = "label1: data1;" // Example of your input string data = input.Split(':')[1].Replace(";","").Trim();

i think you can use regex to solve this problem. Just split the string on the break line and use a regex to get the right number.

Related

How to split and take multiple strings from a url in c#?

Regex without escaping Characters - Problems

How can I parse a value that appears some place in a file using C#

splitting string in dictionary

Best way get first word and rest of the words in a string in C#

Categories

Resources