how to parse this text in c# - c#

abc = tamaz feeo maa roo key gaera porla
Xyz = gippaza eka jaguar ammaz te sanna.
i want to make a struct
public struct word
{
public string Word;
public string Definition;
}
how i can parse them and make a list of <word> in c#.
how i can parse it in c#
thanks for help but it is a text and it is not sure that a line or more so what i do for newline

Read the input line by line and split by the equal sign.
class Entry
{
private string term;
private string definition;
Entry(string term, string definition)
{
this.term = term;
this.definition = definition;
}
}
// ...
string[] data = line.Split('=');
string word = data[0].Trim();
string definition = data[1].Trim();
Entry entry = new Entry(word, definition);

This can also be done using a very simple LINQ query:
var definitions =
from line in File.ReadAllLines(file)
let parts = line.Split('=')
select new word
{
Word = parts[0].Trim(),
Definition = parts[1].Trim()
}

Using RegExp you can proceed in two ways, depending on your source input
Exemple 1
Assuming you have read your source and saved any single line in a vector or list :
string[] input = { "abc = tamaz feeo maa roo key gaera porla", "Xyz = gippaza eka jaguar ammaz te sanna." };
Regex mySplit = new Regex("(\\w+)\\s*=\\s*((\\w+).*)");
List<word> mylist = new List<word>();
foreach (string wordDef in input)
{
Match myMatch = mySplit.Match(wordDef);
word myWord;
myWord.Word = myMatch.Groups[1].Captures[0].Value;
myWord.Definition = myMatch.Groups[2].Captures[0].Value;
mylist.Add(myWord);
}
Exemple 2
Assuming you have read your source in a single variable (and any line is terminated with the line break character '\n') you can use the same regexp "(\w+)\s*=\s*((\w+).*)" but in this way
string inputs = "abc = tamaz feeo maa roo, key gaera porla\r\nXyz = gippaza eka jaguar; ammaz: te sanna.";
MatchCollection myMatches = mySplit.Matches(inputs);
foreach (Match singleMatch in myMatches)
{
word myWord;
myWord.Word = singleMatch.Groups[1].Captures[0].Value;
myWord.Definition = singleMatch.Groups[2].Captures[0].Value;
mylist.Add(myWord);
}
Lines that matches or does not match the regexp "(\w+)\s=\s*((\w+).)":
"abc = tamaz feeo maa roo key gaera porla,qsdsdsqdqsd\n" --> Match!
"Xyz= gippaza eka jaguar ammaz te sanna. sdq=sqds \n" --> Match! you can insert description that includes spaces too.
"qsdqsd=\nsdsdsd\n" --> Match a multiline pair too!
"sdqsd=\n" --> DO NOT Match! (lacking descr)
"= sdq sqdqsd.\n" --> DO NOT Match! (lacking word)

// Split at an = sign. Take at most two parts (word and definition);
// ignore any = signs in the definition
string[] parts = line.Split(new[] { '=' }, 2);
word w = new word();
w.Word = parts[0].Trim();
// If the definition is missing then parts.Length == 1
if (parts.Length == 1)
w.Definition = string.Empty;
else
w.Definition = parts[1].Trim();
words.Add(w);

Use Regular Expressions

Related

Regex without escaping Characters - Problems

I found some solutions for my problem, which is quite simple:
I have a string, which is looking like this:
"\r\nContent-Disposition: form-data; name=\"ctl00$cphMainContent$grid$ctl03$ucPicture$ctl00\""
My goal is to break it down, so I have a Dictionary of values, like:
Key = "name", value ? "ctl..."
My approach was: Split it by "\r\n" and then by the equal or the colon sign.
This worked fine, but then some funny Tester uploaded a file with all allowed charactes, which made the String looking like this:
"\r\nContent-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C:\\Users\\matthias.mueller\\Desktop\\- ie+![]{}_-´;,.$¨##ç %&()=~^`'.jpg\"\r\nContent-Type: image/jpeg"
Of course, the simple splitting doesn't work anymore, since it splits now the filename.
I corrected this by reading out "filename=" and escaping the signs I'm looking to split, and then creating a regex.
Now comes my problem: I found two Regex-samples, which could do the work for the equal sign, the semicolon and the colon. one is:
[^\\]=
The other one I found was:
(?<!\\\\)=
The problem is, the first one doesn't only split, but it splits the equal sign and one character before this sign, which means my key in the Dictionary is "nam" instead of "name"
The second one works fine on this matter, but it still splits the escaped equal sign in the filename.
Is my approach for this problem even working? Would there be a better solution for this? And why is the first Regex cutting a character?
Edit: To avoid confusion, my escaped String looks like this:
"Content-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C\:\Users\matthias.mueller\Desktop\- ie+![]{}_-´\;,.$¨##ç %&()\=~^`'.jpg\""
So I want basically: Split by equal Sign EXCEPT the escaped ones. By the way: The string here shows only one \, but there are 2.
Edit 2: OK seems like I have a working solution, but it's so ugly:
Dictionary<string, string> ParseHeader(byte[] bytes, int pos)
{
Dictionary<string, string> items;
string header;
string[] headerLines;
int start;
int end;
string input = _encoding.GetString(bytes, pos, bytes.Length - pos);
start = input.IndexOf("\r\n", 0);
if (start < 0) return null;
end = input.IndexOf("\r\n\r\n", start);
if (end < 0) return null;
WriteBytes(false, bytes, pos, end + 4 - 0); // Write the header to the form content
header = input.Substring(start, end - start);
items = new Dictionary<string, string>();
headerLines = Regex.Split(header, "\r\n");
Regex regLineParts = new Regex(#"(?<!\\\\);");
Regex regColon = new Regex(#"(?<!\\\\):");
Regex regEqualSign = new Regex(#"(?<!\\\\)=");
foreach (string hl in headerLines)
{
string workString = hl;
//Escape the Semicolon in filename
if (hl.Contains("filename"))
{
String orig = hl.Substring(hl.IndexOf("filename=\"") + 10);
orig = orig.Substring(0, orig.IndexOf('"'));
string toReplace = orig;
toReplace = toReplace.Replace(toReplace, toReplace.Replace(";", #"\\;"));
toReplace = toReplace.Replace(toReplace, toReplace.Replace(":", #"\\:"));
toReplace = toReplace.Replace(toReplace, toReplace.Replace("=", #"\\="));
workString = hl.Replace(orig, toReplace);
}
string[] lineParts = regLineParts.Split(workString);
for (int i = 0; i < lineParts.Length; i++)
{
string[] p;
if (i == 0)
p = regColon.Split(lineParts[i]);
else
p = regEqualSign.Split(lineParts[i]);
if (p.Length == 2)
{
string orig = p[0];
orig = orig.Replace(#"\\;", ";");
orig = orig.Replace(#"\\:", ":");
orig = orig.Replace(#"\\=", "=");
p[0] = orig;
orig = p[1];
orig = orig.Replace(#"\\;", ";");
orig = orig.Replace(#"\\:", ":");
orig = orig.Replace(#"\\=", "=");
p[1] = orig;
items.Add(p[0].Trim(), p[1].Trim());
}
}
}
return items;
}
Needs some further testing.
I had a go at writing a parser for you. It handles literal strings, like "here is a string", as the values in name-value pairs. I've also written a few tests, and the last shows an '=' character inside a literal string. It also handles escaping quotes (") inside literal strings by escaping as \" -- I'm not sure if this is right, but you could change it.
A quick explanation. I first find anything that looks like a literal string and replace it with a value like PLACEHOLDER8230498234098230498. This means the whole thing is now literal name-value pairs; eg
key="value"
becomes
key=PLACEHOLDER8230498234098230498
The original string value is stored off in the literalStrings dictionary for later.
So now we split on semicolons (to get key=value strings) and then on equals, to get the proper key/value pairs.
Then I substitute the placeholder values back in before returning the result.
public class HttpHeaderParser
{
public NameValueCollection Parse(string header)
{
var result = new NameValueCollection();
// 'register' any string values;
var stringLiteralRx = new Regex(#"""(?<content>(\\""|[^\""])+?)""", RegexOptions.IgnorePatternWhitespace);
var equalsRx = new Regex("=", RegexOptions.IgnorePatternWhitespace);
var semiRx = new Regex(";", RegexOptions.IgnorePatternWhitespace);
Dictionary<string, string> literalStrings = new Dictionary<string, string>();
var cleanedHeader = stringLiteralRx.Replace(header, m =>
{
var replacement = "PLACEHOLDER" + Guid.NewGuid().ToString("N");
var stringLiteral = m.Groups["content"].Value.Replace("\\\"", "\"");
literalStrings.Add(replacement, stringLiteral);
return replacement;
});
// now it's safe to split on semicolons to get name-value pairs
var nameValuePairs = semiRx.Split(cleanedHeader);
foreach(var nameValuePair in nameValuePairs)
{
var nameAndValuePieces = equalsRx.Split(nameValuePair);
var name = nameAndValuePieces[0].Trim();
var value = nameAndValuePieces[1];
string replacementValue;
if (literalStrings.TryGetValue(value, out replacementValue))
{
value = replacementValue;
}
result.Add(name, value);
}
return result;
}
}
There's every chance there are some proper bugs in it.
Here's some unit tests you should incorporate, too;
[TestMethod]
public void TestMethod1()
{
var tests = new[] {
new { input=#"foo=bar; baz=quux", expected = #"foo|bar^baz|quux"},
new { input=#"foo=bar;baz=""quux""", expected = #"foo|bar^baz|quux"},
new { input=#"foo=""bar"";baz=""quux""", expected = #"foo|bar^baz|quux"},
new { input=#"foo=""b,a,r"";baz=""quux""", expected = #"foo|b,a,r^baz|quux"},
new { input=#"foo=""b;r"";baz=""quux""", expected = #"foo|b;r^baz|quux"},
new { input=#"foo=""b\""r"";baz=""quux""", expected = #"foo|b""r^baz|quux"},
new { input=#"foo=""b=r"";baz=""quux""", expected = #"foo|b=r^baz|quux"},
};
var parser = new HttpHeaderParser();
foreach(var test in tests)
{
var actual = parser.Parse(test.input);
var actualAsString = String.Join("^", actual.Keys.Cast<string>().Select(k => string.Format("{0}|{1}", k, actual[k])));
Assert.AreEqual(test.expected, actualAsString);
}
}
Looks to me like you'll need a bit more of a solid parser for this than a regex split. According to this page the name/value pairs can either be 'raw';
x=1
or quoted;
x="foo bar baz"
So you'll need to look for a solution that not only splits on the equals, but ignores any equals inside;
x="y=z"
It might be that there is a better or more managed way for you to access this info. If you are using a classic ASP.NET WebForms FileUpload control, you can access the filename using the properties of the control, like
FileUpload1.HasFile
FileUpload1.FileName
If you're using MVC, you can use the HttpPostedFileBase class as a parameter to the action method. See this answer
[HttpPost]
public ActionResult Index(HttpPostedFileBase file)
{
// Verify that the user selected a file
if (file != null && file.ContentLength > 0)
{
// extract only the fielname
var fileName = Path.GetFileName(file.FileName);
// store the file inside ~/App_Data/uploads folder
var path = Path.Combine(Server.MapPath("~/App_Data/uploads"), fileName);
file.SaveAs(path);
}
// redirect back to the index action to show the form once again
return RedirectToAction("Index");
}
This:
(?<!\\\\)=
matches = not preceded by \\.
It should be:
(?<!\\)=
(Make sure you use # (verbatim) strings for the regex, to avoid confusion)

How to remove " [ ] \ from string

I have a string
"[\"1,1\",\"2,2\"]"
and I want to turn this string onto this
1,1,2,2
I am using Replace function for that like
obj.str.Replace("[","").Replace("]","").Replace("\\","");
But it does not return the expected result.
Please help.
You haven't removed the double quotes. Use the following:
obj.str = obj.str.Replace("[","").Replace("]","").Replace("\\","").Replace("\"", "");
Here is an optimized approach in case the string or the list of exclude-characters is long:
public static class StringExtensions
{
public static String RemoveAll(this string input, params Char[] charactersToRemove)
{
if(string.IsNullOrEmpty(input) || (charactersToRemove==null || charactersToRemove.Length==0))
return input;
var exclude = new HashSet<Char>(charactersToRemove); // removes duplicates and has constant lookup time
var sb = new StringBuilder(input.Length);
foreach (Char c in input)
{
if (!exclude.Contains(c))
sb.Append(c);
}
return sb.ToString();
}
}
Use it in this way:
str = str.RemoveAll('"', '[', ']', '\\');
// or use a string as "remove-array":
string removeChars = "\"{[]\\";
str = str.RemoveAll(removeChars.ToCharArray());
You should do following:
obj.str = obj.str.Replace("[","").Replace("]","").Replace("\"","");
string.Replace method does not replace string content in place. This means that if you have
string test = "12345" and do
test.Replace("2", "1");
test string will still be "12345". Replace doesn't change string itself, but creates new string with replaced content. So you need to assign this new string to a new or same variable
changedTest = test.Replace("2", "1");
Now, changedTest will containt "11345".
Another note on your code is that you don't actually have \ character in your string. It's only displayed in order to escape quote character. If you want to know more about this, please read MSDN article on string literals.
how about
var exclusions = new HashSet<char>(new[] { '"', '[', ']', '\\' });
return new string(obj.str.Where(c => !exclusions.Contains(c)).ToArray());
To do it all in one sweep.
As Tim Schmelter writes, if you wanted to do it often, especially with large exclusion sets over long strings, you could make an extension like this.
public static string Strip(
this string source,
params char[] exclusions)
{
if (!exclusions.Any())
{
return source;
}
var mask = new HashSet<char>(exclusions);
var result = new StringBuilder(source.Length);
foreach (var c in source.Where(c => !mask.Contains(c)))
{
result.Append(c);
}
return result.ToString();
}
so you could do,
var result = "[\"1,1\",\"2,2\"]".Strip('"', '[', ']', '\\');
Capture the numbers only with this regular expression [0-9]+ and then concatenate the matches:
var input = "[\"1,1\",\"2,2\"]";
var regex = new Regex("[0-9]+");
var matches = regex.Matches(input).Cast<Match>().Select(m => m.Value);
var result = string.Join(",", matches);

Searching the first few characters of every word within a string in C#

I am new to programming languages. I have a requirement where I have to return a record based on a search string.
For example, take the following three records and a search string of "Cal":
University of California
Pascal Institute
California University
I've tried String.Contains, but all three are returned. If I use String.StartsWith, I get only record #3. My requirement is to return #1 and #3 in the result.
Thank you for your help.
If you're using .NET 3.5 or higher, I'd recommend using the LINQ extension methods. Check out String.Split and Enumerable.Any. Something like:
string myString = "University of California";
bool included = myString.Split(' ').Any(w => w.StartsWith("Cal"));
Split divides myString at the space characters and returns an array of strings. Any works on the array, returning true if any of the strings starts with "Cal".
If you don't want to or can't use Any, then you'll have to manually loop through the words.
string myString = "University of California";
bool included = false;
foreach (string word in myString.Split(' '))
{
if (word.StartsWith("Cal"))
{
included = true;
break;
}
}
I like this for simplicity:
if(str.StartsWith("Cal") || str.Contains(" Cal")){
//do something
}
You can try:
foreach(var str in stringInQuestion.Split(' '))
{
if(str.StartsWith("Cal"))
{
//do something
}
}
You can use Regular expressions to find the matches. Here is an example
//array of strings to check
String[] strs = {"University of California", "Pascal Institute", "California University"};
//create the regular expression to look for
Regex regex = new Regex(#"Cal\w*");
//create a list to hold the matches
List<String> myMatches = new List<String>();
//loop through the strings
foreach (String s in strs)
{ //check for a match
if (regex.Match(s).Success)
{ //add to the list
myMatches.Add(s);
}
}
//loop through the list and present the matches one at a time in a message box
foreach (String matchItem in myMatches)
{
MessageBox.Show(matchItem + " was a match");
}
string univOfCal = "University of California";
string pascalInst = "Pascal Institute";
string calUniv = "California University";
string[] arrayofStrings = new string[]
{
univOfCal, pascalInst, calUniv
};
string wordToMatch = "Cal";
foreach (string i in arrayofStrings)
{
if (i.Contains(wordToMatch)){
Console.Write(i + "\n");
}
}
Console.ReadLine();
}
var strings = new List<string> { "University of California", "Pascal Institute", "California University" };
var matches = strings.Where(s => s.Split(' ').Any(x => x.StartsWith("Cal")));
foreach (var match in matches)
{
Console.WriteLine(match);
}
Output:
University of California
California University
This is actually a good use case for regular expressions.
string[] words =
{
"University of California",
"Pascal Institute",
"California University"
}
var expr = #"\bcal";
var opts = RegexOptions.IgnoreCase;
var matches = words.Where(x =>
Regex.IsMatch(x, expr, opts)).ToArray();
The "\b" matches any word boundary (punctuation, space, etc...).

How can I parse a value that appears some place in a file using C#

I have strings and each contain a value of RowKey stored like this:
data-RowKey=029
This occurs only once in each file. Is there some way I can get the number out with a C# function or do I have to write some kind of select myself. I have a team mate who suggested linq but I'm not sure if this even works on strings and I don't know how I could use this.
Update:
Sorry I changed this from file to string.
Linq does not really help you here. Use a regular expression to extract the number:
data-Rowkey=(\d+)
Update:
Regex r = new Regex(#"data-Rowkey=(\d+)");
string abc = //;
Match match = r.Match(abc);
if (match.Success)
{
string rowKey = match.Groups[1].Value;
}
Code:
public string ExtractRowKey(string filePath)
{
Regex r = new Regex(#"data-Rowkey=(\d+)");
using (StreamReader reader = new StreamReader(filePath))
{
string line;
while ((line = reader.ReadLine()) != null)
{
Match match = r.Match(line);
if (match.Success) return match.Groups[1].Value;
}
}
}
Assuming that it only exists once in a file, i would even throw an exception otherwise:
String rowKey = null;
try
{
rowKey = File.ReadLines(path)
.Where(l => l.IndexOf("data-RowKey=") > -1)
.Select(l => l.Substring(12 + l.IndexOf("data-RowKey=")))
.Single();
}
catch (InvalidOperationException) {
// you might want to log this exception instead
throw;
}
Edit: The simple approach with a string, take the first occurence which is always of length 3:
rowKey = text.Substring(12 + text.IndexOf("data-RowKey="), 3);
Assuming following
File must contain data-Row (with exact match including case)
Number length is 3
Following is the code snippet
var fileNames = Directory.GetFiles("rootDirPath");
var tuples = new List<Tuple<String, int>>();
foreach(String fileName in fileNames)
{
String fileData =File.ReadAllText(fileName) ;
int index = fileData.IndexOf("data-RowKey=");
if(index >=0)
{
String numberStr = fileData.Substring(index+12,3);// ASSUMING data-RowKey is always found, and number length is always 3
int number = 0;
int.TryParse(numberStr, out number);
tuples.Add(Tuple.Create(fileName, number));
}
}
Regex g = new Regex(#"data-RowKey=(?<Value>\d+)");
using (StreamReader r = new StreamReader("myFile.txt"))
{
string line;
while ((line = r.ReadLine()) != null)
{
Match m = g.Match(line);
if (m.Success)
{
string v = m.Groups["Value"].Value;
// ...
}
}
}

Extract data from a big string

First of all, i'm using the function below to read data from a pdf file.
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
pdfReader.Close();
}
}
return text.ToString();
}
As you can see , all data is saved in a string. The string looks like this:
label1: data1;
label2: data2;
label3: data3;
.............
labeln: datan;
My question: How can i get the data from string based on labels ?
I've tried this , but i'm getting stuck:
if ( string.Contains("label1"))
{
extracted_data1 = string.Substring(string.IndexOf(':') , string.IndexOf(';') - string.IndexOf(':') - 1);
}
if ( string.Contains("label2"))
{
extracted_data2 = string.Substring(string.IndexOf("label2") + string.IndexOf(':') , string.IndexOf(';') - string.IndexOf(':') - 1);
}
Have a look at the String.Split() function, it tokenises a string based on an array of characters supplied.
e.g.
string[] lines = text.Split(new[] {';'}, StringSplitOptions.RemoveEmptyEntries);
now loop through that array and split each one again
foreach(string line in lines) {
string[] pair = line.Split(new[] {':'});
string key = pair[0].Trim();
string val = pair[1].Trim();
....
}
Obviously check for empty lines, and use .Trim() where needed...
[EDIT]
Or alternatively as a nice Linq statement...
var result = from line in text.Split(new[] {';'}, StringSplitOptions.RemoveEmptyEntries)
let tokens = line.Split(new[] {':'})
select tokens;
Dictionary<string, string> =
result.ToDictionary (key => key[0].Trim(), value => value[1].Trim());
It's pretty hard-coded, but you could use something like this (with a little bit of trimming to your needs):
string input = "label1: data1;" // Example of your input
string data = input.Split(':')[1].Replace(";","").Trim();
You can do this by using Dictionary<string,string>,
Dictionary<string, string> dicLabelData = new Dictionary<string, string>();
List<string> listStrSplit = new List<string>();
listStrSplit = strBig.Split(';').ToList<string>();//strBig is big string which you want to parse
foreach (string strSplit in listStrSplit)
{
if (strSplit.Split(':').ToList<string>().Count > 1)
{
List<string> listLable = new List<string>();
listLable = strSplit.Split(':').ToList<string>();
dicLabelData.Add(listLable[0],listLable[1]);//Key=Label,Value=Data
}
}
dicLabelData contains data of all label....
i think you can use regex to solve this problem. Just split the string on the break line and use a regex to get the right number.
You can use a regex to do it:
Regex rx = new Regex("label([0-9]+): ([^;]*);");
var matches = rx.Matches("label1: a string; label2: another string; label100: a third string;");
foreach (Match match in matches) {
var id = match.Groups[1].ToString();
var data = match.Groups[2].ToString();
var idAsNumber = int.Parse(id);
// Here you use an array or a dictionary to save id/data
}

Categories

Resources