Best method of Textfile Parsing in C#?

Best method of Textfile Parsing in C#? - c#

I want to parse a config file sorta thing, like so:
[KEY:Value]
[SUBKEY:SubValue]
Now I started with a StreamReader, converting lines into character arrays, when I figured there's gotta be a better way. So I ask you, humble reader, to help me.
One restriction is that it has to work in a Linux/Mono environment (1.2.6 to be exact). I don't have the latest 2.0 release (of Mono), so try to restrict language features to C# 2.0 or C# 1.0.

I considered it, but I'm not going to use XML. I am going to be writing this stuff by hand, and hand editing XML makes my brain hurt. :')
Have you looked at YAML?
You get the benefits of XML without all the pain and suffering. It's used extensively in the ruby community for things like config files, pre-prepared database data, etc
here's an example
customer:
name: Orion
age: 26
addresses:
- type: Work
number: 12
street: Bob Street
- type: Home
number: 15
street: Secret Road
There appears to be a C# library here, which I haven't used personally, but yaml is pretty simple, so "how hard can it be?" :-)
I'd say it's preferable to inventing your own ad-hoc format (and dealing with parser bugs)

I was looking at almost this exact problem the other day: this article on string tokenizing is exactly what you need. You'll want to define your tokens as something like:
#"(?&ltlevel>\s) | " +
#"(?&ltterm>[^:\s]) | " +
#"(?&ltseparator>:)"
The article does a pretty good job of explaining it. From there you just start eating up tokens as you see fit.
Protip: For an LL(1) parser (read: easy), tokens cannot share a prefix. If you have abc as a token, you cannot have ace as a token
Note: The article's missing the | characters in its examples, just throw them in.

There is another YAML library for .NET which is under development. Right now it supports reading YAML streams and has been tested on Windows and Mono. Write support is currently being implemented.

Using a library is almost always preferably to rolling your own. Here's a quick list of "Oh I'll never need that/I didn't think about that" points which will end up coming to bite you later down the line:
Escaping characters. What if you want a : in the key or ] in the value?
Escaping the escape character.
Unicode
Mix of tabs and spaces (see the problems with Python's white space sensitive syntax)
Handling different return character formats
Handling syntax error reporting
Like others have suggested, YAML looks like your best bet.

You can also use a stack, and use a push/pop algorithm. This one matches open/closing tags.
public string check()
{
ArrayList tags = getTags();
int stackSize = tags.Count;
Stack stack = new Stack(stackSize);
foreach (string tag in tags)
{
if (!tag.Contains('/'))
{
stack.push(tag);
}
else
{
if (!stack.isEmpty())
{
string startTag = stack.pop();
startTag = startTag.Substring(1, startTag.Length - 1);
string endTag = tag.Substring(2, tag.Length - 2);
if (!startTag.Equals(endTag))
{
return "Fout: geen matchende eindtag";
}
}
else
{
return "Fout: geen matchende openeningstag";
}
}
}
if (!stack.isEmpty())
{
return "Fout: geen matchende eindtag";
}
return "Xml is valid";
}
You can probably adapt so you can read the contents of your file. Regular expressions are also a good idea.

It looks to me that you would be better off using an XML based config file as there are already .NET classes which can read and store the information for you relatively easily. Is there a reason that this is not possible?
#Bernard: It is true that hand editing XML is tedious, but the structure that you are presenting already looks very similar to XML.
Then yes, has a good method there.

#Gishu
Actually once I'd accommodated for escaped characters my regex ran slightly slower than my hand written top down recursive parser and that's without the nesting (linking sub-items to their parents) and error reporting the hand written parser had.
The regex was a slightly faster to write (though I do have a bit of experience with hand parsers) but that's without good error reporting. Once you add that it becomes slightly harder and longer to do.
I also find the hand written parser easier to understand the intention of. For instance, here is the a snippet of the code:
private static Node ParseNode(TextReader reader)
{
Node node = new Node();
int indentation = ParseWhitespace(reader);
Expect(reader, '[');
node.Key = ParseTerminatedString(reader, ':');
node.Value = ParseTerminatedString(reader, ']');
}

Regardless of the persisted format, using a Regex would be the fastest way of parsing.
In ruby it'd probably be a few lines of code.
\[KEY:(.*)\]
\[SUBKEY:(.*)\]
These two would get you the Value and SubValue in the first group. Check out MSDN on how to match a regex against a string.
This is something everyone should have in their kitty. Pre-Regex days would seem like the Ice Age.

Related

Is it possible to check for a verb without creating my own database?

I am playing around with a sentence string entry for a project I'm working on in C# and wanted to see if there was an alternative way to search for a verb using a built in function.
Currently, I am using a database table with a list of regular verbs and cycling through those to check if there is a match but wanted to see if there would be a better way to do this?
Consider the following input:
"Develop string matching software for verb"
Program will read the string and check each word,
if (word == isVerb)
{
m_verbs.Add(word);
}

Short answer :
There is a better way.
Long answer :
It's not that simple. The problem is that there is no inbuilt language functionality into the string class in C#. This is an implementation detail that rests on the developer's shoulders.
You have some grammatical (or perhaps lexical is a better word) issues to consider as Owen79 pointed out in his comment. Then there is the question of environment / resource restrictions.
You have a few options available to you :
Web based dictionary services. You can query those with the words of your sentence and get back the 'status' of each word. Then you will take only the statuses you want, like verbs for instance. Here is a link to DictService which also includes a C# code sample.
A text / xml / other file based solution. Similar approach, you simply look up the words in the file and act according to the presence or absence of the word in the file. You can cache (load into memory) the contents of the file to save on IO operations. Here are the links to lists of regular and irregular verbs.
Database solution is identical to the previous one with the exception of loading contents into memory. That part may be unnecessary but that depends on your implementation requirements.
Bottom line each solution will require some work but whatever option you go for the key aspects to consider are the platform and the resources available to you. If computational speed is a concern you will most likely need to do some tricks to cut down on lookup times etc.
Hope this helps

you could load the common verbs from disk in a text file. If you have lots of verbs and worry about memory you could bucket them into common and uncommon or alphabetically then load in the dictionaries if needed

If you don't want to use the databse option (although highly recommanded), then you need to put them in a data structure (e.g. array or list). You can then use powerful System.Linq extension methods.
For example:
string[] allVerbs = new[] { "eat", "drink" }; // etc
string s = "Develop string matching software for verb";
var words = s.Split(' ');
foreach (var word in words)
if (allVerbs.Contains(word.ToLower()))
m_verbs.Add(word);

How to read text file between ""

I need an "idea" on how to read text file data between quotes. For example:
line 1: "read a title"
line 2: "read a descr"
line 1: "read a title"
line 2: "read a descr"
I want to do a foreach type of thing, and I want to read all Line 1's, and Line 2's as a pair, but between the ".
In my program I am going to output (foreach of course):
readTerminatedNull(file1);
readTerminatedNull(file2);
I would read line by line, but some of the text could be:
line 1: "read a super long
title that goes off"
line 2: "read a descr"
So that's why I want to read between the ".
Sorry if that is too complicated, and it's a little hard to explain.
Edit:
Thanks for all the feed back guys, but I'm not sure you are getting what I am trying to do :p not your faults, I wrote this kinda wierd.
I will have a text file full of refrences, and text. like so.
text inside:
Refren: "myrefrence_1"
String: "This is a string of a refrence"
Refren: "myrefrence_2"
String: "hello world"
Refren: "myrefrence_3"
String: "I like cookies."
I want it to to read myrefrence_1 in the quotes of the first line, and then read the string in the next line between the ".
I will then stuff into my program that matches the refrence with the string.
But sometimes the text will be more than one line.
Refren: "this is text that goes and then
return keys on some parts."
and I still want it to read through the ".

(not tested, but you'll get the idea)
// Read all text from file
string sData = File.ReadAllText(#"c:/file.txt");
// Match strings between " "
Match match = Regex.Match(sData , "\"(\w|\d|\s|\\\")*\"",
RegexOptions.IgnoreCase);
// Read results and strip " out of them
foreach (var sResult in match) {
sResult = sResult.Remove(0,1).Remove(sResult.length-2, 1);
// Do whatever with sResult
}

You could learn some new tricks by looking into state machines. Basically: Read each character at a time and figure out what state you are in now. First, code this as a big while loop with a big switch statement inside. Then, go and read up on the state pattern for how to do this in an object oriented way. Then, ditch that and use delegates, because c# makes this stuff so easy to do.
Then, scrap it all, write some crappy Regular Expression with a multiline flag and slurp it the Perl way. Meditate on why this is the same as your original state machine solution.
Then, get really stuck in and learn about parser generators (lexx/yacc or some .NET variant) and write a simple BNF grammar for your problem. Take special note of how the trivial grammars used in the tutorials are all way more complicated than the one you need to write. Why is that so? Check out what Noam Chomsky had to say about that.
Eventually, you'll burn out. We all do. But you'll have so much fun digging into what makes programming the coolest activity on the planet. Burn-out is just the realization that that's a pipe dream ;)
When you're done, go outside. Meet people. Talk. Smile a lot. Be friendly. You're now a zen infused developer with a wicked grin. Yay for you! You rock!

What you're describing sounds like a single-column CSV file. The easiest way to access that is probably to use the Microsoft.VisualBasic.FileIO.TextFieldParser class, something like:
using (var csvParser = new TextFieldParser(new StringReader(content))
{
Delimiters = new[] {","},
HasFieldsEnclosedInQuotes = true
})
{
while (!csvParser.EndOfData)
{
var fields = csvParser.ReadFields();
Console.Print(fields[0]); //do something with the first (in your case only) field found.
}
}
Probably the easiest way to determine whether this approach makes sense, is to think about what happens if the string you're reading actually contains a double quote. Would it end up as "He said ""this is quoted"", but I wasn't listening" (doubling up the quotes), or is this situation impossible?
If the quotes would be doubled up in this way, then a standard CSV reader like this built-in framework one is probably your best bet.

To read all of the lines of the file you can use:
File.ReadAllLines(pathToFile);
to strip the text from "" you can use the substring method of string: http://msdn.microsoft.com/en-us/library/aka44szs.aspx
you can do it like that:
string strippedString = original.Substring(1, original.length -2);

Try this one
var text = File.ReadAllLines(pathToFile);
var lines = text.Split(':')
.Where((s,i) => i % 2 != 0)
.Select(s => s.trim('"'));

First of all you need to read in the file using:
File.ReadAllLines(filePath);
Then you could split all the lines using the string.Split function.
Splitting on the closing bracket would be your best bet.

As i have understood from you question is you want to read and write text file with some specific settings. is it ?
I would like to refer to to INI files which are the text files it self and provide the settings configurations as you wish to achieve. here are some links these could help you.
http://www.codeproject.com/Articles/1966/An-INI-file-handling-class-using-C
http://jachman.wordpress.com/2006/09/11/how-to-access-ini-files-in-c-net/

What's the best way to parse a string for "bad" words in C#?

I'm thinking of something like:
foreach (var word in paragraph.split(' ')) {
if (badWordArray.Contains(word) {
// do something about it
}
}
but I'm sure there's a better way.
Thanks in advance!
UPDATE
I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used. Then I'll review it myself to make sure it's legit. An auto flagging system of sorts.

While your way works, it may be a bit time consuming. There is a wonderful response here for a previous SO question. Though the question talks about PHP instead of C#, I think it can be easily ported.
Edit to add sample code:
public string FilterWords(string inputWords) {
Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
return wordFilter.Replace(inputWords, "<3");
}
That should work for you, more or less.
Edit to answer OP clarification:
I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used.
Much as the replacement portion above, you can see if something matches like so:
public bool HasBadWords(string inputWords) {
Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
return wordFilter.IsMatch(inputWords);
}
It will return true if the string you passed to it contains any words in the list.

At my job we put some automatic bad word filtering into our software (it's kind of shocking to be browsing the source and suddenly run across the array containing several pages of obscenity).
One tip is to pre-process the user input before testing against your list, in that case that someone is trying to sneak something by you. So by way of preprocessing, we
uppercase everything in the input
remove most non-alphanumerics (that is, just splice out any spaces, or punctuation, etc.)
and then assuming someone is trying to pass off digits for letters, do the something like this: replace zero with O, 9 with G, 5 with S, etc. (get creative)
And then get some friends to try to break it. It's fun.

You could consider using the HashKey objects or Dictionary<T1, T2> instead of the array as using a Dictionary for example can make code more efficient, because the .Contains() method becomes .Keys.Contains() which is way more efficient. This is especially true if you have a large list of profanities (not sure how many there are! :)

How to replace xml special chars manually?

My application produces an xml file that is then xslt transformed into a nice html report. I have a problem with \n however. There are some xslt techniques to do it, but they are pretty awkward and time consuming.
So my solution was to do a string.replace \n to
< br />
and then to force the xmlWriter to write this with WriteRaw(text). The problem is that the text sometimes has some illegal chars like >.
I am unable to find any utility method in .net that just takes in a string and transforms it in a xml-friendly string. I looked with the reflector and the class that handles this logic is not public.
Any ideeas (beside writing my own code to do this)?

Never, ever use string manipulation to produce XML. It's not just that it makes poorly-socialized people laugh at you: it leads to code that has bugs in it that you don't know exist.
Think about it from a test-driven perspective. You've written a method that uses string manipulation to generate XML. Okay, now write a test case for it that demonstrates that it will never emit poorly-formed XML. In order for the test to prove this, you have to test every possible scenario outlined in the XML recommendation. Do you know the XML recommendation well enough to be able to assert that your test does this?
No, you don't. And you don't really want to, not unless you're writing a framework for XML generation. That's why you use the classes in System.Xml to generate XML. The people who wrote them did that work so that you don't have to.
Tomalak showed how to do what you're trying to do with XSLT. If you're using an XmlWriter to generate the XML, use this pattern:
string s = "replace\nnewlines\nwith\nbreaks";
string[] lines = s.Split('\n');
for (int i=0; i<lines.Length; i++)
{
xw.WriteString(lines[i]);
if (i<lines.Length - 1)
{
xw.WriteElementString("br", "", "");
}
}
This uses string manipulation where it's appropriate - when manipulating string data outside of XML - and doesn't where it's not - when producing XML text.

I think the solution to this question will help you:
xslt replace \n with <br/> only in one node?
You can incorporate the provided template into your transformation process, and you're done without getting your hands dirty.

Best way to replace tokens in a large text template

I have a large text template which needs tokenized sections replaced by other text. The tokens look something like this: ##USERNAME##. My first instinct is just to use String.Replace(), but is there a better, more efficient way or is Replace() already optimized for this?

System.Text.RegularExpressions.Regex.Replace() is what you seek - IF your tokens are odd enough that you need a regex to find them.
Some kind soul did some performance testing, and between Regex.Replace(), String.Replace(), and StringBuilder.Replace(), String.Replace() actually came out on top.

The only situation in which I've had to do this is sending a templated e-mail. In .NET this is provided out of the box by the MailDefinition class. So this is how you create a templated message:
MailDefinition md = new MailDefinition();
md.BodyFileName = pathToTemplate;
md.From = "test#somedomain.com";
ListDictionary replacements = new ListDictionary();
replacements.Add("<%To%>", someValue);
// continue adding replacements
MailMessage msg = md.CreateMailMessage("test#someotherdomain.com", replacements, this);
After this, msg.Body would be created by substituting the values in the template. I guess you can take a look at MailDefinition.CreateMailMessage() with Reflector :). Sorry for being a little off-topic, but if this is your scenario I think it's the easiest way.

Well, depending on how many variables you have in your template, how many templates you have, etc. this might be a work for a full template processor. The only one I've ever used for .NET is NVelocity, but I'm sure there must be scores of others out there, most of them linked to some web framework or another.

string.Replace is fine. I'd prefer using a Regex, but I'm *** for regular expressions.
The thing to keep in mind is how big these templates are. If its real big, and memory is an issue, you might want to create a custom tokenizer that acts on a stream. That way you only hold a small part of the file in memory while you manipulate it.
But, for the naiive implementation, string.Replace should be fine.

If you are doing multiple replaces on large strings then it might be better to use StringBuilder.Replace(), as the usual performance issues with strings will appear.

Regular expressions would be the quickest solution to code up but if you have many different tokens then it will get slower. If performance is not an issue then use this option.
A better approach would be to define token, like your "##" that you can scan for in the text. Then select what to replace from a hash table with the text that follows the token as the key.
If this is part of a build script then nAnt has a great feature for doing this called Filter Chains. The code for that is open source so you could look at how its done for a fast implementation.

Had to do something similar recently. What I did was:
make a method that takes a dictionary (key = token name, value = the text you need to insert)
Get all matches to your token format (##.+?## in your case I guess, not that good at regular expressions :P) using Regex.Matches(input, regular expression)
foreach over the results, using the dictionary to find the insert value for your token.
return result.
Done ;-)
If you want to test your regexes I can suggest the regulator.

FastReplacer implements token replacement in O(n*log(n) + m) time and uses 3x the memory of the original string.
FastReplacer is good for executing many Replace operations on a large string when performance is important.
The main idea is to avoid modifying existing text or allocating new memory every time a string is replaced.
We have designed FastReplacer to help us on a project where we had to generate a large text with a large number of append and replace operations. The first version of the application took 20 seconds to generate the text using StringBuilder. The second improved version that used the String class took 10 seconds. Then we implemented FastReplacer and the duration dropped to 0.1 seconds.

If your template is large and you have lots of tokens, you probably don't want walk it and replace the token in the template one by one as that would result in an O(N * M) operation where N is the size of the template and M is the number of tokens to replace.
The following method accepts a template and a dictionary of the keys value pairs you wish to replace. By initializing the StringBuilder to slightly larger than the size of the template, it should result in an O(N) operation (i.e. it shouldn't have to grow itself log N times).
Finally, you can move the building of the tokens into a Singleton as it only needs to be generated once.
static string SimpleTemplate(string template, Dictionary<string, string> replacements)
{
// parse the message into an array of tokens
Regex regex = new Regex("(##[^#]+##)");
string[] tokens = regex.Split(template);
// the new message from the tokens
var sb = new StringBuilder((int)((double)template.Length * 1.1));
foreach (string token in tokens)
sb.Append(replacements.ContainsKey(token) ? replacements[token] : token);
return sb.ToString();
}

This is an ideal use of Regular Expressions. Check out this helpful website, the .Net Regular Expressions class, and this very helpful book Mastering Regular Expressions.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.