Best option for extracting string data?

Best option for extracting string data? - c#

I have a Portable Bridge Notation formatted-file that I have to work with. I have already a few simple examples working by using indexing and substrings extracting what I need and I suppose for this PBN business it would suite me well since it isn't run too often. If I one the other hand would run that code like all the time (thinking of working with vCards) then I am worried about memory usage under high workload because of the high amout of temporary string variables created from all substrings and splits.
There are two other options available that I know of. Regex and StringReader / TextReader and I wanted some general opinion on what to use.
Intended usage is to extract to objects and serialize to json so that I can more easily work with or persist this information. Hell if it's fast enough I might even do it on the fly.
So hit me, what would you chose?

Personally, I would read the file line by line and store it in an internal representation and then query it with LINQ.
The advantage of storing to an internal representation is that you just read the file from top to bottom, so that just easy. And when you need to query, you have powerful linq queries, which make life a lot easier.

Related

Storing data into custom made files

I am making a combinations generator. For small amounts of elements it's not a problem for the computer that the data is getting stored in the RAM memory instead in a file. But when the number of elements gets bigger, my computer runs out of memory (the exception OutOfMemoryException occurs). The combinations are numbers stored in Lists, that are currently getting stored in an another List.
But this only the first step- the generator work right. I want the data to be stored in file, from where a different program will be able to extract the combinations it needs. Mostly, I need to store the data in a separate file, because the generator will have to be able to create more and more and bigger combinations in the future. The computer will have to read certain parts of the data, without coping all of it in the temporary memory, because this is impossible.
I don't want to turn the data into text and when needed to convert the text back into data. I think this will make things slower because of the conversions. I want the lists to be stored into a custom made file, from where the program can directly extract the data without any converting.

There is a ton of options available I will briefly describe a few.
Use a database. From your description this does not look like a good choices but it will be the most flexible to all clients relativity fast and efficient storage.
Use one of the .net serializers from your description binary serializer will be your best choice. The serlizers offer a lot of advantages relativity fast and baked into .net with built in support and very easy to use.
Use a custom binary format. This will be the fastest option especially if you combine it with a memory mapped file. However binary formats can be hard to use and easy to screw up.

If you really want to store your data in file, you can use BinaryFormatter class. It is probably the most efficient way of serializing data objects into binary stream.
But I wouldn't recommend you generating combinations in this way if only don't you need to store them at one time and load long time after that. It's better to use lazy-generation of combinations. One by one, completely generated without the need to "generate bigger combinations in future" (generate "the biggest" needed combinations one by one - you might want to change your generation algorithm a bit - there are plenty answers how to do that already)

There's a good write up on how to serialize a List<> to a file at http://www.switchonthecode.com/tutorials/csharp-tutorial-serialize-objects-to-a-file

You can use something as a persistent data structure instead, this will reduce the amount of memory needed by your app, without changint too much the current code.
Have a look at this question:
Looking for a simple standalone persistent dictionary implementation in C#
there is a lot of resources doing that, in particular this answer seems to point to some really interesting links:

Computing, storing, and retrieving values to and from an N-Dimensional matrix

This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge.
Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints.
Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long.
In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table.
The questions now, are:
What format(s) might be acceptable
for storing the data?
What programs (MATLAB, C#, etc)
might be best suited to compute the
data?
C# must be used to import the data
on the device - is this possible
given your answer to #1?
Edit:
Is it possible to read from my lookup table file without reading the entire file into memory? Can you explain how that might be done in C#?

I'll comment on 1 and 3 as well. It may be preferable to use a fixed width output file rather than a CSV. This may take up more or less space than a CSV, depending on the output numbers. However, it tends to work well for lookup tables, as figuring out where to look in a fixed width data file can be done without reading the entire file. This is usually important for a lookup table.
Fixed width data, as with CSV, is trivial to read and write. Some math-oriented languages might offer poor string and binary manipulation functionality, but it should be really easy to convert the data to fixed width during the import step regardless.
Number 2 is harder to answer, particularly without knowing what kind of algorithm you are computing. Matlab and similar programs tend to be great about certain types of computations and often have a lot of stuff built in to make it easier. That said, a lot of the math stuff that is built into such languages is available for other languages in the form of libraries.

I'll comment on (1) and (3). All you need to do is dump the data in slices. Pick a traversal and dump data out in that order. Write it out as comma-delimited numbers.

What's so bad about building XML with string concatenation?

In the thread What’s your favorite “programmer ignorance” pet peeve?, the following answer appears, with a large amount of upvotes:
Programmers who build XML using string concatenation.
My question is, why is building XML via string concatenation (such as a StringBuilder in C#) bad?
I've done this several times in the past, as it's sometimes the quickest way for me to get from point A to point B when to comes to the data structures/objects I'm working with. So far, I have come up with a few reasons why this isn't the greatest approach, but is there something I'm overlooking? Why should this be avoided?
Probably the biggest reason I can think of is you need to escape your strings manually, and most new programmers (and even some experienced programmers) will forget this. It will work great for them when they test it, but then "randomly" their apps will fail when someone throws an & symbol in their input somewhere. Ok, I'll buy this, but it's really easy to prevent the problem (SecurityElement.Escape to name one).
When I do this, I usually omit the XML declaration (i.e. <?xml version="1.0"?>). Is this harmful?
Performance penalties? If you stick with proper string concatenation (i.e. StringBuilder), is this anything to be concerned about? Presumably, a class like XmlWriter will also need to do a bit of string manipulation...
There are more elegant ways of generating XML, such as using XmlSerializer to automatically serialize/deserialize your classes. Ok sure, I agree. C# has a ton of useful classes for this, but sometimes I don't want to make a class for something really quick, like writing out a log file or something. Is this just me being lazy? If I am doing something "real" this is my preferred approach for dealing w/ XML.

You can end up with invalid XML, but you will not find out until you parse it again - and then it is too late. I learned this the hard way.

I think readability, flexibility and scalability are important factors. Consider the following piece of Linq-to-Xml:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
Can you find a way to do it easier than this? I can output it to a browser, save it to a document, add attributes/elements in seconds and so on ... just by adding couple lines of code. I can do practically everything with it without much of effort.

Actually, I find the biggest problem with string concatenation is not getting it right the first time, but rather keeping it right during code maintenance. All too often, a perfectly-written piece of XML using string concat is updated to meet a new requirement, and string concat code is just too brittle.
As long as the alternatives were XML serialization and XmlDocument, I could see the simplicity argument in favor of string concat. However, ever since XDocument et. al., there is just no reason to use string concat to build XML anymore. See Sander's answer for the best way to write XML.
Another benefit of XDocument is that XML is actually a rather complex standard, and most programmers simply do not understand it. I'm currently dealing with a person who sends me "XML", complete with unquoted attribute values, missing end tags, improper case sensitivity, and incorrect escaping. But because IE accepts it (as HTML), it must be right! Sigh... Anyway, the point is that string concatenation lets you write anything, but XDocument will force standards-complying XML.

I wrote a blog entry back in 2006 moaning about XML generated by string concatenation; the simple point is that if an XML document fails to validate (encoding issues, namespace issues and so on) it is not XML and cannot be treated as such.
I have seen multiple problems with XML documents that can be directly attributed to generating XML documents by hand using string concatenation, and nearly always around the correct use of encoding.
Ask yourself this; what character set am I currently encoding my document with ('ascii7', 'ibm850', 'iso-8859-1' etc)? What will happen if I write a UTF-16 string value into an XML document that has been manually declared as 'ibm850'?
Given the richness of the XML support in .NET with XmlDocument and now especially with XDocument, there would have to be a seriously compelling argument for not using these libraries over basic string concatenation IMHO.

I think that the problem is that you aren't watching the xml file as a logical data storage thing, but as a simple textfile where you write strings.
It's obvious that those libraries do string manipulation for you, but reading/writing xml should be something similar to saving datas into a database or something logically similar

If you need trivial XML then it's fine. Its just the maintainability of string concatenation breaks down when the xml becomes larger or more complex. You pay either at development or at maintenance time. The choice is yours always - but history suggests the maintenance is always more costly and thus anything that makes it easier is worthwhile generally.

You need to escape your strings manually. That's right. But is that all? Sure, you can put the XML spec on your desk and double-check every time that you've considered every possible corner-case when you're building an XML string. Or you can use a library that encapsulates this knowledge...

Another point against using string concatenation is that the hierarchical structure of the data is not clear when reading the code. In #Sander's example of Linq-to-XML for example, it's clear to what parent element the "product" element belongs, to what element the "title" attribute applies, etc.

As you said, it's just awkward to build XML correct using string concatenation, especially now you have XML linq that allows for simple construction of an XML graph and will get namespaces, etc correct.
Obviously context and how it is being used matters, such as in the logging example string.Format can be perfectly acceptable.
But too often people ignore these alternatives when working with complex XML graphs and just use a StringBuilder.

The main reason is DRY: Don't Repeat Yourself.
If you use string concat to do XML, you will constantly be repeating the functions that keep your string as a valid XML document. All the validation would be repeated, or not present. Better to rely on a class that is written with XML validation included.

I've always found creating an XML to be more of a chore than reading in one. I've never gotten the hang of serialization - it never seems to work for my classes - and instead of spending a week trying to get it to work, I can create an XML file using strings in a mere fraction of the time and write it out.
And then I load it in using an XMLReader tree. And if the XML file doesn't read as valid, I go back and find the problem within my saving routines and corret it. But until I get a working save/load system, I refuse to perform mission-critical work until I know my tools are solid.
I guess it comes down to programmer preference. Sure, there are different ways of doing things, for sure, but for developing/testing/researching/debugging, this would be fine. However I would also clean up my code and comment it before handing it off to another programmer.
Because regardless of the fact you're using StringBuilder or XMLNodes to save/read your file, if it is all gibberish mess, nobody is going to understand how it works.

Maybe it won't ever happen, but what if your environment switches to XML 2.0 someday? Your string-concatenated XML may or may not be valid in the new environment, but XDocument will almost certainly do the right thing.
Okay, that's a reach, but especially if your not-quite-standards-compliant XML doesn't specify an XML version declaration... just saying.

Sharing character buffer between C# strings objects

Is this possible? Given that C# uses immutable strings, one could expect that there would be a method along the lines of:
var expensive = ReadHugeStringFromAFile();
var cheap = expensive.SharedSubstring(1);
If there is no such function, why bother with making strings immutable?
Or, alternatively, if strings are already immutable for other reasons, why not provide this method?
The specific reason I'm looking into this is doing some file parsing. Simple recursive descent parsers (such as the one generated by TinyPG, or ones easily written by hand) use Substring all over the place. This means if you give them a large file to parse, memory churn is unbelievable. Sure there are workarounds - basically roll your own SubString class, and then of course forget about being able to use String methods such as StartsWith or String libraries such as Regex, so you need to roll your own version of these as well. I assume parser generators such as ANTLR basically do that, but my format is simple enough not to justify using such a monster tool. Even TinyPG is probably an overkill.
Somebody please tell me I am missing some obvious or not-so-obvious standard C# method call somewhere...

No, there's nothing like that.
.NET strings contain their text data directly, unlike Java strings which have a reference to a char array, an offset and a length.
Both solutions have "wins" in some situations, and losses in others.
If you're absolutely sure this will be a killer for you, you could implement a Java-style string for use in your own internal APIs.

As far as I know, all larger parsers use streams to parse from. Isn't that suitable for your situation?

The .NET framework supports string interning. This is a partial solution but does not offer the posibility to reuse parts of a string. I think reusing substring will cause some problems not that obviouse at a first look. If you have to do a lot of string manipulation using the StringBuilder is the way to go.

Nothing in C# provides you the out-of-the-box functionality you're looking for.
What want is a Rope data structure, an immutable data structure which supports O(1) concats and O(log n) substrings. I can't find any C# implementations of a rope, but here a Java one.
Barring that, there's nothing wrong with using TinyPG or ANTLR if that's the easiest way to get things done.

Well you could use "unsafe" to do the memory management yourself, which might allow you to do what you are looking for. Also the StringBuilder class is great for situations where a string needs to be manipulated numerous times, since it doesn't make a new string with each manipulation.

You could easily write a trivial class to represent "cheap". It would just hold the index of the start of the substring and the length of the substring. A couple of methods would allow you to read the substring out when needed - a string cast operator would be ideal as you could use
string text = myCheapObject;
and it would work seamlessly as if it were an actual string. Adding support for a few handy methods like StartsWith would be quick and easy (they'd all be one liners).
The other option is to write a regular parser and store your tokens in a Dictionary from which you share references to the tokens rather than keeping multiple copies.

How best to use XPath with very large XML files in .NET?

I need to do some processing on fairly large XML files ( large here being potentially upwards of a gigabyte ) in C# including performing some complex xpath queries. The problem I have is that the standard way I would normally do this through the System.XML libraries likes to load the whole file into memory before it does anything with it, which can cause memory problems with files of this size.
I don't need to be updating the files at all just reading them and querying the data contained in them. Some of the XPath queries are quite involved and go across several levels of parent-child type relationship - I'm not sure whether this will affect the ability to use a stream reader rather than loading the data into memory as a block.
One way I can see of making it work is to perform the simple analysis using a stream-based approach and perhaps wrapping the XPath statements into XSLT transformations that I could run across the files afterward, although it seems a little convoluted.
Alternately I know that there are some elements that the XPath queries will not run across, so I guess I could break the document up into a series of smaller fragments based on it's original tree structure, which could perhaps be small enough to process in memory without causing too much havoc.
I've tried to explain my objective here so if I'm barking up totally the wrong tree in terms of general approach I'm sure you folks can set me right...

XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN article.
If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions.
I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions.
Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner".
Download from Microsoft

Gigabyte XML files! I don't envy you this task.
Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed.
I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB?
One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process these with XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET.
I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language?
This might also help https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html

http://msdn.microsoft.com/en-us/library/bb387013.aspx has a relevant example leveraging XStreamingElement.

In order to perform XPath queries with the standard .NET classes the whole document tree needs to be loaded in memory which might not be a good idea if it can take up to a gigabyte. IMHO the XmlReader is a nice class for handling such tasks.

It seems that you already tried using XPathDocument and could not accomodate the parsed xml document in memory.
If this is the case, before starting to split the file (which is ultimately the right decision!) you may try using the Saxon XSLT/XQuery processor. It has a very efficient in-memory representation of a loaded XML document (the "tinytree" model). In addition Saxon SA (the shema-aware version, which isn't free) has some streaming extensions. Read more about this here.

How about just reading the whole thing into a database and then work with the temp database? That might be better because then your queries can be done more efficiently using TSQL.

I think the best solution is to make your own xml parser that can read small chunks not the whole file, or you can split the large file into small files and use dotnet classes with these files.
The problem is you can not parse some of data till the whole data is available so I recommend to use your own parser not dotnet classes

Have you been trying XPathDocument?
This class is optimized for handling XPath queries efficiently.
If you cannot handle your input documents efficiently using XPathDocument you might consider preprocessing and/or splitting up your input documents using an XmlReader.

You've outlined your choices already.
Either you need to abandon the XPath and use XmlTextReader or you need to break the document up into managable chunks on which you can use XPath.
If you choose the latter use XPathDocument its readonly restriction allows better used of memory.

Since in your case the data size can run in Gbs have you considered using ADO.NET with XML as a database. In addition to that the memory footprint would not be huge.
Another approach would be using Linq to XML with using elements like XElementStream. Hope this helps.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Best option for extracting string data? - c#

Related

Storing data into custom made files

Computing, storing, and retrieving values to and from an N-Dimensional matrix

What's so bad about building XML with string concatenation?

Sharing character buffer between C# strings objects

How best to use XPath with very large XML files in .NET?

Categories

Resources