Fast search for a set of elements

Fast search for a set of elements - c#

I am trying to search a bunch of files on my hard drive for a binary pattern. I have tried to find some way of doing it with what is built in to .net but I can't seem to find anything that would let me search for a set of data, instead of just one byte of data, unless I convert my binary data in to a string first and use String.IndexOf(string value).
I am halfway through writing my own Boyer-Moor stream searching algorithm, but I thought I should check here first in case I did miss a way to do this efficiently.
Here is my current method of doing the search just for text, it works well enough, I just don't know what to do for binary patterns
private string _string;
private byte[] _array;
private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
Parallel.ForEach(Directory.EnumerateFiles(_folder, _filter, SearchOption.AllDirectories)
, Search);
}
private void Search(string filePath)
{
if (numbers)
{
var fileBinary = File.ReadAllBytes(filePath);
if (fileBinary.MagicFunctionToDoContains(_array)) //Need help here
{
lbResults.BeginInvoke(new Action<string>(AddResult), filePath);
}
}
else
{
var fileText = File.ReadAllText(filePath, Encoding.ASCII);
if (fileText.IndexOf(_string, StringComparison.OrdinalIgnoreCase) >= 0)
{
lbResults.BeginInvoke(new Action<string>(AddResult), filePath);
}
}
}
The byte arrays will not be bigger than 8 bytes at the very largest, with the common case being 4 bytes, if that affects the recommendation.
Is there any thing built in to .net or pre-written example that I could use to do this?

Coding the Boyer-Moor algorithm should be straightforward. However, for such short patterns (4-8 bytes) I doubt you see much performance boost compared to to a byte by byte search.
What you can do to increase performance, is to use pointer arithmetic using the unsafe and fixed keywords, since the array indexer will bounds check your index variable each time you access your fileBinary array.

Do you want to search the files as they are on the disk, or do you want to build an index and later search using that index?
If the former is the case, I don't see a reason why Boyer–Moore couldn't be implemented on byte "characters".
If the later is the case, you'll need a specialized data structure such as suffix tree.
BTW, loading the content of the whole file might not be the best idea performance wise - what if you happen to run into a multi-GB video file? Since all you are doing is linearly traversing the file content, you can load it chunk-by-chunk.
For a really performant implementation, separate the search and the chunk-loading into concurrent threads (or better yet, TPL Tasks) with the queue (of chunks) in between. There may even be some benefits in reading multiple files in parallel to utilize the native command queuing implemented in most modern disk controllers (but only for mechanical disks, SSDs don't benefit from NCQ).

I don't know of anything in the .Net Framework that will do what you are trying to accomplish using byte[]. But I think a simple solution would be to convert each byte to char and then the char[] to a string; so you would convert the filedata to char[] and then string and also the data you are searching for and then use the string search algorithms build into .Net. It would save the time of having to roll your own pattern searching algorithm, plus if the data is not large overhead should be negligible.

Related

How to handle large amounts of data in C#?

I have data stored in several seperate text files that I parse and analyze afterwards.
The size of the data processed differs a lot. It ranges from a few hundred megabytes (or less) to 10+ gigabytes.
I started out with storing the parsed data in a List<DataItem> because I wanted to perform a BinarySearch() during the analysis. However, the program throws an OutOfMemory-Exception if too much data is parsed. The exact amount the parser can handle depends on the fragmentation of the memory. Sometimes it's just 1.5 gb of the files and some other time it's 3 gb.
Currently I'm using a List<List<DataItem>> with a limited number of entries because I thought it would change anything for the better. There weren't any significant improvements though.
Another way I tried was serializing the parser data and than deserializing it if needed. The result of that approach was even worse. The whole process took much longer.
I looked into memory mapped files but I don't really know if they could help me because I never used them before. Would they?
So how can I quickly access the data from all the files without the danger of throwing an OutOfMemoryException and find DataItems depending on their attributes?
EDIT: The parser roughly works like this:
void Parse() {
LoadFile();
for (int currentLine = 1; currentLine < MAX_NUMBER_OF_LINES; ++currentLine) {
string line = GetLineOfFile(currentLine);
string[] tokens = SplitLineIntoTokens(line);
DataItem data = PutTokensIntoDataItem(tokens);
try {
List<DataItem>.Add(data);
} catch (OutOfMemoryException ex) {}
}
}
void LoadFile(){
DirectoryInfo di = new DirectroyInfo(Path);
FileInfo[] fileList = di.GetFiles();
foreach(FileInfo fi in fileList)
{
//...
StreamReader file = new SreamReader(fi.FullName);
//...
while(!file.EndOfStram)
strHelp = file.ReadLine();
//...
}
}

There is no right answer for this I believe. The implementation depends on many factors that only you can rate pros and cons on.
If your primary purpose is to parse large files and large number of them, keeping these in memory irrespective of how much RAM is available should be a secondary option, for various reasons for e.g. like persistance at times when an unhandled exception occured.
Although when profiling under initial conditions you may be encouraged and inclined to load them to memory retain for manipulation and search, this will soon change as the number of files increase and in no time your application supporters will start ditching this.
I would do the below
Read and store each file content to a document database like Raven DB for e.g.
Perform parse routine on these documents and store the relevant relations in an rdbms db if that is the requirement
Search at will, fulltext or otherwise, on either the document db (raw) or relational (your parse output)
By doing this, you are taking advantage of research done by the creators of these systems in managing the memory efficiently with focus on performance
I realise that this may not be the answer for you, but for someone who may think this is better and suits perhaps yes.

If the code in your question is representative of the actual code, it looks like you're reading all of the data from all of the files into memory, and then parsing. That is, you have:
Parse()
LoadFile();
for each line
....
And your LoadFile loads all of the files into memory. Or so it seems. That's very wasteful because you maintain a list of all the un-parsed lines in addition to the objects created when you parse.
You could instead load only one line at a time, parse it, and then discard the unparsed line. For example:
void Parse()
{
foreach (var line in GetFileLines())
{
}
}
IEnumerable<string> GetFileLines()
{
foreach (var fileName in Directory.EnumerateFiles(Path))
{
foreach (var line in File.ReadLines(fileName)
{
yield return line;
}
}
}
That limits the amount of memory you use to hold the file names and, more importantly, the amount of memory occupied by un-parsed lines.
Also, if you have an upper limit to the number of lines that will be in the final data, you can pre-allocate your list so that adding to it doesn't cause a re-allocation. So if you know that your file will contain no more than 100 million lines, you can write:
void Parse()
{
var dataItems = new List<DataItem>(100000000);
foreach (var line in GetFileLines())
{
data = tokenize_and_build(line);
dataItems.Add(data);
}
}
This reduces fragmentation and out of memory errors because the list is pre-allocated to hold the maximum number of lines you expect. If the pre-allocation works, then you know you have enough memory to hold references to the data items you're constructing.
If you still run out of memory, then you'll have to look at the structure of your data items. Perhaps you're storing too much information in them, or there are ways to reduce the amount of memory used to store those items. But you'll need to give us more information about your data structure if you need help reducing its footprint.

You can use:
Data Parallelism (Task Parallel Library)
Write a Simple Parallel.ForEach
I think it will make it will reduce memory exception and make files handling faster.

C#: Reading Huge CSV File

I'm parsing a 40MB CSV file.
It works nicely right now, and it's rather easy to parse, the only problem I have is performance, which of course is rather slow.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and then stop looping, so if the entry is at the beginning of the file it finishes quickly, but if it's at the end it takes a while.
I could balance this by giving it a random start line, but the algorithm would still be O(n)... So I'm not sure if it's really worth it.
Is there a way I can improve my sequential parsing algorithm?

First: "Reading Huge CSV File" and "So I'm parsing a 40MB CSV file.". Ihave space delimited files here of 10+ GIGAbyte - what would you call those?
Also: the size of the file is irrelevant, you process them normally anyway line by line.
the only problem I have is performance, which of course is rather slow
Define. What do you think is slow? Parsing them is quite fast when done properly.
I'd like to know if there is a way I can improve this, as I only need to find by key I find and
then stop looping, so if the entry is at the beggining of the file it
finishes quickly, but if it's at the end it takes a while.
Do NOT use a CSV file? More than 60 years ago people invented databases for this.
Is there a way I can improve my secuential parsing algorithm?
YOu mean except pulling the parsing into a separate thread, and using an efficient code (which you may not have - noone knows).
Theoretically you could:
Read on one thread, with a decent buffer (less IO = faster)
Move field split into thread 2 (optional)
Use tasks to parse the fields (one per field per line) so you use all processors).
I am currently processing some (around 10.000) files (with sizes in double digit gigabte sadly) and... I go this way (Have to process them in a specific order) to use my computer fully.
That should give you a lot - and seriously, a 40mb file should load in 0.x seconds (0.5 - 0.6).
STILL that is very inefficient. Any reason you do not load the file into a database like all people do? CSV is good as some transport format, it sucks as a database.

Why don't you convert your csv to a normal database. Even sqlexpress will be fine.

Of course.
Say you order it alphabetically.
Then, start in the middle.
Each iteration, move to the middle of the top or bottom; whichever has the appropriate key.
This algorithm has O(log n).
This is called a "binary search," and is what "Mike Christianson" suggests in his comment.

Will suggest you to break one 40Mb File into smaller size few files.
And using Parallel.ForEach you could improve file processing performace

You can load the CSV into DataTable and use available operations that could be faster than looping through
Loading it to database and perform your operation on that is another option

This, I believe, is the fastest way to read a CSV file sequentially. There may be other ways to extract data from CSV, but if you limited to this approach, then this solution might work for you.
const int BUFFER_SIZE = 0x8000; //represents 32768 bytes
public unsafe void parseCSV(string filePath)
{
byte[] buffer = new byte[BUFFER_SIZE];
int workingSize = 0; //store how many bytes left in buffer
int bufferSize = 0; //how many bytes were read by the file stream
StringBuilder builder = new StringBuilder();
char cByte; //character representation of byte
using (FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
{
do
{
bufferSize = fs.Read(buffer, 0, BUFFER_SIZE);
workingSize = bufferSize;
fixed (byte* bufferPtr = buffer)
{
byte* workingBufferPtr = bufferptr;
while (workingSize-- > 0)
{
switch (cByte = (char)*workingBufferPtr++)
{
case '\n':
break;
case '\r':
case ',':
builder.ToString();
builder.Clear();
break;
default:
builder.Append(cByte);
break;
}
}
}
} while (bufferSize != 0);
}
}
Explanation:
Reading the file into a byte buffer. This will be done using the basic Filestream class, which gives access to the always fast Read()
Unsafe code. While I generally recommend not using unsafe code, when traversing any kind of buffer, using pointers can bring a speedup.
StringBuilder since we will be concatenating bytes into workable strings to test againt the key. StringBuilder is by far the fastest way to append bytes together and get a workable string out them.
Note that this method fairly complaint with RFC 4180, but if you deal with quotes, you can easily modify the code I posted to handle trimming.

Sorting Binary File Index Based

I have a binary file which can be seen as a concatenation of different sub-file:
INPUT FILE:
Hex Offset ID SortIndex
0000000 SubFile#1 3
0000AAA SubFile#2 1
0000BBB SubFile#3 2
...
FFFFFFF SubFile#N N
These are the information i have about each SubFile:
Starting Offset
Lenght in bytes
Final sequence Order
What's the fastest way to produce a Sorted Output File in your opinion ?
For instance OUTPUT FILE will contain the SubFile in the following order:
SubFile#2
SubFile#3
SubFile#1
...
I have thought about:
Split the Input File extracting each Subfile to disk, then
concatenate them in the correct order
Using FileSeek to move around the file and adding each SubFile to a BinaryWriter Stream.
Consider the following information also:
Input file can be really huge (200MB~1GB)
For those who knows, i am speaking about IBM AFP Files.
Both my solution are easy to implement, but looks really not performing in my opinion.
Thanks in advance

Also if file is big the number of IDs is not so huge.
You can just get all you IDs,sortindex,offset,length in RAM, then sort in RAM with a simple quicksort, when you finish, you rewrite the entire file in the order you have in your sorted array.
I expect this to be faster than other methods.
So... let's make some pseudocode.
public struct FileItem : IComparable<FileItem>
{
public String Id;
public int SortIndex;
public uint Offset;
public uint Length;
public int CompareTo(FileItem other) { return this.SortIndex.CompareTo(other.SortIndex); }
}
public static FileItem[] LoadAndSortFileItems(FILE inputFile)
{
FileItem[] result = // fill the array
Array.Sort(result);
}
public static void WriteFileItems(FileItem[] items, FILE inputfile, FILE outputFile)
{
foreach (FileItem item in items)
{
Copy from inputFile[item.Offset .. item.Length] to outputFile.
}
}
The number of read operations is linear, O(n), but seeking is required.
The only performance problem about seeking is cache miss by hard drive cache.
Modern hard drive have a big cache from 8 to 32 megabytes, seeking a big file in random order means cache miss, but i would not worry too much, because the amount of time spent in copying files, i guess, is greater than the amount of time required by seek.
If you are using a solid state disk instead seeking time is 0 :)
Writing the output file however is O(n) and sequential, and this is a very good thing since you will be totally cache friendly.
You can ensure better time if you preallocate the size of the file before starting to write it.
FileStream myFileStream = ...
myFileStream.SetLength(predictedTotalSizeOfFile);
Sorting FileItem structures in RAM is O(n log n) but also with 100000 items it will be fast and will use a little amount of memory.
The copy is the slowest part, use 256 kilobyte .. 2 megabyte for block copy, to ensure that copying big chunks of file A to file B will be fast, however you can adjust the amount of block copy memory doing some tests, always keeping in mind that every machine is different.
It is not useful to try a multithreaded approach, it will just slow down the copy.
It is obvious, but, if you copy from drive C: to drive D:, for example, it will be faster (of course, not partitions but two different serial ata drives).
Consider also that you need seek, or in reading or in writing, at some point, you will need to seek. Also if you split the original file in several smaller file, you will make the OS seek the smaller files, and this doesn't make sense, it will be messy and slower and probably also more difficult to code.
Consider also that if files are fragmented the OS will seek by itself, and that is out of your control.

The first solution I thought of was to read the input file sequentially and build a Subfile-object for every subfile. These objects will be put into b+tree as soon as they are created. The tree will order the subfiles by their SortIndex. A good b-tree implementation will have linked child nodes which enables you to iterate over the subfiles in the correct order and write them into the output file
another way could be to use random access files. you can load all SortIndexes and offsets. then sort them and write the output file in the sorted way. in this case all depends on how random access files work. in this case all depends on the random access file reader implementation. if it just reads the file until a specified position it would not be very performant.. honestly, I have no idea how they work... :(

Quickly load 350M numbers into a double[] array in C#

I am going to store 350M pre-calculated double numbers in a binary file, and load them into memory as my dll starts up. Is there any built in way to load it up in parallel, or should I split the data into multiple files myself and take care of multiple threads myself too?
Answering the comments: I will be running this dll on powerful enough boxes, most likely only on 64 bit ones. Because all the access to my numbers will be via properties anyway, I can store my numbers in several arrays.
[update]
Everyone, thanks for answering! I'm looking forward to a lot of benchmarking on different boxes.
Regarding the need: I want to speed up a very slow calculation, so I am going to pre-calculate a grid, load it into memory, and then interpolate.

Well I did a small test and I would definitely recommend using Memory Mapped Files.
I Created a File containing 350M double values (2.6 GB as many mentioned before) and then tested the time it takes to map the file to memory and then access any of the elements.
In all my tests in my laptop (Win7, .Net 4.0, Core2 Duo 2.0 GHz, 4GB RAM) it took less than a second to map the file and at that point accessing any of the elements took virtually 0ms (all time is in the validation of the index).
Then I decided to go through all 350M numbers and the whole process took about 3 minutes (paging included) so if in your case you have to iterate they may be another option.
Nevertheless I wrapped the access, just for example purposes there a lot conditions you should check before using this code, and it looks like this
public class Storage<T> : IDisposable, IEnumerable<T> where T : struct
{
MemoryMappedFile mappedFile;
MemoryMappedViewAccessor accesor;
long elementSize;
long numberOfElements;
public Storage(string filePath)
{
if (string.IsNullOrWhiteSpace(filePath))
{
throw new ArgumentNullException();
}
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
FileInfo info = new FileInfo(filePath);
mappedFile = MemoryMappedFile.CreateFromFile(filePath);
accesor = mappedFile.CreateViewAccessor(0, info.Length);
elementSize = Marshal.SizeOf(typeof(T));
numberOfElements = info.Length / elementSize;
}
public long Length
{
get
{
return numberOfElements;
}
}
public T this[long index]
{
get
{
if (index < 0 || index > numberOfElements)
{
throw new ArgumentOutOfRangeException();
}
T value = default(T);
accesor.Read<T>(index * elementSize, out value);
return value;
}
}
public void Dispose()
{
if (accesor != null)
{
accesor.Dispose();
accesor = null;
}
if (mappedFile != null)
{
mappedFile.Dispose();
mappedFile = null;
}
}
public IEnumerator<T> GetEnumerator()
{
T value;
for (int index = 0; index < numberOfElements; index++)
{
value = default(T);
accesor.Read<T>(index * elementSize, out value);
yield return value;
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
T value;
for (int index = 0; index < numberOfElements; index++)
{
value = default(T);
accesor.Read<T>(index * elementSize, out value);
yield return value;
}
}
public static T[] GetArray(string filePath)
{
T[] elements;
int elementSize;
long numberOfElements;
if (string.IsNullOrWhiteSpace(filePath))
{
throw new ArgumentNullException();
}
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
FileInfo info = new FileInfo(filePath);
using (MemoryMappedFile mappedFile = MemoryMappedFile.CreateFromFile(filePath))
{
using(MemoryMappedViewAccessor accesor = mappedFile.CreateViewAccessor(0, info.Length))
{
elementSize = Marshal.SizeOf(typeof(T));
numberOfElements = info.Length / elementSize;
elements = new T[numberOfElements];
if (numberOfElements > int.MaxValue)
{
//you will need to split the array
}
else
{
accesor.ReadArray<T>(0, elements, 0, (int)numberOfElements);
}
}
}
return elements;
}
}
Here is an example of how you can use the class
Stopwatch watch = Stopwatch.StartNew();
using (Storage<double> helper = new Storage<double>("Storage.bin"))
{
Console.WriteLine("Initialization Time: {0}", watch.ElapsedMilliseconds);
string item;
long index;
Console.Write("Item to show: ");
while (!string.IsNullOrWhiteSpace((item = Console.ReadLine())))
{
if (long.TryParse(item, out index) && index >= 0 && index < helper.Length)
{
watch.Reset();
watch.Start();
double value = helper[index];
Console.WriteLine("Access Time: {0}", watch.ElapsedMilliseconds);
Console.WriteLine("Item: {0}", value);
}
else
{
Console.Write("Invalid index");
}
Console.Write("Item to show: ");
}
}
UPDATE I added a static method to load all data in a file to an array. Obviously this approach takes more time initially (on my laptop takes between 1 and 2 min) but after that access performance is what you expect from .Net. This method should be useful if you have to access data frequently.
Usage is pretty simple
double[] helper = Storage<double>.GetArray("Storage.bin");
HTH

It sounds extremely unlikely that you'll actually be able to fit this into a contiguous array in memory, so presumably the way in which you parallelize the load depends on the actual data structure.
(Addendum: LukeH pointed out in comments that there is actually a hard 2GB limit on object size in the CLR. This is detailed in this other SO question.)
Assuming you're reading the whole thing from one disk, parallelizing the disk reads is probably a bad idea. If there's any processing you need to do to the numbers as or after you load them, you might want to consider running that in parallel at the same time you're reading from disk.

The first question you have presumably already answered is "does this have to be precalculated?". Is there some algorithm you can use that will make it possible to calculate the required values on demand to avoid this problem? Assuming not...
That is only 2.6GB of data - on a 64 bit processor you'll have no problem with a tiny amount of data like that. But if you're running on a 5 year old computer with a 10 year old OS then it's a non-starter, as that much data will immediately fill the available working set for a 32-bit application.
One approach that would be obvious in C++ would be to use a memory-mapped file. This makes the data appear to your application as if it is in RAM, but the OS actually pages bits of it in only as it is accessed, so very little real RAM is used. I'm not sure if you could do this directly from C#, but you could easily enough do it in C++/CLI and then access it from C#.
Alternatively, assuming the question "do you need all of it in RAM simultaneously" has been answered with "yes", then you can't go for any kind of virtualisation approach, so...
Loading in multiple threads won't help - you are going to be I/O bound, so you'll have n threads waiting for data (and asking the hard drive to seek between the chunks they are reading) rather than one thread waiitng for data (which is being read sequentially, with no seeks). So threads will just cause more seeking and thus may well make it slower. (The only case where splitting the data up might help is if you split it to different physical disks so different chunks of data can be read in parallel - don't do this in software; buy a RAID array)
The only place where multithreading may help is to make the load happen in the background while the rest of your application starts up, and allow the user to start using the portion of the data that is already loaded while the rest of the buffer fills, so the user (hopefully) doesn't have to wait much while the data is loading.
So, you're back to loading the data into one massive array in a single thread...
However, you may be able to speed this up considerably by compressing the data. There are a couple of general approaches woth considering:
If you know something about the data, you may be able to invent an encoding scheme that makes the data smaller (and therefore faster to load). e.g. if the values tend to be close to each other (e.g. imagine the data points that describe a sine wave - the values range from very small to very large, but each value is only ever a small increment from the last) you may be able to represent the 'deltas' in a float without losing the accuracy of the original double values, halving the data size. If there is any symmetry or repetition to the data you may be able to exploit it (e.g. imagine storing all the positions to describe a whole circle, versus storing one quadrant and using a bit of trivial and fast maths to reflect it 4 times - an easy way to quarter the amount of data I/O). Any reduction in data size would give a corresponding reduction in load time. In addition, many of these schemes would allow the data to remain "encoded" in RAM, so you'd use far less RAM but still be able to quickly fetch the data when it was needed.
Alternatively, you can very easily wrap your stream with a generic compression algorithm such as Deflate. This may not work, but usually the cost of decompressing the data on the CPU is less than the I/O time that you save by loading less source data, so the net result is that it loads significantly faster. And of course, save a load of disk space too.

In typical case, loading speed will be limited by speed of storage you're loading data from--i.e. hard drive.
If you want it to be faster, you'll need to use faster storage, f.e. multiple hard drives joined in a RAID scheme.
If your data can be reasonably compressed, do that. Try to find algorithm which will use exactly as much CPU power as you have---less than that and your external storage speed will be limiting factor; more than that and your CPU speed will be limiting factor. If your compression algorithm can use multiple cores, then multithreading can be useful.
If your data are somehow predictable, you might want to come up with custom compression scheme. F.e. if consecutive numbers are close to each other, you might want to store differences between numbers---this might help compression efficiency.
Do you really need double precision? Maybe floats will do the job? Maybe you don't need full range of doubles? For example if you need full 53 bits of mantissa precision, but need only to store numbers between -1.0 and 1.0, you can try to chop few bits per number by not storing exponents in full range.

Making this parallel would be a bad idea unless you're running on a SSD. The limiting factor is going to be the disk IO--and if you run two threads the head is going to be jumping back and forth between the two areas being read. This will slow it down a lot more than any possible speedup from parallelization.
Remember that drives are MECHANICAL devices and insanely slow compared to the processor. If you can do a million instructions in order to avoid a single head seek you will still come out ahead.
Also, once the file is on disk make sure to defrag the disk to ensure it's in one contiguous block.

That does not sound like a good idea to me. 350,000,000 * 8 bytes = 2,800,000,000 bytes. Even if you manage to avoid the OutOfMemoryException the process may be swapping in/out of the page file anyway. You might as well leave the data in the file and load smaller chucks as they are needed. The point is that just because you can allocate that much memory does not mean you should.

With a suitable disk configuration, splitting into multiple files across disks would make sense - and reading each file in a separate thread would then work nicely (if you've some stripyness - RAID whatever :) - then it could make sense to read from a single file with multiple threads).
I think you're on a hiding to nothing attempting this with a single physical disk, though.

Just saw this : .NET 4.0 has support for memory mapped files. That would be a very fast way to do it, and no support required for parallelization etc.

Performance - Python vs. C#/C++/C reading char-by-char

So I have these giant XML files (and by giant, I mean like 1.5GB+) and they don't have CRLFs. I'm trying to run a diff-like program to find the differences between these files.
Since I've yet to find a diff program that won't explode due to memory exhaustion, I've decided the best bet was to add CRLFs after closing tags.
I wrote a python script to read char-by-char and add new-lines after '>'. The problem is I'm running this on a single core PC circa-1995 or something ridiculous, and it's only processing about 20MB/hour when I have both converting at the same time.
Any idea if writing this in C#/C/C++ instead will yield any benefits? If not, does anyone know of a diff program that will go byte-by-byte? Thanks.
EDIT:
Here's the code for my processing function...
def read_and_format(inputfile, outputfile):
''' Open input and output files, then read char-by-char and add new lines after ">" '''
infile = codecs.open(inputfile,"r","utf-8")
outfile = codecs.open(outputfile,"w","utf-8")
char = infile.read(1)
while(1):
if char == "":
break
else:
outfile.write(char)
if(char == ">"):
outfile.write("\n")
char = infile.read(1)
infile.close()
outfile.close()
EDIT2:
Thanks for the awesome responses. Increaseing the read size created an unbelievable speed increase. Problem solved.

Reading and writing a single character at a time is almost always going to be slow, because disks are block-based devices, rather than character-based devices - it will read a lot more than just the one byte you're after, and the surplus parts need to be discarded.
Try reading and writing more at a time, say, 8192 bytes (8KB) and then finding and adding newlines in that string before writing it out - you should save a lot in performance because a lot less I/O is required.
As LBushkin points out, your I/O library may be doing buffering, but unless there is some form of documentation that shows this does indeed happen (for reading AND writing), it's a fairly easy thing to try before rewriting in a different language.

Why don't you just use sed?
cat giant.xml | sed 's/>/>\x0a\x0d/g' > giant-with-linebreaks.xml

Rather than reading byte by byte, which incurs a disk access for each byte read, try reading ~20 MB at a time and doing your search + replace on that :)
You can probably do this in Notepad....
Billy3

For the type of problem you describe, I suspect the algorithm you employ for comparing the data will have a much more significant effect than the I/O model or language. In fact, string allocation and search may be more expensive here than anything else.
Some general suggestions before you write this yourself:
Try running on a faster machine if you have one available. That will make a huge difference.
Look for an existing tool online for doing XML diffs ... don't write one yourself.
If are are going to write this in C# (or Java or C/C++), I would do the following:
Read a fairly large block into memory all at once (let's say between 200k and 1M)
Allocate an empty block that's twice that size (this assumes a worst case of every character is a '>')
Copy from the input block to the output block conditionally appending a CRLF after each '>' character.
Write the new block out to disk.
Repeat until all the data has been processed.
Additionally, you could also write such a program to run on multiple threads, so that while once thread is perform CRLF insertions in memory, a separate thread is read blocks in from disk. This type of parallelization is complicated ... so I would only do so if you really need maximum performance.
Here's a really simple C# program to get you started, if you need it. It accepts an input file path and an output path on the command line, and performs the substitution you are looking for ('>' ==> CRLF). This sample leaves much to be improved (parallel processing, streaming, some validation, etc)... but it should be a decent start.
using System;
using System.IO;
namespace ExpandBrackets
{
class Program
{
static void Main(string[] args)
{
if (args.Length == 2)
{
using( StreamReader input = new StreamReader( args[0] ) )
using( StreamWriter output = new StreamWriter( args[1] ) )
{
int readSize = 0;
int blockSize = 100000;
char[] inBuffer = new char[blockSize];
char[] outBuffer = new char[blockSize*3];
while( ( readSize = input.ReadBlock( inBuffer, 0, blockSize ) ) > 0 )
{
int writeSize = TransformBlock( inBuffer, outBuffer, readSize );
output.Write( outBuffer, 0, writeSize );
}
}
}
else
{
Console.WriteLine( "Usage: repchar {inputfile} {outputfile}" );
}
}
private static int TransformBlock( char[] inBuffer, char[] outBuffer, int size )
{
int j = 0;
for( int i = 0; i < size; i++ )
{
outBuffer[j++] = inBuffer[i];
if (inBuffer[i] == '>') // append CR LF
{
outBuffer[j++] = '\r';
outBuffer[j++] = '\n';
}
}
return j;
}
}
}

All of the languages mentioned typically, at some point, revert to the C runtime library for byte by byte file access. Writing this in C will probably be the fastest option.
However, I doubt it will provide a huge speed boost. Python is fairly speedy, if you're doing things correctly.
The main way to really get a big speed improvement would be to introduce threading. If you read the data in from the file in a large block in one thread, and had a separate thread that did your newline processing + diff processing, you could dramatically improve the speed of this algorithm. This would probably be easier to implement in C++, C#, or IronPython than in C or CPython directly, since they provide very easy, high-level synchronization tools for handling the threading issues (especially when using .NET).

you could try xmldiff - http://msdn.microsoft.com/en-us/library/aa302294.aspx
I haven't used it for such huge data but I think it would be reasonably optimized

I put this as a comment on another answer, but in case you miss it--you might want to look at The Shootout. It's a highly optimized set of code for various problems in many languages.
According to those results, Python tends to be about 50x slower than c (but it is faster than the other interpreted languages). In comparison Java is about 2x slower than c. If you went to one of the faster compiled languages, I don't see why you wouldn't see a similar increase.
By the way, the figures attained from the shootout are wonderfully un-assailable, you can't really challenge them, instead if you don't believe the numbers are fair because the code to solve a problem in your favorite language isn't optimized properly, then you can submit better code yourself. The act of many people doing this means most of the code on there is pretty damn optimized for every popular language. If you show them a more optimized compiler or interpreter, they may include the results from it as well.
Oh: except C#, that's only represented by MONO so if Microsoft's compiler is more optimized, it's not shown. All the tests seem to run on Linux machines. My guess is Microsoft's C# should run at about the same speed as Java, but the shootout lists mono as a bit slower (about 3x as slow as C)..

As others said, if you do it in C it will be pretty much unbeatable, because C buffers I/O, and getc() is inlined (in my memory).
Your real performance issue will be in the diff.
Maybe there's a pretty good one out there, but for those size files I doubt it. For fun, I'm a do-it-yourselfer. The strategy I would use is to have a rolling window in each file, several megabytes long. The search strategy for mismatches is diagonal search, which is if you are at lines i and j, compare in this sequence:
line(i+0) == line(j+0)
line(i+0) == line(j+1)
line(i+1) == line(j+0)
line(i+0) == line(j+2)
line(i+1) == line(j+1)
line(i+2) == line(j+0)
and so on. No doubt there's a better way, but if I'm going to code it myself and manage the rolling windows, that's what I'd try.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.