C# - Sort File.ReadLines IEnumerable without Memory Overhead? - c#

Is this possible?
I have the following code to reduce the total amount of memory usage:
File.WriteAllLines(
Path.Combine(Path.GetDirectoryName(file[0]), "(Sort A-Z) " + Path.GetFileName(file[0])),
File.ReadLines(file[0]).OrderBy(s=>s)
);
(file[0] is the input file path).
This reduces usage from ForEach's e.t.c reducing CPU usage aswell as memory usage (barely).
Its also faster than using a Foreach.
The issue however, is the .OrderBy(s=>s) causes it to load the entire thing into memory. Its not as bad as normally loading it into memory, but it still rises quite a bit of memory. (Im using a 80mb file).
Is there some way to order the IEnumerable/Order by A->Z when saving to a file without using much memory?
I know it sounds vague and unsure what im looking for, because I dont know myself.
Running with .OrderBy(s=>s) on a 2.7 million line file:
https://i.imgur.com/rUyDeFJ.gifv
Running WITHOUT .OrderBy(s=>s) on a 2.7 million line file:
https://i.imgur.com/Ejbnuty.gifv
(You can see it finish)

It is necessary for .OrderBy to load the entire contents into memory. It would be impossible for it to work any other way.
OrderBy receives an IEnumerable. Therefore it receives items on at a time. However, consider the scenario where the very last row needs to be sorted before the very first row. This could only be achieved if the last row and first row were both in memory at the same time. Consider the scenario where the entire input were already sorted in the reverse order. Hopefully these examples show why it is necessary for OrderBy to load the entire contents into memory.
Algorithms exist to partition data sets into individual partitions, on disk, then merge those partitions. However, they are beyond the scope of the Linq OrderBy function.
Internally OrderBy reads everything into a buffer array then performs a quicksort over it. If you're feeling brave, refer to the reference source:
https://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,2530
(It's scattered throughout this file, but lines 2534-2542 best illustrate this)

Related

Fastet / most efficient way to work with very large Lists (in c#)

This question is related to the discussion here but can also be treated stand alone.
Also I think it would be nice to have the relevant results in one separate thread. I couldn't find anything on the web that was dealing with the topic comprehensively.
Let's say we need to work with a very very large List<double> of Data (e.g. 2 Billion entries).
Having a large list loaded into the memory leads to either a "System.OutOfMemoryException"
or if one uses <gcAllowVeryLargeObjects enabled="true|false" /> eventually just eats up the entire memory.
First: How to minimize the size of the container:
would an Array of 2 Billion doubles be less expensive ?
use decimals instead of doubles ?
is there a data-tye even lesse expensive than decimal in C# (-100
to 100 in 0.00001 steps would do the job for me)
Second: Saving the data to disc and reading it
What would be the fastet way to save the List to the disk and read it again?
Here I think the size of the file could be a problem. A txt-file containing 2 Billion entries will be huge - opening it could turn out to take years. (Perhaps some type of stream between the programm and the txt - file would do the job ?) - some sample code would be much appreciated :)
Third: Iteration
if the List is in memory I think using yield would make sense.
if the List is saved to the disk the speed of iteration is mostly limited by the speed at
which we can read it from the disc.

Process very large XML file

I need to process an XML file with the following structure:
<FolderSizes>
<Version></Version>
<DateTime Un=""></DateTime>
<Summary>
<TotalSize Bytes=""></TotalSize>
<TotalAllocated Bytes=""></TotalAllocated>
<TotalAvgFileSize Bytes=""></TotalAvgFileSize>
<TotalFolders Un=""></TotalFolders>
<TotalFiles Un=""></TotalFiles>
</Summary>
<DiskSpaceInfo>
<Drive Type="" Total="" TotalBytes="" Free="" FreeBytes="" Used=""
UsedBytes=""><![CDATA[ ]]></Drive>
</DiskSpaceInfo>
<Folder ScanState="">
<FullPath Name=""><![CDATA[ ]]></FullPath>
<Attribs Int=""></Attribs>
<Size Bytes=""></Size>
<Allocated Bytes=""></Allocated>
<AvgFileSz Bytes=""></AvgFileSz>
<Folders Un=""></Folders>
<Files Un=""></Files>
<Depth Un=""></Depth>
<Created Un=""></Created>
<Accessed Un=""></Accessed>
<LastMod Un=""></LastMod>
<CreatedCalc Un=""></CreatedCalc>
<AccessedCalc Un=""></AccessedCalc>
<LastModCalc Un=""></LastModCalc>
<Perc><![CDATA[ ]]></Perc>
<Owner><![CDATA[ ]]></Owner>
<!-- Special element; see paragraph below -->
<Folder></Folder>
</Folder>
</FolderSizes>
The <Folder> element is special in that it repeats within the <FolderSizes> element but can also appear within itself; I reckon up to about 5 levels.
The problem is that the file is really big at a whopping 11GB so I'm having difficulty processing it - I have experience with XML documents, but nothing on this scale.
What I would like to do is to import the information into a SQL database because then I will be able to process the information in any way necessary without having to concern myself with this immense, impractical file.
Here are the things I have tried:
Simply load the file and attempt to process it with a simple C# program using an XmlDocument or XDocument object
Before I even started I knew this would not work, as I'm sure everyone would agree, but I tried it anyway, and ran the application on a VM (since my notebook only has 4GB RAM) with 30GB memory. The application ended up using 24GB memory, and taking very, very long, so I just cancelled it.
Attempt to process the file using an XmlReader object
This approach worked better in that it didn't use as much memory, but I still had a few problems:
It was taking really long because I was reading the file one line at a time.
Processing the file one line at a time makes it difficult to really work with the data contained in the XML because now you have to detect the start of a tag, and then the end of that tag (hopefully), and then create a document from that information, read the info, attempt to determine which parent tag it belongs to because we have multiple levels... Sound prone to problems and errors
Did I mention it takes really long reading the file one line at a time; and that still without actually processing that line - literally just reading it.
Import the information using SQL Server
I created a stored procedure using XQuery and running it recursively within itself processing the <Folder> elements. This went quite well - I think better than the other two approaches - until one of the <Folder> elements ended up being rather big, producing a An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted. error. I read up about it and I don't think it's an adjustable limit.
Here are more things I think I should try:
Re-write my C# application to use unmanaged code
I don't have much experience with unmanaged code, so I'm not sure how well it will work and how to make it as unmanaged as possible.
I once wrote a little application that works with my webcam, receiving the image, inverting the colours, and painting it to a panel. Using normal managed code didn't work - the result was about 2 frames per second. Re-writing the colour inversion method to use unmanaged code solved the problem. That's why I thought that unmanaged might be a solution.
Rather go for C++ in stead of C#
Not sure if this is really a solution. Would it necessarily be better that C#? Better than unmanaged C#?
The problem here is that I haven't actually worked with C++ before, so I'll need to get to know a few things about C++ before I can really start working with it, and then probably not very efficiently yet.
I thought I'd ask for some advice before I go any further, possibly wasting my time.
Thanks in advance for you time and assistance.
EDIT
So before I start processing the file I run through it and check the size in a attempt to provide the user with feedback as to how long the processing might take; I made a screenshot of the calculation:
That's about 1500 lines per second; if the average line length is about 50 characters, that's 50 bytes per line, that's 75 kilobytes per second, for an 11GB file should take about 40 hours, if my maths is correct. But this is only stepping each line. It's not actually processing the line or doing anything with it, so when that starts, the processing rate drops significantly.
This is the method that runs during the size calculation:
private int _totalLines = 0;
private bool _cancel = false; // set to true when the cancel button is clicked
private void CalculateFileSize()
{
xmlStream = new StreamReader(_filePath);
xmlReader = new XmlTextReader(xmlStream);
while (xmlReader.Read())
{
if (_cancel)
return;
if (xmlReader.LineNumber > _totalLines)
_totalLines = xmlReader.LineNumber;
InterThreadHelper.ChangeText(
lblLinesRemaining,
string.Format("{0} lines", _totalLines));
string elapsed = string.Format(
"{0}:{1}:{2}:{3}",
timer.Elapsed.Days.ToString().PadLeft(2, '0'),
timer.Elapsed.Hours.ToString().PadLeft(2, '0'),
timer.Elapsed.Minutes.ToString().PadLeft(2, '0'),
timer.Elapsed.Seconds.ToString().PadLeft(2, '0'));
InterThreadHelper.ChangeText(lblElapsed, elapsed);
if (_cancel)
return;
}
xmlStream.Dispose();
}
Still runnig, 27 minutes in :(
you can read an XML as a logical stream of elements instead of trying to read it line-by-line and piece it back together yourself. see the code sample at the end of this article
also, your question has already been asked here

data structure for indexing big file

I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.
Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.

Removing redundant data from large file

I have a log file which has single strings on each line. I am trying to remove duplicate data from the file and save the file out as a new file. I had first thought of reading data into a HashSet and then saving the contents of the hashset out, however I get an "OutOfMemory" exception when attempting to do this (on the line that adds the string to the hashset).
There are around 32,000,000 lines in the files. It's not practical to re-read the entire file for each comparison.
Any ideas? My other thought was to output the entire contents into a SQLite database and selecting DISTINCT values, but I'm not sure that'd work either with that many values.
Thanks for any input!
First thing you need to think about - is high memory consumption is a problem?
If your application will always run on server with a lot of RAM available, or in any other case you know you'll have enough memory, you can do a lot of things you can't do if your application will run in a low-memory environment, or in an unknown environment. If memory isn't the problem, then make sure your application is running as a 64-bit application (of course, on 64-bit OS), otherwise you'll be limited to 2GB memory (4GB, if you'll use LARGEADDRESSAWARE flag). I guess then in this case this is your problem, and all you've got to do is change it - and it'll work great (assuming you have enough memory).
If memory is a problem, and you need not to use too much memory, you can as you suggested add all the data to database (i'm more familiar with databases like SQL Server, but i guess SQLite will do), make sure you have the right index on the column, and then select distinct value.
Another option, is to read the file as a stream, line by line, for each line calculate hash, and save the line into other file, and keep the hash in the memory. if the hash already exists, then moving to the next line (and, if you wish, adding to a counter of number of lines removed). in that case, you'll save less data in the memory (only hash for not duplicated items).
Best of luck.
Have you tried to use an array to intialize the HashSet. I assume that the doubling algorithm of HashSet is the reason for the OutOfMemoryException.
var uniqueLines = new HashSet<string>(File.ReadAllLines(#"C:\Temp\BigFile.log"));
Edit:
I am testing the result of the .Add() method to see if it
returns false to count the number of items that are redundant. I'd
like to keep this feature if possible.
Then you should try to initilize the HashSet with the correct(maximum) size of the file's lines:
int lineCount = File.ReadLines(path).Count();
List<string> fooList = new List<String>(lineCount);
var uniqueLines = new HashSet<string>(fooList);
fooList.Clear();
foreach (var line in File.ReadLines(path))
uniqueLines.Add(line);
I took a similar approach to Tim using HashSet. I did add manual line counting and comparison.
I read the setup log from my windows 8 install which was 58MB in size at 312248 lines and ran it in LinqPad in .993 seconds.
var temp=new List<string>(10000);
var uniqueHash=new HashSet<int>();
int lineCount=0;
int uniqueLineCount=0;
using(var fs=new FileStream(#"C:\windows\panther\setupact.log",FileMode.Open,FileAccess.Read))
using(var sr=new StreamReader(fs,true)){
while(!sr.EndOfStream){
lineCount++;
var line=sr.ReadLine();
var key=line.GetHashCode();
if(!uniqueHash.Contains(key) ){
uniqueHash.Add(key);
temp.Add(line);
uniqueLineCount++;
if(temp.Count()>10000){
File.AppendAllLines(#"c:\temp\output.txt",temp);
temp.Clear();
}
}
}
}
Console.WriteLine("Total Lines:"+lineCount.ToString());
Console.WriteLine("Lines Removed:"+ (lineCount-uniqueLineCount).ToString());

Methodology for saving values over time

I have a task, which I know how to code (in C#),
but I know a simple implementation will not meet ALL my needs.
So, I am looking for tricks which might meet ALL my needs.
I am writing a simulation involving N number of entities interacting over time.
N will start at around 30 and move in to many thousands.
a. The number of entities will change during the course of the simulation.
b. I expect this will require each entity to have its own trace file.
Each Entity has a minimum of 20 parameters, up to millions; I want to track over time.
a. This will most likely required that we can’t keep all values in memory at all times. Some subset should be fine.
b. The number of parameters per entity will initially be fixed, but I can think of some test which would have the number of parameters slowing changing over time.
Simulation will last for millions of time steps and I need to keep every value for every parameter.
What I will be using these traces for:
a. Plotting a subset (configurable) of the parameters for a fixed amount of time from the current time step to the past.
i. Normally on the order of 300 time steps.
ii. These plots are in real time while the simulation is running.
b. I will be using these traces to re-play the simulation, so I need to quickly access all the parameters at a give time step so I can quickly move to different times in the simulation.
i. This requires the values be stored in a file(s) which can be inspected/loaded after restarting the software.
ii. Using a database is NOT an option.
c. I will be using the parameters for follow up analysis which I can’t define up front so a more flexible system is desirable.
My initial thought:
One class per entity which holds all the parameters.
Backed by a memory mapped file.
Only a fixed, but moving, amount of the file is mapped to main memory
A second memory mapped file which holds time indexes to main file for quicker access during re-playing of simulation. This may be very important because each entity file will represent a different time slice of the full simulation.
I would start with SQLite. SQLite is like a binary format library that you can query conveniently and quickly. It is not really like a database, in that you can really run it on any machine, with no installation whatsoever.
I strongly recommend against XML, given the requirement of millions of steps, potentially with millions parameters.
EDIT: Given the sheer amount of data involved, SQLite may well end up being too slow for you. Don't get me wrong, SQLite is very fast, but it won't beat seeks & reads, and it looks like your use case is such that basic binary IO is rather appropriate.
If you go with the binary IO method you should expect some moderately involved coding, and the absence of such niceties as your file staying in a consistent state if the application dies halfway through (unless you code this specifically that is).
KISS -- just write a logfile for each entity and at each time slice write out every parameter in a specified order (so you don't double the size of the logfile by adding parameter names). You can have a header in each logfile if you want to specify the parameter names of each column and the identify of the entity.
If there are many parameter values that will remain fixed or slowly changing during the course of the simulation, you can write these out to another file that encodes only changes to parameter values rather than every value at each time slice.
You should probably synchronize the logging so that each log entry is written out with the same time value. Rather than coordinate through a central file, just make the first value in each line of the file the time value.
Forget about database - too slow and too much overhead for simulation replay purposes. For replaying of a simulation, you will simply need sequential access to each time slice, which is most efficiently and fastest implemented by simply reading in the lines of the files one by one.
For the same reason - speed and space efficiency - forget XML.
Just for the memory part...
1.You can save the data as xElemet (sorry for not knowing much about linq) but it holds an XML logic.
2.hold a record counter.
after n records save the xelement to an xmlFile (data1.xml,...dataN.xml)
It can be a perfect log to any parameter you have with any logic you like:
<run>
<step id="1">
<param1 />
<param2 />
<param3 />
</step>
.
.
.
<step id="N">
<param1 />
<param2 />
<param3 />
</step>
</run>
This way your memory is free and the data is relatively free.
You don't have to think too much about DB issues and it's pretty amazing what LINQ can do for you... just open the currect XML log file...
here is what i am doing now
int bw = 0;
private void timer1_Tick(object sender, EventArgs e)
{
bw = Convert.ToInt32(lblBytesReceived.Text) - bw;
SqlCommand comnd = new SqlCommand("insert into tablee (bandwidthh,timee) values (" + bw.ToString() + ",#timee)", conn);
conn.Open();
comnd.Parameters.Add("#timee",System.Data.SqlDbType.Time).Value = DateTime.Now.TimeOfDay;
comnd.ExecuteNonQuery();
conn.Close();
}

Categories

Resources