I have .txt file (contains more than million rows) which is around 1GB and I have one list of string, I am trying to remove all the rows from the file that exist in the list of strings and creating new file but it is taking long long time.
using (StreamReader reader = new StreamReader(_inputFileName))
{
using (StreamWriter writer = new StreamWriter(_outputFileName))
{
string line;
while ((line = reader.ReadLine()) != null)
{
if (!_lstLineToRemove.Contains(line))
writer.WriteLine(line);
}
}
}
How can I enhance the performance of my code?
You may get some speedup by using PLINQ to do the work in parallel, also switching from a list to a hash set will also greatly speed up the Contains( check. HashSet is thread safe for read-only operations.
private HashSet<string> _hshLineToRemove;
void ProcessFiles()
{
var inputLines = File.ReadLines(_inputFileName);
var filteredInputLines = inputLines.AsParallel().AsOrdered().Where(line => !_hshLineToRemove.Contains(line));
File.WriteAllLines(_outputFileName, filteredInputLines);
}
If it does not matter that the output file be in the same order as the input file you can remove the .AsOrdered() and get some additional speed.
Beyond this you are really just I/O bound, the only way to make it any faster is to get faster drives to run it on.
The code is particularly slow because the reader and writer never execute in parallel. Each has to wait for the other.
You can almost double the speed of file operations like this by having a reader thread and a writer thread. Put a BlockingCollection between them so you can communicate between the threads and limit how many rows you buffer in memory.
If the computation is really expensive (it isn't in your case), a third thread with another BlockingCollection doing the processing can help too.
Do not use buffered text routines. Use binary, unbuffered library routines and make your buffer size as big as possible. That's how to make it the fastest.
Have you considered using AWK
AWK is a very powerfull tool to process text files, you can find more information about how to filter lines that match a certain criteria Filter text with ASK
Related
I have data stored in several seperate text files that I parse and analyze afterwards.
The size of the data processed differs a lot. It ranges from a few hundred megabytes (or less) to 10+ gigabytes.
I started out with storing the parsed data in a List<DataItem> because I wanted to perform a BinarySearch() during the analysis. However, the program throws an OutOfMemory-Exception if too much data is parsed. The exact amount the parser can handle depends on the fragmentation of the memory. Sometimes it's just 1.5 gb of the files and some other time it's 3 gb.
Currently I'm using a List<List<DataItem>> with a limited number of entries because I thought it would change anything for the better. There weren't any significant improvements though.
Another way I tried was serializing the parser data and than deserializing it if needed. The result of that approach was even worse. The whole process took much longer.
I looked into memory mapped files but I don't really know if they could help me because I never used them before. Would they?
So how can I quickly access the data from all the files without the danger of throwing an OutOfMemoryException and find DataItems depending on their attributes?
EDIT: The parser roughly works like this:
void Parse() {
LoadFile();
for (int currentLine = 1; currentLine < MAX_NUMBER_OF_LINES; ++currentLine) {
string line = GetLineOfFile(currentLine);
string[] tokens = SplitLineIntoTokens(line);
DataItem data = PutTokensIntoDataItem(tokens);
try {
List<DataItem>.Add(data);
} catch (OutOfMemoryException ex) {}
}
}
void LoadFile(){
DirectoryInfo di = new DirectroyInfo(Path);
FileInfo[] fileList = di.GetFiles();
foreach(FileInfo fi in fileList)
{
//...
StreamReader file = new SreamReader(fi.FullName);
//...
while(!file.EndOfStram)
strHelp = file.ReadLine();
//...
}
}
There is no right answer for this I believe. The implementation depends on many factors that only you can rate pros and cons on.
If your primary purpose is to parse large files and large number of them, keeping these in memory irrespective of how much RAM is available should be a secondary option, for various reasons for e.g. like persistance at times when an unhandled exception occured.
Although when profiling under initial conditions you may be encouraged and inclined to load them to memory retain for manipulation and search, this will soon change as the number of files increase and in no time your application supporters will start ditching this.
I would do the below
Read and store each file content to a document database like Raven DB for e.g.
Perform parse routine on these documents and store the relevant relations in an rdbms db if that is the requirement
Search at will, fulltext or otherwise, on either the document db (raw) or relational (your parse output)
By doing this, you are taking advantage of research done by the creators of these systems in managing the memory efficiently with focus on performance
I realise that this may not be the answer for you, but for someone who may think this is better and suits perhaps yes.
If the code in your question is representative of the actual code, it looks like you're reading all of the data from all of the files into memory, and then parsing. That is, you have:
Parse()
LoadFile();
for each line
....
And your LoadFile loads all of the files into memory. Or so it seems. That's very wasteful because you maintain a list of all the un-parsed lines in addition to the objects created when you parse.
You could instead load only one line at a time, parse it, and then discard the unparsed line. For example:
void Parse()
{
foreach (var line in GetFileLines())
{
}
}
IEnumerable<string> GetFileLines()
{
foreach (var fileName in Directory.EnumerateFiles(Path))
{
foreach (var line in File.ReadLines(fileName)
{
yield return line;
}
}
}
That limits the amount of memory you use to hold the file names and, more importantly, the amount of memory occupied by un-parsed lines.
Also, if you have an upper limit to the number of lines that will be in the final data, you can pre-allocate your list so that adding to it doesn't cause a re-allocation. So if you know that your file will contain no more than 100 million lines, you can write:
void Parse()
{
var dataItems = new List<DataItem>(100000000);
foreach (var line in GetFileLines())
{
data = tokenize_and_build(line);
dataItems.Add(data);
}
}
This reduces fragmentation and out of memory errors because the list is pre-allocated to hold the maximum number of lines you expect. If the pre-allocation works, then you know you have enough memory to hold references to the data items you're constructing.
If you still run out of memory, then you'll have to look at the structure of your data items. Perhaps you're storing too much information in them, or there are ways to reduce the amount of memory used to store those items. But you'll need to give us more information about your data structure if you need help reducing its footprint.
You can use:
Data Parallelism (Task Parallel Library)
Write a Simple Parallel.ForEach
I think it will make it will reduce memory exception and make files handling faster.
I'm trying to locate a line which contains a specific text inside a large text file (18 MB), currently I'm using StreamReader to open the file and read it line by line checking if it contains the search string
while ((line = reader.ReadLine()) != null)
{
if (line.Contains("search string"))
{
//Do something with line
}
}
But unfortunately, because the file I'm using has more than 1 million records, this method is slow. What is the quickest way to achieve this?
In general, disk IO of this nature is just going to be slow. There is likely little you can do to improve over your current version in terms of performance, at least not without dramatically changing the format in which you store your data, or your hardware.
However, you could shorten the code and simplify it in terms of maintenance and readability:
var lines = File.ReadLines(filename).Where(l => l.Contains("search string"));
foreach(var line in lines)
{
// Do something here with line
}
Reading the entire file into memory causes the application to hang and is very slow, do you think there are any other alternative
If the main goal here is to prevent application hangs, you can do this in the background instead of in a UI thread. If you make your method async, this can become:
while ((line = await reader.ReadLineAsync()) != null)
{
if (line.Contains("search string"))
{
//Do something with line
}
}
This will likely make the total operation take longer, but not block your UI thread while the file access is occurring.
Get a hard drive with a faster read speed (moving to a solid state drive if you aren't already would likely help a lot).
Store the data across several files each on different physical drives. Search through those drives in parallel.
Use a RAID0 hard drive configuration. (This is sort of a special case of the previous approach.)
Create an index of the lines in the file that you can use to search for specific words. (Creating the index will be a lot more expensive than a single search, and will require a lot of disk space, but it will allow subsequent searches at much faster speeds.)
I have a text file, that could potentially have up to 1 Million lines in it, and I have a code for reading the file one line at a time, but this is taking a lot of time...lots and lots of time. Is there a method in c# to potentially optimize this process, and improve the reading. This is the code I'm using.
using(var file = new StreamReader(filePath))
{
while((line = file.ReadLine()) != null)
{
//do something.
}
}
Any suggestions on reading these lines in bulk or improving the process?
Thanks.
Thanks for all your comments. The issue had to do with the \do something where I was using the SmartXls library to write to Excel, which was causing the bottle neck. I have contacted the developers to address the issue. All the suggested solutions will work in other scenarios.
Well, this code would be simpler, if you're using .NET 4 or later you can use File.ReadLines:
foreach (var line in File.ReadLines())
{
// Do something
}
Note that this is not the same as ReadAllLines, as ReadLines returns an IEnumerable<string> which reads lines lazily, instead of reading the whole file in one go.
The effect at execution time will be broadly the same as your original code (it won't improve performance) - this is just simpler to read.
Fundamentally, if you're reading a large file, that can take a long time - but reading just a million lines shouldn't take "lots and lots of time". My guess is that whatever you're doing with the lines takes a long time. You might want to parallelize that, potentially using a producer/consumer queue (e.g. via BlockingCollection) or TPL Dataflow, or just use Parallel LINQ, Parallel.ForEach etc.
You should use a profiler to work out where the time is being spent. If you're reading from a very slow file system, then it's possible that it really is the reading which is taking the time. We don't have enough information to guide you on that, but you should be able to narrow it down yourself.
Try to use streamreader, see if it's faster
string filePath = "";
string fileData = "";
using (FileStream fs = new FileStream(filePath, FileMode.Open))
{
byte[] data = new byte[fs.Length];
fs.Seek(0, SeekOrigin.Begin);
fs.Read(data, 0, int.Parse(fs.Length.ToString()));
fileData = System.Text.Encoding.Unicode.GetString(data);
}
You can read more data at once using StreamReader's int ReadBlock(char[] buffer, int index, int count) rather than line by line. This avoids reading reading the entire file at once (File.ReadAllLines) but allows you to process larger chunks in RAM at a time.
To improve performance, consider performing whatever work you are currently doing in your loop by spawning another thread to handle the load.
Parallel.ForEach(file.ReadLines(), (line) =>
{
// do your business
});
If space is not an issue..Create a buffer of around 1mb..
using(BufferedStream bs=new BufferedStream(File.OpenRead(path),1024*1024))
{
int read=-1;
byte[] buffer=new byte[1024*1024];
while((read=bs.Read(buffer,0,buffer.Length))!=0)
{
//play with buffer
}
}
You can also use ReadAllLines(filepath) and load the file into an array of lines, like this:
string[] lines = System.IO.File.ReadAllLines(#"path");
I am developing an application that reads lines from enormous text files (~2.5 GB), manipulates each line to a specific format, and then writes each line to a text file. Once the output text file has been closed, the program "Bulk Inserts" (SQL Server) the data into my database. It works, it's just slow.
I am using StreamReader and StreamWriter.
I'm pretty much stuck with reading one line at a time due to how I have to manipulate the text; however, I think that if I made a collection of lines and wrote out the collection every 1000 lines or so, it would speed things up at least a bit. The problem is (and this could be purely from my ignorance) that I cannot write a string[] using StreamWriter. After exploring StackOverflow and the rest of the internet, I came across File.WriteAllLines, which allows me to write string[]s to file, but I dont think my computer's memory can handle 2.5 GB of data being stored at one time. Also, the file is created, populated, and closed, so I would have to make a ton of smaller files to break down the 2 GB text files only to insert them into the database. So I would prefer to stay away from that option.
One hack job that I can think of is making a StringBuilder and using the AppendLine method to add each line to make a gigantic string. Then I could convert that StringBuilder to a string and write it to file.
But enough of my conjecturing. The method I have already implemented works, but I am wondering if anyone can suggest a better way to write chunks of data to a file?
Two things will increase the speed of output using StreamWriter.
First, make sure that the output file is on a different physical disk than the input file. If the input and output are on the same drive, then very often reads have to wait for writes and writes have to wait for reads. The disk can do only one thing at a time. Obviously not every read or write waits, because the StreamReader reads into a buffer and parses lines out of it, and the StreamWriter writes to a buffer and then pushes that to disk when the buffer is full. With the input and output files on separate drives, your reads and writes overlap.
What do I mean they overlap? The operating system will typically read ahead for you, so it can be buffering your file while you're processing. And when you do a write, the OS typically buffers that and writes it to the disk lazily. So there is some limited amount of asynchronous processing going on.
Second thing is to increase your buffer size. The default buffer size for StreamReader and StreamWriter is 4 kilobytes. So every 4K read or written incurs an operating system call. And, quite likely, a disk operation.
If you increase the buffer size to 64K, then you make 16 times fewer OS calls and 16 times fewer disk operations (not strictly true, but close). Going to a 64K buffer can cut more than 25% off your I/O time, and it's dead simple to do:
const int BufferSize = 64 * 1024;
var reader = new StreamReader(filename, Encoding.UTF8, true, BufferSize);
var writer = new StreamWriter(filename, Encoding.UTF8, BufferSize);
Those two things will speed your I/O more than anything else you can do. Trying to build buffers in memory using StringBuilder is just unnecessary work that does a bad job of duplicating what you can achieve by increasing the buffer size, and done incorrectly can easily make your program slower.
I would caution against buffer sizes larger than 64 KB. On some systems, you get marginally better results with buffers up to 256 KB, but on others you get dramatically worse performance--to the tune of 50% slower! I've never seen a system perform better with buffers larger than 256 KB than they do with buffers of 64 KB. In my experience, 64 KB is the sweet spot.
One other thing you can do is use three threads: a reader, a processor, and a writer. They communicate with queues. This can reduce your total time from (input-time + process-time + output-time) to something very close to max(input-time, process-time, output-time). And with .NET, it's really easy to set up. See my blog posts: Simple multithreading, Part 1 and Simple multithreading, Part 2.
According to the docs, StreamWriter does not automatically flush after every write by default, so it is buffered.
You could also use some of the lazy methods on the File class like so:
File.WriteAllLines("output.txt",
File.ReadLines("filename.txt").Select(ProcessLine));
where ProcessLine is declared like so:
private string ProcessLine(string input) {
string result = // do some calculation on input
return result;
}
Since ReadLines is lazy and WriteAllLines has a lazy overload, it will stream the file rather than attempting to read the whole thing.
What about building strings to write?
Something like
int cnt = 0;
StringBuilder s = new StringBuilder();
while(line = reader.readLine())
{
cnt++;
String x = (manipulate line);
s.append(x+"\n");
if(cnt%10000 == 0)
{
StreamWriter.write(s);
s=new StringBuilder();
}
}
Edited because comment below is right, should have used stringbuilder.
I have a list of large text files to process. I wonder which is the fastest method, because reading line by line is slow.
I have something like that:
int cnt = this.listView1.Items.Count;
for (int i = 0; i < this.listView1.Items.Count; i++)
{
FileStream fs = new FileStream(this.listView1.Items[i].Text.ToString(), FileMode.Open, FileAccess.Read);
using (StreamReader reader = new StreamReader(fs))
while (reader.Peek() != -1)
{
//code part
}
}
I read about using blocks(like 100k lines each) via backgroundworkers with multiple threads would help, but I don't know how to implement it. Or if you have better ideas to improve the performance ... your expert advice would be appreciated.
First you need to decide what is your bottleneck - I/O (reading the files) or CPU (processing them). If it's I/O, reading multiple files concurrently is not going to help you much, the most you can achieve is have one thread read files, and another process them. The processing thread will be done before the next file is available.
I agree with #asawyer, if it's only 100MB, you should read the file entirely into memory in one swoop. You might as well read 5 of them entirely into memory, it's really not a big deal.
EDIT: After realizing all the files are on a single hard-drive, and that processing takes longer than reading the file.
You should have on thread reading the files sequentially. Once a file is read, fire up another thread that handles the processing, and start reading the second file in the first thread. Once the second file is read, fire up another thread, and so on.
You should make sure you don't fire more processing threads than the numbers of cores you have, but for starters just use the thread-pool for this, and optimize later.
You're missing a little bit of performance, because the time you spend reading the first file is not used for any processing. This should be neglible, reading 100MBs of data to memory shouldn't take more than a few seconds.
I assume that you are processing files line by line. You also said that loading of files is faster than processing them. There are few ways you can do what you need. One for example:
Create a thread that reads files one by one, line by line. Sequentially, because when doing this in parallel you'll only hammer your HDD and possibly get worse results. You can use Queue<string> for that. Use Queue.Enqueue() to add lines you've red.
Run another thread that is processing the queue. Use Queue.Dequeue() to get (and remove) lines from beginning of your queue. Process the line and write it to the output file. Eventually you can put processed lines in another queue or list and write them at once when you finish processing.
If order of lines in output file is not important you can create as many threads as you have CPU cores (or use ThreadPool class) to do the processing (that would speed up things significantly).
[Edit]
If order of lines in the output file is important you should limit line processing to one thread. Or process them in parallel using separate threads and implement mechanism that would control output order. For example you may do that by numbering lines you read from input file (the easy way) or processing lines by each thread in chunks of n-lines and writing output chunk by chunk in the same order you started processing threads.
here is a simple threading code you can use: (.Net 4)
//firstly get file paths from listview so you won't block the UI thread
List<string> filesPaths = new List<string>();
for (int i = 0; i < this.listView1.Items.Count; i++)
{
filesPaths.Add(listView1.Items[i].Text.ToString());
}
//this foreach loop will fire 50 threads at same time to read 50 files
Parallel.ForEach(filesPaths, new ParallelOptions() { MaxDegreeOfParallelism = 50 }, (filepath, i, j) =>
{
//read file contents
string data = File.ReadAllText(filepath);
//do whatever you want with the contents
});
not tested though...