C# multiple text file processing

C# multiple text file processing - c#

Let's say that you want to write an application that processes multiple text files, supplied as arguments at the command line (e.g., MyProcessor file1 file2 ...). This is a very common task for which Perl is often used, but what if one wanted to take advantage of .NET directly and use C#.
What is the simplest C# 4.0 application boiler plate code that allows you to do this? It should include basically line by line processing of each line from each file and doing something with that line, by either calling a function to process it or maybe there's a better way to do this sort of "group" line processing (e.g., LINQ or some other method).

You could process files in parallel by reading each line and passing it to a processing function:
class Program
{
static void Main(string[] args)
{
Parallel.ForEach(args, file =>
{
using (var stream = File.OpenRead(file))
using (var reader = new StreamReader(stream))
{
string line;
while ((line = reader.ReadLine()) != null)
{
ProcessLine(line);
}
}
});
}
static void ProcessLine(string line)
{
// TODO: process the line
}
}
Now simply call : SomeApp.exe file1 file2 file3
Pros of this approach:
Files are processed in parallel => taking advantage of multiple CPU cores
Files are read line by line and only the current line is kept into memory which reduces memory consumption and allows you to work with big files

Simple;
foreach(var f in args)
{
var filecontent = File.ReadToEnd();
//Logic goes here
}

After much experimenting, changing this line in Darin Dimitrov's answer:
using (var stream = File.OpenRead(file))
to:
using (var stream=new FileStream(file,System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite,
65536))
to change the read buffer size from the 4KB default to 64KB can shave as much as 10% off of the file read time when read "line at a time" via a stream reader, especially if the text file is large. Larger buffer sizes do not seem to improve performance further.
This improvement is present, even when reading from a relatively fast SSD. The savings are even more substantial if an ordinary HD is used. Interestingly, you get this significant performance improvement even if the file is already cached by the (Windows 7 / 2008R2) OS, which is somewhat counterintuitive.

Related

SSH.NET Real-Time Logging High CPU

Let me preface this question by saying I'm absolutely not a pro C# programmer and have pretty much brute forced my way through most of my small programs so far.
I'm working on a small WinForms application to SSH into a few devices, tail -f a log file on each, and display the real-time output in TextBoxes while also saving to log files. Right now, it works, but hogs nearly 30% of my CPU during logging and I'm sure I'm doing something wrong.
After creating the SshClient and connecting, I run the tail command like so (these variables are part of a logger class which exists for each connection):
command = client.CreateCommand("tail -f /tmp/messages")
result = command.BeginExecute();
stream = command.OutputStream;
I then have a log reading/writing function:
public async Task logOutput(IAsyncResult result, Stream stream, TextBox textBox, string logPath)
{
// Clear textbox ( thread-safe :) )
textBox.Invoke((MethodInvoker)(() => textBox.Clear()));
// Create reader for stream and writer for text file
StreamReader reader = new StreamReader(stream, Encoding.UTF8, true, 1024, true);
StreamWriter sw = File.AppendText(logPath);
// Start reading from SSH stream
while (!result.IsCompleted || !reader.EndOfStream)
{
string line = await reader.ReadLineAsync();
if (line != null)
{
// append to textbox
textBox.Invoke((Action)(() => textBox.AppendText(line + Environment.NewLine)));
// append to file
sw.WriteLine(line);
}
}
}
Which I call the following way, per device connection:
Task.Run(() => logOutput(logger.result, logger.stream, textBox, fileName), logger.token);
Everything works fine, it's just the CPU usage that's the issue. I'm guessing I'm creating way more than one thread per logging process, but I don't know why or how to fix that.
Does anything stand out as a simple fix to the above code? Or even better - is there a way to set up a callback that only prints the new data when the result object gets new text?
All help is greatly appreciated!
EDIT 3/4/2021
I tried a simple test using CopyToAsync by changing the code inside logOutput() to the following:
public async Task logOutput(IAsyncResult result, Stream stream, string logPath)
{
using (Stream fileStream = File.Open(logPath, FileMode.OpenOrCreate))
{
// While the result is running, copy everything from the command stream to a file
while (!result.IsCompleted)
{
await stream.CopyToAsync(fileStream);
}
}
}
However this results in the text files never getting data written to them, and CPU usage is actually slightly worse.
2ND EDIT 3/4/2021
Doing some more debugging, it appears the high CPU usage occurs only when there's no new data coming in. As far as I can tell, this is because the ReadLineAsync() method is constantly firing regardless of whether or not there's actually new data from the SSH command that's running, and it's running as fast as possible hogging all the CPU cycles it can. I'm not entirely sure why that is though, and could really use some help here. I would've assumed that ReadLineAsync() would simply wait until a new line was available from the SSH command to continue.

The solution ended up being much simpler than I would've thought.
There's a known bug in SSH.NET where the command's OutputStream will continually spit out null data when there's no actual new data recieved. This causes the while loop in my code to be running as fast as possible, consuming a bunch of CPU in the process.
The solution is simply to add a short asynchronous delay in the loop. I included the delay only when the recieved data is null, so that reading isn't interrupted when there's actual valid data coming through.
while (!result.IsCompleted && !token.IsCancellationRequested)
{
string line = await reader.ReadLineAsync();
// Append line if it's valid
if (string.IsNullOrEmpty(line))
{
await Task.Delay(10); // prevents high CPU usage
continue;
}
// Append line to textbox
textBox.Invoke((Action)(() => textBox.AppendText(line + Environment.NewLine)));
// Append line to file
writer.WriteLine(line);
}
On a Ryzen 5 3600, this brought my CPU usage from ~30-40% while the program was running to less than 1% even when data is flowing. Much better.

Measure native DLL memory usage from c# code

I'm using a vendor-provided C++ DLL, that I call with DLLImport, to parse and process files containing many object types.
I need to have a correlation between number of objects in the file and memory usage, in order to (hopefully) be able to prevent OutOfMemoryExceptions that happen sometimes.
Update
To be more clear on what I'm trying to measure and why : the out of memory exception is expected, because some very complex files take up to 7gb of memory to load (as measured by perfmon): they are 3D maps of sometimes huge and intricate buildings, from the walls down to the individual screws and bolts, including the trees outside and the tables and chairs in each room.
And since the DLL can load multiple maps in parallel (it's on a web server and the process is shared), loading 2x 7gb files understandably triggers an OutOfMemoryException on a machine with 8gb of RAM.
However, 7gb is pretty rare, most of the maps take up about 500mb, and some take 1 to 2gb.
What we really need is not to find a memory leak (yet...), but be able to know before loading the file how much memory it will probably use. So when a user tries to load a file that we calculate will probably take about 2gb of RAM while the machine has 1gb free, we do something about it; from spinning up a new VM in Azure to preventing the user from working, we don't know what yet, but we can't let the DLL crash the whole server down each time.
And in order to do that, I want to find out, for instance, "the DLL uses 1mb of memory for each 100 geometry object".
So I have a bunch of files to test (about a hundred), and I want to load them up in order, measure the memory usage of the native DLL (before and after), unload the file, process the next. Then I get a nice CSV file with all the data.
I have tried System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64 but it only gives me the current process memory, but the DLL doesn't seem to live in the current process, since most measures give me 0 bytes (difference between before and after file load).
I also have tried GC.GetTotalMemory() but it's not much better, the files are seemingly all exactly 1080 bytes.
private static void MeasureFilesMemoryUsage(string[] files) {
foreach (var file in files) {
var beforeLoad = MeasureMemoryUsage();
wrapper.LoadFile(file)
var afterLoad = MeasureMemoryUsage();
wrapper.Unload();
// save beforeLoad and afterLoad
}
}
private static long MeasureMemoryUsage() {
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
return System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64;
}
I know about tools like VMMAP or RedGate Ants Memory Profiler (or simply performance counters), but these do not allow me to match the memory usage with a specific loaded file, I would have to load files one by one, pause the program, make a measure in the tool, and write down the results. Not something I want to do on 100 files.
How do I measure memory usage of a specific C++ DLL from .Net code?

After reading #HansPassant comments, I have split my test into 2 programs: one that loads the files, and one that reads the memory measures of the first.
Here they are, cleaned up to remove other measures (like number of items in my json files) and results saving.
The "measures" program:
public static void Main(string[] args) {
foreach (var document in Directory.EnumerateDirectories(JsonFolder)) {
MeasureMemory(document);
}
}
private static void MeasureMemory(string document) {
// run process
var proc = new Process {
StartInfo = new ProcessStartInfo {
FileName = "loader.exe",
Arguments = document,
WindowStyle = ProcessWindowStyle.Hidden,
UseShellExecute = false,
RedirectStandardOutput = true,
CreateNoWindow = true
}
};
proc.Start();
// get process output
var output = string.Empty;
while (!proc.StandardOutput.EndOfStream) {
output += proc.StandardOutput.ReadLine() + "\n";
}
proc.WaitForExit();
// parse process output
var processMemoryBeforeLoad = long.Parse(Regex.Match(output, "BEFORE ([\\d]+)", RegexOptions.Multiline).Groups[1].Value);
var processMemoryAfterLoad = long.Parse(Regex.Match(output, "AFTER ([\\d]+)", RegexOptions.Multiline).Groups[1].Value);
// save the measures in a CSV file
}
And the "loader" program:
public static int Main(string[] args) {
var document = args[0];
var files = Directory.EnumerateFiles(document);
Console.WriteLine("BEFORE {0}", MeasureMemoryUsage());
wrapper.LoadFiles(files);
Console.WriteLine("AFTER {0}", MeasureMemoryUsage());
wrapper.Unload();
return 0;
}
private static long MeasureMemoryUsage() {
// make sure GC has done its job
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
return System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64;
}

how to speed up large xml file read/write operations

I have a windows form application that currently does the following:
1) point to a directory and do 2) for all the xml files in there (usually max of 25 files ranging from 10mb to !5gb! - uncommon but possible)
2) xml read/write to alter some of the existing xml attributes (currently I use a single background worker for that)
3) write the altered xml attributes directly to a NEW file in a different dir
the little app works fine but it takes far tooo long to finish (about 20min depending on the net gb size)
what I casually tried is start the main rw method in a Parallel.ForEach() but it blocked itself out unsurprisingly and exited
my idea would be to parallelize the read/write process by starting it on all ~25 files at the same time is this wise? how can I do it (TPL?) without locking myself out?
PS: I have quite a powerful desktop pc, with 1TB samsung pro ssd, 16gb of ram, and intel core i7

You can have a ThreadPool for this approach
You Can have a pool for a size of 20 files
Because you have core i7 , you should use TaskFactory.StartNew
In this case , you should encapsulate the code for processing on file in the a sample class like XMLProcessor
then with use of TaskFactory.StartNew you can use multithreading for xml processsing

This sounds like a job for data parallelism via PLINQ + asynchronous lambdas.
I recently needed to process data from a zip archive that itself contained 5,200 zip archives which then each contained one or more data files in XML or CSV format. In total, between 40-60 GB of data when decompressed and read into memory.
The algorithm browses through this data, makes decisions based on what it finds in conjunction with supplied predicates, and finally writes the selections to disk as 1.0-1.5 GB files. Using an async PLINQ pattern with 32 processors, the average run time for each output file was 4.23 minutes.
After implementing the straightforward solution with async PLINQ, I spent some time trying to improve the running time by digging down into the TPL and TPL Dataflow libraries. In the end, attempting to beat async PLINQ proved to be a fun but ultimately fruitless exercise for my needs. The performance margins from the more "optimized" solutions were not worth the added complexity.
Below is an example of the async PLINQ pattern. The initial collection is an array of file paths.
In the first step, each file path is asynchronously read into memory and parsed, the file name is cached as a root-level attribute, and streamed to the next function.
In the last step, each XElement is asynchronously written to a new file.
I recommend that you play around with the lambda that reads the files. In my case, I found that reading via an async lambda gave me better throughput while decompressing files in memory.
However, for simple XML documents, you may be better off replacing the first async lambda with a method call to XElement.Load(string file) and letting PLINQ read as needed.
using System.IO;
using System.Linq;
using System.Xml.Linq;
namespace AsyncPlinqExample
{
public class Program
{
public static void Main(string[] args)
{
// Limit parallelism here if needed
int degreeOfParallelism = Environment.ProcessorCount;
string resultDirectory = "[result directory path here]";
string[] files = Directory.GetFiles("[directory with files here]");
files.AsParallel()
.WithDegreeOfParallelism(degreeOfParallelism)
.Select(
async x =>
{
using (StreamReader reader = new StreamReader(x))
{
XElement root = XElement.Parse(await reader.ReadToEndAsync());
root.SetAttributeValue("filePath", Path.GetFileName(x));
return root;
}
})
.Select(x => x.Result)
.Select(
x =>
{
// Perform other manipulations here
return x;
})
.Select(
async x =>
{
string resultPath =
Path.Combine(
resultDirectory,
(string) x.Attribute("fileName"));
await Console.Out.WriteLineAsync($"{DateTime.Now}: Starting {(string) x.Attribute("fileName")}.");
using (StreamWriter writer = new StreamWriter(resultPath))
{
await writer.WriteAsync(x.ToString());
}
await Console.Out.WriteLineAsync($"{DateTime.Now}: Comleted {(string)x.Attribute("fileName")}.");
});
}
}
}

Read Changes on a text file dynamically c#

I have a program that continuously writes its log to a text file.
I don't have the source code of it, so I can not modify it in any way and it is also protected with Themida.
I need to read the log file and execute some scripts depending on the content of the file.
I can not delete the file because the program that is continuously writing to it has locked the file.
So what will be the better way to read the file and only read the new lines of the file?
Saving the last line position? Or is there something that will be useful for solving it in C#?

Perhaps use the FileSystemWatcher along with opening the file with FileShare (as it is being used by another process). Hans Passant has provided a nice answer for this part here:
var fs = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using (var sr = new StreamReader(fs)) {
// etc...
}
Have a look at this question and the accepted answer which may also help.

using (var fs = new FileStream("test.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite | FileShare.Delete))
using (var reader = new StreamReader(fs))
{
while (true)
{
var line = reader.ReadLine();
if (!String.IsNullOrWhiteSpace(line))
Console.WriteLine("Line read: " + line);
}
}
I tested the above code and it works if you are trying to read one line at a time. The only issue is that if the line is flushed to the file before it is finished being written then you will read the line in multiple parts. As long as the logging system is writing each line all at once it should be okay.
If not then you may want to read into a buffer instead of using ReadLine, so you can parse the buffer yourself by detecting each Environment.NewLine substring.

You can just keep calling ReadToEnd() in a tight loop. Even after it reaches the end of the file it'll just return an empty string "". If some more data is written to the file it will pick it up on a subsequent call.
while (true)
{
string moreData = streamReader.ReadToEnd();
Thread.Sleep(100);
}
Bear in mind you might read partial lines this way. Also if you are dealing with very large files you will probably need another approach.

Use the filesystemwatcher to detect changes and get new lines using last read position and seek the file.
http://msdn.microsoft.com/en-us/library/system.io.filestream.seek.aspx

The log file is being "continuously" updated so you really shouldn't use FileSystemWatcher to raise an event each time the file changes. This would be triggering continuously, and you already know it will be very frequently changing.
I'd suggest using a timer event to periodically process the file. Read this SO answer for a good pattern to use System.Threading.Timer1. Keep a file stream open for reading or reopen each time and Seek to the end position of your last successful read. By "last successful read" I mean that you should encapsulate the reading and validating of a complete log line. Once you've successfully read and validated a log line, then you have a new position for the next Seek.
1 Note that System.Threading.Timer will execute on a system supplied thread that is kept in business by the ThreadPool. For short tasks this is more desirable that a dedicated thread.

Use this answer on another post c# continuously read file.
This one is quite efficient, and it checks once per second if the file size has changed. So the file is usually not read-locked as a result.
The other answers are quite valid and simple. A couple of them will read-lock the file continuously, but that's probably not a problem for most.

Parallel GZip Decompression of Log Files - Tweaking MaxDegreeOfParallelism for the Highest Throughput

We have up to 30 GB of GZipped log files per day. Each file holds 100.000 lines and is between 6 and 8 MB when compressed. The simplified code in which the parsing logic has been stripped out, utilises the Parallel.ForEach loop.
The maximum number of lines processed peaks at MaxDegreeOfParallelism of 8 on the two-NUMA node, 32 logical CPU box (Intel Xeon E7-2820 # 2 GHz):
using System;
using System.Collections.Concurrent;
using System.Linq;
using System.IO;
using System.IO.Compression;
using System.Threading.Tasks;
namespace ParallelLineCount
{
public class ScriptMain
{
static void Main(String[] args)
{
int maxMaxDOP = (args.Length > 0) ? Convert.ToInt16(args[0]) : 2;
string fileLocation = (args.Length > 1) ? args[1] : "C:\\Temp\\SomeFiles" ;
string filePattern = (args.Length > 1) ? args[2] : "*2012-10-30.*.gz";
string fileNamePrefix = (args.Length > 1) ? args[3] : "LineCounts";
Console.WriteLine("Start: {0}", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
Console.WriteLine("Processing file(s): {0}", filePattern);
Console.WriteLine("Max MaxDOP to be used: {0}", maxMaxDOP.ToString());
Console.WriteLine("");
Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
for (int maxDOP = 1; maxDOP <= maxMaxDOP; maxDOP++)
{
// Construct ConcurrentStacks for resulting strings and counters
ConcurrentStack<Int64> TotalLines = new ConcurrentStack<Int64>();
ConcurrentStack<Int64> TotalSomeBookLines = new ConcurrentStack<Int64>();
ConcurrentStack<Int64> TotalLength = new ConcurrentStack<Int64>();
ConcurrentStack<int> TotalFiles = new ConcurrentStack<int>();
DateTime FullStartTime = DateTime.Now;
string[] files = System.IO.Directory.GetFiles(fileLocation, filePattern);
var options = new ParallelOptions() { MaxDegreeOfParallelism = maxDOP };
// Method signature: Parallel.ForEach(IEnumerable<TSource> source, Action<TSource> body)
Parallel.ForEach(files, options, currentFile =>
{
string filename = System.IO.Path.GetFileName(currentFile);
DateTime fileStartTime = DateTime.Now;
using (FileStream inFile = File.Open(fileLocation + "\\" + filename, FileMode.Open))
{
Int64 lines = 0, someBookLines = 0, length = 0;
String line = "";
using (var reader = new StreamReader(new GZipStream(inFile, CompressionMode.Decompress)))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
lines++; // total lines
length += line.Length; // total line length
if (line.Contains("book")) someBookLines++; // some special lines that need to be parsed later
}
TotalLines.Push(lines); TotalSomeBookLines.Push(someBookLines); TotalLength.Push(length);
TotalFiles.Push(1); // silly way to count processed files :)
}
}
}
);
TimeSpan runningTime = DateTime.Now - FullStartTime;
// Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
Console.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7}",
maxDOP.ToString(),
TotalFiles.Sum().ToString(),
Convert.ToInt32(runningTime.TotalMilliseconds).ToString(),
TotalLength.Sum().ToString(),
TotalLines.Sum(),
TotalSomeBookLines.Sum().ToString(),
Convert.ToInt64(TotalLines.Sum() / runningTime.TotalMilliseconds).ToString(),
Convert.ToInt64(TotalLength.Sum() / runningTime.TotalMilliseconds).ToString());
}
Console.WriteLine();
Console.WriteLine("Finish: " + DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
}
}
}
Here's a summary of the results, with a clear peak at MaxDegreeOfParallelism = 8:
The CPU load (shown aggregated here, most of the load was on a single NUMA node, even when DOP was in 20 to 30 range):
The only way I've found to make CPU load cross 95% mark was to split the files across 4 different folders and execute the same command 4 times, each one targeting a subset of all files.
Can someone find a bottleneck?

It's likely that one problem is the small buffer size used by the default FileStream constructor. I suggest you use a larger input buffer. Such as:
using (FileStream infile = new FileStream(
name, FileMode.Open, FileAccess.Read, FileShare.None, 65536))
The default buffer size is 4 kilobytes, which has the thread making many calls to the I/O subsystem to fill its buffer. A buffer of 64K means that you will make those calls much less frequently.
I've found that a buffer size of between 32K and 256K gives the best performance, with 64K being the "sweet spot" when I did some detailed testing a while back. A buffer size larger than 256K actually begins to reduce performance.
Also, although this is unlikely to have a major effect on performance, you probably should replace those ConcurrentStack instances with 64-bit integers and use Interlocked.Add or Interlocked.Increment to update them. It simplifies your code and removes the need to manage the collections.
Update:
Re-reading your problem description, I was struck by this statement:
The only way I've found to make CPU load cross 95% mark was to split
the files across 4 different folders and execute the same command 4
times, each one targeting a subset of all files.
That, to me, points to a bottleneck in opening files. As though the OS is using a mutual exclusion lock on the directory. And even if all the data is in the cache and there's no physical I/O required, processes still have to wait on this lock. It's also possible that the file system is writing to the disk. Remember, it has to update the Last Access Time for a file whenever it's opened.
If I/O really is the bottleneck, then you might consider having a single thread that does nothing but load files and stuff them into a BlockingCollection or similar data structure so that the processing threads don't have to contend with each other for a lock on the directory. Your application becomes a producer/consumer application with one producer and N consumers.

The reason for this is usually that threads synchronize too much.
Looking for synchronization in your code I can see heavy syncing on the collections. Your threads are pushing the lines individually. This means that each line incurs at best an interlocked operation and at worst a kernel-mode lock wait. The interlocked operations will contend heavily because all threads race to get their current line into the collection. They all try to update the same memory locations. This causes cache line pinging.
Change this to push lines in bigger chunks. Push line-arrays of 100 lines or more. The more the better.
In other words, collect results in a thread-local collection first and only rarely merge into the global results.
You might even want to get rid of the manual data pushing altogether. This is what PLINQ is made for: Streaming data concurrently. PLINQ abstracts away all the concurrent collection manipulations in a well-performing way.

I don't think Parallelizing the disk reads is helping you. In fact, this could be seriously impacting your performance by creating contention in reading from multiple areas of storage at same time.
I would restructure the program to first do a single-threaded read of raw file data into a memory stream of byte[]. Then, do a Parallel.ForEach() on each stream or buffer to decompress and count the lines.
You take an initial IO read hit up front but let the OS/hardware optimize the hopefully mostly sequential reads, then decompress and parse in memory.
Keep in mind that operations like decomprless, Encoding.UTF8.ToString(), String.Split(), etc. will use large amounts of memory, so clean up references to/dispose of old buffers as you no longer need them.
I'd be surprised if you can't cause the machine to generate some serious waste hit this way.
Hope this helps.

The problem, I think, is that you are using blocking I/O, so your threads cannot fully take advantage of parallelism.
If I understand your algorithm right (sorry, I'm more of a C++ guy) this is what you are doing in each thread (pseudo-code):
while (there is data in the file)
read data
gunzip data
Instead, a better approach would be something like this:
N = 0
read data block N
while (there is data in the file)
asyncRead data block N+1
gunzip data block N
N = N + 1
gunzip data block N
The asyncRead call does not block, so basically you have the decoding of block N happening concurrently with the reading of block N+1, so by the time you are done decoding block N you might have block N+1 ready (or close to be ready if I/O is slower than decoding).
Then it's just a matter of finding the block size that gives you the best throughput.
Good luck.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.