C# - Concatenating files until process memory is full then delete duplicates

C# - Concatenating files until process memory is full then delete duplicates - c#

I'm currently working on a c# form.Basically, I have a lot of log files and most of them have duplicates lines between them. This form is supposed to concatenate a lot of those files into one file then delete all the duplicates in it so that I can have one log file without duplicates. I've already successfully made it work by taking 2 files, concatenating them, deleting all the duplicates in it then reproducing the process until I have no more files. Here is the function I made for this:
private static void DeleteAllDuplicatesFastWithMemoryManagement(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1, BackgroundWorker backgroundWorker1)
{
for (int j = 0; j < path_list.Length; j++)
{
HashSet<string>.Enumerator em = path_list[j].GetEnumerator();
List<string> LogFile = new List<string>();
while (em.MoveNext())
{
var secondLogFile = File.ReadAllLines(em.Current);
LogFile = LogFile.Concat(secondLogFile).ToList();
LogFile = LogFile.Distinct().ToList();
backgroundWorker1.ReportProgress(1);
}
LogFile = LogFile.Distinct().ToList();
string new_path = parent_path + "/new_data/probe." + j + ".log";
File.WriteAllLines(new_path, LogFile.Distinct().ToArray());
}
}
path_list contains all the path to the files I need to process.
path_list[0] contains all the probe.0.log files
path_list[1] contains all the probe.1.log files ...
Here is the idea I have for my problem but I have no idea how to code it :
private static void DeleteAllDuplicatesFastWithMemoryManagement(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1, BackgroundWorker backgroundWorker1)
{
for (int j = 0; j < path_list.Length; j++)
{
HashSet<string>.Enumerator em = path_list[j].GetEnumerator();
List<string> LogFile = new List<string>();
while (em.MoveNext())
{
// how I see it
if (currentMemoryUsage + newfile.Length > maximumProcessMemory) {
LogFile = LogFile.Distinct().ToList();
}
//end
var secondLogFile = File.ReadAllLines(em.Current);
LogFile = LogFile.Concat(secondLogFile).ToList();
LogFile = LogFile.Distinct().ToList();
backgroundWorker1.ReportProgress(1);
}
LogFile = LogFile.Distinct().ToList();
string new_path = parent_path + "/new_data/probe." + j + ".log";
File.WriteAllLines(new_path, LogFile.Distinct().ToArray());
}
}
I think this method will be much quicker, and it will adjust to any computer specs. Can anyone help me to make this work ? Or tell me if I'm wrong.

You are creating far too many lists and arrays and Distincts.
Just combine everything in a HashSet, then write it out
private static void CombineNoDuplicates(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1)
{
var logFile = new HashSet<string>(1000); // pre-size your hashset to a suitable size
foreach (var paths in path_list)
{
logFile.Clear();
foreach (var path in paths)
{
var lines = File.ReadLines(file);
logFile.UnionWith(lines);
backgroundWorker1.ReportProgress(1);
}
string new_path = Path.Combine(parent_path, "new_data", "probe." + j + ".log");
File.WriteAllLines(new_path, logFile);
}
}
Ideally you should use async instead of BackgroundWorker which is deprecated. This also means you don't need to store a whole file in memory at once, except for the first one.
private static async Task CombineNoDuplicatesAsync(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1)
{
var logFile = new HashSet<string>(1000); // pre-size your hashset to a suitable size
foreach (var paths in path_list)
{
logFile.Clear();
foreach (var path in paths)
{
using (var sr = new StreamReader(file))
{
string line;
while ((line = await sr.ReadLineAsync()) != null)
{
logFile.Add(line);
}
}
}
string new_path = Path.Combine(parent_path, "new_data", "probe." + j + ".log");
await File.WriteAllLinesAsync(new_path, logFile);
}
}
If you want to risk a colliding hash-code, you could cut down your memory usage even further by just putting the strings' hashes in a HashSet, then you can fully stream all files.
Caveat: colliding hash-codes are a distinct possibility, especially with many strings. Analyze your data to see fi you can risk this.
private static async Task CombineNoDuplicatesAsync(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1)
{
var hashes = new HashSet<int>(1000); // pre-size your hashset to a suitable size
foreach (var paths in path_list)
{
hashes.Clear();
string new_path = Path.Combine(parent_path, "new_data", "probe." + j + ".log");
using (var output = new StreamWriter(new_path))
{
foreach (var path in paths)
{
using (var sr = new StreamReader(file))
{
string line;
while ((line = await sr.ReadLineAsync()) != null)
{
if (hashes.Add(line.GetHashCode())
await output.WriteLineAsync(line);
}
}
}
}
}
}
You can get even more performance if you would read Span<byte> arrays and parse the lines like that, I will leave that as an exercise to the reader as it's quite complex.

Assuming your log files already contain lines that are sorted in chronological order1, we can effectively treat them as intermediate files for a multi-file sort and perform merging/duplicate elimination in one go.
It would be a new class, something like this:
internal class LogFileMerger : IEnumerable<string>
{
private readonly List<IEnumerator<string>> _files;
public LogFileMerger(HashSet<string> fileNames)
{
_files = fileNames.Select(fn => File.ReadLines(fn).GetEnumerator()).ToList();
}
public IEnumerator<string> GetEnumerator()
{
while (_files.Count > 0)
{
var candidates = _files.Select(e => e.Current);
var nextLine = candidates.OrderBy(c => c).First();
for (int i = _files.Count - 1; i >= 0; i--)
{
while (_files[i].Current == nextLine)
{
if (!_files[i].MoveNext())
{
_files.RemoveAt(i);
break;
}
}
}
yield return nextLine;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
You can create a LogFileMerger using the set of input log file names and pass it directly as the IEnumerable<string> to some method like File.WriteAllLines. Using File.ReadLines should mean that the amount of memory being used for each input file is just a small buffer on each file, and we never attempt to have all of the data from any of the files loaded at any time.
(You may want to adjust the OrderBy and comparison operations in the above if there are requirements around case insensitivity but I don't see any evident in the question)
(Note also that this class cannot be enumerated multiple times in the current design. That could be adjusted by storing the paths instead of the open enumerators in the class field and making the list of open enumerators a local inside GetEnumerator)
1If this is not the case, it may be more sensible to sort each file first so that this assumption is met and then proceed with this plan.

Related

Issues with list not getting garbage collected after use in mvc

Backstory: I'm generating csv files as reports, and is testing to see what happens if multiple big reports are generated, this below should generate around 4 MB csv file, so if I called to get this report 2 times my pc would throttle at my 16 GB of ram, even after getting the files the program still uses all my ram, only by restarting the program can I clear up my ram. The ram is mostly used on boilerplate of the SMS class.
My issue is that the tmp list is never cleared/cleaned up by the garbage collector, even after the controller call is finished, which ends in large amounts of ram getting used.
I can see that the ram usage is only increasing when generating the tmp list not when creating the csv file
The console output
Visual studio Memory diagnostic after clicking download once
Snapshot of memory
private static readonly Random random = new();
private string GenerateString(int length = 30)
{
StringBuilder str_build = new StringBuilder();
Random random = new Random();
char letter;
for (int i = 0; i < length; i++)
{
double flt = random.NextDouble();
int shift = Convert.ToInt32(Math.Floor(25 * flt));
letter = Convert.ToChar(shift + 65);
str_build.Append(letter);
}
return str_build.ToString();
}
[HttpGet("SMS")]
public async Task<ActionResult> GetSMSExport([FromQuery]string phoneNumber)
{
Console.WriteLine($"Generating items: {DateTime.Now.ToLongTimeString()}");
Console.WriteLine($"Finished generating items: {DateTime.Now.ToLongTimeString()}");
Console.WriteLine($"Generating CSV: {DateTime.Now.ToLongTimeString()}");
var tmp = new List<SMS>();
// tmp never gets cleared by the garbage collector, even if its not used after the call is finished
for (int i = 0; i < 1000000; i++)
{
tmp.Add(new SMS() {
GatewayName = GenerateString(),
Message = GenerateString(),
Status = GenerateString()
});
}
// 1048574 is max, excel says 1048576 is max but because of header and seperater line it needs to be minussed with 2
ActionResult csv = await ExportDataAsCSV(tmp, $"SMS_Report.csv");
Console.WriteLine($"Finished generating CSV: {DateTime.Now.ToLongTimeString()}");
return csv;
}
private async Task<ActionResult> ExportDataAsCSV(IEnumerable<object> listToExport, string fileName)
{
Console.WriteLine("Creating file");
if (listToExport is null || !listToExport.Any())
throw new ArgumentNullException(nameof(listToExport));
System.IO.File.Delete("Reports/" + GenerateString() + fileName);
var file = System.IO.File.Create("Reports/" + GenerateString() + fileName, 4096, FileOptions.DeleteOnClose);
var streamWriter = new StreamWriter(file, Encoding.UTF8);
await streamWriter.WriteAsync("sep=;");
await streamWriter.WriteAsync(Environment.NewLine);
var headerNames = listToExport.First().GetType().GetProperties();
foreach (var header in headerNames)
{
var displayAttribute = header.GetCustomAttributes(typeof(System.ComponentModel.DataAnnotations.DisplayAttribute),true);
if (displayAttribute.Length != 0)
{
var attribute = displayAttribute.Single() as System.ComponentModel.DataAnnotations.DisplayAttribute;
await streamWriter.WriteAsync(sharedLocalizer[attribute.Name] + ";");
}
else
await streamWriter.WriteAsync(header.Name + ";");
}
await streamWriter.WriteAsync(Environment.NewLine);
var newListToExport = listToExport.ToArray();
for (int j = 0; j < newListToExport.Length; j++)
{
object item = newListToExport[j];
var itemProperties = item.GetType().GetProperties();
for (int i = 0; i < itemProperties.Length; i++)
{
await streamWriter.WriteAsync(itemProperties[i].GetValue(item)?.ToString() + ";");
}
await streamWriter.WriteAsync(Environment.NewLine);
}
Helpers.LogHelper.Log(Helpers.LogHelper.LogType.Info, GetType(), $"User {User.Identity.Name} downloaded {fileName}");
await file.FlushAsync();
file.Position = 0;
return File(file, "text/csv", fileName);
}
public class SMS
{
[Display(Name = "Sent_at_text")]
public DateTime? SentAtUtc { get; set; }
[Display(Name = "gateway_name")]
public string GatewayName { get; set; }
[Display(Name = "message_title")]
[JsonProperty("messageText")]
public string Message { get; set; }
[Display(Name = "status_title")]
[JsonProperty("statusText")]
public string Status { get; set; }
}

Answer can be found in this post
MVC memory issue, memory not getting cleared after controller call is finished (Example project included)
Copied answer
The github version is a fixed version for this issue, so you can go explorer what i did in the changeset
Notes:
After generating a large file, you might need to download smaller sized files before C# releases the memory
Adding forced garbage collection helped alot
Adding a few using statements also helped alot
Smaller existing issues
If your object cant fit in the RAM and it start filling the pagefile it will not reduce the pagefile after use(Restarting your pc will help a little on it but wont clear the pagefile entirely)
I couldnt get it below 400MB of ram usage no matter what i tried, but it didnt matter if i had 5GB or 1GB in the RAM it would still get reduced down to ~400 MB

Assigning instance variables obtained through reflection in generic method

I have data in tab-separated values (TSV) text files that I want to read and (eventually) store in database tables. With the TSV files, each line contains one record, but in one file the record can have 2 fields, in another file 4 fields, etc. I wrote working code to handle the 2-field records, but I thought this might be a good case for a generic method (or two) rather than writing new methods for each kind of record. However, I have not been able to code this because of 2 problems: I can't create a new object for holding the record data, and I don't know how to use reflection to generically fill the instance variables of my objects.
I looked at several other similar posts, including Datatable to object by using reflection and linq
Below is the code that works (this is in Windows, if that matters) and also the code that doesn't work.
public class TSVFile
{
public class TSVRec
{
public string item1;
public string item2;
}
private string fileName = "";
public TSVFile(string _fileName)
{
fileName = _fileName;
}
public TSVRec GetTSVRec(string Line)
{
TSVRec rec = new TSVRec();
try
{
string[] fields = Line.Split(new char[1] { '\t' });
rec.item1 = fields[0];
rec.item2 = fields[1];
}
catch (Exception ex)
{
System.Windows.Forms.MessageBox.Show("Bad import data on line: " +
Line + "\n" + ex.Message, "Error",
System.Windows.Forms.MessageBoxButtons.OK,
System.Windows.Forms.MessageBoxIcon.Error);
}
return rec;
}
public List<TSVRec> ImportTSVRec()
{
List<TSVRec> loadedData = new List<TSVRec>();
using (StreamReader sr = File.OpenText(fileName))
{
string Line = null;
while ((Line = sr.ReadLine()) != null)
{
loadedData.Add(GetTSVRec(Line));
}
}
return loadedData;
}
// *** Attempted generic methods ***
public T GetRec<T>(string Line)
{
T rec = new T(); // compile error!
Type t = typeof(T);
FieldInfo[] instanceVars = t.GetFields();
string[] fields = Line.Split(new char[1] { '\t' });
for (int i = 0; i < instanceVars.Length - 1; i++)
{
rec. ??? = fields[i]; // how do I finish this line???
}
return rec;
}
public List<T> Import<T>(Type t)
{
List<T> loadedData = new List<T>();
using (StreamReader sr = File.OpenText(fileName))
{
string Line = null;
while ((Line = sr.ReadLine()) != null)
{
loadedData.Add(GetRec<T>(Line));
}
}
return loadedData;
}
}
I saw the line
T rec = new T();
in the above-mentioned post, but it doesn't work for me...
I would appreciate any suggestions for how to make this work, if possible. I want to learn more about using reflection with generics, so I don't only want to understand how, but also why.

I wish #EdPlunkett had posted his suggestion as an answer, rather than a comment, so I could mark it as the answer...
To summarize: to do what I want to do, there is no need for "Assigning instance variables obtained through reflection in generic method". In fact, I can have a generic solution without using a generic method:
public class GenRec
{
public List<string> items = new List<string>();
}
public GenRec GetRec(string Line)
{
GenRec rec = new GenRec();
try
{
string[] fields = Line.Split(new char[1] { '\t' });
for (int i = 0; i < fields.Length; i++)
rec.items.Add(fields[i]);
}
catch (Exception ex)
{
System.Windows.Forms.MessageBox.Show("Bad import data on line: " + Line + "\n" + ex.Message, "Error",
System.Windows.Forms.MessageBoxButtons.OK,
System.Windows.Forms.MessageBoxIcon.Error);
}
return rec;
}
public List<GenRec> Import()
{
List<GenRec> loadedData = new List<GenRec>();
using (StreamReader sr = File.OpenText(fileName))
{
string Line = null;
while ((Line = sr.ReadLine()) != null)
loadedData.Add(GetRec(Line));
}
return loadedData;
}
I just tested this, and it works like a charm!
Of course, this isn't helping me to learn how to write generic methods or use reflection, but I'll take it...

JSON Array to Entity Framework Core VERY Slow?

I'm working on a utility to read through a JSON file I've been given and to transform it into SQL Server. My weapon of choice is a .NET Core Console App (I'm trying to do all of my new work with .NET Core unless there is a compelling reason not to). I have the whole thing "working" but there is clearly a problem somewhere because the performance is truly horrifying almost to the point of being unusable.
The JSON file is approximately 27MB and contains a main array of 214 elements and each of those contains a couple of fields along with an array of from 150-350 records (that array has several fields and potentially a small <5 record array or two). Total records are approximately 35,000.
In the code below I've changed some names and stripped out a few of the fields to keep it more readable but all of the logic and code that does actual work is unchanged.
Keep in mind, I've done a lot of testing with the placement and number of calls to SaveChanges() think initially that number of trips to the Db was the problem. Although the version below is calling SaveChanges() once for each iteration of the 214-record loop, I've tried moving it outside of the entire looping structure and there is no discernible change in performance. In other words, with zero trips to the Db, this is still SLOW. How slow you ask, how does > 24 hours to run hit you? I'm willing to try anything at this point and am even considering moving the whole process into SQL Server but would much reather work in C# than TSQL.
static void Main(string[] args)
{
string statusMsg = String.Empty;
JArray sets = JArray.Parse(File.ReadAllText(#"C:\Users\Public\Downloads\ImportFile.json"));
try
{
using (var _db = new WidgetDb())
{
for (int s = 0; s < sets.Count; s++)
{
Console.WriteLine($"{s.ToString()}: {sets[s]["name"]}");
// First we create the Set
Set eSet = new Set()
{
SetCode = (string)sets[s]["code"],
SetName = (string)sets[s]["name"],
Type = (string)sets[s]["type"],
Block = (string)sets[s]["block"] ?? ""
};
_db.Entry(eSet).State = Microsoft.EntityFrameworkCore.EntityState.Added;
JArray widgets = sets[s]["widgets"].ToObject<JArray>();
for (int c = 0; c < widgets.Count; c++)
{
Widget eWidget = new Widget()
{
WidgetId = (string)widgets[c]["id"],
Layout = (string)widgets[c]["layout"] ?? "",
WidgetName = (string)widgets[c]["name"],
WidgetNames = "",
ReleaseDate = releaseDate,
SetCode = (string)sets[s]["code"]
};
// WidgetColors
if (widgets[c]["colors"] != null)
{
JArray widgetColors = widgets[c]["colors"].ToObject<JArray>();
for (int cc = 0; cc < widgetColors.Count; cc++)
{
WidgetColor eWidgetColor = new WidgetColor()
{
WidgetId = eWidget.WidgetId,
Color = (string)widgets[c]["colors"][cc]
};
_db.Entry(eWidgetColor).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
// WidgetTypes
if (widgets[c]["types"] != null)
{
JArray widgetTypes = widgets[c]["types"].ToObject<JArray>();
for (int ct = 0; ct < widgetTypes.Count; ct++)
{
WidgetType eWidgetType = new WidgetType()
{
WidgetId = eWidget.WidgetId,
Type = (string)widgets[c]["types"][ct]
};
_db.Entry(eWidgetType).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
// WidgetVariations
if (widgets[c]["variations"] != null)
{
JArray widgetVariations = widgets[c]["variations"].ToObject<JArray>();
for (int cv = 0; cv < widgetVariations.Count; cv++)
{
WidgetVariation eWidgetVariation = new WidgetVariation()
{
WidgetId = eWidget.WidgetId,
Variation = (string)widgets[c]["variations"][cv]
};
_db.Entry(eWidgetVariation).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
}
_db.SaveChanges();
}
}
statusMsg = "Import Complete";
}
catch (Exception ex)
{
statusMsg = ex.Message + " (" + ex.InnerException + ")";
}
Console.WriteLine(statusMsg);
Console.ReadKey();
}

I had an issue with that kind of code, lots of loops and tons of changing state.
Any change / manipulation you make in _db context, will generate a "trace" of it. And it making your context slower each time. Read more here.
The fix for me was to create new EF context(_db) at some key points. It saved me a few hours per run!
You could try to create a new instance of _db each iteration in this loop
contains a main array of 214 elements
If it make no change, try to add some stopwatch to get a best idea of what/where is taking so long.

If you're making thousands of updates then EF is not really the way to go. Something like SQLBulkCopy will do the trick.
You could try the bulkwriter library.
IEnumerable<string> ReadFile(string path)
{
using (var stream = File.OpenRead(path))
using (var reader = new StreamReader(stream))
{
while (reader.Peek() >= 0)
{
yield return reader.ReadLine();
}
}
}
var items =
from line in ReadFile(#"C:\products.csv")
let values = line.Split(',')
select new Product {Sku = values[0], Name = values[1]};
then
using (var bulkWriter = new BulkWriter<Product>(connectionString)) {
bulkWriter.WriteToDatabase(items);
}

Create list of arrays from text file in C#

I have a number of text files that all follow the same content format:
"Title section","Version of the app"
10
"<thing 1>","<thing 2>","<thing 3>","<thing 4>","<thing 5>","<thing 6>","<thing 7>","<thing 8>","<thing 9>","<thing 10>"
'Where:
' first line never changes, it always contains exactly these 2 items
' second line is a count of how many "line 3s" there are
' line 3 contains a command to execute and (up to) 9 parameters
' - there will always be 10 qoute-delimited entries, even if some are blank
' - there can be N number of entries (in this example, there will be 10 commands to read)
I am reading each of these text files in, using StreamReader, and want to set each file up in its own class.
public class MyTextFile{
public string[] HeaderLine { get; set; }
public int ItemCount { get; set; }
List<MyCommandLine> Commands { get; set;}
}
public class MyCommandLine{
public string[] MyCommand { get; set; }
}
private void btnGetMyFilesiles_Click(object sender, EventArgs e){
DirectoryInfo myFolder = new DirectoryInfo(#"C:\FileSpot");
FileInfo[] myfiles = myfolder.GetFiles("*.ses");
string line = "";
foreach(FileInfo file in Files ){
str = str + ", " + file.Name;
// Read the file and display it line by line.
System.IO.StreamReader readingFile = new System.IO.StreamReader(file.Name);
MyTextFile myFileObject = new MyTextFile()
while ((line = readingFile.ReadLine()) != null){
' create the new MyTextFile here
}
file.Close();
}
}
}
The objective is to determine what the actual command being called is (""), and if any of the remaining parameters point to a pre-existing file, determine if that file exists. My problem is that I can't figure out how to read N number of "line 3" into their own objects and append these objects to the MyTextFile object. I'm 99% certain that I've led myself astray in reading each file line-by-line, but I don't know how to get out of it.

So, addressing the specific issue of getting N number of line 3 items into your class, you could do something like this (obviously you can make some changes so it is more specific to your application).
public class MyTextFile
{
public List<Array> Commands = new List<Array>();
public void EnumerateCommands()
{
for (int i = 0; i < Commands.Count; i++)
{
foreach (var c in Commands[i])
Console.Write(c + " ");
Console.WriteLine();
}
}
}
class Program
{
static void Main(string[] args)
{
string line = "";
int count = 0;
MyTextFile tf = new MyTextFile();
using (StreamReader sr = new StreamReader(#"path"))
{
while ((line = sr.ReadLine()) != null)
{
count += 1;
if (count >= 3)
{
object[] Arguments = line.Split(',');
tf.Commands.Add(Arguments);
}
}
}
tf.EnumerateCommands();
Console.ReadLine();
}
}
At least now you have a list of commands within your 'MyTextFile' class that you can enumerate through and do stuff with.
** I added the EnumerateCommands method so that you could actually see the list is storing the line items. The code should run in a Console application with the appropriate 'using' statements.
Hope this helps.

If all of the is separated with coma sign , you can just do something like :
int length = Convert.ToInt32 (reader.ReadLine ());
string line = reader.ReadLine ();
IEnumerable <string> things = line.Split (',').Select (thing => thing. Replace ('\"'', string.Empty).Take(length);
Take indicates how many things to take from the line.

How to avoid c# File.ReadLines First() locking file

I do not want to read the whole file at any point, I know there are answers on that question, I want t
o read the First or Last line.
I know that my code locks the file that it's reading for two reasons 1) The application that writes to the file crashes intermittently when I run my little app with this code but it never crashes when I am not running this code! 2) There are a few articles that will tell you that File.ReadLines locks the file.
There are some similar questions but that answer seems to involve reading the whole file which is slow for large files and therefore not what I want to do. My requirement to only read the last line most of the time is also unique from what I have read about.
I nead to know how to read the first line (Header row) and the last line (latest row). I do not want to read all lines at any point in my code because this file can become huge and reading the entire file will become slow.
I know that
line = File.ReadLines(fullFilename).First().Replace("\"", "");
... is the same as ...
FileStream fs = new FileStream(#fullFilename, FileMode.Open, FileAccess.Read, FileShare.Read);
My question is, how can I repeatedly read the first and last lines of a file which may be being written to by another application without locking it in any way. I have no control over the application that is writting to the file. It is a data log which can be appended to at any time. The reason I am listening in this way is that this log can be appended to for days on end. I want to see the latest data in this log in my own c# programme without waiting for the log to finish being written to.
My code to call the reading / listening function ...
//Start Listening to the "data log"
private void btnDeconstructCSVFile_Click(object sender, EventArgs e)
{
MySandbox.CopyCSVDataFromLogFile copyCSVDataFromLogFile = new MySandbox.CopyCSVDataFromLogFile();
copyCSVDataFromLogFile.checkForLogData();
}
My class which does the listening. For now it simply adds the data to 2 generics lists ...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using MySandbox.Classes;
using System.IO;
namespace MySandbox
{
public class CopyCSVDataFromLogFile
{
static private List<LogRowData> listMSDataRows = new List<LogRowData>();
static String fullFilename = string.Empty;
static LogRowData previousLineLogRowList = new LogRowData();
static LogRowData logRowList = new LogRowData();
static LogRowData logHeaderRowList = new LogRowData();
static Boolean checking = false;
public void checkForLogData()
{
//Initialise
string[] logHeaderArray = new string[] { };
string[] badDataRowsArray = new string[] { };
//Get the latest full filename (file with new data)
//Assumption: only 1 file is written to at a time in this directory.
String directory = "C:\\TestDir\\";
string pattern = "*.csv";
var dirInfo = new DirectoryInfo(directory);
var file = (from f in dirInfo.GetFiles(pattern) orderby f.LastWriteTime descending select f).First();
fullFilename = directory + file.ToString(); //This is the full filepath and name of the latest file in the directory!
if (logHeaderArray.Length == 0)
{
//Populate the Header Row
logHeaderRowList = getRow(fullFilename, true);
}
LogRowData tempLogRowList = new LogRowData();
if (!checking)
{
//Read the latest data in an asynchronous loop
callDataProcess();
}
}
private async void callDataProcess()
{
checking = true; //Begin checking
await checkForNewDataAndSaveIfFound();
}
private static Task checkForNewDataAndSaveIfFound()
{
return Task.Run(() => //Call the async "Task"
{
while (checking) //Loop (asynchronously)
{
LogRowData tempLogRowList = new LogRowData();
if (logHeaderRowList.ValueList.Count == 0)
{
//Populate the Header row
logHeaderRowList = getRow(fullFilename, true);
}
else
{
//Populate Data row
tempLogRowList = getRow(fullFilename, false);
if ((!Enumerable.SequenceEqual(tempLogRowList.ValueList, previousLineLogRowList.ValueList)) &&
(!Enumerable.SequenceEqual(tempLogRowList.ValueList, logHeaderRowList.ValueList)))
{
logRowList = getRow(fullFilename, false);
listMSDataRows.Add(logRowList);
previousLineLogRowList = logRowList;
}
}
//System.Threading.Thread.Sleep(10); //Wait for next row.
}
});
}
private static LogRowData getRow(string fullFilename, bool isHeader)
{
string line;
string[] logDataArray = new string[] { };
LogRowData logRowListResult = new LogRowData();
try
{
if (isHeader)
{
//Asign first (header) row data.
//Works but seems to block writting to the file!!!!!!!!!!!!!!!!!!!!!!!!!!!
line = File.ReadLines(fullFilename).First().Replace("\"", "");
}
else
{
//Assign data as last row (default behaviour).
line = File.ReadLines(fullFilename).Last().Replace("\"", "");
}
logDataArray = line.Split(',');
//Copy Array to Generics List and remove last value if it's empty.
for (int i = 0; i < logDataArray.Length; i++)
{
if (i < logDataArray.Length)
{
if (i < logDataArray.Length - 1)
{
//Value is not at the end, from observation, these always have a value (even if it's zero) and so we'll store the value.
logRowListResult.ValueList.Add(logDataArray[i]);
}
else
{
//This is the last value
if (logDataArray[i].Replace("\"", "").Trim().Length > 0)
{
//In this case, the last value is not empty, store it as normal.
logRowListResult.ValueList.Add(logDataArray[i]);
}
else { /*The last value is empty, e.g. "123,456,"; the final comma denotes another field but this field is empty so we will ignore it now. */ }
}
}
}
}
catch (Exception ex)
{
if (ex.Message == "Sequence contains no elements")
{ /*Empty file, no problem. The code will safely loop and then will pick up the header when it appears.*/ }
else
{
//TODO: catch this error properly
Int32 problemID = 10; //Unknown ERROR.
}
}
return logRowListResult;
}
}
}

I found the answer in a combination of other questions. One answer explaining how to read from the end of a file, which I adapted so that it would read only 1 line from the end of the file. And another explaining how to read the entire file without locking it (I did not want to read the entire file but the not locking part was useful). So now you can read the last line of the file (if it contains end of line characters) without locking it. For other end of line delimeters, just replace my 10 and 13 with your end of line character bytes...
Add the method below to public class CopyCSVDataFromLogFile
private static string Reverse(string str)
{
char[] arr = new char[str.Length];
for (int i = 0; i < str.Length; i++)
arr[i] = str[str.Length - 1 - i];
return new string(arr);
}
and replace this line ...
line = File.ReadLines(fullFilename).Last().Replace("\"", "");
with this code block ...
Int32 endOfLineCharacterCount = 0;
Int32 previousCharByte = 0;
Int32 currentCharByte = 0;
//Read the file, from the end, for 1 line, allowing other programmes to access it for read and write!
using (FileStream reader = new FileStream(fullFilename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 0x1000, FileOptions.SequentialScan))
{
int i = 0;
StringBuilder lineBuffer = new StringBuilder();
int byteRead;
while ((-i < reader.Length) /*Belt and braces: if there were no end of line characters, reading beyond the file would give a catastrophic error here (to be avoided thus).*/
&& (endOfLineCharacterCount < 2)/*Exit Condition*/)
{
reader.Seek(--i, SeekOrigin.End);
byteRead = reader.ReadByte();
currentCharByte = byteRead;
//Exit condition: the first 2 characters we read (reading backwards remember) were end of line ().
//So when we read the second end of line, we have read 1 whole line (the last line in the file)
//and we must exit now.
if (currentCharByte == 13 && previousCharByte == 10)
{
endOfLineCharacterCount++;
}
if (byteRead == 10 && lineBuffer.Length > 0)
{
line += Reverse(lineBuffer.ToString());
lineBuffer.Remove(0, lineBuffer.Length);
}
lineBuffer.Append((char)byteRead);
previousCharByte = byteRead;
}
reader.Close();
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# - Concatenating files until process memory is full then delete duplicates - c#

Related

Issues with list not getting garbage collected after use in mvc

Assigning instance variables obtained through reflection in generic method

JSON Array to Entity Framework Core VERY Slow?

Create list of arrays from text file in C#

How to avoid c# File.ReadLines First() locking file

Categories

Resources