Unity crashes when reading a large csv file - c#

I need to find a way to read information out of a very big CSV file with unity. The file is approx. 15000*4000 entries with almost 200MB and could even be longer.
Just using ReadAllLines on the file does kind of work but as soon as I try to do any operation on it, it will crash. Here is the code I am using just counting all non zero values which already crashes it. It's okay if the code might need loading time but it shouldn't crash. I assume it's because I save everything in the memory and therefore flood my RAM? Any ideas how to fix this that it won't crash?
private void readCSV()
{
string[] lines = File.ReadAllLines("Assets/Datasets/testCsv.csv");
foreach (string line in lines)
{
List<string> values = new List<string>();
values = line.Split(',').ToList();
int i = 0;
foreach (string val in values)
{
if (val != "0")
{
i++;
}
}
}
}

As I already stated in your other question you should rather go with a streamed solution in order to not load the entire thing into memory at all.
Also both FileIO as well as string.Split are slow especially for soany entries! Rather use a background thread / async Task for this!
The next future possible issue in your case 15000*4000 entries means a total of 60000000 cells. Which is still fine. However, the maximum value of int is 2147483647 so if your file grows further it might break / behave unexpected => rather use e.g. uint or directly ulong to avoid that issue.
private async Task<ulong> CountNonZeroEntries()
{
ulong count = 0;
// Using a stream reader you can load the content into memory one line at a time
using(var sr = new StreamReader("Assets/Datasets/testCsv.csv"))
{
while(true)
{
var line = await sr.ReadLineAsync();
if(line == null) break;
var values = line.Split(',');
foreach(var v in values)
{
if(v != "0") count++;
}
}
}
return count;
}
And then of course you would need to wait for the result e.g. using
// If you declare Start as asnyc Unity automatically calls it asynchronously
private async void Start()
{
var count = await CountNonZeroEntries();
Debug.Log($"{count} cells are != \"0\".");
}
The same can be done using Linq a bit easier to write in my eyes
using System.Linq;
...
private Task<ulong> CountNonZeroEntries()
{
return File.ReadLines("Assets/Datasets/testCsv.csv").Select(line => line.Split(',')).Count(v => v != "0");
}
Also File.ReadLines doesn't load the entire content at once but rather a lazy enumerable so you can use Linq queries on them one by one.

Related

I am using ExistsAsync to check whether the CloudBlockBlob is exists but it is taking too much time

foreach(var lst in list)
{
CloudBlobDirectory dira = container.GetDirectoryReference(folderName + "/" + prefix);
bool isExists = dira.GetBlockBlobReference(filename + ".png").ExistsAsync().Result;
if(isExists)
{
// create sas url using cloud block url
}
}
I am using this code to check if blob exists for each path But ExistsAsync() is taking too much time.
I have also tried GetBlobClient in loop but it was also taking time.
Is there any other way to check if blockblob is exists
I am using the latest version(12) of Azure.Storage.Blobs
It´s not entierly clear what OP wants to do. One naive way to try and make Blob queries faster is to execute multiple in parallel.
Note that this example actually uses Azure.Storage.Blobs, while op seems to be using Windows.Azure.Storage despite stating otherwise.
using Azure.Storage.Blobs;
using Azure.Storage.Blobs.Specialized;
var blobClient = new BlobServiceClient("<connection string>");
var container = blobClient.GetBlobContainerClient("<container name>");
// However your list looks like, Im assuming it´s partial or full paths in blob
var potentialBlobs = new List<string>() { "<some>/<blob>/<path>" };
// Amount to execute in parallel at a time, depends on your system which value will yield better results.
// Could be calculated based on Environment.ProcessorCount too (note that this can return 0 or negative values sometimes, so you have to check)
const int chunkSize = 4;
var existingBlobs = new List<string>();
foreach (var chunk in potentialBlobs.Chunk(chunkSize))
{
// Create multiple tasks
var tasks = chunk.Select(async path => // potentialBlobs list item
{
var exists = await container
.GetBlockBlobClient(path) // adjust path however needed, I don´t understand what OP wants here
.ExistsAsync();
return exists ? path : null;
});
// Wait for tasks to finish
var results = await Task.WhenAll(tasks);
// Append to result list
foreach (var cur in results)
{
if (cur is not null)
existingBlobs.Add(cur);
}
}
There does not seem to be an Exists method available in the BatchClient of Azure.Storage.Blobs.Batch. As far as Im aware of it´s only mostly for adding, updating or deleting blobs not to check for their existence. So I think it is not viable for this scenario, but I might have missed something.
If you cannot get acceptable performance this way you´ll have to store blob paths in something like table storage or cosmosdb that is faster to query against. Depending on what you need you might also be able to cache the result or store the result somewhere and continuously update it as new blobs get added.

log4net memoryappender Out of memory

I'm using log4net with an memoryappender.
When I try to read all the lines to an variable (here: StringBuilder) I'm getting an OutOfMemory-Exception when the amount of lines is to high. I've tested it with 1mio lines:
public class RenderingMemoryAppender : MemoryAppender
{
public IEnumerable<string> GetRenderedEvents(List<Level> levelList = null)
{
foreach (var loggingEvent in GetEvents())
{
yield return RenderLoggingEvent(loggingEvent);
}
}
public byte[] GetEventsAsByteArray(List<Level> levelList=null )
{
var events = GetRenderedEvents(levelList);
var s = new StringBuilder();
foreach (var e in events)
{
s.Append(e);
}
//the exception is thrown here below when calling s.ToString().
return Encoding.UTF8.GetBytes(s.ToString());
}
}
when I simply add one-million lines to an stringbuilder without the log4net component everything works fine...
I've also tried to use this:
var stringBuilder = new StringBuilder();
var stringWriter = new StringWriter(stringBuilder);
foreach (var loggingEvent in GetEvents())
{
stringBuilder.Clear();
loggingEvent.WriteRenderedMessage(stringWriter);
list.Add(stringBuilder.ToString());
}
but this also didn't work.
If you want to have that many lines in memory the runtime wants to allocate a piece of memory that can contain the whole string. If it is not possible because at that moment, the OutOfMemory exception is thrown. If you need the bytes from a memorystream it is more efficient to call the ToArray() method on the memorystream, that can also fail for the same reason as the ToString method fails on the StringBuilder. You can check if you run in 64bit mode, if not, it can help to get more address space. I would advice to rethink your logging method. The way you are doing it now is unreliable and can even break your program.

SharpCompress unpacking locks files

Im trying to make a process where I unpack a set of rar-files using SharpCompress and then delete them.
I first have a foreach loop which loops over the different set of files to be unpacked like this:
foreach (var package in extractionPackages.Where(package => Extractor.ExtractToFolder(package, destinationFolder)))
{
FileHelper.DeleteExtractionFiles(package);
}
The extraction process is taken straight of the SharpCompress Tests code and goes like this:
public static bool ExtractToFolder(ExtractionPackage extractionPackage, string extractionPath)
{
var fullExtractionPath = Path.Combine(extractionPath, extractionPackage.FolderName);
try
{
using (var reader = RarReader.Open(extractionPackage.ExtractionFiles.Select(p => File.OpenRead(p.FullPath))))
{
while (reader.MoveToNextEntry())
{
reader.WriteEntryToDirectory(fullExtractionPath, ExtractOptions.ExtractFullPath | ExtractOptions.Overwrite);
}
}
return true;
}
catch (Exception)
{
return false;
}
}
As you can see in the first code block I then call to delete the files but there I recieve an error due to that the files are locked by another process:
The process cannot access the file 'file.rar' because it is being used by another process.
If I try to put the deletion after the foreach-loop I am able to delete all but the last set of files if there are more than one. If it is just one the same issue occurs since the last set of files doesn't seems to be "unlocked".
How can I structure the code so that the files is unlocked?
By using an example from Nunrar and modifying it a bit I finally seems to have solved the issue:
public static bool ExtractToFolder(ExtractionPackage extractionPackage, string extractionPath)
{
var fullExtractionPath = Path.Combine(extractionPath, extractionPackage.FolderName);
try
{
using (var reader = RarReader.Open(GetStreams(extractionPackage)))//extractionPackage.ExtractionFiles.Select(p => File.OpenRead(p.FullPath)), Options.None))
{
while (reader.MoveToNextEntry())
{
reader.WriteEntryToDirectory(fullExtractionPath, ExtractOptions.ExtractFullPath | ExtractOptions.Overwrite);
}
}
return true;
}
catch (Exception)
{
return false;
}
}
private static IEnumerable<Stream> GetStreams(ExtractionPackage package)
{
foreach (var item in package.ExtractionFiles)
{
using (Stream input = File.OpenRead(item.FullPath))
{
yield return input;
}
}
}
By default (and probably a flaw) is that I leave the Streams open as I did not create them myself. However, the .NET framework usually breaks this rule with things like StreamReader etc. so it's expected that they'd be closed.
I'm thinking I'm going to change this behavior. In the meantime, you ought to be able to fix this yourself by calling this:
RarReader.Open(extractionPackage.ExtractionFiles.Select(p => File.OpenRead(p.FullPath)), Options.None)
I had a similar trouble. I used SharpCompress for adding files in archive and I need to delete files after archiving. I had tried wrap streams with using() {} - but this solution didn't worked in my case. When I added to my code Application.DoEvents() - it solved the problem.

Undetectable memory leak

I have a Stock class which loads lots of stock data history from a file (about 100 MB). I have a Pair class that takes two Stock objects and calculates some statistical relations between the two then writes the results to file.
In my main method I have a loop going through a list of pairs of stocks (about 500). It creates 2 stock objects and then a pair object out of the two. At this point the pair calculations are written to file and I'm done with the objects. I need to free the memory so I can go on with the next calculation.
I addition to setting the 3 objects to null I have added the following two lines at the end of the loop:
GC.Collect(GC.MaxGeneration);
GC.WaitForPendingFinalizers();
Stepping over theses two lines just seems to just free up 50 MB out of the 200-300 MB that is allocated per every loop iteration (viewing it from task manager).
The program does about eight or ten pairs before it gives me a system out of memory exception. The memory usage steadily increases until it crashes at about 1.5 GB. (This is an 8 GB machine running Win7 Ultimate)
I don't have much experience with garbage collection. Am I doing something wrong?
Here's my code since you asked: (note: program has two modes, 1> add mode in which new pairs are added to system. 2> regular mode which updates the pair files realtime based on filesystemwatcher events. The stock data is updated by external app called QCollector.)
This is the segment in MainForm which runs in Add Mode:
foreach (string line in PairList)
{
string[] tokens = line.Split(',');
stockA = new Stock(QCollectorPath, tokens[0].ToUpper());
stockB = new Stock(QCollectorPath, tokens[1].ToUpper());
double ratio = double.Parse(tokens[2]);
Pair p = new Pair(QCollectorPath, stockA, stockB, ratio);
// at this point the pair is written to file (constructor handles this)
// commenting out the following lines of code since they don't fix the problem
// stockA = null;
// stockB = null;
// p = null;
// refraining from forced collection since that's not the problem
// GC.Collect(GC.MaxGeneration);
// GC.WaitForPendingFinalizers();
// so far this is the only way i can fix the problem by setting the pair classes
// references to StockA and StockB to null
p.Kill();
}
I am adding more code as per request: Stock and Pair are subclasses of TimeSeries, which has the common functionality
public abstract class TimeSeries {
protected List<string> data;
// following create class must be implemented by subclasses (stock, pair, etc...)
// as each class is created differently, although their data formatting is identical
protected void List<string> Create();
// . . .
public void LoadFromFile()
{
data = new List<string>();
List<StreamReader> srs = GetAllFiles();
foreach (StreamReader sr in srs)
{
List<string> temp = new List<string>();
temp = TurnFileIntoListString(sr);
data = new List<string>(temp.Concat(data));
sr.Close()
}
}
// uses directory naming scheme (according to data month/year) to find files of a symbol
protected List<StreamReader> GetAllFiles()...
public static List<string> TurnFileIntoListString(StreamReader sr)
{
List<string> list = new List<string>();
string line;
while ((line = sr.ReadLine()) != null)
list.Add(line);
return list;
}
// this is the only mean to access a TimeSeries object's data
// this is to prevent deadlocks by time consuming methods such as pair's Create
public string[] GetListCopy()
{
lock (data)
{
string[] listCopy = new string[data.count];
data.CopyTo(listCopy);
return listCopy();
}
}
}
public class Stock : TimeSeries
{
public Stock(string dataFilePath, string symbol, FileSystemWatcher fsw = null)
{
DataFilePath = dataFilePath;
Name = symbol.ToUpper();
LoadFromFile();
// to update stock data when external app updates the files
if (fsw != null) fsw.Changed += FileSystemEventHandler(fsw_Changed);
}
protected void List<string> Create()
{
// stock files created by external application
}
// . . .
}
public class Pair : TimeSeries {
public Pair(string dataFilePath, Stock stockA, Stock stockB, double ratio)
{
// assign parameters to local members
// ...
if (FileExists())
LoadFromFile();
else
Create();
}
protected override List<string> Create()
{
// since stock can get updated by fileSystemWatcher's event handler
// a copy is obtained from the stock object's data
string[] listA = StockA.GetListCopy();
string[] listB = StockB.GetListCopy();
List<string> listP = new List<string>();
int i, j;
i = GetFirstValidBar(listA);
j = GetFirstValidBar(listB);
DateTime dtA, dtB;
dtA = GetDateTime(listA[i]);
dtB = GetDateTime(listB[j]);
// this hidden segment adjusts i and j until they are starting at same datetime
// since stocks can have different amount of data
while (i < listA.Count() && j < listB.Count)
{
double priceA = GetPrice(listA[i]);
double priceB = GetPrice(listB[j]);
double priceP = priceA * ratio - priceB;
listP.Add(String.Format("{0},{1:0.00},{2:0.00},{3:0.00}"
, dtA
, priceP
, priceA
, priceB
);
if (i < j)
i++;
else if (j < i)
j++;
else
{
i++;
j++;
}
}
}
public void Kill()
{
data = null;
stockA = null;
stockB = null;
}
}
Your memory leak is here:
if (fsw != null) fsw.Changed += FileSystemEventHandler(fsw_Changed);
The instance of the stock object will be kept in memory as long as the FileSystemWatcher is alive, since it is responding to an event of the FileSystemWatcher.
I think that you want to either implement that event somewhere else, or at some other point in your code add a:
if (fsw != null) fsw.Changed -= fsw_Changed;
Given the way that the code is written it might be possible that stock object is intended to be called without a FileSystemWatcher in cases where a bulk processing is done.
In the original code that you posted the constructors of the Stock classes were being called with a FileSystemWatcher. You have changed that now. I think you will find that now with a null FileSystemWatcher you can remove your kill and you will not have a leak since you are no longer listening to the fsw.Changed.

Adding AsParallel() call cause my code to break on writing a file

I'm building a console application that have to process a bunch of document.
To stay simple, the process is :
for each year between X and Y, query the DB to get a list of document reference to process
for each of this reference, process a local file
The process method is, I think, independent and should be parallelized as soon as input args are different :
private static bool ProcessDocument(
DocumentsDataset.DocumentsRow d,
string langCode
)
{
try
{
var htmFileName = d.UniqueDocRef.Trim() + langCode + ".htm";
var htmFullPath = Path.Combine("x:\path", htmFileName;
missingHtmlFile = !File.Exists(htmFullPath);
if (!missingHtmlFile)
{
var html = File.ReadAllText(htmFullPath);
// ProcessHtml is quite long : it use a regex search for a list of reference
// which are other documents, then sends the result to a custom WS
ProcessHtml(ref html);
File.WriteAllText(htmFullPath, html);
}
return true;
}
catch (Exception exc)
{
Trace.TraceError("{0,8}Fail processing {1} : {2}","[FATAL]", d.UniqueDocRef, exc.ToString());
return false;
}
}
In order to enumerate my document, I have this method :
private static IEnumerable<DocumentsDataset.DocumentsRow> EnumerateDocuments()
{
return Enumerable.Range(1990, 2020 - 1990).AsParallel().SelectMany(year => {
return Document.FindAll((short)year).Documents;
});
}
Document is a business class that wrap the retrieval of documents. The output of this method is a typed dataset (I'm returning the Documents table). The method is waiting for a year and I'm sure a document can't be returned by more than one year (year is part of the key actually).
Note the use of AsParallel() here, but I never got issue with this one.
Now, my main method is :
var documents = EnumerateDocuments();
var result = documents.Select(d => {
bool success = true;
foreach (var langCode in new string[] { "-e","-f" })
{
success &= ProcessDocument(d, langCode);
}
return new {
d.UniqueDocRef,
success
};
});
using (var sw = File.CreateText("summary.csv"))
{
sw.WriteLine("Level;UniqueDocRef");
foreach (var item in result)
{
string level;
if (!item.success) level = "[ERROR]";
else level = "[OK]";
sw.WriteLine(
"{0};{1}",
level,
item.UniqueDocRef
);
//sw.WriteLine(item);
}
}
This method works as expected under this form. However, if I replace
var documents = EnumerateDocuments();
by
var documents = EnumerateDocuments().AsParrallel();
It stops to work, and I don't understand why.
The error appears exactly here (in my process method):
File.WriteAllText(htmFullPath, html);
It tells me that the file is already opened by another program.
I don't understand what can cause my program not to works as expected. As my documents variable is an IEnumerable returning unique values, why my process method is breaking ?
thx for advises
[Edit] Code for retrieving document :
/// <summary>
/// Get all documents in data store
/// </summary>
public static DocumentsDS FindAll(short? year)
{
Database db = DatabaseFactory.CreateDatabase(connStringName); // MS Entlib
DbCommand cm = db.GetStoredProcCommand("Document_Select");
if (year.HasValue) db.AddInParameter(cm, "Year", DbType.Int16, year.Value);
string[] tableNames = { "Documents", "Years" };
DocumentsDS ds = new DocumentsDS();
db.LoadDataSet(cm, ds, tableNames);
return ds;
}
[Edit2] Possible source of my issue, thanks to mquander. If I wrote :
var test = EnumerateDocuments().AsParallel().Select(d => d.UniqueDocRef);
var testGr = test.GroupBy(d => d).Select(d => new { d.Key, Count = d.Count() }).Where(c=>c.Count>1);
var testLst = testGr.ToList();
Console.WriteLine(testLst.Where(x => x.Count == 1).Count());
Console.WriteLine(testLst.Where(x => x.Count > 1).Count());
I get this result :
0
1758
Removing the AsParallel returns the same output.
Conclusion : my EnumerateDocuments have something wrong and returns twice each documents.
Have to dive here I think
This is probably my source enumeration in cause
I suggest you to have each task put the file data into a global queue and have a parallel thread take writing requests from the queue and do the actual writing.
Anyway, the performance of writing in parallel on a single disk is much worse than writing sequentially, because the disk needs to spin to seek the next writing location, so you are just bouncing the disk around between seeks. It's better to do the writes sequentially.
Is Document.FindAll((short)year).Documents threadsafe? Because the difference between the first and the second version is that in the second (broken) version, this call is running multiple times concurrently. That could plausibly be the cause of the issue.
Sounds like you're trying to write to the same file. Only one thread/program can write to a file at a given time, so you can't use Parallel.
If you're reading from the same file, then you need to open the file with only read permissions as not to put a write lock on it.
The simplest way to fix the issue is to place a lock around your File.WriteAllText, assuming the writing is fast and it's worth parallelizing the rest of the code.

Categories

Resources