Adding AsParallel() call cause my code to break on writing a file - c#

I'm building a console application that have to process a bunch of document.
To stay simple, the process is :
for each year between X and Y, query the DB to get a list of document reference to process
for each of this reference, process a local file
The process method is, I think, independent and should be parallelized as soon as input args are different :
private static bool ProcessDocument(
DocumentsDataset.DocumentsRow d,
string langCode
)
{
try
{
var htmFileName = d.UniqueDocRef.Trim() + langCode + ".htm";
var htmFullPath = Path.Combine("x:\path", htmFileName;
missingHtmlFile = !File.Exists(htmFullPath);
if (!missingHtmlFile)
{
var html = File.ReadAllText(htmFullPath);
// ProcessHtml is quite long : it use a regex search for a list of reference
// which are other documents, then sends the result to a custom WS
ProcessHtml(ref html);
File.WriteAllText(htmFullPath, html);
}
return true;
}
catch (Exception exc)
{
Trace.TraceError("{0,8}Fail processing {1} : {2}","[FATAL]", d.UniqueDocRef, exc.ToString());
return false;
}
}
In order to enumerate my document, I have this method :
private static IEnumerable<DocumentsDataset.DocumentsRow> EnumerateDocuments()
{
return Enumerable.Range(1990, 2020 - 1990).AsParallel().SelectMany(year => {
return Document.FindAll((short)year).Documents;
});
}
Document is a business class that wrap the retrieval of documents. The output of this method is a typed dataset (I'm returning the Documents table). The method is waiting for a year and I'm sure a document can't be returned by more than one year (year is part of the key actually).
Note the use of AsParallel() here, but I never got issue with this one.
Now, my main method is :
var documents = EnumerateDocuments();
var result = documents.Select(d => {
bool success = true;
foreach (var langCode in new string[] { "-e","-f" })
{
success &= ProcessDocument(d, langCode);
}
return new {
d.UniqueDocRef,
success
};
});
using (var sw = File.CreateText("summary.csv"))
{
sw.WriteLine("Level;UniqueDocRef");
foreach (var item in result)
{
string level;
if (!item.success) level = "[ERROR]";
else level = "[OK]";
sw.WriteLine(
"{0};{1}",
level,
item.UniqueDocRef
);
//sw.WriteLine(item);
}
}
This method works as expected under this form. However, if I replace
var documents = EnumerateDocuments();
by
var documents = EnumerateDocuments().AsParrallel();
It stops to work, and I don't understand why.
The error appears exactly here (in my process method):
File.WriteAllText(htmFullPath, html);
It tells me that the file is already opened by another program.
I don't understand what can cause my program not to works as expected. As my documents variable is an IEnumerable returning unique values, why my process method is breaking ?
thx for advises
[Edit] Code for retrieving document :
/// <summary>
/// Get all documents in data store
/// </summary>
public static DocumentsDS FindAll(short? year)
{
Database db = DatabaseFactory.CreateDatabase(connStringName); // MS Entlib
DbCommand cm = db.GetStoredProcCommand("Document_Select");
if (year.HasValue) db.AddInParameter(cm, "Year", DbType.Int16, year.Value);
string[] tableNames = { "Documents", "Years" };
DocumentsDS ds = new DocumentsDS();
db.LoadDataSet(cm, ds, tableNames);
return ds;
}
[Edit2] Possible source of my issue, thanks to mquander. If I wrote :
var test = EnumerateDocuments().AsParallel().Select(d => d.UniqueDocRef);
var testGr = test.GroupBy(d => d).Select(d => new { d.Key, Count = d.Count() }).Where(c=>c.Count>1);
var testLst = testGr.ToList();
Console.WriteLine(testLst.Where(x => x.Count == 1).Count());
Console.WriteLine(testLst.Where(x => x.Count > 1).Count());
I get this result :
0
1758
Removing the AsParallel returns the same output.
Conclusion : my EnumerateDocuments have something wrong and returns twice each documents.
Have to dive here I think
This is probably my source enumeration in cause

I suggest you to have each task put the file data into a global queue and have a parallel thread take writing requests from the queue and do the actual writing.
Anyway, the performance of writing in parallel on a single disk is much worse than writing sequentially, because the disk needs to spin to seek the next writing location, so you are just bouncing the disk around between seeks. It's better to do the writes sequentially.

Is Document.FindAll((short)year).Documents threadsafe? Because the difference between the first and the second version is that in the second (broken) version, this call is running multiple times concurrently. That could plausibly be the cause of the issue.

Sounds like you're trying to write to the same file. Only one thread/program can write to a file at a given time, so you can't use Parallel.
If you're reading from the same file, then you need to open the file with only read permissions as not to put a write lock on it.
The simplest way to fix the issue is to place a lock around your File.WriteAllText, assuming the writing is fast and it's worth parallelizing the rest of the code.

Related

I am using ExistsAsync to check whether the CloudBlockBlob is exists but it is taking too much time

foreach(var lst in list)
{
CloudBlobDirectory dira = container.GetDirectoryReference(folderName + "/" + prefix);
bool isExists = dira.GetBlockBlobReference(filename + ".png").ExistsAsync().Result;
if(isExists)
{
// create sas url using cloud block url
}
}
I am using this code to check if blob exists for each path But ExistsAsync() is taking too much time.
I have also tried GetBlobClient in loop but it was also taking time.
Is there any other way to check if blockblob is exists
I am using the latest version(12) of Azure.Storage.Blobs
It´s not entierly clear what OP wants to do. One naive way to try and make Blob queries faster is to execute multiple in parallel.
Note that this example actually uses Azure.Storage.Blobs, while op seems to be using Windows.Azure.Storage despite stating otherwise.
using Azure.Storage.Blobs;
using Azure.Storage.Blobs.Specialized;
var blobClient = new BlobServiceClient("<connection string>");
var container = blobClient.GetBlobContainerClient("<container name>");
// However your list looks like, Im assuming it´s partial or full paths in blob
var potentialBlobs = new List<string>() { "<some>/<blob>/<path>" };
// Amount to execute in parallel at a time, depends on your system which value will yield better results.
// Could be calculated based on Environment.ProcessorCount too (note that this can return 0 or negative values sometimes, so you have to check)
const int chunkSize = 4;
var existingBlobs = new List<string>();
foreach (var chunk in potentialBlobs.Chunk(chunkSize))
{
// Create multiple tasks
var tasks = chunk.Select(async path => // potentialBlobs list item
{
var exists = await container
.GetBlockBlobClient(path) // adjust path however needed, I don´t understand what OP wants here
.ExistsAsync();
return exists ? path : null;
});
// Wait for tasks to finish
var results = await Task.WhenAll(tasks);
// Append to result list
foreach (var cur in results)
{
if (cur is not null)
existingBlobs.Add(cur);
}
}
There does not seem to be an Exists method available in the BatchClient of Azure.Storage.Blobs.Batch. As far as Im aware of it´s only mostly for adding, updating or deleting blobs not to check for their existence. So I think it is not viable for this scenario, but I might have missed something.
If you cannot get acceptable performance this way you´ll have to store blob paths in something like table storage or cosmosdb that is faster to query against. Depending on what you need you might also be able to cache the result or store the result somewhere and continuously update it as new blobs get added.

Unity crashes when reading a large csv file

I need to find a way to read information out of a very big CSV file with unity. The file is approx. 15000*4000 entries with almost 200MB and could even be longer.
Just using ReadAllLines on the file does kind of work but as soon as I try to do any operation on it, it will crash. Here is the code I am using just counting all non zero values which already crashes it. It's okay if the code might need loading time but it shouldn't crash. I assume it's because I save everything in the memory and therefore flood my RAM? Any ideas how to fix this that it won't crash?
private void readCSV()
{
string[] lines = File.ReadAllLines("Assets/Datasets/testCsv.csv");
foreach (string line in lines)
{
List<string> values = new List<string>();
values = line.Split(',').ToList();
int i = 0;
foreach (string val in values)
{
if (val != "0")
{
i++;
}
}
}
}
As I already stated in your other question you should rather go with a streamed solution in order to not load the entire thing into memory at all.
Also both FileIO as well as string.Split are slow especially for soany entries! Rather use a background thread / async Task for this!
The next future possible issue in your case 15000*4000 entries means a total of 60000000 cells. Which is still fine. However, the maximum value of int is 2147483647 so if your file grows further it might break / behave unexpected => rather use e.g. uint or directly ulong to avoid that issue.
private async Task<ulong> CountNonZeroEntries()
{
ulong count = 0;
// Using a stream reader you can load the content into memory one line at a time
using(var sr = new StreamReader("Assets/Datasets/testCsv.csv"))
{
while(true)
{
var line = await sr.ReadLineAsync();
if(line == null) break;
var values = line.Split(',');
foreach(var v in values)
{
if(v != "0") count++;
}
}
}
return count;
}
And then of course you would need to wait for the result e.g. using
// If you declare Start as asnyc Unity automatically calls it asynchronously
private async void Start()
{
var count = await CountNonZeroEntries();
Debug.Log($"{count} cells are != \"0\".");
}
The same can be done using Linq a bit easier to write in my eyes
using System.Linq;
...
private Task<ulong> CountNonZeroEntries()
{
return File.ReadLines("Assets/Datasets/testCsv.csv").Select(line => line.Split(',')).Count(v => v != "0");
}
Also File.ReadLines doesn't load the entire content at once but rather a lazy enumerable so you can use Linq queries on them one by one.

Problem speeding up report rendering with FastReports

I'm working on report rendering with FastReports - I'm doing this coming from a system to render with Crystal Reports. When using Crystal, I found that preloading a report and then binding parameters on request sped up crystal dramatically, since most of the time for a small layout like an invoice is in the setup. I'm now trying to achieve the same with FastReports.
It's unclear how much time setup takes however, so I'd also be interested in whether this is not a worthwhile endeavour.
My issue is that I have used a JSON API call, and used ConnectionStringExpression with a single parameter. In a nutshell, changing the parameter does not reload the data when I call Prepare.
Here's my code, with the second report load taken out, it renders the same report twice.
var report = new Report();
report.Load("C:\\dev\\ia\\products\\StratusCloud\\AppFiles\\Reports\\SalesQuoteItems.frx");
var urlTemplate = "http://localhost:9502/data/sales-quote/{CardCode#}/{DocEntry#}";
var reportParms = new Dictionary<string, dynamic>();
reportParms.Add("CardCode#", "C20000");
reportParms.Add("DocEntry#", 77);
var connectionstring = "Json=" + System.Text.RegularExpressions.Regex.Replace(urlTemplate, "{([^}]+)}", (m) => {
if (reportParms.ContainsKey(m.Groups[1].Value))
{
return string.Format("{0}", reportParms[m.Groups[1].Value]);
}
return m.Value;
});
var dataapiparm = report.Dictionary.Parameters.FindByName("DataAPIUrl#");
if (dataapiparm != null)
{
dataapiparm.Value = connectionstring;
}
foreach(FastReport.Data.Parameter P in report.Dictionary.Parameters)
{
if (reportParms.ContainsKey(P.Name))
{
P.Value = reportParms[P.Name];
}
}
report.Prepare();
var pdfExport = new PDFSimpleExport();
pdfExport.Export(report, "test1.pdf");
//report = new Report();
//report.Load("C:\\dev\\ia\\products\\StratusCloud\\AppFiles\\Reports\\SalesQuoteItems.frx");
reportParms["DocEntry#"] = 117;
connectionstring = "Json=" + System.Text.RegularExpressions.Regex.Replace(urlTemplate, "{([^}]+)}", (m) => {
if (reportParms.ContainsKey(m.Groups[1].Value))
{
return string.Format("{0}", reportParms[m.Groups[1].Value]);
}
return m.Value;
});
dataapiparm = report.Dictionary.Parameters.FindByName("DataAPIUrl#");
if (dataapiparm != null)
{
dataapiparm.Value = connectionstring;
}
foreach (FastReport.Data.Parameter P in report.Dictionary.Parameters)
{
if (reportParms.ContainsKey(P.Name))
{
P.Value = reportParms[P.Name];
}
}
report.Prepare();
pdfExport.Export(report, "test2.pdf");
Cheers,
Mark
Fast Project definitely doesn't recalculate the ConnectionStringExpression on report.Prepare, so I went back to another method that I was looking at. It turns out that if the ConnectionString itself is rewritten, then report.Prepare does refetch the data.
A simple connection string without a schema takes a long time to process, so I remove everything beyond the semi-colon and keep it, replace the url portion of the connectionstring, and then stick teh same schema information back on the end.
Copying the schema information into each report generation connection string seems to remove around 10 seconds from report.Prepare!
At the moment it's the best that I can do, and I wonder if there is another more efficient method of rerunning the same report against new data (having the same schema).

ReplaceOneAsync() immediately after InsertOneAsync() not always working, even when journaled

On a single-instance MongoDB server, even with the write concern on the client set to journaled, one in every couple of thousand documents isn't replacable immediately after inserting.
I was under the impression that once journaled, documents are immediately available for querying.
The code below inserts a document, then updates the DateModified property of the document and tries to update the document based on the document's Id and the old value of that property.
public class MyDocument
{
public BsonObjectId Id { get; set; }
public DateTime DateModified { get; set; }
}
static void Main(string[] args)
{
var r = Task.Run(MainAsync);
Console.WriteLine("Inserting documents... Press any key to exit.");
Console.ReadKey(intercept: true);
}
private static async Task MainAsync()
{
var client = new MongoClient("mongodb://localhost:27017");
var database = client.GetDatabase("updateInsertedDocuments");
var concern = new WriteConcern(journal: true);
var collection = database.GetCollection<MyDocument>("docs").WithWriteConcern(concern);
int errorCount = 0;
int totalCount = 0;
do
{
totalCount++;
// Create and insert the document
var document = new MyDocument
{
DateModified = DateTime.Now,
};
await collection.InsertOneAsync(document);
// Save and update the modified date
var oldDateModified = document.DateModified;
document.DateModified = DateTime.Now;
// Try to update the document by Id and the earlier DateModified
var result = await collection.ReplaceOneAsync(d => d.Id == document.Id && d.DateModified == oldDateModified, document);
if (result.ModifiedCount == 0)
{
Console.WriteLine($"Error {++errorCount}/{totalCount}: doc {document.Id} did not have DateModified {oldDateModified.ToString("yyyy-MM-dd HH:mm:ss.ffffff")}");
await DoesItExist(collection, document, oldDateModified);
}
}
while (true);
}
The code inserts at a rate of around 250 documents per second. One in around 1,000-15,000 calls to ReplaceOneAsync(d => d.Id == document.Id && d.DateModified == oldDateModified, ...) fails, as it returns a ModifiedCount of 0. The failure rate depends on whether we run a Debug or Release build and with debugger attached or not: more speed means more errors.
The code shown represents something that I can't really easily change. Of course I'd rather perform a series of Update.Set() calls, but that's not really an option right now. The InsertOneAsync() followed by a ReplaceOneAsync() is abstracted by some kind of repository pattern that updates entities by reference. The non-async counterparts of the methods display the same behavior.
A simple Thread.Sleep(100) between inserting and replacing mitigates the problem.
When the query fails, and we wait a while and then attempt to query the document again in the code below, it'll be found every time.
private static async Task DoesItExist(IMongoCollection<MyDocument> collection, MyDocument document, DateTime oldDateModified)
{
Thread.Sleep(500);
var fromDatabaseCursor = await collection.FindAsync(d => d.Id == document.Id && d.DateModified == oldDateModified);
var fromDatabaseDoc = await fromDatabaseCursor.FirstOrDefaultAsync();
if (fromDatabaseDoc != null)
{
Console.WriteLine("But it was found!");
}
else
{
Console.WriteLine("And wasn't found!");
}
}
Versions on which this occurs:
MongoDB Community Server 3.4.0, 3.4.1, 3.4.3, 3.4.4 and 3.4.10, all on WiredTiger storage engine
Server runs on Windows, other OSes as well
C# Mongo Driver 2.3.0 and 2.4.4
Is this an issue in MongoDB, or are we doing (or assuming) something wrong?
Or, the actual end goal, how can I ensure an insert is immediately retrievable by an update?
ReplaceOneAsync returns 0 if the new document is identical to the old one (because nothing changed).
It looks to me like if your test executes fast enough the various calls to DateTime.Now could return the same value, so it is possible that you are passing the exact same document to InsertOneAsync and ReplaceOneAsync.

C# directory scan performance

I have a folder structure on a network drive that is
Booking Centre -> Facility -> Files
eg
EUR/12345678/File_archive1.txt
EUR/12345678/File_archive2.txt
EUR/12345678/File_latest.txt
EUR/5555/File_archive1.txt
EUR/5555/File_archive2.txt
EUR/5555/File_latest.txt
When a user selects a booking centre from the drop down, I want the code to look in the above network path for that booking centre, to look at all sub folders and find the most recent file in each of the sub folders and use that to populate a list of portfolios for a second dropdown. It is incredibly slow though, my code given below. Can anyone suggest a faster approach?
public IDictionary<string, Portfolio> ReadPortfolios()
{
var portfolios = new Dictionary<string, Portfolio>();
var di = new DirectoryInfo(PortfolioPath);
var possibleFacilities = di.GetDirectories();
foreach (var possibleFacility in possibleFacilities)
{
try
{
if (possibleFacility.GetFiles().Any())
{
var mostRecentFile = possibleFacility.GetFiles().OrderBy(file => file.LastWriteTimeUtc).Last();
var portfolio = UnzipAndReadPortfolio(mostRecentFile);
if (portfolio == null) continue;
portfolios.Add(possibleFacility.Name, portfolio);
}
}
catch (Exception ex)
{
Console.WriteLine(#"Failed to read portfolio: " + ex.Message);
}
}
return portfolios;
}
If you're interested by all subdirectories of "PortFolioPath", try to use the overload of GetDirectories and / or GetFiles which allows you to pass the SearchOption.AllDirectories parameter : it will avoid multiple access to network.
You also have TWO calls of GetFiles() in your loop, you should rather store the result of first call in a local variable.
You don't provide the code of UnzipAndReadPortfolio, which is maybe the slowest part (... or not ?).
Remember : in your code often you can think "one method call = one network access". So try to flatten your loops, reduce FSO access, etc.
A probably real little performance gain
var mostRecentFile = possibleFacility.GetFiles()
.OrderBy(file => file.LastWriteTimeUtc)
.LastOrDefault();
if(mostRecentFile != null)
....
and comment out the first
// if(possibleFacility.GetFiles().Any())
The most obvious thing:
Every time you call possibleFacility.GetFiles() you get all files within the folder.
you have to call it and save it in a variable and then use this variable.

Categories

Resources