C# directory scan performance

C# directory scan performance - c#

I have a folder structure on a network drive that is
Booking Centre -> Facility -> Files
eg
EUR/12345678/File_archive1.txt
EUR/12345678/File_archive2.txt
EUR/12345678/File_latest.txt
EUR/5555/File_archive1.txt
EUR/5555/File_archive2.txt
EUR/5555/File_latest.txt
When a user selects a booking centre from the drop down, I want the code to look in the above network path for that booking centre, to look at all sub folders and find the most recent file in each of the sub folders and use that to populate a list of portfolios for a second dropdown. It is incredibly slow though, my code given below. Can anyone suggest a faster approach?
public IDictionary<string, Portfolio> ReadPortfolios()
{
var portfolios = new Dictionary<string, Portfolio>();
var di = new DirectoryInfo(PortfolioPath);
var possibleFacilities = di.GetDirectories();
foreach (var possibleFacility in possibleFacilities)
{
try
{
if (possibleFacility.GetFiles().Any())
{
var mostRecentFile = possibleFacility.GetFiles().OrderBy(file => file.LastWriteTimeUtc).Last();
var portfolio = UnzipAndReadPortfolio(mostRecentFile);
if (portfolio == null) continue;
portfolios.Add(possibleFacility.Name, portfolio);
}
}
catch (Exception ex)
{
Console.WriteLine(#"Failed to read portfolio: " + ex.Message);
}
}
return portfolios;
}

If you're interested by all subdirectories of "PortFolioPath", try to use the overload of GetDirectories and / or GetFiles which allows you to pass the SearchOption.AllDirectories parameter : it will avoid multiple access to network.
You also have TWO calls of GetFiles() in your loop, you should rather store the result of first call in a local variable.
You don't provide the code of UnzipAndReadPortfolio, which is maybe the slowest part (... or not ?).
Remember : in your code often you can think "one method call = one network access". So try to flatten your loops, reduce FSO access, etc.

A probably real little performance gain
var mostRecentFile = possibleFacility.GetFiles()
.OrderBy(file => file.LastWriteTimeUtc)
.LastOrDefault();
if(mostRecentFile != null)
....
and comment out the first
// if(possibleFacility.GetFiles().Any())

The most obvious thing:
Every time you call possibleFacility.GetFiles() you get all files within the folder.
you have to call it and save it in a variable and then use this variable.

Related

I am using ExistsAsync to check whether the CloudBlockBlob is exists but it is taking too much time

foreach(var lst in list)
{
CloudBlobDirectory dira = container.GetDirectoryReference(folderName + "/" + prefix);
bool isExists = dira.GetBlockBlobReference(filename + ".png").ExistsAsync().Result;
if(isExists)
{
// create sas url using cloud block url
}
}
I am using this code to check if blob exists for each path But ExistsAsync() is taking too much time.
I have also tried GetBlobClient in loop but it was also taking time.
Is there any other way to check if blockblob is exists
I am using the latest version(12) of Azure.Storage.Blobs

It´s not entierly clear what OP wants to do. One naive way to try and make Blob queries faster is to execute multiple in parallel.
Note that this example actually uses Azure.Storage.Blobs, while op seems to be using Windows.Azure.Storage despite stating otherwise.
using Azure.Storage.Blobs;
using Azure.Storage.Blobs.Specialized;
var blobClient = new BlobServiceClient("<connection string>");
var container = blobClient.GetBlobContainerClient("<container name>");
// However your list looks like, Im assuming it´s partial or full paths in blob
var potentialBlobs = new List<string>() { "<some>/<blob>/<path>" };
// Amount to execute in parallel at a time, depends on your system which value will yield better results.
// Could be calculated based on Environment.ProcessorCount too (note that this can return 0 or negative values sometimes, so you have to check)
const int chunkSize = 4;
var existingBlobs = new List<string>();
foreach (var chunk in potentialBlobs.Chunk(chunkSize))
{
// Create multiple tasks
var tasks = chunk.Select(async path => // potentialBlobs list item
{
var exists = await container
.GetBlockBlobClient(path) // adjust path however needed, I don´t understand what OP wants here
.ExistsAsync();
return exists ? path : null;
});
// Wait for tasks to finish
var results = await Task.WhenAll(tasks);
// Append to result list
foreach (var cur in results)
{
if (cur is not null)
existingBlobs.Add(cur);
}
}
There does not seem to be an Exists method available in the BatchClient of Azure.Storage.Blobs.Batch. As far as Im aware of it´s only mostly for adding, updating or deleting blobs not to check for their existence. So I think it is not viable for this scenario, but I might have missed something.
If you cannot get acceptable performance this way you´ll have to store blob paths in something like table storage or cosmosdb that is faster to query against. Depending on what you need you might also be able to cache the result or store the result somewhere and continuously update it as new blobs get added.

Reduce memory footprint of File operations

I'm trying to run this method, it works fine but every time after some hundreds of internal iterations I get with an Out of Memory exception:
...
MNDBEntities db = new MNDBEntities();
var regs = new List<DOCUMENTS>();
var query = from reg in db.DOCUMENTS
where reg.TAG_KEYS.Any(p => p.TAG_DATE_VALUES.FirstOrDefault().TAG_DATE_VALUE.HasValue
&& p.TAG_DATE_VALUES.FirstOrDefault().TAG_DATE_VALUE.Value.Year == 2012)
select reg;
var pages = new List<string>();
foreach (var item in query)
{
Document cert = new Document();
var tags = item.TAG_KEYS;
foreach (var tag in tags)
{
// Basic stuff...
}
var pagesS = item.PAGES;
foreach (var page in pagesS)
{
var path = #"C:\Kumquat\" + (int)page.NUMBER + ".vpimg";
File.WriteAllBytes(path, page.IMAGE);
pages.Add(path);
Console.WriteLine(path);
}
//cms.Save(cert, pages.ToArray()).Wait();
foreach (var pageFile in pages)
File.Delete(pageFile);
pagesS = null;
pages.Clear();
}
...
I'm pretty sure problem is related with the File.WriteAllBytes or the File.Delete because if I comment those lines the method runs without exception. What I'm doing is basically get some tags from a DB plus a document image, that image is then saved onto disk then a stored into a cms and then deleted from disk. Honestly don't figure out what I'm doing wrong with that File calls. Any idea?
This is what PerfView shows:
This is what visual studio 2012 profiler shows as the hot point, the thing is: this is all generated code (within the Entity Model) am I doing something wrong maybe with the properties of the model?

Try to use http://www.microsoft.com/en-us/download/details.aspx?id=28567 to profile your code, focusing on GC events, and CLR managed allocation tick events.
page.IMAGE could be the problem. Most likely it will allocate a byte array and never delete it. Best to change the code to:
page.WriteTo(path);
The rest of the code shown does look fine. The only possible problem is large object allocation, which could lead to fragmentation problem in LOH.

DirectoryInfo.GetFileSystemInfos and File Renaming

I seem to be having a timing issue when renaming images and then re displaying them. In my code I use System.IO.File.Move twice to rename some images in a directory. Then later I try to retrieve a list of files in the directory, but when I do so I get some file names that existed after the first rename, and some that existed after the 2nd rename. How do I ensure I get only file names that exist after the 2nd rename? I have contemplated putting in a Thread.Sleep(), but that feels like a hack. In case it helps, I'm using MVC3.
public ActionResult UpdateImages ()
{
foreach (file in directory)
System.IO.File.Move("oldname", "newname");
foreach (file in directory)
System.IO.File.Move("oldname", "newname");
return RedirectToAction("Images", "Manager", new { id = Id });
}
public ViewResult Images(int id)
{
var di = new DirectoryInfo(Server.MapPath("something")));
var files = di.GetFileSystemInfos("*-glr.jpg");
var orderedFiles = files.OrderBy(f => f.Name);
var images = new List<string>();
images.AddRange(orderedFiles.Select(fileSystemInfo => fileSystemInfo.Name));
ViewData["Images"] = images;
return View();
}
edit
I wish I could remove my own question. It seems I have solved this and the answer isn't even related to the information I provided in the question.
It seems that I ended up sending both a Get and a Post to the server. The Post kicked off the work, but the response from the post got aborted since the Get was also fired. Since the Get finished quickly, it would catch the system in an in-between state.
The offending line of code was a anchor element that had both an href and a javascript click handler (through jQuery) attached to it.

Adding AsParallel() call cause my code to break on writing a file

I'm building a console application that have to process a bunch of document.
To stay simple, the process is :
for each year between X and Y, query the DB to get a list of document reference to process
for each of this reference, process a local file
The process method is, I think, independent and should be parallelized as soon as input args are different :
private static bool ProcessDocument(
DocumentsDataset.DocumentsRow d,
string langCode
)
{
try
{
var htmFileName = d.UniqueDocRef.Trim() + langCode + ".htm";
var htmFullPath = Path.Combine("x:\path", htmFileName;
missingHtmlFile = !File.Exists(htmFullPath);
if (!missingHtmlFile)
{
var html = File.ReadAllText(htmFullPath);
// ProcessHtml is quite long : it use a regex search for a list of reference
// which are other documents, then sends the result to a custom WS
ProcessHtml(ref html);
File.WriteAllText(htmFullPath, html);
}
return true;
}
catch (Exception exc)
{
Trace.TraceError("{0,8}Fail processing {1} : {2}","[FATAL]", d.UniqueDocRef, exc.ToString());
return false;
}
}
In order to enumerate my document, I have this method :
private static IEnumerable<DocumentsDataset.DocumentsRow> EnumerateDocuments()
{
return Enumerable.Range(1990, 2020 - 1990).AsParallel().SelectMany(year => {
return Document.FindAll((short)year).Documents;
});
}
Document is a business class that wrap the retrieval of documents. The output of this method is a typed dataset (I'm returning the Documents table). The method is waiting for a year and I'm sure a document can't be returned by more than one year (year is part of the key actually).
Note the use of AsParallel() here, but I never got issue with this one.
Now, my main method is :
var documents = EnumerateDocuments();
var result = documents.Select(d => {
bool success = true;
foreach (var langCode in new string[] { "-e","-f" })
{
success &= ProcessDocument(d, langCode);
}
return new {
d.UniqueDocRef,
success
};
});
using (var sw = File.CreateText("summary.csv"))
{
sw.WriteLine("Level;UniqueDocRef");
foreach (var item in result)
{
string level;
if (!item.success) level = "[ERROR]";
else level = "[OK]";
sw.WriteLine(
"{0};{1}",
level,
item.UniqueDocRef
);
//sw.WriteLine(item);
}
}
This method works as expected under this form. However, if I replace
var documents = EnumerateDocuments();
by
var documents = EnumerateDocuments().AsParrallel();
It stops to work, and I don't understand why.
The error appears exactly here (in my process method):
File.WriteAllText(htmFullPath, html);
It tells me that the file is already opened by another program.
I don't understand what can cause my program not to works as expected. As my documents variable is an IEnumerable returning unique values, why my process method is breaking ?
thx for advises
[Edit] Code for retrieving document :
/// <summary>
/// Get all documents in data store
/// </summary>
public static DocumentsDS FindAll(short? year)
{
Database db = DatabaseFactory.CreateDatabase(connStringName); // MS Entlib
DbCommand cm = db.GetStoredProcCommand("Document_Select");
if (year.HasValue) db.AddInParameter(cm, "Year", DbType.Int16, year.Value);
string[] tableNames = { "Documents", "Years" };
DocumentsDS ds = new DocumentsDS();
db.LoadDataSet(cm, ds, tableNames);
return ds;
}
[Edit2] Possible source of my issue, thanks to mquander. If I wrote :
var test = EnumerateDocuments().AsParallel().Select(d => d.UniqueDocRef);
var testGr = test.GroupBy(d => d).Select(d => new { d.Key, Count = d.Count() }).Where(c=>c.Count>1);
var testLst = testGr.ToList();
Console.WriteLine(testLst.Where(x => x.Count == 1).Count());
Console.WriteLine(testLst.Where(x => x.Count > 1).Count());
I get this result :
0
1758
Removing the AsParallel returns the same output.
Conclusion : my EnumerateDocuments have something wrong and returns twice each documents.
Have to dive here I think
This is probably my source enumeration in cause

I suggest you to have each task put the file data into a global queue and have a parallel thread take writing requests from the queue and do the actual writing.
Anyway, the performance of writing in parallel on a single disk is much worse than writing sequentially, because the disk needs to spin to seek the next writing location, so you are just bouncing the disk around between seeks. It's better to do the writes sequentially.

Is Document.FindAll((short)year).Documents threadsafe? Because the difference between the first and the second version is that in the second (broken) version, this call is running multiple times concurrently. That could plausibly be the cause of the issue.

Sounds like you're trying to write to the same file. Only one thread/program can write to a file at a given time, so you can't use Parallel.
If you're reading from the same file, then you need to open the file with only read permissions as not to put a write lock on it.
The simplest way to fix the issue is to place a lock around your File.WriteAllText, assuming the writing is fast and it's worth parallelizing the rest of the code.

WP7: collection of images

I have images in folder Images in my windows phone solution. How can i get collection of images in this folder? Build Action of all images is "Content".

It had been bugging me that it wasn't possible to do this so I've done a bit of digging and have come up with a way of getting a list of all image files with the build action of "Resource". - Yes, this isn't quite what was asked for but hopefully this will still be useful.
If you really must use a build action of "Content" I'd use a T4 script to generate the list of files at build time. (This is what I do with one of my projects and it works fine.)
Assuming that the images are in a folder called "images" you can get them with the following:
var listOfImageResources = new StringBuilder();
var asm = Assembly.GetExecutingAssembly();
var mrn = asm.GetManifestResourceNames();
foreach (var resource in mrn)
{
var rm = new ResourceManager(resource.Replace(".resources", ""), asm);
try
{
var NOT_USED = rm.GetStream("app.xaml"); // without getting a stream, next statement doesn't work - bug?
var rs = rm.GetResourceSet(Thread.CurrentThread.CurrentUICulture, false, true);
var enumerator = rs.GetEnumerator();
while (enumerator.MoveNext())
{
if (enumerator.Key.ToString().StartsWith("images/"))
{
listOfImageResources.AppendLine(enumerator.Key.ToString());
}
}
}
catch (MissingManifestResourceException)
{
// Ignore any other embedded resources (they won't contain app.xaml)
}
}
MessageBox.Show(listOfImageResources.ToString());
This just displays a list of the names, but hopefully it'll be easy to change this to do whatever you need to.
Any suggestions for improving this code will be greatly appreciated.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# directory scan performance - c#

A probably real little performance gain var mostRecentFile = possibleFacility.GetFiles() .OrderBy(file => file.LastWriteTimeUtc) .LastOrDefault(); if(mostRecentFile != null) .... and comment out the first // if(possibleFacility.GetFiles().Any())

The most obvious thing: Every time you call possibleFacility.GetFiles() you get all files within the folder. you have to call it and save it in a variable and then use this variable.

Related

I am using ExistsAsync to check whether the CloudBlockBlob is exists but it is taking too much time

Reduce memory footprint of File operations

DirectoryInfo.GetFileSystemInfos and File Renaming

Adding AsParallel() call cause my code to break on writing a file

WP7: collection of images

Categories

Resources