Parallel.ForEach throws exception when extracting a zip file

Parallel.ForEach throws exception when extracting a zip file - c#

I am reading the contents of a zip file and trying to extract them.
var allZipEntries = ZipFile.Open(zipFileFullPath, ZipArchiveMode.Read).Entries;
Now if I extract the using Foreach loop this works fine. The drawback is it is equivalent of zip.extract method and I am not getting any advantage when intend to extract all the files.
foreach (var currentEntry in allZipEntries)
{
if (currentEntry.FullName.Equals(currentEntry.Name))
{
currentEntry.ExtractToFile($"{tempPath}\\{currentEntry.Name}");
}
else
{
var subDirectoryPath = Path.Combine(tempPath, Path.GetDirectoryName(currentEntry.FullName));
Directory.CreateDirectory(subDirectoryPath);
currentEntry.ExtractToFile($"{subDirectoryPath}\\{currentEntry.Name}");
}
}
Now to take advantage of TPL tried using Parallel.forEach,but that's throwing following exception:
An exception of type 'System.IO.InvalidDataException' occurred in System.IO.Compression.dll but was not handled in user code
Additional information: A local file header is corrupt.
Parallel.ForEach(allZipEntries, currentEntry =>
{
if (currentEntry.FullName.Equals(currentEntry.Name))
{
currentEntry.ExtractToFile($"{tempPath}\\{currentEntry.Name}");
}
else
{
var subDirectoryPath = Path.Combine(tempPath, Path.GetDirectoryName(currentEntry.FullName));
Directory.CreateDirectory(subDirectoryPath);
currentEntry.ExtractToFile($"{subDirectoryPath}\\{currentEntry.Name}");
}
});
And to avoid this I could use a lock , but that defeats the whole purpose.
Parallel.ForEach(allZipEntries, currentEntry =>
{
lock (thisLock)
{
if (currentEntry.FullName.Equals(currentEntry.Name))
{
currentEntry.ExtractToFile($"{tempPath}\\{currentEntry.Name}");
}
else
{
var subDirectoryPath = Path.Combine(tempPath, Path.GetDirectoryName(currentEntry.FullName));
Directory.CreateDirectory(subDirectoryPath);
currentEntry.ExtractToFile($"{subDirectoryPath}\\{currentEntry.Name}");
}
}
});
Any other or better way around to extract the files?

ZipFile is explicitly documented as not guaranteed to be threadsafe for instance members. This is no longer mentioned on the page. Snapshot from Nov 2016.
What you're trying to do cannot be done with this library. There may be some other libraries out there which do support multiple threads per zip file, but I wouldn't expect it.
You can use multi-threading to unzip multiple files at the same time, but not for multiple entries in the same zip file.

Writing/reading in parallel is not a good idea as the hard drive controller will only run the requests one by one. By having multiple threads you just add overhead and queue them all up for no gain.
Try reading the file into memory first, this will avoid your exception however if you benchmark it you may find its actually slower due to the overhead of more threads.
If the file is very large and the decompression takes a long time, running the decompressing in parallel may improve speed, however the IO read/write will not. Most decompression libraries are already multi threaded anyway, so only if this one is not will you have any performance gain from doing this.
Edit: A dodgy way to make the library thread safe below. This runs slower/on par depending on the zip archive which proves the point that this is not something that will benefit from parallelism
Array.ForEach(Directory.GetFiles(#"c:\temp\output\"), File.Delete);
Stopwatch timer = new Stopwatch();
timer.Start();
int numberOfThreads = 8;
var clonedZipEntries = new List<ReadOnlyCollection<ZipArchiveEntry>>();
for (int i = 0; i < numberOfThreads; i++)
{
clonedZipEntries.Add(ZipFile.Open(#"c:\temp\temp.zip", ZipArchiveMode.Read).Entries);
}
int totalZipEntries = clonedZipEntries[0].Count;
int numberOfEntriesPerThread = totalZipEntries / numberOfThreads;
Func<object,int> action = (object thread) =>
{
int threadNumber = (int)thread;
int startIndex = numberOfEntriesPerThread * threadNumber;
int endIndex = startIndex + numberOfEntriesPerThread;
if (endIndex > totalZipEntries) endIndex = totalZipEntries;
for (int i = startIndex; i < endIndex; i++)
{
Console.WriteLine($"Extracting {clonedZipEntries[threadNumber][i].Name} via thread {threadNumber}");
clonedZipEntries[threadNumber][i].ExtractToFile($#"C:\temp\output\{clonedZipEntries[threadNumber][i].Name}");
}
//Check for any remainders due to non evenly divisible size
if (threadNumber == numberOfThreads - 1 && endIndex < totalZipEntries)
{
for (int i = endIndex; i < totalZipEntries; i++)
{
Console.WriteLine($"Extracting {clonedZipEntries[threadNumber][i].Name} via thread {threadNumber}");
clonedZipEntries[threadNumber][i].ExtractToFile($#"C:\temp\output\{clonedZipEntries[threadNumber][i].Name}");
}
}
return 0;
};
//Construct the tasks
var tasks = new List<Task<int>>();
for (int threadNumber = 0; threadNumber < numberOfThreads; threadNumber++) tasks.Add(Task<int>.Factory.StartNew(action, threadNumber));
Task.WaitAll(tasks.ToArray());
timer.Stop();
var threaderTimer = timer.ElapsedMilliseconds;
Array.ForEach(Directory.GetFiles(#"c:\temp\output\"), File.Delete);
timer.Reset();
timer.Start();
var entries = ZipFile.Open(#"c:\temp\temp.zip", ZipArchiveMode.Read).Entries;
foreach (var entry in entries)
{
Console.WriteLine($"Extracting {entry.Name} via thread 1");
entry.ExtractToFile($#"C:\temp\output\{entry.Name}");
}
timer.Stop();
Console.WriteLine($"Threaded version took: {threaderTimer} ms");
Console.WriteLine($"Non-Threaded version took: {timer.ElapsedMilliseconds} ms");
Console.ReadLine();

Related

Parallel.Foreach loop gets different result than For loop?

I've
written simple for loop iterating through array and Parallel.ForEach loop doing the same thing. However, resuls I've get are different so I want to ask what the heck is going on? :D
class Program
{
static void Main(string[] args)
{
long creating = 0;
long reading = 0;
long readingParallel = 0;
for (int j = 0; j < 10; j++)
{
Stopwatch timer1 = new Stopwatch();
Random rnd = new Random();
int[] array = new int[100000000];
timer1.Start();
for (int i = 0; i < 100000000; i++)
{
array[i] = rnd.Next(5);
}
timer1.Stop();
long result = 0;
Stopwatch timer2 = new Stopwatch();
timer2.Start();
for (int i = 0; i < 100000000; i++)
{
result += array[i];
}
timer2.Stop();
Stopwatch timer3 = new Stopwatch();
long result2 = 0;
timer3.Start();
Parallel.ForEach(array, (item) =>
{
result2 += item;
});
if (result != result2)
{
Console.WriteLine(result + " - " + result2);
}
timer3.Stop();
creating += timer1.ElapsedMilliseconds;
reading += timer2.ElapsedMilliseconds;
readingParallel += timer3.ElapsedMilliseconds;
}
Console.WriteLine("Create : \t" + creating / 100);
Console.WriteLine("Read: \t\t" + reading / 100);
Console.WriteLine("ReadP: \t\t" + readingParallel / 100);
Console.ReadKey();
}
}
So in the condition I get results:
result = 200009295;
result2 = 35163054;
Is there anything wrong?

The += operator is non-atomic and actually performs multiple operations:
load value at location that result is pointing to, into memory
add array[i] to the in-memory value (I'm simplifying here)
write the result back to result
Since a lot of these add operations will be running in parallel it is not just possible, but likely that there will be races between some of these operations where one thread reads a result value and performs the addition, but before it has the chance to write it back, another thread grabs the old result value (which hasn't yet been updated) and also performs the addition. Then both threads write their respective values to result. Regardless of which one wins the race, you end up with a smaller number than expected.
This is why the Interlocked class exists.
Your code could very easily be fixed:
Parallel.ForEach(array, (item) =>
{
Interlocked.Add(ref result2, item);
});
Don't be surprised if Parallel.ForEach ends up slower than the fully synchronous version in this case though. This is due to the fact that
the amount of work inside the delegate you pass to Parallel.ForEach is very small
Interlocked methods incur a slight but non-negligible overhead, which will be quite noticeable in this particular case

The fastest approach to inserting big data collections to Cassandra in C#

I'm a little bit confused about the fastest way to insert large collections to cassandra database. I read that I shouldn't use batch insert because it's created for atomicity. Even Cassandra thow an information for me to use asynchronic writes for performace.
I've used code for the fastest insert without 'batch' keyword:
var cluster = Cluster.Builder()
.AddContactPoint(“127.0.0.1")
.Build();
var session = cluster.Connect();
//Save off the prepared statement you’re going to use
var statement = session.Prepare (“INSERT INTO tester.users (userID, firstName, lastName) VALUES (?,?,?)”);
var tasks = new List<Task>();
for (int i = 0; i < 1000; i++)
{
//please bind with whatever actually useful data you’re importing
var bind = statement.Bind (i, “John”, “Tester”);
var resultSetFuture = session.ExecuteAsync (bind);
tasks.Add (resultSetFuture);
}
Task.WaitAll(tasks.ToArray());
cluster.Shutdown();
from: https://medium.com/#foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
But it's still much slower than batch option i'm using. My current code looks like this:
IList<Movie> moviesList = Movie.CreateMoviesCollectionForCassandra(collectionEntriesNumber);
var preparedStatements = new List<PreparedStatement>();
foreach (var statement in preparedStatements)
{
statement.SetConsistencyLevel(ConsistencyLevel.One);
}
var statementBinding = new BatchStatement();
statementBinding.SetBatchType(BatchType.Unlogged);
for (int i = 0; i < collectionEntriesNumber; i++)
{
preparedStatements.Add(Session.Prepare("INSERT INTO Movies (id, title, description, year, genres, rating, originallanguage, productioncountry, votingsnumber, director) VALUES (?,?,?,?,?,?,?,?,?,?)"));
}
for (int i = 0; i < collectionEntriesNumber; i++)
{
statementBinding.Add(preparedStatements[i].Bind(moviesList[i].Id, moviesList[i].Title,
moviesList[i].Description, moviesList[i].Year, moviesList[i].Genres, moviesList[i].Rating,
moviesList[i].OriginalLanguage, moviesList[i].ProductionCountry, moviesList[i].VotingsNumber,
new Director(moviesList[0].Director.Id, moviesList[i].Director.Firstname,
moviesList[i].Director.Lastname, moviesList[i].Director.Age)));
}
watch.Start();
Session.ExecuteAsync(statementBinding);
watch.Stop();
It really works much much faster but i can only insert ~2500 prepared statements, no more, and I want to measure time of about 100000 objects insertion.
Is my code correct? Maybe I just should increase insert treshold?
Please, explain my how to do it right way.

Remember that you should prepare your once and reuse that same PreparedStatement to bind to different parameters.
You can use small sized batches if you are targeting the same partition, if not you should use individual requests.
When using individual requests, you can schedule executions in parallel and limit the amount of outstanding requests using a semaphore.
Something like:
public async Task<long> Execute(
IStatement[] statements, int parallelism, int maxOutstandingRequests)
{
var semaphore = new SemaphoreSlim(maxOutstandingRequests);
var tasks = new Task<RowSet>[statements.Length];
var chunkSize = statements.Length / parallelism;
if (chunkSize == 0)
{
chunkSize = 1;
}
var statementLength = statements.Length;
var launchTasks = new Task[parallelism + 1];
var watch = new Stopwatch();
watch.Start();
for (var i = 0; i < parallelism + 1; i++)
{
var startIndex = i * chunkSize;
//start to launch in parallel
launchTasks[i] = Task.Run(async () =>
{
for (var j = 0; j < chunkSize; j++)
{
var index = startIndex + j;
if (index >= statementLength)
{
break;
}
await semaphore.WaitAsync();
var t = _session.ExecuteAsync(statements[index]);
tasks[index] = t;
var rs = await t;
semaphore.Release();
}
});
}
await Task.WhenAll(launchTasks);
await Task.WhenAll(tasks);
watch.Stop();
return watch.ElapsedMilliseconds;
}

Conflicting threads on a local variable

why is it, that in the following code, n doesn't end up being 0, it's some random number with a magnitude less than 1000000 each time, somtimes even a negative number?
static void Main(string[] args)
{
int n = 0;
var up = new Thread(() =>
{
for (int i = 0; i < 1000000; i++)
{
n++;
}
});
up.Start();
for (int i = 0; i < 1000000; i++)
{
n--;
}
up.Join();
Console.WriteLine(n);
Console.ReadLine();
}
Doesn't up.Join() force both for loops to finish before WriteLine is called?
I understand that the local variable is actually part of a class behind the scenes (think it's called a closure), however because the local variable n is actually heap allocated, would that affect n not being 0 each time?

The n++ and n-- operations are not guaranteed to be atomic. Each operation has three phases:
Read current value from memory
Modify value (increment/decrement)
Write value to memory
Since both of your threads are doing this repeatedly, and you have no control over the scheduling of the threads, you will have situations like this:
Thread1: Get n (value = 0)
Thread1: Increment (value = 1)
Thread2: Get n (value = 0)
Thread1: Write n (n == 1)
Thread2: Decrement (value = -1)
Thread1: Get n (value = 1)
Thread2: Write n (n == -1)
And so on.
This is why it is always important to lock access to shared data.
-- Code:
static void Main(string[] args)
{
int n = 0;
object lck = new object();
var up = new Thread(() =>
{
for (int i = 0; i < 1000000; i++)
{
lock (lck)
n++;
}
});
up.Start();
for (int i = 0; i < 1000000; i++)
{
lock (lck)
n--;
}
up.Join();
Console.WriteLine(n);
Console.ReadLine();
}
-- Edit: more on how lock works...
When you use the lock statement it attempts to acquire a lock on the object you supply it - the lck object in my code above. If that object is already locked, the lock statement will cause your code to wait for the lock to be released before continuing.
The C# lock statement is effectively the same as a Critical Section. Effectively it is similar to the following C++ code:
// declare and initialize the critical section (analog to 'object lck' in code above)
CRITICAL_SECTION lck;
InitializeCriticalSection(&lck);
// Lock critical section (same as 'lock (lck) { ...code... }')
EnterCriticalSection(&lck);
__try
{
// '...code...' goes here
n++;
}
__finally
{
LeaveCriticalSection(&lck);
}
The C# lock statement abstracts most of that away, meaning that it's much harder for us to enter a critical section (acquire a lock) and forget to leave it.
The important thing though is that only your locking object is affected, and only with regard to other threads trying to acquire a lock on the same object. Nothing stops you from writing code to modify the locking object itself, or from accessing any other object. YOU are responsible for making your sure your code respect the locks, and always acquires a lock when writing to a shared object.
Otherwise you're going to have a non-deterministic outcome like you've seen with this code, or what the spec-writers like to call 'undefined behavior'. Here Be Dragons (in the form of bugs you'll have endless trouble with).

Yes, up.Join() will ensure that both of the loops end before WriteLine is called.
However, what is happening is that the both of the loops are being executed simultaneously, each one in it's own thread.
The switching between the two threads is done all the time by the operation system, and each program run will show a different switching timing set.
You should also be aware that n-- and n++ are not atomic operations, and are actually being compiled to 3 sub-operations, e.g.:
Take value from memory
Increase it by one
Put value in memory
The last piece of the puzzle, is that the thread context switching can occur inside the n++ or n--, between any of the above 3 operations.
That is why the final value is non-deterministic.

If you don't want to use locks there are atomic versions of the increment and decrement opperators in the Interlocked class.
Change your code to the following and you will always get 0 for an answer.
static void Main(string[] args)
{
int n = 0;
var up = new Thread(() =>
{
for (int i = 0; i < 1000000; i++)
{
Interlocked.Increment(ref n);
}
});
up.Start();
for (int i = 0; i < 1000000; i++)
{
Interlocked.Decrement(ref n);
}
up.Join();
Console.WriteLine(n);
Console.ReadLine();
}

You need to join the threads earlier:
static void Main(string[] args)
{
int n = 0;
var up = new Thread(() =>
{
for (int i = 0; i < 1000000; i++)
{
n++;
}
});
up.Start();
up.Join();
for (int i = 0; i < 1000000; i++)
{
n--;
}
Console.WriteLine(n);
Console.ReadLine();
}

How to use multi threading in a For loop

I want to achieve the below requirement; please suggest some solution.
string[] filenames = Directory.GetFiles("C:\Temp"); //10 files
for (int i = 0; i < filenames.count; i++)
{
ProcessFile(filenames[i]); //it takes time to execute
}
I wanted to implement multi-threading. e.g There are 10 files. I wanted to process 3 files at a time (configurable, say maxthreadcount). So 3 files will be processed in 3 threads from the for loop and if any thread completes the execution, it should pick the next item from the for loop. Also wanted to ensure all the files are processed before it exits the for loop.
Please suggest best approach.

Try
Parallel.For(0, filenames.Length, i => {
ProcessFile(filenames[i]);
});
MSDN
It's only available since .Net 4. Hope that acceptable.

This will do the job in .net 2.0:
class Program
{
static int workingCounter = 0;
static int workingLimit = 10;
static int processedCounter = 0;
static void Main(string[] args)
{
string[] files = Directory.GetFiles("C:\\Temp");
int checkCount = files.Length;
foreach (string file in files)
{
//wait for free limit...
while (workingCounter >= workingLimit)
{
Thread.Sleep(100);
}
workingCounter += 1;
ParameterizedThreadStart pts = new ParameterizedThreadStart(ProcessFile);
Thread th = new Thread(pts);
th.Start(file);
}
//wait for all threads to complete...
while (processedCounter< checkCount)
{
Thread.Sleep(100);
}
Console.WriteLine("Work completed!");
}
static void ProcessFile(object file)
{
try
{
Console.WriteLine(DateTime.Now.ToString() + " recieved: " + file + " thread count is: " + workingCounter.ToString());
//make some sleep for demo...
Thread.Sleep(2000);
}
catch (Exception ex)
{
//handle your exception...
string exMsg = ex.Message;
}
finally
{
Interlocked.Decrement(ref workingCounter);
Interlocked.Increment(ref processedCounter);
}
}
}

Take a look at the Producer/Consumer Queue example by Joe Albahari. It should provide a good starting point for what you're trying to accomplish.

You could use the ThreadPool.
Example:
ThreadPool.SetMaxThreads(3, 3);
for (int i = 0; i < filenames.count; i++)
{
ThreadPool.QueueUserWorkItem(new WaitCallback(ProcessFile), filenames[i]);
}
static void ProcessFile(object fileNameObj)
{
var fileName = (string)fileNameObj;
// do your processing here.
}
If you are using the ThreadPool elsewhere in your application then this would not be a good solution since it is shared across your app.
You could also grab a different thread pool implementation, for example SmartThreadPool

Rather than starting a thread for each file name, put the file names into a queue and then start up three threads to process them. Or, since the main thread is now free, start up two threads and let the main thread work on it, too:
Queue<string> MyQueue;
void MyProc()
{
string[] filenames = Directory.GetFiles(...);
MyQueue = new Queue(filenames);
// start two threads
Thread t1 = new Thread((ThreadStart)ProcessQueue);
Thread t2 = new Thread((ThreadStart)ProcessQueue);
t1.Start();
t2.Start();
// main thread processes the queue, too!
ProcessQueue();
// wait for threads to complete
t1.Join();
t2.Join();
}
private object queueLock = new object();
void ProcessQueue()
{
while (true)
{
string s;
lock (queueLock)
{
if (MyQueue.Count == 0)
{
// queue is empty
return;
}
s = MyQueue.Dequeue();
}
ProcessFile(s);
}
}
Another option is to use a semaphore to control how many threads are working:
Semaphore MySem = new Semaphore(3, 3);
void MyProc()
{
string[] filenames = Directory.GetFiles(...);
foreach (string s in filenames)
{
mySem.WaitOne();
ThreadPool.QueueUserWorkItem(ProcessFile, s);
}
// wait for all threads to finish
int count = 0;
while (count < 3)
{
mySem.WaitOne();
++count;
}
}
void ProcessFile(object state)
{
string fname = (string)state;
// do whatever
mySem.Release(); // release so another thread can start
}
The first will perform somewhat better because you don't have the overhead of starting and stopping a thread for each file name processed. The second is much shorter and cleaner, though, and takes full advantage of the thread pool. Likely you won't notice the performance difference.

Can set max threads unsing ParallelOptions
Parallel.For Method (Int32, Int32, ParallelOptions, Action)
ParallelOptions.MaxDegreeOfParallelism

var results = filenames.ToArray().AsParallel().Select(filename=>ProcessFile(filename)).ToArray();
bool ProcessFile(object fileNameObj)
{
var fileName = (string)fileNameObj;
// do your processing here.
return true;
}

Log to memory then write to file, memory stream vs file compared

this is in references to my previous question Log to memory and then write to file, actually the edit part of that question, I asked in edit part that if I write to memory would that be faster than writing to file? I performed a simple test, and I had shocking results! I wanted to share with the community.
So here's the code
private void Button1Click(object sender, EventArgs e)
{
var stopwatch = new Stopwatch();
stopwatch.Start();
File.AppendAllText(#"D:\File1.txt", string.Format("{0}Start! : {1}", Environment.NewLine, DateTime.Now.ToString(CultureInfo.InvariantCulture)));
for (int i = 0; i < 6; i++)
{
for (int j = 0; j < 1000000; j++)
{
File.AppendAllText(#"D:\File1.txt", string.Format("{0}{1}:{2}", Environment.NewLine, i.ToString(CultureInfo.InvariantCulture), j.ToString(CultureInfo.InvariantCulture)));
}
}
File.AppendAllText(#"D:\File1.txt", string.Format("{0}Done!{1}", Environment.NewLine, DateTime.Now.ToString(CultureInfo.InvariantCulture)));
stopwatch.Stop();
File.AppendAllText(#"D:\File1.txt",
string.Format("{0}{1}:{2}",Environment.NewLine, stopwatch.Elapsed.ToString(), stopwatch.ElapsedMilliseconds.ToString(CultureInfo.InvariantCulture)));
MessageBox.Show("Done!");
}
private void Button2Click(object sender, EventArgs e)
{
var stopwatch = new Stopwatch();
using (var mem = new MemoryStream())
{
using (var binaryWriter = new BinaryWriter(mem))
{
stopwatch.Start();
{
binaryWriter.Write("start! : " + DateTime.Now.ToString(CultureInfo.InvariantCulture));
for (int i = 0; i < 6; i++)
{
for (int j = 0; j < 1000000; j++)
{
binaryWriter.Write(i.ToString(CultureInfo.InvariantCulture) + ":" + j.ToString(CultureInfo.InvariantCulture));
}
}
stopwatch.Stop();
binaryWriter.Write("Done! " + DateTime.Now.ToString(CultureInfo.InvariantCulture));
binaryWriter.Write(stopwatch.Elapsed.ToString() + ":" + stopwatch.ElapsedMilliseconds.ToString(CultureInfo.InvariantCulture));
binaryWriter.Flush();
var file = new FileStream(#"D:\File2.txt", FileMode.Create);
mem.WriteTo(file);
}
}
}
MessageBox.Show("Done!");
}
As the code should be easy to understand
Elapsed time in File1.txt = 00:50:24.5654918
Elapsed milliseconds in File1.txt = 3024565
Elapsed time in File2.txt = 00:00:04.7430152
Elapsed milliseconds in File2.txt = 4743
So, as you can see for yourself, there is about 50 minutes of differences! This could be a real cause for bad perfromance, if you log everything directly to IO File, without use of memory stream or any custom tool for logging, OTOH compared to 50 minutes of File.AppendAllText, using MemoryStream only took about 4 and a quarter second. (I am still confused as to why the time shown in windows explorer doesn't corresponds to time shown in file in the end by stopwatch.ShowElapasedTime, but nonetheless, even if we see windows explorer time, its still about 45 minutes faster!) So, this can be a really useful thing, I thought of sharing it!

That's because File.AppendAllText opens the file, writes, flushes the buffer and closes it. If you keep the log file open and use a stream to write to it (instead of the MemoryStream), you will get results that are very close to what you've seen with MemoryStream - it might even be indistinguishable.
Try it out.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parallel.ForEach throws exception when extracting a zip file - c#

Related

Parallel.Foreach loop gets different result than For loop?

The fastest approach to inserting big data collections to Cassandra in C#

Conflicting threads on a local variable

How to use multi threading in a For loop

Log to memory then write to file, memory stream vs file compared

Categories

Resources