I have a code block which processes StoreProducts an then adds or updates them in the database in a for each loop. But this is slow. When I convert the code Parallel.ForEach block, then same products gets both added and updated at the same time. I could not figure out how to safely utilize for the following functionality, any help would be appreciated.
var validProducts = storeProducts.Where(p => p.Price2 > 0
&& !string.IsNullOrEmpty(p.ProductAtt08Desc.Trim())
&& !string.IsNullOrEmpty(p.Barcode.Trim())
).ToList();
var processedProductCodes = new List<string>();
var po = new ParallelOptions()
{
MaxDegreeOfParallelism = 4
};
Parallel.ForEach(validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode)), po,
(product) =>
{
lock (_lockThis)
{
processedProductCodes.Add(product.ProductCode);
}
// Check if Product Exists in Db
// if product is not in Db Add to Db
// if product is in Db Update product in Db
}
The thing in here is, the list validProducts may have more than one same ProductCode, so they are variants and I have to manage that even one of them is being processed it should not be processed again.
So where condition that is found in the parallel foreach 'validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode)' is not working as expected like in normal for each.
The bulk of my answer is less-so an answer to your question and more some guidance - if you were to provide some more technical details, I may be able to assist more precisely.
A Parallel.ForEach is probably not the best solution here -- especially when you have a shared list or a busy server.
You are locking to write but not to read from that shared list. So I'm surprised it's not throwing during the Where. Turn the List<string> into a ConcurrentDictionary<string, bool> (just to create a simple concurrent hash table) then you'll get better write throughput and it won't throw during reads.
But you're going to have database contention issues (if using multiple connections) because your insert will likely still require locks. Even if you simply split the workload you would run into this. This DB locking could cause blocks/deadlocks so it may end up slower than the original. If using one connection, you generally cannot parallelize commands.
I would try wrapping the majority of inserts in a transaction containing batches of say 1000 inserts or place the entire workload into one bulk insert. Then the database will keep the data in-memory and commit the entire thing to disk when finished (instead of one record at a time).
Depending on your typical workload, you may want to try different storage solutions. Databases are generally bad for inserting large volumes of records... you will likely see much better performance with alternative solutions (such as Key-Value stores). Or place the data into something like Redis and slowly persist to the database in the background.
Parallel.ForEach buffers items internally for each thread, one option you could do is switch to a partitioner that does not use buffering
var pat = Partitioner.Create(validProducts.Where(p => !processedProductCodes.Contains(p.ProductCode))
,EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(pat, po, (product) => ...
That will get you closer but you will still have a race conditions where two of the same object can be processed because you don't break out of the loop if you find a duplicate.
The better option is switch processedProductCodes to a HashSet<string> and change your code to
var processedProductCodes = new HashSet<string>();
var po = new ParallelOptions()
{
MaxDegreeOfParallelism = 4
};
Parallel.ForEach(validProducts, po,
(product) =>
{
//You can safely lock on processedProductCodes
lock (processedProductCodes)
{
if(!processedProductCodes.Add(product.ProductCode))
{
//Add returns false if the code is already in the collection.
return;
}
}
// Check if Product Exists in Db
// if product is not in Db Add to Db
// if product is in Db Update product in Db
}
HashSet has a much faster lookup and is built in to the Add function.
Related
I have a large DataTable that contains Users details. I need to complete user's details into this table from several tables in DB. I run though each row in the table and make several calls to different tables in the database, using ADO.NET objects and methods, process and reorganize the results and them to the main table. It's works fine, but to slow...
My idea was to split the large table into few small tables and run the CompleteAddressDetails method in a few threads simultaneously and in the end to merge small tables into one result table. I have implemented this idea using Task object of TPL. There is a code below. It works fine, but without any improvement of execution time.
Several questions:
1. Why there no any improvement of execution time?
2. What I have to do in order to improve it?
Thank you for any advice!
resultTable1 = data.Clone();
resultTable2 = data.Clone();
resultTable3 = data.Clone();
resultTable4 = data.Clone();
resultTable5 = data.Clone();
DataTable[] tables = new DataTable[] { resultTable1, resultTable2, resultTable3, resultTable4, resultTable5 };
for (int i = 0; i < data.Rows.Count; i += 5)
{
for (int j = 0; j < 5; j++)
{
if (data.Rows.Count > i + j)
{
tables[j].Rows.Add(data.Rows[i + j].ItemArray);
}
}
}
Task[] taskArray = {Task.Factory.StartNew(() =>CompleteAddressDetails(resultTable1)),
Task.Factory.StartNew(() =>CompleteAddressDetails(resultTable2)),
Task.Factory.StartNew(() =>CompleteAddressDetails(resultTable3)),
Task.Factory.StartNew(() =>CompleteAddressDetails(resultTable4)),
Task.Factory.StartNew(() =>CompleteAddressDetails(resultTable5))};
Task.WaitAll(taskArray);
When using multi-threaded parallelism without any performance benefit, there's basically two possibilities:
The code isn't CPU-bound, so throwing more CPUs on the task isn't going to help
The code uses too much synchronization to actually allow realistic parallel execution
In this case, 1 is likely the cause. Your code isn't doing enough CPU work to benefit from multi-threading. Most likely, you're simply waiting for the database to do the work.
It's hard to give any pointers without seeing what the CompleteAddressDetails method does - I assume it goes through all the rows one by one, and executes a couple of separate queries to fill in the details. Even if each individual query is fast enough, doing thousands of separate queries is going to hurt your performance no matter what you do - and especially so if those queries require locking some shared state in the DB.
First, think of a better way to fill in the details. Perhaps you can join some of those queries together, or maybe you can even load all of the rows at once. Second, try profiling the actual queries as they happen on the server. Find out if there's something you can do to improve their performance - say, by adding some indices, or by better using the existing ones.
There is no improvement because you can't code your way around how the sql server database handles your calls.
I would recommend using a User-Defined Table Type on SQL Server, a Stored Procedure that accepts this table type, and then just send the DataTable you have through to the Stored Procedure and do your processing in there. You'd then be able to optimize from there going forward.
I'm sure we're missing something very important here, so hopefully someone can point me to the right direction. Thank you in advance :)
Issue that we currently experience: sometimes asynchronous operation (read) do no return us hash value from db that has been written by async operation. For example, one time operation can return us 600 keys, the next time amount of keys can be 598, the next one : 596 and so on. Also we experience same issue with short sets (when we have up to 10 keys in set and read 10 hash objects in the batch: sometimes we can get 8 objects, sometimes 6, once we get only 2.
We have issue with async methods in about 30-40% of our operations, migration to the synchronous operations solved some of the cases - be we've lost performance.
Example of our create/read batch operations
protected void CreateBatch(Func<IBatch, List<Task>> action)
{
IBatch batch = Database.CreateBatch();
List<Task> tasks = action(batch);
batch.Execute();
Task.WaitAll(tasks.ToArray());
}
protected IEnumerable<T> GetBatch<T, TRedis>(
IEnumerable<RedisKey> keys,
Func<IBatch, RedisKey, Task<TRedis>> invokeBatchOperation,
Func<TRedis, T> buildResultItem)
{
IBatch batch = Database.CreateBatch();
List<RedisKey> keyList = keys.ToList();
List<Task> tasks = new List<Task>(keyList.Count);
List<T> result = new List<T>(keyList.Count);
foreach (RedisKey key in keyList)
{
Task task = invokeBatchOperation(batch, key).ContinueWith(
t =>
{
T item = buildResultItem(t.Result);
result.Add(item);
});
tasks.Add(task);
}
batch.Execute();
Task.WaitAll(tasks.ToArray());
return result;
}
we use write operations next way:
private void CreateIncrementBatch(IEnumerable<DynamicDTO> dynamicDtos)
{
CreateBatch(
batch =>
{
List<Task> tasks = new List<Task>();
foreach (DynamicDTO dynamicDto in dynamicDtos)
{
string dynamicKey = KeysBuilders.Live.Dynamic.BuildDetailsKeyByIdAndVersion(
dynamicDto.Id,
dynamicDto.Version);
HashEntry[] dynamicFields = _dtoMapper.MapDynamicToHashEntries(dynamicDto);
Task task = batch.HashSetAsync(dynamicKey, dynamicFields, CommandFlags.HighPriority);
tasks.Add(task);
}
return tasks;
});
}
We read data as batch using next code sample
IEnumerable<RedisKey> userKeys =
GetIdsByUserId(userId).Select(x => (RedisKey) KeysBuilders.Live.Dynamic.BuildDetailsKeyByUserId(x));
return GetBatch(userKeys, (batch, key) => batch.HashGetAllAsync(key), _dtoMapper.MapToDynamic);
We know that batch.Execute is no synchronous/not truly asynchronous operation, at the same time we need to check status of each operation later.
We do plan to make much more read-write operations into redis server, but using this issue, we're not sure if we're on the right path ).
Any advices/samples and points to the right direction are highly appreciated!
Some Additonal info:
We're using StackExchange redis client (latest stable version: 1.0.481) in asp.mvc/worker role (.NET version 4.5) to connect and work with Azure redis cache (C1, Standard).
At the moment we have about 100 000 keys in Database during small test flow (mostly Hashes - based on recommendations provided in redis.io (each key stores up to 10 fields for different objects, no big data or text fields stored in the hash) and sets (mostly mappings, the biggest one can take up to 10 000 keys to the parent)).
We have about 20 small writers to the cache (each writer instance writes it's own subset of data and do not overlap with another, amount of keys to write per operation is up to 100 (hash)).
Also we have one "big man" worker who can make some calculations based on current redis state and store data back into redis server (amount of operations - is up to 1200 keys to read/write per first request, and then work with 10 000+ keys (store and calculate).
At the time the big man works: nobody read-write to this exact keyspace, however small writers continue to write some keys constantly.
At the same time we have many small readers (up to 100 000) who can request their specific chunk of data (based on mappings and joins of 2 hash entities.
Amount of hash entities to return to the readers is about 100-500 records.
Due to some restrictions in the domain model - we try to store/read keys as batch operations (the biggest(longest) batch can have up to 500-1000 reads/writes of hash fields into cache. We do not use transactions at the moment.
Maybe you can use instead of
List<T> result = new List<T>(keyList.Count);
Something like this?
ConcurrentBag<T>result = new ConcurrentBag<T>();
ConcurrentBag represents a thread-safe, unordered collection of objects.
I'm dumping a table out of MySQL into a DataTable object using MySqlDataAdapter. Database input and output is doing fine, but my application code seems to have a performance issue I was able to track down to a specific LINQ statement.
The goal is simple, search the contents of the DataTable for a column value matching a specific string, just like a traditional WHERE column = 'text' SQL clause.
Simplified code:
foreach (String someValue in someList) {
String searchCode = OutOfScopeFunction(someValue);
var results = emoteTable.AsEnumerable()
.Where(myRow => myRow.Field<String>("code") == searchCode)
.Take(1);
if (results.Any()) {
results.First()["columnname"] = 10;
}
}
This simplified code is executed thousands of times, once for each entry in someList. When I run Visual Studio Performance Profiler I see that the "results.Any()" line is highlighted as consuming 93.5% of the execution time.
I've tried several different methods for optimizing this code, but none have improved performance while keeping the emoteTable DataTable as the primary source of the data. I can convert emoteTable to Dictionary<String, DataRow> outside of the foreach, but then I have to keep the DataTable and the Dictionary in sync, which while still a performance improvement, feels wrong.
Three questions:
Is this the proper way to search for a value in a DataTable (equivalent of a traditional SQL WHERE clause)? If not, how SHOULD it be done?
Addendum to 1, regardless of the proper way, what is the fastest (execution time)?
Why does the results.Any() line consume 90%+ resources? In this situation it makes more sense that the var results line should consume the resources, after all, it's the line doing the actual search, right?
Thank you for your time. If I find an answer I shall post it here as well.
Any() is taking 90% of the time because the result is only executed when you call Any(). Before you call Any(), the query is not actually made.
It would seem the problem is that you first fetch entire table into the memory and then search. You should instruct your database to search.
Moreover, when you call results.First(), the whole results query is executed again.
With deferred execution in mind, you should write something like
var result = emoteTable.AsEnumerable()
.Where(myRow => myRow.Field<String>("code") == searchCode)
.FirstOrDefault();
if (result != null) {
result["columnname"] = 10;
}
What you have implemented is pretty much join :
var searchCodes = someList.Select(OutOfScopeFunction);
var emotes = emoteTable.AsEnumerable();
var results = Enumerable.Join(emotes, searchCodes, e=>e, sc=>sc.Field<String>("code"), (e, sc)=>sc);
foreach(var result in results)
{
result["columnname"] = 10;
}
Join will probably optimize the access to both lists using some kind of lookup.
But first thing I would do is to completely abandon idea of combining DataTable and LINQ. They are two different technologies and trying to assert what they might do inside when combined is hard.
Did you try doing raw UPDATE calls? How many items are you expecting to update?
What I've got:
I have a large list of addresses(ip addr) > millions
What I'm trying to do:
Remove 500k addresses efficiently through EntityFramework
My Problem:
Right now, I'm splitting into lists of 10000 addresses and using RemoveRange(ListOfaddresses)
if (addresses.Count() > 10000)
{
var addressChunkList = extension.BreakIntoChunks<Address>(addresses.ToList(), 10000);
foreach (var chunk in addressChunkList)
{
db.Address.RemoveRange(chunk);
}
}
but I'm getting an OutOfMemoryException which must mean that it's not freeing resources even though I'm splitting my addresses into separate lists.
What can I do to not get the OutOfMemoryException and still remove large quantities of addresses within reasonable time?
When I have needed to do something similar I have turned to the following plugin (I am not associated).
https://github.com/loresoft/EntityFramework.Extended
This allows you to do bulk deletes using Entity Framework without having to select and load the entity into the memory first which of course is more efficient.
Example from the website:
context.Users.Delete(u => u.FirstName == "firstname");
So? WHere did you get the idea EF is an ETL / bulk data manipulation tool?
It is not. Doing half a million deletes in one transaction will be dead slow (delete one by one) and EF is just not done for this. As you found out.
Nothing you can do here. Start using EF within design parameters or choose an alternative approach for this bulk operations. There are cases an ORM makes little sense.
A couple of suggestions.
Use a stored procedure or plain SQL
Move your DbContext to a narrower scope:
for (int i = 0; i < 500000; i += 1000)
{
using (var db = new DbContext())
{
var chunk = largeListOfAddress.Take(1000).Select(a => new Address { Id = a.Id });
db.Address.RemoveRange(chunk);
db.SaveChanges();
}
}
See Rick Strahl's post on bulk inserts for more details
Let say I have a query with a very large resultset (+100.000 rows) and I need to loop through the and perform an update:
var ds = context.Where(/* query */).Select(e => new { /* fields */ } );
foreach(var d in ds)
{
//perform update
}
I'm fine with this process taking long time to execute but I have limited amount of memory on my server.
What happens in the foreach? Is the entire result fetched at once from the database?
Would it be better to use Skip and Take to do the update in portions?
Best way is to use Skip and Take yes and make sure that after each update, you dispose the DataContext (by using "using")
You could check out my question, has a similiar problem with a nice solution: Out of memory when creating a lot of objects C#
YOu basically abuse LINQ2SQL - not made for that.
ALl results are laoded into memory.
YOur changes are written out once, after you are done.
This will be slow, and it will be - hm - using TONS of memory. Given limited amounts of memory - not possible.
Do NOT load all data in at once. Try to run multiple queries with partial result sets (1000-2500 items each).
ORM's are not made for mass manipulation.
Could you not use a stored procedure to update everything in one go?