Sequential vs parallel solution memory usage

Sequential vs parallel solution memory usage - c#

I have a slight issue with the following scenario:
I'm given a list of ID values, I need to run a SELECT query (where the ID is a parameter), then combine all the result sets as one big one and return it to the caller.
Since the query might run for minutes per ID (that's another issue, but at the moment I consider it as a given fact), and there can be 1000s of IDs in the input) I tried to use tasks. With that approach I experience a slow, but solid increase in memory use.
As a test, I made a simple sequential solution too, this has normal memory usage graph, but as expected, very slow. There's an increase while it's running, but then everything drops back to the normal level when it's finished.
Here's the skeleton of code:
public class RowItem
{
public int ID { get; set; }
public string Name { get; set; }
//the rest of the properties
}
public List<RowItem> GetRowItems(List<int> customerIDs)
{
// this solution has the memory leak
var tasks = new List<Task<List<RowItem>>>();
foreach (var customerID in customerIDs)
{
var task = Task.Factory.StartNew(() => return ProcessCustomerID(customerID));
tasks.Add(task);
}
while (tasks.Any())
{
var index = Task.WaitAny(tasks.ToArray());
var task = tasks[index];
rowItems.AddRange(task.Result);
tasks.RemoveAt(index);
}
// this works fine, but slow
foreach (var customerID in customerIDs)
{
rowItems.AddRange(ProcessCustomerID(customerID)));
}
return rowItems;
}
private List<RowItem> ProcessCustomerID(int customerID)
{
var rowItems = new List<RowItem>();
using (var conn = new OracleConnection("XXX"))
{
conn.Open();
var sql = "SELECT * FROM ...";
using (var command = new OracleCommand(sql, conn))
{
using (var dataReader = command.ExecuteReader())
{
using (var dataTable = new DataTable())
{
dataTable.Load(dataReader);
rowItems = dataTable
.Rows
.OfType<DataRow>()
.Select(
row => new RowItem
{
ID = Convert.ToInt32(row["ID"]),
Name = row["Name"].ToString(),
//the rest of the properties
})
.ToList();
}
}
}
conn.Close();
}
return rowItems;
}
What am I doing wrong when using tasks? According to this MSDN article, I don't need to bother disposing them manually, but there's barely anything else. I guess ProcessCustomerID is OK, as it's called in both variations.
update
To log the current memory usage I used Process.GetCurrentProcess().PrivateMemorySize64, but I noticed the problem in Task Manager >> Processes

Using entity framework your ProcessCustomerID method could look like:
List<RowItem> rowItems;
using(var ctx = new OracleEntities()){
rowItems = ctx.Customer
.Where(o => o.id == customerID)
.Select(
new RowItem
{
ID = Convert.ToInt32(row["ID"]),
Name = row["Name"].ToString(),
//the rest of the properties
}
).ToList();
}
return rowItems;
Unless you are transferring large amounts of data like images, video, data or blobs this should be near instantaneous with 1k data as result.
If it is unclear what is taking time, and you use pre 10g oracle, it would be really hard to monitor this. However if you use entity framework you can attach monitoring to it! http://www.hibernatingrhinos.com/products/efprof
At least a year ago Oracle supported entity framework 5.
In sequential they are executed one by one, in parallel they literally get started all at same time consuming your resources and creating deadlocks.

I don't think you have any evidences for a memory leak in the parallel execution.
May be Garbage Collection occurs at different times and that’s why experienced two different readings. You cannot expect it release memory real time. .Net garbage collection occurs only when required. Have a look at “Fundamentals of Garbage Collection”
Task Manager or Process.GetCurrentProcess().PrivateMemorySize64 may not very accurate way to find memory leaks. If you do so, at least make sure you call full garbage collection and wait for pending finalizers prior reading memory counters.
GC.Collect();
GC.WaitForPendingFinalizers();

Related

Why the process memory increases when fetching a lot of data from DB via entity framework in buckets

In the following code in each iteration of the loop I do the following things:
Fetch 5000 entities from DB
According to the logic in FilterRelationsToDelete I decide which entities to delete
I add the ids of entities to delete to a collection
After I finish the loop I delete from DB the entities according to idsToDelete collection.
I saw in visual studio "diagnostics tool" that the memory of the process is rising in the beginning of each loop iteration and after the iteration finishes it decreases by half, My problem is that sometimes it raises to 800MB and drops to 400MB, sometimes it is steady on 200MB, and sometimes it is over 1GB and drops to 500MB and stay steady on that.
I am not sure why my process memory is not steady on 200MB with small spikes when the data arrives from the DB. what might be the reasons for that? maybe Entity framework does not free all the memory it used? maybe the GC I activated here on purpose does not clean all the memory as I expected? maybe I have a bug here that I am not aware of?
The list of longs memory that I accumulate in idsToDelete is almost zero, this is not the problem.
Is there any way to write this code better?
private static void PlayWithMemory()
{
int buketSize = 5000;
List<long> idsToDelete = new List<long>();
for (int i = 0; i < 500; i++)
{
System.GC.Collect();//added just for this example
using (var context = new PayeeRelationsContext())
{
int toSkip = i * bucketSize;
List<PayeeRelation> dbPayeeRelations = GetDBRelations(context, toSkip, buketSize);
var relationsToDelete = FilterRelationsToDelete(dbPayeeRelations);
List<long> ids = relationsToDelete.Select(x => x.id).ToList();
idsToDelete.AddRange(ids);
Console.WriteLine($"i = {i}, toSkip = {toSkip}, payeeRelations.Count = {payeeRelationsIds.Count}");
}
}
}
private static List<PayeeRelation> GetDBRelations(PayeeRelationsContext context, int toSkip,
int bucketSize)
{
return context.PayeeRelations
.OrderBy(x => x.id)
.Include(x => x.PayeeRelation_PayeeVersion)
.Skip(toSkip)
.Take(bucketSize)
.AsNoTracking()
.ToList();
}

I don't see anything inherently wrong with your code to indicate a memory leak. I believe what you are observing is simply that the garbage collection does not fully "release" memory as soon as the references are deemed unused or out of scope.
If memory use/allocation is a concern then you should consider projecting down to the minimal viable data you need to validate in order to identify which IDs need to be deleted. For example, if you need the ID and Field1 from the PayeeRelations, then need Field2 and Field3 from the related PayeeVersion:
private class RelationValidationDetails
{
public long PayeeRelationId { get; set; }
public string Field1 { get; set; }
public string Field2 { get; set; }
public DateTime Field3 { get; set; }
}
....then in your query:
var validationData = context.PayeeRelations
.OrderBy(x => x.id)
.Select(x => new RelationValidationDetails
{
PayeeRelationId = x.Id,
Field1 = x.Field1,
Field2 = x.PayeeRelation_PayeeVersion.Field2,
Field3 = x.PayeeRelation_PayeeVersion.Field3
}).Skip(toSkip)
.Take(bucketSize)
.ToList();
Then your validation just takes the above collection of validation details to determine which IDs need to be deleted. (assuming it bases this decision on Fields 1-3) This ensures that your query only returns back exactly what data is needed to ultimately get the IDs to delete, minimizing memory growth.
There could be an argument that if later a "Field4" is required to do the validation would mean you have to update this object definition and revise the query which is extra work when you could just use the entity. However, Field4 might not come from PayeeRelations or the PayeeVersion, it might come from a different related entity which currently isn't eager loaded. This would introduce an overhead of having to add the cost of eager loading another table for every caller of that wrapped GetPayeeRelations call, whether they need that data or not. That, or risking performance hits from lazy loading (removing the AsNoTracking()) or introducing conditional complexity to tell the GetPayeeRelations which relationships need to be eager loaded. Trying to predict this possibility is really just an example of YAGNI.
I generally don't recommend hiding EF queries behind getter methods (such as generic repositories) simply because these tend to form a lowest-common denominator while chasing DNRY or SRP. The reality is that they end up being single points that are inefficient in many cases because if any one consumer needs a relationship eager loaded, all consumers get it eager loaded. It's generally far better to allow your consumers to have the ability to project down to just exactly what they need rather than worry that similar (rather than identical) queries might appear in multiple places.

Why Some records are missing when using parallel.forEach? [duplicate]

This question already has answers here:
multiple threads adding elements to one list. why are there always fewer items in the list than expected?
(2 answers)
Closed 2 years ago.
In my code,i'm getting list of menus from database and map them to DTO objects,
due to nested child,i decided to use parallel to map for mapping entities,but i bumped into a weird issue ,when forEach is finished some of the records are not mapped !
The number of missed records are different each time,one time one and another time more !
public List<TreeStructureDto> GetParentNodes()
{
var data = new List<TreeStructureDto>();
var result = MenuDLL.Instance.GetTopParentNodes();
Parallel.ForEach(result, res =>
{
data.Add( new Mapper().Map(res));
});
return data;
}
but when I'm debugging I'm getting
number of my original data is 59
But after mapping, the number of my final list is 58 !
My mapper class is as follows:
public TreeStructureDto Map(Menu menu)
{
return new TreeStructureDto()
{
id = menu.Id.ToString(),
children = true,
text = menu.Name,
data = new MenuDto()
{
Id = menu.Id,
Name = menu.Name,
ParentId = menu.ParentId,
Script = menu.Script,
SiblingsOrder = menu.SiblingsOrder,
systemGroups = menu.MenuSystemGroups.Select(x => Map(x)).ToList()
}
};
}
I appreciate your helps in advance.

You are adding to a single list concurrently, which is not valid because List<T> is not thread-safe (most types are not thread-safe; this isn't a fault of List<T> - the fault is simply: never assume something is thread-safe unless you've checked).
If the bulk of the CPU work in that per-item callback is the new Mapper().Map(res) part, then you may be able to fix this with synchronization, i.e.
Parallel.ForEach(result, res =>
{
var item = new Mapper().Map(res);
lock (data)
{
data.Add(item);
}
});
which prevents threads fighting while adding, but still allows the Map part to run concurrently and independently. Note that the order is going to be undefined, though; you might want some kind of data.Sort(...) after the Parallel.ForEach has finished.

An alternative solution to locking inside a Parallel.ForEach would be to use PLINQ:
public List<TreeStructureDto> GetParentNodes()
{
var mapper = new Mapper();
return MenuDLL.Instance.GetTopParentNodes()
.AsParallel()
.Select(mapper.Map)
.ToList();
}
AsParallel uses multiple threads to perform the mappings, but no collection needs to be accessed via multiple threads concurrently.
As mentioned by Marc, this may or may not prove more efficient for your situation, so you should benchmark both approaches, as well as comparing to a single-threaded approach.

How can i make this run faster?

Can I do threads instead of tasks to make this run faster?
I'm trying to get 114000 products into the database. As my code is right now I get about 100 products into the database a minute.
My Tasks (producers) each scrape an XML File which contains a product data, packages it in the Product class, then queue's it for the consumer.
my Consumer takes each product from the queue and puts it into the database 1 at a time. I use Entity Framework so it's not safe for threading.
public static void GetAllProductsFromIndexes_AndPutInDB(List<IndexModel> indexes, ProductContext context)
{
BlockingCollection<IndexModel> inputQueue = CreateInputQueue(indexes);
BlockingCollection<Product> productsQueue = new BlockingCollection<Product>(5000);
var consumer = Task.Run(() =>
{
foreach (Product readyProduct in productsQueue.GetConsumingEnumerable())
{
InsertProductInDB(readyProduct, context);
}
});
var producers = Enumerable.Range(0, 100)
.Select(_ => Task.Run(() =>
{
foreach (IndexModel index in inputQueue.GetConsumingEnumerable())
{
Product product = new Product();
byte[] unconvertedByteArray;
string xml;
string url = #"https://data.Icecat.biz/export/freexml.int/en/";
unconvertedByteArray = DownloadIcecatFile(index.IndexNumber.ToString() + ".xml", url);
xml = Encoding.UTF8.GetString(unconvertedByteArray);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xml);
GetProductDetails(product, xmlDoc, index);
XmlNodeList nodeList = (xmlDoc.SelectNodes("ICECAT-interface/Product/ProductFeature"));
product.FeaturesLink = GetProductFeatures(product, nodeList);
nodeList = (xmlDoc.SelectNodes("ICECAT-interface/Product/ProductGallery/ProductPicture"));
product.Images = GetProductImages(nodeList);
productsQueue.Add(product);
}
})).ToArray();
Task.WaitAll(producers);
productsQueue.CompleteAdding();
consumer.Wait();
}

A couple of things you must do.
Detach each Product entity after you isnert it, or they will all accumulate in the Change Tracker.
Don't call SaveChanges after every product. Batch up a hundred or so. Like this:
var consumer = Task.Run(() =>
{
var batch = new List<Product>();
foreach (Product readyProduct in productsQueue.GetConsumingEnumerable())
{
batch.Add(readyProduct);
if (batch.Count >= 100)
{
context.Products.AddRange(batch);
context.SaveChanges();
foreach (var p in batch)
{
context.Entry(p).State = EntityState.Detached;
}
batch.Clear();
}
}
context.Products.AddRange(batch);
context.SaveChanges();
foreach (var p in batch)
{
context.Entry(p).State = EntityState.Detached;
}
});
If you're on EF Core and your provider supports it (Like SQL Server), you'll even get statement batching. You should expect several hunderd rows per second using basic best-practices here. If you need more than that, you can switch to a bulk load API ( like SqlBulkCopy for SQL Server).

First, read the speed rant to make sure this is even worth investigating.
Can I do threads instead of tasks to make this run faster?
Extremely unlikely. Multithreading has been used as a cheap way to implement Multitasking for a while, but it is technically only usefull if the task is CPU bound. You are doing a DB operation. That one will be Network bound. More likely DB bound (they apply additional bottlenecks as part of their reliabiltiy and concurrency issue prevention).
Im trying to get 114000 products into the db.
Then your best bet is not trying to do that in code. Every DBMS worth it's memory footprint has bulk insert options. Doing that in C# code? That will just make it slower and less reliable.
At best you add the Network load to send the data to the DB to the whole operation. At worst, you make it even slower then that. It is one of the most common mistakes with DB's, thinking you can beat the DBMS performance with code. It will not work.

How to read a large IEnumerable<> fast

I have the following code returning 22,000,000 records from the database pretty quick:
var records = from row in dataContext.LogicalMapTable
select
new
{
row.FwId,
row.LpDefId,
row.FlDefMapID
};
The code following the database call above takes over 60 seconds to run:
var cache = new Dictionary<string, int>();
foreach (var record in records)
{
var tempHashCode = record.FirmwareVersionID + "." + record.LogicalParameterDefinitionID;
cache.Add(tempHashCode, record.FirmwareLogicalDefinitionMapID);
}
return cache;
Is there a better way to do this with improve performance?

The second part of your code is not slow. It just causes LINQ query evaluation, you can see this by consuming your query earlier, for example
var records = (from row in dataContext.LogicalMapTable
select
new
{
row.FwId,
row.LpDefId,
row.FlDefMapID
}).ToList();
So it is your LINQ query that is slow, and here is how you can fix it.
You probably don't need 22M records cached in memory. Things you can try:
Pagination (take, skip)
Change queries to include specific ids or other columns. E.g. before select * ..., after select * ... where id in (1,2,3) ...
Do most of the analytic work at database, it's fast and doesn't take up your app memory
Prefer queries that bring small data batches fast. You can run several of these concurrently to update different bits of your UI

As others have mentioned in comments, reading the entire list like that is very inefficient.
Based on the code you posted, I am assuming that after the list is loaded into your "Cache", you lookup the FirmwareLogicalDefinitionMapID using the Key of FirmwareVersionID + "." + LogicalParameterDefinitionID;
My suggestion to improve overall performance and memory usage is to implement an actual caching pattern, something like this:
public static class CacheHelper
{
public static readonly object _SyncLock = new object();
public static readonly _MemoryCache = MemoryCache.Default;
public static int GetFirmwareLogicalDefinitionMapID(int firmwareVersionID, int logicalParameterDefinitionID)
{
int result = -1;
// Build up the cache key
string cacheKey = string.Format("{0}.{1}", firmwareVersionID, logicalParameterDefinitionID);
// Check if the object is in the cache already
if(_MemoryCache.Countains(cacheKey))
{
// It is, so read it and type cast it
object cacheObject = _MemoryCache[cacheKey];
if(cacheObject is int)
{
result = (int)cacheObject;
}
}
else
{
// The object is not in cache, aquire a sync lock for thread safety
lock(_SyncLock)
{
// Double check that the object hasnt been put into the cache by another thread.
if(!_MemoryCache.Countains(cacheKey))
{
// Still not there, now Query the database
result = (from i in dataContext.LogicalMapTable
where i.FwId == firmwareVersionID && i.LpDefId == logicalParameterDefinitionID
select i.FlDefMapID).FirstOrDefault();
// Add the results to the cache so that the next operation that asks for this object can read it from ram
_MemoryCache.Add(new CacheItem(cacheKey, result), new CacheItemPolicy() { SlidingExpiration = new TimeSpan(0, 5, 0) });
}
else
{
// we lost a concurrency race to read the object from source, its in the cache now so read it from there.
object cacheObject = _MemoryCache[cacheKey];
if(cacheObject is int)
{
result = (int)cacheObject;
}
}
}
}
// return the results
return result;
}
}
You should also read up on the .Net MemoryCache: http://www.codeproject.com/Articles/290935/Using-MemoryCache-in-Net-4-0
Hope this helps!

efficient way to do multi threaded calls to sql server?

i have an sql stored procedure that will call to TOP 1000 records from a table that function like a queue-in this table there will be more or less 30,000-40,000 records.the call to the SP takes ~4 seconds (there's an xml column) so to finish the calls it will take ~2 minutes.
i thought to use multi threaded calls and to insert the records to a sync dictionary\list.
did someone did that before? any efficient way to end the calls as soon as possible?
Thanks...

Consider optimizing the query before resorting to threads.
In my experience, when beginners at multi-threading implement threads, it usually does not improve performance. Worse, it usually introduces subtle errors which can be difficult to debug.
Optimize the query first, and you may find that you don't need threads.
Even if you implemented them, eventually you'll have SQL Server doing too much work, and the threaded requests will simply have to wait.

Basic mistake is wanting to insert into the database from multiple threads and overload server with connections, locks, and eventually bring it to its knees.
If you are READING the data, you will do much better if you find a query that will perform faster and fetch as much data as you can at once.
To me, it seems like your problem is not solvable on its level - maybe if you elaborate what you want to do you'll get better advice.
EDIT:
I did use SQL as a queue once - and I just remembered - to dequeue, you'll have to use result from the first query to get input to the second, so threads are out of the question. Or, you'll have to MARK your queued data 'done' in the database, and your READ will become UPDATE -> resulting to locking.
If you are reading, and you want to react as soon as possible, you can use DataReader, then read ALL of the data, and chunk your processing into threads - read 100 records, fork a thread and pass it to it... then next records and so on. That way you'll be able to balance your resource usage.

Try reading the data asynchronously using DataReader; fetch the columns that can uniquely identify the row in the database .Populate the Queue to hold the returned data value (Custom Object) and run work threads to perform the task against the queue.
You have to decide how many worker threads should be implemented to perform the task as threads have their own overheads and if not implemented correctly could be a nightmare.

If you really have to you can start BGWorkers that individually make connections to the server and report back with their progress.
I did the same thing for an elaborate export/import application to move roughly 50GB of data (4GB deflatestream'ed) except I only used the BGWorker to do the work consecutively, not concurrently, without locking up the UI-thread..

It isn't clear if you're selecting the 1000 most recently added rows, or the 1000 rows with the highest value in a particular column, nor is it clear whether your rows are mutable -- i.e. a row might qualify for the top 1000 yesterday but then get updated so that it no longer qualifies today. But if the individual rows are not mutable, you could have a separate table for the TOP1000, and when the 1001st row is inserted into it, an after-insert trigger would move the 1001st row (however you determine that row) to a HISTORY table. That would make the selection virtually instantaneous: select * from TOP1000. You'd simply combine the two tables with a UNION when you need to query the TOP1000 and HISTORY as though they were one table. Or instead of trigger you could wrap the insert and 1001st-row delete in a transaction.
Different can of worms, though, if the rows mutate, and can move in and out of the top 1000.

public struct BillingData
{
public int CustomerTrackID, CustomerID;
public DateTime BillingDate;
}
public Queue<BillingData> customerQueue = new Queue<BillingData>();
volatile static int ThreadProcessCount = 0;
readonly static object threadprocesslock = new object();
readonly static object queuelock = new object();
readonly static object countlock = new object();
AsyncCallback asyncCallback
// Pulling the Data Aync from the Database
private void StartProcess()
{
SqlCommand command = SQLHelper.GetCommand("GetRecordsByBillingTrackID");
command.Connection = SQLHelper.GetConnection("Con");SQLHelper.DeriveParameters(command);
command.Parameters["#TrackID"].Value = trackid;
asyncCallback = new AsyncCallback(FetchData);
command.BeginExecuteXmlReader(asyncCallback, command);
}
public void FetchData(IAsyncResult c1)
{
SqlCommand comm1 = (SqlCommand)c1.AsyncState;
System.Xml.XmlReader xr = comm1.EndExecuteXmlReader(c1);
xr.Read();
string data = "";
while (!xr.EOF)
{
data = xr.ReadOuterXml();
XmlDocument dom = new XmlDocument();
dom.LoadXml("<data>" + data + "</data>");
BillingData billingData;
billingData.CustomerTrackID = Convert.ToInt32(dom.FirstChild.ChildNodes[0].Attributes["CustomerTrackID"].Value);
billingData.CustomerID = Convert.ToInt32(dom.FirstChild.ChildNodes[0].Attributes["CustomerID"].Value);
billingData.BillingDate = Convert.ToDateTime(dom.FirstChild.ChildNodes[0].Attributes["BillingDate"].Value);
lock (queuelock)
{
if (!customerQueue.Contains(billingData))
{
customerQueue.Enqueue(billingData);
}
}
AssignThreadProcessToTheCustomer();
}
xr.Close();
}
// Assign the Threads based on the data pulled
private void AssignThreadProcessToTheCustomer()
{
int TotalProcessThreads = 5;
int TotalCustomersPerThread = 5;
if (ThreadProcessCount < TotalProcessThreads)
{
int ThreadsNeeded = (customerQueue.Count % TotalCustomersPerThread == 0) ? (customerQueue.Count / TotalCustomersPerThread) : (customerQueue.Count / TotalCustomersPerThread + 1);
int count = 0;
if (ThreadsNeeded > ThreadProcessCount)
{
count = ThreadsNeeded - ThreadProcessCount;
if ((count + ThreadProcessCount) > TotalProcessThreads)
count = TotalProcessThreads - ThreadProcessCount;
}
for (int i = 0; i < count; i++)
{
ThreadProcess objThreadProcess = new ThreadProcess(this);
ThreadPool.QueueUserWorkItem(objThreadProcess.BillingEngineThreadPoolCallBack, count);
lock (threadprocesslock)
{
ThreadProcessCount++;
}
}
public void BillingEngineThreadPoolCallBack(object threadContext)
{
BillingData? billingData = null;
while (true)
{
lock (queuelock)
{
billingData = ProcessCustomerQueue();
}
if (billingData != null)
{
StartBilling(billingData.Value);
}
else
break;
More....
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Sequential vs parallel solution memory usage - c#

Related

Why the process memory increases when fetching a lot of data from DB via entity framework in buckets

Why Some records are missing when using parallel.forEach? [duplicate]

How can i make this run faster?

How to read a large IEnumerable<> fast

efficient way to do multi threaded calls to sql server?

Categories

Resources