How to read a large IEnumerable<> fast

How to read a large IEnumerable<> fast - c#

I have the following code returning 22,000,000 records from the database pretty quick:
var records = from row in dataContext.LogicalMapTable
select
new
{
row.FwId,
row.LpDefId,
row.FlDefMapID
};
The code following the database call above takes over 60 seconds to run:
var cache = new Dictionary<string, int>();
foreach (var record in records)
{
var tempHashCode = record.FirmwareVersionID + "." + record.LogicalParameterDefinitionID;
cache.Add(tempHashCode, record.FirmwareLogicalDefinitionMapID);
}
return cache;
Is there a better way to do this with improve performance?

The second part of your code is not slow. It just causes LINQ query evaluation, you can see this by consuming your query earlier, for example
var records = (from row in dataContext.LogicalMapTable
select
new
{
row.FwId,
row.LpDefId,
row.FlDefMapID
}).ToList();
So it is your LINQ query that is slow, and here is how you can fix it.
You probably don't need 22M records cached in memory. Things you can try:
Pagination (take, skip)
Change queries to include specific ids or other columns. E.g. before select * ..., after select * ... where id in (1,2,3) ...
Do most of the analytic work at database, it's fast and doesn't take up your app memory
Prefer queries that bring small data batches fast. You can run several of these concurrently to update different bits of your UI

As others have mentioned in comments, reading the entire list like that is very inefficient.
Based on the code you posted, I am assuming that after the list is loaded into your "Cache", you lookup the FirmwareLogicalDefinitionMapID using the Key of FirmwareVersionID + "." + LogicalParameterDefinitionID;
My suggestion to improve overall performance and memory usage is to implement an actual caching pattern, something like this:
public static class CacheHelper
{
public static readonly object _SyncLock = new object();
public static readonly _MemoryCache = MemoryCache.Default;
public static int GetFirmwareLogicalDefinitionMapID(int firmwareVersionID, int logicalParameterDefinitionID)
{
int result = -1;
// Build up the cache key
string cacheKey = string.Format("{0}.{1}", firmwareVersionID, logicalParameterDefinitionID);
// Check if the object is in the cache already
if(_MemoryCache.Countains(cacheKey))
{
// It is, so read it and type cast it
object cacheObject = _MemoryCache[cacheKey];
if(cacheObject is int)
{
result = (int)cacheObject;
}
}
else
{
// The object is not in cache, aquire a sync lock for thread safety
lock(_SyncLock)
{
// Double check that the object hasnt been put into the cache by another thread.
if(!_MemoryCache.Countains(cacheKey))
{
// Still not there, now Query the database
result = (from i in dataContext.LogicalMapTable
where i.FwId == firmwareVersionID && i.LpDefId == logicalParameterDefinitionID
select i.FlDefMapID).FirstOrDefault();
// Add the results to the cache so that the next operation that asks for this object can read it from ram
_MemoryCache.Add(new CacheItem(cacheKey, result), new CacheItemPolicy() { SlidingExpiration = new TimeSpan(0, 5, 0) });
}
else
{
// we lost a concurrency race to read the object from source, its in the cache now so read it from there.
object cacheObject = _MemoryCache[cacheKey];
if(cacheObject is int)
{
result = (int)cacheObject;
}
}
}
}
// return the results
return result;
}
}
You should also read up on the .Net MemoryCache: http://www.codeproject.com/Articles/290935/Using-MemoryCache-in-Net-4-0
Hope this helps!

Related

Optimizing a LINQ reading from System.Diagnostics.EventLog

I have a performance problem on certain computers with the following query:
System.Diagnostics.EventLog log = new System.Diagnostics.EventLog("Application");
var entries = log.Entries
.Cast<System.Diagnostics.EventLogEntry>()
.Where(x => x.EntryType == System.Diagnostics.EventLogEntryType.Error)
.OrderByDescending(x => x.TimeGenerated)
.Take(cutoff)
.Select(x => new
{
x.Index,
x.TimeGenerated,
x.EntryType,
x.Source,
x.InstanceId,
x.Message
}).ToList();
Apparently ToList() can be quite slow in certain queries, but with what should I replace it?

log.Entries collection works like this: it is aware about total number of events (log.Entries.Count) and when you access individual element - it makes a query to get that element.
That means when you enumerate over whole Entries collection - it will query for each individual element, so there will be Count queries. And structure of your LINQ query (for example, OrderBy) forces full enumeration of that collection. As you already know - it's very inefficient.
Much more efficient might be to query only log entries you need. For that you can use EventLogQuery class. Suppose you have simple class to hold event info details:
private class EventLogInfo {
public int Id { get; set; }
public string Source { get; set; }
public string Message { get; set; }
public DateTime? Timestamp { get; set; }
}
Then you can convert your inefficient LINQ query like this:
// query Application log, only entries with Level = 2 (that's error)
var query = new EventLogQuery("Application", PathType.LogName, "*[System/Level=2]");
// reverse default sort, by default it sorts oldest first
// but we need newest first (OrderByDescending(x => x.TimeGenerated)
query.ReverseDirection = true;
var events = new List<EventLogInfo>();
// analog of Take
int cutoff = 100;
using (var reader = new EventLogReader(query)) {
while (true) {
using (var next = reader.ReadEvent()) {
if (next == null)
// we are done, no more events
break;
events.Add(new EventLogInfo {
Id = next.Id,
Source = next.ProviderName,
Timestamp = next.TimeCreated,
Message = next.FormatDescription()
});
cutoff--;
if (cutoff == 0)
// we are done, took as much as we need
break;
}
}
}
It will be 10-100 times faster. However, this API is more low-level and returns instances of EventRecord (and not EventLogEntry), so for some information there might be different ways to obtain it (compared to EventLogEntry).
If you decide that you absolutely must use log.Entries and EventLogEntry, then at least enumerate Entries backwards. That's because newest events are in the end (its sorted by timestamp ascending), and you need top X errors by timestamp descended.
EventLog log = new System.Diagnostics.EventLog("Application");
int cutoff = 100;
var events = new List<EventLogEntry>();
for (int i = log.Entries.Count - 1; i >= 0; i--) {
// note that line below might throw ArgumentException
// if, for example, entries were deleted in the middle
// of our loop. That's rare condition, but robust code should handle it
var next = log.Entries[i];
if (next.EntryType == EventLogEntryType.Error) {
// add what you need here
events.Add(next);
// got as much as we need, break
if (events.Count == cutoff)
break;
}
}
That is less efficient, but still should be 10 times faster than your current approach. Note that it's faster because Entries collection is not materialized in memory. Individual elements are queried when you access them, and when enumerating backwards in your specific case - there is high chance to query much less elements.

C# Entity Framework 6: Out Of Memory when it absolutely should not run out of memory

I've simplified things as much as possible. This is reading from a table that has around 3,000,000 rows. I want to create a Dictionary from some concatenated fields of the data.
Here's the code that, in my opinion, should never, ever throw an Out Of Memory Exception:
public int StupidFunction()
{
var context = GetContext();
int skip = 0;
int take = 100000;
var batch = context.VarsHG19.OrderBy(v => v.Id).Skip(skip).Take(take);
while (batch.Any())
{
batch.ToList();
skip += take;
batch = context.VarsHG19.OrderBy(v => v.Id).Skip(skip).Take(take);
}
return 1;
}
In my opinion, the batch object should simply be replaced each iteration and the previous memory allocated for the previous batch object should be garbage collected. I would expect that the loop in this function should take a nearly static amount of memory. At the very worst, it should be bounded by the memory needs of one row * 100,000. The Max size of a row from this table is 540 bytes. I removed Navigation Properties from the edmx.

You can turn off tracking using AsNoTracking. Why not use a foreach loop though on a filtered IEnumerable from the DbSet? You can also help by only returning what you need using an anonymous type using Select() – Igor
Thanks for the Answer Igor.
public int StupidFunction()
{
var context = GetContext();
int skip = 0;
int take = 100000;
var batch = context.VarsHG19.AsNoTracking().OrderBy(v => v.Id).Skip(skip).Take(take);
while (batch.Any())
{
batch.ToList();
skip += take;
batch = context.VarsHG19.AsNoTracking().OrderBy(v => v.Id).Skip(skip).Take(take);
}
return 1;
}
No Out of Memory Exception.

You are not assigning the query's result to anything. How C# will understand what should be cleaned to assign new memory.
batch is a query and would not contain anything. Once you have called .ToList() this will execute the query and return the records.
public int StupidFunction()
{
var context = GetContext();
int skip = 0;
int take = 100000;
var batch = context.VarsHG19.OrderBy(v => v.Id).Skip(skip).Take(take).ToList();
while (batch.Any())
{
skip += take;
batch = context.VarsHG19.OrderBy(v => v.Id).Skip(skip).Take(take).ToList();
}
return 1;
}

Sequential vs parallel solution memory usage

I have a slight issue with the following scenario:
I'm given a list of ID values, I need to run a SELECT query (where the ID is a parameter), then combine all the result sets as one big one and return it to the caller.
Since the query might run for minutes per ID (that's another issue, but at the moment I consider it as a given fact), and there can be 1000s of IDs in the input) I tried to use tasks. With that approach I experience a slow, but solid increase in memory use.
As a test, I made a simple sequential solution too, this has normal memory usage graph, but as expected, very slow. There's an increase while it's running, but then everything drops back to the normal level when it's finished.
Here's the skeleton of code:
public class RowItem
{
public int ID { get; set; }
public string Name { get; set; }
//the rest of the properties
}
public List<RowItem> GetRowItems(List<int> customerIDs)
{
// this solution has the memory leak
var tasks = new List<Task<List<RowItem>>>();
foreach (var customerID in customerIDs)
{
var task = Task.Factory.StartNew(() => return ProcessCustomerID(customerID));
tasks.Add(task);
}
while (tasks.Any())
{
var index = Task.WaitAny(tasks.ToArray());
var task = tasks[index];
rowItems.AddRange(task.Result);
tasks.RemoveAt(index);
}
// this works fine, but slow
foreach (var customerID in customerIDs)
{
rowItems.AddRange(ProcessCustomerID(customerID)));
}
return rowItems;
}
private List<RowItem> ProcessCustomerID(int customerID)
{
var rowItems = new List<RowItem>();
using (var conn = new OracleConnection("XXX"))
{
conn.Open();
var sql = "SELECT * FROM ...";
using (var command = new OracleCommand(sql, conn))
{
using (var dataReader = command.ExecuteReader())
{
using (var dataTable = new DataTable())
{
dataTable.Load(dataReader);
rowItems = dataTable
.Rows
.OfType<DataRow>()
.Select(
row => new RowItem
{
ID = Convert.ToInt32(row["ID"]),
Name = row["Name"].ToString(),
//the rest of the properties
})
.ToList();
}
}
}
conn.Close();
}
return rowItems;
}
What am I doing wrong when using tasks? According to this MSDN article, I don't need to bother disposing them manually, but there's barely anything else. I guess ProcessCustomerID is OK, as it's called in both variations.
update
To log the current memory usage I used Process.GetCurrentProcess().PrivateMemorySize64, but I noticed the problem in Task Manager >> Processes

Using entity framework your ProcessCustomerID method could look like:
List<RowItem> rowItems;
using(var ctx = new OracleEntities()){
rowItems = ctx.Customer
.Where(o => o.id == customerID)
.Select(
new RowItem
{
ID = Convert.ToInt32(row["ID"]),
Name = row["Name"].ToString(),
//the rest of the properties
}
).ToList();
}
return rowItems;
Unless you are transferring large amounts of data like images, video, data or blobs this should be near instantaneous with 1k data as result.
If it is unclear what is taking time, and you use pre 10g oracle, it would be really hard to monitor this. However if you use entity framework you can attach monitoring to it! http://www.hibernatingrhinos.com/products/efprof
At least a year ago Oracle supported entity framework 5.
In sequential they are executed one by one, in parallel they literally get started all at same time consuming your resources and creating deadlocks.

I don't think you have any evidences for a memory leak in the parallel execution.
May be Garbage Collection occurs at different times and that’s why experienced two different readings. You cannot expect it release memory real time. .Net garbage collection occurs only when required. Have a look at “Fundamentals of Garbage Collection”
Task Manager or Process.GetCurrentProcess().PrivateMemorySize64 may not very accurate way to find memory leaks. If you do so, at least make sure you call full garbage collection and wait for pending finalizers prior reading memory counters.
GC.Collect();
GC.WaitForPendingFinalizers();

efficient way to do multi threaded calls to sql server?

i have an sql stored procedure that will call to TOP 1000 records from a table that function like a queue-in this table there will be more or less 30,000-40,000 records.the call to the SP takes ~4 seconds (there's an xml column) so to finish the calls it will take ~2 minutes.
i thought to use multi threaded calls and to insert the records to a sync dictionary\list.
did someone did that before? any efficient way to end the calls as soon as possible?
Thanks...

Consider optimizing the query before resorting to threads.
In my experience, when beginners at multi-threading implement threads, it usually does not improve performance. Worse, it usually introduces subtle errors which can be difficult to debug.
Optimize the query first, and you may find that you don't need threads.
Even if you implemented them, eventually you'll have SQL Server doing too much work, and the threaded requests will simply have to wait.

Basic mistake is wanting to insert into the database from multiple threads and overload server with connections, locks, and eventually bring it to its knees.
If you are READING the data, you will do much better if you find a query that will perform faster and fetch as much data as you can at once.
To me, it seems like your problem is not solvable on its level - maybe if you elaborate what you want to do you'll get better advice.
EDIT:
I did use SQL as a queue once - and I just remembered - to dequeue, you'll have to use result from the first query to get input to the second, so threads are out of the question. Or, you'll have to MARK your queued data 'done' in the database, and your READ will become UPDATE -> resulting to locking.
If you are reading, and you want to react as soon as possible, you can use DataReader, then read ALL of the data, and chunk your processing into threads - read 100 records, fork a thread and pass it to it... then next records and so on. That way you'll be able to balance your resource usage.

Try reading the data asynchronously using DataReader; fetch the columns that can uniquely identify the row in the database .Populate the Queue to hold the returned data value (Custom Object) and run work threads to perform the task against the queue.
You have to decide how many worker threads should be implemented to perform the task as threads have their own overheads and if not implemented correctly could be a nightmare.

If you really have to you can start BGWorkers that individually make connections to the server and report back with their progress.
I did the same thing for an elaborate export/import application to move roughly 50GB of data (4GB deflatestream'ed) except I only used the BGWorker to do the work consecutively, not concurrently, without locking up the UI-thread..

It isn't clear if you're selecting the 1000 most recently added rows, or the 1000 rows with the highest value in a particular column, nor is it clear whether your rows are mutable -- i.e. a row might qualify for the top 1000 yesterday but then get updated so that it no longer qualifies today. But if the individual rows are not mutable, you could have a separate table for the TOP1000, and when the 1001st row is inserted into it, an after-insert trigger would move the 1001st row (however you determine that row) to a HISTORY table. That would make the selection virtually instantaneous: select * from TOP1000. You'd simply combine the two tables with a UNION when you need to query the TOP1000 and HISTORY as though they were one table. Or instead of trigger you could wrap the insert and 1001st-row delete in a transaction.
Different can of worms, though, if the rows mutate, and can move in and out of the top 1000.

public struct BillingData
{
public int CustomerTrackID, CustomerID;
public DateTime BillingDate;
}
public Queue<BillingData> customerQueue = new Queue<BillingData>();
volatile static int ThreadProcessCount = 0;
readonly static object threadprocesslock = new object();
readonly static object queuelock = new object();
readonly static object countlock = new object();
AsyncCallback asyncCallback
// Pulling the Data Aync from the Database
private void StartProcess()
{
SqlCommand command = SQLHelper.GetCommand("GetRecordsByBillingTrackID");
command.Connection = SQLHelper.GetConnection("Con");SQLHelper.DeriveParameters(command);
command.Parameters["#TrackID"].Value = trackid;
asyncCallback = new AsyncCallback(FetchData);
command.BeginExecuteXmlReader(asyncCallback, command);
}
public void FetchData(IAsyncResult c1)
{
SqlCommand comm1 = (SqlCommand)c1.AsyncState;
System.Xml.XmlReader xr = comm1.EndExecuteXmlReader(c1);
xr.Read();
string data = "";
while (!xr.EOF)
{
data = xr.ReadOuterXml();
XmlDocument dom = new XmlDocument();
dom.LoadXml("<data>" + data + "</data>");
BillingData billingData;
billingData.CustomerTrackID = Convert.ToInt32(dom.FirstChild.ChildNodes[0].Attributes["CustomerTrackID"].Value);
billingData.CustomerID = Convert.ToInt32(dom.FirstChild.ChildNodes[0].Attributes["CustomerID"].Value);
billingData.BillingDate = Convert.ToDateTime(dom.FirstChild.ChildNodes[0].Attributes["BillingDate"].Value);
lock (queuelock)
{
if (!customerQueue.Contains(billingData))
{
customerQueue.Enqueue(billingData);
}
}
AssignThreadProcessToTheCustomer();
}
xr.Close();
}
// Assign the Threads based on the data pulled
private void AssignThreadProcessToTheCustomer()
{
int TotalProcessThreads = 5;
int TotalCustomersPerThread = 5;
if (ThreadProcessCount < TotalProcessThreads)
{
int ThreadsNeeded = (customerQueue.Count % TotalCustomersPerThread == 0) ? (customerQueue.Count / TotalCustomersPerThread) : (customerQueue.Count / TotalCustomersPerThread + 1);
int count = 0;
if (ThreadsNeeded > ThreadProcessCount)
{
count = ThreadsNeeded - ThreadProcessCount;
if ((count + ThreadProcessCount) > TotalProcessThreads)
count = TotalProcessThreads - ThreadProcessCount;
}
for (int i = 0; i < count; i++)
{
ThreadProcess objThreadProcess = new ThreadProcess(this);
ThreadPool.QueueUserWorkItem(objThreadProcess.BillingEngineThreadPoolCallBack, count);
lock (threadprocesslock)
{
ThreadProcessCount++;
}
}
public void BillingEngineThreadPoolCallBack(object threadContext)
{
BillingData? billingData = null;
while (true)
{
lock (queuelock)
{
billingData = ProcessCustomerQueue();
}
if (billingData != null)
{
StartBilling(billingData.Value);
}
else
break;
More....
}

Help needed for optimizing linq data extraction

I'm fetching data from all 3 tables at once to avoid network latency. Fetching the data is pretty fast, but when I loop through the results a lot of time is used
Int32[] arr = { 1 };
var query = from a in arr
select new
{
Basket = from b in ent.Basket
where b.SUPERBASKETID == parentId
select new
{
Basket = b,
ObjectTypeId = 0,
firstObjectId = "-1",
},
BasketImage = from b in ent.Image
where b.BASKETID == parentId
select new
{
Image = b,
ObjectTypeId = 1,
CheckedOutBy = b.CHECKEDOUTBY,
firstObjectId = b.FIRSTOBJECTID,
ParentBasket = (from parentBasket in ent.Basket
where parentBasket.ID == b.BASKETID
select parentBasket).ToList()[0],
},
BasketFile = from b in ent.BasketFile
where b.BASKETID == parentId
select new
{
BasketFile = b,
ObjectTypeId = 2,
CheckedOutBy = b.CHECKEDOUTBY,
firstObjectId = b.FIRSTOBJECTID,
ParentBasket = (from parentBasket in ent.Basket
where parentBasket.ID == b.BASKETID
select parentBasket),
}
};
//Exception handling
var mixedElements = query.First();
ICollection<BasketItem> basketItems = new Collection<BasketItem>();
//Here 15 millis has been used
//only 6 elements were found
if (mixedElements.Basket.Count() > 0)
{
foreach (var mixedBasket in mixedElements.Basket){}
}
if (mixedElements.BasketFile.Count() > 0)
{
foreach (var mixedBasketFile in mixedElements.BasketFile){}
}
if (mixedElements.BasketImage.Count() > 0)
{
foreach (var mixedBasketImage in mixedElements.BasketImage){}
}
//the empty loops takes 811 millis!!

Why are you bothering to check the counts before the foreach statements? If there are no results, the foreach will just finish immediately.
Your queries are actually all being deferred - they'll be executed as and when you ask for the data. Don't forget that your outermost query is a LINQ to Objects query: it's just returning the result of calling ent.Basket.Where(...).Select(...) etc... which doesn't actually execute the query.
Your plan to do all three queries in one go isn't actually working. However, by asking for the count separately, you may actually be executing each database query twice - once just getting the count and once for the results.
I strongly suggest that you get rid of the "optimizations" in this code which are making it much more complicated and slower than just writing the simplest code you can.
I don't know of any way of getting LINQ to SQL (or LINQ to EF) to execute multiple queries in a single call - but this approach certainly isn't going to do it.
One other minor hint which is irrelevant in this case, but can be useful in LINQ to Objects - if you want to find out whether there's any data in a collection, just use Any() instead of Count() > 0 - that way it can stop as soon as it's found anything.

You're using IEnumerable in the foreach loop. Implementations only have to prepare data when it's asked for. In this way, I'd suggest that the above code is accessing your data lazily -- that is, only when you enumerate the items (which actually happens when you call Count().)
Put a System.Diagnostics.Stopwatch around the call to Count() and see whether that's taking the bulk of the time you're seeing.
I can't comment further here because you don't specify the type of ent in your code sample.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to read a large IEnumerable<> fast - c#

Related

Optimizing a LINQ reading from System.Diagnostics.EventLog

C# Entity Framework 6: Out Of Memory when it absolutely should not run out of memory

Sequential vs parallel solution memory usage

efficient way to do multi threaded calls to sql server?

Help needed for optimizing linq data extraction

Categories

Resources