why am I getting a System.OutOfMemoryException - c#

I am using Linq to Sql.
Here is the code:
Dictionary<string, int> allResults;
using (var dc= new MyDataContext())
{
dc.CommandTimeout = 0;
allResults = dc.MyTable.ToDictionary(x => x.Text, x => x.Id);
}
it is ran on a 64 bit machine and the compilation is AnyCPU. It throws a System.OutOfMemoryException.
This accesses an SQL Server database. The Id field maps to an SQL Server int field, And the Text field maps to Text(nvarchar(max)) field. Running select COUNT(*) from TableName results in 1,173,623 records and running select sum(len(Text)) from TableName results in 48,915,031. Since int is a 32 bit integer, the ids should take only 4.69MB of space and the strings less than 1GB. So we are not even bumping against the 2GB/object limit.
I then change the code in this way:
Dictionary<string, int> allResults;
using (var dc = new MyDataContext())
{
Dictionary<string, int> threeHundred;
dc.CommandTimeout = 0;
var tot = dc.MyTable.Count();
allResults = new Dictionary<string, int>(tot);
int skip = 0;
int takeThis = 300000;
while (skip < tot)
{
threeHundred = dc.MyTable.Skip(skip).Take(takeThis).ToDictionary(x => x.Text, x => x.Id);
skip = skip + takeThis;
allResults = allResults.Concat(threeHundred).ToDictionary(x => x.Key, x => x.Value);
threeHundred = null;
GC.Collect();
}
}
I learn that garbage collaction here does not help and that the out of memory exception is thrown on the first line in the while loop once skip = 900,000.
What is wrong and how do I fix this?

Without getting into your calculations of how much it should take in memory (as there could be issues of encoding that could easily double the size of the data), I'll try to give a few pointers.
Starting with the cause of the issue - my guess is that the threeHundred dictionary is causing a lot of allocations.
When you add items to a dictionary like above, the dictionary won't be able to know how many items it should pre-allocated. Which will cause a massive re-allocation and coping of all data to newly created dictionaries.
Please set a size (using the ctor) to the threeHundred dictionary before adding any items to it.
Please read this article I've published which goes in-depth into Dictionary internals - I'm sure it will shed some light on those symptoms.
http://www.codeproject.com/Articles/500644/Understanding-Generic-Dictionary-in-depth
In addition, when trying to populate this large amount of data, I suggest to fully control the process.
My suggestion:
Pre-allocate slots in the Dictionary (using a Count query directly on the DB, and passing it to the Dictionary ctor)
Work with DataReader for populating those items without loading all of the query result into memory.
If you know for a fact (which is VERY important to know this in advance) - think of using string.Intern - only if there are many duplicated items! - you should test to see how it is working
Memory-profile the code - you should only see ONE allocation for the Dictionary, and strings as the amount of the items from the query (int is value type - therefor it is not allocated on the heap as an object, but instead it sits inside the Dictionary.
Either way, you should check if you are running on 32 bit or 64 bit. .Net 4.5 prefers 32 bit. (check it on Task Manager or the project properties)
Hope this helps,
Ofir.

Related

How to optimize a code using DataTable and Linq?

I have 2 DataTables. There are about 17000 (table1) and 100000 (table2) records.
It's needed to check if the field "FooName" contains "ItemName". Also it's needed to take "FooId" and then add "ItemId" and "FooId" to ConcurrentDictionary.
I have this code.
DataTable table1;
DataTable table2;
var table1Select = table1.Select();
ConcurrentDictionary<double, double> compareDictionary = new ConcurrentDictionary<double, double>();
foreach (var item in table1)
{
var fooItem = from foo in table2.AsEnumerable()
where foo.Field<string>("FooName").Contains(item.Field<string>("ItemName"))
select foo.Field<double>("FooId");
if(fooItem != null && fooItem.FirstOrDefault() != 0)
{
compareDictionary.TryAdd(item.Field<double>("ItemId"), fooItem.FirstOrDefault());
}
}
It works slowly (it takes about 10 minutes to perform the task).
I want to make it faster. How I can optimize it?
I see three points you can attack:
ditch strong typing on field accessors in favour of direct casts: it forces unboxing which you can totally avoid with doubles being value types. upd as pointed out in comments, you will not avoid unboxing either way, but could potentially save some method call overheads (which is again, arguable). This point can probably be ignored
cache reference string so you only access it once per outer loop
(i think this is where the biggest gains are) since you seem to always take first result - opt for FirstOrDefault() straight in LINQ - don't let it enumerate the whole thing when a match is found
ConcurrentDictionary<double, double> compareDictionary = new ConcurrentDictionary<double, double>();
foreach (var item in table1)
{
var sample = (string)item["ItemName"]; // cache the value before looping through inner collection
var fooItem = table2.AsEnumerable()
.FirstOrDefault(foo => ((string)foo["FooName"]).Contains(sample)); // you seem to always take First item, so you could instruct LINQ to stop after a match is found
if (fooItem != null && (double)fooItem["FooId"] != 0)
{
compareDictionary.TryAdd((double)item["ItemId"], (double)fooItem["FooId"]);
}
}
It appears, applying .FirstOrDefault() condition to LINQ query syntax will sort of reduce it to method chain syntax anyway, so I'd opt for method chainging all the way and leave it to you to figure out the aesthetics
If you are willing to sacrifice memory for speed, converting from DataTable for the fields you need gives about 6x speedup over repeatedly pulling the column data out of table2. (This is in addition to the speedup from using FirstOrDefault.)
var compareDictionary = new ConcurrentDictionary<double, double>();
var t2e = table2.AsEnumerable().Select(r => (FooName: r.Field<string>("FooName"), FooId: r.Field<double>("FooId"))).ToList();
foreach (var item in table1.AsEnumerable().Select(r => (ItemName: r.Field<string>("ItemName"), ItemId: r.Field<double>("ItemId")))) {
var firstFooId = t2e.FirstOrDefault(foo => foo.FooName.Contains(item.ItemName)).FooId;
if (firstFooId != 0.0) {
compareDictionary.TryAdd(item.ItemId, firstFooId);
}
}
I am using C# ValueTuples to avoid reference object overhead from anonymous classes.

C# Update entries of a dictionary in parallel?

No idea if this is possible, but rather than iterate over a dictionary and modify entries based on some condition, sequentially, I was wondering if it is possible to do this in parallel?
For example, rather than:
Dictionary<int, byte> dict = new Dictionary<int, byte>();
for (int i = 0; i < dict.Count; i++)
{
dict[i] = 255;
}
I'd like something like:
Dictionary<int, byte> dict = new Dictionary<int, byte>();
dict.Parallel(x=>x, <condition>, <function_to_apply>);
I realise that in order to build the indices for modifying the dict, we would need to iterate and build a list of ints... but I was wondering if there was some sneaky way to do this that would be both faster and more concise than the first example.
I could of course iterate through the dict and for each entry, spawn a new thread and run some code, return the value and build a new, updated dictionary, but that seems really overkill.
The reason I'm curious is that the <function_to_apply> might be expensive.
I could of course iterate through the dict and for each entry, spawn a new thread and run some code, return the value and build a new, updated dictionary, but that seems really overkill.
Assuming you don't need the dictionary while it's rebuilt it's not that much:
var newDictionary = dictionary.AsParallel()
.Select(kvp =>
/* do whatever here as long as
it works with the local kvp variable
and not the original dict */
new
{
Key = kvp.Key,
NewValue = function_to_apply(kvp.Key, kvp.Value)
})
.ToDictionary(x => x.Key,
x => x.NewValue);
Then lock whatever sync object you need and swap the new and old dictionaries.
First of all, I mostly agree with others recommending ConcurrentDictionary<> - it is designed to be thread-safe.
But if you are adventurous coder ;) and performance it super-critical for you, you could sometimes try doing what you (I suppose) is trying to do in case no new keys are added and no keys are removed from dictionary during your parallel manipulations:
int keysNumber = 1000000;
Dictionary<int, string> d = Enumerable.Range(1, keysNumber)
.ToDictionary(x => x, x => (string)null);
Parallel.For(1, keysNumber + 1, k => { d[k] = "Value" + k; /*Some complex logic might go here*/ });
To verify data consistency after these operations you can add:
Debug.Assert(d.Count == keysNumber);
for (int i = 1; i <= keysNumber; i++)
{
Debug.Assert(d[i] == "Value" + i);
}
Console.WriteLine("Successful");
WHY IT WORKS:
Basically we have created dictionary in advance from SINGLE main thread and then popullated it in parallel. What allows us to do that is that current Dictionary implementation (Microsoft does not guarantee that, but most likely won't ever change) defines it's structure solely on keys, and values are just assigned to corresponding cells. Since each key is being assigned a new value from single thread we do not have race condition, and since navigating the hashtable concurrently does not alter it, everything works fine.
But you should be really careful with such code and have very good reasons not to use ConcurrentDictionary.
PS: My main idea is not even a "hack" of using Dicrionary concurrently, but to draw attention that not always every data structure need to be concurrent. I saw ConcurrentDictionary<int, ConcurrentStack<...>>, while each stack object in dictionary could be accessed only from single thread and that is an overkill and doesn't make your performance better. Just keep in mind what are you affecting and what can go wrong with multithreading scenarios.

EF - A proper way to search several items in database

I have about 100 items (allRights) in the database and about 10 id-s to be searched (inputRightsIds). Which one is better - first to get all rights and then search the items (Variant 1) or to make 10 checking requests requests to the database
Here is some example code:
DbContext db = new DbContext();
int[] inputRightsIds = new int[10]{...};
Variant 1
var allRights = db.Rights.ToLIst();
foreach( var right in allRights)
{
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(inputRightsIds[i] == right.Id)
{
// Do something
}
}
}
Variant 2
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(db.Rights.Any(r => r.Id == inputRightsIds[i]);)
{
// Do something
}
}
Thanks in advance!
As other's have already stated you should do the following.
var matchingIds = from r in db.Rights
where inputRightIds.Contains(r.Id)
select r.Id;
foreach(var id in matchingIds)
{
// Do something
}
But this is different from both of your approaches. In your first approach you are making one SQL call to the DB that is returning more results than you are interested in. The second is making multiple SQL calls returning part of the information you want with each call. The query above will make one SQL call to the DB and return only the data you are interested in. This is the best approach as it reduces the two bottle necks of making multiple calls to the DB and having too much data returned.
You can use following :
db.Rights.Where(right => inputRightsIds.Contains(right.Id));
They should be very similar speeds since both must enumerate the arrays the same number of times. There might be subtle differences in speed between the two depending on the input data but in general I would go with Variant 2. I think you should almost always prefer LINQ over manual enumeration when possible. Also consider using the following LINQ statement to simplify the whole search to a single line.
var matches = db.Rights.Where(r=> inputRightIds.Contains(r.Id));
...//Do stuff with matches
Not forget get all your items into memory to process list further
var itemsFromDatabase = db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList();
Or you could even enumerate through collection and do some stuff on each item
db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList().Foreach(item => {
//your code here
});

What's the easiest way to reorder two dictionary lists together in C#?

BACKGROUND TO THE PROBLEM
Say I had the following two lists (prioSums and contentVals) compiled from a SQL Server CE query like this:
var queryResults = db.Query(searchQueryString, searchTermsArray);
Dictionary<string, double> prioSums = new Dictionary<string, double>();
Dictionary<string, string> contentVals = new Dictionary<string, string>();
double prioTemp = 0.0;
foreach(var row in queryResults)
{
string location = row.location;
double priority = row.priority;
if (!prioSums.ContainsKey(location))
{
prioSums[location] = 0.0;
}
if (!contentVals.ContainsKey(location))
{
contentVals[location] = row.value;
prioTemp = priority;
}
if (prioTemp < priority)
{
contentVals[location] = row.value;
}
prioSums[location] += priority;
}
The query itself is pretty large, very dynamically compiled, and really beyond the scope of this question, so I'll just say that it returns rows that include a priority, text value, and location.
With the above code I am able to get one list (prioSums) which sums up all of the priorities for each location (not allowing repeats on the location [key] itself, even though repeats for the location are in the query results), and another list (contentVals) to hold the value of the location with the highest priority, once again, using the location as key.
All of this I have accomplished and it works very well. I can iterate over the two lists and display the information I want HOWEVER...
THE PROBLEM
...Now I need to reorder these lists together with the highest priority (or sums of priorities which are stored as the values in prioSums) first.
I have wracked my brain trying to think about using an instantiated class with three properties as given advice by others, but I can't seem to wrap my brain on how that would work, given my WebMatrix C#.net-webpages environment. I know how to call a class from a .cs file from the current .cshtml file, no problem, but I have never done this by instantiating a class to make it an object before (sorry, still new to some of the more complex C# logic/methodology).
Can anyone suggest how to accomplish this, or perhaps show an easier (at least easier to understand) way of doing this? In short all I really need is these two lists ordered together by the value in prioSums from highest to lowest.
NOTE
Please forgive me if I have not provided quite enough information. If more should be provided don't hesitate to ask.
Also, for more information or background on this problem, you can look at my previous question on this here: Is there any way to loop through my sql results and store certain name/value pairs elsewhere in C#?
I dont know if its the outcome you want but you can give it a try:
var result = from p in prioSums
orderby p.Value descending
select new { location = p.Key, priority = p.Value, value = contentVals[p.Key] };

Would there be any performance difference between looping every row of dataset and same dataset list form

I need to loop every row of a dataset 100k times.
This dataset contains 1 Primary key and another string column. Dataset has 600k rows.
So at the moment i am looping like this
for (int i = 0; i < dsProductNameInfo.Tables[0].Rows.Count; i++)
{
for (int k = 0; k < dsFull.Tables[0].Rows.Count; k++)
{
}
}
Now dsProductNameInfo has 100k rows and dsFull has 600k rows. Should i convert dsFull to a KeyValuePaired string list and loop that or there would not be any speed difference.
What solution would work fastest ?
Thank you.
C# 4.0 WPF application
In the exact scenario you mentioned, the performance would be the same except converting to the list would take some time and cause the list approach to be slower. You can easily find out by writing a unit test and timing it.
I would think it'd be best to do this:
// create a class for each type of object you're going to be dealing with
public class ProductNameInformation { ... }
public class Product { ... }
// load a list from a SqlDataReader (much faster than loading a DataSet)
List<Product> products = GetProductsUsingSqlDataReader(); // don't actually call it that :)
// The only thing I can think of where DataSets are better is indexing certain columns.
// So if you have indices, just emulate them with a hashtable:
Dictionary<string, Product> index1 = products.ToDictionary( ... );
Here are references to the SqlDataReader and ToDictionary concepts that you may or may not be familiar with.
The real question is, why isn't this kind of heavy processing done at the database layer? SQL servers are much more optimized for this type of work. Also, you may not have to actually do this, why don't you post the original problem and maybe we can help you optimize deeper?
HTH
There might be quite a few things that could be optimized not related to the looping. E.g. reducing the number of iteration would yield a lot at pressent the body of the inner loop is executed 100k * 600k times so eliminating one iteration of the outer loop would eliminate 600k iterations of the inner (or you might be able to switch the inner and outer loop if it's easier to remove iterations from the inner loop)
One thing that you could do in any case is only index once for each table:
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows
var productInfoCount = productNameInfoRows.Count;
var fullRows = dsFull.Tables[0].Rows;
var fullCount = fullRows.Count;
for (int i = 0; i < productInfoCount; i++)
{
for (int k = 0; k < fullCount; k++)
{
}
}
inside the loops you'd get to the rows with productNameInfoRows[i] and FullRows[k] which is faster than using the long hand I'm guessing there might be more to gain from optimizing the body than the way you are looping over the collection. Unless of course you have already profiled the code and found the actual looping to be the bottle neck
EDIT After reading your comment to Marc about what you are trying to accomplish. Here's a go at how you could do this. It's worth noting that the below algorithm is probabalistic. That is there's a 1:2^32 for two words being seen as equal without actually being it. It is however a lot faster than comparing strings.
The code assumes that the first column is the one you are comparing.
//store all the values that will not change through the execution for faster access
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows;
var fullRows = dsFull.Tables[0].Rows;
var productInfoCount = productNameInfoRows.Count;
var fullCount = fullRows.Count;
var full = new List<int[]>(fullCount);
for (int i = 0; i < productInfoCount; i++){
//we're going to compare has codes and not strings
var prd = productNameInfoRows[i][0].ToString().Split(';')
.Select(s => s.GetHashCode()).OrderBy(t=>t).ToArray();
for (int k = 0; k < fullCount; k++){
//caches the calculation for all subsequent oterations of the outer loop
if (i == 0) {
full.Add(fullRows[k][0].ToString().Split(';')
.Select(s => s.GetHashCode()).OrderBy(t=>t).ToArray());
}
var fl = full[k];
var count = 0;
for(var j = 0;j<fl.Length;j++){
var f = fl[j];
//the values are sorted so we can exit early
for(var m = 0;m<prd.Length && prd[m] <= f;m++){
count += prd[m] == f ? 1 : 0;
}
}
if((double)(fl.Length + prd.Length)/count >= 0.6){
//there's a match
}
}
}
EDIT your comment motivated me to give it another try. The below code could have fewer iterations. Could have is because it depends on the number of matches and the number of unique words. A lot of unique words and a lot of matches for each (which would require a LOT of words per column) would potentially yield more iterations. However under the assumption that each row has few words this should yield substantial fewer iterations. your code has a NM complexity this has N+M+(matchesproductInfoMatches*fullMatches). In other words the latter would have to be almost 99999*600k for this to have more iterations than yours
//store all the values that will not change through the execution for faster access
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows;
var fullRows = dsFull.Tables[0].Rows;
var productInfoCount = productNameInfoRows.Count;
var fullCount = fullRows.Count;
//Create a list of the words from the product info
var lists = new Dictionary<int, Tuple<List<int>, List<int>>>(productInfoCount*3);
for(var i = 0;i<productInfoCount;i++){
foreach (var token in productNameInfoRows[i][0].ToString().Split(';')
.Select(p => p.GetHashCode())){
if (!lists.ContainsKey(token)){
lists.Add(token, Tuple.Create(new List<int>(), new List<int>()));
}
lists[token].Item1.Add(i);
}
}
//Pair words from full with those from productinfo
for(var i = 0;i<fullCount;i++){
foreach (var token in fullRows[i][0].ToString().Split(';')
.Select(p => p.GetHashCode())){
if (lists.ContainsKey(token)){
lists[token].Item2.Add(i);
}
}
}
//Count all matches for each pair of rows
var counts = new Dictionary<int, Dictionary<int, int>>();
foreach(var key in lists.Keys){
foreach(var p in lists[key].Item1){
if(!counts.ContainsKey(p)){
counts.Add(p,new Dictionary<int, int>());
}
foreach(var f in lists[key].Item2){
var dic = counts[p];
if(!dic.ContainsKey(f)){
dic.Add(f,0);
}
dic[f]++;
}
}
}
If performance is the critical factor, then I would suggest trying an array-of-struct; this has minimal indireaction (DataSet/DataTable has quite a lot of indirection). You mention KeyValuePair, and that would work, although it might not necessarily be my first choice. Milimetric is right to say that there is an overhead if you create a DataSet first and then build an array/list from tht - however, even then the time savings when looping may exceed the build time. If you can restructure the load to remove the DataSet completely, great.
I would also look carefully at the loops, to see if anything could reduce the actual work needed; for example, would building a dictionary/grouping allow faster lookups? Would sorting allow binary search? Can any operations be per-aggregated and applied at a higher level (with fewer rows)?
What are you doing with the data inside the nested loop?
Is the source of your datasets a SQL database? If so, the best possible performance you could get would be to perform your calculation in SQL using an inner join and return the result to .net.
Another alternative would be to use the dataset's built in querying methods that act like SQL, but in-memory.
If neither of those options are appropriate, you would get a performance improvement by retrieving the 'full' dataset as a DataReader and looping over it as the outer loop. A dataset loads all of the data from SQL into memory in one hit. With 600k rows, this will take up a lot of memory! Whereas a DataReader will keep the connection to the DB open and stream rows as they are read. Once you have read a row the memory will be reused/reclaimed by the garbage collector.
In your comment reply to my earlier answer you said that both datasets are essentially lists of strings and each string a delimited list of tags effectively. I would first look to normalise the csv strings in the database. I.e. Split the CSVs, add them to a tag table and link from the product to the tags via a link table.
You can then quite easily create a SQL statement that will do your matching according to the link records rather than by string (which be more performant in it's own right).
The issue you would then have is that if your sub-set product list needs to be passed into SQL from .net you would need to call the SP 100k times. Thankfully SQL 2008 R2, introduced TableTypes. You could define a table type in your database with one column to hold your product ID, have your SP accept that as an input parameter and then perform an inner join between your actual tables and your table parameter.. I've used this in my own project with very large datasets and the performance gain was massive.
On the .net side you can create a DataTable matching the structure of the SQL table type and then pass that as a command parameter when calling your SP (once!).
This article shows you how to do both the SQL and .net sides. http://www.mssqltips.com/sqlservertip/2112/table-value-parameters-in-sql-server-2008-and-net-c/

Categories

Resources