How to optimize a code using DataTable and Linq?

How to optimize a code using DataTable and Linq? - c#

I have 2 DataTables. There are about 17000 (table1) and 100000 (table2) records.
It's needed to check if the field "FooName" contains "ItemName". Also it's needed to take "FooId" and then add "ItemId" and "FooId" to ConcurrentDictionary.
I have this code.
DataTable table1;
DataTable table2;
var table1Select = table1.Select();
ConcurrentDictionary<double, double> compareDictionary = new ConcurrentDictionary<double, double>();
foreach (var item in table1)
{
var fooItem = from foo in table2.AsEnumerable()
where foo.Field<string>("FooName").Contains(item.Field<string>("ItemName"))
select foo.Field<double>("FooId");
if(fooItem != null && fooItem.FirstOrDefault() != 0)
{
compareDictionary.TryAdd(item.Field<double>("ItemId"), fooItem.FirstOrDefault());
}
}
It works slowly (it takes about 10 minutes to perform the task).
I want to make it faster. How I can optimize it?

I see three points you can attack:
ditch strong typing on field accessors in favour of direct casts: it forces unboxing which you can totally avoid with doubles being value types. upd as pointed out in comments, you will not avoid unboxing either way, but could potentially save some method call overheads (which is again, arguable). This point can probably be ignored
cache reference string so you only access it once per outer loop
(i think this is where the biggest gains are) since you seem to always take first result - opt for FirstOrDefault() straight in LINQ - don't let it enumerate the whole thing when a match is found
ConcurrentDictionary<double, double> compareDictionary = new ConcurrentDictionary<double, double>();
foreach (var item in table1)
{
var sample = (string)item["ItemName"]; // cache the value before looping through inner collection
var fooItem = table2.AsEnumerable()
.FirstOrDefault(foo => ((string)foo["FooName"]).Contains(sample)); // you seem to always take First item, so you could instruct LINQ to stop after a match is found
if (fooItem != null && (double)fooItem["FooId"] != 0)
{
compareDictionary.TryAdd((double)item["ItemId"], (double)fooItem["FooId"]);
}
}
It appears, applying .FirstOrDefault() condition to LINQ query syntax will sort of reduce it to method chain syntax anyway, so I'd opt for method chainging all the way and leave it to you to figure out the aesthetics

If you are willing to sacrifice memory for speed, converting from DataTable for the fields you need gives about 6x speedup over repeatedly pulling the column data out of table2. (This is in addition to the speedup from using FirstOrDefault.)
var compareDictionary = new ConcurrentDictionary<double, double>();
var t2e = table2.AsEnumerable().Select(r => (FooName: r.Field<string>("FooName"), FooId: r.Field<double>("FooId"))).ToList();
foreach (var item in table1.AsEnumerable().Select(r => (ItemName: r.Field<string>("ItemName"), ItemId: r.Field<double>("ItemId")))) {
var firstFooId = t2e.FirstOrDefault(foo => foo.FooName.Contains(item.ItemName)).FooId;
if (firstFooId != 0.0) {
compareDictionary.TryAdd(item.ItemId, firstFooId);
}
}
I am using C# ValueTuples to avoid reference object overhead from anonymous classes.

Related

Converting foreach to Linq

Current Code:
For each element in the MapEntryTable, check the properties IsDisplayedColumn and IsReturnColumn and if they are true then add the element to another set of lists, its running time would be O(n), there would be many elements with both properties as false, so will not get added to any of the lists in the loop.
foreach (var mapEntry in MapEntryTable)
{
if (mapEntry.IsDisplayedColumn)
Type1.DisplayColumnId.Add(mapEntry.OutputColumnId);
if (mapEntry.IsReturnColumn)
Type1.ReturnColumnId.Add(mapEntry.OutputColumnId);
}
Following is the Linq version of doing the same:
MapEntryTable.Where(x => x.IsDisplayedColumn == true).ToList().ForEach(mapEntry => Type1.DisplayColumnId.Add(mapEntry.OutputColumnId));
MapEntryTable.Where(x => x.IsReturnColumn == true).ToList().ForEach(mapEntry => Type1.ReturnColumnId.Add(mapEntry.OutputColumnId));
I am converting all such foreach code to linq, as I am learning it, but my question is:
Do I get any advantage of Linq conversion in this case or is it a disadvantage ?
Is there a better way to do the same using Linq
UPDATE:
Consider the condition where out of 1000 elements in the list 80% have both properties false, then does where provides me a benefit of quickly finding elements with a given condition.
Type1 is a custom type with set of List<int> structures, DisplayColumnId and ReturnColumnId

ForEach ins't a LINQ method. It's a method of List. And not only is it not a part of LINQ, it's very much against the very values and patterns of LINQ. Eric Lippet explains this in a blog post that was written when he was a principle developer on the C# compiler team.
Your "LINQ" approach also:
Completely unnecessarily copies all of the items to be added into a list, which is both wasteful in time and memory and also conflicts with LINQ's goals of deferred execution when executing queries.
Isn't actually a query with the exception of the Where operator. You're acting on the items in the query, rather than performing a query. LINQ is a querying tool, not a tool for manipulating data sets.
You're iterating the source sequence twice. This may or may not be a problem, depending on what the source sequence actually is and what the costs of iterating it are.
A solution that uses LINQ as much as is it is designed for would be to use it like so:
foreach (var mapEntry in MapEntryTable.Where(entry => mapEntry.IsDisplayedColumn))
list1.DisplayColumnId.Add(mapEntry.OutputColumnId);
foreach (var mapEntry in MapEntryTable.Where(entry => mapEntry.IsReturnColumn))
list2.ReturnColumnId.Add(mapEntry.OutputColumnId);

I would say stick with the original way with the foreach loop, since you are only iterating through the list 1 time over.
also your linq should look more like this:
list1.DisplayColumnId.AddRange(MapEntryTable.Where(x => x.IsDisplayedColumn).Select(mapEntry => mapEntry.OutputColumnId));
list2.ReturnColumnId.AddRange(MapEntryTable.Where(x => x.IsReturnColumn).Select(mapEntry => mapEntry.OutputColumnId));

The performance of foreach vs Linq ForEach are almost exactly the same, within nano seconds of each other. Assuming you have the same internal logic in the loop in both versions when testing.
However a for loop, outperforms both by a LARGE margin. for(int i; i < count; ++i) is much faster than both. Because a for loop doesn't rely on an IEnumerable implementation (overhead). The for loop compiles to x86 register index/jump code. It maintains an incrementor, and then it's up to you to retrieve the item by it's index in the loop.
Using a Linq ForEach loop though does have a big disadvantage. You cannot break out of the loop. If you need to do that you have to maintain a boolean like "breakLoop = false", set it to true, and have each recursive exit if breakLoop is true... Bad performing there. Secondly you cannot use continue, instead you use "return".
I never use Linq's foreach loop.
If you are dealing with linq, e.g.
List<Thing> things = .....;
var oldThings = things.Where(p.DateTime.Year < DateTime.Now.Year);
That internally will foreach with linq and give you back only the items with a year less than the current year. Cool..
But if I am doing this:
List<Thing> things = new List<Thing>();
foreach(XElement node in Results) {
things.Add(new Thing(node));
}
I don't need to use a linq for each loop. Even if I did...
foreach(var node in thingNodes.Where(p => p.NodeType == "Thing") {
if (node.Ignore) {
continue;
}
thing.Add(node);
}
even though I could write that cleaner like
foreach(var node in thingNodes.Where(p => p.NodeType == "Thing" && !node.Ignore) {
thing.Add(node);
}
There is no real reason I can think of to do this..>
things.ForEach(thing => {
//do something
//can't break
//can't continue
return; //<- continue
});
And if I want the fastest loop possible,
for (int i = 0; i < things.Count; ++i) {
var thing = things[i];
//do something
}
Will be faster.

Your LINQ isn't quite right as you're converting the results of Where to a List and then pseudo-iterating over those results with ForEach to add to another list. Use ToList or AddRange for converting or adding sequences to lists.
Example, where overwriting list1 (if it were actually a List<T>):
list1 = MapEntryTable.Where(x => x.IsDisplayedColumn == true)
.Select(mapEntry => mapEntry.OutputColumnId).ToList();
or to append:
list1.AddRange(MapEntryTable.Where(x => x.IsDisplayedColumn == true)
.Select(mapEntry => mapEntry.OutputColumnId));

In C#, to do what you want functionally in one call, you have to write your own partition method. If you are open to using F#, you can use List.Partition<'T>
https://msdn.microsoft.com/en-us/library/ee353782.aspx

A better way to loop through lists

So I have a couple of different lists that I'm trying to process and merge into 1 list.
Below is a snipet of code that I want to see if there was a better way of doing.
The reason why I'm asking is that some of these lists are rather large. I want to see if there is a more efficient way of doing this.
As you can see I'm looping through a list, and the first thing I'm doing is to check to see if the CompanyId exists in the list. If it does, then I find item in the list that I'm going to process.
pList is my processign list. I'm adding the values from my different lists into this list.
I'm wondering if there is a "better way" of accomplishing the Exist and Find.
boolean tstFind = false;
foreach (parseAC item in pACList)
{
tstFind = pList.Exists(x => (x.CompanyId == item.key.ToString()));
if (tstFind == true)
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}
Just as a side note, I'm going to be researching a way to use joins to see if that is faster. But I haven't gotten there yet. The above code is my first cut at solving this issue and it appears to work. However, since I have the time I want to see if there is a better way still.
Any input is greatly appreciated.
Time Findings:
My current Find and Exists code takes about 84 minutes to loop through the 5.5M items in the pACList.
Using pList.firstOrDefault(x=> x.CompanyId == item.key.ToString()); takes 54 minutes to loop through 5.5M items in the pACList

You can retrieve item with FirstOrDefault instead of searching for item two times (first time to define if item exists, and second time to get existing item):
var tstFind = pList.FirstOrDefault(x => x.CompanyId == item.key.ToString());
if (tstFind != null)
{
//Processing done here. pItem gets updated here
}

Yes, use a hashtable so that your algorithm is O(n) instead of O(n*m) which it is right now.
var pListByCompanyId = pList.ToDictionary(x => x.CompanyId);
foreach (parseAC item in pACList)
{
if (pListByCompanyId.ContainsKey(item.key.ToString()))
{
pItem = pListByCompanyId[item.key.ToString()];
//Processing done here. pItem gets updated here
...
}

You can iterate though filtered list using linq
foreach (parseAC item in pACList.Where(i=>pList.Any(x => (x.CompanyId == i.key.ToString()))))
{
pItem = pList.Find(x => (x.CompanyId == item.key.ToString()));
//Processing done here. pItem gets updated here
...
}

Using lists for this type of operation is O(MxN) (M is the count of pACList, N is the count of pList). Additionally, you are searching pACList twice. To avoid that issue, use pList.FirstOrDefault as recommended by #lazyberezovsky.
However, if possible I would avoid using lists. A Dictionary indexed by the key you're searching on would greatly improve the lookup time.

Doing a linear search on the list for each item in another list is not efficient for large data sets. What is preferable is to put the keys into a Table or Dictionary that can be much more efficiently searched to allow you to join the two tables. You don't even need to code this yourself, what you want is a Join operation. You want to get all of the pairs of items from each sequence that each map to the same key.
Either pull out the implementation of the method below, or change Foo and Bar to the appropriate types and use it as a method.
public static IEnumerable<Tuple<Bar, Foo>> Merge(IEnumerable<Bar> pACList
, IEnumerable<Foo> pList)
{
return pACList.Join(pList, item => item.Key.ToString()
, item => item.CompanyID.ToString()
, (a, b) => Tuple.Create(a, b));
}
You can use the results of this call to merge the two items together, as they will have the same key.
Internally the method will create a lookup table that allows for efficient searching before actually doing the searching.

Convert pList to HashSet then query pHashSet.Contains(). Complexity O(N) + O(n)
Sort pList on CompanyId and do Array.BinarySearch() = O(N Log N) + O(n * Log N )
If Max company id is not prohibitively large, simply create and array of them where item with company id i exists at i-th position. Nothing can be more fast.
where N is size of pList and n is size of pACList

Looping on IEnumerator<T>, Any Suggestions

I am having a situation where looping through the result of LINQ is getting on my nerves. Well here is my scenario:
I have a DataTable, that comes from database, from which I am taking data as:
var results = from d in dtAllData.AsEnumerable()
select new MyType
{
ID = d.Field<Decimal>("ID"),
Name = d.Field<string>("Name")
}
After doing the order by depending on the sort order as:
if(orderBy != "")
{
string[] ord = orderBy.Split(' ');
if (ord != null && ord.Length == 2 && ord[0] != "")
{
if (ord[1].ToLower() != "desc")
{
results = from sorted in results
orderby GetPropertyValue(sorted, ord[0])
select sorted;
}
else
{
results = from sorted in results
orderby GetPropertyValue(sorted, ord[0]) descending
select sorted;
}
}
}
The GetPropertyValue method is as:
private object GetPropertyValue(object obj, string property)
{
System.Reflection.PropertyInfo propertyInfo = obj.GetType().GetProperty(property);
return propertyInfo.GetValue(obj, null);
}
After this I am taking out 25 records for first page like:
results = from sorted in results
.Skip(0)
.Take(25)
select sorted;
So far things are going good, Now I have to pass this results to a method which is going to do some manipulation on the data and return me the desired data, here in this method when I want to loop these 25 records its taking a good enough time. My method definition is:
public MyTypeCollection GetMyTypes(IEnumerable<MyType> myData, String dateFormat, String offset)
I have tried foreach and it takes like 8-10 secs on my machine, it is taking time at this line:
foreach(var _data in myData)
I tried while loop and is doing same thing, I used it like:
var enumerator = myData.GetEnumerator();
while(enumerator.MoveNext())
{
int n = enumerator.Current;
Console.WriteLine(n);
}
This piece of code is taking time at MoveNext
Than I went for for loop like:
int length = myData.Count();
for (int i = 0; i < 25;i++ )
{
var temp = myData.ElementAt(i);
}
This code is taking time at ElementAt
Can anyone please guide me, what I am doing wrong. I am using Framework 3.5 in VS 2008.
Thanks in advance

EDIT: I suspect the problem is in how you're ordering. You're using reflection to first fetch and then invoke a property for every record. Even though you only want the first 25 records, it has to call GetPropertyValue on all the records first, in order to order them.
It would be much better if you could do this without reflection at all... but if you do need to use reflection, at least call Type.GetProperty() once instead of for every record.
(In some ways this is more to do with helping you diagnose the problem more easily than a full answer as such...)
As Henk said, this is very odd:
results = from sorted in results
.Skip(0)
.Take(25)
select sorted;
You almost certainly really just want:
results = results.Take(25);
(Skip(0) is pointless.)
It may not actually help, but it will make the code simpler to debug.
The next problem is that we can't actually see all your code. You've written:
After doing the order by depending on the sort order
... but you haven't shown how you're performing the ordering.
You should show us a complete example going from DataTable to its use.
Changing how you iterate over the sequence will not help - it's going to do the same thing either way, really - although it's surprising that in your last attempt, Count() apparently works quickly. Stick to the foreach - but work out exactly what that's going to be doing. LINQ uses a lot of lazy evaluation, and if you've done something which makes that very heavy going, that could be the problem. It's hard to know without seeing the whole pipeline.

The problem is that your "results" IEnumerable isn't actually being evaluated until it is passed into your method and enumerated. That means that the whole operation, getting all the data from dtAllData, selecting out the new type (which is happening on the whole enumerable, not just the first 25), and then finally the take 25 operation, are all happening on the first enumeration of the IEnumerable (foreach, while, whatever).
That's why your method is taking so long. It's actually doing some of the work defined elsewhere inside the method. If you want that to happen before your method, you could do a "ToList()" prior to the method.

You might find it easier to adopt a hybrid approach;
In order:
1) Sort your datatable in-situ. It's probably best to do this at the database level, but, if you can't, then DataTable.DefaultView.Sort is pretty efficient:
dtAllData.DefaultView.Sort = ord[0] + " " + ord[1];
This assumes that ord[0] is the column name, and ord[1] is either ASC or DESC
2) Page through the DefaultView by index:
int pageStart = 0;
List<DataRowView> pageRows = new List<DataRowView>();
for (int i = pageStart; i < dtAllData.DefaultView.Count; i++ )
{
if(pageStart + 25 > i || i == dtAllData.DefaultView.Count - 1) { break; //Exit if more than the number of pages or at the end of the rows }
pageRows.Add(dtAllData.DefaultView[i]);
}
...and create your objects from this much smaller list... (I've assumed the columns are called Id and Name, as well as the types)
List<MyType> myObjects = new List<MyType>();
foreach(DataRowView pageRow in pageRows)
{
myObjects.Add(new MyObject() { Id = Convert.ToInt32(pageRow["Id"]), Name = Convert.ToString(pageRow["Name"])});
}
You can then proceed with the rest of what you were doing.

Update object in IEnumerable<> not updating?

I have an IEnumerable of a POCO type containing around 80,000 rows
and a db table (L2E/EF4) containing a subset of rows where there was a "an error/a difference" (about 5000 rows, but often repeated to give about 150 distinct entries)
The following code gets the distinct VSACode's "in error" and then attempts to update the complete result set, updating JUST the rows that match...but it doesn't work!
var vsaCodes = (from g in db.GLDIFFLs
select g.VSACode)
.Distinct();
foreach (var code in vsaCodes)
{
var hasDifference = results.Where(r => r.VSACode == code);
foreach (var diff in hasDifference)
diff.Difference = true;
}
var i = results.Count(r => r.Difference == true);
After this code, i = 0
I've also tried:
foreach (var code in vsaCodes)
{
results.Where(r => r.VSACode == code).Select(r => { r.Difference = true; return r; }).ToList();
}
How can I update the "results" to set only the matching Difference property?

Assuming results is just a query (you haven't shown it), it will be evaluated every time you iterate over it. If that query creates new objects each time, you won't see the updates. If it returns references to the same objects, you would.
If you change results to be a materialized query result - e.g. by adding ToList() to the end - then iterating over results won't issue a new query, and you'll see your changes.

I had the same kind of error some time ago. The problem is that linq queries are often deferred and not executed when it appears you are calling them.
Quotation from "Pro LINQ Language Integrated Query in C# 2010":
"Notice that even though we called the query only once, the results of
the enumeration are different for each of the enumerations. This is
further evidence that the query is deferred. If it were not, the
results of both enumerations would be the same. This could be a
benefit or detriment. If you do not want this to happen, use one of
the conversion operators that do not return an IEnumerable so that
the query is not deferred, such as ToArray, ToList, ToDictionary, or
ToLookup, to create a different data structure with cached results
that will not change if the data source changes."
Here you have a good explanation with examples of it:
http://blogs.msdn.com/b/charlie/archive/2007/12/09/deferred-execution.aspx
Regards

Parsing words pretty closely on #jonskeet's answer...
If your query is simply a filter and the underlying source objects are updated, the query will be reevaluated and may exclude these objects based on the filter condition in which case your query results will change on subsequent enumerations but the underlying objects will still have been updated.
The key is a lack of a projection to a new type as far as updating and persisting the changed objects.
ToList() is the usual solution to this issue and it will solve the problem if there is a projection to a new type but things gets cloudy in event your query filters but does not project. Updates to the query still affect the original source objects given everything is referencing the same object.
Again, parsing words but these edge cases can trip you up.
public class Widget
{
public string Name { get; set; }
}
var widgets1 = new[]
{
new Widget { Name = "Red", },
new Widget { Name = "Green", },
new Widget { Name = "Blue", },
new Widget { Name = "Black", },
};
// adding ToList() will result in 'static' query result but
// updates to the objects will still affect the source objects
var query1 = widgets1
.Where(i => i.Name.StartsWith("B"))
//.ToList()
;
foreach (var widget in query1)
{
widget.Name = "Yellow";
}
// produces no output unless you uncomment out the ToList() above
// query1 is reevaluated and filters out "yellow" which does not start with "B"
foreach (var name in query1)
Console.WriteLine(name.Name);
// produces Red, Green, Yellow, Yellow
// the underlying widgets were updated
foreach (var name in widgets1)
Console.WriteLine(name.Name);

Help needed for optimizing linq data extraction

I'm fetching data from all 3 tables at once to avoid network latency. Fetching the data is pretty fast, but when I loop through the results a lot of time is used
Int32[] arr = { 1 };
var query = from a in arr
select new
{
Basket = from b in ent.Basket
where b.SUPERBASKETID == parentId
select new
{
Basket = b,
ObjectTypeId = 0,
firstObjectId = "-1",
},
BasketImage = from b in ent.Image
where b.BASKETID == parentId
select new
{
Image = b,
ObjectTypeId = 1,
CheckedOutBy = b.CHECKEDOUTBY,
firstObjectId = b.FIRSTOBJECTID,
ParentBasket = (from parentBasket in ent.Basket
where parentBasket.ID == b.BASKETID
select parentBasket).ToList()[0],
},
BasketFile = from b in ent.BasketFile
where b.BASKETID == parentId
select new
{
BasketFile = b,
ObjectTypeId = 2,
CheckedOutBy = b.CHECKEDOUTBY,
firstObjectId = b.FIRSTOBJECTID,
ParentBasket = (from parentBasket in ent.Basket
where parentBasket.ID == b.BASKETID
select parentBasket),
}
};
//Exception handling
var mixedElements = query.First();
ICollection<BasketItem> basketItems = new Collection<BasketItem>();
//Here 15 millis has been used
//only 6 elements were found
if (mixedElements.Basket.Count() > 0)
{
foreach (var mixedBasket in mixedElements.Basket){}
}
if (mixedElements.BasketFile.Count() > 0)
{
foreach (var mixedBasketFile in mixedElements.BasketFile){}
}
if (mixedElements.BasketImage.Count() > 0)
{
foreach (var mixedBasketImage in mixedElements.BasketImage){}
}
//the empty loops takes 811 millis!!

Why are you bothering to check the counts before the foreach statements? If there are no results, the foreach will just finish immediately.
Your queries are actually all being deferred - they'll be executed as and when you ask for the data. Don't forget that your outermost query is a LINQ to Objects query: it's just returning the result of calling ent.Basket.Where(...).Select(...) etc... which doesn't actually execute the query.
Your plan to do all three queries in one go isn't actually working. However, by asking for the count separately, you may actually be executing each database query twice - once just getting the count and once for the results.
I strongly suggest that you get rid of the "optimizations" in this code which are making it much more complicated and slower than just writing the simplest code you can.
I don't know of any way of getting LINQ to SQL (or LINQ to EF) to execute multiple queries in a single call - but this approach certainly isn't going to do it.
One other minor hint which is irrelevant in this case, but can be useful in LINQ to Objects - if you want to find out whether there's any data in a collection, just use Any() instead of Count() > 0 - that way it can stop as soon as it's found anything.

You're using IEnumerable in the foreach loop. Implementations only have to prepare data when it's asked for. In this way, I'd suggest that the above code is accessing your data lazily -- that is, only when you enumerate the items (which actually happens when you call Count().)
Put a System.Diagnostics.Stopwatch around the call to Count() and see whether that's taking the bulk of the time you're seeing.
I can't comment further here because you don't specify the type of ent in your code sample.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.