no improvements on the following PLINQ code - c#

I do not see any improvements in processing speed using the following code:
IEnumerable<Quote> sortedQuotes = (from x in unsortedQuotes.AsParallel()
orderby (x.DateTimeTicks)
select x);
over the sequential version:
IEnumerable<Quote> sortedQuotes = (from x in unsortedQuotes
orderby (x.DateTimeTicks)
select x);
Am I missing something here? I varied the number of items in the source collections from thousands to several tens of millions and no size showed the Parallel version coming out ahead.
Any tips appreciated. By the way if anyone knows of a faster way to sort more efficiently (given my indicated item variable type (containing a long DateTimeTicks by which the items are sorted in the collection) that would also be appreciate.
Edit: "sorting efficiently" -> As fast as possible.
Thanks

According to this page,
If you have a sort in your query, stop-and-go will be used instead because pipelining the output of a sort is wasteful. A sort exhibits extremely high latency [...], and so PLINQ prefers to devote all processing power to completing the sort as quickly as possible.
Your query only contains a Sort, the select doesn't count. So the PLINQ engine will execute it as sequential.
You can only expect some improvement when the sorting is a part of a larger query.

Related

Why is For Each loop slower than linq's outer join in this example?

The customers object has 30,000 records and the Orders object has 20,000 records. Left join using for each is 4 seconds slower than using linq's group join. I have two questions:
What is the reason for that?
How can make it faster without using linq?
foreach (Customer c in Customers) {
foreach (Order o in Orders) {
if (c.ID == o.OwnerID) {
c.OrderName = o.OrderName;
break;
}
}
}
Processing a sorted array is always faster. (This could be one of the most upvoted answer in stackoverflow). That question is about hardware but software gains from that too.
Sort both arrays.
Now start inner loop from outer loop's latest index(being equal to ownerid) equivalent of inner loop index, not from zero. You already have early quit so total complexity would be
O(small) + O(small) instead of O(bruteforce)
sorting nested loop nested loop unsorted
If you have time, you can install arrayfire(C++) and get some wrapper around it to use in C# for these brute-force calcs. Only this cheating would be better than linq's join for small(30k-100k) arrays.
Cheating dissolves when number of elements go millions and algorithm becomes utmmost importance unless you have 3-4 high-end gpus in case. Then it would stuck at around 30M then algorithm would shine again unless you have a cluster but if you have cluster, it would be a waste to not use advanced algorithm.
Best is C#'s own implementation ofcourse when CPU is used. As in the Ivan Stoev's comment, a good hash function is better than sorting.
I don't know why are you trying to avoid using linq, using two nested for each loops are not always a good practice, however, try to use for loops instead of for each it is much faster for big list of data.

Comparison between select then filtering and select distinct in linq, c#

Which is faster?
Getting a list of some variable (say string type) from a LINQ and then filtering duplicates in C#, or directly selecting distinct values in LINQ only?
Say we are having
N rows if we take duplicates
and R if we filter
( N >> R ) there are many duplicates.
Basically I am asking, in general which is faster and better programming
selecting whole N rows in LINQ, convert it into a list and then filtering it to R rows
or directly selecting the R rows from LINQ and converting it to a list.
Note :
IN SQL the time taken to get R rows is roughly 2 times then it takes to get N for my case! But a generic answer is welcome.
I assume when you say Linq, you mean LinqToSQL.
Rule of thumb when connect to database is to only get what you need; and for this, if you have a good querying strategy for Linq, then filtering at LinqToSQL can save a lot of wasted work.
If the column you're filtering happens to be FullTextIndex, you hit the jackpot.
Look, you question is complex, what i mean.
1)Better programming is to use the most of times a ready build-in function
2)Based on my experience Distinct works faster, in MsSql and C#.
3) LINQ is kinda lazy on filtering, especially if you have many items in your list. Distinct is optimized by Microsoft developers.
note: similar question, may be useful
Result: Try to use more of build-in functions you have at your platform, there is a plenty of information on net, and you can escape paragraphs of coding just calling a ready function.
Hope it helped.

LINQ c# efficiency

I need to write a query pulling distinct values from columns defined by a user for any given data set. There could be millions of rows so the statements must be as efficient as possible. Below is the code I have.
What is the order of this LINQ query? Is there a more efficient way of doing this?
var MyValues = from r in MyDataTable.AsEnumerable()
orderby r.Field<double>(_varName)
select r.Field<double>(_varName);
IEnumerable result= MyValues.Distinct();
I can't speak much to the AsEnumerable() call or the field conversions, but for the LINQ side of things, the orderby is a stable quick sort and should be O(n log n). If I had to guess, everything but the orderby should be O(n), so overall you're still just O(n log n).
Update: the LINQ Distinct() call should also be O(n).
So altogether, the Big-Oh for this thing is still O(Kn log n), where K is some constant.
Is there a more efficent way of doing this?
You could get better efficiency if you do the sort as part of the query that initializes MyDataTable, instead of sorting in memory afterwards.
from comments
I actually use MyDistinct.Distinct()
If you want distinct _varName values and you cannot do this all in the select query in dbms(what would be the most efficient way), you should use Distinct before OrderBy. The order matters here.
You would need to order all million of rows before you start to filter out the duplicates. If you use distinct first, you need to order only the rest.
var values = from r in MyDataTable.AsEnumerable()
select r.Field<double>(_varName);
IEnumerable<double> orderedDistinctValues = values.Distinct()
.OrderBy(d => d);
I have asked a related question recently which E.Lippert answered with a good explanation when order matters and when not:
Order of LINQ extension methods does not affect performance?
Here's a little demo where you can see that the order matters, but you can also see that it does not really matter since comparing doubles is trivial for a cpu:
Time for first orderby then distinct: 00:00:00.0045379
Time for first distinct then orderby: 00:00:00.0013316
your above query (linq) is good if you want all the million records and you have enough memory on a 64bit memory addressing OS.
the order of the query is, if you see the underlying command, would be transalated to
Select <_varname> from MyDataTable order by <_varname>
and this is as good as it is when run on the database IDE or commandline.
to give you a short answer regarding performance
put in a where clause if you can (with columns that are indexed)
ensure that the user can choose colums (_varname) that are indexed. Imagine the DB trying to sort million records on an unindexed column, which is evidently slow, but endangers linq to receive the badpress
Ensure that (if possible) initilisation of the MyDataTable is done correctly with the records that are of value (again based on a where clause)
profile your underlying query,
if possible, create storedprocs (debatable). you can create an entity model which includes storedprocs aswell
it may be faster today, but with the tablespace growing, and if your data is not ordered (indexed) thats where things get slowerr (even if you had a good linq expression)
Hope this helps
that said, if your db is not properly indexed, meaning

Is it possible to accelerate (dynamic) LINQ queries using GPU?

I have been searching for some days for solid information on the possibility to accelerate LINQ queries using a GPU.
Technologies I have "investigated" so far:
Microsoft Accelerator
Cudafy
Brahma
In short, would it even be possible at all to do an in-memory filtering of objects on the GPU?
Let´s say we have a list of some objects and we want to filter something like:
var result = myList.Where(x => x.SomeProperty == SomeValue);
Any pointers on this one?
Thanks in advance!
UPDATE
I´ll try to be more specific about what I am trying to achieve :)
The goal is, to use any technology, which is able to filter a list of objects (ranging from ~50 000 to ~2 000 000), in the absolutely fastest way possible.
The operations I perform on the data when the filtering is done (sum, min, max etc) is made using the built in LINQ-methods and is already fast enough for our application, so that´s not a problem.
The bottleneck is "simply" the filtering of data.
UPDATE
Just wanted to add that I have tested about 15 databases, including MySQL (checking possible cluster approach / memcached solution), H2, HSQLDB, VelocityDB (currently investigating further), SQLite, MongoDB etc, and NONE is good enough when it comes to the speed of filtering data (of course, the NO-sql solutions do not offer this like the sql ones, but you get the idea) and/or the returning of the actual data.
Just to summarize what I/we need:
A database which is able to sort data in the format of 200 columns and about 250 000 rows in less than 100 ms.
I currently have a solution with parallellized LINQ which is able (on a specific machine) to spend only nano-seconds on each row when filtering AND processing the result!
So, we need like sub-nano-second-filtering on each row.
Why does it seem that only in-memory LINQ is able to provide this?
Why would this be impossible?
Some figures from the logfile:
Total tid för 1164 frågor: 2579
This is Swedish and translates:
Total time for 1164 queries: 2579
Where the queries in this case are queries like:
WHERE SomeProperty = SomeValue
And those queries are all being done in parallell on 225639 rows.
So, 225639 rows are being filtered in memory 1164 times in about 2.5 seconds.
That´s 9,5185952917007032597107300413827e-9 seconds / row, BUT, that also includes the actual processing of the numbers! We do Count (not null), total count, Sum, Min, Max, Avg, Median. So, we have 7 operations on these filtered rows.
So, we could say it´s actually 7 times faster than the the databases we´ve tried, since we do NOT do any aggregation-stuff in those cases!
So, in conclusion, why are the databases so poor at filtering data compared to in-memory LINQ filtering? Have Microsoft really done such a good job that it is impossible to compete with it? :)
It makes sense though that in-memory filtering should be faster, but I don´t want a sense that it is faster. I want to know what is faster, and if it´s possible why.
I will answer definitively about Brahma since it's my library, but it probably applies to other approaches as well. The GPU has no knowledge of objects. It's memory is also mostly completely separate from CPU memory.
If you do have a LARGE set of objects and want to operate on them, you can only pack the data you want to operate on into a buffer suitable for the GPU/API you're using and send it off to be processed.
Note that this will make two round trips over the CPU-GPU memory interface, so if you aren't doing enough work on the GPU to make it worthwhile, you'll be slower than if you simply used the CPU in the first place (like the sample above).
Hope this helps.
The GPU is really not intended for all general purpose computing purposes, especially with object oriented designs like this, and filtering an arbitrary collection of data like this would really not be an appropriate thing.
GPU computations are great for things where you are performing the same operation on a large dataset - which is why things like matrix operations and transforms can be very nice. There, the data copying can be outweighed by the incredibly fast computational capabilities on the GPU....
In this case, you'd have to copy all of the data into the GPU to make this work, and restructure it into some form the GPU will understand, which would likely be more expensive than just performing the filter in software in the first place.
Instead, I would recommend looking at using PLINQ for speeding up queries of this nature. Provided your filter is thread safe (which it'd have to be for any GPU related work...) this is likely a better option for general purpose query optimization, as it won't require the memory copying of your data. PLINQ would work by rewriting your query as:
var result = myList.AsParallel().Where(x => x.SomeProperty == SomeValue);
If the predicate is an expensive operation, or the collection is very large (and easily partitionable), this can make a significant improvement to the overall performance when compared to standard LINQ to Objects.
GpuLinq
GpuLinq's main mission is to democratize GPGPU programming through LINQ. The main idea is that we represent the query as an Expression tree and after various transformations-optimizations we compile it into fast OpenCL kernel code. In addition we provide a very easy to work API without the need of messing with the details of the OpenCL API.
https://github.com/nessos/GpuLinq
select *
from table1 -- contains 100k rows
left join table2 -- contains 1M rows
on table1.id1=table2.id2 -- this would run for ~100G times
-- unless they are cached on sql side
where table1.id between 1 and 100000 -- but this optimizes things (depends)
could be turned into
select id1 from table1 -- 400k bytes if id1 is 32 bit
-- no need to order
stored in memory
select id2 from table2 -- 4Mbytes if id2 is 32 bit
-- no need to order
stored in memory, both arrays sent to gpu using a kernel(cuda,opencl) like below
int i=get_global_id(0); // to select an id2, we need a thread id
int selectedID2=id2[i];
summary__=-1;
for(int j=0;j<id1Length;j++)
{
int selectedID1=id1[j];
summary__=(selectedID2==selectedID1?j:summary__); // no branching
}
summary[i]=j; // accumulates target indexings of
"on table1.id1=table2.id2" part.
On the host side, you can make
select * from table1 --- query3
and
select * from table2 --- query4
then use the id list from gpu to select the data
// x is table1 ' s data
myList.AsParallel().ForEach(x=>query3.leftjoindata=query4[summary[index]]);
The gpu code shouldn't be slower than 50ms for a gpu with constant memory, global broadcast ability and some thousands of cores.
If any trigonometric function is used for filtering, the performance would drop fast. Also when left joined tables row count makes it O(m*n) complexity so millions versus millions would be much slower. GPU memory bandwidth is important here.
Edit:
A single operation of gpu.findIdToJoin(table1,table2,"id1","id2") on my hd7870(1280 cores) and R7-240(320 cores) with "products table(64k rows)" and a "categories table(64k rows)" (left join filter) took 48 milliseconds with unoptimized kernel.
Ado.Net 's "nosql" style linq-join took more than 2000 ms with only 44k products and 4k categories table.
Edit-2:
left join with a string search condition gets 50 to 200 x faster on gpu when tables grow to 1000s of rows each having at least hundreds of characters.
The simple answer for your use case is no.
1) There's no solution for that kind of workload even in raw linq to object, much less in something that would replace your database.
2) Even if you were fine with loading the whole set of data at once (this takes time) it would still be much slower as GPU have high thoroughput but their access is high latency, so if you're looking at "very" fast solutions GPGPU is often not the answer as just preparing / sending the workload and getting back the results will be slow, and in your case probably need to be done in chunks too.

List<T> FirstOrDefault() bad performance - is dictionary possible in this case?

I have set of 'codes' Z that are valid in a certain time period.
Since I need them a lot of times in a large loop (million+) and every time I have to lookup the corresponding code I cache them in a List<>. After finding the correct codes, i'm inserting (using SqlBulkCopy) a million rows.
I lookup the id with the following code (l_z is a List<T>)
var z_fk = (from z in l_z
where z.CODE == lookupCode &&
z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
In other situations I have used a Dictionary with superb performance, but in those cases I only had to lookup the id based on the code.
But now with searching on the combination of fields, I am stuck.
Any ideas? Thanks in advance.
Create a Dictionary that stores a List of items per lookup code - Dictionary<string, List<Code>> (assuming that lookup code is a string and the objects are of type Code).
Then when you need to query based on lookupDate, you can run your query directly off of dict[lookupCode]:
var z_fk = (from z in dict[lookupCode]
where z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
Then just make sure that whenever you have a new Code object, that it gets added to the List<Code> collection in the dict corresponding to the lookupCode (and if one doesn't exist, then create it).
A simple improvement would be to use...
//in initialization somewhere
ILookup<string, T> l_z_lookup = l_z.ToLookup(z=>z.CODE);
//your repeated code:
var z_fk = (from z in lookup[lookupCode]
where z.VALIDFROM <= lookupDate && z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
You could further use a more complex, smarter data structure storing dates in sorted fashion and use a binary search to find the id, but this may be sufficient. Further, you speak of SqlBulkCopy - if you're dealing with a database, perhaps you can execute the query on the database, and then simply create the appropriate index including columns CODE, VALIDUNTIL and VALIDFROM.
I generally prefer using a Lookup over a Dictionary containing Lists since it's trivial to construct and has a cleaner API (e.g. when a key is not present).
We don't have enough information to give very prescriptive advice - but there are some general things you should be thinking about.
What types are the time values? Are you comparing date times or some primitive value (like a time_t). Think about how your data types affects performance. Choose the best ones.
Should you really be doing this in memory or should you be putting all these rows in to SQL and letting it be queried on there? It's really good at that.
But let's stick with what you asked about - in memory searching.
When searching is taking too long there is only one solution - search fewer things. You do this by partitioning your data in a way that allows you to easily rule out as many nodes as possible with as few operations as possible.
In your case you have two criteria - a code and a date range. Here are some ideas...
You could partition based on code - i.e. Dictionary> - if you have many evenly distributed codes your list sizes will each be about N/M in size (where N = total event count and M = number of events). So a million nodes with ten codes now requires searching 100k items rather than a million. But you could take that a bit further. The List could itself be sorted by starting time allowing a binary search to rule out many other nodes very quickly. (this of course has a trade-off in time building the collection of data). This should provide very quick
You could partition based on date and just store all the data in a single list sorted by start date and use a binary search to find the start date then march forward to find the code. Is there a benefit to this approach over the dictionary? That depends on the rest of your program. Maybe being an IList is important. I don't know. You need to figure that out.
You could flip the dictionary model partition the data by start time rounded to some boundary (depending on the length, granularity and frequency of your events). This is basically bucketing the data in to groups that have similar start times. E.g., all the events that were started between 12:00 and 12:01 might be in one bucket, etc. If you have a very small number of events and a lot of highly frequent (but not pathologically so) events this might give you very good lookup performance.
The point? Think about your data. Consider how expensive it should be to add new data and how expensive it should be to query the data. Think about how your data types affect those characteristics. Make an informed decision based on that data. When in doubt let SQL do it for you.
This to me sounds like a situation where this could all happen on the database via a single statement. Then you can use indexing to keep the query fast and avoid having to push data over the wire to and from your database.

Categories

Resources