I want to show similar products so called variants for the a product. Currently I am doing it as below:
public IList<Product> GetVariants(string productName)
{
EFContext db = new EFContext(); //using Entity Framework
return db.Products
.Where(product = > product.ProductName == productName)
.ToList();
}
But , this results into exact match, that is the current product itself. I am thinking to use Levenshtein Distance as a basis to get the similar products. But , before that I want to check what majority developers do for getting variants?
Is it good to use Levenshtein Distance ? Is it used in industry for this purpose?
Do I have to add another table in database showing the variants for the product while adding the product to database?
I used the Jaro-Winkler distance effectively to account for typos in one system I wrote a while back. IMO, It's much better than a simple edit distance calculation as it can account for string lengths fairly effectively. See this question on SO for open source implementations.
I ended up writing it in C# and importing it into SQL server as a SQL CLR function, but it was still relatively slow. It worked in my case mostly because such queries were executed infrequently (100-200 in a day).
If you expect a lot of traffic, you'd have to build an index to make these lookups faster. One strategy for this would be to periodically compute the distance between each pair of products each pair of products and store this in an index table if the distance exceeds a certain threshold. To reduce the amount of work that needs to be done, you can run this only once or twice a day and you can limit this to only new or modified records since the last run. You can then look up similar products and order by distance quickly.
Related
I have three tables:
Sites:
Id
Timezone (string)
SiteName
Vehicles:
Id
SiteId
Name
Positions:
Id
TimestampLocal (DateTimeOffset)
VehicleId
Data1
Data2
...
Data50
There are multiple positions for a single vehicle. The position table is very large (100+ mil. records)
I need to get the last position for each vehicle (by timestamp as they can send old data) and its timezone so I can do further data processing based on the timezone. Something like {PositionId, VehicleId, Timezone, Data1}
I have tried with:
var result =
from ot in entities.Positions
join v in entities.Vehicles on ot.VehicleId equals v.Id
join s in entities.Sites on v.SiteId equals s.Id
group ot by ot.VehicleId into grp
select grp.OrderByDescending(g=>g.TimestampLocal).FirstOrDefault();
then I process the data with:
foreach (var rr in result){... update Data1 field ... }
This gets the last values but it does bring all the fields in Positions (a lot of data) and no Timezone. Also the foreach part is very CPU intensive (as it probably brings the data) as it gets 100% CPU for a few seconds).
How can this be done in Linq ...and be lightweight for the DB and its transfers?
Check out the following it is on same lines as you have done but the result contains Sites Timezonecolumn as projected, it is projected in the Join itself
var result =
S.Join(V,s1=>s1.Id,v1=>v1.SiteId,(s1,v1)=>new {v1.Id,s1.Timezone})
.Join(P,v1=>v1.Id,p1=>p1.VehicleId,(v1,p1)=>new {p1,v1.Timezone})
.GroupBy(g=>g.p1.VehicleId)
.Select(x=>x.OrderByDescending(y=>y.p1.TimestampLocal).FirstOrDefault())
.Select(y=>new {y.p1,y.Timezone});
Now following are important points related to the questions you have asked:
To reduce the number of columns fetched, as you may not want all the Positions columns, following need to be done:
In this line - Join(P,v1=>v1.Id,p1=>p1.VehicleId,(v1,p1)=>new {p1,v1.Timezone})
project the fields, which are result of the Join, something like:
new {p1.Id,p1.TimestampLocal,p1.VehicleId,p1.Data1,p1...,v1.Timezone}
which will provide only the projected field, but then the GroupBy would change to GroupBy(g=>g.VehicleId)
Other option for the same would be change the GroupBy projection, as follows instead of Join statement
GroupBy(g=>g.p1.VehicleId,
g=>new {g.p1.Id,g.p1.TimestampLocal,g.p1.VehicleId,g.p1.Data1,g.p1...,g.Timezone})
Now the remaining part process being CPU intensive and using 100% CPU, following could be done to optimize:
Does each foreach loop iterations does the network Update call, then it is bound to make it a network intensive process and thus slow, preferable would be doing all the necessary changes in-memory and then updating the database in one go that would still be intensive if you have millions of record
Even doing the same for million records will never be a good idea, since that would be a fairly network and CPU intensive, thus making it slow, your options would be:
Chunk the memory into smaller components, since I am not sure if any one needs to update million records in one shot, so make it multiple smaller updates, which are executed one after another triggered by the user, but will be much less taxing on the system resources.
Bring smaller dataset in the memory, do the filtering at the database level using a parameter and pass the necessary data for modification in memory to update.
Using projection as shown in the Linq above, bring only the required columns, that would reduce overall data memory footprint, bound to have an impact.
If the logic is such that various updates are mutually exclusive, then do them using Parallel API in a thread safe structure, that would ensure efficient and quick CPU utilization of all the cores, thus faster, though it will spike to 100% CPU but to a fraction of the Non parallel execution
Beside this provide me specific details to help you our with more details, these are basic suggestions, there's no golden rule to solve such optimization issues
all.
I am developing an application that is tracking the changes to an objects properties. Each time an objects properties change, I create a new row in the table with the updated property values and an incremented revision.
I have a table that has a structure like the following:
Id (primary key, system generated)
UserFriendlyId (generated programmatically, it is the Id the user sees in the UI, it stays the same regardless of how many revisions an object goes through)
.... (misc properties)
Revision (int, incremented when an object properties are changed)
To get the maximum revision for each UserFriendlyId, I do the following:
var latestIdAndRev = context.Rows.GroupBy(r => r.UserFriendlyId).Select(latest => new { UserFriendlyId = latest.Key, Revision = latest.Max(r=>r.Revision)}).ToList();
Then in order to get a collection of the Row objects, I do the following:
var latestRevs = context.Rows.Where(r => latestIdAndRev.Contains( new {UserFriendlyId=r.UserFriendlyId, Revision=r.Revision})).ToList();
Even though, my table only has ~3K rows, the performance on the latestRevs statement is horrible (takes several minutes to finish, if it doesn't time out first).
Any idea on what I might do differently to get better performance retrieving the latest revision for a collection of userfriendlyids?
To increase the performance of you query you should try to make the entire query run on the database. You have divided the query into two parts and in the first query you pull all the revisions to the client side into latestIdAndRev. The second query .Where(r => latestIdAndRev.Contains( ... )) will then translate into a SQL statement that is something like WHERE ... IN and then a list of all the ID's that you are looking for.
You can combine the queries into a single query where you group by UserFriendlyId and then for each group select the row with the highest revision simply ordering the rows by Revision (descending) and picking the first row:
latestRevs = context.Rows.GroupBy(
r => r.UserFriendlyId,
(key, rows) => rows.OrderByDescending(r => r.Revision).First()
).ToList();
This should generate pretty efficient SQL even though I have not been able to verify this myself. To further increase performance you should have a look at indexing the UserFriendlyId and the Revision columns but your results may vary. In general adding an index increases the time it takes to insert a row but may decrease the time it takes to find a row.
(General advice: Watch out for .Where(row => clientSideCollectionOfIds.Contains(row.Id)) because all the ID's will have to be included in the query. This is not a fault of the ER mapper.)
There are a couple of things to look at, as you are likely ending up with serious recursion. If this is SQL Server, open profiler and start a profile on the database in question and then fire off the command. Look at what is being run, examine the execution plan, and see what is actually being run.
From this you MIGHT be able to use the index wizard to create a set of indexes that speeds things up. I say might, as the recursive nature of the query may not be easily solved.
If you want something that recurses to be wicked fast, invest in learning Window Functions. A few years back, we had a query that took up to 30 seconds reduced to milliseconds by heading that direction. NOTE: I am not stating this is your solution, just stating it is worth looking into if indexes alone do not meet your Service Level Agreements (SLAs).
I have been searching for some days for solid information on the possibility to accelerate LINQ queries using a GPU.
Technologies I have "investigated" so far:
Microsoft Accelerator
Cudafy
Brahma
In short, would it even be possible at all to do an in-memory filtering of objects on the GPU?
Let´s say we have a list of some objects and we want to filter something like:
var result = myList.Where(x => x.SomeProperty == SomeValue);
Any pointers on this one?
Thanks in advance!
UPDATE
I´ll try to be more specific about what I am trying to achieve :)
The goal is, to use any technology, which is able to filter a list of objects (ranging from ~50 000 to ~2 000 000), in the absolutely fastest way possible.
The operations I perform on the data when the filtering is done (sum, min, max etc) is made using the built in LINQ-methods and is already fast enough for our application, so that´s not a problem.
The bottleneck is "simply" the filtering of data.
UPDATE
Just wanted to add that I have tested about 15 databases, including MySQL (checking possible cluster approach / memcached solution), H2, HSQLDB, VelocityDB (currently investigating further), SQLite, MongoDB etc, and NONE is good enough when it comes to the speed of filtering data (of course, the NO-sql solutions do not offer this like the sql ones, but you get the idea) and/or the returning of the actual data.
Just to summarize what I/we need:
A database which is able to sort data in the format of 200 columns and about 250 000 rows in less than 100 ms.
I currently have a solution with parallellized LINQ which is able (on a specific machine) to spend only nano-seconds on each row when filtering AND processing the result!
So, we need like sub-nano-second-filtering on each row.
Why does it seem that only in-memory LINQ is able to provide this?
Why would this be impossible?
Some figures from the logfile:
Total tid för 1164 frågor: 2579
This is Swedish and translates:
Total time for 1164 queries: 2579
Where the queries in this case are queries like:
WHERE SomeProperty = SomeValue
And those queries are all being done in parallell on 225639 rows.
So, 225639 rows are being filtered in memory 1164 times in about 2.5 seconds.
That´s 9,5185952917007032597107300413827e-9 seconds / row, BUT, that also includes the actual processing of the numbers! We do Count (not null), total count, Sum, Min, Max, Avg, Median. So, we have 7 operations on these filtered rows.
So, we could say it´s actually 7 times faster than the the databases we´ve tried, since we do NOT do any aggregation-stuff in those cases!
So, in conclusion, why are the databases so poor at filtering data compared to in-memory LINQ filtering? Have Microsoft really done such a good job that it is impossible to compete with it? :)
It makes sense though that in-memory filtering should be faster, but I don´t want a sense that it is faster. I want to know what is faster, and if it´s possible why.
I will answer definitively about Brahma since it's my library, but it probably applies to other approaches as well. The GPU has no knowledge of objects. It's memory is also mostly completely separate from CPU memory.
If you do have a LARGE set of objects and want to operate on them, you can only pack the data you want to operate on into a buffer suitable for the GPU/API you're using and send it off to be processed.
Note that this will make two round trips over the CPU-GPU memory interface, so if you aren't doing enough work on the GPU to make it worthwhile, you'll be slower than if you simply used the CPU in the first place (like the sample above).
Hope this helps.
The GPU is really not intended for all general purpose computing purposes, especially with object oriented designs like this, and filtering an arbitrary collection of data like this would really not be an appropriate thing.
GPU computations are great for things where you are performing the same operation on a large dataset - which is why things like matrix operations and transforms can be very nice. There, the data copying can be outweighed by the incredibly fast computational capabilities on the GPU....
In this case, you'd have to copy all of the data into the GPU to make this work, and restructure it into some form the GPU will understand, which would likely be more expensive than just performing the filter in software in the first place.
Instead, I would recommend looking at using PLINQ for speeding up queries of this nature. Provided your filter is thread safe (which it'd have to be for any GPU related work...) this is likely a better option for general purpose query optimization, as it won't require the memory copying of your data. PLINQ would work by rewriting your query as:
var result = myList.AsParallel().Where(x => x.SomeProperty == SomeValue);
If the predicate is an expensive operation, or the collection is very large (and easily partitionable), this can make a significant improvement to the overall performance when compared to standard LINQ to Objects.
GpuLinq
GpuLinq's main mission is to democratize GPGPU programming through LINQ. The main idea is that we represent the query as an Expression tree and after various transformations-optimizations we compile it into fast OpenCL kernel code. In addition we provide a very easy to work API without the need of messing with the details of the OpenCL API.
https://github.com/nessos/GpuLinq
select *
from table1 -- contains 100k rows
left join table2 -- contains 1M rows
on table1.id1=table2.id2 -- this would run for ~100G times
-- unless they are cached on sql side
where table1.id between 1 and 100000 -- but this optimizes things (depends)
could be turned into
select id1 from table1 -- 400k bytes if id1 is 32 bit
-- no need to order
stored in memory
select id2 from table2 -- 4Mbytes if id2 is 32 bit
-- no need to order
stored in memory, both arrays sent to gpu using a kernel(cuda,opencl) like below
int i=get_global_id(0); // to select an id2, we need a thread id
int selectedID2=id2[i];
summary__=-1;
for(int j=0;j<id1Length;j++)
{
int selectedID1=id1[j];
summary__=(selectedID2==selectedID1?j:summary__); // no branching
}
summary[i]=j; // accumulates target indexings of
"on table1.id1=table2.id2" part.
On the host side, you can make
select * from table1 --- query3
and
select * from table2 --- query4
then use the id list from gpu to select the data
// x is table1 ' s data
myList.AsParallel().ForEach(x=>query3.leftjoindata=query4[summary[index]]);
The gpu code shouldn't be slower than 50ms for a gpu with constant memory, global broadcast ability and some thousands of cores.
If any trigonometric function is used for filtering, the performance would drop fast. Also when left joined tables row count makes it O(m*n) complexity so millions versus millions would be much slower. GPU memory bandwidth is important here.
Edit:
A single operation of gpu.findIdToJoin(table1,table2,"id1","id2") on my hd7870(1280 cores) and R7-240(320 cores) with "products table(64k rows)" and a "categories table(64k rows)" (left join filter) took 48 milliseconds with unoptimized kernel.
Ado.Net 's "nosql" style linq-join took more than 2000 ms with only 44k products and 4k categories table.
Edit-2:
left join with a string search condition gets 50 to 200 x faster on gpu when tables grow to 1000s of rows each having at least hundreds of characters.
The simple answer for your use case is no.
1) There's no solution for that kind of workload even in raw linq to object, much less in something that would replace your database.
2) Even if you were fine with loading the whole set of data at once (this takes time) it would still be much slower as GPU have high thoroughput but their access is high latency, so if you're looking at "very" fast solutions GPGPU is often not the answer as just preparing / sending the workload and getting back the results will be slow, and in your case probably need to be done in chunks too.
I have set of 'codes' Z that are valid in a certain time period.
Since I need them a lot of times in a large loop (million+) and every time I have to lookup the corresponding code I cache them in a List<>. After finding the correct codes, i'm inserting (using SqlBulkCopy) a million rows.
I lookup the id with the following code (l_z is a List<T>)
var z_fk = (from z in l_z
where z.CODE == lookupCode &&
z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
In other situations I have used a Dictionary with superb performance, but in those cases I only had to lookup the id based on the code.
But now with searching on the combination of fields, I am stuck.
Any ideas? Thanks in advance.
Create a Dictionary that stores a List of items per lookup code - Dictionary<string, List<Code>> (assuming that lookup code is a string and the objects are of type Code).
Then when you need to query based on lookupDate, you can run your query directly off of dict[lookupCode]:
var z_fk = (from z in dict[lookupCode]
where z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
Then just make sure that whenever you have a new Code object, that it gets added to the List<Code> collection in the dict corresponding to the lookupCode (and if one doesn't exist, then create it).
A simple improvement would be to use...
//in initialization somewhere
ILookup<string, T> l_z_lookup = l_z.ToLookup(z=>z.CODE);
//your repeated code:
var z_fk = (from z in lookup[lookupCode]
where z.VALIDFROM <= lookupDate && z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
You could further use a more complex, smarter data structure storing dates in sorted fashion and use a binary search to find the id, but this may be sufficient. Further, you speak of SqlBulkCopy - if you're dealing with a database, perhaps you can execute the query on the database, and then simply create the appropriate index including columns CODE, VALIDUNTIL and VALIDFROM.
I generally prefer using a Lookup over a Dictionary containing Lists since it's trivial to construct and has a cleaner API (e.g. when a key is not present).
We don't have enough information to give very prescriptive advice - but there are some general things you should be thinking about.
What types are the time values? Are you comparing date times or some primitive value (like a time_t). Think about how your data types affects performance. Choose the best ones.
Should you really be doing this in memory or should you be putting all these rows in to SQL and letting it be queried on there? It's really good at that.
But let's stick with what you asked about - in memory searching.
When searching is taking too long there is only one solution - search fewer things. You do this by partitioning your data in a way that allows you to easily rule out as many nodes as possible with as few operations as possible.
In your case you have two criteria - a code and a date range. Here are some ideas...
You could partition based on code - i.e. Dictionary> - if you have many evenly distributed codes your list sizes will each be about N/M in size (where N = total event count and M = number of events). So a million nodes with ten codes now requires searching 100k items rather than a million. But you could take that a bit further. The List could itself be sorted by starting time allowing a binary search to rule out many other nodes very quickly. (this of course has a trade-off in time building the collection of data). This should provide very quick
You could partition based on date and just store all the data in a single list sorted by start date and use a binary search to find the start date then march forward to find the code. Is there a benefit to this approach over the dictionary? That depends on the rest of your program. Maybe being an IList is important. I don't know. You need to figure that out.
You could flip the dictionary model partition the data by start time rounded to some boundary (depending on the length, granularity and frequency of your events). This is basically bucketing the data in to groups that have similar start times. E.g., all the events that were started between 12:00 and 12:01 might be in one bucket, etc. If you have a very small number of events and a lot of highly frequent (but not pathologically so) events this might give you very good lookup performance.
The point? Think about your data. Consider how expensive it should be to add new data and how expensive it should be to query the data. Think about how your data types affect those characteristics. Make an informed decision based on that data. When in doubt let SQL do it for you.
This to me sounds like a situation where this could all happen on the database via a single statement. Then you can use indexing to keep the query fast and avoid having to push data over the wire to and from your database.
I am using NHibernate in an MVC 2.0 application. Essentially I want to keep track of the number of times each product shows up in a search result. For example, when somebody searches for a widget the product named WidgetA will show up in the first page of the search results. At this point i will increment a field in the database to reflect that it appeared as part of a search result.
While this is straightforward I am concerned that the inserts themselves will greatly slow down the search result. I would like to batch my statements together but it seems that coupling my inserts with my select may be counter productive. Has anyone tried to accomplish this in NHibernate and, if so, are there any standard patterns for completing this kind of operation?
Interesting question!
Here's a possible solution:
var searchResults = session.CreateCriteria<Product>()
//your query parameters here
.List<Product>();
session.CreateQuery(#"update Product set SearchCount = SearchCount + 1
where Id in (:productIds)")
.SetParameterList("productIds", searchResults.Select(p => p.Id).ToList())
.ExecuteUpdate();
Of course you can do the search with Criteria, HQL, SQL, Linq, etc.
The update query is a single round trip for all the objects, so the performance impact should be minimal.