First I build a list (by reading existing files) of approximately 12,000 objects that look like this:
public class Operator
{
string identifier; //i.e "7/1/2017 MN01 Day"
string name1;
string name2;
string id1;
string id2;
}
The identifier will be unique within the list.
Next I run a large query (currently about 4 million rows but it could be as large as 10 million, and about 20 columns). Then I write all of this to a CSV line by line using a write stream. For each line I loop over the Operator list to find a match and add those columns.
The problem I am having is with performance. I expect this report to take a long time to run but I've determined that the file writing step is taking especially long (about 4 hours). I suspect that it has to do with looping over the Operator list 4 million times.
Is there any way I can improve the speed of this? Perhaps by doing something when I build the list initially (indexing or sorting, maybe) that will allow searching to be done much faster.
You should be able to greatly speed up your code by building a Dictionary(HashTable):
var items = list.ToDictionary(i => i.identifier, i => i);
You can then index in on this dictionary:
var item = items["7/1/2017 MN01 Day"];
Building the dictionary is an O(n) operation, and doing a lookup into the dictionary is an O(1) operation. This means that your time complexity becomes linear rather than exponential.
... but also, "couldn't you somehow put those operators into a database table, so that you could use some kind of JOIN operation in your SQL?"
Another possibility that comes to mind is ... "twenty different queries, one for each symbol." Or, a UNION query with twenty branches. If there is any way for the SQL engine to use indexes, on its side, to speed up that process, you would still come out ahead.
Right now, vast amounts of time might be being wasted, packaging up every one of those millions of lines, squirting them through the network wires to your machine, only to have to discarding most of them, say, because they don't match any symbol.
If you control the database and can afford the space, and if, say, most of the rows don't match any symbol, consider a symbols table and a symbols_matched table, the second being a many-to-many join table that pre-identifies which rows match which symbol(s). It might well be worth the space, to save the time. (The process of populating this table could be put to a stored procedure which is TRIGGERed by appropriate insert, update, and delete events ...)
It is difficult to tell you how to speed up your file write without seeing any code.
But in general it could be worth considering writing using multiple threads. This SO post has some helpful info, and you could of course Google for more.
Related
Just out of curiosity, how exactly does SELECT * FROM table WHERE column = "something" works?
Is the underlying principle same as that of a for/foreach loop with an if condition like:
for (iterator)
{
if(condition)
//print results
}
If am dealing with , say 100 records, will there be any considerable performance difference between the 2 approaches in getting the desired data I want ?
SQL is a 4th generation language, which makes it very different from programming languages. Instead of telling the computer how to do something (loop through rows, compare columns), you tell the computer what to do (get the rows matching a condition).
The DBMS may or may not use a loop. It could as well use hashes and buckets, pre-sort a data set, whatever. It is free to choose.
On the technical side, you can provide an index in the datebase, so the DBMS can look up the keys to quickly to access the rows (like quickly finding names in a telephone book). This gives the DBMS an option how to acces the data, but it is still free to use a completely different approach, e.g. read the whole table sequentially.
I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Create a dictionary Dictionary<string,int> documentFrequency
Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
Once all documents are processed this way, write the document frequencies back to the database.
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
Counting the number of instances of the term.
Loading the correct idf score from the idf table in the database.
Doing the tf-idf calculation.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.
I have a program that creates a list of objects from a file, and also creates a list of the same type of object, but with fewer/and some different properties, from the database, like:
List from FILE: Address ID, Address, City, State, Zip, other important properties
list from DB: Address ID, Address, City, State
I have implemented IEquatable on this CustObj so that it only compares against Address, City, and State, in the hopes of doing easy comparisons between the two lists.
The ultimate goal is to get the address ID from the database and update the address IDs for each address in the list of objects from the file. These two lists could have quite a lot of objects (over 1,000,000) so I want it to be fast.
The alternative is to offload this to the database and have the DB return the info we need. If that would be significantly faster/more resource efficient, I will go that route, but I want to see if it can be done quickly and efficiently in code first.
Anyways, I see there's a Zip method. I was wondering if I could use that to say "if there's a match between the two lists, keep the data in list 1 but update the address id property of each object in list 1 to the address Id from list 2".
Is that possible?
The answer is, it really depends. There are a lot of parameters you haven't mentioned.
The only way to be sure is to build one solution (preferably using the zip method, since it has less work involved) and if it works within the parameters of your requirements (time or any other parameter, memory footprint?), you can stop there.
Otherwise you have to off load it to the database. Mind you, you would have to hold the 1 million records from files and 1 million records from DB in memory at the same time if you want to use the zip method.
The problem with pushing everything to the database is, inserting that many records is resource (time, space etc) consuming. Moreover if you want to do that maybe everyday, it is going to be more difficult, resource wise.
Your question didn't say if this was going to be a one time thing or a daily event in a production environment. Even that is going to make a difference in which approach to choose.
To repeat, you would have to try different approaches to see which will work best for you based on your requirements: is this a one time thing? How much resources does the process have? How much time does it have? and possibly many more.
It kindof sounds also like a job for .Aggregate() aka
var aggreg = list1.Aggregate(otherListPrefilled, (acc,elemFrom1) =>
{
// some code to create the joined data, usig elemFrom1 to find
// and modify the correct element in otherListPrefilled
return acc;
});
normally I would use an empty otherListPrefilled, not sure how it performs on 100k data items though.
If its a onetime thing, its probably faster to put your file to a csv, import it in your database as temporary table and join the data in sql.
I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection
You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"
I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.
Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.
I wonder if anyone else has asked a similar question.
Basically, I have a huge tree I'm building up in RAM using LINQ objects, and then I dump it all in one go using DataContext.SubmitChanges().
It works, but I can't find how to give the user a sort of visual indication of how far has the query progressed so far. If I could ultimately implement a sort of progress bar, that would be great, even if there is a minimal loss in performance.
Note that I have quite a large amount of rows to put into the DB, over 750,000 rows.
I haven't timed it exactly, but it does take a long while to put them in.
Edit: I thought I'd better give some indication of what I'm doing.
Basically, I'm building a suffix tree from the Lord of the Rings. Thus, there are a lot of Nodes, and certain Nodes have positions associated to them (Nodes that happen to be at the end of a suffix). I am building the Linq objects along these lines.
suffixTreeDB.NodeObjs.InsertOnSubmit(new NodeObj()
{
NodeID = 0,
ParentID = 0,
Path = "$"
});
After the suffix tree has been fully generated in RAM (which only takes a few seconds), I then call suffixTreeDB.submitChanges();
What I'm wondering is if there is any faster way of doing this. Thanks!
Edit 2: I've did a stopwatch, and apparently it takes precisely 6 minutes for the DB to be written.
I suggest you divide the calls you are doing, as they are sent in separate calls to the db anyway. This will also reduce the size of the transaction (which linq does when calling submitchanges).
If you divide them in 10 blocks of 75.000, you can provide a rough estimate on a 1/10 scale.
Update 1: After re-reading your post and your new comments, I think you should take a look at SqlBulkCopy instead. If you need to improve the time of the operation, that's the way to go. Check this related question/answer: What's the fastest way to bulk insert a lot of data in SQL Server (C# client)
I was able to get percentage progress for ctx.submitchanges() by using ctx.Log and ActionTextWriter
ctx.Log = new ActionTextWriter(s => {
if (s.StartsWith("INSERT INTO"))
insertsCount++;
ReportProgress(insertsCount);
});
more details are available at my blog post
http://epandzo.wordpress.com/2011/01/02/linq-to-sql-ctx-submitchanges-progress/
and stackoverflow question
LINQ to SQL SubmitChangess() progress
This isn't ideal, but you could create another thread that periodically queries the table you're populating to count the number of records that have been inserted. I'm not sure how/if this will work if you are running in a transaction though, since there could be locking/etc.
What I really think I need is a form of Bulk-Insert, however it appears that Linq doesn't support it.