Just out of curiosity, how exactly does SELECT * FROM table WHERE column = "something" works?
Is the underlying principle same as that of a for/foreach loop with an if condition like:
for (iterator)
{
if(condition)
//print results
}
If am dealing with , say 100 records, will there be any considerable performance difference between the 2 approaches in getting the desired data I want ?
SQL is a 4th generation language, which makes it very different from programming languages. Instead of telling the computer how to do something (loop through rows, compare columns), you tell the computer what to do (get the rows matching a condition).
The DBMS may or may not use a loop. It could as well use hashes and buckets, pre-sort a data set, whatever. It is free to choose.
On the technical side, you can provide an index in the datebase, so the DBMS can look up the keys to quickly to access the rows (like quickly finding names in a telephone book). This gives the DBMS an option how to acces the data, but it is still free to use a completely different approach, e.g. read the whole table sequentially.
Related
I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Create a dictionary Dictionary<string,int> documentFrequency
Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
Once all documents are processed this way, write the document frequencies back to the database.
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
Counting the number of instances of the term.
Loading the correct idf score from the idf table in the database.
Doing the tf-idf calculation.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.
First I build a list (by reading existing files) of approximately 12,000 objects that look like this:
public class Operator
{
string identifier; //i.e "7/1/2017 MN01 Day"
string name1;
string name2;
string id1;
string id2;
}
The identifier will be unique within the list.
Next I run a large query (currently about 4 million rows but it could be as large as 10 million, and about 20 columns). Then I write all of this to a CSV line by line using a write stream. For each line I loop over the Operator list to find a match and add those columns.
The problem I am having is with performance. I expect this report to take a long time to run but I've determined that the file writing step is taking especially long (about 4 hours). I suspect that it has to do with looping over the Operator list 4 million times.
Is there any way I can improve the speed of this? Perhaps by doing something when I build the list initially (indexing or sorting, maybe) that will allow searching to be done much faster.
You should be able to greatly speed up your code by building a Dictionary(HashTable):
var items = list.ToDictionary(i => i.identifier, i => i);
You can then index in on this dictionary:
var item = items["7/1/2017 MN01 Day"];
Building the dictionary is an O(n) operation, and doing a lookup into the dictionary is an O(1) operation. This means that your time complexity becomes linear rather than exponential.
... but also, "couldn't you somehow put those operators into a database table, so that you could use some kind of JOIN operation in your SQL?"
Another possibility that comes to mind is ... "twenty different queries, one for each symbol." Or, a UNION query with twenty branches. If there is any way for the SQL engine to use indexes, on its side, to speed up that process, you would still come out ahead.
Right now, vast amounts of time might be being wasted, packaging up every one of those millions of lines, squirting them through the network wires to your machine, only to have to discarding most of them, say, because they don't match any symbol.
If you control the database and can afford the space, and if, say, most of the rows don't match any symbol, consider a symbols table and a symbols_matched table, the second being a many-to-many join table that pre-identifies which rows match which symbol(s). It might well be worth the space, to save the time. (The process of populating this table could be put to a stored procedure which is TRIGGERed by appropriate insert, update, and delete events ...)
It is difficult to tell you how to speed up your file write without seeing any code.
But in general it could be worth considering writing using multiple threads. This SO post has some helpful info, and you could of course Google for more.
I have a table that has Constant Value...Is it better that I have this table in my Database(that is SQL)or have an Enum in my code and delete my table?
my table has only 2 Columns and maximum 20 rows that these rows are fixed and get filled once,first time that i run application.
I would suggest to create an Enum for your case. Since the values are fixed(and I am assuming that the table is not going to change very often) you can use Enum. Creating a table in database will require an unnecessary hit to the database and will require a database connection which could be skipped if you are using Enum.
Also a lot may depend on how much operation you are going to do with your values. For example: its tedious to query your Enum values to get distinct values from your table. Whereas if you will use table approach then it would be a simple select distinct. So you may have to look into your need and the operations which you will perform on these values.
As far as the performance is concerned you can look at: Enum Fields VS Varchar VS Int + Joined table: What is Faster?
As you can see, ENUM and VARCHAR results are almost the same, but join
query performance is 30% lower. Also note the times themselves –
traversing about same amount of rows full table scan performs about 25
times better than accessing rows via index (for the case when data
fits in memory!)
So, if you have an application and you need to have some table field
with a small set of possible values, I’d still suggest you to use
ENUM, but now we can see that performance hit may not be as large as
you expect. Though again a lot depends on your data and queries.
That depends on your needs.
You may want to translate the Enum Values (if you are showing it in GUI) and order a set of record based on translated values. For example: imagine you have a Employees table and a Position column. If the record set is big, and you want to sort or order by translated position column, then you have to keep the enum values + translations in database.
Otherwise KISS and have it in code. You will spare time on asking database for values.
I depends on character of that constants.
If they are some low level system constants that never should be change (like pi=3.1415) then it is better to keep them only in code part in some config file. And also if performance is critical parameter and you use them very often (on almost each request) it is better to keep them in code part.
If they are some constants (may be business constants) that can change in future it is Ok to put them in table - then you have more flexibility to change them (for instance from admin panel).
It really depends on what you actually need.
With Enum
It is faster to access
Bound to that certain application. (although you can share by making it as reference, but it just does not look as good as using DB)
You can use in switch statement
Enum usually does not care about value and it is limited to int.
With DB
It is slower, because you have to make connection and query.
The data can be shared widely.
You can set the value to be anything (any type any value).
So, if you will use it only on certain application, Enum is good enough. But if several applications are going to use it, then DB would be better option.
I'm rather new to Parse and Cloud Code, and I'm having trouble writing a certain query script.
I have a table of Salespeople, who have two integers : dailySold and dailyQuota.
The dailySold is reset to 0 each day, and the dailyQuota is defined by upper management.
Now, I'd like to make queries that call out bulks of users. Say, all users which dailySold is below their dailyQuota. In MySQL it would just look like this :
select * from salespeople where dailySold < dailyQuota
But in Parse / CloudCode I have been unable to find something like this. Currently, I'm loading all the entries, and going through them one by one, populating a large array clientside. This feels like the absolutely wrong way of doing it.
And the query.WhereNotEqualTo() function (and their siblings) seem to only be able to compare with static queries.
Does anyone know how to put together a query to optimize this ? I need it to go through thousands of records, and its often only 10-20 results I'm interested in. If nothing else, I'll have to make a cloudcode function that iterates for me serverside, but I still feel like there is some function I should be able to use, to make a more lean query.
You can't compare two columns in a query. You can only compare a key with a provided object. If the dailyQuota is set by upper management, I'm assuming this is the same for all salespeople, or for groups of people. I'd suggest first making a query for the daily quota and then either use
whereKey:matchesKey:inQuery
or just fetch the dailyQuota first and then use that value in the second query.
I would like to know what performs faster and its preferable a condition in a tsql query like this:
select case 'color' when 'red' then 1 when 'blue' then 2 else 3 end
or performing the same switch in c# code after getting the value from the db?
switch(color):
{
case "red":
return 1;
case "blue":
return 2;
default:
return 3;
}
To add more data in my specific case we have a sql query that returns 5800+ records in some cases (date filters and so) then we concatenate those results in c# (one txt line per record) and generate a txt.
We have one server that is the sql server + webserver(asp.net) and it takes like 10 or more mins to generate it...So we where thinking about doing all the conditions on the sql side, maybe concatenating the fields as one at the sql level too vs using c# loop with StringBuilder?
Right now the sql takes 1 sec to execute and all the time its taken at the concatening loop, there are 5873 records with 11 fields each
I think you are prematurely optimizing. "Make it work, make it right, then make it fast."
I know this statement (and others like it) bring about a lot of debate, but I think you should be putting this logic in the layer that is most appropriate, as in, where it has the least duplication, most re-usability, easiest to maintain, etc. If you have a performance problem at that point, you can make actual measurements in your environment with your own loads.
As an example, rather than some naked switch like this (that must be maintained), perhaps this should be in a lookup table in the DB and brought back with a join, or maybe it's better exposed as a property of some class based upon an enum. These might be better patterns to follow.
All things being equal on processors, your performance is probably going to depend more on workload and bandwidth.
Will you be saving any bandwidth by replacing the string with an integer or simply adding an integer column? Will there be any filtering on rows which result in significantly less data going across the wire?
If you have 2 web servers and 1 sql server, the processor work will be divided by doing it on the web server. If you have thousands of rich clients and 1 sql server, the processor work will be completely distributed by doing it on the clients.
That's not really possible to say, as there are too many unknown factors.
It depends for example on how much data you return from the database, how you handle the data returned, and whether the database server or the application server is at capacity.
The switch in itself would be faster than the select, but that could easily be outweighed by the fact that returning a number instead of a string from the database could be faster to handle in the code.