I have a large set of conditions (I don't know them ahead of time) that I need count against a relatively small SQL Server table (< 10,000 rows). Each condition is in the form of a SQL where clause. Currently, I build a complete sql statement in the form of "select count(*) from some_table where " + Where_Clause; and I let SQL Server return the count to me. I do this in a loop to get all the counts I need for all the various conditions.
I'm looking for ways to speed this up when I have dozens or hundreds of counts I need to run. I don't know the statements ahead of time. Some will select the whole table, some may not select any rows. I've tried the following solutions:
Issuing the queries in parallel - saw minimal improvement even when not doing any locking
Writing them as complex counts on case statements so I can run multiple where clauses in one statement
Passing many queries together as UNIONs and getting the results back as one result set
None of these options have had a fantastic improvement in runtime (sometimes it would run more slowly) and with the added complexity, I don't feel any of them are worth it.
My question is: I can very quickly load the entire table into a DataTable object just by running one "select *" against it. If I had the whole table in memory, is there a way to run counts against it without going back and forth to SQL Server? I'm hoping to cut out the overhead of the: network, I/O, locking, etc.
The most complex where clause would be something like:
a=1 or b in (2,3) or c<4 or d like '%5' or substring(e,2,1)='z'
So it's not trivial and supporting as much as possible of T-SQL would be ideal, but I don't think DataTable's .Select() method supports this OR is it very fast. So given a table of data in memory, can I could rows using T-SQL syntax very fast (or in parallel)?
You could look into using LINQ to SQL, or possibly building the queries as Stored Procedures that you can execute.
Things to try
a) Make sure you've got the right indexes - check the query plan for the most complex query, and make sure there are no table scans in there.
b) Do multiple queries in the same request using something like this.
var cmd = new SqlCommand(#"SELECT TOP 10 * FROM Foo; SELECT TOP 10 * FROM Bar", con))
var ad = new SqlDataAdapter(cmd);
var ds = new DataSet();
ad.Fill(ds);
var foo = ds.Tables[0];
var bar = ds.Tables[1];
Related
I have a .net core API and I am trying to search 4.4 million records using .Contains(). This is obviously extremely slow - 26 seconds. I am just querying one column which is the name of the record. How is this problem generally solved when dealing with millions of records?
I have never worked with millions of records before so apart from the obvious altering of the .Select and .Take, I haven't tried anything too drastic. I have spent many hours on this though.
The other filters included in the .Where are only used when a user chooses to use them on the front end - The real problem is just searching by CompanyName.
Note; I am using .ToArray() when returning the results.
I have indexes in the database but cannot add one for CompanyName as it is Nvarchar(MAX).
I have also looked at the execution plan and it doesn't really show anything out of the ordinary.
query = _context.Companies.Where(
c => c.CompanyName.Contains(paging.SearchCriteria.companyNameFilter.ToUpper())
&& c.CompanyNumber.StartsWith(
string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter)
? paging.SearchCriteria.companyNumberFilter.ToUpper()
: ""
)
&& c.IncorporationDate > paging.SearchCriteria.companyIncorperatedGreaterFilter
&& c.IncorporationDate < paging.SearchCriteria.companyIncorperatedLessThanFilter
)
.Select(x => new Company() {
CompanyName = x.CompanyName,
IncorporationDate = x.IncorporationDate,
CompanyNumber = x.CompanyNumber
}
)
.Take(10);
I expect the query to take around 1 / 2 seconds as when I execute a like query in ssms it take about 1 / 2 seconds.
Here is the code being submitted to DB:
Microsoft.EntityFrameworkCore.Database.Command: Information: Executing DbCommand [Parameters=[#__p_4='?' (DbType = Int32), #__ToUpper_0='?' (Size = 4000), #__p_1='?' (Size = 4000), #__paging_SearchCriteria_companyIncorperatedGreaterFilter_2='?' (DbType = DateTime2), #__paging_SearchCriteria_companyIncorperatedLessThanFilter_3='?' (DbType = DateTime2), #__p_5='?' (DbType = Int32)], CommandType='Text', CommandTimeout='30']
SELECT [t].[CompanyName], [t].[IncorporationDate], [t].[CompanyNumber]
FROM (
SELECT TOP(#__p_4) [c].[CompanyName], [c].[IncorporationDate], [c].[CompanyNumber], [c].[ID]
FROM [Companies] AS [c]
WHERE (((((#__ToUpper_0 = N'') AND #__ToUpper_0 IS NOT NULL) OR (CHARINDEX(#__ToUpper_0, [c].[CompanyName]) > 0)) AND (((#__p_1 = N'') AND #__p_1 IS NOT NULL) OR ([c].[CompanyNumber] IS NOT NULL AND (#__p_1 IS NOT NULL AND (([c].[CompanyNumber] LIKE [c].[CompanyNumber] + N'%') AND (((LEFT([c].[CompanyNumber], LEN(#__p_1)) = #__p_1) AND (LEFT([c].[CompanyNumber], LEN(#__p_1)) IS NOT NULL AND #__p_1 IS NOT NULL)) OR (LEFT([c].[CompanyNumber], LEN(#__p_1)) IS NULL AND #__p_1 IS NULL))))))) AND ([c].[IncorporationDate] > #__paging_SearchCriteria_companyIncorperatedGreaterFilter_2)) AND ([c].[IncorporationDate] < #__paging_SearchCriteria_companyIncorperatedLessThanFilter_3)
) AS [t]
ORDER BY [t].[IncorporationDate] DESC
OFFSET #__p_5 ROWS FETCH NEXT #__p_4 ROWS ONLY
SOLVED! With the help of both answers!
In the end as suggested, I tried full-text searching which was lightening fast but compromised accuracy of search results. In order to filter those results more accurately, I used .Contains on the query after applying the full-text search.
Here is the code that works. Hopefully this helps others.
//query = _context.Companies
//.Where(c => c.CompanyName.StartsWith(paging.SearchCriteria.companyNameFilter.ToUpper())
//&& c.CompanyNumber.StartsWith(string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter) ? paging.SearchCriteria.companyNumberFilter.ToUpper() : "")
//&& c.IncorporationDate > paging.SearchCriteria.companyIncorperatedGreaterFilter && c.IncorporationDate < paging.SearchCriteria.companyIncorperatedLessThanFilter)
//.Select(x => new Company() { CompanyName = x.CompanyName, IncorporationDate = x.IncorporationDate, CompanyNumber = x.CompanyNumber }).Take(10);
query = _context.Companies.Where(c => EF.Functions.FreeText(c.CompanyName, paging.SearchCriteria.companyNameFilter.ToUpper()));
query = query.Where(x => x.CompanyName.Contains(paging.SearchCriteria.companyNameFilter.ToUpper()));
(I temporarily excluded the other filters for simplicity)
When you run the query in SSMS, it's probably cached for subsequent calls. The original query probably took similar time as the EF query. That said, there are disadvantages to parametrised queries - while you can better reuse execution plans in a parametrised query, this also means that the execution plan isn't necessarily the best for the actual query you're trying to run right now.
For example, if you specify a CompanyNumber (which is easy to find in an index due to the StartsWith), you can filter the data first by CompanyNumber, thus making the name search trivial (I assume CompanyNumber is unique, so either you get 0 records, or you get the one you get by CompanyNumber). This might not be possible for the parametrised query, if its execution plan was optimized for looking up by name.
But in the end, Contains is a performance killer. It needs to read every single byte of data in your table's CompanyName field; which usually means it has to read every single row, and process much of its data. Searching by a substring looks deceptively simple, but always carries heavy penalties - its complexity is linear with respect to data size.
One option is to find a way to avoid the Contains. Users often ask for features they don't actually need. StartsWith might work just as well for most of the cases. But that's a business decision, of course.
Another option would be finding a way to reduce the query as much as possible before you apply the Contains filter - if you only allow searching for company name with other filters that narrow the search down, you can save the DB server a lot of work. This may be tricky, and can sometimes collide with the execution plan collission issue - you might want to add some way to avoid having the same execution plan for two queries that are wildly different; an easy way in EF would be to build the query up dynamically, rather than trying for one expression:
var query = _context.Companies;
if (!string.IsNullOrEmpty(paging.SearchCriteria.companyNameFilter))
query = query.Where(c => c.CompanyName.Contains(paging.SearchCriteria.companyNameFilter));
if (!string.IsNullOrEmpty(paging.SearchCriteria.companyNumberFilter))
query = query.Where(c => c.CompanyNumber.StartsWith(paging.SearchCriteria.companyNumberFilter));
// etc. for the rest of the query
This means that you actually have multiple parametrised queries that can each have their own execution plan, more in line with what the query actually does. For some extreme cases, it might also be worthwhile to completely prevent execution plan caching (this is often useful in reports).
The final option is using full-text search. You can find plenty of tutorials on how to make this work. This works essentially by splitting the unformatted string data to individual words or phrases, and indexing those. This means that a search for "hello world" doesn't necessarily return all the records that have "hello world" in the name, and it might also return records that have something else than "hello world" in the name. Think Google Search rather than Contains. This can often be a great method for human-written text, but it can be very confusing for the user who doesn't understand why you'd return search results that are completely different from what he was searching for. It also often doesn't work well if you need to do partial searches (e.g. searching for "Computer" might return "Computer, Inc.", but searching for "Comp" might return nothing).
The first option is likely the fastest, and closest to what the users would expect. It has the weakness that it can't search in the middle, though. The second option is the most correct, and might make your query substantially faster, especially in the most common cases with good statistics. The third option is probably about as fast as the first one, but can be tricky to setup properly, and can be confusing for your users. It does also provide you with more powerful ways to query the text data (e.g. using wildcards).
Welcome to stack overflow. It looks like you are suffering from at least one of these three problems in your code and your architecture.
First: indexing
You've mentioned that this cannot be indexed but there is support in SQL Server for full text indexing at the very least.
.Contains
This method isn't exactly suitable for the size of operation you're performing. If possible, perhaps as a last resort, consider moving to a parameterized query. For now, however, it looks like you want to keep your business logic in the .net code rather than spreading it into SQL and that's a worthy plan.
c.IncorporationDate
Date comparison can be a little costly in SQL Server. Once you're dealing with so many millions of rows you might get a lot of performance benefit from correctly partitioned tables and indexes.
Consider whether or not these rows can change at all. Something named IncoporationDate sounds like it definitely should not be changed. I suspect you may want to leverage that after reading the rest of these.
I'm coding an application with Entity Framework in which I rely heavily on user defined functions.
I have a question about the best way (most optimized way) of how I limit and page my result sets. Basically I am wondering if these two options are the same or one is prefered performance wise.
Option 1.
//C#
var result1 = _DB.fn_GetData().OrderBy(x => Id).Skip(page *100).Take(100).ToList();
// SQL in fn_GetData
SELECT * FROM [Data].[Table]
Option 2.
//C#
var result2 = _DB.fn_GetData(page = 0, size = 100).ToList();
// SQL in fn_GetData
SELECT * FROM [Data].[Table]
ORDER BY Id
OFFSET (size * page) ROWS FETCH NEXT size ROWS ONLY
To me these seem to be producing about the same result, but maybe I am missing some key aspect.
You'll have to be aware when your LINQ statement is AsEnumerable and when it is AsQueryable. As long as your statement is an IQueryable<...> the software will try to translate it into SQL and let your database do the query. Once it really has lost the IQueryable, and has become an implementation of an IEnumerable, the data has been brought to local memory, and all further LINQ statements will be performed by your process, not by the database.
If you use your debugger, you will see that the return value of your fn_getData returns an IEnumerable. This means that the result of fn_GetData is brought to local memory and your OrderBy etc is performed by your process.
Usually it is much more efficient to only move the records that you will use to local memory. Besides: do not fetch the complete records, but only the properties that you plan to use. So in this case I guess you'll have to create an extended version of fn_GetData that returns only the values you plan to use
I suggest second option because SQL Server can more faster then C# methods.
In your first option, you take all of the records in table and loop through. But second option, SQL Server do it for you and you get what you want.
You should apply the limiting and where clauses (it depends on table indexes) in the database as far as possible. For first example;
var result1 = _DB.fn_GetData().OrderBy(x => Id).Skip(page *100).Take(100).ToList();
// SQL in fn_GetData
SELECT * FROM [Data].[Table]
The whole table is retrieved from database into in-memory and it kills the performance and reliability. I strongly don't suggest it. You should consider to put some limitations to filter records on the database. So, the second option is better approach in this case.
I am using System.Data.SQLite and SQLiteDataReader in my C# project. I am facing performance issues when getting the results of a query with attached databases.
Here is an example of a query to search text into two databases :
ATTACH "db2.db" as db2;
SELECT MainRecord.RecordID,
((LENGTH(MainRecord.Value) - LENGTH(REPLACE(UPPER(MainRecord.Value), UPPER("FirstValueToSearch"), ""))) / 18) AS "FirstResultNumber",
((LENGTH(DB2Record.Value) - LENGTH(REPLACE(UPPER(DB2Record.Value), UPPER("SecondValueToSearch"), ""))) / 19) AS "SecondResultNumber"
FROM main.Record MainRecord
JOIN db2.Record DB2Record ON DB2Record.RecordID BETWEEN (MainRecord.PositionMin) AND (MainRecord.PositionMax)
WHERE FirstResultNumber > 0 AND SecondResultNumber > 0;
DETACH db2;
When executing this query with SQLiteStudio or SQLiteAdmin, this works fine, I am getting the results in a few seconds (the Record table can contain hundreds of thousands of records, the query returns 36000 records).
When executing this query in my C# project, the execution takes a few seconds too, but it takes hours to run through all the results.
Here is my code :
// Attach databases
SQLiteDataReader data = null;
using (SQLiteCommand command = this.m_connection.CreateCommand())
{
command.CommandText = "SELECT...";
data = command.ExecuteReader();
}
if (data.HasRows)
{
while (data.Read())
{
// Do nothing, just iterate all results
}
}
data.Close();
// Detach databases
Calling the Read method of the SQLiteDataReader once can take more than 10 seconds ! I guess this is because the SQLiteDataReader is lazy loaded (and so it doesn't return the whole rowset before reading the results), am I right ?
EDIT 1 :
I don't know if this has something to do with lazy loading, like I said initially, but all I want is being able to get ALL the results as soon as the query is ended. Isn't it possible ? In my opinion, this is really strange that it takes hours to get results of a query executed in few seconds...
EDIT 2 :
I just added a COUNT(*) in my select query in order to see if I could get the total number of results at the first data.Read(), just to be sure that it was only the iteration of the results that was taking so long. And I was wrong : this new request executes in few seconds in SQLiteAdmin / SQLiteStudio, but takes hours to execute in my C# project. Any idea why the same query is so much longer to execute in my C# project?
EDIT 3 :
Thanks to EXPLAIN QUERY PLAN, I noticed that there was a slight difference in the execution plan for the same query between SQLiteAdmin / SQLiteStudio and my C# project. In the second case, it is using an AUTOMATIC PARTIAL COVERING INDEX on DB2Record instead of using the primary key index. Is there a way to ignore / disable the use of automatic partial covering indexes? I know it is used to speed up the queries, but in my case, it's rather the opposite that happens...
Thank you.
Besides finding matching records, it seems that you're also counting the number of times the strings matched. The result of this count is also used in the WHERE clause.
You want the number of matches, but the number of matches does not matter in the WHERE clause - you could try change the WHERE clause to:
WHERE MainRecord.Value LIKE '%FirstValueToSearch%' AND DB2Record.Value LIKE '%SecondValueToSearch%'
It might not result in any difference though - especially if there's no index on the Value columns - but worth a shot. Indexes on text columns require alot of space, so I wouldn't blindly recommend that.
If you haven't done so yet, place an index on the DB2's RecordID column.
You can use EXPLAIN QUERY PLAN SELECT ... to make SQLite spit out what it does to try to make your query perform, the output of that might help diagnose the problem.
Are you sure you use the same version of sqlite in System.Data.SQLite, SQLiteStudio and SQLiteAdmin ?
You can have huge differences.
One more typical reason why SQL query can take different amount of time when executed with ADO.NET and from native utility (like SQLiteAdmin) are command parameters used in CommandText (it is not clear from your code whether parameters are used or not). Depending on ADO.NET provider implementation the following identical CommandText values:
SELECT * FROM sometable WHERE somefield = ? // assume parameter is '2'
and
SELECT * FROM sometable WHERE somefield='2'
may lead to absolutely different execution plan and query performance.
Another suggestion: you may disable journal (specifying "Journal Mode=off;" in the connection string) and synchronous mode ("Synchronous=off;") as these options also may affect query performance in some cases.
I have a HUGE query which I need to optimize. Before my coding it was like
SELECT [...] WHERE foo = 'var' [...]
executed 2000 times for 2000 different values of foo. We all know how slow it is. I managed to join all that different queries in
SELECT [...] WHERE foo = 'var' OR foo = 'var2' OR [...]
Of course, there are 2000 chained comparisons. The result is a huge query, executed a few seconds faster than before but not enough. I suppose the StringBuilder I am using takes a while in building the query, so the time earned by saving 1999 queries is wasted in this:
StringBuilder query = new StringBuilder();
foreach (string var in vars)
query.Append("foo = '").Append(var).Append("' OR ");
query.Remove(query.Length - 4) // for removing the last " OR "
So I would like to know if I could use some workaround for optimize the building of that string, maybe joining different values in the comparison with some SQL trick like
SELECT [...] WHERE foo = ('var' OR 'var2' OR [...])
so I can save some Append operations. Of course, any different idea trying to avoid that huge query at all will be more than welcome.
#Armaggedon,
For any decent DBMS, the IN () operator should correspond to a number of x OR y corresponding comparisons. About your concern about StringBuild.Append, its implementation is very efficient and you shouldn't notice any delay regarding this amount of data, if you have a few MB to spare for its temporary internal buffer. That said, I don't think your performance problem is related to these issues.
For database tuning it's always a far shot to propose solutions without the "full picture", but I think your problem might be related to compiling such a huge dynamic SQL statement. -- parsing and optimizing SQL statements can consume lots of processor time and it should be avoided.
Maybe you could improve the response time by moving your domain into an auxiliary indexed table. Or by moving the various checks over the same char column to a text search using INSTR functions:
-- 1. using domain table
SELECT myColumn FROM myTable WHERE foo IN (SELECT myValue FROM myDomain);
-- 2. using INSTR function
SELECT myColumn FROM myTable WHERE INSTR('allValues', foo, 1, 1) > 0;
Why not use the IN-operator as of IN-operator on W3school? It lets you combine your values in a much shorter way. You can also store the values in a temporary table as mentioned in this post to bypass the limit of 1000 rows on Oracle
It's been a while since I danced the Oracle dance, but I seem to remember a concept of "Bind Variables" - typically used for bulk insertions... I'm wondering if you could express the list of values as an array, and use that with IN...
Have to say - this is just an idea - I don't have time to research it further for you...
I'd like to increase performance of very simple select and update queries of .NET & MSSQL 2k8.
My queries always select or update a single row. The DB tables have indexes on the columns I query on.
My test .NET code looks like this:
public static MyData GetMyData(int accountID, string symbol)
{
using (var cnn = new SqlConnection(connectionString))
{
cnn.Open();
var cmd = new SqlCommand("MyData_Get", cnn);
cmd.CommandType = CommandType.StoredProcedure;
cmd.Parameters.Add(CreateInputParam("#AccountID", SqlDbType.Int, accountID));
cmd.Parameters.Add(CreateInputParam("#Symbol", SqlDbType.VarChar, symbol));
SqlDataReader reader = cmd.ExecuteReader();
while (reader.Read())
{
var MyData = new MyData();
MyData.ID = (int)reader["ID"];
MyData.A = (int)reader["A"];
MyData.B = reader["B"].ToString();
MyData.C = (int)reader["C"];
MyData.D = Convert.ToDouble(reader["D"]);
MyData.E = Convert.ToDouble(reader["E"]);
MyData.F = Convert.ToDouble(reader["F"]);
return MyData;
}
}
}
and the according stored procedure looks like this:
PROCEDURE [dbo].[MyData_Get]
#AccountID int,
#Symbol varchar(25)
AS
BEGIN
SET NOCOUNT ON;
SELECT p.ID, p.A, p.B, p.C, p.D, p.E, p.F FROM [MyData] AS p WHERE p.AccountID = #AccountID AND p.Symbol = #Symbol
END
What I'm seeing if I run GetMyData in a loop, querying MyData objects, I'm not exceeding about ~310 transactions/sec. I was hoping to achieve well over a 1000 transactions/sec.
On the SQL Server side, not really sure what I can improve for such a simple query.
ANTS profiler shows me that on the .NET side, as expected, the bottleneck is cnn.Open and cnn.ExecuteReader, however I have no idea how I could significantly improve my .NET code?
I've seen benchmarks though where people seem to easily achieve 10s of thousands transactions/sec.
Any advice on how I can significantly improve the performance for this scenario would be greatly appreciated!
Thanks,
Tom
EDIT:
Per MrLink's recommendation, adding "TOP 1" to the SELECT query improved performance to about 585 transactions/sec from 310
EDIT 2:
Arash N suggested to have the select query "WITH(NOLOCK)" and that dramatically improved the performance! I'm now seeing around 2500 transactions/sec
EDIT 3:
Another slight optimization that I just did on the .NET side helped me to gain another 150 transactions/sec. Changing while(reader.Read()) to if(reader.Read()) surprisingly made quite a difference. On avg. I'm now seeing 2719 transactions/sec
Try using WITH(NOLOCK) in your SELECT statement to increase the performance. This would select the row without locking it.
SELECT p.ID, p.A, p.B, p.C, p.D, p.E, p.F FROM [MyData] WITH(NOLOCK) AS p WHERE p.AccountID = #AccountID AND p.Symbol = #Symbol
Some things to consider.
First, your not closing the server connection. (cnn.Close();) Eventually, it will get closed by the garbage collector. But until that happens, your creating a brand new connection to the database every time rather than collecting one from the connection pool.
Second, Do you have an index set in Sql Server covering the AccountID and Symbol columns?
Third, While accountId being and int is nice and fast. The Symbol column being varchar(25) is always going to be much slower. Can you change this to an int flag?
Make sure your database connections are actually pooling. If you are seeing a bottleneck in cnn.Open, there would seem to be a good chance they are not getting pooled.
I was hoping to achieve well over a 1000 transactions/sec [when running GetMyData in a loop]
What you are asking for is for GetMyData to run in less than 1ms - this is just pointless optimisation! At the bare minimum this method involves a round trip to the database server (possibly involving network access) - you wouldn't be able to make this method much faster if your query was SELECT 1.
If you have a genuine requirement to make more requests per second then the answer is either to use multiple threads or to buy a faster PC.
There is absolutely nothing wrong with your code - I'm not sure where you have seen people managing 10,000+ transactions per second, but I'm sure this must have involved multiple concurrent clients accessing the same database server rather than a single thread managing to execute queries in less than a 10th of a ms!
Is your method called frequently? Could you batch your requests so you can open your connection, create your parameters, get the result and reuse them for several queries before closing the whole thing up again?
If the data is not frequently invalidated (updated) I would implement a cache layer. This is one of the most effective ways (if used correctly) to gain performance.
You could use output parameters instead of a select since there's always a single row.
You could also create the SqlCommand ahead of time and re-use it. If you are executing a lot of queries in a single thread you can keep executing it on the same connection. If not, you could create a pool of them or do cmdTemplate.Clone() and set the Connection.
Try re-using the Command and Prepare'ing it first.
I can't say that it will definitely help, but it seems worth a try.
In no particular order...
Have you (or your DBAs) examined the execution plan your stored procedure is getting? Has SQL Server cached a bogus execution plan (either due to oddball parameters or old stats).
How often are statistics updated on the database?
Do you use temp tables in your stored procedure? If so, are they create upfront. If not, you'll be doing a lot of recompiles as creating a temp table invalidates the execution plan.
Are you using connection pooling? Opening/Closing a SQL server connection is an expensive operation.
Is your table clustered on accountID and Symbol?
Finally...
Is there a reason you're hitting this table by account and symbol rather than, say, just retrieving all the data for an entire account in one fell swoop?