Natural sort for SQL Server - c#

Similar Questions
Similar questions have been asked before, but always have specific qualities to the data which allow a more targeted "split it up and just sort by this part" approach, which does not work when you don't know the structure of the data in the column - or even the column, frankly. In other words, not a generic, "Natural" sort order - something roughly equivalent to SELECT * FROM [parts] ORDER BY [part_category] DESC, [part_number] NATURAL DESC
My Situation
I have a DataView in C# that has a Sort parameter for specifying the ORDER BY that would be used by ADO, and a requirement to sort by a column using a 'natural' sort algorithm. I could in theory do just about anything from creating a different column to sort by (based on the column I'd like to have 'sorted naturally') to not sorting in SQL, but rather sorting the result set in code afterwards. I'm looking for the best balance of flexibility, efficiency, preparation effort, and maintainability. I would benefit somewhat from being able to sort such data after retrieval (in C#) or completely within a stored procedure.
In my mind, and according to customer statements so far, 'Natural' sort order will mean treating upper and lower case letters equivalently, and considering the magnitude of a number, rather than the ASCII value of its digits (that is x90 comes before x100). Jeff Atwood had a pretty decent discussion of this, but it didn't address SQL sorting. That said, these are my thoughts:
Incorporating the magnitude awareness while also retaining the ability to sort alpha characters ASCII-betically may also come in handy
Non-alphanumeric characters would probably have to be sorted ASCII-betically regardless
Decimal point awareness might be more effort than it's worth, since most of the time periods and commas in alphanumeric fields are treated as merely punctuation/separators, and only denote fractional portions when they're representing a float field
My Question
What is a reasonably flexible, reasonably generic, reasonably efficient, approach to implementing a natural sort algorithm for SQL? Weighing the pros and cons, which is the best approach? Is there another option?
Is there a native SQL way to ORDER BY [field] NATURAL DESC or something?
PURE SQL function to create a 'sort equivalent' - Could be used to create some sort of second, possibly indexed, 'sort value' column, or called from a stored procedure, or specified in an 'ORDER BY' clause - but how to write it efficiently? (loops? is there a set based solution at all??)
CLR SQL Function - usability benefits of pure SQL function, but using procedural language, like C# (algorithm should be no problem, but can it be made to go faster than a pure SQL sort [set based??] implementation?) Also, could be referenced and utilized in C# if efficient enough.
Avoid SQL Server - since parsing an arbitrary number of numbers amid all sorts of other characters is really best suited for looping or recursion, and T-SQL is not well suited for looping or recursion (though TECHNICALLY supported, All I see is 'DON'T USE LOOPS!!!' and 'CTE's are even worse!!!')
Some sort of comparator in SQL(??) - doesn't seem like SQL lends itself to that sort of sorting and I don't see a way to specify a comparator to use - so I guess this won't work...
I have values at least as varied as the following:
100s455t
200s400
d399487
S0000005.2
d400400
d99222
cg9876
D550-9-1
CL2009-3-27
f2g099
f2g100
f2g1000
f2g999
cg 8837
99s1000f
These should be sorted as follows:
99s1000f
100s455t
200s400
cg9876
cg 8837
CL2009-3-27
D550-9-1
d99222
d399487
d400400
f2g099
f2g100
f2g999
f2g1000
S0000005.2

Create a sort column. That way you can keep all the usual mechanisms in place that you use today to sort. You can index that column for example.
Split the string into parts. You need to pad number parts with zeroes to the maximum possible number length.
For example CL2009-3 would become CL|000002009|-|000000003.
This way the usual case-insensitive SQL Server collation sort behavior will create the right order.
Doing a natural sort dynamically prevents indexing, requires the entire data set to move into the app for each query and is resource intensive.
Instead, simply update the sort column whenever you update the base column.

OK. Here is something that is almost what you are looking for. The only piece it can't deal with is when there are some characters then a space and then numbers (cg 8837 and cg9876). Would be good if in the future you could post the ddl and sample data so we can work with it.
with Something (SomeValue) as(
select '100s455t' union all
select '200s400' union all
select 'd399487' union all
select 'S0000005.2' union all
select 'd400400' union all
select 'd99222' union all
select 'cg9876' union all
select 'D550-9-1' union all
select 'CL2009-3-27' union all
select 'f2g099' union all
select 'f2g100' union all
select 'f2g1000' union all
select 'f2g999' union all
select 'cg 8837' union all
select '99s1000f'
)
select *
from Something
order by
cast(
case when patindex('%[A_Za-z]%', SomeValue) = 1 then '99999999999'
when patindex('%[A_Za-z]%', SomeValue) = 0 then SomeValue
else substring(SomeValue, 1, patindex('%[A_Za-z]%', SomeValue) - 1)
end as bigint),
SomeValue

I would recommend to "stay away from the SQL Server". While technically you can implement everything using t-sql or clr function, SQL server remains a single non scalable unit of the infrastructure. Using its CPU resources to do a heavy computing is going to inevitably impact performance of the system in general. And in the end, SQL server will be performing sort using almost exactly the same algorithm that you will use to sort your array on the application side, i.e. looking at each item in array and comparing it to the others until it finds an appropriate position.
Of course I am assuming that, if you try to implement this type of sort on the SQL server side, you will be copying the data into a temporary table before performing the sort, to avoid data locks etc.

Related

Comparison between select then filtering and select distinct in linq, c#

Which is faster?
Getting a list of some variable (say string type) from a LINQ and then filtering duplicates in C#, or directly selecting distinct values in LINQ only?
Say we are having
N rows if we take duplicates
and R if we filter
( N >> R ) there are many duplicates.
Basically I am asking, in general which is faster and better programming
selecting whole N rows in LINQ, convert it into a list and then filtering it to R rows
or directly selecting the R rows from LINQ and converting it to a list.
Note :
IN SQL the time taken to get R rows is roughly 2 times then it takes to get N for my case! But a generic answer is welcome.
I assume when you say Linq, you mean LinqToSQL.
Rule of thumb when connect to database is to only get what you need; and for this, if you have a good querying strategy for Linq, then filtering at LinqToSQL can save a lot of wasted work.
If the column you're filtering happens to be FullTextIndex, you hit the jackpot.
Look, you question is complex, what i mean.
1)Better programming is to use the most of times a ready build-in function
2)Based on my experience Distinct works faster, in MsSql and C#.
3) LINQ is kinda lazy on filtering, especially if you have many items in your list. Distinct is optimized by Microsoft developers.
note: similar question, may be useful
Result: Try to use more of build-in functions you have at your platform, there is a plenty of information on net, and you can escape paragraphs of coding just calling a ready function.
Hope it helped.

Joining values in a WHERE comparison in Oracle

I have a HUGE query which I need to optimize. Before my coding it was like
SELECT [...] WHERE foo = 'var' [...]
executed 2000 times for 2000 different values of foo. We all know how slow it is. I managed to join all that different queries in
SELECT [...] WHERE foo = 'var' OR foo = 'var2' OR [...]
Of course, there are 2000 chained comparisons. The result is a huge query, executed a few seconds faster than before but not enough. I suppose the StringBuilder I am using takes a while in building the query, so the time earned by saving 1999 queries is wasted in this:
StringBuilder query = new StringBuilder();
foreach (string var in vars)
query.Append("foo = '").Append(var).Append("' OR ");
query.Remove(query.Length - 4) // for removing the last " OR "
So I would like to know if I could use some workaround for optimize the building of that string, maybe joining different values in the comparison with some SQL trick like
SELECT [...] WHERE foo = ('var' OR 'var2' OR [...])
so I can save some Append operations. Of course, any different idea trying to avoid that huge query at all will be more than welcome.
#Armaggedon,
For any decent DBMS, the IN () operator should correspond to a number of x OR y corresponding comparisons. About your concern about StringBuild.Append, its implementation is very efficient and you shouldn't notice any delay regarding this amount of data, if you have a few MB to spare for its temporary internal buffer. That said, I don't think your performance problem is related to these issues.
For database tuning it's always a far shot to propose solutions without the "full picture", but I think your problem might be related to compiling such a huge dynamic SQL statement. -- parsing and optimizing SQL statements can consume lots of processor time and it should be avoided.
Maybe you could improve the response time by moving your domain into an auxiliary indexed table. Or by moving the various checks over the same char column to a text search using INSTR functions:
-- 1. using domain table
SELECT myColumn FROM myTable WHERE foo IN (SELECT myValue FROM myDomain);
-- 2. using INSTR function
SELECT myColumn FROM myTable WHERE INSTR('allValues', foo, 1, 1) > 0;
Why not use the IN-operator as of IN-operator on W3school? It lets you combine your values in a much shorter way. You can also store the values in a temporary table as mentioned in this post to bypass the limit of 1000 rows on Oracle
It's been a while since I danced the Oracle dance, but I seem to remember a concept of "Bind Variables" - typically used for bulk insertions... I'm wondering if you could express the list of values as an array, and use that with IN...
Have to say - this is just an idea - I don't have time to research it further for you...

LINQ c# efficiency

I need to write a query pulling distinct values from columns defined by a user for any given data set. There could be millions of rows so the statements must be as efficient as possible. Below is the code I have.
What is the order of this LINQ query? Is there a more efficient way of doing this?
var MyValues = from r in MyDataTable.AsEnumerable()
orderby r.Field<double>(_varName)
select r.Field<double>(_varName);
IEnumerable result= MyValues.Distinct();
I can't speak much to the AsEnumerable() call or the field conversions, but for the LINQ side of things, the orderby is a stable quick sort and should be O(n log n). If I had to guess, everything but the orderby should be O(n), so overall you're still just O(n log n).
Update: the LINQ Distinct() call should also be O(n).
So altogether, the Big-Oh for this thing is still O(Kn log n), where K is some constant.
Is there a more efficent way of doing this?
You could get better efficiency if you do the sort as part of the query that initializes MyDataTable, instead of sorting in memory afterwards.
from comments
I actually use MyDistinct.Distinct()
If you want distinct _varName values and you cannot do this all in the select query in dbms(what would be the most efficient way), you should use Distinct before OrderBy. The order matters here.
You would need to order all million of rows before you start to filter out the duplicates. If you use distinct first, you need to order only the rest.
var values = from r in MyDataTable.AsEnumerable()
select r.Field<double>(_varName);
IEnumerable<double> orderedDistinctValues = values.Distinct()
.OrderBy(d => d);
I have asked a related question recently which E.Lippert answered with a good explanation when order matters and when not:
Order of LINQ extension methods does not affect performance?
Here's a little demo where you can see that the order matters, but you can also see that it does not really matter since comparing doubles is trivial for a cpu:
Time for first orderby then distinct: 00:00:00.0045379
Time for first distinct then orderby: 00:00:00.0013316
your above query (linq) is good if you want all the million records and you have enough memory on a 64bit memory addressing OS.
the order of the query is, if you see the underlying command, would be transalated to
Select <_varname> from MyDataTable order by <_varname>
and this is as good as it is when run on the database IDE or commandline.
to give you a short answer regarding performance
put in a where clause if you can (with columns that are indexed)
ensure that the user can choose colums (_varname) that are indexed. Imagine the DB trying to sort million records on an unindexed column, which is evidently slow, but endangers linq to receive the badpress
Ensure that (if possible) initilisation of the MyDataTable is done correctly with the records that are of value (again based on a where clause)
profile your underlying query,
if possible, create storedprocs (debatable). you can create an entity model which includes storedprocs aswell
it may be faster today, but with the tablespace growing, and if your data is not ordered (indexed) thats where things get slowerr (even if you had a good linq expression)
Hope this helps
that said, if your db is not properly indexed, meaning

Entity Framework and self-referencing table

I need to have a database that starts with a table called "User" that needs to self reference itself and will have a very deep graph of related objects. It will need to be like the left side of the image below (disregard the right side).
I will also need to traverse through this graph both up and downwards in order to calculate percentages, totals, etc. In other words I'll need to travese the entire graph in some cases.
Is this possible and/or how is it done? Can traversing be done right in the LINQ statement? Examples?
EDIT:
I'm basically trying to create a network marketing scenario and need to calculate each persons earnings.
Examples:
To be able to calulate the total sales for each user under a specific user (so each user would have some sort of revenue coming in).
Calculate the commission at a certain level of the tree (e.g. if the top person had 3 people below them each selling a product for $1 and the commission was 50% then there would be $1.50.)
If I queried the image above (on the left) for "B" I should get "B,H,I,J,N,O"
Hopefully that helps :S
You can't traverse the whole tree using just LINQ in a way that would translate to single SQL query (or a constant count of them). You can do it either with one query for each level or with one query, that is limited to a specific count of levels (but such a query would get really big with many levels).
In T-SQL (I assume you're using MS SQL Server), you can do this using recursive common table expressions. It should be possible to put that into a stored procedure that you can use from LINQ to get the information you actually want.
To sum up, your options are:
Don't use LINQ, just SQL with recursive CTE
Use recursive CTE in a stored procedure from LINQ
Use LINQ, creating one query for each level
Use ugly LINQ query limited to just a few levels
I know this is late, but if you look at Directed Graph algorithms, you can bypass the recursive issues. check out these 2 articles:
http://www.sitepoint.com/hierarchical-data-database/
http://www.codeproject.com/Articles/22824/A-Model-to-Represent-Directed-Acyclic-Graphs-DAG-o

List<T> FirstOrDefault() bad performance - is dictionary possible in this case?

I have set of 'codes' Z that are valid in a certain time period.
Since I need them a lot of times in a large loop (million+) and every time I have to lookup the corresponding code I cache them in a List<>. After finding the correct codes, i'm inserting (using SqlBulkCopy) a million rows.
I lookup the id with the following code (l_z is a List<T>)
var z_fk = (from z in l_z
where z.CODE == lookupCode &&
z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
In other situations I have used a Dictionary with superb performance, but in those cases I only had to lookup the id based on the code.
But now with searching on the combination of fields, I am stuck.
Any ideas? Thanks in advance.
Create a Dictionary that stores a List of items per lookup code - Dictionary<string, List<Code>> (assuming that lookup code is a string and the objects are of type Code).
Then when you need to query based on lookupDate, you can run your query directly off of dict[lookupCode]:
var z_fk = (from z in dict[lookupCode]
where z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
Then just make sure that whenever you have a new Code object, that it gets added to the List<Code> collection in the dict corresponding to the lookupCode (and if one doesn't exist, then create it).
A simple improvement would be to use...
//in initialization somewhere
ILookup<string, T> l_z_lookup = l_z.ToLookup(z=>z.CODE);
//your repeated code:
var z_fk = (from z in lookup[lookupCode]
where z.VALIDFROM <= lookupDate && z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
You could further use a more complex, smarter data structure storing dates in sorted fashion and use a binary search to find the id, but this may be sufficient. Further, you speak of SqlBulkCopy - if you're dealing with a database, perhaps you can execute the query on the database, and then simply create the appropriate index including columns CODE, VALIDUNTIL and VALIDFROM.
I generally prefer using a Lookup over a Dictionary containing Lists since it's trivial to construct and has a cleaner API (e.g. when a key is not present).
We don't have enough information to give very prescriptive advice - but there are some general things you should be thinking about.
What types are the time values? Are you comparing date times or some primitive value (like a time_t). Think about how your data types affects performance. Choose the best ones.
Should you really be doing this in memory or should you be putting all these rows in to SQL and letting it be queried on there? It's really good at that.
But let's stick with what you asked about - in memory searching.
When searching is taking too long there is only one solution - search fewer things. You do this by partitioning your data in a way that allows you to easily rule out as many nodes as possible with as few operations as possible.
In your case you have two criteria - a code and a date range. Here are some ideas...
You could partition based on code - i.e. Dictionary> - if you have many evenly distributed codes your list sizes will each be about N/M in size (where N = total event count and M = number of events). So a million nodes with ten codes now requires searching 100k items rather than a million. But you could take that a bit further. The List could itself be sorted by starting time allowing a binary search to rule out many other nodes very quickly. (this of course has a trade-off in time building the collection of data). This should provide very quick
You could partition based on date and just store all the data in a single list sorted by start date and use a binary search to find the start date then march forward to find the code. Is there a benefit to this approach over the dictionary? That depends on the rest of your program. Maybe being an IList is important. I don't know. You need to figure that out.
You could flip the dictionary model partition the data by start time rounded to some boundary (depending on the length, granularity and frequency of your events). This is basically bucketing the data in to groups that have similar start times. E.g., all the events that were started between 12:00 and 12:01 might be in one bucket, etc. If you have a very small number of events and a lot of highly frequent (but not pathologically so) events this might give you very good lookup performance.
The point? Think about your data. Consider how expensive it should be to add new data and how expensive it should be to query the data. Think about how your data types affect those characteristics. Make an informed decision based on that data. When in doubt let SQL do it for you.
This to me sounds like a situation where this could all happen on the database via a single statement. Then you can use indexing to keep the query fast and avoid having to push data over the wire to and from your database.

Categories

Resources