Get duplicate values from Azure Table Storage

Get duplicate values from Azure Table Storage - c#

How can I query on the Azure Table Storage for duplicate values?
Suppose the table contains a column named 'LastName' and there are a few lastnames that equal to each other. How can I query on that without knowing or having that specific string that holds the lastname value?
Edit
An example would be:
Partitionkey RowKey LastName
1 1 Smith
1 2 Smith
1 3 Smith
1 3 MILLER
1 3 WILLIAMS
In this case, I'd like to get all records where Smith is the last name, because they are duplicates.

As a general rule of thumb: queries that do not include the PartitionKey or RowKey will not perform very well.
I assume your LastName column is neither Partition- nor RowKey. In that case you only have bad options. The way Table Storage works is that entitios of a Partition are stored close together, so the fastest queries are those that include the Partition Key of the entities you are looking for. Since you cannot build indexes on any other columns, all queries that do not include the RowKey will be partition-scans, i.e. not perform well at all because all rows of that partition must be analyzed.
In your case, if you are looking for all those columns that include duplicate values, your best bet will likely be to just query everything and look for duplicates locally.
I don't think you can create a table storage query that would return the results. As far as I know, there is no such thing as select … where count(select duplicates) > 1 – and even if so, that query would be very slow. Unless we're talking about huge amounts of data, simply querying everything and filtering locally would likely perform better.
As I said, you only have bad options. That's because Table Storage wasn't designed for queries like this. Unlike SQL tables, Table Storage tables should be designed with queries in mind, i.e. you should know how you're gonna query the table before you design it.
Your second option would be to migrate to Azure SQL, where such queries are no problem at all. Azure SQL is very different form Table Storage though, so it's questionable whether it fits your requirements.
Edit: One way you can optimize the query-everything solution would be to only return the LastNames of your entities (+ Partition/RowKey or whatever else you need). This way the amount of data that is being sent can potentially be reduced by quite a bit. Here's an article about query projection that explains this technique in detail.

The query to fetch all records should be
PartitionKey eq 'Your PartitionKey' and LastName eq 'Smith'
unless I'm missing something.
You would also need to take table continuation token into consideration as well. See this thread for more details: Copy all Rows to another Table in Azure Table Storage. As #enzi mentioned, there's no Select * from table where ... functionality is available in table storage.

Related

How to store string array in sqlite database and keep order

i want to store a great amount of strings into my sqlite database. I want them to be always in the same order when i read them as i add them to the database. I know i could give them an autoincrementing primary key and sort by that but since there can be up to 100.000 strings this is a performance issue. Besides the order should NEVER change or be sorted in any different way.
short example:
sql insert "hghtzdz12g"
sql insert "jut65bdt"
sql insert "lkk7676nbgt"
sql select * should give ALWAYS this order {"hghtzdz12g", "jut65bdt", "lkk7676nbgt" }
Any ideas how to achive this ?
Thanks

In a query like
SELECT * FROM MyTable ORDER BY MyColumn
the database does not need to sort the results if the column is indexed, because it can just scan through the index entries in order.
The rowid (or whatever you call the autoincrementing column) is an index, and is even more efficient than a separate index.

If you are sure you will never need anything but exactly this array in exactly this order, you can cheat the database and put in a single blob field.
But then you should ask yourself why you chose a database in the first place.
The correct database solution is indeed a table using a key that you can sort by.
If this performance is not enough, you can have a look here for performance hints.
If you need ultra-fast performance, maybe a database is not the best tool for the job. Databases are used for their ACID abilities and speed is not one of them but rather a secondary objective of everything in software.

Recommend usage of temp table or table variable in Entity Framework 4. Update Performance Entity framework

I need to update a bit field in a table and set this field to true for a specific list of Ids in that table.
The Ids are passed in from an external process.
I guess in pure SQL the most efficient way would be to create a temp table and populate it with the Ids, then join the main table with this and set the bit field accordingly.
I could create a SPROC to take the Ids but there could be 200 - 300,000 rows involved that need this flag set so its probably not the most efficient way. Using the IN statement has limitation wrt the amount of data that can be passed and performance.
How can I achieve the above using the Entity Framework
I guess its possible to create a SPROC to create a temp table but this would not exist from the models perspective.
Is there a way to dynamically add entities at run time. [Or is this approach just going to cause headaches].
I'm making the assumption above though that populating a temp table with 300,000 rows and doing a join would be quicker than calling a SPROC 300,000 times :)
[The Ids are Guids]
Is there another approach that I should consider.

For data volumes like 300k rows, I would forget EF. I would do this by having a table such as:
BatchId RowId
Where RowId is the PK of the row we want to update, and BatchId just refers to this "run" of 300k rows (to allow multiple at once etc).
I would generate a new BatchId (this could be anything unique -Guid leaps to mind), and use SqlBulkCopy to insert te records onto this table, i.e.
100034 17
100034 22
...
100034 134556
I would then use a simgle sproc to do the join and update (and delete the batch from the table).
SqlBulkCopy is the fastest way of getting this volume of data to the server; you won't drown in round-trips. EF is object-oriented : nice for lots of scenarios - but not this one.

I'm assigning Marcs response as the answer but I'd just like to give a little detail on how we implemented the requirement.
Marc response helped greatly in the formulation of our solution.
We had to deal with an aim/guideline to keep within the Entity Framework while not utilizing SPROCS and although our solution may not suit others it has worked for us
We created a Item table in the Database with BatchId [uniqueidentifier] and ItemId varchar columns.
This table was added to the EF model so we did not use temporary tables.
On upload of these Ids this table is populated with the Ids [Inserts are quick enough we find using EF]
We then use context.ExecuteStoreCommand to run the SQL to do join the item table and the main table and update the bit field in the main table for records that exist for the batch Id created specifically for that session.
We finally clear this table for that batchId.
We have the performance, keeping within our no SPROC goal. [Which not of us agree with :) but its a democracy]
Our exact requirements are a little more complex but insofar as needing good update performance using the Entity framework given our specific restrictions it works fine.
Liam

T-SQL query timeout / performance issue

I have a table having around 1 million records. Table structure is shown below. The UID column is a primary key and uniqueidentifier type.
Table_A (contains a million records)
UID Name
-----------------------------------------------------------
E8CDD244-B8E4-4807-B04D-FE6FDB71F995 DummyRecord
I also have a function called fn_Split('Guid_1,Guid_2,Guid_3,....,Guid_n') which accepts a list of comma
seperated guids and gives back a table variable containing the guids.
From my application code I am passing a sql query to get new guids [Keys that are with application code but not in the database table]
var sb = new StringBuilder();
sb
.Append(" SELECT NewKey ")
.AppendFormat(" FROM fn_Split ('{0}') ", keyList)
.Append(" EXCEPT ")
.Append("SELECT UID from Table_A");
The first time this command is executed it times out on quite a few occassions. I am trying to figure out what would be a better approach here to avoid such timeouts and/or improve performance of this.
Thanks.

Firstly add an index if there isn't one, on table_a.uid, but i assume there is.
Some alternate queries to try,
select newkey
from fn_split
left outer join table_a
on newkey = uid
where uid IS NULL
select newkey
from fn_split(blah)
where newkey not in (select uid
from table_a)
select newkey
from fn_split(blah) f
where not exists(select uid
from table_a a
where f.newkey = a.uid)

There is plenty of info around here as to why you should not use a Guid for your primary key, especially if it in unordered. That would be the first thing to fix. As far as your query goes you might try what Paul or Tim suggested, but as far as I know EXCEPT and NOT IN will use the same execution plan, though the OUTER JOIN may be more efficint in some cases.

If you're using MS SQL 2008 then you can/should use TableValue Parameters. Essentially you'd send in your guids in the form of a DataTable to your stored procedure.
Then inside your stored procedure you can use the parameters as a "table" and do a join or EXCEPT or what have you to get your results.
This method is faster than using a function to split because functions in MS SQL server are really slow.
But I guess is the time is being taken due to massive Disk I/O this query requires. Since you're searching on your UId column and since they are "random" no index is going to help here. The engine will have to resort to a table scan. Which means you'll need some serious Disk I/O performance to get the results in "good time".
Using the Uid data type as in index is not recommended. However, it may not make a difference in your case. But let me ask you this:
The guids that you send in from your app, are in just a random list of guids or is here some business relationship or entity relationship here? It's possible, that your data model is not correct for what you are trying to do. So how do you determine what guids you have to search on?
However, for argument sake, let's assume your guids are just a random selection then there is no index that is really being used since the database engine will have to do a table scan to pick out each of the required guids/records from the million records you have. In a situation like this the only way to speed things up is at the physical database level, that is how your data is physically stored on the hard drives etc.
For example:
Having faster drives will improve performance
If this kind of query is being fired over and over then more memory on the box will help because the engine can cache the data in memory and it won't need to do physical reads
If you partition your table then the engine can parallelize the the seek operation and get you results faster.
If your table contains a lot of other fields that you don't always need, then spliting the table in two tables where table1 contains the guid and the bare minimum set of fields and table2 contains the rest will speed up the query quite a bit due to the disk I/O demands being less
Lot's of other things to look at here
Also note that when you send in adhoc SQL statements that don't have parameters the engine has to create a plan each time you execute it. In this case it's not a big deal but keep in mind that each plan will be cached in memory thus pushing out any data that might have been cached.
Lastly you can always increase the commandTimeOut property in this case to get past the timeout issues.
How much time does it take now and what kind of improvement are you looking to get ot hoping to get?

If I understand your question correctly, in your client code you have a comma-delimited string of (string) GUIDs. These GUIDS are usable by the client only if they don't already exist in TableA. Could you invoke a SP which creates a temporary table on the server containing the potentially usable GUIDS, and then do this:
select guid from #myTempTable as temp
where not exists
(
select uid from TABLEA where uid = temp.guid
)
You could pass your string of GUIDS to the SP; it would populate the temp table using your function; and then return an ADO.NET DataTable to the client. This should be very easy to test before you even bother to write the SP.

I am questioning what you do with this information.
If you insert the keys into this table afterwards you could simply try to insert them on first hand - that's much faster and more solid in a multi-user environment then query first insert later:
create procedure TryToInsert #GUID uniqueidentifier, #Name varchar(n) as
begin try
insert into Table_A (UID,Name)
values (#GUID, #Name);
return 0;
end try
begin catch
return 1;
end;
In all cases you can split the KeyList at the client to get faster results - and you could query the keys that are not valid:
select UID
from Table_A
where UID in ('new guid','new guid',...);
If the GUID are random you should use newsequentialid() with you clustered primary key:
create table Table_A (
UID uniqueidentifier default newsequentialid() primary key,
Name varchar(n) not null
);
With this you can insert and query your newly inserted data in one step:
insert into Table_A (Name)
output inserted.*
values (#Name);
... just my two cents

In any case, are not GUIDs intrinsically engineered to be, for all intents and purposes, unique? (i.e. universally unique -- doesn't matter where generated). I wouldn't even bother to do the test beforehand; just insert your row with the GUID PK and if the insert should fail, discard the GUID. But it should not fail, unless these are not truly GUIDs.
http://en.wikipedia.org/wiki/GUID
http://msdn.microsoft.com/en-us/library/ms190215.aspx
It seems you are doing a lot of unnecessary work, but perhaps I don't grasp your application requirement.

How to fetch lots of database table records by primary key?

Using the ADO.NET MySQL Connector, what is a good way to fetch lots of records (1000+) by primary key?
I have a table with just a few small columns, and a VARCHAR(128) primary key. Currently it has about 100k entries, but this will become more in the future.
In the beginning, I thought I would use the SQL IN statement:
SELECT * FROM `table` WHERE `id` IN ('key1', 'key2', [...], 'key1000')
But with this the query could be come very long, and also I would have to manually escape quote characters in the keys etc.
Now I use a MySQL MEMORY table (tempid INT, id VARCHAR(128)) to first upload all the keys with prepared INSERT statements. Then I make a join to select all the existing keys, after which I clean up the mess in the memory table.
Is there a better way to do this?
Note: Ok maybe its not the best idea to have a string as primary key, but the question would be the same if the VARCHAR column would be a normal index.
Temporary table: So far it seems the solution is to put the data into a temporary table, and then JOIN, which is basically what I currently do (see above).

I've dealt with a similar situation in a Payroll system where the user needed to generate reports based on a selection of employees (eg. employees X,Y,Z... or employees that work in certain offices). I've built a filter window with all the employees and all the attributes that could be considered as a filter criteria, and had that window save selected employee id's in a filter table from the database. I did this because:
Generating SELECT queries with dynamically generated IN filter is just ugly and highly unpractical.
I could join that table in all my queries that needed to use the filter window.
Might not be the best solution out there but served, and still serves me very well.

If your primary keys follow some pattern, you can select where key like 'abc%'.
If you want to get out 1000 at a time, in some kind of sequence, you may want to have another int column in your data table with a clustered index. This would do the same job as your current memory table - allow you to select by int range.
What is the nature of the primary key? It is anything meaningful?

If you're concerned about performance I definitely wouldn't recommend an 'IN' clause. It's much better try do an INNER JOIN if you can.
You can either first insert all the values into a temporary table and join to that or do a sub-select. Best is to actually profile the changes and figure out what works best for you.

Why can't you consider using a Table valued parameter to push the keys in the form of a DataTable and fetch the matching records back?
Or
Simply you write a private method that can concatenate all the key codes from a provided collection and return a single string and pass that string to the query.
I think it may solve your problem.

how can I speed up insertion of many rows to a table via ADO.NET?

I have a table that has 5 columns: AcctId (int), Address1 (varchar), Address2 (varchar), Person1 (varchar), Person2 (varchar) . I'm generating random data to insert into this table via a C# console application. I've tried doing this random data insert via SQL-Server and decided it was not a good solution -- SQL is not good at random on an each-row basis. Generating the random data -- 975k rows of it -- takes a minimal amount of time. It's in a List of custom objects.
I need to take this random data and update many rows in the database with the new random data. I tried updating the rows one at a time, very slow because of the repeated searching of the List object in code. So I think the best approach is to put all the randomized data into a table in the database, then update all the other tables that use this data. I.e. UPDATE t SET t.Address1=d.Address1 FROM Table1 t INNER JOIN RandomizedData d ON d.AcctId = t.Acct_ID. The database is very un-normalized so this Acct data is sprinkled all over the place. I've got no control of the normalization.
So, having decided to insert all of the randomized data into a single table, I set out to create insert scripts:
USE TheDatabase
Insert tmp_RandomizedData
SELECT 1,'4392 EIGHTH AVE','','JENNIFER CARTER','BARBARA CARTER' UNION ALL
SELECT 2,'2168 MAIN ST','HNGR F','DANIEL HERNANDEZ','SUSAN MARTIN'
// etc another 98 times...
// FYI, this is not real data!
I'm building this INSERT script in batches of 100. It's taking on average 175 ms to run each insert. Does this seem like a long time? It's going to take about 35 mins to run the whole insert.
The table doesn't have a primary key or any indexes. I was planning on adding those after all the data is inserted (thinking that that would be faster).
Is there a better way to do this?

The SQLBulkCopy class in .net can blast records in pretty quickly. I used this to transfer data from an i-Series database to SQL Tables very rapidly.

Use BCP. You can use this article as a guide. It's for VB6 but the gist is exactly the same. The trick is to use the BULK INSERT command.

... Read more of your question, you might also want to look at Sql RedGates sample data generator, it generates tons of data really, really, fast.
Use larger batches, 50,000 to 75,000 rows. On SQL 2000 on hardware from 2000, the sweet spot for inserts was 50,000 rows. This was on a live production database, with indexes, during the day on a very large table.
Small batch sizes are better for inserts into a highly active table and where there is a high deadlock risk. Is anyone else using this table while your doing inserts?
Is this a one time import? Let it run over night.
Finally, INSERT statements executed via ADO.NET isn't really an optimal ETL solution. SSIS, DTS, (or any other ETL solution, such as Talend) would be more appropriate for heavy duty data moving. On the other hand, if all you have is a hammer...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.