i have an import process, where i take a CSV file in .Net, and it writes to two user tables, in the DataBase. It's a pretty complex process, it takes several minutes to process about five hundred users at at time.
In that process, i need to generate a string, random string, that will be unique to each user, as it gives him access to some promotions. I can't use GUIDs because it has to be a simple string for the user to input in a splash screen.
What I need to know is, what is the best way to check if each newly generated key doesn't repeat any already created in thousands of pre-existing users.
I don't want to add a new query in each inserted row, asking if the string is already there.
Thanks, i hope i was clear enough
How many users are there compared to the number of possible unique keys?
If there are many more possible keys than there are users then I'd just add a unique constraint on the key column, and generate a new key if you hit a constraint violation.
If you're likely to get a lot of collisions with the above technique then there are a few options open to you:
Pre-generate sets of unique keys, store them in a table somewhere and take one when needed.
Add some uniqueness to the keys: do the users have a unique id that could be incorporated?
You can store string in Dictionary in Key value. If string is repeated then it will generate an error, here you can handle error and generate new string for user.
Hope it will help for you.
One simple way could be to use part of an MD5 for the customer + the customerid encoded in hex.
The customerid part ensures uniqueness and the MD5 part ensures that you cannot guess another users key.
Depending on how short string you can handle you can use just the first 6-10 chars from the MD5 and if you need to shorten it further reencode using somthing more compact than Hex, like base-64 or if you cannot handle different case make your own selection, A-Z + 0-9 and maybe some special chars to get an exponent of 2 that is easy to map to hex.
Related
Check the code bellow, the RandomMan.MyRandomString(64) is generating a random string of 64 char.
Now I want to check if this random string is unique in database using entityframework query like bellow. And if this string is not unique in database then it will continue the do loop until it finds a unique random string. Now my question is am I doing it correctly? Or is there any better way than that?
string randstr;
do {
randstr = RandomMan.MyRandomString(64);
} while (DataCtx.StorageFiles.Any(x => x.AwsUniqueFileName == randstr));
I cannot tell whether or not you are doing it correctly, but if you already have the row in DB, I could suggest concatenating (adding) your identity field id to the produced string so you make sure that the result is unique in the DB, given that your MyRandomString only produces chars (or no numbers at the end)
Let's say your generated string is abc and the id of the row you are updating is 53 then your final unique string is going to be abc53
Standard approach for this is to just generate GUID:
Console.WriteLine(Guid.NewGuid());
It's designed to be unique and highly unlikely to generate two identical GUIDs even on many instances at the same time so you don't need to worry much about atomicity of this operation.
The possibility of collision is so low that you can skip handling it at all, but just to be sure you can set unique key on this column and treat it as an exception, no need for loop for sure.
I have a requirement to generate a semi-random code in C#/ASP.NET that has to be unique in the SQL Server database.
These codes need to be generated in batches of up to 100 codes per run.
Given the requirements, I'm not sure how I can do this without generating a code and then checking the database to see if it exists, which seems like a horrible way of doing it.
Here are the requirements:
Maximum 10 characters long (alpha-numeric only)
Must not be case sensitive
User can specify an optional 3 character prefix for the code
Must not violate 2 column unique constraint in the database, i.e. must be a unique "code text" within the "category" (CONSTRAINT ucCodes UNIQUE (ColumnCodeText, ColumnCategoryId))
So, given the 10 character limit, GUIDs are not an option. Given the case insensitivity requirement, the mathematical probability for database collisions are fairly high, I think.
At the same time, there are enough possible combinations that a straight look-up table in the DB would be prohibitive, I believe.
Is there a reasonably performant way of generating codes with these requirements that doesn't involve saving them to the DB one code at a time and waiting for a unique key violation to see if it goes through?
You have two options here.
You generate a new ID and insert it. If it throws dup unique key exception then try again until you succeed or bail if you run out of IDs. The performance will stink if most of the IDs are used up.
You pregenerate all the possible IDs and store them in a table. Whenever you need to get one you can remove one from a random row index and use that as the ID. Database will take care of the concurrency for you so its guarantee unique. if the first three letters are given then you can simply add a where clause to restrict the rows to match that constraint.
I am working on a project which reads information from CSV files posted twice a day and stores the info into a database. Each CSV file may contain rows from previous files. Unfortunately, to get a unique row in the CSV files, you have to assign 8 columns as the primary key. I feel that this is ridiculous to work with. So, I really want to reduce the number down to one. So far, the only idea I have is to create a hash of all of the primary key columns or just append them all into one string. Before I do this, I'd like to know if there might be a better way to reduce the 8 primary keys down to one.
PK columns are defined as:
// ....
table.Columns.Add("plantNumber",typeof(string)); //e.g. 341
table.Columns.Add("shipLocation",typeof(string)); //e.g. 11000047
table.Columns.Add("shipDate",typeof(DateTime)); //e.g. 2017/04/18 00:00
table.Columns.Add("releaseNumber",typeof(string)); //e.g. VH6516128
table.Columns.Add("releaseDate",typeof(DateTime)); //e.g. 2017/04/14
table.Columns.Add("orderNumber",typeof(string)); //e.g. 216967
table.Columns.Add("orderLine",typeof(string)); //e.g. 0011
table.Columns.Add("sequence",typeof(string)); //e.g. 044
// ....
table.PrimaryKey = new DataColumn[]
{
table.Columns["plantNumber"],
table.Columns["shipLocation"],
table.Columns["shipDate"],
table.Columns["releaseDate"],
table.Columns["releaseNumber"],
table.Columns["orderNumber"],
table.Columns["orderLine"],
table.Columns["sequence"],
};
Note: the reason many of the seemingly numeric fields are treated as a string instead of an int is because they quoted in the CSV file, and may begin with zero's which I need to preserve. I also do not know 100% certain they won't ever contain letters.
UPDATE:
I don't consider an auto-incremental number to be a good solution, because I still need to ensure that not only within the SQL DB, but within the DataTable itself that the combination of the 8 columns are unique. The individual columns by themselves are not unique. Only the combination of the columns.
To me that is not a primary key. The primary key isn't 'the only thing unique' in your row. An unique index can do the same for you.
A primary key (in my opinion), should just be a single (often) numerical value to technically represent the data as unique. Functionally something else can define a row as unique, as you have in your sample here, but I wouldn't make that the primary key just for that reason only.
Nothing wrong with a compound index. Thats's how relational databases work, but if you really have to you could concat or hash the 8 values that build the unique key into a single column, but that would have the adverse effect of making your data static, unless you rebuild the hash/concat index.
I wonder about Guid duplication. I am creating a Guid to save database table as entity Primary Key.
Account account = new Account(Guid.NewGuid());
But I am confused. Does this cause a duplication on a database table because I am creating manually a Primary key and inserting it to the database.
Database engine does not generate Ids. After saving myriads of records, is there a possibility to have duplications?
Not really.
How much of "not really" depends on the GUID type, and on your understanding of probabilities.
A "real" GUID, version 1, the value is guaranteed to be unique. It's formed by combining the MAC address of your network card (unique, unless you change it manually) and a timestamp.
A pseudo-random GUID, version 4, is not guaranteed to be unique, but it is extremely unlikely to get a collision anyway. You have 122 bits to work with, and 2^122 is a very big number. Like, really big. Using Guid.NewGuid() is fine - although it should be noted that the random numbers used to generate the GUID are not crypto-random.
Of course, different implementations of GUIDv4 will have markedly different entropies. If you just use Random to generate the numbers, you're nowhere near the 122-bit maximum. So don't think you can just write your own code to generate GUIDs, most of such attempts and with nothing more unique than just Random.Next() - by far not good enough for a primary key in a database.
Note that GUIDs are commonly used in scenarios like replication, which are completely built on two generated GUIDs being unique.
the total number of unique such GUIDs is 2122 (approximately
5.3×1036). This number is so large that the probability of the same number being generated randomly twice is negligible
From Wiki
For your information SQL SERVER Generates Guid.
Make the data type of column ID as uniqueidentifier then in the properties bar, go to RowGuid then change it to yes.
P.S.
Make sure your ID is Primary Key.
I am migrating an old database (oracle) and there are few tables like CountryCode, DeptCode and RoleCodes, their primary key is string (Codes) and i am thinking about adding Number column as a primary key because it would work fast with joins. These tables are not really big.
I am wondering if primary key for those tables should start from number '1' or it can be started from 100 just to differentiate b/w tables PK although i don't think i would be showing them on reports.
For sequence-generated IDs, I would suggest starting at different values if it's easy to do (depends on your database etc). You shouldn't be using this to differentiate between them in code, but it can make testing more reasonable.
Before now, I've had a situation where I've accidentally used a foreign key one table as if it were the foreign key for another table. The tests passed as the IDs were coincidentally the same. After we discovered the problem, we changed the initial seed and found the tests were a lot clearer.
You shouldn't do it to differentiate between tables. That is just not practical.
Not all primary keys have to start at 1, as in the case of an order number.
The rationale you're using to switch to an integer primary key doesn't seem valid: the performance gain you'd see using an INT rather than the original codes (which I assume are strings) will be negligable. The PK is always indexed, and indexes for strings or numerics are as good as instant. So unless you really need an INT, I'd be tempted to stick with the original data-type and work with the original data - simplifies data migration (which is something that should be considered whilst doing any work).
It is very common for example in ERP systems to define number ranges that
represent a certain group of items.
This can be both as position in a bigger number, e.g.
1234567890
| |
index 4 - 6 represents region code
index 7 - 8 represents dept code...
or, as I suspect in your case, parts at the same place, like
1000 - 1999 Region codes
2000 - 2999 DeptCode
3000 - 3999 RoleCode
Therefore: No, it not necessarily starts with 1.
Bigger ERP Systems have even configuration sections for number ranges!
Now, from a database point of view:
Yes, your tables should always have a primary key!
Having one will tremendously improve performance on average cases.
(but in most database systems, if you do not provide one, one will be
set by the DBMS which you do not see and can not handle. Some DBMS even
create indices, but thats another story)
I think it does not matter the start number or the start value that will hold the primary key .
What is important is that they will be represented in the FK of the join tables with the same values that are in the PK of the MAIN table .
A surrogate key can have any values, as long as they are unique. That's what makes it "surrogate" after all - values have no intrinsic meaning on their own, and shouldn't generally even be shown to the user. That being said, you could think about using different seeds, just for testing purposes, as Jon Skeet suggested.
That being said, do you really need to introduce a new (surrogate) key? The existing natural key could actually lead to less1 JOINS, and may be useful for clustering. While there are legitimate uses for surrogate keys, don't do it just becaus it is "fashionable" - always be aware of the tradeoffs you are making and pick the right balance for you concrete needs.
1 It is automatically "propagated" down foreign keys, so you don't need to JOIN the child table to the parent just to get the natural key - natural key is already in the child.
Doesn't matter what int the primary key starts from.
Assuming the codes aren't updated regularly, I don't believe that int will be any faster. It more heavily depends on it being a varchar or of a known size.
I personally always have an field names "Id" as a primary key to a table, defined as an int or a bigInt if necessary.
If the table matches up to an enumerated type then I make sure the Id matches the EnumeratedType id which can be any number - so no it doesn't need to start from 1.
If it doesn't match an enumerated type, then I will usually use an auto-incrementing key starting from 1 but this is not always needed.
Note - that if the number of rows is small, then the difference between indexing on a number and on a varchar will be negligible.
yes, it does'nt matter what integer it start from, it main use is define row uniquely and relationship among other table.