We have reference values created from a Sequence in a database, which means that they are all integers. (It's not inconceivable - although massively unlikely - that they could change in the future to include letters, e.g. R12345.)
In our [C#] code, should these be typed as strings or integers?
Does the fact that it wouldn't make sense to perform any arithmetic on these values (e.g. adding them together) mean that they should be treated as string literals? If not, and they should be typed as integers (/longs), then what is the underlying principle/reason behind this?
I've searched for an answer to this, but not managed to find anything, either on Google or StackOverflow, so your input is very much appreciated.
There are a couple of other differences:
Leading Zeroes:
Do you need to allow for these. If you have an ID string then it would be required
Sorting:
Sort order will vary between the types:
Integer:
1
2
3
10
100
String
1
10
100
2
3
So will you have a requirement to put the sequence in order (either way around)?
The same arguments apply to your typing as applied in the DB itself too, as the requirements there are likely to be the same. Ideally as Chris says, they should be consistent.
Here are a few things to consider:
Are leading zeros important, i.e. is 010 different to 10. If so, use string.
Is the sort order important? i.e. should 200 be sorted before or after 30?
Is the speed of sorting and/or equality checking important? If so, use int.
Are you at all limited in memory or disk space? If so, ints are 4 bytes, strings at minimum 1 byte per character.
Will int provide enough unique values? A string can support potentially unlimited unique values.
Is there any sort of link in the system that isn't guaranteed reliable (networking, user input, etc)? If it's a text medium, int values are safer (all non-digit characters are erraneous), if it's binary, strings make for easier visual inspection (R13_55 is clearly an error if your ids are just alphanumeric, but is 12372?)
From the sounds of your description, these are values that currently happen to be represented by a series of digits; they are not actually numbers in themselves. This, incidentally, is just like my phone number: it is not a single number, it is a set of digits.
And, like my phone number, I would suggest storing it as a string. Leading zeros don't appear to be an issue here but considering you are treating them as strings, you may as well store them as such and give yourself the future flexibility.
They should be typed as integers and the reason is simply this: retain the same type definition wherever possible to avoid overhead or unexpected side-effects of type conversion.
There are good reasons to not use use types like int, string, long all over your code. Among other problems, this allows for stupid errors like
using a key for one table in a query pertaining another table
doing arithmetic on a key and winding up with a nonsense result
confusing an index or other integral quantity with a key
and communicates very little information: Given int id, what table does this refer to, what kind of entity does it signify? You need to encode this in parameter/variable/field/method names and the compiler won't help you with that.
Since it's likely those values will always be integers, using an integral type should be more efficient and put less load on the GC. But to prevent the aforementioned errors, you could use an (immutable, of course) struct containing a single field. It doesn't need to support anything but a constructor and a getter for the id, that's enough to solve the above problems except in the few pieces of code that need the actual value of the key (to build a query, for example).
That said, using a proper ORM also solves these problems, with less work on your side. They have their own share of downsides, but they're really not that bad.
If you don't need to perform some mathematical calculations on the sequences, you can easily choose strings.
But think about sorting: Produced orders between integers and strings will differ, e.g. 1, 2, 10 for integers and 1, 10, 2 for strings.
Related
I am using Googles protocol buffer library within my persistent storage system and want to persist currency values, but I am not sure that the floating point types provided by photo (float/double) are good enough. Are there any downsides to storing all of my currency values as strings (e.g. storing "0.10" instead of 0.1), then using the Convert.ToDecimal function when I retrieve my data and need to do arithmetic?
You are correct in anticipating that float/double data types are not suitable for "currency!"
Consider how SQL databases (and, uhh, COBOL programs ...) commonly store "currency" values: they use a decimal representation of some sort. For instance, a true COBOL program might use a "binary-coded decimal (BCD)" data type. A Microsoft Access database uses a "scaled integer": the dollars-and-cents value multiplied by 10,000, giving a fixed(!) "4 digits to the right of the decimal."
For the immediate purposes of this question, I would definitely store the values as strings, and then give very serious thought to the number of digits to be stored and just how to handle "rounding" to that number of digits. (For instance, there are algorithms such as “banker’s rounding.”)
“Storage size?” You don’t care about that. What you do care about is, that if a particular customer (or, auditor ...) actually adds-up all the numbers on a printed statement, the bottom-line on that piece of paper will agree ... at the very(!) least, within a single penny.
in C# should you define a variable that only contains digits as string if you're not going to do any math operations on that variable? Obviously you have to make sure when setting the variable, you would only accept numbers. The benefit of using int as opposed to string is that compiler does the check for you. I'm just curious what other people think.
No, definitely not. It uses more memory and there is a reason number data types exist.
EDIT: Also, at some point you're possibly going to want to add some mathematical operations the your code. If you defined the variable as a string, you would have to make more modifications to the code.
You should be using the primitive data type necessary to store what you need. If you need to store a numeric value, you should (most often) use a numeric data type.
This question goes beyond any specific programming language. The question you are asking is how do you want to show your intent within your code to handle some form of data.
For this case, your data is a sequence of digits.
What you need to ask is what is the purpose of this value you are storing.
What is the upper limit for this data?
Does formatting matter?
Do leading zeros have significance?
Answering these questions will assist you in making the best choice for your implementation.
Is there any advantage to using a String data type for this case?
One other point to consider is whether the variables will be used for sorting or comparing. For example, integers as text do not sort the same as integers - "6" would be considered greater than "59".
I have a column defined as decimal(10,6). When I try to save the model with the value 10.12345, in the database I saved as 10.123400. The last digit ("5") is truncated.
Why is the number default to only 4 digits in LINQ (for decimal) and how can I avoid this for all columns in my models? The solution I found was to use DbType="Decimal(10,6)", but I have a lot of columns in this situation and the change should be applied to all, and I don't see it like a good idea.
Is there a way to change this behavior without changing all the decimal columns?
Thanks
You need to use the proper DbType, decimal(10, 6) in this case.
The reason for this is simple - while .NET's decimal is actually a (decimal) floating point (the decimal point can move), MS SQL's isn't. It's a fixed "four left of decimal point, six right of decimal point". When LINQ passes the decimal to MS SQL, it has to use a specific SQL decimal - and the default simply happens to use four for the scale. You could always use a decimal big enough for whatever value you're trying to pass, but that's very impractical - for one, it will pretty much eliminate execution plan caching, because each different decimal(p, s) required will be its own separate query. If you're passing multiple decimals, this means you'll pretty much never get a cached plan; ouch.
In effect, the command doesn't send the value 10.12345 - it sends 10123450 (not entirely true, but just bear with me). Thus, when you're passing the parameter, you must know the scale - you need to send 10 as 10000000, for example. The same applies when you're not using LINQ - using SqlCommand manually has the same "issue", and you have to use a specific precision and scale.
If you're wary of modifying all those columns manually, just write a script to do it for you. But you do need to maintain the proper data types manually, there's no way around it.
From a list of integers in C#, I need to generate a list of unique values. I thought in MD5 or similar but they generates too many bytes.
Integer size is 2 bytes.
I want to get a one way correspondence, for example
0 -> ARY812Q3
1 -> S6321Q66
2 -> 13TZ79K2
So, proving the hash, the user cannot know the integer or to interfere a sequence behind a list of hashes.
For now, I tried to use MD5(my number) and then I used the first 8 characters. However I found the first collision at 51389. Which other alternatives I could use?
As I say, I only need one way. It is not necessary to be able to calculate the integer from the hash. The system uses a dictionary to find them.
UPDATE:
Replying some suggestions about using GetHashCode(). GetHashCode returns the same integer. My purpose is to hide to the end user the integer. In this case, the integer is the primary key of a database. I do not want to give this information to users because they could deduce the number of records in the database or the increment of records by week.
Hashes are not unique, so maybe I need to use encryption like TripleDes or so, but I wanted to use something fast and simple. Also, TripleDes returns too many bytes too.
UPDATE 2:
I was talking about hashes and it is an error. In reality, I am trying to obfuscate it, and I tried it using hash algorithm, that it is not a good idea because they are not unique.
Update May 2017
Feel free to use (or modify) the library I developed, installable via Nuget with:
Install-Package Kent.Cryptography.Obfuscation
This converts a non-negative id such as 127 to 8-character string, e.g. xVrAndNb, and back (with some available options to randomize the sequence each time it's generated).
Example Usage
var obfuscator = new Obfuscator();
string maskedID = obfuscator.Obfuscate(15);
Full documentation at: Github.
Old Answer
I came across this problem way back and I couldn't find what I want in StackOverflow. So I made this obfuscation class and just shared it on github.
Obfuscation.cs - Github
You can use it by:
Obfuscation obfuscation = new Obfuscation();
string maskedValue = obfuscation.Obfuscate(5);
int? value = obfuscation.DeObfuscate(maskedValue);
Perhaps it can be of help to future visitor :)
Encrypt it with Skip32, which produces a 32 bit output. I found this C# implementation but can't vouch for its correctness. Skip32 is a relatively uncommon crypto choice and probably hasn't been analyzed much. Still it should be sufficient for your obfuscation purposes.
The strong choice would be format preserving encryption using AES in FFX mode. But that's pretty complicated and probably overkill for your application.
When encoded with Base32 (case insensitive, alphanumeric) a 32 bit value corresponds to 7 characters. When encoded in hex, it corresponds to 8 characters.
There is also the non cryptographic alternative of generating a random value, storing it in the database and handling collisions.
Xor the integer. Maybe with a random key that it is generated per user (stored in session). While it's not strictly a hash (as it is reversible), the advantages are that you don't need to store it anywhere, and the size will be the same.
For what you want, I'd recommend using GUIDs (or other kind of unique identifier where the probability of collision is either minimal or none) and storing them in the database row, then just never show the ID to the user.
IMHO, it's kind of bad practice to ever show the primary key in the database to the user (much less to let users do any kind of operations on them).
If they need to have raw access to the database for some reason, then just don't use ints as primary keys, and make them guids (but then your requirement loses importance since they can just access the number of records)
Edit
Based on your requirements, if you don't care the algorithm is potentially computationally expensive, then you can just generate a random 8 byte string every time a new row is added, and keep generating random strings until you find one that is not already in the database.
This is far from optimal, and -can- be computationally expensive, but taking you use a 16-bit id and the maximum number of rows is 65536, I'd not care too much about it (the possibility of an 8 byte random string to be in a 65536 possibility list is minimal, so you'll probably be good at first or as much as second try, if your pseudo-random generator is good).
I often have to convert a retreived value (usually as a string) - and then convert it to an int. But in C# (.Net) you have to choose either int16, int32 or int64 - how do you know which one to choose when you don't know how big your retrieved number will be?
Everyone here who has mentioned that declaring an Int16 saves ram should get a downvote.
The answer to your question is to use the keyword "int" (or if you feel like it, use "Int32").
That gives you a range of up to 2.4 billion numbers... Also, 32bit processors will handle those ints better... also (and THE MOST IMPORTANT REASON) is that if you plan on using that int for almost any reason... it will likely need to be an "int" (Int32).
In the .Net framework, 99.999% of numeric fields (that are whole numbers) are "ints" (Int32).
Example: Array.Length, Process.ID, Windows.Width, Button.Height, etc, etc, etc 1 million times.
EDIT: I realize that my grumpiness is going to get me down-voted... but this is the right answer.
Just wanted to add that... I remembered that in the days of .NET 1.1 the compiler was optimized so that 'int' operations are actually faster than byte or short operations.
I believe it still holds today, but I'm running some tests now.
EDIT: I have got a surprise discovery: the add, subtract and multiply operations for short(s) actually return int!
Repeatedly trying TryParse() doesn't make sense, you have a field already declared. You can't change your mind unless you make that field of type Object. Not a good idea.
Whatever data the field represents has a physical meaning. It's an age, a size, a count, etc. Physical quantities have realistic restraints on their range. Pick the int type that can store that range. Don't try to fix an overflow, it would be a bug.
Contrary to the current most popular answer, shorter integers (like Int16 and SByte) do often times take up less space in memory than larger integers (like Int32 and Int64). You can easily verify this by instantiating large arrays of sbyte/short/int/long and using perfmon to measure managed heap sizes. It is true that many CLR flavors will widen these integers for CPU-specific optimizations when doing arithmetic on them and such, but when stored as part of an object, they take up only as much memory as is necessary.
So, you definitely should take size into consideration especially if you'll be working with large list of integers (or with large list of objects containing integer fields). You should also consider things like CLS-compliance (which disallows any unsigned integers in public members).
For simple cases like converting a string to an integer, I agree an Int32 (C# int) usually makes the most sense and is likely what other programmers will expect.
If we're just talking about a couple numbers, choosing the largest won't make a noticeable difference in your overall ram usage and will just work. If you are talking about lots of numbers, you'll need to use TryParse() on them and figure out the smallest int type, to save ram.
All computers are finite. You need to define an upper limit based on what you think your users requirements will be.
If you really have no upper limit and want to allow 'unlimited' values, try adding the .Net Java runtime libraries to your project, which will allow you to use the java.math.BigInteger class - which does math on nearly-unlimited size integer.
Note: The .Net Java libraries come with full DevStudio, but I don't think they come with Express.