What is a size metric attached to a VARCHAR? - c#

I use VARCHAR throughout my app, and found something particularly confusing... Why do I need to define my SQL VARCHAR columns with a length, such as VARCHAR(50) or VARCHAR(1000)? Is the one and only purpose that this length constraint allows me to define my preferred maximum string length? Is there any performance difference or otherwise between VARCHAR(50) and VARCHAR(1000)?

That depends entirely on the internals of your DBMS. For example, if you index a varchar column, you will almost certainly get a keypart set to the maximum size.
That's because indexes have to be insanely efficient and you don't want to be mucking about with variable length fields in that case, since it will probably slow you down.
Even in the data area of the database, you may find that it simply allows for the largest size. I've seen proposals that just store a pointer in the row to a on-disk-heap but that means two disk reads per row and I can't see that being a very good option for massive performance.
The sizes of your columns will affect performance with things like how many records can be read in at one time, how many can fit in a n-ary tree index node and so forth.

SQLite: Size limits are completely ignored.
PostgreSQL: VARCHAR(N) is essentially equivalent to TEXT CHECK (LENGTH(x) <= N). There is no performance advantage to declaring a maximum size.
MySQL: Determines whether the string length is stored as one byte or two bytes.
Oracle: Higher size limits have a performance disadavantage.
MS SQL Server: VARCHAR columns greater than 900 bytes cannot be indexed.

Related

EF Core Guids generating with same last 6 bytes [duplicate]

Thanks to the wonderful article The Cost of GUIDs as Primary Keys, we have the COMB GUID. Based on current implementation, there are 2 approaches:
use last 6 bytes for timestamp: GUIDs as fast primary keys under multiple databases
use last 8 bytes for timestamp by using windows tick: GUID COMB strategy in EF4.1 (CodeFirst)
We all know that for 6 bytes timestamp at GUID, there would more bytes for random bytes to reduce the collision of the GUID. However more GUID with same timestamp would be created and those are not sequential at all. With that, 8 bytes timestamp would be preferred.
So it seems a hard choice. Based on article above GUIDs as fast primary keys under multiple databases, it says:
Before we continue, a short footnote about this approach: using a 1-millisecond-resolution timestamp means that GUIDs generated very close together might have the same timestamp value, and so will not be sequential. This might be a common occurrence for some applications, and in fact I experimented with some alternate approaches, such as using a higher-resolution timer such as System.Diagnostics.Stopwatch, or combining the timestamp with a "counter" that would guarantee the sequence continued until the timestamp updated. However, during testing I found that this made no discernible difference at all, even when dozens or even hundreds of GUIDs were being generated within the same one-millisecond window. This is consistent with what Jimmy Nilsson encountered during his testing with COMBs as well
Just wonder if someone who knows database internal could share some lights about above observation. Is it because that database server just store the data in the memory and only write to disk when it reaches certain threshold? Thus the reorder of inserted data with non sequence GUID with same time stamp would happen in general in memory and thus minimal performance penalty.
Update:
Based on our testing, the COMB GUID could not reduce the table fragmentation as it is claimed over the internet compared with random GUID. It seems the only way right now is to use SQL Server to generate the sequential GUID.
The article you referenced is from 2002 and is very old. Just use newsequentialid (available in SQL Server 2005 and up). This guarantees that each new id you generate is greater than the previous one, solving the index fragmentation/page split issue.
Another aspect I'd like to mention, though, that the writer of that article glossed over, is that using 16 bytes when you only need 4 is not a good idea. Let's say you have a table with 500,000 rows averaging 150 bytes not including the clustered column, and the table has 3 nonclustered indexes (which repeat the clustered column in each row), each in turn with rows averaging 4 bytes, 25 bytes, and 50 bytes not counting the clustered column.
The storage requirements at perfect 100% fill factor are then (all numbers in megabytes except where %):
Item Clust 50 25 4 Total
---- ----- ----- ----- ----- ------
GUID 79.1 31.5 19.6 9.5 139.7
int 73.4 25.7 13.8 3.8 116.7
%imp 7.2% 18.4% 29.6% 60.0% 16.5%
In the nonclustered index having just one int column of 4 bytes (a common scenario), switching the clustered index to an int makes it 60% smaller! This translates directly into a 60% performance improvement for any scans on the table--and that's conservative, because with smaller rows, page splits will occur less often and the fragmentation will stay better longer.
Even in the clustered index itself, there's still a 7.2% performance improvement, which is not nothing, at all.
What if you used GUIDs throughout your entire database, which had tables with a similar profile as this where switching to int would yield a 16.5% reduction in size, and the database itself was 1.397 Terabytes in size? Your whole database would be 230 Gb larger (refer to the Total column, 139.7 - 116.7). That translates into real money in the real world for high-availability storage. It moves your disk purchase schedule earlier in time which is harmful to your company's bottom line.
Do not use larger data types than necessary, ever. It's like adding weight to your car for no reason: you will pay for it (if not in speed, then in fuel economy).
UPDATE
Now that I know you are creating the GUID in your client-side code, I can see more clearly the nature of your problem. If you are able to defer creating the GUID until row insertion time, here's one way to accomplish that.
First, set a default for your CustomerID column:
ALTER TABLE dbo.Customer ADD CONSTRAINT DF_Customer_CustomerID
DEFAULT (newsequentialid()) FOR Customer;
Now you don't have to specify what value to insert for CustomerID in any INSERT, and your query could look like this:
DECLARE #Name varchar(100) = 'Acme Spy Devices';
INSERT dbo.Customer (Name)
OUTPUT inserted.CustomerID -- a GUID
VALUES (#Name);
In this very simple example, you have inserted a new row to the Customer table, and returned a rowset to the client containing the just-created value, all in one query.
If you wanted to explicitly insert VALUES (newsequentialid(), #Name) that would work, too.

having table for fixed data or Enum?

I have a table that has Constant Value...Is it better that I have this table in my Database(that is SQL)or have an Enum in my code and delete my table?
my table has only 2 Columns and maximum 20 rows that these rows are fixed and get filled once,first time that i run application.
I would suggest to create an Enum for your case. Since the values are fixed(and I am assuming that the table is not going to change very often) you can use Enum. Creating a table in database will require an unnecessary hit to the database and will require a database connection which could be skipped if you are using Enum.
Also a lot may depend on how much operation you are going to do with your values. For example: its tedious to query your Enum values to get distinct values from your table. Whereas if you will use table approach then it would be a simple select distinct. So you may have to look into your need and the operations which you will perform on these values.
As far as the performance is concerned you can look at: Enum Fields VS Varchar VS Int + Joined table: What is Faster?
As you can see, ENUM and VARCHAR results are almost the same, but join
query performance is 30% lower. Also note the times themselves –
traversing about same amount of rows full table scan performs about 25
times better than accessing rows via index (for the case when data
fits in memory!)
So, if you have an application and you need to have some table field
with a small set of possible values, I’d still suggest you to use
ENUM, but now we can see that performance hit may not be as large as
you expect. Though again a lot depends on your data and queries.
That depends on your needs.
You may want to translate the Enum Values (if you are showing it in GUI) and order a set of record based on translated values. For example: imagine you have a Employees table and a Position column. If the record set is big, and you want to sort or order by translated position column, then you have to keep the enum values + translations in database.
Otherwise KISS and have it in code. You will spare time on asking database for values.
I depends on character of that constants.
If they are some low level system constants that never should be change (like pi=3.1415) then it is better to keep them only in code part in some config file. And also if performance is critical parameter and you use them very often (on almost each request) it is better to keep them in code part.
If they are some constants (may be business constants) that can change in future it is Ok to put them in table - then you have more flexibility to change them (for instance from admin panel).
It really depends on what you actually need.
With Enum
It is faster to access
Bound to that certain application. (although you can share by making it as reference, but it just does not look as good as using DB)
You can use in switch statement
Enum usually does not care about value and it is limited to int.
With DB
It is slower, because you have to make connection and query.
The data can be shared widely.
You can set the value to be anything (any type any value).
So, if you will use it only on certain application, Enum is good enough. But if several applications are going to use it, then DB would be better option.

Should primary key always start from 1?

I am migrating an old database (oracle) and there are few tables like CountryCode, DeptCode and RoleCodes, their primary key is string (Codes) and i am thinking about adding Number column as a primary key because it would work fast with joins. These tables are not really big.
I am wondering if primary key for those tables should start from number '1' or it can be started from 100 just to differentiate b/w tables PK although i don't think i would be showing them on reports.
For sequence-generated IDs, I would suggest starting at different values if it's easy to do (depends on your database etc). You shouldn't be using this to differentiate between them in code, but it can make testing more reasonable.
Before now, I've had a situation where I've accidentally used a foreign key one table as if it were the foreign key for another table. The tests passed as the IDs were coincidentally the same. After we discovered the problem, we changed the initial seed and found the tests were a lot clearer.
You shouldn't do it to differentiate between tables. That is just not practical.
Not all primary keys have to start at 1, as in the case of an order number.
The rationale you're using to switch to an integer primary key doesn't seem valid: the performance gain you'd see using an INT rather than the original codes (which I assume are strings) will be negligable. The PK is always indexed, and indexes for strings or numerics are as good as instant. So unless you really need an INT, I'd be tempted to stick with the original data-type and work with the original data - simplifies data migration (which is something that should be considered whilst doing any work).
It is very common for example in ERP systems to define number ranges that
represent a certain group of items.
This can be both as position in a bigger number, e.g.
1234567890
| |
index 4 - 6 represents region code
index 7 - 8 represents dept code...
or, as I suspect in your case, parts at the same place, like
1000 - 1999 Region codes
2000 - 2999 DeptCode
3000 - 3999 RoleCode
Therefore: No, it not necessarily starts with 1.
Bigger ERP Systems have even configuration sections for number ranges!
Now, from a database point of view:
Yes, your tables should always have a primary key!
Having one will tremendously improve performance on average cases.
(but in most database systems, if you do not provide one, one will be
set by the DBMS which you do not see and can not handle. Some DBMS even
create indices, but thats another story)
I think it does not matter the start number or the start value that will hold the primary key .
What is important is that they will be represented in the FK of the join tables with the same values that are in the PK of the MAIN table .
A surrogate key can have any values, as long as they are unique. That's what makes it "surrogate" after all - values have no intrinsic meaning on their own, and shouldn't generally even be shown to the user. That being said, you could think about using different seeds, just for testing purposes, as Jon Skeet suggested.
That being said, do you really need to introduce a new (surrogate) key? The existing natural key could actually lead to less1 JOINS, and may be useful for clustering. While there are legitimate uses for surrogate keys, don't do it just becaus it is "fashionable" - always be aware of the tradeoffs you are making and pick the right balance for you concrete needs.
1 It is automatically "propagated" down foreign keys, so you don't need to JOIN the child table to the parent just to get the natural key - natural key is already in the child.
Doesn't matter what int the primary key starts from.
Assuming the codes aren't updated regularly, I don't believe that int will be any faster. It more heavily depends on it being a varchar or of a known size.
I personally always have an field names "Id" as a primary key to a table, defined as an int or a bigInt if necessary.
If the table matches up to an enumerated type then I make sure the Id matches the EnumeratedType id which can be any number - so no it doesn't need to start from 1.
If it doesn't match an enumerated type, then I will usually use an auto-incrementing key starting from 1 but this is not always needed.
Note - that if the number of rows is small, then the difference between indexing on a number and on a varchar will be negligible.
yes, it does'nt matter what integer it start from, it main use is define row uniquely and relationship among other table.

Correct usage of primary key with timestamped data

I have a table which stores timestamped recordings from a collection of sensors, these readings are taken 14400 times per day. (every 6 seconds).
There are 4 sensors, and they share their main data table.
At the moment the schema is as follows:
id (int-PK)
time (DateTime)
sensor (int)
reading (int)
This works perfectly well, and I have the primary key set to autoincrement.
It seems silly to have this primary key at all however, since I never refer to it - Would I be better off using a combination of time and sensor to act as a composite key?
If I did use a composite key, I assume my bytes per row would be decreased too? This is relevant since the table is over 10m rows, so any saving is worth it.
It seems win-win, but I wanted to see what the repercussions of this approach would be.
Composite indexes, and especially composite primary keys, should be avoided. The index is wider and this is bad for performance (and memory usage). In my personal opinion, it's also bad design to have a composite primary key, since there is no more unique singular way of referencing your row.
My advice would be to stick to the design you have now.
At this time you are using a surrogate key. And you are evaluating to move to natural keys.
Working with surrogate keys has advantages over natural keys that you can learn about in previous link:
Immutability
Requirement changes
Performance
Compatibility
Uniformity
(From wikipedia)
You can look for some other posts about surrogate v.s. natural keys in stackoverflow.
But each design is different to others. As database analyst you should evaluate what is the best decission for your project.
stick with the design, I've never had anything but problems putting a datetime in a PK. When your inserts start failing because of duplicates, you'll wish you hadn't done it.
if you want to save space go with a tiny int for the sensor column (you have only 4 different values). Possibly something smaller for reading, I doubt the sensor can record 2 trillion different values that an int can store, most likely you can use a smallint or tiny int for it.
bigint 8 bytes, -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
int 4 Bytes -2,147,483,648 to 2,147,483,647
smallint 2 Bytes -32,768 to 32,767
tinyint 1 byte 0 to 255
Using a combined primary key (or unique index) on 10M rows could easily eat up any storage space gained by removing the int PK (and more). Also, referencing a row from this table would become a lot more difficult.
I always keep an int (or bigint if required) PK on any table. The storage space is normally relatively small compared to the rest of the data and having an easy way of linking/referencing rows always in place makes life a lot easier WRT to enhancements and changes to your data model.

best way to represent this lookup table in c#

I need to represent a lookup table in C#, here is the basic structure:
Name Range Multiplier
Active 10-20 0.5
What do you guys suggest?
I will need to lookup on range and retrieve the multiplier.
I will also need to lookup using the name.
UPdate
It will have maybe 10-15 rows in total.
Range is integer date type.
What you actually have is two lookup tables: one by Name and one by Range. There are several ways you can represent these in memory depending on how big the table will get.
The mostly-likely fit for the "by-name" lookup is a dictionary:
var MultiplierByName = new Dictionary<string, double>() { {"Active",.5}, {"Other", 1.0} };
The range is trickier. For that you will probably want to store either just the minimum or the maximum item, depending on how your range works. You may also need to write a function to reduce any given integer to it's corresponding stored key value (hint: use integer division or the mod operator).
From there you can choose another dictionary (Dictionary<int, double>), or if it works out right you could make your reduce function return a sequential int and use a List<double> so that your 'key' just becomes an index.
But like I said: to know for sure what's best we really need to know the scope and nature of the data in the lookup, and the scenario you'll use to access it.
Create a class to represent each row. It would have Name, RangeLow, RangeHigh and Multiplier properties. Create a list of such rows (read from a file or entered in the code), and then use LINQ to query it:
from r in LookupTable
where r.RangeLow <= x && r.RangeHigh >= x
select r.Multiplier;
Sometimes simplicity is best. How many entries are we looking at, and are the ranges integer ranges as you seem to imply in your example? While there are several approaches I can think of, the first one that comes to mind is to maintain two different lookup dictionaries, one for the name and one for the value (range) and then just store redundant info in the range dictionary. Of course, if your range is keyed by doubles, or your range goes into the tens of thousands I'd look for something different, but simplicity rules in my book.
I would implement this using a DataTable, assuming there was no pressing reason to use another datatype. DataTable.Select would work fine for running a lookup on Name or Range. You do lose some performance using a DataTable for this but with 10-15 records would it matter that much.

Categories

Resources