Scenario:
I have an application (C#) that expects a SQL database and login, which are set by a user. Once connected, it checks for the existence of several table and creates them if not found.
I'd like to expand on this by having the program be capable of adding columns to those tables if I release a new version of the program which relies upon the new columns.
Question:
What is the best way to programatically check the structure of an existing SQL table and create or update it to match an expected structure?
I am planning to iterate through the list of required columns and alter the existing table whenever it does not contain the new column. I can't help but wonder if there's an approach that is different or better.
Criteria:
Here are some of my expectations and self-imposed rules:
Newer versions of the program might no longer use certain columns, but they would be retained for data logging purposes. In other words, no columns will be removed.
Existing data in the table must be preserved, so the table cannot simply be dropped and recreated.
In all cases, newly added columns would allow null data, so the population of old records is taken care of by having default null values.
Example:
Here is a sample table (because visual examples help!):
id datetime sensor_name sensor_status x1 x2 x3 x4
1 20100513T151907 na019 OK 0.01 0.21 1.41 1.22
2 20100513T152907 na019 OK 0.02 0.23 1.45 1.52
Then, in a new version, I may want to add the column x5. The "x-columns" are all data-storage columns that accept null.
Edit:
I updated the sample table above. It is more of a log and not a parent table. So the sensors will repeatedly show up in this logging table with the values logged. A separate parent table contains the geographic and other logistical information about the sensor, making the table I wish to modify a child table.
This is a very troublesome feature that you're thinking about implementing. i would advise against it and instead consider scripting changes using a 3rd party tool such as Red Gate's Sql Compare: http://www.red-gate.com/products/SQL_Compare/index.htm
If you're in doubt, consider downloading the trial version of the software and performing a structure diff script on two databases with some non-trivial differences. You'll see from the result that the considerations for such operations are far from simple.
The other way around this type of issue is to redesign your database using the EAV model: http://en.wikipedia.org/wiki/Entity-attribute-value_model (Pivots to dynamically add rows thus never changing the structure. It has its own issues but it's very flexible.)
(To utilize a diff tool you would have to have a copy of all of your db versions and create diff scripts which would go out and get executed with new releases and upgrades. That's a huge mess of its own to maintain. EAV is the way for a thing like this. It wrongfully gets a lot of flak for not being as performant as a traditional db structure but i've used it a number of times with great success. In fact, i have an HIPAA-compliant EAV db (Sql Server 2000) that's been in production for over six years with several of the EAV tables containing tens or millions of rows and it's still going strong w/ no big slow down. Of course we don't do heavy reporting against that db. For reports we have an export that flattens the data into a relational structure.)
The common solution i see would be to store in your database somewhere version information. maybe have a really small table:
CREATE TABLE DB_PROPERTIES (key varchar(100), value varchar(100));
then you could add a row:
key | value
version | 12
Then you could just create a sql update script (or set of scripts) which updates the db from version 12 to version13.
declare v varchar(100)
select v=value from DB_PROPERTIES where key='version'
if v ='12'
#do upgrade from 12 to 13
elsif v='11'
#do upgrade from 11 to 13
...and so on
depending on what upgrade paths you wanted to support you could add more cases. You could also obviously move this upgrade logic into C# and or whatever design works for you. But having the db version information stored in the database will make it much easier to figure out what is already there, rather than querying for all the db structures individually.
If you have to build something in such a way as to rely on the application making table changes, your design is flawed. You should have a related table for the sensor values (x1, x2, etc.), then you can just add another record rather than having to create a new column.
Suggested child table structure
READINGS
ID int
Reading_type varchar (10)
Reading_Value int
Then data in the table would read:
ID Reading_type Reading_value
1 x1 2
1 x2 3
1 x3 1
2 x1 7
Try Microsoft.SqlServer.Management.Smo
These are a set of C# classes that provide an API to SQL Server database objects.
The Microsoft.SqlServer.Management.Smo.Table has a Columns Collection that will allow you to query and manipulate the columns.
Have fun.
Related
I'm building an app where I need to store invoices from customers so we can track who has paid and who has not, and if not, see how much they owe in total. Right now my schema looks something like this:
Customer
- Id
- Name
Invoice
- Id
- CreatedOn
- PaidOn
- CustomerId
InvoiceItem
- Id
- Amount
- InvoiceId
Normally I'd fetch all the data using Entity Framework and calculate everything in my C# service, (or even do the calculation on SQL Server) something like so:
var amountOwed = Invoice.Where(i => i.CustomerId == customer.Id)
.SelectMany(i => i.InvoiceItems)
.Select(ii => ii.Amount)
.Sum()
But calculating everything every time I need to generate a report doesn't feel like the right approach this time, because down the line I'll have to generate reports that should calculate what all the customers owe (sometimes go even higher on the hierarchy).
For this scenario I was thinking of adding an Amount field on my Invoice table and possibly an AmountOwed on my Customer table which will be updated or populated via the InvoiceService whenever I insert/update/delete an InvoiceItem. This should be safe enough and make the report querying much faster.
But I've also been searching some on this subject and another recommended approach is using triggers on my database. I like this method best because even if I were to directly modify a value using SQL and not the app services, the other tables would automatically update.
My question is:
How do I add a trigger to update all the parent tables whenever an InvoiceItem is changed?
And from your experience, is this the best (safer, less error-prone) solution to this problem, or am I missing something?
There are many examples of triggers that you can find on the web. Many are poorly written unfortunately. And for future reference, post DDL for your tables, not some abbreviated list. No one should need to ask about the constraints and relationships you have (or should have) defined.
To start, how would you write a query to calculate the total amount at the invoice level? Presumably you know the tsql to do that. So write it, test it, verify it. Then add your amount column to the invoice table. Now how would you write an update statement to set that new amount column to the sum of the associated item rows? Again - write it, test it, verify it. At this point you have all the code you need to implement your trigger.
Since this process involves changes to the item table, you will need to write triggers to handle all three types of dml statements - insert, update, and delete. Write a trigger for each to simplify your learning and debugging. Triggers have access to special tables - go learn about them. And go learn about the false assumption that a trigger works with a single row - it doesn't. Triggers must be written to work correctly if 0 (yes, zero), 1, or many rows are affected.
In an insert statement, the inserted table will hold all the rows inserted by the statement that caused the trigger to execute. So you merely sum the values (using the appropriate grouping logic) and update the appropriate rows in the invoice table. Having written the update statement mentioned in the previous paragraphs, this should be a relatively simple change to that query. But since you can insert a new row for an old invoice, you must remember to add the summed amount to the value already stored in the invoice table. This should be enough direction for you to start.
And to answer your second question - the safest and easiest way is to calculate the value every time. I fear you are trying to solve a problem that you do not have and that you may never have. Generally speaking, no one cares about invoices that are of "significant" age. You might care about unpaid invoices for a period of time, but eventually you write these things off (especially if the amounts are not significant). Another relatively easy approach is to create an indexed view to calculate and materialize the total amount. But remember - nothing is free. An indexed view must be maintained and it will add extra processing for DML statements affecting the item table. Indexed views do have limitations - which are documented.
And one last comment. I would strongly hesitate to maintain a total amount at any level higher than invoice. Above that level one frequently wants to filter the results in any ways - date, location, type, customer, etc. At this level you are approaching data warehouse functionality which is not appropriate for a OLTP system.
First of all never use triggers for business logic. Triggers are tricky and easily forgettable. It will be hard to maintain such application.
For most cases you can easily populate your reporting data via entity framework or SQL query. But if it requires lots of joins then you need to consider using staging tables. Because reporting requires data denormalization. To populate staging tables you can use SQL jobs or other schedule mechanism (Azure Scheduler maybe). This way you won't need to work with lots of join and your reports will populate faster.
This past week I was tasked with moving a PHP based database to a new SQL database. There are a handful of requirements, but one of those was using ASP.Net MVC to connect to the SQL database...and I have never used ASP.Net or MVC.
I have successfully moved the database to SQL and have the foundation of the ASP site set up (after spending many hours pouring through tutorials). The issue I am having now is that one of the pages is meant to display a handful of fields (User_Name, Work_Date, Work_Description, Work_Location, etc) but the only way of grabbing all of those fields is by combining two of the tables. Furthermore, I am required to allow the user to search the combined table for any matching rows between a user inputted date range.
I have tried having a basic table set up that displays the correct fields and have implemented a search bar...but that only allows me to search by a single date, not a range. I have also tried to use GridView with its Query Builder feature to grab the data fields I needed (which worked really well), but I can't figure out how to attach textboxes/buttons to the newly made GridView. Using a single table with GridView works perfectly and using textboxes/buttons is very intuitive. I just can't seem to make the same connection with a joined view.
So I suppose my question is this: what is the best way for me to combine these two tables while also still having the ability to perform searches on the displayed data? If I could build this database from scratch I would have just made a table with the relevant data attached to it, but because this is derived from a previously made database it has 12+ years of information that I need to dump into it.
Any help would be greatly appreciated. I am kind of dead in the water here. My inexperience with these systems is getting the better of me. I could post the code that I have, but I am mainly interested in my options and then I can do the research on my own.
Thanks!
It's difficult to offer definitive answers to your questions due to the need for guesswork.
But here are some hints.
You can say WHERE datestamp >= '2017-01-01' AND datestamp < '2018-01-01' to filter all the rows in calendar year 2017. Many variations on this sort of date range filter are available.
Your first table probably has some kind of ID number on each row. Let's call it first.first_id. Your second table probably has its own id, let's call it second.second_id. And, it probably has another id that identifies a row in your first table, let's call it second.first_id. That second.first_id is called a foreign key in the second table to the first table. There can be any number of rows in your second table corresponding to your first table via this foreign key.
If this is the case you can do something like this:
SELECT first.datestamp, first.val1, first.val2, second.val1, second.val2
FROM first
JOIN second ON first.first_id = second.first_id
WHERE first.datestamp >= '2018-06-01' AND first.datestamp < '2018-07-01'
AND (first.val1 = 'some search term' OR second.val1 = 'some search term')
ORDER BY first.datestamp
This makes a virtual table by joining together your two physical tables (FROM...JOIN...).
Then it filters the rows you want from that virtual table (FROM ...).
Then it puts them in the order you want (ORDER BY...).
Finally, it chooses the columns from the virtual table you want in your result set (SELECT ...).
SQL database servers (MySQL, SQL Server, postgreSQL, Oracle and the rest) are very smart about doing this sort of thing efficiently.
In Azure we have four Shards and i want to remove two of them as we do not need them anymore. The Data should be merged into the other two Shards.
I use a Listmap with GUIDs as Key to identifiy the Shard (in our application this is the UserId).
In the tutorials i only found samples to merge Shards with the Range type.
Is there a way to merge these type of shards in a faster way or do i have to write my own tool for this?
If the merge is performed automatically what will for example happen in the following case:
The GUID to identify the Shard is the UserId, now this data is moved from Shard A to Shard B. There is another Table called Comments which has the UserId as ForeignKey. The PrimaryKey in this Table is a classic numeric auto increment value. What will happen to those values if they are moved from Shard A to Shard B? Will they be inserted and a new ID is assigned to them or will this not work at all?
Also there is some local FileStorage invloved which uses IDs in the Path so i will have to write my own tool anyway i think.
For that I took a look at the ShardMapManager but did not fully understand how it works. In the ShardMappingsGlobal Table is a Column called MappingId. But this is not the Guid/UserId which is stored in the Shard Database. How do i get the actual Guid which is used to identify the shard, in my case the UserId?
I also did not find Methods to move data between Shards.
What i would do now is Transfer the Data between the Shards with a tool by myself and then use the ListShardMap.UpdateMapping Method to set a new Shard for the value.
At the end of the operation i would use ListShardMap.DeleteShard or is there a better way to do this?
EDIT:
I wrote my own tool to merge the shards but i get a strange exception now. here some code:
Guid userKey = Guid.Parse(userId);
ListShardMap<Guid> map = GetUserShardMap<Guid>();
try
{
PointMapping<Guid> currentMapping = map.GetMappingForKey(userKey);
PointMapping<Guid> mappingOffline = map.UpdateMapping(currentMapping, new PointMappingUpdate()
{
Status = MappingStatus.Offline
});
}
The UpdateMapping causes the following exception:
Store Error: Error 515, Level 16, State 2, Procedure __ShardManagement.spBulkOperationShardMappingsLocal, Line 98, Message: Cannot insert the value NULL into column 'LockOwnerId', table __ShardManagement.ShardMappingsLocal
I do not understand why there is even an insert? I checked for the mappingId in the local and global Shardmapping tables and the mapping is there so no insert should be required in my opinion. I also took a look at the Code of the mentioned stored procedure spBulkOperationShardMappingsLocal here: https://github.com/Azure/elastic-db-tools/blob/master/Src/ElasticScale.Client/ShardManagement/Scripts/UpgradeShardMapManagerLocalFrom1.1To1.2.sql
In the Insert statement the LockOwnerId is not passed as parameter so it can only fail.
Currently i work with a testsetup because i do not want to play on the productive system of course. Maybe i made a mistake there but to me everything looks good. i would be very grateful about any hint regarding this error.
In the tutorials i only found samples to merge Shards with the Range type. Is there a way to merge these type of shards in a faster way or do i have to write my own tool for this?
Yes, the Split-Merge tool can move data from both range and list shard maps. For a list shard map you can issue shardlet move requests for each key. The Split-Merge tool unfortunately has some complicated set up, last time it took me around an hour to configure. I know this is not great, I'll leave it up to you to determine whether it would take more or less time to write your own custom version.
There is another Table called Comments which has the UserId as ForeignKey. The PrimaryKey in this Table is a classic numeric auto increment value. What will happen to those values if they are moved from Shard A to Shard B? Will they be inserted and a new ID is assigned to them or will this not work at all?
The values of autoincrement columns are not copied over, they will be regenerated at the destination. So new ids will be assigned to these rows.
For that I took a look at the ShardMapManager but did not fully understand how it works. In the ShardMappingsGlobal Table is a Column called MappingId. But this is not the Guid/UserId which is stored in the Shard Database. How do i get the actual Guid which is used to identify the shard, in my case the UserId?
I would strongly suggest not trying to edit the ShardMapManager tables on your own, it's very easy to mess up. Editing ShardMapManager tables is precisely what the Elastic Database Tools library is designed to do.
You can update the metadata for a mapping by using the ListShardMap.UpdatePointMapping method. Just to be clear, this only updates the ShardMapManager tables' knowledge of where the data should be for the key. Actually moving the mapping must be done by a higher layer.
This is a high-level summary of what the Split-Merge service does:
Lock the mapping to prevent concurrent update from another shard map management operation
Mark the mapping offline with ListShardMap.UpdatePointMapping. This prevents data-directed routing with OpenConnectionForKey from being allowed to access data with that key. It also kills all current sessions on the shard to force them to reconnect, this ensure that there are no active connections operating on data with the now-offline key
Move the underlying data, using the Shard Map's SchemaInfo to determine which tables need to be moved
Update the mapping and mark it online with ListShardMap.UpdatePointMapping
Unlock the mapping
I have a problem concerning application performance: I have many tables, each having millions of records. I am performing select statements over them using joins, where clauses and orderby on different criterias (specified by the user at runtime). I want to get my records paged but no matter what I do with my SQL statements I cannot reach the performance of getting my pages directly from memory. Basically the problem comes when I have to filter my records by using some runtime dynamic specified criteria. I tried everything such as using ROW_NUMBER() function combined with a "where RowNo between" clause, I've tried CTE, temp tables, etc. Those SQL solutions performs well only if I don't include filtering. Keep in mind also that I want my solution to be as generic as possible (imagine that i have in my app several lists that virtually presents paged millions of records and those records are constructed with very complex sql statements).
All my tables has a primary key of type INT.
So, I come with an ideea: Why not create a "server" only for select statements. The server loads first all records from all tables and stores them into some HashSets where each T has an Id property and GetHashCode () returns that Id and also the Equals is implemented such that two records are "equal" only if Id is equal (don't scream, You will see later why I am not using all record data for hashing and comparisons).
So far so good, but there's a problem: How can I sync my in memory collections with database records?. The ideea is that I must find a solution such as I load only differential changes. So I invented a changelog table for each table that I want to cache. In this changelog I perform only inserts that marks dirty rows (updates or deletes) and also records newly inserted ids, all of this mechanism implemented using triggers. So whenever an in-memory select comes, I check first if I must sync something (by interogating the changelog). If something must be applied, I load the changelog, I apply those changes in memory and finally I am clearing that changelog (or maybe remember what was the highest changelog id that I've applied ...).
In order to be able to apply the changelog in O ( N ) where N is the changelog size, i am using this algo:
for each log.
identify my in-memory Dictionary <int, T> where the key is the primary key.
if it's a delete log then call dictionary.Remove (id) ( O ( 1 ))
if it's an update log, then call also dictionary.Remove (id) ( O (1)) and move this id into an "to be inserted" collection
if it's an insert log, move this id into a "to be inserted" collection.
finally, refresh cache by selecting all data from the corresponding table where Id in ("to be inserted").
For filtering, I am compiling some expression trees into Func < T, List < FilterCriterias >, bool > functors. Using this mechanism I am performing way more faster than SQL.
I Know that SQL 2012 has caching support and the new comming SQL version will suport even more but My client have SQL server 2005 so ... I can't benefit of this stuff.
My question: What do you think ? this is a bad ideea ? there's a better aproach ?
The developers of SQL Server did a very good job. I think it is fairly impossible to trick this out.
Unless your data has some kind of implicit structure which might help to speed things up and which the optimizer cannot be aware of, such "I do my own speedy trick" approaches won't help - normally...
Performance problems are ever first to be solved where they occur:
the tables structures and relations
indexes and statistics
quality of SQL statements
Even many million rows are no problem if the design and the queries are good...
If your queries do a lot of computations, or you need to retrieve data out of tricky structures (nested list with recursive reads, XML...) I'd go the Data-Warehouse-Path and write some denormalized tables for quick selects. Of course you will have to deal with the fact, that you are reading "old" data. If your data does not change much, you could trigger all changes to a denormalized structure immediately. But this depends on your actual situation.
If you want, you could post one of your imperformant queries together with the relevant structure details and ask for review. There are dedicated groups on Stack-Exchange, such as "Code Review". If it's not to big, you might try it here as well...
I'm creating a data-entry application where users are allowed to create the entry schema.
My first version of this just created a single table per entry schema with each entry spanning a single or multiple columns (for complex types) with the appropriate data type. This allowed for "fast" querying (on small datasets as I didn't index all columns) and simple synchronization where the data-entry was distributed on several databases.
I'm not quite happy with this solution though; the only positive thing is the simplicity...
I can only store a fixed number of columns. I need to create indexes on all columns. I need to recreate the table on schema changes.
Some of my key design criterias are:
Very fast querying (Using a simple domain specific query language)
Writes doesn't have to be fast
Many concurrent users
Schemas will change often
Schemas might contain many thousand columns
The data-entries might be distributed and needs syncronization.
Preferable MySQL and SQLite - Databases like DB2 and Oracle is out of the question.
Using .Net/Mono
I've been thinking of a couple of possible designs, but none of them seems like a good choice.
Solution 1: Union like table containing a Type column and one nullable column per type.
This avoids joins, but will definitly use a lot of space.
Solution 2: Key/value store. All values are stored as string and converted when needed.
Also use a lot of space, and of course, I hate having to convert everything to string.
Solution 3: Use an xml database or store values as xml.
Without any experience I would think this is quite slow (at least for the relational model unless there is some very good xpath support).
I also would like to avoid an xml database as other parts of the application fits better as a relational model, and being able to join the data is helpful.
I cannot help to think that someone has solved (some of) this already, but I'm unable to find anything. Not quite sure what to search for either...
I know market research is doing something like this for their questionnaires, but there are few open source implementations, and the ones I've found doesn't quite fit the bill.
PSPP has much of the logic I'm thinking of; primitive column types, many columns, many rows, fast querying and merging. Too bad it doesn't work against a database.. And of course... I don't need 99% of the provided functionality, but a lot of stuff not included.
I'm not sure this is the right place to ask such a design related question, but I hope someone here has some tips, know of any existing work, or can point me to a better place to ask such a question.
Thanks in advance!
Have you already considered the most trivial solution: having one table for each of your datatypes and storing the schema of your dataset in the database as well. Most simple solution:
DATASET Table (Virtual "table")
ID - primary key
Name - Name for the dataset/table
COLUMNSCHEMA Table (specifies the columns for one "dataset")
DATASETID - int (reference to Dataset-table)
COLID - smallint (unique # of the column)
Name - varchar
DataType - ("varchar", "int", whatever)
Row Table
DATASETID
ID - Unique id for the "row"
ColumnData Table (one for each datatype)
ROWID - int (reference to Row-table)
COLID - smallint
DATA - (varchar/int/whatever)
To query a dataset (a virtual table), you must then dynamically construct a SQL statement using the schema information in COLUMNSCHEMA table.