I'm creating a data-entry application where users are allowed to create the entry schema.
My first version of this just created a single table per entry schema with each entry spanning a single or multiple columns (for complex types) with the appropriate data type. This allowed for "fast" querying (on small datasets as I didn't index all columns) and simple synchronization where the data-entry was distributed on several databases.
I'm not quite happy with this solution though; the only positive thing is the simplicity...
I can only store a fixed number of columns. I need to create indexes on all columns. I need to recreate the table on schema changes.
Some of my key design criterias are:
Very fast querying (Using a simple domain specific query language)
Writes doesn't have to be fast
Many concurrent users
Schemas will change often
Schemas might contain many thousand columns
The data-entries might be distributed and needs syncronization.
Preferable MySQL and SQLite - Databases like DB2 and Oracle is out of the question.
Using .Net/Mono
I've been thinking of a couple of possible designs, but none of them seems like a good choice.
Solution 1: Union like table containing a Type column and one nullable column per type.
This avoids joins, but will definitly use a lot of space.
Solution 2: Key/value store. All values are stored as string and converted when needed.
Also use a lot of space, and of course, I hate having to convert everything to string.
Solution 3: Use an xml database or store values as xml.
Without any experience I would think this is quite slow (at least for the relational model unless there is some very good xpath support).
I also would like to avoid an xml database as other parts of the application fits better as a relational model, and being able to join the data is helpful.
I cannot help to think that someone has solved (some of) this already, but I'm unable to find anything. Not quite sure what to search for either...
I know market research is doing something like this for their questionnaires, but there are few open source implementations, and the ones I've found doesn't quite fit the bill.
PSPP has much of the logic I'm thinking of; primitive column types, many columns, many rows, fast querying and merging. Too bad it doesn't work against a database.. And of course... I don't need 99% of the provided functionality, but a lot of stuff not included.
I'm not sure this is the right place to ask such a design related question, but I hope someone here has some tips, know of any existing work, or can point me to a better place to ask such a question.
Thanks in advance!
Have you already considered the most trivial solution: having one table for each of your datatypes and storing the schema of your dataset in the database as well. Most simple solution:
DATASET Table (Virtual "table")
ID - primary key
Name - Name for the dataset/table
COLUMNSCHEMA Table (specifies the columns for one "dataset")
DATASETID - int (reference to Dataset-table)
COLID - smallint (unique # of the column)
Name - varchar
DataType - ("varchar", "int", whatever)
Row Table
DATASETID
ID - Unique id for the "row"
ColumnData Table (one for each datatype)
ROWID - int (reference to Row-table)
COLID - smallint
DATA - (varchar/int/whatever)
To query a dataset (a virtual table), you must then dynamically construct a SQL statement using the schema information in COLUMNSCHEMA table.
Related
At the risk of over-explaining my question, I'm going to err on the side of too much information.
I am creating a bulk upload process that inserts data into two tables. The two tables look roughly as follows. TableA is a self-referencing table that allows N levels of reference.
Parts (self-referencing table)
--------
PartId (PK Int Non-Auto-Incrementing)
DescriptionId (Fk)
ParentPartId
HierarchyNode (HierarchyId)
SourcePartId (VARCHAR(500) a unique Part Id from the source)
(other columns)
Description
--------
DescriptionId (PK Int Non-Auto-Incrementing)
Language (PK either 'EN' or 'JA')
DescriptionText (varchar(max))
(I should note too that there are other tables that will reference our PartID that I'm leaving out of this for now.)
In Description, the combo of Description and Language will be unique, but the actual `DescriptionID will always have at least two instances.
Now, for the bulk upload process, I created two staging tables that look a lot like Parts and Description but don't have any PK's, Indexes, etc. They are Parts_Staging and Description_Staging.
In Parts_Staging there is an extra column that contains a Hierarchy Node String, which is the HierarchyNode in this kind of format: /1/2/3/ etc. Then when data is copied from the _Staging table to the actual table, I use a CAST(Source.Column AS hierarchyid).
Because of the complexity of the ID's shared across the two tables, the self-referencing id's and the hierarchyid in Parts, and the number of rows to be inserted (possible in the 100,000's) I decided to 100% compile ALL of the data in a C# model first, including the PK ID's. So the process looks like this in C#:
Query the two tables for MAX ID
Using those Max ID's, compile a complete model of all the data for both tables (inlcuding the hierarchyid /1/2/3/)
Do a bulk insert into both _Staging Tables
Trigger a SP that copies non-duplicate data from the two _Staging tables into the actual tables. (This is where the CAST(Source.Column AS hierarchyid) happens).
We are importing lots of parts books, and a single part may be replicated across multiple books. We need to remove the duplicates. In step 4, duplicates are weeded out by checking the SourcePartId in the Parts table and the Description in the DescriptionText in the Description table.
That entire process works beautifully! And best of all, it's really fast. But, if you are reading this carefully (and I thank if you are) then you have already noticed one glaring, obvious problem.
If multiple processes are happening at the same time (and that absolutely WILL happen!) then there is a very real risk of getting the ID's mixed up and the data becoming really corrupted. Process1 could do the GET MAX ID query and before it manages to finish, Process2 could also do a GET MAX ID query, and because Process1 hasn't actually written to the tables yet, it would get the same ID's.
My original thought was to use a SEQUENCE object. And at first, that plan seemed to be brilliant. But it fell apart in testing because it's entirely possible that the same data will be processed more than once and eventually ignored when the copy happens from the _Staging tables to the final tables. And in that case, the SEQUENCE numbers will already be claimed and used, resulting in giant gaps in the ID's. Not that this is a fatal flaw, but it's an issue we would rather avoid.
So... that was a LOT of background info to ask this actual question. What I'm thinking of doing is this:
Lock both of the tables in question
Steps 1-4 as outlined above
Unlock both of the tables.
The lock would need to be a READ lock (which I think is an Exclusive lock?) so that if another process attempts to do the GET MAX ID query, it will have to wait.
My question is: 1) Is this the best approach? And 2) How does one place an Exclusive lock on a table?
Thanks!
I'm not sure in regards to what's the best approach but in terms of placing an 'exclusive' lock on a table, simply using with (TABLOCKX) in your query will put one on the table.
If you wish to learn about it;
https://msdn.microsoft.com/en-GB/library/ms187373.aspx
I have to create a database structure. I have a question about foreing keys and good practice:
I have a table which must have a field that can be two different string values, either "A" or "B".
It cannot be anything else (therefore, i cannot use a string type field).
What is the best way to design this table:
1) create an int field which is a foreign key to another table with just two records, one for the string "A" and one for the string "B"
2) create an int field then, in my application, create an enumeration such as this
public enum StringAllowedValues
{
A = 1,
B
}
3) ???
In advance, thanks for your time.
Edit: 13 minutes later and I get all this awesome feedback. Thank you all for the ideas and insight.
Many database engines support enumerations as a data type. And there are, indeed, cases where an enumeration is the right design solution.
However...
There are two requirements which may decide that a foreign key to a separate table is better.
The first is: it may be necessary to increase the number of valid options in that column. In most cases, you want to do this without a software deployment; enumerations are "baked in", so in this case, a table into which you can write new data is much more efficient.
The second is: the application needs to reason about the values in this column, in ways that may go beyond "A" or "B". For instance, "A" may be greater/older/more expensive than "B", or there is some other attribute to A that you want to present to the end user, or A is short-hand for something.
In this case, it is much better to explicitly model this as columns in a table, instead of baking this knowledge into your queries.
In 30 years of working with databases, I personally have never found a case where an enumeration was the right decision....
Create a secondary table with the meanings of these integer codes. There's nothing that compels you to JOIN that in, but if you need to that data is there. Within your C# code you can still use an enum to look things up but try to keep that in sync with what's in the database, or vice-versa. One of those should be authoritative.
In practice you'll often find that short strings are easier to work with than rigid enums. In the 1990s when computers were slow and disk space scarce you had to do things like this to get reasonable performance. Now it's not really an issue even on tables with hundreds of millions of rows.
I have four types of data's in a SQL Server Database Table: forum topic, article topic, chat topic and QnA Topic. These have same type of columns : ID, Title, Content, User,type etc. The only difference is the type column that is used to detect if the current content is forum topic(type = 0) or article topic(type = 1) and so on.
My colleagues said it will be better to store them in separate tables namely ForumTopics, Articles, Chats, QnAs. But in my view its not a good idea because the C# methods that are based on these content will be different and either I have to write multiple functions having same logic for each operation for each table or a conditional check in one function that its a forum topic(type = 0) or article topic(type = 1) or other.
Please tell me which is a better approach?
One table is better approach because it will give you flexibility in the future. You will be able to do things like the following:
Select everything for a particular user
Search something in all titles
Besides multiple tables are harder to maintain and you are right. There will be more complexity and repetition in your C# code as well with multiple tables.
Using one table is better way because it is difficult to maintain data if it is stored in separate table you have to write complex queries.
If you use multiple table you have to use joins or subquery to retrieve data which makes slow performance.
So go with a single table.
I'm going to be creating competitions on the current site I'm working on. Each competition is not going to be the same and may have a varying number of input fields that a user must enter to be part of the competition eg.
Competition 1 might just require a firstname
Competition 2 might require a firstname, lastname and email address.
I will also be building a tool to observe these entries so that I can look at each individual entry.
My question is what is the best way to store an arbitrary number of fields? I was thinking of two options, one being to write each entry to a CSV file containing all the entries of the competition, the other being to have a db table with a varchar field in the database that just stores an entire entry as text. Both of these methods seem messy, is there any common practice for this sort of task?
I could in theory create a db table with a column for every possible field, but it won't work when the competition has specific requirements such as "Tell us in 100 words why..." or "Enter your 5 favourite things that.."
ANSWERED:
I have decided to use the method described below where there are multiple generic columns that can be utilized for different purposes per competition.
Initially I was going to use EAV, and I still think it might be slightly more appropriate for this specific scenario. But it is generally recommended against because of it's poor scalability and complicated querying, and I wouldn't want to get into a habit of using it. Both answers worked absolutely fine in my tests.
I think you are right to be cautious about EAV as it will make your code a bit more complex, and it will be a bit more difficult to do ad-hoc queries against the table.
I've seen many enterprise apps simply adopt something like the following schema -
t_Comp_Data
-----------
CompId
Name
Surname
Email
Field1
Field2
Field3
...
Fieldn
In this instance, the generic fields (Field1 etc) mean different things for the different competitions. For ease of querying, you might create a different view for each competition, with the proper field names aliased in.
I'm usually hesitant to use it, but this looks like a good situation for the Entity-attribute-value model if you use a database.
Basically, you have a CompetitionEntry (entity) table with the standard fields which make up every entry (Competition_id, maybe dates, etc), and then a CompetitionEntryAttribute table with CompetitionEntry_id, Attribute and Value.You probably also want another table with template attributes for each competition for creating new entries.
Unfortunately you will only be able to store one datatype, which will likely have to be a large nvarchar.
Another disadvantage is the difficulty to query against EAV databases.
Another option is to create one table per competition (possibly in code as part of the competition creation), but depending on the number of competitions this may be impractcal.
I want to allow the user to add columns to a table in the UI.
The UI: Columns Name:______ Columns Type: Number/String/Date
My Question is how to build the SQL tables and C# objects so the implementation will be efficient and scalable.
My thought is to build two SQL tables:
TBL 1 - ColumnsDefinition:
ColId, ColName, ColType[Text]
TBL 2 - ColumnsValues:
RowId, ColId, Value [Text]
I want the solution to be efficient in DB space,
and I want to allow the user to sort the dynamic columns.
I work on .NET 3.5 / SQL Server 2008.
Thanks.
I believe that is essentially how the WebParts.SqlPersonalizationProvider works, which doesn't necessarily mean it's the best, but does mean that after some smart people thought about it for a while, that's what they came up with.
Sorting on a given field will be a bit tricky, particularly if the field text need a non-text sorting (i.e., if you want "2" to come before "10").
I'd suggest that from C#, you do one query on ColumnsDefinition, and based on that, choose one of several different queries for selecting/sort the data.
Add a DefaultValue to your ColumnDefinition. Only add a value in ColumnsValues if the value is not the default value. This will speed up things a lot.
The thing I hate about these kind of systems is that it is very difficult to transfer changes betwween dev/stage/production because you will have to keep structure and content of tables in sync.