In our project we need the ability to say who and when delete some entity.
So after some investigation I've found the next solutions:
Add IsDeleted and DeletedBy columns to every table and set it before deletion (Using delete event of NH). But here is some drawback of this solution: we have many sql views which should work only with non deleted data. So to achieve this we should write View over each Table which will be something like a filter. (WHERE IsDeleted = 0)
Serialize to xml each entity before deletion and store it in single separate table with the next structure: Id | XML | Deleted By
From your point of view which of these solutions is prefered, or maybe there are other solutions I didn't mention above?
P.S. The deleted rows should be excluded from queries (Both Nhibernate and SQL).
I see three options:
Hard delete. The rows do not exist.
Soft delete. As you describe. Yep, you'll have to tack on IsSoftDeleted checks EVERYWHERE. EVERYWHERE. EVERYWHERE. Its a total pain.
Archive table. Create a table that is an exact replica of the existing table...and do the move (to the archive table) and the delete (from the original table) in a transaction.
I've worked with #2 an #3. I prefer #3 because you avoid the EVERYWHERE additional clauses.
With #2, you may also have to figure out constraints that allow for 1 non-soft-deleted row (based on the unique constraint) but also allow duplicates of soft-deleted-rows that violate the unique-constraint. Yep, good times.
Related
I have a folder filled with about 200 csv files, each containing about 6000 rows of data containing mutual fund data. I have to copy those comma separated data into the database via Entity Framework.
The two major objects are Mutual_Fund_Scheme_Details and Mutual_Fund_NAV_Details.
Mutual_Fund_Scheme_Details - this contains columns like Scheme_Name, Scheme_Code, Id, Last_Updated_On.
Mutual_Fund_NAV_Details - this contains Scheme_Id (foreign key), NAV, NAV_Date.
Each line in the CSV contains all of the above columns so before inserting, I have to -
Split each line.
Extract first the scheme related data and check if the scheme exists and get id. If it does not exist then insert the scheme details and get id.
Using the id obtained from step 2, check if an entry for NAV exists for the same date. If not, then insert it else skip it.
If an entry is inserted in Step 3 then the Last_Updated_On date might need to be updated for the scheme with the NAV date (depending on it is newer than existing value)
All the exists checks are done using ANY linq extension method and all the new entries are inserted into the DbContext but the SaveChanges method is called only at the end of processing of each file. I used to call it after each insert but that just takes even longer than right now.
Now since, this involves at least two exists checks, at the most two inserts and one update, the insertion of each file is taking too long close to 5-7 minutes per file. I am looking for suggestions to improve this. Any help would be useful.
Specifically, I am looking to:
Reduce the time it takes to process each file
Decrease the number of individual exists check (if I can possibly club them in some way)
Decrease individual inserts/updates (if I can possibly club them in some way)
It's going to be hard to optimize it with EF. Here is a suggestion:
Once you process the whole file (~6000) do the exists check with .Where( x => listOfIdsFromFile.Contains(x.Id)). This should work for 6000 ids and it will allow you separate inserts from updates.
At the risk of over-explaining my question, I'm going to err on the side of too much information.
I am creating a bulk upload process that inserts data into two tables. The two tables look roughly as follows. TableA is a self-referencing table that allows N levels of reference.
Parts (self-referencing table)
--------
PartId (PK Int Non-Auto-Incrementing)
DescriptionId (Fk)
ParentPartId
HierarchyNode (HierarchyId)
SourcePartId (VARCHAR(500) a unique Part Id from the source)
(other columns)
Description
--------
DescriptionId (PK Int Non-Auto-Incrementing)
Language (PK either 'EN' or 'JA')
DescriptionText (varchar(max))
(I should note too that there are other tables that will reference our PartID that I'm leaving out of this for now.)
In Description, the combo of Description and Language will be unique, but the actual `DescriptionID will always have at least two instances.
Now, for the bulk upload process, I created two staging tables that look a lot like Parts and Description but don't have any PK's, Indexes, etc. They are Parts_Staging and Description_Staging.
In Parts_Staging there is an extra column that contains a Hierarchy Node String, which is the HierarchyNode in this kind of format: /1/2/3/ etc. Then when data is copied from the _Staging table to the actual table, I use a CAST(Source.Column AS hierarchyid).
Because of the complexity of the ID's shared across the two tables, the self-referencing id's and the hierarchyid in Parts, and the number of rows to be inserted (possible in the 100,000's) I decided to 100% compile ALL of the data in a C# model first, including the PK ID's. So the process looks like this in C#:
Query the two tables for MAX ID
Using those Max ID's, compile a complete model of all the data for both tables (inlcuding the hierarchyid /1/2/3/)
Do a bulk insert into both _Staging Tables
Trigger a SP that copies non-duplicate data from the two _Staging tables into the actual tables. (This is where the CAST(Source.Column AS hierarchyid) happens).
We are importing lots of parts books, and a single part may be replicated across multiple books. We need to remove the duplicates. In step 4, duplicates are weeded out by checking the SourcePartId in the Parts table and the Description in the DescriptionText in the Description table.
That entire process works beautifully! And best of all, it's really fast. But, if you are reading this carefully (and I thank if you are) then you have already noticed one glaring, obvious problem.
If multiple processes are happening at the same time (and that absolutely WILL happen!) then there is a very real risk of getting the ID's mixed up and the data becoming really corrupted. Process1 could do the GET MAX ID query and before it manages to finish, Process2 could also do a GET MAX ID query, and because Process1 hasn't actually written to the tables yet, it would get the same ID's.
My original thought was to use a SEQUENCE object. And at first, that plan seemed to be brilliant. But it fell apart in testing because it's entirely possible that the same data will be processed more than once and eventually ignored when the copy happens from the _Staging tables to the final tables. And in that case, the SEQUENCE numbers will already be claimed and used, resulting in giant gaps in the ID's. Not that this is a fatal flaw, but it's an issue we would rather avoid.
So... that was a LOT of background info to ask this actual question. What I'm thinking of doing is this:
Lock both of the tables in question
Steps 1-4 as outlined above
Unlock both of the tables.
The lock would need to be a READ lock (which I think is an Exclusive lock?) so that if another process attempts to do the GET MAX ID query, it will have to wait.
My question is: 1) Is this the best approach? And 2) How does one place an Exclusive lock on a table?
Thanks!
I'm not sure in regards to what's the best approach but in terms of placing an 'exclusive' lock on a table, simply using with (TABLOCKX) in your query will put one on the table.
If you wish to learn about it;
https://msdn.microsoft.com/en-GB/library/ms187373.aspx
I have a issue regarding Merge Replication. I have a table SETTINGS where in i store the settings of my software.
The schema of the table is ID ( PK) , Description , Value.
Suppose i have 15 rows in this table on my server.
Now i have applied filter on this table saying only the first 10 rows would replicate.
Now with this settings when i sync for the first time, i receive the 10 rows on my client (having subscription).
Then i add the remaining 5 on my client.
Now when i sync again it gives me a conflict saying that
A row insert at 'ClientServer.ClientDatabaseName' could not be
propagated to 'MyServer.ServerDatabaseName'. This failure can be
caused by a constraint violation. Violation of PRIMARY KEY constraint
'PK_SETTINGS'. Cannot insert duplicate key in object 'dbo.SETTINGS'.
The duplicate key value is (11).
What i don't understand is why is it trying to replicate something (row) which is outside the subset filter applied on that table ?? Please help guys.
Is this scenario not possible with Merge replication ?
https://msdn.microsoft.com/en-us/library/ms151775.aspx the link suggests that this is possible. But confused.
Filters created on for a merge article are evaluated only at the publisher. Changes made at the subscriber will always be propagated back to the subscriber, even if they are outside the filter criteria. However if the changes from the one subscriber do not meet the filtering criteria, then they will sit on the publisher, but not be replicated to all the other subscribers.
Is this a production scenario, or are you playing around with replication? If you do static filtering, which is what you have above, it is typically done on read-only type of tables. For example, a salesperson in the field may only need prices for products in their region. They are not expected to update this table. If you do dynamic filtering, for example, filtering based on HOSTNAME(), then you would only get data specific for that user. For example, a salesperson in the field would receive only their customer information. Thus, any updates to that information, unless it's shared across multiple salespersons, would propagate back up, and not flow to anyone else.
In your case, i would not recommend updating tables on the subscriber that have static filters, thus i suggest re-evaluating your filtering design to ensure you have the right filtering model for your scenario.
I'm deleting data from database which is about 1.8GB big. (through C# app)
The same operation on smaller databases (~600MB) run without problem, but on the big one I'm getting:
Lock wait timeout exceeded; try restarting transaction.
Will innodb_lock_wait_timeout fix the problem or there is another way?
I don't think that optimizing queries is a solution, because there is no way to make them simpler.
I'm deleting parts of the data on conditions and relations, not all the data.
You can split the delete statement into smaller parts that wont timeout.
Like delete these stuff from id 1 to 1000 ,execute and commit, do the same for ids 10.000 -20.000 and etc..
You mentioned that you were '...deleting parts of the data based on some conditions and relations, not all the data'. I would check that there are appropriate indexes on all the keys you are using to filter the data to delete.
If you were to show us your schema and where clause we could suggest ones that may help.
You should also consider splitting your delete into multiple batches of smaller numbers of rows.
Another alternative would be to do a SELECT INTO, with only the data you want to keep into another table, drop the original, then rename this new table.
Right Click the table--> Script Table As --> Create To --> New query.. save the query...
Right click the table --> delete
Refresh your Database and Intellisense to forget the table and then run the script which will recreate the table and that's how you have an empty table...
Or you can simply increase the setting for innodb_lock_wait_timeout (or table_lock_wait_timeout, not sure which) if you dont want to delete all the info in the table
If you're deleting all the rows in the table, use
Truncate table *tablename*
The delete command uses the transaction log when completing the task, but truncate cleans it out without logging.
The Story
I'm going to write up some code to manage the deleted items in my application, but I'm going to soft delete them so I could return them back when I need. I have a hierarchy to respect in my application's logic when it comes to hiding or deleting items.
I logically place my items in three containers to the country, city, district and brand.
Each item should belong to a country, a city, a district and a brand.
Now, if I deleted a country it should delete the cities, districts, brands, and items that belongs to the given country. and if I deleted the city it should also delete the whole stuff under it (districts, brands, etc)
A Note
When I delete a country and delete the associated brands, I should take care that a brand might have items in more than one country.
The Question
Do you suggest to
Flag the items (whether it's country, city, item, etc) as deleted and this will require a lot of code to check every time when any item is loaded from the database, if it's deleted or not and also some extra fields to mark if the city it belongs to is deleted, and the country it belongs to is deleted and so on.
Move the deleted stuff each to a specific table (DeletedCountries, Deleted Cities, etc)
and save the the IDs of the items it was associated with so I could insert them back later to it's original table. and of course this will save my application all the code that will manage to check all the deleted items and make sure all the hierarchy is deleted.
Maybe you have a better approach/advice/idea about achieving such a thing!
For argument's sake, one advantage of solution #2 (moving deleted items to their own tables) is if you have lots and lots of records, you would not have to worry about indexing records in respect to their "deleted" state.
With that said, if I were going to "move" data from table to table (via delete followed by insert) I would make sure to do it in 1 transaction.
I'm using a technique right now where we are storing a 'DeleteDate' on every user maintained table in our database. The DeleteDate field is a smalldatetime data type with a default value of 6/1/2079
Coupled with an index on the DeleteDate field, we are able to use a standard View or User-Defined-Function to return only the 'current' records (that is, those records with a delete date in the future). All queries route through this index when looking for current data, and deletes become a trivial update query.
There are some additional logic checks that need to be done for related tables. But that is part of the price of having to never worry about a user 'accidentally' deleting valuable data.
In the future, when these tables are excessively large and there are a lot of deleted records present, we can partition the table first on the DeleteDate. This will move all 'deleted' records away from the 'live' records.
Flagging an item as delete really complicates the information retrieval, and also, you need to deal with cascade remove by yourself.
I would choose the "mail box" approach, that move deleted records to different table. I have done a project that use soft-delete, and I end up put all delete calls to Stored Procedure and handle the copy and remove in Stored Procedure.
You should manage your hierarchy by tagging all subitems as deleted. This way if your eg. product belongs to a brand, you can check only if brand is deleted. You should also put your logic on data retrieval side, to avoid unnecessary gathering of deleted information.
SELECT
*
FROM
products p,
category c
WHERE
p.catId = c.Id
AND NOT c.Deleted
And above all, information about deleted category should be indexed.
CREATE PRIMARY INDEX ON category (Id)
CREATE INDEX ON category (Deleted)
or
CREATE INDEX ON category (Id, Deleted)
I think to flag the item is the best approach and even i also use the mail approach for the purpose of soft delete.
Yea that requires much stuff to take in account to manage but yet i didn't found any other way. I just add the one extra column to each and every table that is Status whose datatype is bit.
Thanks
How complex a delete technique are you asking for?
With just one date field and no audit log, you can have an instant deleted flag. If datefield is null, then it's not been deleted. You can then use that datefield on the index (if the index allows nulls).
If you want something more complex, then you could use extra tables. Will you allow it to be deleted, undeleted, redeleted and maintain a record of each of those? If so, keep a separate table for action logging and keep only the one record with a boolean field (actually a join on that table might be faster, depends on the data)
If you often reconsistute the items, flagging is a preferable means, but you end up having to alter your data access to avoid showing the items that are flagged, which can be rather painful if you have already set up a lot of code accessing your data, so moving may be better if you have a lot of "legacy" code accessing the data. If it is rare, and you are also interested in a history log, moving to another database table works well.
One easy way to achieve either is to use a trigger that changes the delete row and does the operation. If you actually do need to delete items, however, the flag option becomes a royal PITA when you flag rather than move items. The reason a trigger is easier in many cases is you capture every delete, not just those that are initiated by code.