Design Strategy: Query and Update data across 2 different databases

Design Strategy: Query and Update data across 2 different databases - c#

We have a requirement in which we need to query data across 2 different databases ( 1 in SQL Server and other in Oracle).
Here are the scenarios which need to be implemented:
Query: Get the data from one database and match for values in other
Update: Get the data from one database and update the objects in other
Technology that we are using: ASP.net, C#
The options that we have thought about:
Staging area in one database
Link Server ( can't go with the approach as it is not allowed due to organization wide policy)
Create web services
Create 2 different DAL and perform list operations with the data from 2 sources in DAL
I would like to know what is the best design strategy to deal with this kind of a scenario? If yes, then what are the pros and cons of that approach

Is it not possible to use SSIS package to do the data transformation between 2 servers and invoke it either via ASP.Net & c# project or via schedule job invoked on demand?

Will the results from one of the databases be small enough to efficiently pass around?
If so, I would suggest treating the databases as two independent datasources.
If the datasets are large, then you may have to consider some form of ETL into a staging area on one of the database. You may have issues if you need the queries to return up-to-date data from both databases. Because you will need to do a real-time ETL.

There is an article here about performing distributed transactions between Microsoft SQL server and Oracle:
https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1054237.html
I don't know how well this works, however if it does work, this will probably be the best solution for you:
It will almost certainly be the fastest method of querying across multiple database servers.
It should also allow for true transactional support even when writing to both databases.

The best strategy for this will be to use Linked Server, as it is designed for querying and writing to heterogeneous databases as you described above. But obviously due to the policy constraint you mentioned, this is not the option.
Therefore, to achieve the result you want in the most optimal performance, here is what I suggest:
Decide which database contains the lookup data only (minimal dataset) and you will need to execute a query on it to pull the info out
Insert the lookup data using bulk copy into a temp/dummy table in the main database (contains most of the data that you will want to retrieve and return to the caller)
Use stored procedure or query to join the temp table with other tables in your main database to retrieve the dataset desired
The decision to whether to write this as web service or not isn't going to change the data retrieval process. But consideration should be given in essentially reducing the overhead on data transfer time by keeping the process as close to your db server as possible either on same machine or within LAN/high speed connection link.
Data update will be quite straightforward. It will just be the standard two phase operations of pull data out from one and update the other. -

It's hard to tell what the best solution is. But we have a scenario that's nearly the same.
RealTime:
For realtime data updating, we are using WebServices, since in our case, the two different databases belongs to distinct projects. So every project offers a WebService which can be used for data retrieval and data update. That has the advantage, that the project must not take care for database structure changes as long the webservice interface does not change.
Static Data:
Static data (e.g. employees) will be mirrored because for faster access. For that huge amount of data we are using flat files for the nightly update.
In case of static data I think it's important to explicit define data owners. For every piece of data it should be clear which database has the original data, and which database only has shadow copies for faster access.
So Static data is readonly in the shadow database, or only updateable through designated WebServices.

The problem with using multiple data sources in your .NET code is that you run the risk of having your CRUD ops fail ACID tests and having data inconsistencies.
I would be most inclined to pursue #Will A's comment to your question...
Set up a replication to a remove server, then link the two remote servers.

Have multiple DALs and handle it in the application - thousands is not a big number, you need to worry only if you are into 100,000s or millions in which case your application will hang.
Use linq to perform data operations on the datasets that are generated rather than looping through them.

Related

Tools or data base tips to show big data information?

I'm having a problem with an application builded in .net core(c#) and SQL server 2017 with angular js version 1.x (frondend).
The problem is the following we have are very big tables, with millons of records. Only a simple select count in one of theses tables takes to long. we execute the query directly from the code without passing through any ORM librery, but even without using any ORM the queries take too long.
I was asking myself ¿if there is another better way to consult these giant tables likes (external tools, another type of database, etc.) since in many cases you need to show reports and see statistics graphs?.

One possible strategy is to use table partitions using a partition function that match your business needs. With this you can split data in table among many files, thus reducing the number of results to scan.
See this link for detailed info.

OLTP databases like SQL Server are not designed for handling OLAP (aggregate) queries in the real time in case of large datasets. Typical workarounds are:
limit number of aggregated rows with extra WHERE conditions, and add indexes for these columns. This is usually is possible with historic data like orders, events log etc - show reports only for last month or year.
use materialized views and use it for reports that doesn't need much detalization
configure slave read-only instance of SQL Server, possibly add columnstore indexes, and use it for OLAP queries.
replicate your SQL Server data to specialized (possibly, distributed) analytical database that can handle OLAP queries in the real-time (like Amazon Redshift, Vertica, MongoDb, ElasticSearch, Yandex ClickHouse etc)
If reports can be configured by end users ensure that your ROLAP-like engine produces efficient SQL GROUP BY queries.

Should I do many sql queries or one large query and do the processing on the server?

The situation is as follows: I have a large-ish dataset with a couple thousand entries that I populate from an Excel file. For each entry I have to match it to another field on a certain table in the database (this table contains only a couple hundred entries).
What's the best way to go about doing it? I can make a query for each entry in the dataset but this seems fairly wasteful; on the other hand I can just select the fields I need from all the entries in the table, put them on a Dictionary or some other data structure and match them on IIS, thus making effectively only one query but doing all the processing on the webserver.
Dataset : ~1000 to ~3000 entries
Table in the DB: ~300 entries
Using asp.net on IIS but the database is a MS access file.
Is either of these better the other? Is there a third, better way I haven't thought of?

Databases are designed to do many things that are useful for data processing. A lot of benefits for transactional processing are contained in the acronym ACID -- atomicity, consistency, isolation, durability. In other words, databases behave the way you would expect when you store something in them. The data is there, relationships are enforced, it will be there tomorrow.
The features that you want are on the querying side. Databases in general (although perhaps not MS Access in particular) allow a relatively standard interface to powerful processing. Database engines know how to optimize queries. Database engines know how to manage memory. Database engines know how to manager hierarchical memory, with disk, RAM, and cache. Databases know how to take advantage of indexes, row partitions, and other optimizations (you can get this functionality by using a free version of a more advanced database, such as SQL Server, Oracle, Postgres, or even MySQL).
You are talking about thousands of rows of data. Databases can easily work with millions of rows. You are talking about two tables. Databases can easily manage many more tables and queries using a dozen.
So, no, you should not load your data into in-memory structures on the application side. You should do the processing in the database and bring back the results you want. Then, you can format the results on the application side, to take advantage of what applications do best: interface to the user.

SaaS application needs to export/backup data to individual customer sites

We have a cloud based SaaS application and many of our customers (school systems) require that a backup of their data be stored on-site for them.
All of our application data is stored in a single MS SQL database. At the very top of the "hierarchy" we have an "Organization". This organization represents a single customer in our system. Each organization has many child tables/objects/data. Each having FK relationships that ultimately end at "Organization".
We need a way to extract a SINGLE customer's data from the database and bundle it in some way so that it can be downloaded to the customers site. Preferably in a SQL Express, SQLite or an access database.
For example: Organization -> Skill Area -> Program -> Target -> Target Data are all tables in the system. Each one linking back to the parent by a FK. I need to get all the target data, targets, programs and skill areas per organization and export that data.
Does anyone have any suggestions about how to do this within SQL Server, a C# service, or a 3-rd party tool?
I need this solution to be easy to replicate for each customer who wants this feature "turned on"
Ideas?

I'm a big fan of using messaging to propagate data at the moment, so here's a message based solution that will allow external customers to keep a local, in sync copy of the data which you provide on the web.
The basic architecture would be an online, password secured and user specific list of changes which have occurred in the system.
At the server side this list would be appended to any time there was a change to an entity which is relevant to the specific customer.
At the client would run an application which checks the list of changes for any it hasn't yet received and then applies them to its local database (in the order they occurred).
There a a bunch of different ways of doing the list based component of the system but my gut feeling is that you would be best to use something like RSS to do this.
Below is a practical scenario of how this could work:
A new skill area is created for organisation "my org"
The skill is added to the central database and associated with the "my org" reccord
A SkillAreaExists event is also added at the same time to the "my org" RSS with JSON or XML data specifying the properties of the new skill area
A new program is added to the skill area that was just created
The program is added to the central database and associated with the skill area
A ProgramExists event is also added at the same time to the "my org" RSS with JSON or XML data specifying the properties of the new program
A SkillAreaHasProgram event is also added at the same time to the "my org" RSS with JSON or XML data specifying an identifier for the skill area and program
The client agent checks the RSS feed and sees the new messages and processes them in order
When the SkillAreaExists event is processed a new Skill area is added to the local DB
When the ProgramExists event is processed a new Program is added to the local DB
When the SkillAreaHasProgram event is processed the program is linked to the skill area
This approach has a whole bunch of benefits over traditional point in time replication.
Its online, a consumer of this can get realtime updates if required
Consistancy is maintained by order, at any point in time in the event stream if you stop receiving events you have a local DB which accuratly reflects the central DB as at some point in time.
Its diff based, you only need to recieve changes
Its auditable, you can see whats actually happened not just the current state.
Its easily recoverable, if there's a data consistency issue you can revert the entire DB by replaying the event stream.
It allows for multiple consumers, lots of individual copies of the clients info can exist and function autonomously.
We have had a great deal of success with these techniques for replicating data between sites especially when they are only sometimes online.

While there are some very interesting enterprise solutions that have been suggested, I think my approach would be to develop a plane old scheduled backup solution that simply exports the data for each organisation with a stored procedure or just a number of select statements.
Admittedly you'll have to keep this up to date as your database schema changes but if this is a production application I cant imagine that happens very drastically.
There are any number of technologies available to do this, be it SSIS, a custom windows service, or even something as rudimentary as a scheduled task that kicks off a stored procedure from the command line.
The format you choose to export to is entirely up to you and should probably be driven by how the backup is intended to be used. I might consider writing data to a number of CSV files and zipping the result such that it could be imported into other platforms should the need arise.
Other options might be to copy data across to a scratch database and then simply create a SQL backup of that database.
However you choose to go about it, I would encourage you to ensure that the process is well documented and has as much automated installation and setup as possible. Systems with loosely coupled dependencies such as common file locations or scheduled tasks are prone to getting tweaked and changed over time. Without those tweaks and changes being recorded you can create a system that works but can't be replicated. Soon no one wants to touch it and no one remembers exactly how it works. When it eventual needs changing, or worse it breaks, you have to start reverse engineering before you can fix it.
In a cloud based environment this is especially important because you want to be able to deploy as quickly as possible. If there is a lot of configuration that needs to be done you're likely to make mistakes or just be inconsistent. By creating a nuke-and-repave deployment you have a single point that you can change installation and configuration, safe in the knowledge that the change will be consistent across any deployment.

From what i understand, you have one large database for all the clients, you use relations which lead to the table organization to know which data for which client, and you want to backup the data based on client => organization.
To backup the data you can use one of the following methods:
As the comments from #Phil, and #Kris you can use SSIS for automated backup, check this link for structure backup, and check this link for how to Export a Query Result to a File using SSIS and instead of file do it to access or SQL Server database.
Build an application\service using C# to select the data and export it manually, need time but customization has no limits.

Have you looked at StreamInsight?
http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/complex-event-processing.aspx

When I've had to deal with backups of relational data in the past (in MySQL which isn't super different in terms of capability from MSSQL that you're running) is to create a backup "package" file which is essentially a zip file with a different file extension so that windows won't let users open it.
If you really want to get fancy, encrypt the file after zipping it and change the extension. I presume you're using ASP for your SaaS and since I'm a PHP-geek, I can't help too much with the code side of things, but the way I've handled this before was for a script that would package an entire Joomla site and Database for migration to a new server.
//open the MySQL connection
$dbc = mysql_connect($cfg->host,$cfg->user,$cfg->password);
//select the database
mysql_select_db($cfg->db,$dbc);
output( 'Getting database tables
');
//get all the tables in the database
$tables = array();
$result = mysql_query('SHOW TABLES',$dbc);
while($row = mysql_fetch_row($result)) {
$tables[] = $row[0];
}
output( 'Found '.count($tables).' tables to be migrated.
Exporting tables:
');
$return = "";
//cycle through the tables and get their create statements and data
foreach($tables as $table) {
$result = mysql_query('SELECT * FROM '.$table);
$num_fields = mysql_num_fields($result);
$return.= 'DROP TABLE IF EXISTS '.$table.";\n";
$row2 = mysql_fetch_row(mysql_query('SHOW CREATE TABLE '.$table));
$return.= $row2[1].";\n";
while($row = mysql_fetch_row($result)) {
$return.= 'INSERT INTO '.$table.' VALUES(';
for($j=0; $j<$num_fields; $j++) {
$row[$j] = mysql_escape_string($row[$j]);
$row[$j] = ereg_replace("\n","\\n",$row[$j]);
if (!empty($row[$j])) {
$return.= "'".$row[$j]."'" ;
} else {
$return.= "NULL";
}
if ($j<($num_fields-1)) {
$return.= ',';
}
}
$return.= ");\n";
}
}
That's the relevant portion of the code in PHP that loops the database structure and stores the recreation script in $result which can then be output to a file.
In your case, you don't want to recreate the databases, but rather the data itself. You've compounded the issue slightly since you have a SaaS that is prone to possible data structure changes which you'll need to be able to account for. My suggestion would be this then:
Use a similar system to the above to dump the relevant data from the individual tables. I'm simply pulling all the data, but you could pull only the parts that pertain to the individual user by using JOIN statements and whatnot. Dump the contents of each table's insert/replace statements into a file named after the table. Create a file called manifest.xml or something of that sort and populate it with the current version of your SaaS application, name/information, unique ID, etc of the client exporting the data.
Package all those files into a ZIP file, change the extension to whatever you want, encrypt it if you desire, etc. Let them download that backup file and you're set.
In your import script, you will need to read the version number of the exported data and compare it to some algorithm that can handle remapping the data based on revisions you make later on. This way if you need to re-import one of their backups later, you can correctly handle transitioning the data from when they pulled the backup to the current structure of the data in that table now.
Hopefully that helps ;)

Because you keep all the data in just one database, it will always be difficult to export/backup data on customer basis.
Even if you implement such scenario now, you will end up with two different places you need to maintain/change/test every time you change the database schema (fixing bugs, adding new features, optimization, etc).
I would recommend you to partition the data, say, by using a database per organization. Then you change your application just once (mainly around building a connection string for the specified organization), and then you can safely export/backup each database separately in a way you want it.
It also gives you a lot of extra benefits "for free" such as scalability and the ability to dedicate resources on per-organization base (whether it is needed in the future).
Say, you have a set of small and low priority (from a business point of view) organizations, and a big and high priority one. So you will be able to keep a set of small low priority databases on one server, but dedicate another one for that specific important big one.
Or if your current DB server is overloaded (perhaps you have A LOT of data and A LOT of requests to the database), you can simply get another cheap server and move half of the load without any changes in your system...
You still need to write something in order to split the existing big database into several small ones, but you do it just once, and after it is done this "migration tool" can be thrown away so you don't need to support it anymore.

Have you tried SyncFramework?
Have a look at this article!
It explains how to sync filtered data between databases using Sync Framework.
You can sync to the customer's database or sync to your own empty db and then export it as a file.

Did you thought about using an ORM? (Object Relational Mapper)
I know, and use, LLBLGen Pro (so I can talk only about the feature of this specific ORM)
Anyway, with LLBLGen you can reverse-engineer the DB and create a hierarchy of class that map the tables and relations of your DB.
Now If all the data of a customer is reachable via relations, I can tell to my ORM framework to load a single costumers (1 row of a specific table) and then load all the related data in the related table.
If the data is not too complex, it should be possible.
If you have hundreds of self referenced tables or strange relations, it may be undoable, it depend upon your data.
If all the data of a single customer is, say, 10'000 rows in 100 tables, it will probably work.
If all the data of is 100'000 rows in 1000 tables it "may" work if you have some times, and a lot of memory.
If all the data is 10'000'000 you probably cant load it all at once, and you'll need a more efficient way.
Anyway, if you can load all the data at once, then you'll have a nice "in memory" graph with all the data of a single customer, and then you can serialize this data, or project it on a dataset (obtaining a set of datatable/relations) and then serialize the dataset.
Using an ORM to load and export all the data of a single customer as explained, probably, is not the most efficient way of doing things, but when doable it's a simple and cheap way.
Naturally, with or without ORM, you can find hundreds of different way to export this data :-)

For you design, you should have sharded your database for customers.
However, as you have already developed the database design, I suggest you to create a temp database and create the new tables in this temp database using the FK relation.
For this, you need to sort the tables based on the FK relationship and create them in the temp database.
Then, select the table data from the source database and insert them in the temp database.
You can also use this technique to shard your database and revamp your database design.
Aravind

What's the purpose of Datasets?

I want to understand the purpose of datasets when we can directly communicate with the database using simple SQL statements.
Also, which way is better? Updating the data in dataset and then transfering them to the database at once or updating the database directly?

I want to understand the purpose of datasets when we can directly communicate with the database using simple SQL statements.
Why do you have food in your fridge, when you can just go directly to the grocery store every time you want to eat something? Because going to the grocery store every time you want a snack is extremely inconvenient.
The purpose of DataSets is to avoid directly communicating with the database using simple SQL statements. The purpose of a DataSet is to act as a cheap local copy of the data you care about so that you do not have to keep on making expensive high-latency calls to the database. They let you drive to the data store once, pick up everything you're going to need for the next week, and stuff it in the fridge in the kitchen so that its there when you need it.
Also, which way is better? Updating the data in dataset and then transfering them to the database at once or updating the database directly?
You order a dozen different products from a web site. Which way is better: delivering the items one at a time as soon as they become available from their manufacturers, or waiting until they are all available and shipping them all at once? The first way, you get each item as soon as possible; the second way has lower delivery costs. Which way is better? How the heck should we know? That's up to you to decide!
The data update strategy that is better is the one that does the thing in a way that better meets your customer's wants and needs. You haven't told us what your customer's metric for "better" is, so the question cannot be answered. What does your customer want -- the latest stuff as soon as it is available, or a low delivery fee?

Datasets support disconnected architecture. You can add local data, delete from it and then using SqlAdapter you can commit everything to the database. You can even load xml file directly into dataset. It really depends upon what your requirements are. You can even set in memory relations between tables in DataSet.
And btw, using direct sql queries embedded in your application is a really really bad and poor way of designing application. Your application will be prone to "Sql Injection". Secondly if you write queries like that embedded in application, Sql Server has to do it's execution plan everytime whereas Stored Procedures are compiled and it's execution is already decided when it is compiled. Also Sql server can change it's plan as the data gets large. You will get performance improvement by this. Atleast use stored procedures and validate junk input in that. They are inherently resistant to Sql Injection.
Stored Procedures and Dataset are the way to go.
See this diagram:
Edit: If you are into .Net framework 3.5, 4.0 you can use number of ORMs like Entity Framework, NHibernate, Subsonic. ORMs represent your business model more realistically. You can always use stored procedures with ORMs if some of the features are not supported into ORMs.
For Eg: If you are writing a recursive CTE (Common Table Expression) Stored procedures are very helpful. You will run into too much problems if you use Entity Framework for that.

This page explains in detail in which cases you should use a Dataset and in which cases you use direct access to the databases

I usually like to practice that, if I need to perform a bunch of analytical proccesses on a large set of data I will fill a dataset (or a datatable depending on the structure). That way it is a disconnected model from the database.
But for DML queries I prefer the quick hits directly to the database (preferably through stored procs). I have found this is the most efficient, and with well tuned queries it is not bad at all on the db.

Store data in file system rather than SQL or Oracle database

As I am working on Employee Management system, I have two table (for example) in database as given below.
EmployeeMaster (DB table structure)
EmployeeID (PK) | EmployeeName | City
MonthMaster (DB table structure)
Month | Year | EmployeeID (FK) | PrenentDays | BasicSalary
Now my question is, I want to store data in file system rather than storing data in SQL or ORACLE.
I want my data in file system storage for Insert, Edit and Delete opration with keeping relation with objects too.
I am a C# developer, Could anybody have thoughts or idea on it. (To store data in file system with keeping relations between them)
Thanks in advance.
Any ideas on it?

If you are wanting to perform EDIT/DELETE operations, then don't do this. You will only end up recreating a database from scratch, so you might as well just use one. If you won't want to use SQLServer or Oracle, then use mysql, postgresql, or any of the various in-memory (persist-to-disk) databases out there. If you need to maintain human-readable, or plain/text based data-files, then still use an in-memory database, and save as .csv when persisting to disk.
Using external files can work well if you are doing batch processing, and focusing on APPEND operations only—I do this regularly, and achieve throughput that is simply impossible with a relational database.
You can also use the filesystem effectively if you use one file per record, and your operations are restricted to MAP/INSERT/DELETE/REPLACE; and, never attempt UPDATE. But again, if I need to do updates or correlations, or any of a number of other interesting queries I use a database.
Finally, you can use the filesystem if your operations are DUMP/RESTORE from in memory datastructures to a single file. In which case you should use whatever the standard XML persistence library is for your platform and just perform a RESTORE when you start the application, and a DUMP on exit or periodic save. Pretty much as soon as you move to multiple files you should be looking at a database and ORM again.
Ultimately, unless you are dealing with a very small amount of data; or you are dealing with a large amount of data (at least 100's of millions of records), stick with a database.

You could use SQLite which is basically a "lightweight" DBMS hosted as a DLL in your process.
That would work well for you, if data format doesn't have to be human-readable and if concurrent data access (by several processes at once) is not required.

Before diving into creating a database system, definitely you will need to know the inner workings of how relational databases really function. There are several ways to organize how you serialize your data whether in typical serialization as a dbo as a whole or the mysql solution of serializing the database file and tables in separate files for retrieval. Although doing the mysql way does reveal what tables are associated in the database object, the server does not have to load the whole dbo at once but according to what table(s) the statement(s) queries in the sql making the table cache smaller and faster. I tend to agree.
If you are not going to include some type of T-SQL language but simply by code, then you shouldn't have to many issues if you set up your classes right with a good cache (I mean you don't duplicate objects). If you are aiming for T-SQL support, then you will need to create a parser ,leaving me to say with experience in T-SQL parsing you got a lot of coding to do along with having to create token flags and checks and bounds.
Then you need to decide whether you want to incorporate views, functions and triggers.
If you are not ready for all that, then stick to a server database or an embedded database for your needs.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.