Data transferring, from .csv to db. Which one is the best way?

Data transferring, from .csv to db. Which one is the best way? - c#

At the moment I'm working on a quite tricky transferring from a .csv file to DB. I have to develop a package/solution/xxxyyy that handles a flow of data from this .csv file to my SQL Server DB (the .csv is updated with new data everyday).
The approach that my boss "suggested" I should use is through SSIS (normally I would have wrote some kind of "parser" to easily convoy the data from the .csv). The fact is that I have quite a bit of transformation to do.
i.e.
An employee has this fields:
name;surname;id;roles
The field "roles" is formatted like this:
role1,role2,role3
This relationship in my db is mapped in 3 different tables:
tblEmployee
PK_Emp | name | surname
tblRoles
PK_Role | roleName
tblEmployeeRole
PK_Emp | PK_Role
So, from the .csv I have to extract the roles of a single employee, insert those in tblRoles (checking that there's no duplicate). Then I have to manage the relationship in tblEmployeeRole.
Considering that this is just an example of one of the different transformations that I have to manage I was wondering if SSIS is the best tool to achieve my goal (loads if script components). When I explained my perplexities to my boss he came up with this "idea":
Use SSIS to transfer the data, as they are, in a temporary table then handle the different transformations through stored procedures.
From the very little I know about stored procedure, I'm not sure that I should follow this idea.
Now, considering that my actual superior isn't that enlightened project manager (he usually mess up our work with bizarre ideas) and considering the fact that I'm not such an expert neither in SSIS nor in stored procedure, I've decided to write here and see if anyone can explain me if one of the previous approaches is the right one or if I have to consider some other (better) solution.
Sorry for my poor English, ty for any help =)

I would insert the data from the CSV file as-is.
Then do any parsing in the database end. If this is something that has to be done often I would then take any scripts you have made to do this and create procedures/functions from that. This question is a bit grand-scheme so this is only a general solution. If you need help doing the parsing of the roles into the look up tables then that would be more specific and of better use.
In general when I work with massive flat-file data sets that need to be parsed into a SQL structure:
Import the data as-is
Find the commonalities among the look up codes
Create the base look up tables (in your case it would be tblRoles)
Create a script to insert into both tblEmployee and tblEmployee role
Once my test scenarios work then I worry about combining each component step into one monolithic SSIS or stored procedure.
I suggest something similar here. Break this import task into small pieces and worry about the grand design later. SSIS, procs, compiled code...any of these might work for you. You just need to know what you need it to do.

Depending upon your transformations they can all be done within SSIS. If you don't need to store the raw .csv data, I would stay away from stored procedures and temporary tables as you are bypassing a large portion of SSIS's strengths.
As an example, you can do look-ups on your incoming data to determine proper relationships and insert those results into multiple tables (your 3 in the example).

Looks like the task is very suitable for bcp utility or BULK INSERT command

Related

When to use Temporary SQL Tables vs DataTables

I don't know whether it is better to use temporary tables in SQL Server or use the DataTable in C# for a report. Here is the scope of the report: it will be copied into a workbook with about 10 worksheets - each worksheet containing about 1000 rows and about 30 columns so it's a lot of data. There is some guidance out there but I could not find anything specific regarding the amount of data that is too much for a DataTable. According to https://msdn.microsoft.com/en-us/library/system.data.datatable.aspx, 16M rows but my data set seems unwieldy considering the number of columns I have. Plus, I will either have to make multiple SQL queries to collect the data in my report or try to write a stored procedure in SQL to collect that data. How do I figure out this quandary?

My rule of thumb is that if it can be processed on the database server, it probably should. Keep in mind, no matter how efficient your C# code is, SQL Server will mostly likely to it faster and more efficiently, after all it was designed for data manipulation.
There is no shame in using #temp tables. They maintain stats, can be indexed, and/or manipulated. One recent example, a developer create an admittedly elegant query using cte, the performance was 12-14 seconds vs mine at 1 second using #temps.
Now, one carefully structured stored procedure could produce and return the 10 data-sets for your worksheets. If you are using a product like SpreadSheetLight (there are many options available), it becomes a small matter of passing the results and creating the tabs (no cell level looping... unless you want or need to).
I would also like to add, you can dramatically reduce the number of touch points and better enforce the business logic by making SQL Server do the heavy lifting. For example, a client introduced a 6W risk rating, which was essentially a 6.5. HUNDREDS of legacy reports had to be updated, while I only had to add the 6W into my mapping table.

There's a lot of missing context here - how is this report going to be accessed and run? Is this going to run as a scripted event every day?
Have you considered SSRS?
In my opinion it's best to abstract away your business logic by creating Views or Stored Procedures in the database. Stored Procedures would probably be the way to go but it really depends on your specific environment. Then you can point whatever tools you want to use at the database object. This has several advantages:
if you end up having different versions or different formats of the report, and your logic ever changes, you can update the logic in one place rather than many.
your code is simpler and cleaner, typically:
select v.col1, v.col2, v.col3
from MY_VIEW v
where v.date between #startdate and #enddate
I assume your 10 spreadsheets are going to be something like
Summary Page | Department 1 | Department 2 | ...
So you could make a generalized View or SP, create a master spreadsheet linked to the db object that pulls all the relevant data from SQL, and use Pivot Tables or filters or whatever else you want, and use that to generate your copies that get sent out.
But before going to all that trouble, I would make sure that SSRS is not an option, because if you can use that, it has a lot of baked in functionality that would make your life easier (export to Excel, automatic date parameters, scheduled execution, email subscriptions, etc).

Best way to have 2 connections to sql server (one read one write)

I have a very large number of rows (10 million) which I need to select out of a SQL Server table. I will go through each record and parse out each one (they are xml), and then write each one back to a database via a stored procedure.
The question I have is, what's the most efficient way to do this?
The way I am doing it currently is I open 2 SqlConnection's (one for read one for write). The read one uses the SqlDataReader of which it basically does a select * from table and I loop through the dataset. After I parse each record I do an ExecuteNonQuery (using parameters) on the second connection.
Is there any suggestions to make this more efficient, or is this just the way to do it?
Thanks

It seems that you are writing rows one-by-one. That is the slowest possible model. Write bigger batches.
There is no need for two connections when you use MARS. Unfortunately, MARS forces a 14 byte row versioning tag in each written row. Might be totally acceptable, or not.

I had very slimier situation and here what I did:
I made two copies of same database.
One is optimized for reading and another is optimized for writing.
In config, i kept two connection string ConnectionRead and ConnectionWrite.
Now in DataLayer when I have read statement(select..) I switch my connection to ConnectionRead connection string and when writing using other one.
Now since I have to keep both the databases in sync, I am using SQL replication for this job.
I can understand implementation depends on many aspect but approach may help you.

I agree with Tim Schmelter's post - I did something very similar... I actually used a SQLCLR procedure which read the data from a XML column in a SQL table into an in-memory (table) using .net (System.Data) then used the .net System.Xml namespace to deserialize the xml, populated another in-memory table (in the shape of the destination table) and used the sqlbulkcopy to populate that destination SQL table with those parsed attributes I needed.
SQL Server is engineered for set-based operations... If ever I'm shredding/iterating (row-by-row) I tend to use SQLCLR as .Net is generally better at iterative/data-manipulative processing. An exception to my rule is when working with a little metadata for data-driven processes, cleanup routines where I may use a cursor.

Guidance on data source(s) for my project

I'm a novice programmer. I am full of theoretical knowledge, but I'm behind with the practice. OK. I am trying to make a program for adding categories and descriptions to files. The language is C#, it should run on Windows 7...
1.The categories can contain sub-categories.
I don't want to call them "tags", because these are different. A category can be fx "favorites". But it can also be: "favorites->music->2013". You can create sub-categories, I will use a TreeView on a WinForm for all the operations a user can do with them.
QUESTION: Should I use XML file for the categories?
2.Every file CAN have a description and one or many categories. However:
Even if the file is deleted, I want to keep its description, so that it can be available for later usage.
Folders themselves will be omitted. The folders themselves cannot have nor categories, nor description. But the contained files YES.
I made a very simple SQL Server database containing one table: !http://img832.imageshack.us/img832/3931/finalprojectdb.png
QUESTION: Is this a good idea? Maybe the categories column is better to be of type XML ?
Any advice on what should the best approach in this situation be, is welcomed. Thanks in advance !

SQL is not great for getting nested data at once. You can store things in XML which gives you a lot of flexibility, but you also have to write a parser or deserializer for it. Nowadays people also just write a little Javascript class and use something like Newtonsoft to deserialize it automatically.
If you want a DB solution, you can use something like SQLite embedded in your application if you don't want to install a database separately.

XML is a great design for an app that needs to communicate cross platform (say c# to java), or cross internet, or cross network. But as a way to store data as a subset in a table, not really.
A normalized database is a terrific tool. It can be indexed (xml can not) this allows for rapid querying of data. If you de-normalize your data by embedding xml in a column querying it will be slow and updating / maintaining a pain.
I personally prefer foreign key tables.

Why creating Tables in run-time (code behind) is bad?

People suggest creating database table dynamically (or, in run-time) should be avoided, with the saying that it is bad practice and will be hard to maintain.
I don't see the reason why, and I don't see difference between creating table and any another SQL query/statement such as SELECT or INSERT. I wrote apps that create, delete and modify database and tables in run time, and so far I do not see any performance issues.
Can anyone explane the cons of creating database and tables in run-time?

Tables are much more complex entities than rows and managing table creation is much more complex than an insert which has to abide by an existing model, the table. True, a table create statement is a standard SQL operation but depending on creating them dynamically smacks of a bad design decisions.
Now, if you just create one or two and that's it, or an entire database dynamically, or from a script once, that might be ok. But if you depend on having to create more and more tables to handle your data you will also need to join more and more and query more and more. One very serious issue I encountered with an app that made use of dynamic table creation is that a single SQL Server query can only involve 255 tables. It's a built-in constraint. (And that's SQL Server, not CE.) It only took a few weeks in production for this limit to be reached resulting in a nonfunctioning application.
And if you get into editing the tables, e.g. adding/dropping columns, then your maintenance headache gets even worse. There's also the matter of binding your db data to your app's logic. Another issue is upgrading production databases. This would really be a challenge if a db had been growing with objects dynamically and you suddenly needed to update the model.
When you need to store data in such a dynamic manner the standard practice is to make use of EAV models. You have fixed tables and your data is added dynamically as rows so your schema does not have to change. There are drawbacks of course but it's generally thought of as better practice.

KMC ,
Remember the following points
What if you want to add or remove a column , you many need to change in the code and compile it agian
what if the database location changes
Developers who are not very good at database can make changes , if you create the schema at the backend , DBA's can take care of it.
If you get any performance issues , it may get tough to debug.

You will need to be a little clearer about what you mean by "creating tables".
One reason to not allow the application to control table creation and deletion is that this is a task that should be handled only by an administrator. You don't want normal users to have the ability to delete whole tables.
Temporary tables ar a different story, and you may need to create temporary tables as part of your queries, but your basic database structure should be managed only by someone with the rights to do so.

sometimes, creating tables dynamically is not the best option security-wise (Google SQL injection), and it would be better using stored procedures and have your insert or update operations occur at the database level by executing the stored procedures in code.

Store data in file system rather than SQL or Oracle database

As I am working on Employee Management system, I have two table (for example) in database as given below.
EmployeeMaster (DB table structure)
EmployeeID (PK) | EmployeeName | City
MonthMaster (DB table structure)
Month | Year | EmployeeID (FK) | PrenentDays | BasicSalary
Now my question is, I want to store data in file system rather than storing data in SQL or ORACLE.
I want my data in file system storage for Insert, Edit and Delete opration with keeping relation with objects too.
I am a C# developer, Could anybody have thoughts or idea on it. (To store data in file system with keeping relations between them)
Thanks in advance.
Any ideas on it?

If you are wanting to perform EDIT/DELETE operations, then don't do this. You will only end up recreating a database from scratch, so you might as well just use one. If you won't want to use SQLServer or Oracle, then use mysql, postgresql, or any of the various in-memory (persist-to-disk) databases out there. If you need to maintain human-readable, or plain/text based data-files, then still use an in-memory database, and save as .csv when persisting to disk.
Using external files can work well if you are doing batch processing, and focusing on APPEND operations only—I do this regularly, and achieve throughput that is simply impossible with a relational database.
You can also use the filesystem effectively if you use one file per record, and your operations are restricted to MAP/INSERT/DELETE/REPLACE; and, never attempt UPDATE. But again, if I need to do updates or correlations, or any of a number of other interesting queries I use a database.
Finally, you can use the filesystem if your operations are DUMP/RESTORE from in memory datastructures to a single file. In which case you should use whatever the standard XML persistence library is for your platform and just perform a RESTORE when you start the application, and a DUMP on exit or periodic save. Pretty much as soon as you move to multiple files you should be looking at a database and ORM again.
Ultimately, unless you are dealing with a very small amount of data; or you are dealing with a large amount of data (at least 100's of millions of records), stick with a database.

You could use SQLite which is basically a "lightweight" DBMS hosted as a DLL in your process.
That would work well for you, if data format doesn't have to be human-readable and if concurrent data access (by several processes at once) is not required.

Before diving into creating a database system, definitely you will need to know the inner workings of how relational databases really function. There are several ways to organize how you serialize your data whether in typical serialization as a dbo as a whole or the mysql solution of serializing the database file and tables in separate files for retrieval. Although doing the mysql way does reveal what tables are associated in the database object, the server does not have to load the whole dbo at once but according to what table(s) the statement(s) queries in the sql making the table cache smaller and faster. I tend to agree.
If you are not going to include some type of T-SQL language but simply by code, then you shouldn't have to many issues if you set up your classes right with a good cache (I mean you don't duplicate objects). If you are aiming for T-SQL support, then you will need to create a parser ,leaving me to say with experience in T-SQL parsing you got a lot of coding to do along with having to create token flags and checks and bounds.
Then you need to decide whether you want to incorporate views, functions and triggers.
If you are not ready for all that, then stick to a server database or an embedded database for your needs.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.