Speed difference between Linq to XML and Excel with a OledbConnection? - c#

On e of my current requirements is to take in an Excel spreadsheet that the user updates about once a week and be able to query that document for certain fields.
As of right now, I run through and push all the Excel (2007) data into an xml file (just once when they upload the file, then I just use the xml) that then holds all of the needed data (not all of the columns in the spreadsheet) for querying via Linq-to-XML; note that the xml file is smaller than the excel.
Now my question is, is there any performance difference between querying an XML file with Linq and an Excel file with OledbConnection? Am I just adding another unneccesary step?
I suppose the followup question would be, is it worth it for ease of use to keep pushing it to xml.
The file has about 1000 rows.

For something that is done only once per week I don't see the need to perform any optimizations. Instead you should focus on what is maintainable and understandable both for you and whoever will maintain the solution in the future.
Use whatever solution you find most natural :-)

As I understand it the performance side of things stands like this for accessing Excel data.
Fastest to Slowest
1. Custom 3rd party vendor software using C++ directly on the Excel file type.
2. OleDbConnection method using a schema file if necessary for data types, treats Excel as a flatfile db.
3. Linq 2 XML method superior method for read/write data with Excel 2007 file formats only.
4. Straight XML data manipulation using the OOXML SDK and optionally 3rd party xml libraries. Again limited to Excel 2007 file formats only.
5. Using an Object[,] array to read a region of cells (using .Value2 prop), and passing an Object[,] array back again to a region of cells (again .Value2 prop) to write data.
6. Updating and reading from cells individually using the .Cells(x,y) and .Offset(x,y) prop accessors.

You can't use a SqlConnection to access an Excel spreadsheet. More than likely, you are using an OleDbConnection or an OdbcConnection.
That being said, I would guess that using the OleDbConnection to access the Excel sheet would be faster, as you are processing the data natively, but the only way to know for the data you are using is to test it yourself, using the Stopwatch class in the System.Diagnostics namespace, or using a profiling tool.
If you have a great deal of data to process, you might also want to consider putting it in SQL Server and then querying that (depending on the ratio of queries to the time it takes to save the data, of course).

I think it's important to discuss what type of querying you are doing with the file. I have to believe it will be a great deal easier to query using LINQ than the oledbconnection although I am talking more from experience than anything else.

Related

Is SSIS packages are best solution for importing and exporting the large amount of data?

my requirement is that in nop commerce 1.9 i have to insert multiple discount form a excel sheet which have lot of data so before doing this task i need to be clear in mind which one is best solution for this.
Which is the fastest way to upload the excelsheet having more than 100,000 lines of code in C#?
i read this question and answer found that SSIS is an option .
is really SSIS is best for large size file import and export options.
and what other benefits i will get if i use SSIS packages ?
For ~100,000 rows, performance should not be a significant problem with this type of data.
SSIS can do this, but it is not the only option. I think there are three reasonable approaches to doing this:
SSIS:
This can read excel files. If your spreadsheet is well behaved (i.e. can be trusted
to be laid out correctly) then SSIS can load the contents. It has some error logging
features, but in practice it can only usefully dump a log file or write errors out to
a log table. Erroneous rows can be directed to a holding table.
Pros
Load process is fairly easy to develop.
SSIS package can be changed independently of the application if the spreadsheet format has to change.
Can read directly from spreadsheet file
Cons:
Dependency on having SSIS runtime installed on the system.
SSIS is really intended to be a server-side installation; error handling tends to consist of writing messages to logs. You would need to find a way to make error logs available to the user to troubleshoot errors.
BCP or BULK INSERT:
You can export the spreadsheet to a CSV and use BCP or a BULK INSERT statement to load the file. However, this requires the file to be exported to a CSV and copied to a drive on the database server or a share accessible to it.
Pros:
Fast
bcp can be assumed to be present on the server.
Cons:
Requires manual steps to export to CSV
The file must be placed on a volume that can be mounted on the server
Limited error handling facilities.
SqlBulkCopy API:
If you're already using .Net you can read from the spreadsheet using OLE automation or ODBC and load the data using the SQL Server Bulk Load API. This requires you to write a C# routine to do the import. If the spreadsheet is loaded manually then it can be loaded from the user's PC.
Pros
Does not require SSIS to be installed on the computer,
file can be located on user's PC
Load process can be interactive, presenting errors to the user and allowing them to correct the errors with multiple retries.
Cons:
Most effort to develop.
Really only practical as a feature on an application.
SSIS is a ETL tool. You can do transformations, error handling (as mentioned by Kumar), look-ups in the SSIS, you can redirect invalid rows, add derived columns and lot more. You can even add configuration files to it to change some of the properties/parameters ...
There are more options how to load the data to SQL.
1, SSIS - you need to design the workflow (you need BIDS or VS to design and test the package)
2, as "demas" mentioned, you can export the data to flat file and use BCP/bulk insert
3, you can use the openrowset operator on SQL (ad-hoc distributed queries must be enabled to use this functionality) Then you can just query the excel file from SQL - this could be the easy way how to read the data:
SELECT * FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0','Excel 8.0;Database=C:\test.xls', 'SELECT * FROM [Sheet1$]')
- try to look on google for OPENROWSET + EXCEL to get more examples. In this scenario you can query also text files, ACCESS ...
There are more ways how to do it, but it really depends on what you want to achieve. 100K rows is really not much in this case.
SSIS is a good solution, but if the performance is most important for you I'll try to convert excel file to plain text file and use BULK INSERT functionality.
Error logging - Invalid rows can be easily logged into a seperate table for further verification.
If you need to perform complex transformations and required value checking and the file is extremely large (100,000 rows in an Excel file is tiny), the SSIS may be the best solution. It is a very powerful, complex tool.
However, and it is a big however, SSIS is difficult to effectivly learn and work in and is hard to debug. I perform imports and exports as my full-time job and it took me well over a year to get comfortable with SSIS. (of course what I do with the data is very complicated and not at all straightforward, so take this with a grain of salt.) Now if you aren't doing any complex transformations, it isn't that bad to set up, but it is nowhere near as simple as DTS was for instance mostly because it has so much more available functionality. If you are doing simple imports with no transformations, I believe that other methods might be more effective.
As far as performance, SSIS can shine (it is built to move millions or more records to data warehouses where speed is criticial)or be a real dog depending on the way it is set up. It takes a good level of skill to get to where you can performance tune one of these packages.

C# .NET - Datatype to map CSV file data

In one of my ASP.NET applications in C#, Suppose I need to read CSV file (and do some stuff of course) and in some other function I need to read another csv file and do some other stuff with the data. PS: We are using oledb to read CSV file.
My question is will it be good to have a common function like readCSV(fileName) to read csv file or we should write all oledb commands (i.e. oledbconnection, open, close etc) in every function differently.
Problem with the option one is we need to loop through twice (i.e. 10K times to read from csv and 10K times to validate) - (By the way what will be the best data type that readCSV should return? if your suggestion is option one).
Problem with the option two is that we need to write all oledb commands (i.e. oledbconnection, open, close etc) in every function that we implement to do different tasks with CSV data.
I'd put all your DB code into a service layer and call that to parse your csv files. This way, if your source ever changes, you only have a small chunk of code to edit.
You could either create objects for each of your csv files or use dynamic objects. Your service layer would then return either IEnumerable or IQueryable.
Have you looked at Generic Parser?
http://www.codeproject.com/KB/database/GenericParser.aspx
It seems to be used fairly widely, reads CSV files fast and is likely flexible enough for your needs.

Import csv file into SQL Server using ASP.NET and C#

What's the best way to import a small csv file into SQL Server using an ASP.NET form with C#? I know there are many ways to do this, but I'm wondering which classes would be best to read the file and how to insert into the database. Do I read the file into a DataTable and then use the SqlBulkCopy class, or just insert the data using ADO.NET? Not sure which way is best. I'm after the simplest solution and am not concerned about scalability or performance as the csv files are tiny.
Using ASP.NET 4.0, C# 4.0 and SQL Server 2008 R2.
The DataTable and the SqlBulkCopy classes will do just fine and that is the way I would prefer to do it, in order to prevent that someday these tiny CSV files become larger, your program will be ready for it, as ADO.NET might add some overhead by treating a single row at a time.
EDIT #1
What's the best way to get from csv file to datatable?
The CSV file format is nothing more than a text file. As such, you might want to read it using the File.ReadAllLines Method (String), which will return a string[]. Then, you may add to your DataTable using the DataRow class or your prefered way.
Consider adding your columns when defining your DataTable so that it knows its structure when adding rows.

Which One is Best OLEDB Or Excel Object Or Database

I need to work with Excel 2007 File for reading the data. for that which one is the best way to do that:
Using OLEDB Provider
Excel Interop Object
Dump the Excel data to Database and Using Procedure
kindly guide me to choose.
Here are my opinions:
1. Using OLEDB Provider
will only suit your needs if you have simple, uniform structured tables. It won't help you much, for example, if you have to extract any cell formatting information. The Jet engine's buggy "row type guessing" algorithm may make this approach almost unusable. But if the data type can be uniquely identified from the first few rows of each table, this approach may be enough. Pro: it is fast, and it works even on machines where MS Excel is not installed.
2. Excel Interop Object
may be very slow, especially compared to option 1, and you need MS Excel to be installed. But you get complete access to Excel's object model, you can extract almost every information (for example: formatting information, colors, frames etc) that is stored in your Excel file and your sheets can be as complex structured as you want.
3. Dump the Excel data to Database and Using Procedure
depends on what kind of database dump you have in mind, and if you have a database system at hand. If you are thinking of MS access, this will internally use the Jet engine again, with the same pros and cons as approach 1 above.
Other options:
4. write an Excel VBA macro to read the data you need and write it to a text file. Read the text file from a C# program. Pro: much faster than approach 2, with the same flexibility in accessing meta information. Con: you have to split your program in a VBA part and a C# part. And you need MS Excel on your machine.
5. Use a third party library / component for this task. There are plenty of libraries for the job, free and commercial ones. Just ask Google, or search here on SO. Lots of those libs don't require MS Excel on the machine, and they are typically the best option if you are going to extract the data as part of a server process.
Options 1 and 2 are almost always an exercise in pain, no matter how you ask the question.
If you can use SSIS to move the data into a database, and if that suits your needs because of other requirements, that's also a good option.
But the preferred option is usually to use Office Open XML for Excel 2007 and later. That has none of the COM headaches you get with Option 2, and none of the issues you have with guessing row types as you have with Option 1.
With a more carefully crafted question, you can get a far better answer, though.

C# Importing Large Volume of Data from CSV to Database

What's the most efficient method to load large volumes of data from CSV (3 million + rows) to a database.
The data needs to be formatted(e.g. name column needs to be split into first name and last name, etc.)
I need to do this in a efficiently as possible i.e. time constraints
I am siding with the option of reading, transforming and loading the data using a C# application row-by-row? Is this ideal, if not, what are my options? Should I use multithreading?
You will be I/O bound, so multithreading will not necessarily make it run any faster.
Last time I did this, it was about a dozen lines of C#. In one thread it ran the hard disk as fast as it could read data from the platters. I read one line at a time from the source file.
If you're not keen on writing it yourself, you could try the FileHelpers libraries. You might also want to have a look at Sébastien Lorion's work. His CSV reader is written specifically to deal with performance issues.
You could use the csvreader to quickly read the CSV.
Assuming you're using SQL Server, you use csvreader's CachedCsvReader to read the data into a DataTable which you can use with SqlBulkCopy to load into SQL Server.
I would agree with your solution. Reading the file one line at a time should avoid the overhead of reading the whole file into memory at once, which should make the application run quickly and efficiently, primarily taking time to read from the file (which is relatively quick) and parse the lines. The one note of caution I have for you is to watch out if you have embedded newlines in your CSV. I don't know if the specific CSV format you're using might actually output newlines between quotes in the data, but that could confuse this algorithm, of course.
Also, I would suggest batching the insert statements (include many insert statements in one string) before sending them to the database if this doesn't present problems in retrieving generated key values that you need to use for subsequent foreign keys (hopefully you don't need to retrieve any generated key values). Keep in mind that SQL Server (if that's what you're using) can only handle 2200 parameters per batch, so limit your batch size to account for that. And I would recommend using parameterized TSQL statements to perform the inserts. I suspect more time will be spent inserting records than reading them from the file.
You don't state which database you're using, but given the language you mention is C# I'm going to assume SQL Server.
If the data can't be imported using BCP (which it sounds like it can't if it needs significant processing) then SSIS is likely to be the next fastest option. It's not the nicest development platform in the world, but it is extremely fast. Certainly faster than any application you could write yourself in any reasonable timeframe.
BCP is pretty quick so I'd use that for loading the data. For string manipulation I'd go with a CLR function on SQL once the data is there. Multi-threading won't help in this scenario except to add complexity and hurt performance.
read the contents of the CSV file line by line into a in memory DataTable. You can manipulate the data (ie: split the first name and last name) etc as the DataTable is being populated.
Once the CSV data has been loaded in memory then use SqlBulkCopy to send the data to the database.
See http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.writetoserver.aspx for the documentation.
If you really want to do it in C#, create & populate a DataTable, truncate the target db table, then use System.Data.SqlClient.SqlBulkCopy.WriteToServer(DataTable dt).

Categories

Resources