How to deal with very large databases

How to deal with very large databases - c#

I've got a DataTable which can have hundreds of thousands of records placed in it. This is a huge memory overhead, so I have added a feature whereby users can only visualise the top 200 records in my application, and can export the rest of the results to a CSV file.
However, in order to export to a CSV file, the method which I am using, does so by converting the contents of a DataTable to a CSV file. Since I can have over 100K records in the DataTable, I think it would be too much of a memory hog to place all records into the DataTable and map to the CSV file. What approach would be recommended? This is my CSV mapping code:
StringBuilder builder = new StringBuilder();
IEnumerable<string> columnNames = dtResults.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
builder.AppendLine(string.Join(",", columnNames));
foreach (DataRow row in dtResults.Rows)
{
IEnumerable<string> fields = row.ItemArray.Select(field => DisplayCommas(field.ToString()));
builder.AppendLine(string.Join(",", fields));
}
File.WriteAllText(filename, builder.ToString());

As others have suggested, working with the heavyweight DataTable here is to be avoided given your requirements. So streaming data from the source using its provider's data reader is going to give you the best performance while maintaining a slim memory profile.
I did some quick searches, but couldn't come up with any csv library implementations (there are a ton, far from an exhaustive search) that touted easy plug-n-play with a DataReader. However it would be fairly trivial to use a CSV library (i have used FileHelpers and kbcsv with success before) to handle the file writing, load up the data reader from your query, tell the csv writer the column names before you start looping and then just let the writer handle streaming the results to disk.
You might get some memory increase during this, since the file write stream will probably have a decent buffer, but it will be way less memory consumption than the datatable-centric approach. The only drawback I can really see from using a datareader with a large result set in this fashion is the drain that a long running query can potentially have from a ADO.NET data provider, but thats completely provider specific (but a common problem) which you can look into if you have issues on the database side once you go down this route.
Hope that helps.

I'd go old school and use sqlcmd. Something like:
sqlcmd -q "select field1,field2,field3 from mytable" -oc:\output.csv -h-1 -s","
http://msdn.microsoft.com/en-us/library/ms162773.aspx

Ditch the DataTable and use a DataReader. Sequentially read the records you need and build the CSV file as you go.

Related

What is the fastest way to export a DataTable in C# to MS Excel?

As the title says I have massive DataTables in C# that I need to export to Excel as fast as possible whilst having a reasonable memory consumption.
I've tried using the following:
EPPlus (currently the fastest)
OpenXML (slower than EPPlus - not sure this makes sense as EPPlus probably uses OpenXML?)
SpreadsheetLight (slow)
ClosedXML (OOM exception with large tables)
Assuming massive data sets (up to 1,000,000 rows, 50 columns) and an infinite development time, what is THE fastest way I can export?
EDIT: I need more than just a basic table starting in A1. I need the table to start in a cell of my choosing, I need to be able to format the cells with data in, and have multiple tabs all of which contain their own data set.
Thanks.

You did not specified any requirements on how the data should look like in excel file. I guess, you don't need any complicated logic, just setting up correct data in correct columns. In that case, you can put your data in CSV (comma separated values) file. Excel can read this file just fine.
Example of CSV file:
Column 1,Column 2,Column 3
value1,value1,value1
value2,value2,value2
...
As requested, here is the code sample for creation csv file.
var csvFile = new StringBuilder();
csvFile.AppendLine("Column 1,Column 2,Column 3");
foreach (var row in data)
{
csvFile.AppendLine($"{row.Column1Value},{row.Column2Value}, row.Column3Value}");
}
File.WriteAllText(filePath, csvFile.ToString());
You can use some texternal libraries for parsing csv files, but this is the most basic way i can think of atm.

Excel is just an XML file format. If you strip away all the helper libraries, and think you can do a better job at coding than the people at EPPPlus or OpenXML then you can just use an XML Stream Writer and write the properly tagged Excel XML to a file.
You can make use of all kinds of standard file buffering and caching to make the writing as fast as possible but none of that will be specific to an Excel file - just standard buffered writes.

Assuming ... an infinite development time, what is THE fastest way I can export?
Hand-roll your own XLSX export. It's basically compressed XML. So stream your XML to a ZipArchive and it will be more-or-less as fast as it can go. If you stream it rather than buffer it then memory usage should be fixed for any size export.

Intersecting 2 big datasets

I have a giant (100Gb) csv file with several columns and a smaller (4Gb) csv also with several columns. The first column in both datasets have the same category. I want to create a third csv with the records of the big file which happen to have a matching first column in the small csv. In database terms it would be a simple join on the first column.
I am trying to find the best approach to go about this in terms of efficiency. As the smaller dataset fits in memory, I was thinking of loading it in a sort of set structure and then read the big file line to line and querying the in memory set, and write to file on positive.
Just to frame the question in SO terms, is there an optimal way to achieve this?
EDIT: This is a one time operation.
Note: the language is not relevant, open to suggestions on column, row oriented databases, python, etc...

Something like
import csv
def main():
with open('smallfile.csv', 'rb') as inf:
in_csv = csv.reader(inf)
categories = set(row[0] for row in in_csv)
with open('bigfile.csv', 'rb') as inf, open('newfile.csv', 'wb') as outf:
in_csv = csv.reader(inf)
out_csv = csv.writer(outf)
out_csv.writerows(row for row in in_csv if row[0] in categories)
if __name__=="__main__":
main()
I presume you meant 100 gigabytes, not 100 gigabits; most modern hard drives top out around 100 MB/s, so expect it to take around 16 minutes just to read the data off the disk.

If you are only doing this once, your approach should be sufficient. The only improvement I would make is to read the big file in chunks instead of line by line. That way you don't have to hit the file system as much. You'd want to make the chunks as big as possible while still fitting in memory.
If you will need to do this more than once, consider pushing the data into some database. You could insert all the data from the big file and then "update" that data using the second, smaller file to get a complete database with one large table with all the data. If you use a NoSQL database like Cassandra this should be fairly efficient since Cassandra is pretty good and handling writes efficiently.

Out of memory exception when pulling huge data from DB

We are pulling a huge data from sql server DB. It has around 25000 rows with 2500 columns. The requirement is to read the data and export it to spread sheet, so pagination is not a choice. When the records are less it is able to pull the data but when it grows to the size i mentioned above it is throwing exception.
public DataSet Exportexcel(string Username)
{
Database db = DatabaseFactory.CreateDatabase(Config);
DbCommand dbCommand =
db.GetStoredProcCommand("Sp_ExportADExcel");
db.AddInParameter(dbCommand, "#Username", DbType.String,
Username);
return db.ExecuteDataSet(dbCommand);
}
Please help me in resolving this issue.

The requirement is to read the data and export it to spread sheet, so
pagination is not a choice.
Why not read data in steps. Instead of getting all records at once why not get limited number of records every time and write them to excel. Continue until you have processed all the records

Your problem is purely down to the fact that you are trying to extract so much data in one go.
You may get around the problem by installing more memory in the machine doing the query, but this is just a bodge.
Your best to retrieve such amounts of data in steps.
You could quite easily read the data back row by row and export/append that in CSV format to a file and this could all be done in a stored procedure.
You don't say what database you are using, but handling such large amounts of data is what database engines are designed to cope with.
Other than that when handling large quantities of data objects in C# code its best to look into using generics as this doesn't impose object instantiation in the same way that classes do and so reduces the memory footprint.

You can use batch processing logic to fetch records in batches say 5000 records per execution and store the result in a temp dataset and once all processing is done. Dump the data from temp dataset to excel.
You can use C# BulkCopy class for this purpose.

If it is enough to have the data available in Excel as csv you can use bulk copy
bcp "select col1, col2, col3 from database.schema.SomeTable" queryout "c:\MyData.txt" -c -t"," -r"\n" -S ServerName -T
This is mangnitudes faster and has little footprint.

IEnumerable<T> to Excel (2007) w/ Formatting

I'm looking for a good way to export an IEnumerable to Excel 2007 (.xlsb).
The T is a known type, so reflection is not completely necessary for performance reasons.
I'm using .xlsb (excel binary format) because the amount of data will be large for Excel.
The IEnumerable in question has approximately 2 million records. The IEnumerable is retrieved from an Access database (.mdb) then goes through some processing, then finally LINQ queries are wrote to generate a report structure for T. Though these records do not need to get sent to excel as one (nor could it); it will be sub-divided by a condition to which the largest record length will be roughly 1 million records.
I want to be able to convert the data to an Excel Pivot Table for easy viewing.
My initial idea was to convert the IEnumerable to a 2Darray [,] then push into an Excel range using COM interop.
public static object[,] To2DArray<T>(this IEnumerable<T> objectList)
{
Type t = typeof(T);
PropertyInfo[] fields = t.GetProperties();
object[,] my2DObject = new object[objectList.Count(), fields.Count()];
int row = 0;
foreach (var o in objectList)
{
int col = 0;
foreach (var f in fields)
{
my2DObject[row, col] = f.GetValue(o, null) ?? string.Empty;
col++;
}
row++;
}
return my2DObject;
}
I then took that object[,] and did a "transaction split" as I called it which just split up the object[,] into smaller chunks such as I'd create a List then go through each one and send into Excel range using something similar to:
Excel.Range range = worksheet.get_Range(cell,cell);
range.Value2 = List<object[,]>[0]
I'd obviously loop the above but just for simplicity it would look like the above.
This will work though, it takes an enormous amount of time to process, over 30minutes.
I've dabbled in outputting the IEnumerable to CSV though, it is not very efficient either; since it first requires the .csv file to be created, then open the .csv file using COM interop to do the excel pivot table formatting.
My question: Is there a better (preferred) way to do this?
Should I force execution (toList()) before iteration?
Should I use a different mechanism to output/display the data?
I'm open to any options to get a disconnected IEnumerable out to file in an efficient manner.
-I wouldn't be opposed to using something like SQL Express.

The main question will be where the bottleneck is. I'd have a look at the code in a profiler to see what part of the execution is taking a long time. It can also be worthwhile looking at your resource usage by running the process and seeing whether there is a shortage of CPU or Memory, or whether it's disk-locked.
If you're getting sensible performance doing 2000 records at a time, then I suspect memory resources may be an issue - with the code you posted you're converting an IEnumerable (which can avoid loading a complete dataset into memory) into an entirely in-memory structure with potentially a million records - depending on the size and number of fields involved, this could easily become an issue.
If the problem looks like the time to create the Excel file itself (which it doesn't immediately sound like it is in this case), then COM interop calls can add up, and some of the 3rd party Excel libraries aim to be much faster at writing Excel files, particularly with large numbers of records, so rather than necessarily use Excel Binary format and COM, I'd suggest looking at an Open Source library like EPPlus (http://epplus.codeplex.com/) and seeing what the performance difference is like.

Paging through a datatable in codebehind

I need to handle very large datatables (2 million rows+) that comes from databases (SQL, Oracle, Access, MySQL, Sharepoint etc) outside of my control: Currently I loop through every row and column building a string object, but I run out of memory at about 100k rows.
The only solution I may take is to break the datatable into smaller pieces and persisting each block before starting on the next block of rows.
Since I cannot add ROW_NUMBER() or anything similar, I have to handle the populated datatable.
How can I easily (keep performance in mind) break the populated datatable into smaller datatables like paging?
PS there is no visual component to this functionality.

Are you using string concatenation? like this string += string.
Change that to StringBuilder and you should not have problems, at least not for 20k rows.

If you are talking about filling a DataTable object (which loads the results of your calls into memory before processing), you will likely be better off using a datareader for each of the mentioned providers so then you can process each row as it is read from the database instead of storing the DataTable in memory...
A great answer to another question lists the pro/cons of datareaders/datatables
If you're already using datareaders- ignore this. But your memory problem might be from also storing the retrieved results...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to deal with very large databases - c#

I'd go old school and use sqlcmd. Something like: sqlcmd -q "select field1,field2,field3 from mytable" -oc:\output.csv -h-1 -s"," http://msdn.microsoft.com/en-us/library/ms162773.aspx

Ditch the DataTable and use a DataReader. Sequentially read the records you need and build the CSV file as you go.

Related

What is the fastest way to export a DataTable in C# to MS Excel?

Intersecting 2 big datasets

Out of memory exception when pulling huge data from DB

IEnumerable<T> to Excel (2007) w/ Formatting

Paging through a datatable in codebehind

Categories

Resources