I have a high scale distributed system which downloads a lot of large .csv files and indexes the data everyday.
Lets say, our file(file.csv) is:
col1 col2 col3
user11 val12 val13
user21 val22 val23
Then we read this file row wise and store the byte offset of where the row of user11 or user12 is located in this file. eg:
Index table -
user11 -> 1120-2130 (bytes offset)
user12 -> 2130-3545 (bytes offset)
When someone says, delete the data for user11, we refer this table, download and open the file, delete the byte offset in the file. Please note, this byte offset is of the entire row.
How can I design the system to process parquet files?
Parquet files operate column wise. To get an entire row of say 10 columns, will i have to make 10 calls? Then, form an entire row, calculate the bytes and then store them in the table?
Then, while deleting, I will have to again form the row and then delete the bytes?
Other option is store the byte offset of each column instead and process column wise but that will blow up the index table.
How can parquet files be efficiently processed in row-wise manner?
Current system is a background job in C#.
Using Cinchoo ETL, an open source library to convert CSV to parquet file easily.
string csv = #"Id,Name
1,Tom
2,Carl
3,Mark";
using (var r = ChoCSVReader.LoadText(csv)
.WithFirstLineHeader()
)
{
using (var w = new ChoParquetWriter("*** PARQUET FILE PATH ***"))
w.Write(r);
}
For more information, pls check https://www.codeproject.com/Articles/5270332/Cinchoo-ETL-Parquet-Reader article.
Sample fiddle: https://dotnetfiddle.net/Ra8yf4
Disclaimer: I'm author of this library.
Related
As the title says I have massive DataTables in C# that I need to export to Excel as fast as possible whilst having a reasonable memory consumption.
I've tried using the following:
EPPlus (currently the fastest)
OpenXML (slower than EPPlus - not sure this makes sense as EPPlus probably uses OpenXML?)
SpreadsheetLight (slow)
ClosedXML (OOM exception with large tables)
Assuming massive data sets (up to 1,000,000 rows, 50 columns) and an infinite development time, what is THE fastest way I can export?
EDIT: I need more than just a basic table starting in A1. I need the table to start in a cell of my choosing, I need to be able to format the cells with data in, and have multiple tabs all of which contain their own data set.
Thanks.
You did not specified any requirements on how the data should look like in excel file. I guess, you don't need any complicated logic, just setting up correct data in correct columns. In that case, you can put your data in CSV (comma separated values) file. Excel can read this file just fine.
Example of CSV file:
Column 1,Column 2,Column 3
value1,value1,value1
value2,value2,value2
...
As requested, here is the code sample for creation csv file.
var csvFile = new StringBuilder();
csvFile.AppendLine("Column 1,Column 2,Column 3");
foreach (var row in data)
{
csvFile.AppendLine($"{row.Column1Value},{row.Column2Value}, row.Column3Value}");
}
File.WriteAllText(filePath, csvFile.ToString());
You can use some texternal libraries for parsing csv files, but this is the most basic way i can think of atm.
Excel is just an XML file format. If you strip away all the helper libraries, and think you can do a better job at coding than the people at EPPPlus or OpenXML then you can just use an XML Stream Writer and write the properly tagged Excel XML to a file.
You can make use of all kinds of standard file buffering and caching to make the writing as fast as possible but none of that will be specific to an Excel file - just standard buffered writes.
Assuming ... an infinite development time, what is THE fastest way I can export?
Hand-roll your own XLSX export. It's basically compressed XML. So stream your XML to a ZipArchive and it will be more-or-less as fast as it can go. If you stream it rather than buffer it then memory usage should be fixed for any size export.
I have more than 2 million rows of data and I want to dump this data in Excel file but as given in this specification that Excel file can contains only 1,048,576 rows.
Consider that I have 40 million rows in the database and I want to dump this data in excel file.
I did 1 test but got the same result that is successfully got 1,048,576 rows and after that got error:
Exception from HRESULT: 0x800A03EC Error
Code:
for (int i = 1; i <= 1200000; i++)
{
oSheet.Cells[i, 1] = i;
}
I think of CSV file but I can't use it as because we cant give colors and styles to CSV file as per this Answer and my Excel file is going to contain many colors and styles.
Is there any third party tool or whatever through which I can dump more than 2 millions rows in Excel file? I am not concerned if it is paid or free.
Like you said the current excel specification Link has a maximum of 1,048,576 rows. But the amount of Sheets is only limited by the memory.
Maybe the seperation of the content on multiple sheets would be a solution for this.
or if you want to do some analysis on the data for instance you could maybe aggregat the information before loading them into the excel file.
I followed this very promising link to make my program read Excel files, but the problem I get is System.OutOfMemoryException. As far as I can gather, it happens because of this chunk of code
object[,] valueArray = (object[,])excelRange.get_Value(
XlRangeValueDataType.xlRangeValueDefault);
which loads the whole list of data into one variable. I do not understand why the developers of the library decided to do it this way, instead of making an iterator, that would parse a sheet line by line. So, I need some working solution that would enable to read large (>700K rows) Excel files.
I am using the following function in one of my C# applications:
string[,] ReadCells(Excel._Worksheet WS,
int row1, int col1, int row2, int col2)
{
Excel.Range R = WS.get_Range(GetAddress(row1, col1),
GetAddress(row2, col2));
....
}
The reason to read a Range in one go rather than cell-by-cell is performance.
For every cell access, a lot of internal data transfer is going on. If the Range is too large to fit into memory, you can process it in smaller chunks.
I have 2 CSV files with 7 columns each.
CSV file 1 stores current or old data.
CSV file 2 stores the new data to be updated into CSV file 1.
I'd like to programmatically compare each row entry per column of the CSV files, & if a change is detected, generate a SQL script that can be run to auto update this data into CSV file 1.
E.g. if CSV file 1 has a string value called "three" stored under column "number" with ID value 1, & CSV file 2 has a string value called "zwei" stored under the same column with the same ID value, then CSV file 1's value of "three" should be changed to "zwei", but this has to be done via a programmatically generated SQL script.
Please assist...
I would load both files into SQL Temp tables, process line by line and do updates in SQL. Then overwrite CSV file 1 completely.
This is fast and easy.
I have a table that has a blob column representing a file.
I'd like to run a LinqToSql query that returns a name and description of the file, along with the file size... but in the interests of not killing performance, I obviously don't want to download the whole blob!
var q = from f in MyFiles
select new {f.Name, f.Description, f.Blob.Length};
appears to pull the entire blob from the DB, then calculate its length in local memory.
How can I do this so that I only get the blob size, without downloading the entire blob?
I think the best choose in your case is to store blob size in the separate column, when storing file to database.