Intersecting 2 big datasets - c#

I have a giant (100Gb) csv file with several columns and a smaller (4Gb) csv also with several columns. The first column in both datasets have the same category. I want to create a third csv with the records of the big file which happen to have a matching first column in the small csv. In database terms it would be a simple join on the first column.
I am trying to find the best approach to go about this in terms of efficiency. As the smaller dataset fits in memory, I was thinking of loading it in a sort of set structure and then read the big file line to line and querying the in memory set, and write to file on positive.
Just to frame the question in SO terms, is there an optimal way to achieve this?
EDIT: This is a one time operation.
Note: the language is not relevant, open to suggestions on column, row oriented databases, python, etc...

Something like
import csv
def main():
with open('smallfile.csv', 'rb') as inf:
in_csv = csv.reader(inf)
categories = set(row[0] for row in in_csv)
with open('bigfile.csv', 'rb') as inf, open('newfile.csv', 'wb') as outf:
in_csv = csv.reader(inf)
out_csv = csv.writer(outf)
out_csv.writerows(row for row in in_csv if row[0] in categories)
if __name__=="__main__":
main()
I presume you meant 100 gigabytes, not 100 gigabits; most modern hard drives top out around 100 MB/s, so expect it to take around 16 minutes just to read the data off the disk.

If you are only doing this once, your approach should be sufficient. The only improvement I would make is to read the big file in chunks instead of line by line. That way you don't have to hit the file system as much. You'd want to make the chunks as big as possible while still fitting in memory.
If you will need to do this more than once, consider pushing the data into some database. You could insert all the data from the big file and then "update" that data using the second, smaller file to get a complete database with one large table with all the data. If you use a NoSQL database like Cassandra this should be fairly efficient since Cassandra is pretty good and handling writes efficiently.

Related

How can I deal with slow performance on Contains query in Entity Framework / MS-SQL?

I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Create a dictionary Dictionary<string,int> documentFrequency
Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
Once all documents are processed this way, write the document frequencies back to the database.
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
Counting the number of instances of the term.
Loading the correct idf score from the idf table in the database.
Doing the tf-idf calculation.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.

Searching over a large List of objects quickly

First I build a list (by reading existing files) of approximately 12,000 objects that look like this:
public class Operator
{
string identifier; //i.e "7/1/2017 MN01 Day"
string name1;
string name2;
string id1;
string id2;
}
The identifier will be unique within the list.
Next I run a large query (currently about 4 million rows but it could be as large as 10 million, and about 20 columns). Then I write all of this to a CSV line by line using a write stream. For each line I loop over the Operator list to find a match and add those columns.
The problem I am having is with performance. I expect this report to take a long time to run but I've determined that the file writing step is taking especially long (about 4 hours). I suspect that it has to do with looping over the Operator list 4 million times.
Is there any way I can improve the speed of this? Perhaps by doing something when I build the list initially (indexing or sorting, maybe) that will allow searching to be done much faster.
You should be able to greatly speed up your code by building a Dictionary(HashTable):
var items = list.ToDictionary(i => i.identifier, i => i);
You can then index in on this dictionary:
var item = items["7/1/2017 MN01 Day"];
Building the dictionary is an O(n) operation, and doing a lookup into the dictionary is an O(1) operation. This means that your time complexity becomes linear rather than exponential.
... but also, "couldn't you somehow put those operators into a database table, so that you could use some kind of JOIN operation in your SQL?"
Another possibility that comes to mind is ... "twenty different queries, one for each symbol." Or, a UNION query with twenty branches. If there is any way for the SQL engine to use indexes, on its side, to speed up that process, you would still come out ahead.
Right now, vast amounts of time might be being wasted, packaging up every one of those millions of lines, squirting them through the network wires to your machine, only to have to discarding most of them, say, because they don't match any symbol.
If you control the database and can afford the space, and if, say, most of the rows don't match any symbol, consider a symbols table and a symbols_matched table, the second being a many-to-many join table that pre-identifies which rows match which symbol(s). It might well be worth the space, to save the time. (The process of populating this table could be put to a stored procedure which is TRIGGERed by appropriate insert, update, and delete events ...)
It is difficult to tell you how to speed up your file write without seeing any code.
But in general it could be worth considering writing using multiple threads. This SO post has some helpful info, and you could of course Google for more.

What is the fastest way to export a DataTable in C# to MS Excel?

As the title says I have massive DataTables in C# that I need to export to Excel as fast as possible whilst having a reasonable memory consumption.
I've tried using the following:
EPPlus (currently the fastest)
OpenXML (slower than EPPlus - not sure this makes sense as EPPlus probably uses OpenXML?)
SpreadsheetLight (slow)
ClosedXML (OOM exception with large tables)
Assuming massive data sets (up to 1,000,000 rows, 50 columns) and an infinite development time, what is THE fastest way I can export?
EDIT: I need more than just a basic table starting in A1. I need the table to start in a cell of my choosing, I need to be able to format the cells with data in, and have multiple tabs all of which contain their own data set.
Thanks.
You did not specified any requirements on how the data should look like in excel file. I guess, you don't need any complicated logic, just setting up correct data in correct columns. In that case, you can put your data in CSV (comma separated values) file. Excel can read this file just fine.
Example of CSV file:
Column 1,Column 2,Column 3
value1,value1,value1
value2,value2,value2
...
As requested, here is the code sample for creation csv file.
var csvFile = new StringBuilder();
csvFile.AppendLine("Column 1,Column 2,Column 3");
foreach (var row in data)
{
csvFile.AppendLine($"{row.Column1Value},{row.Column2Value}, row.Column3Value}");
}
File.WriteAllText(filePath, csvFile.ToString());
You can use some texternal libraries for parsing csv files, but this is the most basic way i can think of atm.
Excel is just an XML file format. If you strip away all the helper libraries, and think you can do a better job at coding than the people at EPPPlus or OpenXML then you can just use an XML Stream Writer and write the properly tagged Excel XML to a file.
You can make use of all kinds of standard file buffering and caching to make the writing as fast as possible but none of that will be specific to an Excel file - just standard buffered writes.
Assuming ... an infinite development time, what is THE fastest way I can export?
Hand-roll your own XLSX export. It's basically compressed XML. So stream your XML to a ZipArchive and it will be more-or-less as fast as it can go. If you stream it rather than buffer it then memory usage should be fixed for any size export.

C# Excel Reading optimization

My app will build an item list and grab the necessary data (ex: prices, customer item codes) from an excel file.
This reference excel file has 650 lines and 7 columns.
App will read rows of 10-12 items in one run-time.
Would it be wiser to read line item by line item?
Or should I first read all line item in the excel file into a list/array and make the search from there?
Thank you
It's good to start by designing the classes that best represent the data regardless of where it comes from. Pretend that there is no Excel, SQL, etc.
If your data is always going to be relatively small (650 rows) then I would just read the whole thing into whatever data structure you create (your own classes.) Then you can query those for whatever data you want, like
var itemsIWant = allMyData.Where(item => item.Value == "something");
The reason is that it enables you to separate the query (selecting individual items) from the storage (whatever file or source the data comes from.) If you replace Excel with something else you won't have to rewrite other code. If you read it line by line then the code that selects items based on criteria is mingled with your Excel-reading code.
Keeping things separate enables you to more easily test parts of your code in isolation. You can confirm that one component correctly reads what's in Excel and converts it to your data. You can confirm that another component correctly executes a query to return the data you want (and it doesn't care where that data came from.)
With regard to optimization - you're going to be opening the file from disk and no matter what you'll have to read every row. That's where all the overhead is. Whether you read the whole thing at once and then query or check each row one at a time won't be a significant factor.

Getting large data: sqlconnection and compression

I need to get around 50k-100k records from a table. Two of the fields hold very long strings. Field1 is up to 2048 characters and field2 is up to 255.
Getting just these two fields, 50k rows takes around 120 seconds. Is there a way to use compression or some how optimize the retrieval of this data? I'm using a data adapter to fill a data table.
Note: It's just a select statement, no where clause.
Simple answer: DONT PULL 50.000 to 100.000 rows. Point. Mass transfers always take time, and compression would put a lot of stress on the cpu. I still have to come to a case where pulling that much data outside of pure data transfers is a worthwhile proposition - most of the time it is a sign of a bad architecture.

Categories

Resources