C# Parallelizing CSV Parsing - c#

Please look at the code below.
void func()
{
for(;;)
{
var item = new Item();
}
}
Item is a class in whose constructor I read several csv files, as follows
List<string> data = new List<string>();
Item()
{
//read from csv into List<string>data
}
As is visible, the csv files are distinct and are read into unique variables. I would like to be able to parallelize this. All my data is on a network drive. I understand that the limitation in this case is the disk access. Can someone suggest what I can do to parallelize this?

As stated before Parallel.ForEach is the easiest way to run something in parallel, but if I recall correctly parallel.ForEach is a .net 4 method. so if you are using a different version you will have to find another method that uses locks.
If you are looking to read in data from a csv, ADO.net has a built in function that can read in csv files based on a schema file, it's one of the fastest ways in my experience to read in csv files.
quick link I found from google
http://www.daniweb.com/web-development/aspnet/threads/38676
I've also had great success with this http://www.codeproject.com/KB/database/CsvReader.aspx . It's a little slower than the ado.net version but it's easier to use and you don't need schema file.
just a warning, if you use the ado.net and you large string numeric values like credit cards numbers and you are getting things that look like scientific notation, your schema file needs to be adjusted, I've had a lot of coders complain about this.
happy coding.

if all your files are unique and stored in unique variables, take a look on Parallel.ForEach statement - take a look at http://msdn.microsoft.com/en-us/library/dd460720.aspx

Related

What is the fastest way to export a DataTable in C# to MS Excel?

As the title says I have massive DataTables in C# that I need to export to Excel as fast as possible whilst having a reasonable memory consumption.
I've tried using the following:
EPPlus (currently the fastest)
OpenXML (slower than EPPlus - not sure this makes sense as EPPlus probably uses OpenXML?)
SpreadsheetLight (slow)
ClosedXML (OOM exception with large tables)
Assuming massive data sets (up to 1,000,000 rows, 50 columns) and an infinite development time, what is THE fastest way I can export?
EDIT: I need more than just a basic table starting in A1. I need the table to start in a cell of my choosing, I need to be able to format the cells with data in, and have multiple tabs all of which contain their own data set.
Thanks.
You did not specified any requirements on how the data should look like in excel file. I guess, you don't need any complicated logic, just setting up correct data in correct columns. In that case, you can put your data in CSV (comma separated values) file. Excel can read this file just fine.
Example of CSV file:
Column 1,Column 2,Column 3
value1,value1,value1
value2,value2,value2
...
As requested, here is the code sample for creation csv file.
var csvFile = new StringBuilder();
csvFile.AppendLine("Column 1,Column 2,Column 3");
foreach (var row in data)
{
csvFile.AppendLine($"{row.Column1Value},{row.Column2Value}, row.Column3Value}");
}
File.WriteAllText(filePath, csvFile.ToString());
You can use some texternal libraries for parsing csv files, but this is the most basic way i can think of atm.
Excel is just an XML file format. If you strip away all the helper libraries, and think you can do a better job at coding than the people at EPPPlus or OpenXML then you can just use an XML Stream Writer and write the properly tagged Excel XML to a file.
You can make use of all kinds of standard file buffering and caching to make the writing as fast as possible but none of that will be specific to an Excel file - just standard buffered writes.
Assuming ... an infinite development time, what is THE fastest way I can export?
Hand-roll your own XLSX export. It's basically compressed XML. So stream your XML to a ZipArchive and it will be more-or-less as fast as it can go. If you stream it rather than buffer it then memory usage should be fixed for any size export.

Convert SQL result set to CSV file

I'm working in C# and need making use of Entity Framework 6. I have a service that calls a stored procedure (using the Dbcontext) and places the results in an IList. I then have a controller that makes use of this service. Now originally I was using the results combined with Epplus to save this as an Excel/Xlsl File - this worked perfectly/as intended.
However now, I need to save it as an CSV file. I have found several links, such as this and this, which converts excel to CSV (however I can now skip this step, as I can just convert the resultset to CSV, with no need for the excel file), I also found this link.
From what I understand, it is fairly easy to export/convert a dataset/result set to CSV using a stringbuilder. However, I was wondering, given that I have Epplus and the ability to save as Excel- is there not a cleaner way of doing it? Or is it best to take the data, use a stringbuilder for comma delimiting the values and use that for the CSV?
I know similar topics (like this one) have been posted before - but I felt my question was unique enough for a new post.
Using EPPlus is not a cleaner way of doing this. You would only need much more code to accomplish exactly the same result. Creating a CSV file is nothing more than writing a text file with commas in it. So why not just do that?
StringBuilder sb = new StringBuilder();
foreach (DataRow dr in yourDataSet)
{
List<string> fields = new List<string>();
foreach (object field in dr.ItemArray)
{
fields.Add(field);
}
sb.Append(String.Join(",", fields) + Environment.NewLine);
}
//and save.. sb.ToString() as a .csv file

Copy Row from DataTable to another with different column schemas

I am working on optimizing some code I have been assigned from a previous employee's code base. Beyond the fact that the code is pretty well "spaghettified" I did run into an issue where I'm not sure how to optimize properly.
The below snippet is not an exact replication, but should detail the question fairly well.
He is taking one DataTable from an Excel spreasheet and placing rows into a consistantly formatted DataTable which later updates the database. This seems logical to me, however, the way he is copying data seems convoluted, and is a royal pain to modify, maintain or add new formats.
Here is what I'm seeing:
private void VendorFormatOne()
{
//dtSumbit is declared with it's column schema elsewhere
for (int i = 0; i < dtFromExcelFile.Rows.Count; i++)
{
dtSubmit.Rows.Add(i);
dtSubmit.Rows[i]["reference_no"] = dtFromExcelFile.Rows[i]["VENDOR REF"];
dtSubmit.Rows[i]["customer_name"] = dtFromExcelFile.Rows[i]["END USER ID"];
//etc etc etc
}
}
To me this is completely overkill for mapping columns to a different schema, but I can't think of a way to do this more gracefully. In the actual solution, there are about 20 of these methods, all using different formats from dtFromExcelFile and the column list is much longer. The column schema of dtSubmit remains the same across the board.
I am looking for a way to avoid having to manually map these columns every time the company needs to load a new file from a vendor. Is there a way to do this more efficiently? I'm sure I'm overlooking something here, but did not find any relevant answers on SO or elsewhere.
This might be overkill, but you could define an XML file that describes which Excel column maps to which database field, then input that along with each new Excel file. You'd want to whip up a class or two for parsing and consuming that file, and perhaps another class for validating the Excel file against the XML file.
Depending on the size of your organization, this may give you the added bonus of being able to offload that tedious mapping to someone less skilled. However, it is quite a bit of setup work and if this happens only sparingly, you might not get a significant return on investment for creating so much infrastructure.
Alternatively, if you're using MS SQL Server, this is basically what SSIS is built for, though in my experience, most programmers find SSIS quite tedious.
I had originally intended this just as a comment but ran out of space. It's in reply to Micah's answer and your first comment therein.
The biggest problem here is the amount of XML mapping would equal that of the manual mapping in code
Consider building a small tool that, given an Excel file with two
columns, produces the XML mapping file. Now you can offload the
mapping work to the vendor, or an intern, or indeed anyone who has a
copy of the requirement doc for a particular vendor project.
Since the file is then loaded at runtime in your import app or
whatever, you can change the mappings without having to redeploy the
app.
Having used exactly this kind of system many, many times in the past,
I can tell you this: you will be very glad you took the time to do
it - especially the first time you get a call right after deployment
along the lines of "oops, we need to add a new column to the data
we've given you, and we realised that we've misspelled the 19th
column by the way."
About the only thing that can perhaps go wrong is data type
conversions, but you can build that into the mapping file (type
from/to) and generalise your import routine to perform the
conversions for you.
Just my 2c.
A while ago I ran into similar problem where I had over 400 columns from 30 odd tables to be mapped to about 60 in the actual table in the database. I had the same dilemma whether to go with a schema or write something custom.
There was so much duplication that I ended up writing a simple helper class with a couple of overridden methods that basically took in a column name from import table and spit out the database column name. Also, for column names, I built a separate class of the format
public static class ColumnName
{
public const string FirstName = "FirstName";
public const string LastName = "LastName";
...
}
Same thing goes for TableNames as well.
This made it much simpler to maintain table names and column names. Also, this handled duplicate columns across different tables really well avoiding duplicate code.

Dictionary list or file access methods or any other method in C#

I have a text file which follows exactly the same type:
**Unique-Key_1**
Value1
Value2
**Unique-Key_2**
Value1
**Unique_Key_3**
Value1
(Please note that the keys and values are not fixed.They might grow in time but one thing is confirmed: It will always follow this structure)
My program wants to search for a key and then retrive all values under it.
I have some viable solutions for this.
1) Should I use a dictionary type and then when my app loads read all keys and values and populate that list?
2) Can I use file access/search methods during run-time and based upon a key , search it and then retrieve values ?
3) Which is the optimum method or is there any methods or any other ways to achieve the same ?
Things to consider:
Does the application have time to load in and parse the file before data is searched? If so, consider parsing the file into a Dictionary. If not, parse the file as needed.
Will the file be very large? If so, parsing it into a Dictionary may take up too much memory. Consider an LRU Cache like Object cache for C#.
Are the keys in the file sorted? If so, a binary search on the file may be possible to speed up the file parses.
Does the data change frequently? If so, parsing the file would guarantee the data is up to date at the cost of slower data access.
Another alternative is to load the values into database tables or a key/value store. This allows the data to be updated piece meal or completely with reasonable access speed if needed at the expense of maintaining and running the database.
Okay, if the file isn't that large I would recommend the Dictionary approach because it's going to make accessing it easier and more efficient at runtime. However, if the file is too large to hold in memory you can use the algorithm provided in this answer to search it.

Most efficient method to parse multi-joined records within a Search Result in NetSuite

I have to extract data from a saved search and drop each column into a csv file. This search is routinely over 300 lines long and I need to parse each record into a separate csv file ( so 300+ csv files need to be created )
With all the previous searches I have done this with the amount of columns required were small (less then 10) and the amount of joins minimal to none, so efficiency wasn't a large concern.
I now have a project that has 42 fields in in the saved search. The search is built off of a sales order and includes joins to the customer record and item records.
The search makes extensive use of custom fields as well as formula's.
What is the most efficient way for me to step through all of this?
I am thinking that the easiest method (and maybe the quickest) is to wrap it in a
foreach (TransactionSearchRow row in searchResult.searchRowList)
{
using (var sw = System.IO.File.CreateText(path+filename))
{
....
}
}
block, but I want to try and avoid
if (customFieldRef is SelectCustomFieldRef)
{
SelectCustomFieldRef selectCustomFieldRef = (SelectCustomFieldRef)customFieldRef;
if (selectCustomFieldRef.internalId.Equals("custom_field_name"))
{
....
}
}
as I expect this code to become excessively long with this process. So any ideas are appreciated.
Using the NetSuite WSDL-generated API, there is no alternative to the nested type/name tests when reading custom fields. It just sucks and you have to live with it.
You could drop down to manual SOAP and parse the XML response yourself. That sounds like torture to me but with a few helper functions you could make the process of reading custom fields much more logical.
The other alternative would be to ditch SuiteTalk entirely and do your search in a SuiteScript RESTlet. The JavaScript API has much simpler, more direct access to custom fields than the SOAP API. You could do whatever amount of pre-processing you wanted on the server side before returning data (which could be JSON, XML, plain text, or even the final CSV) to the calling application.

Categories

Resources