Output dataset as collection of .csv files, .zip not an option C#

Output dataset as collection of .csv files, .zip not an option C# - c#

I have produced a solution in a C# web app that runs a report which returns a DataSet containing 9 Datatables.
The tables are linked with many-to-many relationships, and the purpose of the report is to produce a snapshot of the progress on records contained in the tables for anyone point in time.
This dictates that all 9 tables are queried at once, as querying the tables individually at different times will mean that the results dont match up between tables.
I was returning the 9 csv files in a zip file, which worked fine in development. However, in the organisation I am working for a lot of user's machines are very limited and we cannot rely on them having winzip or some other archiving software, and as such I have been told to find an alternative solution.
So far, the problems I have come up against are:
All 9 datatables must be produced from the same query, which rules out giving users the option to run the report against tables individually as the results may not match up if they are run at different times of the day.
I can only return one file at a time to the browser, which stops me returning the collection of csv files unzipped.
I did think about returning the data as separate tabs on an excel workbook, however some of the results have leading zeros, which will be trimmed off when they are loading into the worksheet.
It would be great if anyone knew of an obvious solution I'm missing?

Using DotNetZip, you can zip and unzip files like so:
using Ionic.Zip;
namespace ConsoleApplication23
{
class Program
{
static void Main(string[] args)
{
//Zip your files like so
ZipFile x = new ZipFile();
x.AddFile(#"C:\myFile");
x.AddFile(#"C:\mySecondFile");
x.Save(#"c:\myZipFile.zip");
//Unzip like so
ZipFile y = ZipFile.Read(#"c:\myZipFile.zip");
foreach (ZipEntry e in y)
{
e.Extract(#"c:\test", ExtractExistingFileAction.OverwriteSilently);
}
}
}
}

Related

C# reading excel and still getting data from blank cells

I have a tool I made for work. Every week there are 5-20 files for a certain process that fails and I have to find their job ids and rerun them.
I made a tool in C# that takes the names of the failed files in an Excel spreadsheet (we'll call it the Failed File Spreadsheet, or FFS if you're feeling cynical) and then cross references them with a different Excel spreadsheet that has the job ids, and displays the result in the terminal. It reads the FFS this with a fairly simple OledDbDataAdapter code:
public static DataTable GetDataFromExcel(string filename, string sheetName)
{
using(var oledb = new OleDbConnection(CONN_STR.Replace("<FILENAME>", filename).Replace("<HDR>", "no"))
{
var result = new DataSet();
new OleDbDataAdapter($"SELECT * FROM [{sheetName}]", oledb).Fill(result);
return result.Tables[0];
}
}
The tool works fine, mostly. It cross references with another excel sheet and I get my job ids and I can carry on with my task.
However there's one slight issue, and that is that, often when running the tool, when it reads from the FFS, sometimes it returns blank lines. Like if last week I had 7 files, then this week I erased those, pasted in 5 files, then my tool will show the job ids for those 5 files just fine, but also show two blanks, as if it's still reading those two extra rows from the previous week. If however I make a new blank spreadsheet in Excel, plug in my failed files and overwrite the save file, I don't have this issue at all, making me think this is an Excel issue and not a C# coding issue.
Is there a reason why, if I delete the contents of a cell, the OleDbDataAdapter would still be reading those cells? Like are there whitespace characters or other hidden characters still present after deleting contents? I mean I could fix it in the code and just say "don't write it out if the values are whitespace or null" but I want to know why blank cells are even being read at all.
This is just a minor bug and it's not stopping me from doing my work and this tool is nothing more than a personal tool to help with a weekly task. But I'd still like to know why cells that had content, but then had that content deleted, are still being read.

Excel is a little bit quirky like that. If you are manually editing your "Failed File Spreadsheet" (FFS) and as you say, you are pasting 5 rows over the existing 7 rows, then you may still read in those extra rows after the data you expect, if there is any formatting on the cells. To avoid this, in Excel select the range of cells of the whole sheet and right-click and select "Clear Contents".
To be fair, as you alluded to, I think it would be simpler just to fix it in code and skip rows in the DataTable that are empty. Or there is a SO post here which shows how to remove empty rows from a DataTable

C# Excel Reading optimization

My app will build an item list and grab the necessary data (ex: prices, customer item codes) from an excel file.
This reference excel file has 650 lines and 7 columns.
App will read rows of 10-12 items in one run-time.
Would it be wiser to read line item by line item?
Or should I first read all line item in the excel file into a list/array and make the search from there?
Thank you

It's good to start by designing the classes that best represent the data regardless of where it comes from. Pretend that there is no Excel, SQL, etc.
If your data is always going to be relatively small (650 rows) then I would just read the whole thing into whatever data structure you create (your own classes.) Then you can query those for whatever data you want, like
var itemsIWant = allMyData.Where(item => item.Value == "something");
The reason is that it enables you to separate the query (selecting individual items) from the storage (whatever file or source the data comes from.) If you replace Excel with something else you won't have to rewrite other code. If you read it line by line then the code that selects items based on criteria is mingled with your Excel-reading code.
Keeping things separate enables you to more easily test parts of your code in isolation. You can confirm that one component correctly reads what's in Excel and converts it to your data. You can confirm that another component correctly executes a query to return the data you want (and it doesn't care where that data came from.)
With regard to optimization - you're going to be opening the file from disk and no matter what you'll have to read every row. That's where all the overhead is. Whether you read the whole thing at once and then query or check each row one at a time won't be a significant factor.

IEnumerable<T> to Excel (2007) w/ Formatting

I'm looking for a good way to export an IEnumerable to Excel 2007 (.xlsb).
The T is a known type, so reflection is not completely necessary for performance reasons.
I'm using .xlsb (excel binary format) because the amount of data will be large for Excel.
The IEnumerable in question has approximately 2 million records. The IEnumerable is retrieved from an Access database (.mdb) then goes through some processing, then finally LINQ queries are wrote to generate a report structure for T. Though these records do not need to get sent to excel as one (nor could it); it will be sub-divided by a condition to which the largest record length will be roughly 1 million records.
I want to be able to convert the data to an Excel Pivot Table for easy viewing.
My initial idea was to convert the IEnumerable to a 2Darray [,] then push into an Excel range using COM interop.
public static object[,] To2DArray<T>(this IEnumerable<T> objectList)
{
Type t = typeof(T);
PropertyInfo[] fields = t.GetProperties();
object[,] my2DObject = new object[objectList.Count(), fields.Count()];
int row = 0;
foreach (var o in objectList)
{
int col = 0;
foreach (var f in fields)
{
my2DObject[row, col] = f.GetValue(o, null) ?? string.Empty;
col++;
}
row++;
}
return my2DObject;
}
I then took that object[,] and did a "transaction split" as I called it which just split up the object[,] into smaller chunks such as I'd create a List then go through each one and send into Excel range using something similar to:
Excel.Range range = worksheet.get_Range(cell,cell);
range.Value2 = List<object[,]>[0]
I'd obviously loop the above but just for simplicity it would look like the above.
This will work though, it takes an enormous amount of time to process, over 30minutes.
I've dabbled in outputting the IEnumerable to CSV though, it is not very efficient either; since it first requires the .csv file to be created, then open the .csv file using COM interop to do the excel pivot table formatting.
My question: Is there a better (preferred) way to do this?
Should I force execution (toList()) before iteration?
Should I use a different mechanism to output/display the data?
I'm open to any options to get a disconnected IEnumerable out to file in an efficient manner.
-I wouldn't be opposed to using something like SQL Express.

The main question will be where the bottleneck is. I'd have a look at the code in a profiler to see what part of the execution is taking a long time. It can also be worthwhile looking at your resource usage by running the process and seeing whether there is a shortage of CPU or Memory, or whether it's disk-locked.
If you're getting sensible performance doing 2000 records at a time, then I suspect memory resources may be an issue - with the code you posted you're converting an IEnumerable (which can avoid loading a complete dataset into memory) into an entirely in-memory structure with potentially a million records - depending on the size and number of fields involved, this could easily become an issue.
If the problem looks like the time to create the Excel file itself (which it doesn't immediately sound like it is in this case), then COM interop calls can add up, and some of the 3rd party Excel libraries aim to be much faster at writing Excel files, particularly with large numbers of records, so rather than necessarily use Excel Binary format and COM, I'd suggest looking at an Open Source library like EPPlus (http://epplus.codeplex.com/) and seeing what the performance difference is like.

C# Parallelizing CSV Parsing

Please look at the code below.
void func()
{
for(;;)
{
var item = new Item();
}
}
Item is a class in whose constructor I read several csv files, as follows
List<string> data = new List<string>();
Item()
{
//read from csv into List<string>data
}
As is visible, the csv files are distinct and are read into unique variables. I would like to be able to parallelize this. All my data is on a network drive. I understand that the limitation in this case is the disk access. Can someone suggest what I can do to parallelize this?

As stated before Parallel.ForEach is the easiest way to run something in parallel, but if I recall correctly parallel.ForEach is a .net 4 method. so if you are using a different version you will have to find another method that uses locks.
If you are looking to read in data from a csv, ADO.net has a built in function that can read in csv files based on a schema file, it's one of the fastest ways in my experience to read in csv files.
quick link I found from google
http://www.daniweb.com/web-development/aspnet/threads/38676
I've also had great success with this http://www.codeproject.com/KB/database/CsvReader.aspx . It's a little slower than the ado.net version but it's easier to use and you don't need schema file.
just a warning, if you use the ado.net and you large string numeric values like credit cards numbers and you are getting things that look like scientific notation, your schema file needs to be adjusted, I've had a lot of coders complain about this.
happy coding.

if all your files are unique and stored in unique variables, take a look on Parallel.ForEach statement - take a look at http://msdn.microsoft.com/en-us/library/dd460720.aspx

C#, reading in Fixed Width records, varying record types in one file

To start I would like to clarify that I'm not extremely well versed in C#. In that, a project I'm doing working in C# using .Net 3.5 has me building a class to read from and export files that contain multiple fixed width formats based on the record type.
There are currently 5 types of records indicated by the first character position in each line of the file that indicate a specific line format. The problem I have is that the types are distinct from each other.
Record type 1 has 5 columns, signifies beginning of the file
Record type 3 has 10 columns, signifies beginning of a batch
Record type 5 has 69 columns, signifies a transaction
Record type 7 has 12 columns, signifies end of the batch, summarizes
(these 3 repeat throughout the file to contain each batch)
Record type 9 has 8 columns, signifies end of the file, summarizes
Is there a good library out there for these kinds of fixed width files? I've seen a few good ones that want to load the entire file in as one spec but that won't do.
Roughly 250 of these files are read at the end of every month and combined filesize on average is about 300 megs. Efficiency is very important to me in this project.
Based on my knowledge of the data I've build a class hierarchy of what I "think" an object should look like...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Extract_Processing
{
class Extract
{
private string mFilePath;
private string mFileName;
private FileHeader mFileHeader;
private FileTrailer mFileTrailer;
private List<Batch> mBatches; // A file can have many batches
public Extract(string filePath)
{ /* Using file path some static method from another class would be called to parse in the file somehow */ }
public string ToString()
{ /* Iterates all objects down the heiarchy to return the file in string format */ }
public void ToFile()
{ /* Calls some method in the file parse static class to export the file back to storage somewhere */ }
}
class FileHeader
{ /* ... contains data types for all fields in this format, ToString etc */ }
class Batch
{
private string mBatchNumber; // Should this be pulled out of the batch header to make LINQ querying simpler for this data set?
private BatchHeader mBatchHeader;
private BatchTrailer mBatchTrailer;
private List<Transaction> mTransactions; // A batch can have multiple transactions
public string ToString()
{ /* Iterates through batches to return what the entire batch would look like in string format */ }
}
class BatchHeader
{ /* ... contains data types for all fields in this format, ToString etc */ }
class Transaction
{ /* ... contains data types for all fields in this format, ToString etc */ }
class BatchTrailer
{ /* ... contains data types for all fields in this format, ToString etc */ }
class FileTrailer
{ /* ... contains data types for all fields in this format, ToString etc */ }
}
Ive left out many constructors and other methods but I think the idea should be pretty solid. I'm looking for ideas and critique to the methods I'm considering as again, not knowledgable about C# and the execution time is the highest priority.
Biggest question besides some critique is, how should I bring in this file? I've brought in many files in other languages such as VBA using FSO methods, Microsoft Access ImportSpec to read in the file (5 times, one for each spec... wow that was inefficient!), created a 'Cursor' object in visual foxpro (which was FAAAAAAAST but again, had to do five times) but am looking for hidden gems in C# if said things exist.
Thanks for reading my novel, let me know if your having issues understanding it. I'm taking the weekend to go over this design to see if I buy it and want to take the effort to implement it this way.

FileHelpers is nice. It has a couple of drawbacks in that it doesn't seem to be under active development anymore, and it makes you use public variables for your fields instead of letting you use properties. But otherwise good.
What are you doing with these files? Are you loading them into SQL Server? If so, and you're looking for FAST and SIMPLE, I'd recommend a design like this:
Make staging tables in your database that correspond to each of the 5 record types. Consider adding a LineNumber column and a FileName column too just so you can trace problems back to the file itself.
Read the file line by line and parse it out into your business objects, or directly into ADO.NET DataTable objects that correspond to your tables.
If you used business objects, apply your data transformations or business rules and then put the data into DataTable objects that correspond to your tables.
Once each DataTable reaches an appropriate BatchSize (say 1000 records), use the SqlBulkCopy object to pump the data into your staging tables. After each SqlBulkCopy operation, clear out the DataTable and continue processing.
If you didn't want to use business objects, do any final data manipulation in SQL Server.
You could probably accomplish the whole thing in under 500 lines of C#.

Biggest question besides some critique is, how should I bring in this file?
I do not know of any good library for file IO, but the reading is pretty straightforward.
Instantiate a StreamReader class using a 64kB buffer to limit disk IO operations (my estimations is 1500 transactions average per file per the end of the month).
Now you can stream over the file:
1) Using the Read at the beggining of each line to determine the type of the record.
2) Using the ReadLine method with the String.Split method to get column values.
3) Create the object using the column values.
or
You could just buffer the data from a Stream manually and IndexOf+SubString for more performance (if done right).
Also if the lines weren't columns but primitive datatypes in binary format, you could use the BinaryReader class for a very easy and performant way to read the objects.

One critique I have is that you are not correctly implementing ToString.
public string ToString()
Should be:
public override string ToString()

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Output dataset as collection of .csv files, .zip not an option C# - c#

Related

C# reading excel and still getting data from blank cells

C# Excel Reading optimization

IEnumerable<T> to Excel (2007) w/ Formatting

C# Parallelizing CSV Parsing

C#, reading in Fixed Width records, varying record types in one file

Categories

Resources