Large file CSV transpose takes forever in C# - c#

I'm trying to transpose a large data file that may have many rows and columns, for subsequent analysis in Excel. Currently rows might contain either 2 or 125,000 points, but I'm trying to be generic. (I need to transpose because Excel can't handle that many columns, but is fine if the large sets span many rows.)
Initially, I implemented this is Python, using the built-in zip function. I process the source file to separate long rows from short, then transpose the long rows with zip:
tempdata = zip(*csv.reader(open(tempdatafile,'r')))
csv.writer(open(outfile, 'a', newline='')).writerows(tempdata)
os.remove(tempdatafile)
This works great and takes a few seconds for a 15MB csv file, but since the program that generated the data in the first place is in C#, I thought it would be best to do it all in one program.
My initial approach in C# is a little different, since from what I've read, the zip function might not work quite the same. Here's my approach:
public partial class Form1 : Form
{
StreamReader source;
int Rows = 0;
int Columns = 0;
string filePath = "input.csv";
string outpath = "output.csv";
List<string[]> test_csv = new List<string[]>();
public Form1()
{
InitializeComponent();
}
private void button_Load_Click(object sender, EventArgs e)
{
source = new StreamReader(filePath);
while(!source.EndOfStream)
{
string[] Line = source.ReadLine().Split(',');
test_csv.Add(Line);
if (test_csv[Rows].Length > Columns) Columns = test_csv[Rows].Length;
Rows++;
}
}
private void button_Write_Click(object sender, EventArgs e)
{
StreamWriter outfile = new StreamWriter(outpath);
for (int i = 0; i < Columns; i++)
{
string line = "";
for (int j = 0; j < Rows; j++)
{
try
{
if (j != 0) line += ",";
line += test_csv[j][i];
}
catch { }
}
outfile.WriteLine(line);
}
outfile.Close();
MessageBox.Show("Outfile written");
}
}
I used the List because the rows might be of variable length, and I have the load function set to give me total number of columns and rows so I can know how big the outfile has to be.
I used a try/catch when writing to deal with variable length rows. If the indices are out of range for the row, this catches the exception and just skips it (the next loop writes a comma before an exception occurs).
Loading takes very little time, but actually saving the outfile is an insanely long process. After 2 hours, I was only 1/3 of the way through the file. When I stopped the program and looked at the outfile, everything is done correctly, though.
What might be causing this program to take so long? Is it all the exception handling? I could implement a second List that stores row lengths for each row so I can avoid exceptions. Would that fix this issue?

Try using StringBuilder. Concatenation (+) of long strings is very inefficient.
Create a List<string> of lines and then make a single call System.IO.File.WriteAllLines(filename, lines). This will reduce disk IO.
If you don't care about the order of the points try changing your outside for loop to System.Threading.Tasks.Parallel.For. This will run multiple threads. Since these run parallel it won't preserve the order when writing it out.
In regards to your exception handling: Since this is an error that you can determine ahead of time, you should not use a try/catch to take care of it. Change it to this:
if (j < test_csv.Length && i < test_csv[j].Length)
{
line += test_csv[j][i];
}

Related

Performance using Span<T> to parse a text file

I am trying to take advantage of Span<T>, using .NETCore 2.2 to improve the performance of parsing text from a text file. The text file contains multiple consecutive rows of data which will each be split into fields that are then each mapped to a data class.
Initially, the parsing routine uses a traditional approach of using StreamReader to read each row, and then using Substring to copy the individual fields from that row.
From what I have read (on MSDN), amongst others, using Span<T> with Slice should perform more efficiently as less data allocations are made, and instead, a pointer to the byte[] array is passed around and acted upon.
After some experimentation I have compared 3 approaches to parsing the file and used BenchmarkDotNet to compare the results. What I found was that, when parsing a single row from the text file using Span, both mean execution time and allocated memory are indeed significantly less. So far so good. However, when parsing more than one row from the file, the performance gain quickly disappears to the point that it is almost insignificant, even from as little as 50 rows.
I am sure I must be missing something. Something seems to be outweighing the performance gain of Span.
The best performing approach WithSpan_StringFirst looks like this:
private static byte[] _filecontent;
private const int ROWSIZE = 252;
private readonly Encoding _encoding = Encoding.ASCII;
public void WithSpan_StringFirst()
{
var buffer1 = new Span<byte>(_filecontent).Slice(0, RowCount * ROWSIZE);
var buffer = _encoding.GetString(buffer1).AsSpan();
int cursor = 0;
for (int i = 0; i < RowCount; i++)
{
var row = buffer.Slice(cursor, ROWSIZE);
cursor += ROWSIZE;
Foo.ReadWithSpan(row);
}
}
[Params(1, 50)]
public int RowCount { get; set; }
Implementation of Foo.ReadWithSpan:
public static Foo ReadWithSpan(ReadOnlySpan<char> buffer) => new Foo
{
Field1 = buffer.Read(0, 2),
Field2 = buffer.Read(3, 4),
Field3 = buffer.Read(5, 6),
// ...
Field30 = buffer.Read(246, 249)
};
public static string Read(this ReadOnlySpan<char> input, int startIndex, int endIndex)
{
return new string(input.Slice(startIndex, endIndex - startIndex));
}
Any feedback would be appreciated. I have posted a full working sample on github.
For small files < 10,000 lines and simple line structure to parse, most any .net Core method will be the same.
For large, multi-gigibyte files and millions of lines of data, optimizations matter more.
If file processing time is in hours or even in tens of minutes, getting all the C# code together in the same class will drastically speed up processing the file as the compiler can do better code optimizations. Inlining the methods called into the main processing code can help also.
It's the same 1960s answer, changing the processing algorithm and how it chunks input and output data is an order of magnitude better than small code optimizations.

'System.OutOfMemoryException' on copying csv file to Data table

I have developed a code in C# that copies data from csv file to a data table. The csv file contains 5 Million rows and I read the rows line by line to prevent memory issues. I wonder why I still get OutOfMemory Exception. I added breakPoints to make sure the right strings are copied to my variables and they are working correctly. Any ideas?
int first_row_flag = 0; //first row is column name and we dont need to import them
string temp;
foreach (var row in File.ReadLines(path3))
{
if (!string.IsNullOrEmpty(row))
{
int i = 0;
if (first_row_flag != 0)
{
dt.Rows.Add();
foreach (string cell in row.Split(','))
{
if (i < 9)
{
temp = cell.Replace("\n", "");
temp = temp.Replace("\r", "");
dt.Rows[dt.Rows.Count - 1][i] = temp;
i++;
}
}
}
else
{
first_row_flag++; //get rid of first row
}
}
}
The number of columns in each row is 9. Thats why I use i to make sure I will not read unexpected data in 10th column.
Here is the stack trace:
5 million rows, could be too much data to handle. (it will depend on the number of columns and values). Check the file size and then compare it with the memory available for a rough idea. The point is, with this much data , you will end up with out of memory exception with other techniques, most of the time.
You should reconsider the usage of DataTable, if you are holding records so that you can later do an insert in DB, then process your data in small batches.
If you decide to handle data in batches, then you could even think about not using DataTable at all, instead use List<T>.
Also, look at other techniques to read CSV file. Reading CSV files using C#

C# - Reading Text Files(System IO)

I would like to consecutively read from a text file that is generated by my program. The problem is that after parsing the file for the first time, my program reads the last line of the file before it can begin re-parsing, which causes it to accumulates unwanted data.
3 photos: first is creating tournament and showing points, second is showing text file and the third is showing that TeamA got more 3 points
StreamReader = new StreamReader("Torneios.txt");
torneios = 0;
while (!rd.EndOfStream)
{
string line = rd.ReadLine();
if (line == "Tournament")
{
torneios++;
}
else
{
string[] arr = line.Split('-');
equipaAA = arr[0];
equipaBB = arr[1];
res = Convert.ToChar(arr[2]);
}
}
rd.Close();
That is what I'm using at the moment.
To avoid mistakes like these, I highly recommend using File.ReadAllText or File.ReadAllLines unless you are using large files (in which case they are not good choices), here is an example of an implementation of such:
string result = File.ReadAllText("textfilename.txt");
Regarding your particular code, an example using File.ReadAllLines which achieves this is:
string[] lines = File.ReadAllLines("textfilename.txt");
for(int i = 0; i < lines.Length; i++)
{
string line = lines[i];
//Do whatever you want here
}
Just to make it clear, this is not a good idea if the files you intend to read from are large (such as binary files).

C# Best way to parse flat file with dynamic number of fields per row

I have a flat file that is pipe delimited and looks something like this as example
ColA|ColB|3*|Note1|Note2|Note3|2**|A1|A2|A3|B1|B2|B3
The first two columns are set and will always be there.
* denotes a count for how many repeating fields there will be following that count so Notes 1 2 3
** denotes a count for how many times a block of fields are repeated and there are always 3 fields in a block.
This is per row, so each row may have a different number of fields.
Hope that makes sense so far.
I'm trying to find the best way to parse this file, any suggestions would be great.
The goal at the end is to map all these fields into a few different files - data transformation. I'm actually doing all this within SSIS but figured the default components won't be good enough so need to write own code.
UPDATE I'm essentially trying to read this like a source file and do some lookups and string manipulation to some of the fields in between and spit out several different files like in any normal file to file transformation SSIS package.
Using the above example, I may want to create a new file that ends up looking like this
"ColA","HardcodedString","Note1CRLFNote2CRLF","ColB"
And then another file
Row1: "ColA","A1","A2","A3"
Row2: "ColA","B1","B2","B3"
So I guess I'm after some ideas on how to parse this as well as storing the data in either Stacks or Lists or?? to play with and spit out later.
One possibility would be to use a stack. First you split the line by the pipes.
var stack = new Stack<string>(line.Split('|'));
Then you pop the first two from the stack to get them out of the way.
stack.Pop();
stack.Pop();
Then you parse the next element: 3* . For that you pop the next 3 items on the stack. With 2** you pop the next 2 x 3 = 6 items from the stack, and so on. You can stop as soon as the stack is empty.
while (stack.Count > 0)
{
// Parse elements like 3*
}
Hope this is clear enough. I find this article very useful when it comes to String.Split().
Something similar to below should work (this is untested)
ColA|ColB|3*|Note1|Note2|Note3|2**|A1|A2|A3|B1|B2|B3
string[] columns = line.Split('|');
List<string> repeatingColumnNames = new List<string();
List<List<string>> repeatingFieldValues = new List<List<string>>();
if(columns.Length > 2)
{
int repeatingFieldCountIndex = columns[2];
int repeatingFieldStartIndex = repeatingFieldCountIndex + 1;
for(int i = 0; i < repeatingFieldCountIndex; i++)
{
repeatingColumnNames.Add(columns[repeatingFieldStartIndex + i]);
}
int repeatingFieldSetCountIndex = columns[2 + repeatingFieldCount + 1];
int repeatingFieldSetStartIndex = repeatingFieldSetCountIndex + 1;
for(int i = 0; i < repeatingFieldSetCount; i++)
{
string[] fieldSet = new string[repeatingFieldCount]();
for(int j = 0; j < repeatingFieldCountIndex; j++)
{
fieldSet[j] = columns[repeatingFieldSetStartIndex + j + (i * repeatingFieldSetCount))];
}
repeatingFieldValues.Add(new List<string>(fieldSet));
}
}
System.IO.File.ReadAllLines("File.txt").Select(line => line.Split(new[] {'|'}))

C# DataGrid CSV Import moving any size array into defined column grid?

I have a function to which an array has a large sum of CSV imported split data. I currently have it set as the code below however I'm having some issues getting the data to go into the separate columns than into a single one. What I would like to achieve is a non-redundant means of supplying a function a string array of any size, and define how many columns across the data must be read before adding it as a row into the DataGrid.
private string csvtogrid(string input, columns)
{
input = input.Replace("\r", ",").Substring(2).TrimEnd(',').Trim().Replace("\n", ",").Replace(",,,", ",").Replace(",,",",");
string[] repack = input.Split(',');
string[] cell = new string[columns];
int rcell = 0;
for (int counter = 1; counter < repack.Length; counter++)
{
if (rcell < columns)
{
cell[rcell] = repack[counter];
rcell++;
}
//MessageBox.Show(cell[0] + cell[1] + cell[2]);
procgrid.Rows.Add(cell[0], cell[1], cell[2]);
rcell = 0;
}
return null;
}
The most reliable and easy method i have ever found to read in csv files is the TextFieldParser class. It's in the Microsoft.VisualBasic.FileIO namespace so it is probably not referenced by defaukt if you use c# but it's should be in the gac foryou to reference.

Categories

Resources