I have developed a code in C# that copies data from csv file to a data table. The csv file contains 5 Million rows and I read the rows line by line to prevent memory issues. I wonder why I still get OutOfMemory Exception. I added breakPoints to make sure the right strings are copied to my variables and they are working correctly. Any ideas?
int first_row_flag = 0; //first row is column name and we dont need to import them
string temp;
foreach (var row in File.ReadLines(path3))
{
if (!string.IsNullOrEmpty(row))
{
int i = 0;
if (first_row_flag != 0)
{
dt.Rows.Add();
foreach (string cell in row.Split(','))
{
if (i < 9)
{
temp = cell.Replace("\n", "");
temp = temp.Replace("\r", "");
dt.Rows[dt.Rows.Count - 1][i] = temp;
i++;
}
}
}
else
{
first_row_flag++; //get rid of first row
}
}
}
The number of columns in each row is 9. Thats why I use i to make sure I will not read unexpected data in 10th column.
Here is the stack trace:
5 million rows, could be too much data to handle. (it will depend on the number of columns and values). Check the file size and then compare it with the memory available for a rough idea. The point is, with this much data , you will end up with out of memory exception with other techniques, most of the time.
You should reconsider the usage of DataTable, if you are holding records so that you can later do an insert in DB, then process your data in small batches.
If you decide to handle data in batches, then you could even think about not using DataTable at all, instead use List<T>.
Also, look at other techniques to read CSV file. Reading CSV files using C#
Related
Using C# I'm reading data from text files into a 2D list for further processing. Each file is 256 doubles, space delimited in 256 lines, each line is read into a list of doubles and each list is added to a list of lines. All files have 256x256 = 65,536 data points.
I've got code that reads the files and works well for some files but for others it takes a really long time. Since all files are formatted in the same way and contain the same number of data points I don't understand the difference in read time, any one got any ideas?
How can I speed up the read time of file 2?
Here is the code I'm using:
private Data ReadData (string name, string file)
{
List<List<Double>> data_points = new List<List<Double>>();
String input = File.ReadAllText( file );
foreach (string row in input.Split('\n'))
{
List<Double> line_list = new List<double>();
foreach (string col in row.Trim().Split(' '))
{
if(row != "")
{
line_list.Add(double.Parse(col.Trim()));
}
}
if(line_list.Count > 1)
{
data_points.Add(line_list);
}
}
Data temp_data = new Data(name, data_points);
return temp_data;
}
Example text files are here:
https://www.dropbox.com/s/diindi2qjlgoxep/FOV2_t1.txt?dl=0 => reads fast
https://www.dropbox.com/s/4xrgdz0nq24ypz8/FOV2_t2.txt?dl=0 => reads slow
In answer to some of the comments:
#AntDC - What constitutes a valid double? I tried replacing Parse.Double with Convert.ToDouble with no improvement.
#Henk Holterman - the difference in read time is very noticeable <1s for the first file and approx. 50s for the second file. It appears to be repeatable.
#Slai - I moved both files to other locations and it had no impact on read time. Both files were exported from the same program within seconds of one another.
Performance wise you can optimize your code, instead reading the whole file and after that splitting it. Just read line by line.
List<Double> line_list = new List<double>();
foreach (string line in File.ReadLines("c:\\file.txt"))
{
string[] rows = line.Trim().Split(' ');
foreach(string el in rows)
{
line_list.Add(double.Parse(el.Trim()));
}
}
I'm generating an .xlsx file using the EPPlus library.
I create several worksheets, each with multiple rows. Some rows have a cell reference in column J which I am inserting using the following:
for (int i = 2; i < rowCount; i++) // start on row 2 (header)
{
var formula = GetCellRefFormula(i);
worksheet.Cells[$"J{i}"].Formula = formula;
}
// save worksheet/workbook
private string GetCellRefFormula(int i)
{
return $"\"Row \"&ROW(D{i})"
}
When I open the workbook I get the following errors:
Removed Records: Formula from /xl/worksheets/sheet1.xml part
Removed Records: Formula from /xl/worksheets/sheet7.xml part
The errors are certainly caused by the string returned from GetCellRefFormula(), if I don't set these formulas or GetCellRefFormula simply returns an empty string, I get no errors.
I have also tried setting the formula to have an equals sign in front, with the same result.
private string GetCellRefFormula(int i)
{
return $"=\"Row \"&ROW(D{i})"
}
Should I be setting the formula field like this?
Is there a way to see specifically which formulas are incorrect in the Excel repair log?
As far as I can see it only gives the errors I've copied above.
I'm trying to transpose a large data file that may have many rows and columns, for subsequent analysis in Excel. Currently rows might contain either 2 or 125,000 points, but I'm trying to be generic. (I need to transpose because Excel can't handle that many columns, but is fine if the large sets span many rows.)
Initially, I implemented this is Python, using the built-in zip function. I process the source file to separate long rows from short, then transpose the long rows with zip:
tempdata = zip(*csv.reader(open(tempdatafile,'r')))
csv.writer(open(outfile, 'a', newline='')).writerows(tempdata)
os.remove(tempdatafile)
This works great and takes a few seconds for a 15MB csv file, but since the program that generated the data in the first place is in C#, I thought it would be best to do it all in one program.
My initial approach in C# is a little different, since from what I've read, the zip function might not work quite the same. Here's my approach:
public partial class Form1 : Form
{
StreamReader source;
int Rows = 0;
int Columns = 0;
string filePath = "input.csv";
string outpath = "output.csv";
List<string[]> test_csv = new List<string[]>();
public Form1()
{
InitializeComponent();
}
private void button_Load_Click(object sender, EventArgs e)
{
source = new StreamReader(filePath);
while(!source.EndOfStream)
{
string[] Line = source.ReadLine().Split(',');
test_csv.Add(Line);
if (test_csv[Rows].Length > Columns) Columns = test_csv[Rows].Length;
Rows++;
}
}
private void button_Write_Click(object sender, EventArgs e)
{
StreamWriter outfile = new StreamWriter(outpath);
for (int i = 0; i < Columns; i++)
{
string line = "";
for (int j = 0; j < Rows; j++)
{
try
{
if (j != 0) line += ",";
line += test_csv[j][i];
}
catch { }
}
outfile.WriteLine(line);
}
outfile.Close();
MessageBox.Show("Outfile written");
}
}
I used the List because the rows might be of variable length, and I have the load function set to give me total number of columns and rows so I can know how big the outfile has to be.
I used a try/catch when writing to deal with variable length rows. If the indices are out of range for the row, this catches the exception and just skips it (the next loop writes a comma before an exception occurs).
Loading takes very little time, but actually saving the outfile is an insanely long process. After 2 hours, I was only 1/3 of the way through the file. When I stopped the program and looked at the outfile, everything is done correctly, though.
What might be causing this program to take so long? Is it all the exception handling? I could implement a second List that stores row lengths for each row so I can avoid exceptions. Would that fix this issue?
Try using StringBuilder. Concatenation (+) of long strings is very inefficient.
Create a List<string> of lines and then make a single call System.IO.File.WriteAllLines(filename, lines). This will reduce disk IO.
If you don't care about the order of the points try changing your outside for loop to System.Threading.Tasks.Parallel.For. This will run multiple threads. Since these run parallel it won't preserve the order when writing it out.
In regards to your exception handling: Since this is an error that you can determine ahead of time, you should not use a try/catch to take care of it. Change it to this:
if (j < test_csv.Length && i < test_csv[j].Length)
{
line += test_csv[j][i];
}
I am developing a small time-management app (so that I can learn C#/WPF). I need to know the best way to return calculations to various textblocks on one of my forms.
I have a table called "tblActivity" and I need to calculate how many times certain values exist. In the old days of VBA, I would have simply used DSum or DCount, but I'm not sure as the the most efficient/correct/fastest way to return this sort of data (the fields are indexed by the way).
if you want to query the table as a whole you would do something like this:
int rowCount = tblActivity.rows.count();
if you want the count where a column meets certain criteria, run a select statement
datarow[] SelectedIndexCountROw = tblActivity.select("Index = 12 AND Index2 = 'something'");
what you can still do if you need to display the data and the count
int COunt;
foreach row as datarow in tblActivity.rows
{
string ValueFromTable = row("Column");
//display data if you must,
COunt += 1;
}
I'm have a ADO DataSet that I'm loading from its XML file via ReadXml. The data and the schema are in separate files.
Right now, it takes close to 13 seconds to load this DataSet. I can cut this to 700 milliseconds if I don't read the DataSet's schema and just let ReadXml infer the schema, but then the resulting DataSet doesn't contain any constraints.
I've tried doing this:
Console.WriteLine("Reading dataset with external schema.");
ds.ReadXmlSchema(xsdPath);
Console.WriteLine("Reading the schema took {0} milliseconds.", sw.ElapsedMilliseconds);
foreach (DataTable dt in ds.Tables)
{
dt.BeginLoadData();
}
ds.ReadXml(xmlPath);
Console.WriteLine("ReadXml completed after {0} milliseconds.", sw.ElapsedMilliseconds);
foreach (DataTable dt in ds.Tables)
{
dt.EndLoadData();
}
Console.WriteLine("Process complete at {0} milliseconds.", sw.ElapsedMilliseconds);
When I do this, reading the schema takes 27ms, and reading the DataSet takes 12000+ milliseconds. And that's the time reported before I call EndLoadData on all the DataTables.
This is not an enormous amount of data - it's about 1.5mb, there are no nested relations, and all of the tables contain two or three columns of 6-30 characters. The only thing I can figure that's different if I read the schema up front is that the schema includes all of the unique constraints. But BeginLoadData is supposed to turn constraints off (as well as change notification, etc.). So that shouldn't apply here. (And yes, I've tried just setting EnforceConstraints to false.)
I've read many reports of people improving the load time of DataSets by reading the schema first instead of having the object infer the schema. In my case, inferring the schema makes for a process that's about 20 times faster than having the schema provided explicitly.
This is making me a little crazy. This DataSet's schema is generated off of metainformation, and I'm tempted to write a method that creates it programatically and just deseralizes it with an XmlReader. But I'd much prefer not to.
What am I missing? What else can I do to improve the speed here?
I will try to give you a performance comparison between storing data in text plain files and xml files.
The first function creates two files: one file with 1000000 records in plain text and one file with 1000000 (same data) records in xml. First you have to notice the difference in file size: ~64MB(plain text) vs ~102MB (xml file).
void create_files()
{
//create text file with data
StreamWriter sr = new StreamWriter("plain_text.txt");
for(int i=0;i<1000000;i++)
{
sr.WriteLine(i.ToString() + "<SEP>" + "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbb" + i.ToString());
}
sr.Flush();
sr.Close();
//create xml file with data
DataSet ds = new DataSet("DS1");
DataTable dt = new DataTable("T1");
DataColumn c1 = new DataColumn("c1", typeof(int));
DataColumn c2 = new DataColumn("c2", typeof(string));
dt.Columns.Add(c1);
dt.Columns.Add(c2);
ds.Tables.Add(dt);
DataRow dr;
for(int j=0; j< 1000000; j++)
{
dr = dt.NewRow();
dr[0]=j;
dr[1] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbb" + j.ToString();
dt.Rows.Add(dr);
}
ds.WriteXml("xml_text.xml");
}
The second function reads these two files: first it reads the plain text into a dictionary (just to simulate the real world of using it) and after that it reads the XML file. Both steps are measured in milliseconds (and results are written to console):
Start read Text file into memory
Text file loaded into memory in 7628 milliseconds
Start read XML file into memory
XML file loaded into memory in 21018 milliseconds
void read_files()
{
//timers
Stopwatch stw = new Stopwatch();
long milliseconds;
//read text file in a dictionary
Debug.WriteLine("Start read Text file into memory");
stw.Start();
milliseconds = 0;
StreamReader sr = new StreamReader("plain_text.txt");
Dictionary<int, string> dict = new Dictionary<int, string>(1000000);
string line;
string[] sep = new string[]{"<SEP>"};
string [] arValues;
while (sr.EndOfStream!=true)
{
line = sr.ReadLine();
arValues = line.Split(sep,StringSplitOptions.None);
dict.Add(Convert.ToInt32(arValues[0]),arValues[1]);
}
stw.Stop();
milliseconds = stw.ElapsedMilliseconds;
Debug.WriteLine("Text file loaded into memory in " + milliseconds.ToString() + " milliseconds" );
//create xml structure
DataSet ds = new DataSet("DS1");
DataTable dt = new DataTable("T1");
DataColumn c1 = new DataColumn("c1", typeof(int));
DataColumn c2 = new DataColumn("c2", typeof(string));
dt.Columns.Add(c1);
dt.Columns.Add(c2);
ds.Tables.Add(dt);
//read xml file
Debug.WriteLine("Start read XML file into memory");
stw.Restart();
milliseconds = 0;
ds.ReadXml("xml_text.xml");
stw.Stop();
milliseconds = stw.ElapsedMilliseconds;
Debug.WriteLine("XML file loaded into memory in " + milliseconds.ToString() + " milliseconds");
}
Conclusion: the XML file size is almost double than the text file size and is loaded three times slower than the text file.
XML handling is more convenient (because of the abstraction level) than plain text but it is more CPU/disk consuming.
So, if you have small files and is acceptable from the performance point of view, XML data Sets are more than ok. But, if you need performance, I don't know if XML Data set ( with any kind of method available) is faster that plain text files. And basically, it start from the very first reason: XML file is bigger because it has more tags.
It's not an answer, exactly (though it's better than nothing, which is what I've gotten so far), but after a long time struggling with this problem I discovered that it's completely absent when my program's not running inside Visual Studio.
Something I didn't mention before, which makes this even more mystifying, is that when I loaded a different (but comparably large) XML document into the DataSet, the program performed just fine. I'm now wondering if one of my DataSets has some kind of metainformation attached to it that Visual Studio is checking at runtime while the other one doesn't. I dunno.
Another dimesion to try is to read the dataset without the schema and then Merge it into a typed dataset that has the constraints enabled. That way it has all of the data on hand as it builds the indexes used to enforce constraints -- maybe it would be more efficient?
From MSDN:
The Merge method is typically called
at the end of a series of procedures
that involve validating changes,
reconciling errors, updating the data
source with the changes, and finally
refreshing the existing DataSet
.