This question already has answers here:
Load very big CSV-Files into s SQL-Server database
(3 answers)
Closed 9 years ago.
I have a big CSV file with 10 million entries, and i need to export it to SQL using C#. I'm a newby and i really don't know how to write this.
I have something like this so far:
private static void ExportToDB()
{
SqlConnection con = new SqlConnection(#"Data Source=SHAWHP\SQLEXPRESS;Initial Catalog=FOO;Persist Security Info=True;User ID=sa");
string filepath = #"E:\Temp.csv";
StreamReader sr = new StreamReader(filepath);
string line = sr.ReadLine();
string[] value = line.Split(',');
DataTable dt = new DataTable();
DataRow row;
foreach (string dc in value)
{
dt.Columns.Add(new DataColumn(dc));
}
while ( !sr.EndOfStream )
{
value = sr.ReadLine().Split(',');
if(value.Length == dt.Columns.Count)
{
row = dt.NewRow();
row.ItemArray = value;
dt.Rows.Add(row);
}
}
SqlBulkCopy bc = new SqlBulkCopy(con.ConnectionString, SqlBulkCopyOptions.TableLock);
bc.DestinationTableName = "tblparam_test";
bc.BatchSize = dt.Rows.Count;
con.Open();
bc.WriteToServer(dt);
bc.Close();
con.Close();
}
.
And it gives me an error, saying this:
An unhandled exception of type 'System.OutOfMemoryException' occurred in mscorlib.dll
How can i fix it? Or is there another way?
You can't use such approach because string.Split creates lots of arrays that multiply amount of memory. Suppose you have 10 columns. After split you will have Array length of 10 and 10 string = 11 objects. Each of them has 8 or 16 bytes extra memory(object sync root and etc). So, memory overhead is 88 bytes for each string. 10 KK lines will consume at least 880KK memory- and add to this number size of your file and you will have the value of 1gb. This is not all, DateRow is quite heavy structure, so, you should add 10KK of data rows. And this is not all - DataTable of size 10KK elements will have size more than 40mb.
So, expected required size is more than 1Gb.
For х32 process .Net can't easily use more then 1Gb memory. Theoretically it has 2 gigs, but this is just theoretically, because everything consumes memory - assemblies, native dlls and another objects, UI and etc.
The solutions is to use х64 process or read-write in chunks like below
private static void ExportToDB()
{
string filepath = #"E:\Temp.csv";
StreamReader sr = new StreamReader(filepath);
string line = sr.ReadLine();
string[] value = line.Split(',');
DataTable dt = new DataTable();
DataRow row;
foreach (string dc in value)
{
dt.Columns.Add(new DataColumn(dc));
}
int i = 1000; // chunk size
while ( !sr.EndOfStream )
{
i--
value = sr.ReadLine().Split(',');
if(value.Length == dt.Columns.Count)
{
row = dt.NewRow();
row.ItemArray = value;
dt.Rows.Add(row);
}
if(i > 0)
continue;
WriteChunk(dt);
i = 1000;
}
WriteChunk(dt);
}
void WriteChunk(DataTable dt)
{
SqlConnection con = new SqlConnection(#"Data Source=SHAWHP\SQLEXPRESS;Initial Catalog=FOO;Persist Security Info=True;User ID=sa");
using(SqlBulkCopy bc = new SqlBulkCopy(con.ConnectionString, SqlBulkCopyOptions.TableLock))
{
bc.DestinationTableName = "tblparam_test";
bc.BatchSize = dt.Rows.Count;
using(con.Open())
{
bc.WriteToServer(dt);
}
}
dt.Rows.Clear()
}
If you can get the file to the server. I would use bulk insert server-side.
BULK Insert CSV
Regards.
Taken from MSDN:
In relation to .ReadLine()
If the current method throws an OutOfMemoryException, the reader's position in the underlying Stream object is advanced by the number of characters the method was able to read, but the characters already read into the internal ReadLine buffer are discarded. If you manipulate the position of the underlying stream after reading data into the buffer, the position of the underlying stream might not match the position of the internal buffer. To reset the internal buffer, call the DiscardBufferedData method; however, this method slows performance and should be called only when absolutely necessary.
Related
I am using below code to export data from a csv file to datatable.
As the values are of mixed text i.e. both numbers and Alphabets, some of the columns are not getting exported to Datatable.
I have done some research here and found that we need to set ImportMixedType = Text and TypeGuessRows = 0 in registry which even did not solve the problem.
Below code is working for some files even with mixed text.
Could someone tell me what is wrong with below code. Do I miss some thing here.
if (isFirstRowHeader)
{
header = "Yes";
}
using (OleDbConnection connection = new OleDbConnection(#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + pathOnly +
";Extended Properties=\"text;HDR=" + header + ";FMT=Delimited\";"))
{
using (OleDbCommand command = new OleDbCommand(sql, connection))
{
using (OleDbDataAdapter adapter = new OleDbDataAdapter(command))
{
adapter.Fill(table);
}
connection.Close();
}
}
for comma delimited file this worked for me
public DataTable CSVtoDataTable(string inputpath)
{
DataTable csvdt = new DataTable();
string Fulltext;
if (File.Exists(inputpath))
{
using (StreamReader sr = new StreamReader(inputpath))
{
while (!sr.EndOfStream)
{
Fulltext = sr.ReadToEnd().ToString();//read full content
string[] rows = Fulltext.Split('\n');//split file content to get the rows
for (int i = 0; i < rows.Count() - 1; i++)
{
var regex = new Regex("\\\"(.*?)\\\"");
var output = regex.Replace(rows[i], m => m.Value.Replace(",", "\\c"));//replace commas inside quotes
string[] rowValues = output.Split(',');//split rows with comma',' to get the column values
{
if (i == 0)
{
for (int j = 0; j < rowValues.Count(); j++)
{
csvdt.Columns.Add(rowValues[j].Replace("\\c",","));//headers
}
}
else
{
try
{
DataRow dr = csvdt.NewRow();
for (int k = 0; k < rowValues.Count(); k++)
{
if (k >= dr.Table.Columns.Count)// more columns may exist
{ csvdt .Columns.Add("clmn" + k);
dr = csvdt .NewRow();
}
dr[k] = rowValues[k].Replace("\\c", ",");
}
csvdt.Rows.Add(dr);//add other rows
}
catch
{
Console.WriteLine("error");
}
}
}
}
}
}
}
return csvdt;
}
The main thing that would probably help is to first stop using OleDB objects for reading a delimited file. I suggest using the 'TextFieldParser' which is what I have successfully used for over 2 years now for a client.
http://www.dotnetperls.com/textfieldparser
There may be other issues, but without seeing your .CSV file, I can't tell you where your problem may lie.
The TextFieldParser is specifically designed to parse comma delimited files. The OleDb objects are not. So, start there and then we can determine what the problem may be, if it persists.
If you look at an example on the link I provided, they are merely writing lines to the console. You can alter this code portion to add rows to a DataTable object, as I do, for sorting purposes.
Following is the code for it:
protected void Upload(object sender, EventArgs e)
{
if (FileUpload1.HasFile)
{
//Upload and save the file
string csvPath = Server.MapPath("~/App_Data/") + Path.GetFileName(FileUpload1.PostedFile.FileName);
FileUpload1.SaveAs(csvPath);
DataTable dt = new DataTable();
dt.Columns.AddRange(new DataColumn[7]
{
new DataColumn("pataintno", typeof(int)),
new DataColumn("Firstname", typeof(string)),
new DataColumn("Lastname",typeof(string)),
new DataColumn("Age", typeof(int)),
new DataColumn("Address", typeof(string)),
new DataColumn("Email", typeof(string)),
new DataColumn("Phno", typeof(int)),});
string csvData = File.ReadAllText(csvPath);
foreach (string row in csvData.Split('\n'))
{
if (!string.IsNullOrEmpty(row))
{
dt.Rows.Add();
int i = 0;
foreach (string cell in row.Split(','))
{
dt.Rows[dt.Rows.Count - 1][i] = cell;
i++;
}
}
}
string consString = ConfigurationManager.ConnectionStrings["cnstr"].ConnectionString;
using (SqlConnection con = new SqlConnection(consString))
{
using (SqlBulkCopy sqlBulkCopy = new SqlBulkCopy(con))
{
//Set the database table name
sqlBulkCopy.DestinationTableName = "Pataint";
con.Open();
sqlBulkCopy.WriteToServer(dt);
con.Close();
Array.ForEach(Directory.GetFiles((Server.MapPath("~/App_Data/"))), File.Delete);
}
}
}
else
{
Label1.Text = "PlZ TRY AGAIN";
}
}
You have a DataTable with 3 fields of type integer, the error says that one or more of the data extracted from your file are not valid integers.
So you need to check for bad input (as always in these cases)
// Read all lines and get back an array of the lines
string[] lines = File.ReadAllLines(csvPath);
// Loop over the lines and try to add them to the table
foreach (string row in lines)
{
// Discard if the line is just null, empty or all whitespaces
if (!string.IsNullOrWhiteSpace(row))
{
string[] rowParts = row.Split(',');
// We expect the line to be splittes in 7 parts.
// If this is not the case then log the error and continue
if(rowParts.Length != 7)
{
// Log here the info on the incorrect line with some logging tool
continue;
}
// Check if the 3 values expected to be integers are really integers
int pataintno;
int age;
int phno;
if(!Int32.TryParse(rowParts[0], out pataintno))
{
// It is not an integer, so log the error
// on this line and continue
continue;
}
if(!Int32.TryParse(rowParts[3], out age))
{
// It is not an integer, so log the error
// on this line and continue
continue;
}
if(!Int32.TryParse(rowParts[6], out phno))
{
// It is not an integer, so log the error
// on this line and continue
continue;
}
// OK, all is good now, try to create a new row, fill it and add to the
// Rows collection of the DataTable
DataRow dr = dt.NewRow();
dr[0] = pataintno;
dr[1] = rowParts[1].ToString();
dr[2] = rowParts[2].ToString();
dr[3] = age
dr[4] = rowParts[4].ToString();
dr[5] = rowParts[5].ToString();
dr[6] = phno;
dt.Rows.Add(dr);
}
}
The check on your input is done using Int32.TryParse that will return false if the string cannot be converted in an integer. In this case you should write some kind of error log to look at when the loop is completed and discover which lines are incorrect and fix them.
Notice also that I have changed your code in some points: Use File.ReadAllLines so you have already your input splitted at each new line (without problem if the newline is just a \n or a \r\n code), also the code to add a new row to your datatable should follow the pattern: create a new row, fill it with values, add the new row to the existing collection.
I checked the code and it seems fine. I suggest you to check the csv file and make sure there are no headers for any columns.
I had this problem today while parsing csv to sql table. My parser was working good since one year but all of a sudden threw int conversion error today. SQL bulk copy is not that informative, neither reviewing the csv file shows anything wrong in data. All my numeric columns in csv had valid numeric values.
So to find the error, I wrote below custom method. Error immediately popped on very first record. Actual problem was vendor changed the csv format of numeric value and now started rendering decimal values in place of integer. So for example, in place of value 1, csv file had 1.0. When I open the csv file, it reflects only 1 but in notepad, it showed 1.0. My sql table had all integer values and somehow SQL BulkCopy can't handle this transformation. Spent around 3 hours to figure out this error.
Solution inspired from - https://sqlbulkcopy-tutorial.net/type-a-cannot-be-converted-to-type-b
private void TestData(CsvDataReader dataReader)
{
int a = 0;
while(dataReader.Read())
{
try
{
a = int.Parse(dataReader[<<Column name>>].ToString());
}
catch (Exception ex){}
}
}
This question already has answers here:
Dealing with commas in a CSV file
(29 answers)
Parsing a CSV with comma in data [duplicate]
(3 answers)
Closed 9 years ago.
So here is my issue.
I am reading a CSV into a datagridview. the first row of the csv is the column headers and the of the file becomes the row. I am reading the file and using as the datasource of the datagridview. However some lines in the CSV have something similar to
name,test,1,2,3,test,2,3,2,1,test,name
Where it is bolded above when I use .Split() it considers it as a new cell in the row however the number of columns is 10 instead of 11 and I get the following error:
Input array is longer than the number of columns in this table
How can I get around this.
Below is my C# Code.
OpenFileDialog openFile = new OpenFileDialog();
openFile.InitialDirectory = "c:\\";
openFile.Filter = "txt files (*.txt)|*.txt| CSV Files (*.csv) | *.csv| All Files (*.*) | *.*";
openFile.FilterIndex = 2;
openFile.RestoreDirectory = true;
try {
if (openFile.ShowDialog() == DialogResult.OK)
{
string file = openFile.FileName;
StreamReader sr = new StreamReader(file);
/* gets all the lines of the csv */
string[] str = File.ReadAllLines(file);
/* creates a data table*/
DataTable dt = new DataTable();
/* gets the column headers from the first line*/
string[] temp = str[0].Split(',');
/* creates columns of the gridview base on what is stored in the temp array */
foreach (string t in temp)
{
dt.Columns.Add(t, typeof(string));
}
/* retrieves the rows after the first row and adds it to the datatable */
for (int i = 1; i < str.Length; i++)
{
string[] t = str[i].Split(',');
dt.Rows.Add(t);
}
/* assigns the data grid view a data source based on what is stored in dt */
dataGridView1.DataSource = dt;
}
}
catch (Exception ex)
{
MessageBox.Show("Error: The CSV you selected could not be loaded" + ex.Message);
}
This sounds more like a data issue than a programming issue. If your CSV file is supposed to have 10 columns yet some have 11, how are you to know which one is the extra column?
One thing you could do is check for the number of columns before adding the row.
for (int i = 1; i < str.Length; i++)
{
string[] t = str[i].Split(',');
if(t.Length == temp.Length)
dt.Rows.Add(t);
}
I am using Microsoft.Office.Interop.Excel to read a spreadsheet that is open in memory.
gXlWs = (Microsoft.Office.Interop.Excel.Worksheet)gXlApp.ActiveWorkbook.ActiveSheet;
int NumCols = 7;
string[] Fields = new string[NumCols];
string input = null;
int NumRow = 2;
while (Convert.ToString(((Microsoft.Office.Interop.Excel.Range)gXlWs.Cells[NumRow, 1]).Value2) != null)
{
for (int c = 1; c <= NumCols; c++)
{
Fields[c-1] = Convert.ToString(((Microsoft.Office.Interop.Excel.Range)gXlWs.Cells[NumRow, c]).Value2);
}
NumRow++;
//Do my other processing
}
I have 180,000 rows and this turns out be very slow. I am not sure the "Convert" is efficient. Is there anyway I could do this faster?
Moon
Hi I found a very much faster way.
It is better to read the entire data in one go using "get_range". This loads the data into memory and I can loop through that like a normal array.
Microsoft.Office.Interop.Excel.Range range = gXlWs.get_Range("A1", "F188000");
object[,] values = (object[,])range.Value2;
int NumRow=1;
while (NumRow < values.GetLength(0))
{
for (int c = 1; c <= NumCols; c++)
{
Fields[c - 1] = Convert.ToString(values[NumRow, c]);
}
NumRow++;
}
There are several options - all involve some additional library:
OpenXML 2.0 (free library from MS) can be used to read/modify the content of an .xlsx so you can do with it what you want
some (commercial) 3rd-party libraries come with grid controls allowing you to do much more with excel files in your application (be it Winforms/WPF/ASP.NET...) like SpreadsheetGear, Aspose.Cells etc.
I am not sure the "Convert" is efficient. Is there anyway I could do
this faster?
What makes you believe this? I promise you that Convert.ToString() is the most effective method in the code you posted. Your problem is that your looping through 180,000 records in an excel document...
You could split the work up since you know the number of row this is trival to do.
Why are you coverting Value2 to a string exactly?
I found really fast way to read excel in my specific way. I need to get it as a two dimensional array of string. With really big excel, it took about one hour in old way. In this way, I get my values in 20sec.
I am using this nugget: https://reposhub.com/dotnet/office/ExcelDataReader-ExcelDataReader.html
And here is my code:
DataSet result = null;
//https://reposhub.com/dotnet/office/ExcelDataReader-ExcelDataReader.html
using (var stream = File.Open(path, FileMode.Open, FileAccess.Read))
{
// Auto-detect format, supports:
// - Binary Excel files (2.0-2003 format; *.xls)
// - OpenXml Excel files (2007 format; *.xlsx)
using (var reader = ExcelReaderFactory.CreateReader(stream))
{
result = reader.AsDataSet();
}
}
foreach (DataTable table in result.Tables)
{
if (//my conditions)
{
continue;
}
var rows = table.AsEnumerable().ToArray();
var dataTable = new string[table.Rows.Count][];//[table.Rows[0].ItemArray.Length];
Parallel.For(0, rows.Length, new ParallelOptions { MaxDegreeOfParallelism = 8 },
i =>
{
var row = rows[i];
dataTable[i] = row.ItemArray.Select(x => x.ToString()).ToArray();
});
importedList.Add(dataTable);
}
I guess it's not the Convert the source of "slowing"...
Actually, retrieving cell values is very slow.
I think this conversion is not necessary:
(Microsoft.Office.Interop.Excel.Range)gXlWs
It should work without that.
And you can ask directly:
gXlWs.Cells[NumRow, 1].Value != null
Try to move the entire range or, at least, the entire row to an object Matrix and work with it instead of the range itself.
Use the OleDB Method. That is the fastest as follows;
string con =
#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=D:\temp\test.xls;" +
#"Extended Properties='Excel 8.0;HDR=Yes;'";
using(OleDbConnection connection = new OleDbConnection(con))
{
connection.Open();
OleDbCommand command = new OleDbCommand("select * from [Sheet1$]", connection);
using(OleDbDataReader dr = command.ExecuteReader())
{
while(dr.Read())
{
var row1Col0 = dr[0];
Console.WriteLine(row1Col0);
}
}
}
This has been killing me - I have a massive file that I need to read in as a DataTable.
After a lot of messing about I am using this:
using (OleDbConnection connection = new OleDbConnection(connString))
{
using (OleDbCommand command = new OleDbCommand(sql, connection))
{
using (OleDbDataAdapter adapter = new OleDbDataAdapter(command))
{
dataTable = new DataTable();
dataTable.Locale = CultureInfo.CurrentCulture;
adapter.Fill(dataTable);
}
}
}
which works if the text file is comma seperated but does not work if it is tab delimited - Can anyone please help??
My connection string looks like :
string connString = #"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + pathOnly + #";Extended Properties='text;HDR=YES'";
I ve tried to set the FMT property with no luck....
Here is a good library use it.
http://www.codeproject.com/KB/database/CsvReader.aspx
Here is the code which use the library.
TextReader tr = new StreamReader(HttpContext.Current.Server.MapPath(Filename));
string data = tr.ReadToEnd();
tr.Close();
for comma delimited;
CachedCsvReader cr = new CachedCsvReader(new StringReader(csv), true);
for tab delimited;
CachedCsvReader cr = new CachedCsvReader(new StringReader(csv), true, '\t');
And here you can load it into DataTable by having this code
DataTable dt = new DataTable();
dt.Load(cr);
Hope you find it helpful. Thanks
Manually: You can use String.Split() method to split your entire file. Here is an example of what i use in my code. In this example, i read the data line by line and split it. I then place the data directly into columns.
//Open and read the file
System.IO.FileStream fs = new System.IO.FileStream("myfilename", System.IO.FileMode.Open);
System.IO.StreamReader sr = new System.IO.StreamReader(fs);
string line = "";
line = sr.ReadLine();
string[] colVal;
try
{
//clear of previous data
//myDataTable.Clear();
//for each reccord insert it into a row
while (!sr.EndOfStream)
{
line = sr.ReadLine();
colVal = line.Split('\t');
DataRow dataRow = myDataTable.NewRow();
//associate values with the columns
dataRow["col1"] = colVal[0];
dataRow["col2"] = colVal[1];
dataRow["col3"] = colVal[2];
//add the row to the table
myDataTable.Rows.Add(dataRow);
}
//close the stream
sr.Close();
//binds the dataset tothe grid view.
BindingSource bs = new BindingSource();
bs.DataSource = myDataSet;
bs.DataMember = myDataTable.TableName;
myGridView.DataSource = bs;
}
You could modify it to do some loops for columns if you have many and they are numbered. Also, i recommend checking the integrity first, by checking that the number of columns read is correct.
This should work: (from http://www.hotblue.com/article0000.aspx?a=0006)
just replace the comma part with:
if ((postdata || !quoted) && (c == ',' || c == '\t'))
to make it tab delimited.
using System.Data;
using System.IO;
using System.Text.RegularExpressions;
public DataTable ParseCSV(string inputString) {
DataTable dt=new DataTable();
// declare the Regular Expression that will match versus the input string
Regex re=new Regex("((?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")+)\")(,|(?<rowbreak>\\r\\n|\\n|$))");
ArrayList colArray=new ArrayList();
ArrayList rowArray=new ArrayList();
int colCount=0;
int maxColCount=0;
string rowbreak="";
string field="";
MatchCollection mc=re.Matches(inputString);
foreach(Match m in mc) {
// retrieve the field and replace two double-quotes with a single double-quote
field=m.Result("${field}").Replace("\"\"","\"");
rowbreak=m.Result("${rowbreak}");
if (field.Length > 0) {
colArray.Add(field);
colCount++;
}
if (rowbreak.Length > 0) {
// add the column array to the row Array List
rowArray.Add(colArray.ToArray());
// create a new Array List to hold the field values
colArray=new ArrayList();
if (colCount > maxColCount)
maxColCount=colCount;
colCount=0;
}
}
if (rowbreak.Length == 0) {
// this is executed when the last line doesn't
// end with a line break
rowArray.Add(colArray.ToArray());
if (colCount > maxColCount)
maxColCount=colCount;
}
// create the columns for the table
for(int i=0; i < maxColCount; i++)
dt.Columns.Add(String.Format("col{0:000}",i));
// convert the row Array List into an Array object for easier access
Array ra=rowArray.ToArray();
for(int i=0; i < ra.Length; i++) {
// create a new DataRow
DataRow dr=dt.NewRow();
// convert the column Array List into an Array object for easier access
Array ca=(Array)(ra.GetValue(i));
// add each field into the new DataRow
for(int j=0; j < ca.Length; j++)
dr[j]=ca.GetValue(j);
// add the new DataRow to the DataTable
dt.Rows.Add(dr);
}
// in case no data was parsed, create a single column
if (dt.Columns.Count == 0)
dt.Columns.Add("NoData");
return dt;
}
Now that we have a parser for converting a string into a DataTable, all we need now is a function that will read the content from a CSV file and pass it to our ParseCSV function:
public DataTable ParseCSVFile(string path) {
string inputString="";
// check that the file exists before opening it
if (File.Exists(path)) {
StreamReader sr = new StreamReader(path);
inputString = sr.ReadToEnd();
sr.Close();
}
return ParseCSV(inputString);
}
And now you can easily fill a DataGrid with data coming off the CSV file:
protected System.Web.UI.WebControls.DataGrid DataGrid1;
private void Page_Load(object sender, System.EventArgs e) {
// call the parser
DataTable dt=ParseCSVFile(Server.MapPath("./demo.csv"));
// bind the resulting DataTable to a DataGrid Web Control
DataGrid1.DataSource=dt;
DataGrid1.DataBind();
}