data table out of memory exception - c#

We have lots of old excel files that contain quite a bit of data. I am trying to get this data into SQL Server.
I have a C# application that I have used before to upload data from excel to SQL, the code is shown below.
The excel sheet has dates going across the sheet in row 4. The first date is in cell D4. There are id's (strings) going down from cell A5 to A11005. The values are of type double.
I am getting a System.OutOfMemoryException exception. I am surprised though as this error has been thrown on the 10,785th row & 333rd column. Is it really out of memory? I thought this wouldn't be a huge amount of data to be honest for a data table.
11,000 ids, 785 dates so 8,635,000 doubles. Is Visual Studio out of memory? I have a 64 bit PC with 32 GB RAM.
DataTable dt = new DataTable();
dt.Columns.Add("DateTM", typeof(DateTime));
dt.Columns.Add("Id", typeof(string));
dt.Columns.Add("Vtm", typeof(double));
OpenExcelWorkbook(path + fileName, true);
XlWorksheet = (Excel.Worksheet)XlWorkbook.Worksheets["Sheet1"];
Rng = XlWorksheet.UsedRange;
object[,] valueArray = (object[,])Rng.get_Value(Excel.XlRangeValueDataType.xlRangeValueDefault);
XlWorkbook.Close(false);
// dates start in cell D4
DateTime[] dates = new DateTime[valueArray.GetLength(1) - 3];
for (int t = 4; t <= valueArray.GetLength(1); t++)
dates[t - 4] = Convert.ToDateTime(valueArray[4, t]);
// values start from row 5
for (int n = 5; n <= valueArray.GetLength(0); n++)
{
string id = valueArray[n, 1].ToString().Trim();
// dates start from column D
for (int m = 4; m <= valueArray.GetLength(1); m++)
{
double vt = Convert.ToDouble(valueArray[n, m]);
if (vt == -2146826246) // for any #N/A values
vt = -999;
dt.Rows.Add(dates[m - 4], id, vt);
}
}
using (SqlBulkCopy sqlBulk = new SqlBulkCopy(UtilityLibrary.Database.Connections.Myconnection))
{
sqlBulk.BulkCopyTimeout = 0;
sqlBulk.DestinationTableName = "tblMyTbl";
sqlBulk.WriteToServer(dt);
}

Related

The source contains no DataRows. error when one iteration in for loop

I am making a program in Visual Studio where you can read in an excel file in a specific format and where my program converts the data from the excel file in a different format and stores it in a database table.
Below you can find a part of my code where something strange happens
//copy schema into new datatable
DataTable _longDataTable = _library.Clone();
foreach (DataRow drlibrary in _library.Rows)
{
//count number of variables in a row
string check = drlibrary["Check"].ToString();
int varCount = check.Length - check.Replace("{", "").Length;
int count_and = 0;
if (check.Contains("and") || check.Contains("or"))
{
count_and = Regex.Matches(check, "and").Count;
varCount = varCount - count_and;
}
//loop through number of counted variables in order to add rows to long datatable (one row per variable)
for (int i = 1; i <= varCount; i++)
{
var newRow = _longDataTable.NewRow();
newRow.ItemArray = drlibrary.ItemArray;
string j = i.ToString();
//fill variablename with variable number
if (i < 10)
{
newRow["VariableName"] = "Variable0" + j;
}
else
{
newRow["VariableName"] = "Variable" + j;
}
}
}
When varCount equals 1, I get the following error message when running the program after inserting an excel file
The source contains no DataRows.
I don't know why I can't run the for loop with just one iteration. Anyone who can help me?

Fastest Way to Loop through SQL Database column against Excel Column - C#

I have a sql table with two columns: OldValue and NewValue. I have the same two columns in an excel spreadsheet. I want to find the quickest way to iterate through both the database and excel spreadsheet checking if the OldValue column in the database is the same as the OldValue column in the spreadsheet.
My logic works such that I iterate the entire sql column (333228 records) looking for a match against the excel column which has 153 000 rows. This iteration is performance heavy and takes hours without even finishing - ends up hanging. How can I quickly do this? 153 000 x 333228 = 24 billion iterations which is computationally intensive.
I read here https://codereview.stackexchange.com/questions/47368/looping-through-an-excel-document-in-c but couldn't get what I was looking for. The code works and has already found 500 matches but its slow considering I need to get through 333228 records in the database.
List<sim_info> exel_sims = new List<sim_info>();
Microsoft.Office.Interop.Excel.Application Excel_app = new Microsoft.Office.Interop.Excel.Application();
Microsoft.Office.Interop.Excel.Workbooks work_books = Excel_app.Workbooks;
string excel_file_path = Application.StartupPath + "\\TestSample";
Microsoft.Office.Interop.Excel.Workbook work_book = work_books.Open(excel_file_path);
work_book.SaveAs(excel_file_path + ".csv", Microsoft.Office.Interop.Excel.XlFileFormat.xlCSVWindows);
Microsoft.Office.Interop.Excel.Sheets work_sheets = work_book.Worksheets;
Microsoft.Office.Interop.Excel.Worksheet work_sheet = (Microsoft.Office.Interop.Excel.Worksheet)work_sheets.get_Item(1);
for (int j = 2; j < work_sheet.Rows.Count; j++)
{
try
{
temp_sim_info.msisdn = cell_to_str(work_sheet.Cells[j, 1]).Trim();
temp_sim_info.mtn_new_number = cell_to_str(work_sheet.Cells[j, 8]).Trim();
temp_sim_info.status = cell_to_str(work_sheet.Cells[j, 9]).Trim();
if (temp_sim_info.msisdn.Length < 5 || temp_sim_info.mtn_new_number.Length > 15) //Valid cellphone number length contains 11 digits +27XXXXXXXXX / 14 digits for the new msisdn. This condition checks for invalid cellphone numbers
{
if (zero_count++ > 10)
break;
}
else
{
zero_count = 0;
exel_sims.Add(temp_sim_info);
if (exel_sims.Count % 10 == 0)
{
txtExcelLoading.Text = exel_sims.Count.ToString();
}
}
}
catch
{
if (zero_count++ > 10)
break;
}
// }
txtExcelLoading.Text = exel_sims.Count.ToString();
work_sheet.Columns.AutoFit();
for (int i = 0; i < TestTableInstance.Rows.Count; i++)
{
string db_oldNumbers = "";
string db_CellNumber = "";
if (!TestTableInstance.Rows[i].IsNull("OldNumber"))
db_oldNumbers = TestTableInstance[i].OldNumber;
else
db_oldNumbers = TestTableInstance[i].CellNumber;
if (!TestTableInstance.Rows[i].IsNull("CellNumber"))
db_CellNumber = temp_sim_info.mtn_new_number;
for (int k = 0; k < exel_sims.Count; k++)
{
sim_info sim_Result = exel_sims.Find(x => TestTableInstance[i].CellNumber == x.msisdn);
if (TestTableInstance[i].CellNumber == exel_sims[k].msisdn && sim_Result != null)
{
//If match found then do logic here
}
}
}
}
MessageBox.show("DONE");
TableInstance is a DataSet of the database loaded in memory. The second inner loop iterates the entire DB column for each record until it finds a match in the first row of the OldValue column in the spreadsheet.
My code works. Its tried and tested when I have an excel sheet of 800 rows and a DB table consisting of 1000 records. It completes under 5 minutes. But for hundred thousand records it hangs for hours.
Exactly! Why the heck are you use C# for this? Load the Excel file into a temp table in your DB and do a comparison between your actual SQL table (which allegedly has all the data you have in the Excel file) and the temp table (or View). This kind of comparison should complete in a couple seconds.
select *
from dbtest02.dbo.article d2
left join dbtest01.dbo.article d1 on d2.id=d1.id
The left join shows all rows from the left table "dbtest02.dbo.article", even if there are no matches in the "dbtest01.dbo.article":
OR
select * from dbtest02.dbo.article
except
select * from dbtest01.dbo.article
See the link below for some other ideas of how to do this.
https://www.mssqltips.com/sqlservertip/2779/ways-to-compare-and-find-differences-for-sql-server-tables-and-data/

Optimize performance of data processing method

I am using the following code to take some data (in XML like format - Not well formed) from a .txt file and then write it to an .xlsx using EPPlus after doing some processing. StreamElements is basically a modified XmlReader. My question is about performance, I have made a couple of changes but don't see what else I can do. I'm going to use this for large datasets so I'm trying to modify to make this as efficient and fast as possible. Any help will be appreciated!
I tried using p.SaveAs() to do the excel writing but it did not really see a performance difference. Are there better faster ways to do the writing? Any suggestions are welcome.
using (ExcelPackage p = new ExcelPackage())
{
ExcelWorksheet ws = p.Workbook.Worksheets[1];
ws.Name = "data1";
int rowIndex = 1; int colIndex = 1;
foreach (var element in StreamElements(pa, "XML"))
{
var values = element.DescendantNodes().OfType<XText>()
.Select(v => Regex.Replace(v.Value, "\\s+", " "));
string[] data = string.Join(",", values).Split(',');
data[2] = toDateTime(data[2]);
for (int i = 0; i < data.Count(); i++)
{
if (rowIndex < 1000000)
{
var cell1 = ws.Cells[rowIndex, colIndex];
cell1.Value = data[i];
colIndex++;
}
}
rowIndex++;
}
}
ws.Cells[ws.Dimension.Address].AutoFitColumns();
Byte[] bin = p.GetAsByteArray();
using (FileStream fs = File.OpenWrite("C:\\test.xlsx"))
{
fs.Write(bin, 0, bin.Length);
}
}
}
Currently, for it to do the processing and then write 1 Million lines into an Excel worksheet, it takes about ~30-35 Minutes.
I've ran into this issue before and excel has a huge overhead when you're modifying worksheet cells individually one by one.
The solution to this is to create an object array and populate the worksheet using the WriteRange functionality.
using(ExcelPackage p = new ExcelPackage()) {
ExcelWorksheet ws = p.Workbook.Worksheets[1];
ws.Name = "data1";
//Starting cell
int startRow = 1;
int startCol = 1;
//Needed for 2D object array later on
int maxColCount = 0;
int maxRowCount = 0;
//Queue data
Queue<string[]> dataQueue = new Queue<string[]>();
//Tried not to touch this part
foreach(var element in StreamElements(pa, "XML")) {
var values = element.DescendantNodes().OfType<XText>()
.Select(v = > Regex.Replace(v.Value, "\\s+", " "));
//Removed unnecessary split and join, use ToArray instead
string[] eData = values.ToArray();
eData[2] = toDateTime(eData[2]);
//Push the data to queue and increment counters (if needed)
dataQueue.Enqueue(eData);
if(eData.Length > maxColCount)
maxColCount = eData.Length;
maxRowCount++;
}
//We now have the dimensions needed for our object array
object[,] excelArr = new object[maxRowCount, maxColCount];
//Dequeue data from Queue and populate object matrix
int i = 0;
while(dataQueue.Count > 0){
string[] eData = dataQueue.Dequeue();
for(int j = 0; j < eData.Length; j++){
excelArr[i, j] = eData[j];
}
i++;
}
//Write data to range
Excel.Range c1 = (Excel.Range)wsh.Cells[startRow, startCol];
Excel.Range c2 = (Excel.Range)wsh.Cells[startRow + maxRowCount - 1, maxColCount];
Excel.Range range = worksheet.Range[c1, c2];
range.Value2 = excelArr;
//Tried not to touch this stuff
ws.Cells[ws.Dimension.Address].AutoFitColumns();
Byte[] bin = p.GetAsByteArray();
using(FileStream fs = File.OpenWrite("C:\\test.xlsx")) {
fs.Write(bin, 0, bin.Length);
}
}
I didn't try compiling this code, so double check the indexing used; and check for any small syntax errors.
A few extra pointers to consider for performance:
Try to parallel the population of the object array, since it is primarily index based (maybe have a dictionary with an index tracker Dictionary<int, string[]>) and lookup in there for faster population of the object array. You would likely have to trade space for time.
See if you are able to hardcode the column and row counts, or figure it out quickly. In my code fix, I've set counters to count the maximum rows and columns on the fly; I wouldn't recommend it as a permanent solution.
AutoFitColumns is very costly, especially if you're dealing with over a million rows

pick a random records from a datatable

I'm trying to create an application which import an excel file and read the data from it and it returns n records randomly as winners according to how many winners the user want from that list. so i read the data from excel file and assign it to a datatable called dt. here is a small overview
thats the first 30 records in the excel which will be imported to dt. now if user key in 10(thats the total number of winners), i need to pick 10 winners "RANDOMLY" from this dt, but as you can see some of them are duplicated for example: in column D, the entry named "H" has 6 rows. now if the application chose 1 of them, the others "H" have to be removed but that is after it has been chosen. removing the duplicates before choosing any of them, will lower the chance for them to win better prizes.
Could you try something like,
dt2 = dt.Clone();
dt.AsEnumerable().Select(x => x["IC_NUMBER"].ToString()).Distinct().ToList().ForEach(x =>
{
DataRow[] dr = dt.Select("IC_NUMBER = '" + x + "'");
dt2.ImportRow(dr[0]);
dr.ToList().ForEach(y => dt.Rows.Remove(y));
dt.AcceptChanges();
});
EDIT:
int totalWinners = 10;
Random rnd = new Random();
dt2 = dt.Clone();
for (int i = 1; i <= totalWinners; i++)
{
//Pick random datarow
DataRow selectedWinner = dt.Rows[rnd.Next(0, dt.Rows.Count - 1)];
//Insert it in the second table
dt2.ImportRow(selectedWinner);
//Retrieve other datarows that have same 'IC NUMBER'
var rows = dt.AsEnumerable().Where(x => x["IC NUMBER"].ToString() ==
selectedWinner["IC NUMBER"].ToString());
//Delete all the rows with the selected IC NUMBER in the first table
rows.ToList().ForEach(y => dt.Rows.Remove(y));
dt.AcceptChanges();
}
Hope this helps...

How do I read an excel file in c# without missing any columns?

I've been using an OleDb connection to read excel files successfully for quite a while now, but I've run across a problem. I've got someone who is trying to upload an Excel spreadsheet with nothing in the first column and when I try to read the file, it doesn't recognize that column.
I'm currently using the following OleDb connection string:
Provider=Microsoft.Jet.OLEDB.4.0;
Data Source=c:\test.xls;
Extended Properties="Excel 8.0;IMEX=1;"
So, if there are 13 columns in the excel file, the OleDbDataReader I get back only has 12 columns/fields.
Any insight would be appreciated.
SpreadsheetGear for .NET gives you an API for working with xls and xlsx workbooks from .NET. It is easier to use and faster than OleDB or the Excel COM object model. You can see the live samples or try it for yourself with the free trial.
Disclaimer: I own SpreadsheetGear LLC
EDIT:
StingyJack commented "Faster than OleDb? Better back that claim up".
This is a reasonable request. I see claims all the time which I know for a fact to be false, so I cannot blame anyone for being skeptical.
Below is the code to create a 50,000 row by 10 column workbook with SpreadsheetGear, save it to disk, and then sum the numbers using OleDb and SpreadsheetGear. SpreadsheetGear reads the 500K cells in 0.31 seconds compared to 0.63 seconds with OleDB - just over twice as fast. SpreadsheetGear actually creates and reads the workbook in less time than it takes to read the workbook with OleDB.
The code is below. You can try it yourself with the SpreadsheetGear free trial.
using System;
using System.Data;
using System.Data.OleDb;
using SpreadsheetGear;
using SpreadsheetGear.Advanced.Cells;
using System.Diagnostics;
namespace SpreadsheetGearAndOleDBBenchmark
{
class Program
{
static void Main(string[] args)
{
// Warm up (get the code JITed).
BM(10, 10);
// Do it for real.
BM(50000, 10);
}
static void BM(int rows, int cols)
{
// Compare the performance of OleDB to SpreadsheetGear for reading
// workbooks. We sum numbers just to have something to do.
//
// Run on Windows Vista 32 bit, Visual Studio 2008, Release Build,
// Run Without Debugger:
// Create time: 0.25 seconds
// OleDb Time: 0.63 seconds
// SpreadsheetGear Time: 0.31 seconds
//
// SpreadsheetGear is more than twice as fast at reading. Furthermore,
// SpreadsheetGear can create the file and read it faster than OleDB
// can just read it.
string filename = #"C:\tmp\SpreadsheetGearOleDbBenchmark.xls";
Console.WriteLine("\nCreating {0} rows x {1} columns", rows, cols);
Stopwatch timer = Stopwatch.StartNew();
double createSum = CreateWorkbook(filename, rows, cols);
double createTime = timer.Elapsed.TotalSeconds;
Console.WriteLine("Create sum of {0} took {1} seconds.", createSum, createTime);
timer = Stopwatch.StartNew();
double oleDbSum = ReadWithOleDB(filename);
double oleDbTime = timer.Elapsed.TotalSeconds;
Console.WriteLine("OleDb sum of {0} took {1} seconds.", oleDbSum, oleDbTime);
timer = Stopwatch.StartNew();
double spreadsheetGearSum = ReadWithSpreadsheetGear(filename);
double spreadsheetGearTime = timer.Elapsed.TotalSeconds;
Console.WriteLine("SpreadsheetGear sum of {0} took {1} seconds.", spreadsheetGearSum, spreadsheetGearTime);
}
static double CreateWorkbook(string filename, int rows, int cols)
{
IWorkbook workbook = Factory.GetWorkbook();
IWorksheet worksheet = workbook.Worksheets[0];
IValues values = (IValues)worksheet;
double sum = 0.0;
Random rand = new Random();
// Put labels in the first row.
foreach (IRange cell in worksheet.Cells[0, 0, 0, cols - 1])
cell.Value = "Cell-" + cell.Address;
// Using IRange and foreach be less code,
// but we'll do it the fast way.
for (int row = 1; row <= rows; row++)
{
for (int col = 0; col < cols; col++)
{
double number = rand.NextDouble();
sum += number;
values.SetNumber(row, col, number);
}
}
workbook.SaveAs(filename, FileFormat.Excel8);
return sum;
}
static double ReadWithSpreadsheetGear(string filename)
{
IWorkbook workbook = Factory.GetWorkbook(filename);
IWorksheet worksheet = workbook.Worksheets[0];
IValues values = (IValues)worksheet;
IRange usedRahge = worksheet.UsedRange;
int rowCount = usedRahge.RowCount;
int colCount = usedRahge.ColumnCount;
double sum = 0.0;
// We could use foreach (IRange cell in usedRange) for cleaner
// code, but this is faster.
for (int row = 1; row <= rowCount; row++)
{
for (int col = 0; col < colCount; col++)
{
IValue value = values[row, col];
if (value != null && value.Type == SpreadsheetGear.Advanced.Cells.ValueType.Number)
sum += value.Number;
}
}
return sum;
}
static double ReadWithOleDB(string filename)
{
String connectionString =
"Provider=Microsoft.Jet.OLEDB.4.0;" +
"Data Source=" + filename + ";" +
"Extended Properties=Excel 8.0;";
OleDbConnection connection = new OleDbConnection(connectionString);
connection.Open();
OleDbCommand selectCommand =new OleDbCommand("SELECT * FROM [Sheet1$]", connection);
OleDbDataAdapter dataAdapter = new OleDbDataAdapter();
dataAdapter.SelectCommand = selectCommand;
DataSet dataSet = new DataSet();
dataAdapter.Fill(dataSet);
connection.Close();
double sum = 0.0;
// We'll make some assumptions for brevity of the code.
DataTable dataTable = dataSet.Tables[0];
int cols = dataTable.Columns.Count;
foreach (DataRow row in dataTable.Rows)
{
for (int i = 0; i < cols; i++)
{
object val = row[i];
if (val is double)
sum += (double)val;
}
}
return sum;
}
}
}
We always use Excel Interop to open the spreadsheet and parse directly (e.g. similar to how you would scan through cells in VBA), or we create locked down templates that enforce certain columns to be filled in before the user can save the data.
You can probably look at ExcelMapper. It is a tool to read excel files as strongly typed objects. It hides all the details of reading an excel from your code. It would take care if your excel is missing a column or data is missing from a column. You read data that you are interested in. You can get the code/executable for ExcelMapper from http://code.google.com/p/excelmapper/.
If could require the format of the excel sheet to have column headers, then you would always have the 13 columns. You would just need to skip the header row when processing.
This would also correct situations where the user puts the columns in an order that you are not expecting. (detect column indexes in the header row and read appropriately)
I see that others are recommending the Excel interop, but jeez that's a slow option compared to the OleDb way. Plus it requires Excel or OWC to be installed on the server (licensing).
You might try using Excel and COM. That way, you'll be getting your info straight form the horse's mouth, as it were.
From D. Anand over on the MSDN forums:
Create a reference in your project to Excel Objects Library. The excel object library can be added in the COM tab of adding reference dialog.
Here's some info on the Excel object model in C#
http://msdn.microsoft.com/en-us/library/aa168292(office.11).aspx
I recommend you to try Visual Studio Tools for Office and Excel Interop! It's using is very easy.

Categories

Resources