Fastest Way to Loop through SQL Database column against Excel Column - C# - c#

I have a sql table with two columns: OldValue and NewValue. I have the same two columns in an excel spreadsheet. I want to find the quickest way to iterate through both the database and excel spreadsheet checking if the OldValue column in the database is the same as the OldValue column in the spreadsheet.
My logic works such that I iterate the entire sql column (333228 records) looking for a match against the excel column which has 153 000 rows. This iteration is performance heavy and takes hours without even finishing - ends up hanging. How can I quickly do this? 153 000 x 333228 = 24 billion iterations which is computationally intensive.
I read here https://codereview.stackexchange.com/questions/47368/looping-through-an-excel-document-in-c but couldn't get what I was looking for. The code works and has already found 500 matches but its slow considering I need to get through 333228 records in the database.
List<sim_info> exel_sims = new List<sim_info>();
Microsoft.Office.Interop.Excel.Application Excel_app = new Microsoft.Office.Interop.Excel.Application();
Microsoft.Office.Interop.Excel.Workbooks work_books = Excel_app.Workbooks;
string excel_file_path = Application.StartupPath + "\\TestSample";
Microsoft.Office.Interop.Excel.Workbook work_book = work_books.Open(excel_file_path);
work_book.SaveAs(excel_file_path + ".csv", Microsoft.Office.Interop.Excel.XlFileFormat.xlCSVWindows);
Microsoft.Office.Interop.Excel.Sheets work_sheets = work_book.Worksheets;
Microsoft.Office.Interop.Excel.Worksheet work_sheet = (Microsoft.Office.Interop.Excel.Worksheet)work_sheets.get_Item(1);
for (int j = 2; j < work_sheet.Rows.Count; j++)
{
try
{
temp_sim_info.msisdn = cell_to_str(work_sheet.Cells[j, 1]).Trim();
temp_sim_info.mtn_new_number = cell_to_str(work_sheet.Cells[j, 8]).Trim();
temp_sim_info.status = cell_to_str(work_sheet.Cells[j, 9]).Trim();
if (temp_sim_info.msisdn.Length < 5 || temp_sim_info.mtn_new_number.Length > 15) //Valid cellphone number length contains 11 digits +27XXXXXXXXX / 14 digits for the new msisdn. This condition checks for invalid cellphone numbers
{
if (zero_count++ > 10)
break;
}
else
{
zero_count = 0;
exel_sims.Add(temp_sim_info);
if (exel_sims.Count % 10 == 0)
{
txtExcelLoading.Text = exel_sims.Count.ToString();
}
}
}
catch
{
if (zero_count++ > 10)
break;
}
// }
txtExcelLoading.Text = exel_sims.Count.ToString();
work_sheet.Columns.AutoFit();
for (int i = 0; i < TestTableInstance.Rows.Count; i++)
{
string db_oldNumbers = "";
string db_CellNumber = "";
if (!TestTableInstance.Rows[i].IsNull("OldNumber"))
db_oldNumbers = TestTableInstance[i].OldNumber;
else
db_oldNumbers = TestTableInstance[i].CellNumber;
if (!TestTableInstance.Rows[i].IsNull("CellNumber"))
db_CellNumber = temp_sim_info.mtn_new_number;
for (int k = 0; k < exel_sims.Count; k++)
{
sim_info sim_Result = exel_sims.Find(x => TestTableInstance[i].CellNumber == x.msisdn);
if (TestTableInstance[i].CellNumber == exel_sims[k].msisdn && sim_Result != null)
{
//If match found then do logic here
}
}
}
}
MessageBox.show("DONE");
TableInstance is a DataSet of the database loaded in memory. The second inner loop iterates the entire DB column for each record until it finds a match in the first row of the OldValue column in the spreadsheet.
My code works. Its tried and tested when I have an excel sheet of 800 rows and a DB table consisting of 1000 records. It completes under 5 minutes. But for hundred thousand records it hangs for hours.

Exactly! Why the heck are you use C# for this? Load the Excel file into a temp table in your DB and do a comparison between your actual SQL table (which allegedly has all the data you have in the Excel file) and the temp table (or View). This kind of comparison should complete in a couple seconds.
select *
from dbtest02.dbo.article d2
left join dbtest01.dbo.article d1 on d2.id=d1.id
The left join shows all rows from the left table "dbtest02.dbo.article", even if there are no matches in the "dbtest01.dbo.article":
OR
select * from dbtest02.dbo.article
except
select * from dbtest01.dbo.article
See the link below for some other ideas of how to do this.
https://www.mssqltips.com/sqlservertip/2779/ways-to-compare-and-find-differences-for-sql-server-tables-and-data/

Related

Skip certain Rows and Columns while Parsing XLS

I'm using the following code to parse the XLS file using ExcelDataReader. I would like to exclude the first three rows, first two columns followed by any columns that are after 9.
//create the reader
var reader = ExcelReaderFactory.CreateReader(stream);
var result = reader.AsDataSet();
//remove the first 3 rows
DataRowCollection dt = result.Tables[0].Rows;
dt.RemoveAt(0);
dt.RemoveAt(1);
dt.RemoveAt(2);
//exclude the column 1 and2 and any columns after 9
for (int columnNumber = 2; columnNumber < 8; columnNumber++)
{
foreach (DataRow dr in dt)
{
Debug.Log(dr[columnNumber].ToString());
msg += dr[columnNumber].ToString();
}
}
Unfortunately, it does not skip the rows and columns as expected. How do I skip specific columns and rows using excelDataReader?
You are doing the following
dt.RemoveAt(0);
dt.RemoveAt(1);
dt.RemoveAt(2);
When the first line executes, the rows are reindexed with the 1 becoming 0, 2 becoming 1 and so on.
When the second line executes you have now removed the line that was position 2 originally. The rows are again reindexed.
When the third line executes you are then again removing an incorrect row.
As a result, when this process completes, it will have removed the lines that were originally positioned at 0, 2, and 4.
Change the code to remove the correct lines, or skip three lines with linq or a for loop.
Sample using for loop (not tested).
//create the reader
var reader = ExcelReaderFactory.CreateReader(stream);
var result = reader.AsDataSet();
DataRowCollection dt = result.Tables[0].Rows;
//ignore the first 3 rows
for(int dataRowCount = 3; dataRowCount < dt.Count; dataRowCount++)
{
//exclude the column 1 and 2 and any columns after 9
for (int columnNumber = 2; columnNumber < 8; columnNumber++)
{
Debug.Log(dr[dataRowCount][columnNumber].ToString());
msg += dr[dataRowCount][columnNumber].ToString();
}
}

The source contains no DataRows. error when one iteration in for loop

I am making a program in Visual Studio where you can read in an excel file in a specific format and where my program converts the data from the excel file in a different format and stores it in a database table.
Below you can find a part of my code where something strange happens
//copy schema into new datatable
DataTable _longDataTable = _library.Clone();
foreach (DataRow drlibrary in _library.Rows)
{
//count number of variables in a row
string check = drlibrary["Check"].ToString();
int varCount = check.Length - check.Replace("{", "").Length;
int count_and = 0;
if (check.Contains("and") || check.Contains("or"))
{
count_and = Regex.Matches(check, "and").Count;
varCount = varCount - count_and;
}
//loop through number of counted variables in order to add rows to long datatable (one row per variable)
for (int i = 1; i <= varCount; i++)
{
var newRow = _longDataTable.NewRow();
newRow.ItemArray = drlibrary.ItemArray;
string j = i.ToString();
//fill variablename with variable number
if (i < 10)
{
newRow["VariableName"] = "Variable0" + j;
}
else
{
newRow["VariableName"] = "Variable" + j;
}
}
}
When varCount equals 1, I get the following error message when running the program after inserting an excel file
The source contains no DataRows.
I don't know why I can't run the for loop with just one iteration. Anyone who can help me?

'System.AccessViolationException' occurred

UPDATED: added full block of code where error occurs
UPDATE 2: I found a weird anomaly. The code has now been continuously breaking on that line, when the tabName variable equals "service line prior year". This morning, for grins, I changed the tab name to "test", so in turn the tabName variable equals "test", and it worked more often then not. I am really at a loss.
I have researched a ton and can't find anything that addresses what is happening in my code. It happens randomly though. Sometimes it doesn't happen, then other times it happens in the same spot, but all on this part of the code (on the line templateSheet = templateBook.Sheets[tabName];):
public void ExportToExcel(DataSet dataSet, string filePath, int i, int h, Excel.Application excelApp)
{
//create the excel definitions again.
//Excel.Application excelApp = new Excel.Application();
//excelApp.Visible = true;
FileInfo excelFileInfo = new FileInfo(filePath);
Boolean fileOpenTest = IsFileOpen(excelFileInfo);
Excel.Workbook templateBook;
Excel.Worksheet templateSheet;
//check to see if the template is already open, if its not then open it,
//if it is then bind it to work with it
if (!fileOpenTest)
{ templateBook = excelApp.Workbooks.Open(filePath); }
else
{ templateBook = (Excel.Workbook)System.Runtime.InteropServices.Marshal.BindToMoniker(filePath); }
//this grabs the name of the tab to dump the data into from the "Query Dumps" Tab
string tabName = lstQueryDumpSheet.Items[i].ToString();
templateSheet = templateBook.Sheets[tabName];
excelApp.Calculation = Excel.XlCalculation.xlCalculationManual;
templateSheet = templateBook.Sheets[tabName];
// Copy DataTable
foreach (System.Data.DataTable dt in dataSet.Tables)
{
// Copy the DataTable to an object array
object[,] rawData = new object[dt.Rows.Count + 1, dt.Columns.Count];
// Copy the values to the object array
for (int col = 0; col < dt.Columns.Count; col++)
{
for (int row = 0; row < dt.Rows.Count; row++)
{ rawData[row, col] = dt.Rows[row].ItemArray[col]; }
}
// Calculate the final column letter
string finalColLetter = string.Empty;
string colCharset = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int colCharsetLen = colCharset.Length;
if (dt.Columns.Count > colCharsetLen)
{ finalColLetter = colCharset.Substring((dt.Columns.Count - 1) / colCharsetLen - 1, 1); }
finalColLetter += colCharset.Substring((dt.Columns.Count - 1) % colCharsetLen, 1);
//this grabs the cell address from the "Query Dump" sheet, splits it on the '=' and
//pulls out only the cell address (i.e., "address=a3" becomes "a3")
string dumpCellString = lstQueryDumpText.Items[i].ToString();
string dumpCell = dumpCellString.Split('=').Last();
//referts to the range in which we are dumping the DataSet. The upper right hand cell is
//defined by the 'dumpCell' varaible and the bottom right cell is defined by the
//final column letter and the count of rows.
string firstRef = "";
string baseRow = "";
if (char.IsLetter(dumpCell, 1))
{
char[] createCellRef = dumpCell.ToCharArray();
firstRef = createCellRef[0].ToString() + createCellRef[1].ToString();
for (int z = 2; z < createCellRef.Count(); z++)
{
baseRow = baseRow + createCellRef[z].ToString();
}
}
else
{
char[] createCellRef = dumpCell.ToCharArray();
firstRef = createCellRef[0].ToString();
for (int z = 1; z < createCellRef.Count(); z++)
{
baseRow = baseRow + createCellRef[z].ToString();
}
}
int baseRowInt = Convert.ToInt32(baseRow);
int startingCol = ColumnLetterToColumnIndex(firstRef);
int endingCol = ColumnLetterToColumnIndex(finalColLetter);
int finalCol = startingCol + endingCol;
string endCol = ColumnIndexToColumnLetter(finalCol - 1);
int endRow = (baseRowInt + (dt.Rows.Count - 1));
string cellCheck = endCol + endRow;
string excelRange;
if (dumpCell.ToUpper() == cellCheck.ToUpper())
{
excelRange = string.Format(dumpCell + ":" + dumpCell);
}
else
{
excelRange = string.Format(dumpCell + ":{0}{1}", endCol, endRow);
}
//this dumps the cells into the range on Excel as defined above
templateSheet.get_Range(excelRange, Type.Missing).Value2 = rawData;
//checks to see if all the SQL queries have been run from the "Query Dump" tab, if not, continue
//the loop, if it is the last one, then save the workbook and move on.
if (i == lstSqlAddress.Items.Count - 1)
{
excelApp.Calculation = Excel.XlCalculation.xlCalculationAutomatic;
/*Run through the value save sheet array then grab the address from the corresponding list
place in the address array. If the address reads "whole sheet" then save the whole page,
else set the addresses range and value save that.*/
//for (int y = 0; y < lstSaveSheet.Items.Count; y++)
//{
// MessageBox.Show("Save Sheet: " + lstSaveSheet.Items[y] + "\n" + "Save Address: " + lstSaveRange.Items[y]);
//}
//run the macro to hide the unused columns
excelApp.Run("ReportMakerExecute");
//save excel file as hospital name and move onto the next
SaveTemplateAs(templateBook, h);
//close the open Excel App before looping back
//Marshal.ReleaseComObject(templateSheet);
//Marshal.ReleaseComObject(templateBook);
//templateSheet = null;
//templateBook = null;
//GC.Collect();
//GC.WaitForPendingFinalizers();
}
//Close excel Applications
//excelApp.Quit();
//Marshal.ReleaseComObject(templateSheet);
//Marshal.FinalReleaseComObject(excelApp);
//excelApp = null;
//templateSheet = null;
// GC.Collect();
//GC.WaitForPendingFinalizers();
}
}
The try/catch block is of no use either. This is the error:
"An unhandled exception of type 'System.AccessViolationException' occurred inSQUiRE (Sql QUery REtriever) v1.exe. Additional information: Attempted to read or write protected memory. This is often an indication that other memory is corrupt."
System.AccessViolationException would normally happen when you try to access an unallocated memory in a native code (not .NET). Then .NET translates it to the managed world as this exception.
Your code itself does not have any unsafe block. So access violation must me happening inside Excel.
Given the fact that it sometimes happens, some times not, I would say that it can be caused by a parallel Excel usage (I think the Excel COM is not thread-safe).
I would recommend you putting all your code inside a lock block, to prevent Excel from begin used in parallel. Something like this:
public void ExportToExcel(DataSet dataSet, string filePath, int i, int h, Excel.Application excelApp)
{
lock(this.GetType()) // You can change here to other instance to me used a mutex
{
// Your original code here
}
}
So long story, three days of testing longer, it was because of an excel file that was trying to open and fill with SQL results. The buffer was filling up and causing an exception...it just happened at the same point in every run because the load time for the excel file was the determining factor in it working or failing.
So after the load i just added a delaying do...while that checked to see if the file was accessible or not and it stopped the failures. fileOpenTest was taken from here
do
{
Task.Delay(2000);
}
while(!fileOpenTest);

unable to read a particular cell from excel using reader

I am importing excel into sql server db the excel sheet has three columns :
id(number only)|data|passport
before importing it i want to check for certain things such as:
the passport should begin a letter and rest of the characters must be numbers
id must be numeric only
I am able to check for passport but i am not able to check id even though i am using same code i used for checking passport.
using (DbDataReader dr = command.ExecuteReader())
{
// SQL Server Connection String
string sqlConnectionString = "Data Source=DITSEC3;Initial Catalog=test;Integrated Security=True";
con.Open();
DataTable dt7 = new DataTable();
dt7.Load(dr);
DataRow[] ExcelRows = new DataRow[dt7.Rows.Count];
DataColumn[] ExcelColumn = new DataColumn[dt7.Columns.Count];
//=================================================
for (int i1 = 0; i1 < dt7.Rows.Count; i1++)
{
if (dt7.Rows[i1]["passport"] == null)
{
dt7.Rows[i1]["passport"] = 0;
}
if (dt7.Rows[i1]["id"] == null)
{
dt7.Rows[i1]["id"] = 0;
}
string a = Convert.ToString(dt7.Rows[i1]["passport"]);
string b = dt7.Rows[i1]["id"].ToString();
if (!string.IsNullOrEmpty(b))
{
int idlen = b.Length;
for (int j = 0; j < idlen; j++)
{
if (Char.IsDigit(b[j]))
{
//action
}
if(!Char.IsDigit(b[j]))
{
flag = flag + 1;
int errline = i1 + 2;
Label12.Text = "Error at line: " + errline.ToString();
//Label12.Visible = true;
}
}
if (!String.IsNullOrEmpty(a))
{
int len = a.Length;
for (int j = 1; j < len; j++)
{
if (Char.IsLetter(a[0]) && Char.IsDigit(a[j]) && !Char.IsSymbol(a[j]))
{
//action
}
else
{
flag = flag + 1;
int errline = i1 + 2;
Label12.Text = "Error at line: " + errline.ToString();
//Label12.Visible = true;
}
}
}
}
For some strange reason when i use breakpoint i can see the values of id as long as id is numeric in excel the moment flow comes to cell which has id as 25h547 the value if b turn "" any reason for this? i can give you entire code if you require.
What seems to be happening is that when the data is imported into the holding datatable and the first record in column is alphanumeric it will assume all the records in the column to be alphanumeric if the first one is numeric it will assume that all records in the column are numeric and therefore will be blank for alphanumeric records which occur somewhere in column. I solved the problem myself by modifying connectionstring : "Excel 8.0;IMEX=1;HDR=NO;TypeGuessRows=0;ImportMixedTypes=Text"
"IMEX=1;" tells the driver to always read "intermixed" (numbers, dates, strings etc) data columns as text.
specify the imex mode in connectionstring to handle mixed values
See: Mixed values in excel rows
Missing values. The Excel driver reads a certain number of rows (by
default, 8 rows) in the specified source to guess at the data type of
each column. When a column appears to contain mixed data types,
especially numeric data mixed with text data, the driver decides in
favor of the majority data type, and returns null values for cells
that contain data of the other type. (In a tie, the numeric type
wins.) Most cell formatting options in the Excel worksheet do not seem
to affect this data type determination. You can modify this behavior
of the Excel driver by specifying Import Mode. To specify Import Mode,
add IMEX=1 to the value of Extended Properties in the connection
string of the Excel connection manager in the Properties window

How to specify format for individual cells with Excel.Range.set_Value()

When I write a whole table into an excel worksheet, I know to work with a whole Range at once instead of writing to individual cells. However, is there a way to specify format as I'm populating the array I'm going to export to Excel?
Here's what I do now:
object MissingValue = System.Reflection.Missing.Value;
Excel.Application excel = new Excel.Application();
int rows = 5;
int cols = 5;
int someVal;
Excel.Worksheet sheet = (Excel.Worksheet)excel.Workbooks.Add(MissingValue).Sheets[1];
Excel.Range range = sheet.Range("A1", sheet.Cells(rows,cols));
object[,] rangeData = new object[rows,cols];
for(int r = 0; r < rows; r++)
{
for(int c = 0; c < cols; c++)
{
someVal = r + c;
rangeData[r,c] = someVal.ToString();
}
}
range.set_Value(MissingValue, rangeData);
Now suppose that I want some of those numbers to be formatted as percentages. I know I can go back on a cell-by-cell basis and change the formatting, but that seems to defeat the whole purpose of using a single Range.set_Value() call. Can I make my rangeData[,] structure include formatting information, so that when I call set_Value(), the cells are formatted in the way I want them?
To clarify, I know I can set the format for the entire Excel.Range object. What I want is to have a different format specified for each cell, specified in the inner loop.
So here's the best "solution" I've found so far. It isn't the nirvanna I was looking for, but it's much, much faster than setting the format for each cell individually.
// 0-based indexes
static string RcToA1(int row, int col)
{
string toRet = "";
int mag = 0;
while(col >= Math.Pow(26, mag+1)){mag++;}
while (mag>0)
{
toRet += System.Convert.ToChar(64 + (byte)Math.Truncate((double)(col/(Math.Pow(26,mag)))));
col -= (int)Math.Truncate((double)Math.Pow(26, mag--));
}
toRet += System.Convert.ToChar(65 + col);
return toRet + (row + 1).ToString();
}
static Random rand = new Random(DateTime.Now.Millisecond);
static string RandomExcelFormat()
{
switch ((int)Math.Round(rand.NextDouble(),0))
{
case 0: return "0.00%";
default: return "0.00";
}
}
struct ExcelFormatSpecifier
{
public object NumberFormat;
public string RangeAddress;
}
static void DoWork()
{
List<ExcelFormatSpecifier> NumberFormatList = new List<ExcelFormatSpecifier>(0);
object[,] rangeData = new object[rows,cols];
for(int r = 0; r < rows; r++)
{
for(int c = 0; c < cols; c++)
{
someVal = r + c;
rangeData[r,c] = someVal.ToString();
NumberFormatList.Add(new ExcelFormatSpecifier
{
NumberFormat = RandomExcelFormat(),
RangeAddress = RcToA1(rowIndex, colIndex)
});
}
}
range.set_Value(MissingValue, rangeData);
int max_format = 50;
foreach (string formatSpecifier in NumberFormatList.Select(p => p.NumberFormat).Distinct())
{
List<string> addresses = NumberFormatList.Where(p => p.NumberFormat == formatSpecifier).Select(p => p.RangeAddress).ToList();
while (addresses.Count > 0)
{
string addressSpecifier = string.Join(",", addresses.Take(max_format).ToArray());
range.get_Range(addressSpecifier, MissingValue).NumberFormat = formatSpecifier;
addresses = addresses.Skip(max_format).ToList();
}
}
}
Basically what is happening is that I keep a list of the format information for each cell in NumberFormatList (each element also holds the A1-style address of the range it applies to). The original idea was that for each distinct format in the worksheet, I should be able to construct an Excel.Range of just those cells and apply the format to that range in a single call. This would reduce the number of accesses to NumberFormat from (potentially) thousands down to just a few (however many different formats you have).
I ran into an issue, however, because you apparently can't construct a range from an arbitrarily long list of cells. After some testing, I found that the limit is somewhere between 50 and 100 cells that can be used to define an arbitrary range (as in range.get_Range("A1,B1,C1,A2,AA5,....."). So once I've gotten the list of all cells to apply a format to, I have one final while() loop that applies the format to 50 of those cells at a time.
This isn't ideal, but it still reduces the number of accesses to NumberFormat by a factor of up to 50, which is significant. Constructing my spreadsheet without any format info (only using range.set_Value()) takes about 3 seconds. When I apply the formats 50 cells at a time, that is lengthened to about 10 seconds. When I apply the format info individually to each cell, the spreadsheet takes over 2 minutes to finish being constructed!
You can apply a formatting on the range, and then populate it with values you cannot specify formatting in you object[,] array
You apply the formatting to each individual cell within the inner loop via
for(int r = 0; r < rows; r++)
{
for(int c = 0; c < cols; c++)
{
Excel.Range r2 = sheet.Cells( r, c );
r2.xxxx = "";
}
}
Once you have r2, you can change the cell format any way you want.

Categories

Resources