I am using the following code to take some data (in XML like format - Not well formed) from a .txt file and then write it to an .xlsx using EPPlus after doing some processing. StreamElements is basically a modified XmlReader. My question is about performance, I have made a couple of changes but don't see what else I can do. I'm going to use this for large datasets so I'm trying to modify to make this as efficient and fast as possible. Any help will be appreciated!
I tried using p.SaveAs() to do the excel writing but it did not really see a performance difference. Are there better faster ways to do the writing? Any suggestions are welcome.
using (ExcelPackage p = new ExcelPackage())
{
ExcelWorksheet ws = p.Workbook.Worksheets[1];
ws.Name = "data1";
int rowIndex = 1; int colIndex = 1;
foreach (var element in StreamElements(pa, "XML"))
{
var values = element.DescendantNodes().OfType<XText>()
.Select(v => Regex.Replace(v.Value, "\\s+", " "));
string[] data = string.Join(",", values).Split(',');
data[2] = toDateTime(data[2]);
for (int i = 0; i < data.Count(); i++)
{
if (rowIndex < 1000000)
{
var cell1 = ws.Cells[rowIndex, colIndex];
cell1.Value = data[i];
colIndex++;
}
}
rowIndex++;
}
}
ws.Cells[ws.Dimension.Address].AutoFitColumns();
Byte[] bin = p.GetAsByteArray();
using (FileStream fs = File.OpenWrite("C:\\test.xlsx"))
{
fs.Write(bin, 0, bin.Length);
}
}
}
Currently, for it to do the processing and then write 1 Million lines into an Excel worksheet, it takes about ~30-35 Minutes.
I've ran into this issue before and excel has a huge overhead when you're modifying worksheet cells individually one by one.
The solution to this is to create an object array and populate the worksheet using the WriteRange functionality.
using(ExcelPackage p = new ExcelPackage()) {
ExcelWorksheet ws = p.Workbook.Worksheets[1];
ws.Name = "data1";
//Starting cell
int startRow = 1;
int startCol = 1;
//Needed for 2D object array later on
int maxColCount = 0;
int maxRowCount = 0;
//Queue data
Queue<string[]> dataQueue = new Queue<string[]>();
//Tried not to touch this part
foreach(var element in StreamElements(pa, "XML")) {
var values = element.DescendantNodes().OfType<XText>()
.Select(v = > Regex.Replace(v.Value, "\\s+", " "));
//Removed unnecessary split and join, use ToArray instead
string[] eData = values.ToArray();
eData[2] = toDateTime(eData[2]);
//Push the data to queue and increment counters (if needed)
dataQueue.Enqueue(eData);
if(eData.Length > maxColCount)
maxColCount = eData.Length;
maxRowCount++;
}
//We now have the dimensions needed for our object array
object[,] excelArr = new object[maxRowCount, maxColCount];
//Dequeue data from Queue and populate object matrix
int i = 0;
while(dataQueue.Count > 0){
string[] eData = dataQueue.Dequeue();
for(int j = 0; j < eData.Length; j++){
excelArr[i, j] = eData[j];
}
i++;
}
//Write data to range
Excel.Range c1 = (Excel.Range)wsh.Cells[startRow, startCol];
Excel.Range c2 = (Excel.Range)wsh.Cells[startRow + maxRowCount - 1, maxColCount];
Excel.Range range = worksheet.Range[c1, c2];
range.Value2 = excelArr;
//Tried not to touch this stuff
ws.Cells[ws.Dimension.Address].AutoFitColumns();
Byte[] bin = p.GetAsByteArray();
using(FileStream fs = File.OpenWrite("C:\\test.xlsx")) {
fs.Write(bin, 0, bin.Length);
}
}
I didn't try compiling this code, so double check the indexing used; and check for any small syntax errors.
A few extra pointers to consider for performance:
Try to parallel the population of the object array, since it is primarily index based (maybe have a dictionary with an index tracker Dictionary<int, string[]>) and lookup in there for faster population of the object array. You would likely have to trade space for time.
See if you are able to hardcode the column and row counts, or figure it out quickly. In my code fix, I've set counters to count the maximum rows and columns on the fly; I wouldn't recommend it as a permanent solution.
AutoFitColumns is very costly, especially if you're dealing with over a million rows
Related
Currently I have a table with 6 rows and 14 columns.
I'm trying to pass that table to my excel document and I have no problem doing that.
My problem is that I can't format it the way I want.
Idealy I want to have 3 rows, blank space and 3 rows again, but I can't do that.
This is the function I'm currently using to format the Sql table. Basically it writes in excel all the rows consecutively.
Instead of doing that I want it to have a black row between row 3 and row 4.
If someone could help I'd very thankful.
private int Export_putDataGeneric(Excel.Worksheet sh, DataTable ds, String D_ReferenceDate, int starting_row = 5, int[] column_mapping = null, bool isNumber = true)
{
int curr_row = 0;
if (column_mapping == null)
{
column_mapping = new int[ds.Columns.Count];
int start_char = 2;
for (int c = 0; c < ds.Columns.Count; c++)
{
column_mapping[c] = start_char;
start_char++;
}
}
var data = new Object[ds.Rows.Count, column_mapping[ds.Columns.Count - 1] - column_mapping[0] + 1];
foreach (DataRow row in ds.Rows)
{
for (int c = 0; c < ds.Columns.Count; c++)
{
data[curr_row, column_mapping[c] - column_mapping[0]] = row[c];
}
curr_row++;
}
int end_row = starting_row + ds.Rows.Count - 1;
Excel.Range beginWrite = sh.Cells[starting_row, column_mapping[0]] as Excel.Range;
Excel.Range endWrite = sh.Cells[end_row, column_mapping[ds.Columns.Count - 1]] as Excel.Range;
Excel.Range sheetData = sh.Range[beginWrite, endWrite];
sheetData.Value2 = data;
if (isNumber) sheetData.NumberFormat = "#,##0.00";
Marshal.ReleaseComObject(beginWrite);
Marshal.ReleaseComObject(endWrite);
Marshal.ReleaseComObject(sheetData);
beginWrite = null;
endWrite = null;
sheetData = null;
return end_row;
}
You can try using Range.Offset.
Check out the Microsoft Documentation
This question on SO might also help.
I am making a program in Visual Studio where you can read in an excel file in a specific format and where my program converts the data from the excel file in a different format and stores it in a database table.
Below you can find a part of my code where something strange happens
//copy schema into new datatable
DataTable _longDataTable = _library.Clone();
foreach (DataRow drlibrary in _library.Rows)
{
//count number of variables in a row
string check = drlibrary["Check"].ToString();
int varCount = check.Length - check.Replace("{", "").Length;
int count_and = 0;
if (check.Contains("and") || check.Contains("or"))
{
count_and = Regex.Matches(check, "and").Count;
varCount = varCount - count_and;
}
//loop through number of counted variables in order to add rows to long datatable (one row per variable)
for (int i = 1; i <= varCount; i++)
{
var newRow = _longDataTable.NewRow();
newRow.ItemArray = drlibrary.ItemArray;
string j = i.ToString();
//fill variablename with variable number
if (i < 10)
{
newRow["VariableName"] = "Variable0" + j;
}
else
{
newRow["VariableName"] = "Variable" + j;
}
}
}
When varCount equals 1, I get the following error message when running the program after inserting an excel file
The source contains no DataRows.
I don't know why I can't run the for loop with just one iteration. Anyone who can help me?
I have one large data table of some millions records. I need to export that into multiple CSV files of specific size. So for example, I choose file size of 5MB and when I say export, The Datatable will get exported to 4 CSV files each of size 5MB and last file size may vary due to remaining records. I went through many solutions here as well had a look at csvhelper library but all deals with large files gets split into multiple CSV but not the in memory data table to multiple CSV files based on the file size specified. I want to do this in C#. Any help in this direction would be great.
Thanks
Jay
Thanks #H.G.Sandhagen and #jdweng for the inputs. Currently I have written following code which does the work needed. I know it is not perfect and some enhancement can surely be done and can be made more efficient if we can pre-determine length out of data table item array as pointed out by Nick.McDermaid. As of now, I will go with this code to unblock my self and will post the final optimized version when I have it coded.
public void WriteToCsv(DataTable table, string path, int size)
{
int fileNumber = 0;
StreamWriter sw = new StreamWriter(string.Format(path, fileNumber), false);
//headers
for (int i = 0; i < table.Columns.Count; i++)
{
sw.Write(table.Columns[i]);
if (i < table.Columns.Count - 1)
{
sw.Write(",");
}
}
sw.Write(sw.NewLine);
foreach (DataRow row in table.AsEnumerable())
{
sw.WriteLine(string.Join(",", row.ItemArray.Select(x => x.ToString())));
if (sw.BaseStream.Length > size) // Time to create new file!
{
sw.Close();
sw.Dispose();
fileNumber ++;
sw = new StreamWriter(string.Format(path, fileNumber), false);
}
}
sw.Close();
}
I had a similar problem and this is how I solved it with CsvHelper.
Answer could be easily adapted to use DataTable as source.
public void SplitCsvTest()
{
var inventoryRecords = new List<InventoryCsvItem>();
for (int i = 0; i < 100000; i++)
{
inventoryRecords.Add(new InventoryCsvItem { ListPrice = i + 1, Quantity = i + 1 });
}
const decimal MAX_BYTES = 5 * 1024 * 1024; // 5 MB
List<byte[]> parts = new List<byte[]>();
using (var memoryStream = new MemoryStream())
{
using (var streamWriter = new StreamWriter(memoryStream))
using (var csvWriter = new CsvWriter(streamWriter))
{
csvWriter.WriteHeader<InventoryCsvItem>();
csvWriter.NextRecord();
csvWriter.Flush();
streamWriter.Flush();
var headerSize = memoryStream.Length;
foreach (var record in inventoryRecords)
{
csvWriter.WriteRecord(record);
csvWriter.NextRecord();
csvWriter.Flush();
streamWriter.Flush();
if (memoryStream.Length > (MAX_BYTES - headerSize))
{
parts.Add(memoryStream.ToArray());
memoryStream.SetLength(0);
memoryStream.Position = 0;
csvWriter.WriteHeader<InventoryCsvItem>();
csvWriter.NextRecord();
}
}
if (memoryStream.Length > headerSize)
{
parts.Add(memoryStream.ToArray());
}
}
}
for(int i = 0; i < parts.Count; i++)
{
var part = parts[i];
File.WriteAllBytes($"C:/Temp/Part {i + 1} of {parts.Count}.csv", part);
}
}
UPDATED: added full block of code where error occurs
UPDATE 2: I found a weird anomaly. The code has now been continuously breaking on that line, when the tabName variable equals "service line prior year". This morning, for grins, I changed the tab name to "test", so in turn the tabName variable equals "test", and it worked more often then not. I am really at a loss.
I have researched a ton and can't find anything that addresses what is happening in my code. It happens randomly though. Sometimes it doesn't happen, then other times it happens in the same spot, but all on this part of the code (on the line templateSheet = templateBook.Sheets[tabName];):
public void ExportToExcel(DataSet dataSet, string filePath, int i, int h, Excel.Application excelApp)
{
//create the excel definitions again.
//Excel.Application excelApp = new Excel.Application();
//excelApp.Visible = true;
FileInfo excelFileInfo = new FileInfo(filePath);
Boolean fileOpenTest = IsFileOpen(excelFileInfo);
Excel.Workbook templateBook;
Excel.Worksheet templateSheet;
//check to see if the template is already open, if its not then open it,
//if it is then bind it to work with it
if (!fileOpenTest)
{ templateBook = excelApp.Workbooks.Open(filePath); }
else
{ templateBook = (Excel.Workbook)System.Runtime.InteropServices.Marshal.BindToMoniker(filePath); }
//this grabs the name of the tab to dump the data into from the "Query Dumps" Tab
string tabName = lstQueryDumpSheet.Items[i].ToString();
templateSheet = templateBook.Sheets[tabName];
excelApp.Calculation = Excel.XlCalculation.xlCalculationManual;
templateSheet = templateBook.Sheets[tabName];
// Copy DataTable
foreach (System.Data.DataTable dt in dataSet.Tables)
{
// Copy the DataTable to an object array
object[,] rawData = new object[dt.Rows.Count + 1, dt.Columns.Count];
// Copy the values to the object array
for (int col = 0; col < dt.Columns.Count; col++)
{
for (int row = 0; row < dt.Rows.Count; row++)
{ rawData[row, col] = dt.Rows[row].ItemArray[col]; }
}
// Calculate the final column letter
string finalColLetter = string.Empty;
string colCharset = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int colCharsetLen = colCharset.Length;
if (dt.Columns.Count > colCharsetLen)
{ finalColLetter = colCharset.Substring((dt.Columns.Count - 1) / colCharsetLen - 1, 1); }
finalColLetter += colCharset.Substring((dt.Columns.Count - 1) % colCharsetLen, 1);
//this grabs the cell address from the "Query Dump" sheet, splits it on the '=' and
//pulls out only the cell address (i.e., "address=a3" becomes "a3")
string dumpCellString = lstQueryDumpText.Items[i].ToString();
string dumpCell = dumpCellString.Split('=').Last();
//referts to the range in which we are dumping the DataSet. The upper right hand cell is
//defined by the 'dumpCell' varaible and the bottom right cell is defined by the
//final column letter and the count of rows.
string firstRef = "";
string baseRow = "";
if (char.IsLetter(dumpCell, 1))
{
char[] createCellRef = dumpCell.ToCharArray();
firstRef = createCellRef[0].ToString() + createCellRef[1].ToString();
for (int z = 2; z < createCellRef.Count(); z++)
{
baseRow = baseRow + createCellRef[z].ToString();
}
}
else
{
char[] createCellRef = dumpCell.ToCharArray();
firstRef = createCellRef[0].ToString();
for (int z = 1; z < createCellRef.Count(); z++)
{
baseRow = baseRow + createCellRef[z].ToString();
}
}
int baseRowInt = Convert.ToInt32(baseRow);
int startingCol = ColumnLetterToColumnIndex(firstRef);
int endingCol = ColumnLetterToColumnIndex(finalColLetter);
int finalCol = startingCol + endingCol;
string endCol = ColumnIndexToColumnLetter(finalCol - 1);
int endRow = (baseRowInt + (dt.Rows.Count - 1));
string cellCheck = endCol + endRow;
string excelRange;
if (dumpCell.ToUpper() == cellCheck.ToUpper())
{
excelRange = string.Format(dumpCell + ":" + dumpCell);
}
else
{
excelRange = string.Format(dumpCell + ":{0}{1}", endCol, endRow);
}
//this dumps the cells into the range on Excel as defined above
templateSheet.get_Range(excelRange, Type.Missing).Value2 = rawData;
//checks to see if all the SQL queries have been run from the "Query Dump" tab, if not, continue
//the loop, if it is the last one, then save the workbook and move on.
if (i == lstSqlAddress.Items.Count - 1)
{
excelApp.Calculation = Excel.XlCalculation.xlCalculationAutomatic;
/*Run through the value save sheet array then grab the address from the corresponding list
place in the address array. If the address reads "whole sheet" then save the whole page,
else set the addresses range and value save that.*/
//for (int y = 0; y < lstSaveSheet.Items.Count; y++)
//{
// MessageBox.Show("Save Sheet: " + lstSaveSheet.Items[y] + "\n" + "Save Address: " + lstSaveRange.Items[y]);
//}
//run the macro to hide the unused columns
excelApp.Run("ReportMakerExecute");
//save excel file as hospital name and move onto the next
SaveTemplateAs(templateBook, h);
//close the open Excel App before looping back
//Marshal.ReleaseComObject(templateSheet);
//Marshal.ReleaseComObject(templateBook);
//templateSheet = null;
//templateBook = null;
//GC.Collect();
//GC.WaitForPendingFinalizers();
}
//Close excel Applications
//excelApp.Quit();
//Marshal.ReleaseComObject(templateSheet);
//Marshal.FinalReleaseComObject(excelApp);
//excelApp = null;
//templateSheet = null;
// GC.Collect();
//GC.WaitForPendingFinalizers();
}
}
The try/catch block is of no use either. This is the error:
"An unhandled exception of type 'System.AccessViolationException' occurred inSQUiRE (Sql QUery REtriever) v1.exe. Additional information: Attempted to read or write protected memory. This is often an indication that other memory is corrupt."
System.AccessViolationException would normally happen when you try to access an unallocated memory in a native code (not .NET). Then .NET translates it to the managed world as this exception.
Your code itself does not have any unsafe block. So access violation must me happening inside Excel.
Given the fact that it sometimes happens, some times not, I would say that it can be caused by a parallel Excel usage (I think the Excel COM is not thread-safe).
I would recommend you putting all your code inside a lock block, to prevent Excel from begin used in parallel. Something like this:
public void ExportToExcel(DataSet dataSet, string filePath, int i, int h, Excel.Application excelApp)
{
lock(this.GetType()) // You can change here to other instance to me used a mutex
{
// Your original code here
}
}
So long story, three days of testing longer, it was because of an excel file that was trying to open and fill with SQL results. The buffer was filling up and causing an exception...it just happened at the same point in every run because the load time for the excel file was the determining factor in it working or failing.
So after the load i just added a delaying do...while that checked to see if the file was accessible or not and it stopped the failures. fileOpenTest was taken from here
do
{
Task.Delay(2000);
}
while(!fileOpenTest);
When I write a whole table into an excel worksheet, I know to work with a whole Range at once instead of writing to individual cells. However, is there a way to specify format as I'm populating the array I'm going to export to Excel?
Here's what I do now:
object MissingValue = System.Reflection.Missing.Value;
Excel.Application excel = new Excel.Application();
int rows = 5;
int cols = 5;
int someVal;
Excel.Worksheet sheet = (Excel.Worksheet)excel.Workbooks.Add(MissingValue).Sheets[1];
Excel.Range range = sheet.Range("A1", sheet.Cells(rows,cols));
object[,] rangeData = new object[rows,cols];
for(int r = 0; r < rows; r++)
{
for(int c = 0; c < cols; c++)
{
someVal = r + c;
rangeData[r,c] = someVal.ToString();
}
}
range.set_Value(MissingValue, rangeData);
Now suppose that I want some of those numbers to be formatted as percentages. I know I can go back on a cell-by-cell basis and change the formatting, but that seems to defeat the whole purpose of using a single Range.set_Value() call. Can I make my rangeData[,] structure include formatting information, so that when I call set_Value(), the cells are formatted in the way I want them?
To clarify, I know I can set the format for the entire Excel.Range object. What I want is to have a different format specified for each cell, specified in the inner loop.
So here's the best "solution" I've found so far. It isn't the nirvanna I was looking for, but it's much, much faster than setting the format for each cell individually.
// 0-based indexes
static string RcToA1(int row, int col)
{
string toRet = "";
int mag = 0;
while(col >= Math.Pow(26, mag+1)){mag++;}
while (mag>0)
{
toRet += System.Convert.ToChar(64 + (byte)Math.Truncate((double)(col/(Math.Pow(26,mag)))));
col -= (int)Math.Truncate((double)Math.Pow(26, mag--));
}
toRet += System.Convert.ToChar(65 + col);
return toRet + (row + 1).ToString();
}
static Random rand = new Random(DateTime.Now.Millisecond);
static string RandomExcelFormat()
{
switch ((int)Math.Round(rand.NextDouble(),0))
{
case 0: return "0.00%";
default: return "0.00";
}
}
struct ExcelFormatSpecifier
{
public object NumberFormat;
public string RangeAddress;
}
static void DoWork()
{
List<ExcelFormatSpecifier> NumberFormatList = new List<ExcelFormatSpecifier>(0);
object[,] rangeData = new object[rows,cols];
for(int r = 0; r < rows; r++)
{
for(int c = 0; c < cols; c++)
{
someVal = r + c;
rangeData[r,c] = someVal.ToString();
NumberFormatList.Add(new ExcelFormatSpecifier
{
NumberFormat = RandomExcelFormat(),
RangeAddress = RcToA1(rowIndex, colIndex)
});
}
}
range.set_Value(MissingValue, rangeData);
int max_format = 50;
foreach (string formatSpecifier in NumberFormatList.Select(p => p.NumberFormat).Distinct())
{
List<string> addresses = NumberFormatList.Where(p => p.NumberFormat == formatSpecifier).Select(p => p.RangeAddress).ToList();
while (addresses.Count > 0)
{
string addressSpecifier = string.Join(",", addresses.Take(max_format).ToArray());
range.get_Range(addressSpecifier, MissingValue).NumberFormat = formatSpecifier;
addresses = addresses.Skip(max_format).ToList();
}
}
}
Basically what is happening is that I keep a list of the format information for each cell in NumberFormatList (each element also holds the A1-style address of the range it applies to). The original idea was that for each distinct format in the worksheet, I should be able to construct an Excel.Range of just those cells and apply the format to that range in a single call. This would reduce the number of accesses to NumberFormat from (potentially) thousands down to just a few (however many different formats you have).
I ran into an issue, however, because you apparently can't construct a range from an arbitrarily long list of cells. After some testing, I found that the limit is somewhere between 50 and 100 cells that can be used to define an arbitrary range (as in range.get_Range("A1,B1,C1,A2,AA5,....."). So once I've gotten the list of all cells to apply a format to, I have one final while() loop that applies the format to 50 of those cells at a time.
This isn't ideal, but it still reduces the number of accesses to NumberFormat by a factor of up to 50, which is significant. Constructing my spreadsheet without any format info (only using range.set_Value()) takes about 3 seconds. When I apply the formats 50 cells at a time, that is lengthened to about 10 seconds. When I apply the format info individually to each cell, the spreadsheet takes over 2 minutes to finish being constructed!
You can apply a formatting on the range, and then populate it with values you cannot specify formatting in you object[,] array
You apply the formatting to each individual cell within the inner loop via
for(int r = 0; r < rows; r++)
{
for(int c = 0; c < cols; c++)
{
Excel.Range r2 = sheet.Cells( r, c );
r2.xxxx = "";
}
}
Once you have r2, you can change the cell format any way you want.