I am using SpreadsheetLight to write log files from a WinForms project. My intent is to write log entries to three worksheets in the same file, and I really want to avoid using Interop if I can avoid it.
I start with a template file made in Excel which has the three worksheets pre-populated with row titles, and since each worksheet has the same basic properties (which can vary independently), I encapsulate each sheet in a class, the basics of which look like this:
/// <summary>
/// Encapsulate the info we need to know about each worksheet in order to populate properly
/// </summary>
public class LogSheet
{
public SLDocument data;
public SLWorksheetStatistics stats;
public int RowCount;
public int ColumnCount;
public int currentColumn; //indicates what column you want to be writing to
public List<string> rowNames = new List<string>(); //used to make sure you're writing new data to the right row
public List<string> columnNames = new List<string>(); //used by GetLatestRun() to check if data already exists for a given serial number
public LogSheet(string sheet)
{
this.data = new SLDocument(_path, sheet);
this.stats = this.data.GetWorksheetStatistics();
this.RowCount = this.stats.EndRowIndex;
this.ColumnCount = this.stats.EndColumnIndex;
currentColumn = GetLatestRun();
for (int i = 1; i < RowCount + 1; i++)
{
this.rowNames.Add(this.data.GetCellValueAsString(i, 1));
}
for (int i = 1; i < ColumnCount + 1; i++)
{
this.columnNames.Add(this.data.GetCellValueAsString(1, i));
}
}
}
There are also some methods not shown in the LogSheet class that handle writing data to the right places.
This all seems to work fine, and when debugging, I can see that each of the three worksheets instantiated with new LogSheet(<sheetName>) contain the data they are supposed to after I've written things to them.
The problem is that when I want to save the data, I can get away with this.data.Save(), but it only saves one worksheet, and the other two are now left in limbo because the Save() method is terminal and closes the Excel file. trying the Save() method on either of the other sheets two ends up with an Exception "Object reference not set to an object" because, of course, Save() killed my spreadsheet, and the sheets no longer have anything to reference. The resulting file only has data for the first time I saved it.
My best guess for how to get around this is to not instantiate a new SLDocument for each sheet and instead use SLDocument.SelectWorksheet() each time I want to write to a specific worksheet, but I still want to keep things encapsulated in the LogSheet class because everything else in there is still relevant.
Any other suggestions?
The recommended and efficient way is to store all the logs to be written in memory first (with a List<> or something). Then when writing, you select the worksheet, write everything from the first List<>, select the second worksheet, write everything from the second List<>, select the third worksheet, write everything from the third List<>.
If memory is an issue, then select first worksheet, write log chunk into cell value, select second worksheet, write log chunk into cell value (will be in second worksheet because second worksheet is currently selected), select third worksheet, write log chunk. Then iterate over every log chunk with the above.
The latter method takes less memory at any one time, but takes more CPU cycles because you keep going back and forth between the worksheets. The going back and forth thing is equivalent to loading up one worksheet, unloading it, then load another worksheet and so on.
Related
The saga of trying to chop flat files up into useable bits continues!
You may see from my other questions that I am trying to wrangle some flat file data into various bits using C# transformer in SSIS. The current challenge is trying to turn a selection of rows with one column into one row with many columns.
A friend has very kindly tipped me off to use List and then to somehow loop through that in the PostExecute().
The main problem is that I do not know how to loop through and create a row to add to the Output Buffer programatically - there might be a variable number of fields listed in the flat file, there is no consistency. For now, I have allowed for 100 outputs and called these pos1, pos2, etc.
What I would really like to do is count everything in my list, and loop through that many times, incrementing the numbers accordingly - i.e. fieldlist[0] goes to OutputBuffer.pos1, fieldlist[1] goes to OutputBuffer.pos2, and if there is nothing after this then nothing is put in pos3 to pos100.
The secondary problem is that I can't even test that my list and writing to an output table is working by specifically using OutputBuffer in PostExecute, never mind working out a loop.
The file has all sorts in it, but the list of fields is handily contained between START-OF-FIELDS and END-OF-FIELDS, so I have used the same logic as before to only process the rows in the middle of those.
bool passedSOF;
bool passedEOF;
List<string> fieldlist = new List<string>();
public override void PostExecute()
{
base.PostExecute();
OutputBuffer.AddRow();
OutputBuffer.field1=fieldlist[0];
OutputBuffer.field2=fieldlist[1];
}
public override void Input_ProcessInputRow(InputBuffer Row)
{
if (Row.RawData.Contains("END-OF-FIELDS"))
{
passedEOF = true;
OutputBuffer.SetEndOfRowset();
}
if (passedSOF && !passedEOF)
{
fieldlist.Add(Row.RawData);
}
if(Row.RawData.Contains("START-OF-FIELDS"))
{
passedSOF = true;
}
}
I have nothing underlined in red, but when I try to run this I get an error message about PostExecute() and "object reference not set to an instance of an object", which I thought meant something contained a null where it shouldn't, but in my test file I have more than two fields between START and END markers.
So first of all, what am I doing wrong in the example above, and secondly, how do I do this in a proper loop? There are only 100 possible outputs right now, but this could increase over time.
"Post execute" It's named that for a reason.
The execution of your data flow has ended and this method is for cleanup or anything that needs to happen after execution - like modification of SSIS variables. The buffers have gone away, there's no way to do interact with the contents of the buffers at this point.
As for the rest of your problem statement... it needs focus
So once again I have misunderstood a basic concept - PostExecute cannot be used to write out in the way I was trying. As people have pointed out, there is no way to do anything with the buffer contents here.
I cannot take credit for this answer, as again someone smarter than me came to the rescue, but I have got permission from them to post the code in case it is useful to anyone. I hope I have explained this OK, as I only just understand it myself and am very much learning as I go along.
First of all, make sure to have the following in your namespace:
using System.Reflection;
using System.Linq;
using System.Collections.Generic;
These are going to be used to get properties for the Output Buffer and to allow me to output the first item in the list to pos_1, the second to pos_2, etc.
As usual I have two boolean variables to determine if I have passed the row which indicates the rows of data I want have started or ended, and I have my List.
bool passedSOF;
bool passedEOF;
List<string> fieldlist = new List<string>();
Here is where it is different - as I have something which indicates I am done processing my rows, which is the row containing END-OF-FIELDS, when I hit that point, I should be writing out my collected List to my output buffer. The aim is to take all of the multiple rows containing field names, and turn that into a single row with multiple columns, with the field names populated across those columns in the row order they appeared.
if (Row.RawData.Contains("END-OF-FIELDS"))
{
passedEOF = true;
//IF WE HAVE GOT TO THIS POINT, WE HAVE ALL THE DATA IN OUR LIST NOW
OutputBuffer.AddRow();
var fields = typeof(OutputBuffer).GetProperties();
//SET UP AND INITIALISE A VARIABLE TO HOLD THE ROW NUMBER COUNT
int rowNumber = 0;
foreach (var fieldName in fieldList)
{
//ADD ONE TO THE CURRENT VALUE OF rowNumber
rowNumber++;
//MATCH THE ROW NUMBER TO THE OUTPUT FIELD NAME
PropertyInfo field = fields.FirstOrDefault(x = > x.Name == string.Format("pos{0}", rowNumber));
if (field != null)
{
field.SetValue(OutputBuffer, fieldName);
}
}
OutputBuffer.SetEndOfRowset();
}
if (passedSOF && !passedEOF)
{
this.fieldList.Add(Row.RawData);
}
if (Row.RawData.Contains("START-OF-FIELDS"))
{
passedSOF = true;
}
So instead of having something like this:
START-OF-FIELDS
FRUIT
DAIRY
STARCHES
END-OF-FIELDS
I have the output:
pos_1 | pos_2 | pos_3
FRUIT | DAIRY | STARCHES
So I can build a position key table to show which field will appear in which order in the current monthly file, and now I am looking forward into getting myself into more trouble splitting the actual data rows out into another table :)
I am trying to read excel null/blank values.
I have looked into hundreds of solutions and either I am implementing it wrong or it just does not seem to work and results in Microsoft.CSharp.RuntimeBinder.RuntimeBinderException:'Cannot perform runtime binding on a null reference'
This is one of the last codes I tried.(Since I was trying to put NA in all the null cells)
for (int i = 2; i <= rowCount; i++)
{
string natext = xlRange.Value2[rowCount, colCount];
if (natext == null)
{
natext = "NA";
}
Any ideas that can help me with some examples?
If the click the details it shows:
Microsoft.CSharp.RuntimeBinder.RuntimeBinderException
HResult=0x80131500 Message=Cannot perform runtime binding on a null
reference Source=
StackTrace:
First, the Excel object model is really weird. Value2 returns an object, and that object can be of all sorts of different types. If xlRange is a cell, then it returns the value of that cell, which could be a string or a double or something else. If xlRange is multiple cells then that object is an array of values. And then each of those values is an object. For each value you don't know if it's a string or a double or something else.
That's not fun to deal with. It's actually really, really bad. C# is a strongly-typed language, which means that you know what type everything is and you don't have to guess. Excel Interop takes that away from you and says, "Here's an object. It could be anything or lots of things that could each be anything. Figure it out. Good luck."
Instead of getting the Value2 property of the range and then looping through the array, it's much easier to deal with the cells in the range instead.
Given that excelRange is a Range of cells:
for (var row = 1; row <= excelRange.Rows.Count; row++)
{
for (var column = 1; row <= excelRange.Columns.Count; row++)
{
var cellText = excelRange[row, column].Text.ToString();
}
}
This does two things. First, you're looking at one cell at a time. Second, you're using the Text property. The Text property should always be a string so you could just do this and it would almost certainly work:
string cellText = excelRange.Cells[row, column].Text;
It's just that the object model returns dynamic, so even though it is a string, the possibility is left open that maybe it won't be.
My strong recommendation - and I think most developers would agree - is to abandon Excel Interop and run from it, and use a library like EPPlus instead. There are tons of examples.
Excel Interop works by actually starting an instance of Excel and giving you access to the clunky VBA object model. It's evil. Chances are that if you open your task manager right now you'll see several extra instances of Excel open that you didn't expect to see. Fixing that is a whole separate frustrating problem.
For some years Excel files have just been collections of XML documents, and EPPlus helps you to work with them as documents, but providing all sorts of helper methods so that you can interact with sheets, ranges, cells, and so forth. Try it. Trust me, you'll never look back.
Here's an example after adding the EPPlus Nuget package:
var pathToYourExcelWorkbook = #"c:\somepath\document.xlsx";
using (var workbookPackage = new ExcelPackage(new FileInfo(pathToYourExcelWorkbook)))
{
var workbook = workbookPackage.Workbook;
var sheet = workbook.Worksheets[1]; // 1-based, or use the name.
for (var row = 1; row <= 10; row++)
{
for (var column = 1; column <= 10; column++)
{
var cellText = sheet.Cells[row, column].Text;
}
}
}
It's awesome. No starting or closing an application - you're just reading from a file. No weird COM objects. And the objects are all strongly-typed. The Text property returns a string.
I am working with a client to import a rather larger Excel file (over 37K rows) into a custom system and utilizing the excellent LinqToExcel library to do so. While reading all of the data in, I noticed it was breaking on records about 80% in and dug a little further. The reason it fails is the majority of records (with associated dates ranging 2011 - 2015) are normal, e.g. 1/3/2015, however starting in 2016, the structure changes to look like this: '1/4/2016 (note the "tick" at the beginning of the date) and LinqToExcel starts returning a DBNull for that column.
Any ideas on why it would do that and ways around it? Note that this isn't a casting issue - I can use the Immediate Window to see all the values of the LinqToExcel.Row value and where that column index is, it's empty.
Edit
Here is the code I am using to read in the file:
var excel = new LinqToExcel.ExcelQueryFactory(Path.Combine(this.FilePath, this.CurrentFilename));
foreach (var row in excel.Worksheet(file.WorksheetName))
{
data.Add(this.FillEntity(row));
}
The problem I'm referring to is inside the row variable, which is a LinqToExcel.Row instance and contains the raw data from Excel. The values inside row all line up, with the exception of the column for the date which is empty.
** Edit 2 **
I downloaded the LinqToExcel code from GitHub and connected it to my project and it looks like the issue is even deeper than this library. It uses an IDataReader to read in all of the values and the cells in question that aren't being read are empty from that level. Here is the block of code from the
LinqToExcel.ExcelQueryExecutorclass that is failing:
private IEnumerable<object> GetRowResults(IDataReader data, IEnumerable<string> columns)
{
var results = new List<object>();
var columnIndexMapping = new Dictionary<string, int>();
for (var i = 0; i < columns.Count(); i++)
columnIndexMapping[columns.ElementAt(i)] = i;
while (data.Read())
{
IList<Cell> cells = new List<Cell>();
for (var i = 0; i < columns.Count(); i++)
{
var value = data[i];
//I added this in, since the worksheet has over 37K rows and
//I needed to snag right before it hit the values I was looking for
//to see what the IDataReader was exposing. The row inside the
//IDataReader relevant to the column I'm referencing is null,
//even though the data definitely exists in the Excel file
if (value.GetType() == typeof(DateTime) && value.Cast<DateTime>() == new DateTime(2015, 12, 31))
{
}
value = TrimStringValue(value);
cells.Add(new Cell(value));
}
results.CallMethod("Add", new Row(cells, columnIndexMapping));
}
return results.AsEnumerable();
}
Since their class uses an OleDbDataReader to retrieve the results, I think that is what can't find the value of the cell in question. I don't even know where to go from there.
Found it! Once I traced down that it was the OleDbDataReader that was failing and not the LinqToExcel library itself, it sent me down a different path to look around. Apparently, when an Excel file is read by an OleDbDataReader (as virtually all utilities do under the covers), the first few records are scanned to determine the type of content associated with the column. In my scenario, over 20K records had "normal" dates, so it assumed everything was a date. Once it got to the "bad" records, the ' in front of the date meant it couldn't be parsed into a date, so the value was null.
To circumvent this, I load the file and tell it to ignore column headers. Since the header for this column is a string and most of the values are dates, it makes everything a string because of the mismatched types and the values I need are loaded properly. From there, I can parse accordingly and get it to work.
Source: What is IMEX in the OLEDB connection string?
I followed this very promising link to make my program read Excel files, but the problem I get is System.OutOfMemoryException. As far as I can gather, it happens because of this chunk of code
object[,] valueArray = (object[,])excelRange.get_Value(
XlRangeValueDataType.xlRangeValueDefault);
which loads the whole list of data into one variable. I do not understand why the developers of the library decided to do it this way, instead of making an iterator, that would parse a sheet line by line. So, I need some working solution that would enable to read large (>700K rows) Excel files.
I am using the following function in one of my C# applications:
string[,] ReadCells(Excel._Worksheet WS,
int row1, int col1, int row2, int col2)
{
Excel.Range R = WS.get_Range(GetAddress(row1, col1),
GetAddress(row2, col2));
....
}
The reason to read a Range in one go rather than cell-by-cell is performance.
For every cell access, a lot of internal data transfer is going on. If the Range is too large to fit into memory, you can process it in smaller chunks.
I have written a C# program that does a lot of iterative calculations and then returns a huge list of data. Because the data changes each time I run the program, I draw it in an Excel spreadsheet with predefined functions and graphs that are useful to interpret the data. However, all my charts in the spreadsheet depend on a single column of data, through with other columns and axis are calculated using formulas. However, the total amount of data is not always constant.
For instance, sometimes I get 22 elements of data in the list, and sometimes the number flows into 100s. To have a stable bound, I cap the charts to graph only the first 50 rows of data, and in my program, I fill the remaining columns with the value "#N/A". However, when I open the spreadsheet, the rows with superfluous data is graphed as 0s. I want the charts to graph only the rows with valid data.
Here is what my code looks like, it is relatively very simple, so I am not going to modify this, I want to know what changes I can make in the spreadsheet.
FileInfo newFile = new FileInfo("Report.xlsx");
ExcelPackage pack = new ExcelPackage(newFile);
ExcelWorksheet ws = pack.Workbook.Worksheets[1];
int cellCount = 2;
for(int i = 0; i < 49; i++)
{
String cell = "B" + cellCount;
if (i < data.Count)
ws.Cells[cell].Value = data.ElementAt(i);
else
ws.Cells[cell].Value = "#N/A";
cellCount++;
}
Console.Out.WriteLine("saving");
pack.Save();
System.Diagnostics.Process.Start("Report.xlsx");
To access the Excel documents, I use EPPLUS. Here is what my charts look like:
As the graph shows, the last 5-6 rows contain NULL values, however, they are graphed as well, with values of 0. The blue line represents the data in the third column, and red line represents the last column (that is never going to be null because it's dependant on a fixed row).
How do I force Excel to ignore the last few NULL rows?