I am writing a program that pulls data from a CSV file (which, due to the structure, is easier to work with through Excel). There are columns that hold a date and time. The date column processes correctly, yet the time column (F) is being interpreted as a double. For example, in the following loop, it sees the value as 0.00 on the first loop, 0.25 on the second, 0.26041666666666669 on the third, 0.27083333333333331o on the fourth iteration, and so on.
for (i = startRow; i <= endRow; i++)
{
PeriodSales saleRow = new PeriodSales();
DateTime saleDate = Convert.ToDateTime((sheet.Cells[i, 5] as Excel.Range).Value);
var timeString = (sheet.Cells[i, 6] as Excel.Range).Value;
DateTime timeOfSale;
timeOfSale = new DateTime(saleDate.Year, saleDate.Month, saleDate.Day, 0, 0, 0);
// the lines below were commented out for testing purposes
(so I could see the value of timeString in the loop */
/* if (timeString != "0")
{
String[] timeArray = timeString.Split(':');
timeOfSale = new DateTime(saleDate.Year, saleDate.Month, saleDate.Day, Convert.ToInt32(timeArray[0]), Convert.ToInt32(timeArray[1]), 0);
}
else
{
timeOfSale = new DateTime(saleDate.Year, saleDate.Month, saleDate.Day, 0, 0, 0);
} */
Attached is a screenshot of my spreadsheet/CSV
The underlying CSV (in Notepad++)
Thanks for any guidance.
This discussion might answer your question:
Capturing Time Values from an Excel Cell
The first answer there may be what you're looking for. There may be some additional C# you're going to need to write to get the conversions you're looking for.
You want to use
timeOfSale = DateTime.FromOADate(timeString).TimeOfDay;
which will convert the date using excel format
Related
I have an application that reads excel files and extracts all the data. The xlsx or xls files are not standard and vary in column size, column data type, and row size. My application currently is able to read all files of this nature but when I use Value2 all the dates are coming back as OLE Automation type (32442,45322..). After researching its clear that Value2 is returning the underlying data as days from Dec 30, 1899 in the double data type.
Im currently calling Value2 only once and moving everything into an object array:
object[,] data = xlRange.Value2;
Then depending on the rows size, column size, and if there is header Im reading everything into a new temp table that gets bulk copied into the file's matching table in the db.
Before the file is read, I perform a setup process like how many rows to skip at the top, the column names, if it has header, the header column names...
Now is there a way for me to automatically detect which columns are dates or currency without having to explicitly let the application know which columns to handle differently? I was able to solve this problem by labeling the columns that were dates as DATES and using the FromOADate function to load it as a date string but not really happy with this solution.
for (int r = startRow; r <= rowCount - rowsAfter; r++) //populates the raw table with all file rows
{
//if (i % 100 == 0)
//MessageBox.Show("Message" + DateTime.Now.ToLongTimeString());
//Application.DoEvents();
object[] rowData = new object[colCount];
for (int c = 1; c <= colCount; c++)
{
bool isDateColumn = false;
string tempGridFieldName = fieldListcount_excel2[c - 1]; //checks the fieldListNames and looks for keywork Date
if (tempGridFieldName.Contains("DATE"))
isDateColumn = true; //suppose to be true
int maxFieldSize = Int32.Parse(fieldListlength_excel2[c - 1]); //changed this to version 2
if (data[r, c] != null)
{
int cellSize = data[r, c].ToString().Length; //gets max cell size from grid
if (cellSize <= maxFieldSize)
if (isDateColumn) //column is a date field so we take the Excel serial date and convert it to Datetime
{
object o = data[r, c];
double d = Convert.ToDouble(o);
DateTime dt = DateTime.FromOADate(d);
rowData[c - 1] = dt.ToShortDateString();
}
else
{
rowData[c - 1] = data[r, c];
}
else
{
MessageBox.Show($"Cell [{r},{c}] content size larger than field size");
cellOversizeFlag = true;
eof = false;
break;
}
}
else
{
rowData[c - 1] = data[r, c];
}
}
rawExcelFileDT.Rows.Add(rowData);
}
I also tried to check the columns by using the getType function but the column names were always strings and the dates came back as doubles. The problem is that I cannot assume that all the doubles will be dates so how can I approach this problem?
I am reading and loading in files into Excel using C# VSTO and the filenames are something like this:
C:\myfiles\1000AM.csv
C:\myfiles\1100AM.csv
C:\myfiles\1200PM.csv
C:\myfiles\100PM.csv
C:\myfiles\200PM.csv
And then i am putting these in a list and need to sort these by "time".
How can i convert the string in the format above into a time object that i can use to sort on?
You need extract the time parts somehow and then compare them to each other.
You could for example do this using a Comparison<string>. Here is an example that uses the Span<T> type to do this without allocating any additional garbage:
List<string> list = new List<string>() { ... }
list.Sort((a, b) =>
{
//compare AM/PM
int compareAmAndPm = a.AsSpan().Slice(a.Length - 6, 2)
.CompareTo(b.AsSpan().Slice(b.Length - 6, 2), StringComparison.Ordinal);
if (compareAmAndPm != 0)
return compareAmAndPm;
//compare the times as integers
int index = a.LastIndexOf('\\');
var firstTime = int.Parse(a.AsSpan().Slice(index + 1, a.Length - index - 7));
index = b.LastIndexOf('\\');
var secondTime = int.Parse(b.AsSpan().Slice(index + 1, b.Length - index - 7));
return firstTime.CompareTo(secondTime);
});
It should give you a result of this:
C:\myfiles\1000AM.csv
C:\myfiles\1100AM.csv
C:\myfiles\100PM.csv
C:\myfiles\200PM.csv
C:\myfiles\1200PM.csv
From practice we figured out, that a Time or Date on it's own does not work 99% of the cases. We need both plus the timezone to have any hope of processing them meaningfully.
That is why we only have things like DateTime, nowadays. Ideally those file names should consist of the full DateTime, in UTC and Invariant culture. If you got the option to change how those are created, use it.
However if you consistently only have one part that is not an issue: DateTime simply used default values for the other two. And as those two will be consistent, they will work. The only issue will be a finding a culture setting that eats that AM/PM format.
I have a huge dataset that I want to write into the Excel and need to perform conditional formatting of rows based on a business logic. So, for the data insertion part, I am using a data array to populate the Excel and it works pretty fast. However, I see a severe performance degradation when it comes to formatting the rows. It almost takes more than double the time just to do the formatting.
As of now, I am applying formatting to individual rows and loop through a series of rows. However, I am wondering if I can select multiple rows at a time and apply bulk formatting options to those rows:
Here is what I have right now:
foreach (int row in rowsToBeFormatted)
{
Excel.Range range = (Excel.Range)xlsWorksheet.Range[xlsWorksheet.Cells[row + introFormat, 1], xlsWorksheet.Cells[row + introFormat, 27]];
range.Font.Size = 11;
range.Interior.ColorIndex = 15;
range.Font.Bold = true;
}
And here is a demo of how I am trying to select multiple rows to the range and apply the formatting:
string excelrange = "A3:AA3,A83:AA83,A88:AA88,A94:AA94,A102:AA102,A106:AA106,A110:AA110,...." (string with more than 3000 characters)
xlsWorksheet.get_Range(excelrange).Interior.Color = Color.SteelBlue;
However, I get the following error when I execute the code:
Exception from HRESULT: 0x800A03EC
and there is nothing in inner exception. Any ideas how can I achieve the desired result?
As per comments under the question, there's hard-coded limit of 255 characters for a range string, however I wasn't able to find any documentation about it. Another commenter suggested to use semicolon as separator, but the documentation clearly states that comma should be used as union operator in range string:
The name of the range in A1-style notation in the language of the application. It can include the range operator (a colon), the intersection operator (a space), or the union operator (a comma). It can also include dollar signs, but they are ignored. You can use a local defined name in any part of the range. If you use a name, the name is assumed to be in the language of the application.
So where do we go from here? Formatting each range individually is indeed inefficient. Application interface provides method Union, but calling it in a loop is as inefficient as individual formatting. So the natural choice is to use the range string limit to the maximum and thus minimizing number of calls to COM interface.
You can split the full range to format into chunks; each not exceeding 255 characters limit. I would implement it using enumerators:
static IEnumerable<string> GetChunks(IEnumerable<string> ranges)
{
const int MaxChunkLength = 255;
var sb = new StringBuilder(MaxChunkLength);
foreach (var range in ranges)
{
if (sb.Length > 0)
{
if (sb.Length + range.Length + 1 > MaxChunkLength)
{
yield return sb.ToString();
sb.Clear();
}
else
{
sb.Append(",");
}
}
sb.Append(range);
}
if (sb.Length > 0)
{
yield return sb.ToString();
}
}
var rowsToFormat = new[] { 3, 83, 88, 94, 102, 106, 110/*, ...*/ }
var rowRanges = rowsToFormat.Select(row => "A" + row + ":" + "AA" + row);
foreach (var chunk in GetChunks(rowRanges))
{
var range = xlsWorksheet.Range[chunk];
// do formatting stuff here
}
The above is 10-15 times faster than individual formatting:
foreach (var rangeStr in rowRanges)
{
var range = xlsWorksheet.Range[rangeStr];
// do formatting stuff here
}
I can also see further space for optimization like grouping contiguous rows, but in case you are formatting discrete rows with subtotals, it won't help.
I have a number of documents with predicted placement of certain text which I'm trying to extract. For the most part, it works very well, but I'm having difficulties with a certain fraction of documents which have slightly thicker text.
Thin text:
Thick text:
I know it's hard to tell the difference at this resolution, but if you look at MO DAY YEAR TIME (2400) portion, you can tell that the second one is thicker.
The thin text gives me exactly what is expected:
09/28/2015
0820
However, the thick version gives me a triple of every character with white space in between each duplicated character:
1 1 11 1 1/ / /1 1 19 9 9/ / /2 2 20 0 01 1 15 5 5
1 1 17 7 70 0 02 2 2
I'm using the following code to extract text from documents:
public static Document GetDocumentInfo(string fileName)
{
// Using 11 in x 8.5 in dimensions at 72 dpi.
var boudingBoxes = new[]
{
new RectangleJ(446, 727, 85, 14),
new RectangleJ(396, 702, 43, 14),
new RectangleJ(306, 680, 58, 7),
new RectangleJ(378, 680, 58, 7),
new RectangleJ(446, 680, 45, 7),
new RectangleJ(130, 727, 29, 10),
new RectangleJ(130, 702, 29, 10)
};
var data = GetPdfData(fileName, 1, boudingBoxes);
// I would populated the new document with extracted data
// here, but it's not important for the example.
var doc = new Document();
return doc;
}
public static string[] GetPdfData(string fileName, int pageNum, RectangleJ[] boundingBoxes)
{
// Omitted safety checks, as they're not important for the example.
var data = new string[boundingBoxes.Length];
using (var reader = new PdfReader(fileName))
{
if (reader.NumberOfPages < 1)
{
return null;
}
RenderFilter filter;
ITextExtractionStrategy strategy;
for (var i = 0; i < boundingBoxes.Length; ++i)
{
filter = new RegionTextRenderFilter(boundingBoxes[i]);
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
data[i] = PdfTextExtractor.GetTextFromPage(reader, pageNum, strategy);
}
return data;
}
}
Obviously, if nothing else works, I can get rid of duplicate characters after reading them in, as there is a very apparent pattern, but I'd rather find a proper way than a hack. I tried looking around for the past few hours, but couldn't find anyone encountering a similar issue.
EDIT:
I finally came across this SO question:
Text Extraction Duplicate Bold Text
...and in the comments it's indicated that some of the lower quality PDF producers duplicate text to simulate boldness, so that's one of the things that might be happening. However, there is a mention of omitting duplicate text at the location, which I don't know how can be achieved since this portion of my code...
data[i] = PdfTextExtractor.GetTextFromPage(reader, pageNum, strategy);
...reads in the duplicated text completely in any of the specified locations.
EDIT:
I now have come across documents that duplicate contents up to four times to simulate thickness. That's a very strange way of doing things, but I'm sure designers of that method had their reasons.
EDIT:
I produced A solution (see my answer). It processes the data after it's already extracted and removes any repetitions. Ideally this would have been done during the extraction process, but it can get pretty complicated and this seemed like a very clean and easy way of getting the same accomplished.
As #mkl has suggested, one way of tackling this issue is to override LocationExtractionStrategy; however, things get pretty complicated since it would require comparison of locations for each character found at specific boundaries. I tried doing some research in order to accomplish that, but due to poor documentation, it was getting a bit out of hand.
So, instead as I created a post-processing method, loosely based around what #TheMuffinMan has suggested, to clean up any repetitions. I decided not to deal with pixels, but rather with character count anomalies in known static locations. In my case, I know that the second data piece extracted can never be greater than three characters, so it's a good comparison point for me. If you know the document layout, you can use anything on it that you know will always be of fixed length.
After I extract the data with the method listed in my original post, I check to see if the second data piece is greater than three in length. If it returns true, then I divide the given length by three, as that's the most characters it can have and since all repitions come out to even length, I know I'll get an even number of repetition cases:
var data = GetPdfData(fileName, 1, boudingBoxes);
if (data[1].Length > 3)
{
var count = data[1].Length / 3;
for (var i = 0; i < data.Length; ++i)
{
data[i] = RemoveRepetitions(data[i], count);
}
}
As you can see, I then loop over the data and pass each piece into RemoveRepetitions() method:
public static string RemoveRepetitions(string original, int count)
{
if (original.Length % count != 0)
{
return null;
}
var temp = new char[original.Length / count];
for (int i = 0; i < original.Length; i += count)
{
temp[i / count] = original[i];
}
return new string(temp);
}
This method takes the string and the number of expected repetitions, which we calculated earlier. One thing to note is that I don't have to worry about the white spaces that are inserted in the duplicated process, as the example shows in the original post, due to the fact that count will represent the total number of characters where only one should have been.
I'm building an interface layer for a Matlab component which is used to analyse data maintained by a separate .NET application which I am also building. I'm trying to serialise a .NET datatable as a numeric array to be passed to the MATLAB component (as part of a more generalised serialisation routine).
So far, I've been reasonably successful with passing tables of numeric data but I've hit a snag when trying to add a column of datatype DateTime. What I've been doing up to now is stuffing the values from the DataTable into a double array, because MATLAB only really cares about doubles, and then doing a straight cast to a MWNumericArray, which is essentially a matrix.
Here's the current code;
else if (sourceType == typeof(DataTable))
{
DataTable dtSource = source as DataTable;
var rowIdentifiers = new string[dtSource.Rows.Count];
// I know this looks silly but we need the index of each item
// in the string array as the actual value in the array as well
for (int i = 0; i < dtSource.Rows.Count; i++)
{
rowIdentifiers[i] = i.ToString();
}
// convenience vars
int rowCount = dtSource.Rows.Count;
int colCount = dtSource.Columns.Count;
double[,] values = new double[rowCount, colCount];
// For each row
for (int rownum = 0; rownum < rowCount; rownum++)
{
// for each column
for (int colnum = 0; colnum < colCount; colnum++)
{
// ASSUMPTION. value is a double
values[rownum, colnum] = Conversion.ConvertToDouble(dtSource.Rows[rownum][colnum]);
}
}
return (MWNumericArray)values;
}
Conversion.ConvertToDouble is my own routine which caters for NULLS, DBNull and returns double.NaN, again because Matlab treats all NULLS as NaNs.
So here's the thing; Does anyone know of a MATLAB datatype that would allow me to pass in a contiguous array with multiple datatypes? The only workaround I can conceive of involves using a MWStructArray of MWStructArrays, but that seems hacky and I'm not sure how well it would work in the MATLAB code, so I'd like to try to find a more elegant solution if I can. I've had a look at using an MWCellArray, but it gives me a compile error when I try to instantiate it.
I'd like to be able to do something like;
object[,] values = new object[rowCount, colCount];
// fill loosely-typed object array
return (MWCellArray)values;
But as I said, I get a compile error with this, also with passing an object array to the constructor.
Apologies if I have missed anything silly. I've done some Googling, but information on Matlab to .NET interfaces seems a little light, so that is why I posted it here.
Thanks in advance.
[EDIT]
Thanks to everyone for the suggestions.
Turns out that the quickest and most efficient way for our specific implementation was to convert the Datetime to an int in the SQL code.
However, of the other approaches, I would recommend using the MWCharArray approach. It uses the least fuss, and it turns out I was just doing it wrong - you can't treat it like another MWArray type, as it is of course designed to deal with multiple datatypes you need to iterate over it, sticking in MWNumerics or whatever takes your fancy as you go. One thing to be aware of is that MWArrays are 1-based, not 0-based. That one keeps catching me out.
I'll go into a more detailed discussion later today when I have the time, but right now I don't. Thanks everyone once more for your help.
As #Matt suggested in the comments, if you want to store different datatypes (numeric, strings, structs, etc...), you should use the equivalent of cell-arrays exposed by this managed API, namely the MWCellArray class.
To illustrate, I implemented a simple .NET assembly. It exposes a MATLAB function that receives a cell-array (records from a database table), and simply prints them. This function would be called from our C# application, which generates a sample DataTable, and convert it into a MWCellArray (fill table entries cell-by-cell).
The trick is to map the objects contained in the DataTable to the supported types by the MWArray-derived classes. Here are the ones I used (check the documentation for a complete list):
.NET native type MWArray classes
------------------------------------------
double,float,int,.. MWNumericArray
string MWCharArray
DateTime MWNumericArray (using Ticks property)
A note about the date/time data: in .NET, the System.DateTime expresses date and time as:
the number of 100-nanosecond intervals that have elapsed since January
1, 0001 at 00:00:00.000
while in MATLAB, this is what the DATENUM function has to say:
A serial date number represents the whole and fractional number of
days from a specific date and time, where datenum('Jan-1-0000
00:00:00') returns the number 1
For this reason, I wrote two helper functions in the C# application to convert the DateTime "ticks" to match the MATLAB definition of serial date numbers.
First, consider this simple MATLAB function. It expects to receive a numRos-by-numCols cellarray containing the table data. In my example, the columns are: Name (string), Price (double), Date (DateTime)
function [] = my_cell_function(C)
names = C(:,1);
price = cell2mat(C(:,2));
dt = datevec( cell2mat(C(:,3)) );
disp(names)
disp(price)
disp(dt)
end
Using deploytool from MATLAB Builder NE, we build the above as a .NET assembly. Next, we create a C# console application, then add a reference to the MWArray.dll assembly, in addition to the above generated one. This is the program I am using:
using System;
using System.Data;
using MathWorks.MATLAB.NET.Utility; // MWArray.dll
using MathWorks.MATLAB.NET.Arrays; // MWArray.dll
using CellExample; // CellExample.dll assembly created
namespace CellExampleTest
{
class Program
{
static void Main(string[] args)
{
// get data table
DataTable table = getData();
// create the MWCellArray
int numRows = table.Rows.Count;
int numCols = table.Columns.Count;
MWCellArray cell = new MWCellArray(numRows, numCols); // one-based indices
// fill it cell-by-cell
for (int r = 0; r < numRows; r++)
{
for (int c = 0; c < numCols; c++)
{
// fill based on type
Type t = table.Columns[c].DataType;
if (t == typeof(DateTime))
{
//cell[r+1,c+1] = new MWNumericArray( convertToMATLABDateNum((DateTime)table.Rows[r][c]) );
cell[r + 1, c + 1] = convertToMATLABDateNum((DateTime)table.Rows[r][c]);
}
else if (t == typeof(string))
{
//cell[r+1,c+1] = new MWCharArray( (string)table.Rows[r][c] );
cell[r + 1, c + 1] = (string)table.Rows[r][c];
}
else
{
//cell[r+1,c+1] = new MWNumericArray( (double)table.Rows[r][c] );
cell[r + 1, c + 1] = (double)table.Rows[r][c];
}
}
}
// call MATLAB function
CellClass obj = new CellClass();
obj.my_cell_function(cell);
// Wait for user to exit application
Console.ReadKey();
}
// DateTime <-> datenum helper functions
static double convertToMATLABDateNum(DateTime dt)
{
return (double)dt.AddYears(1).AddDays(1).Ticks / (10000000L * 3600L * 24L);
}
static DateTime convertFromMATLABDateNum(double datenum)
{
DateTime dt = new DateTime((long)(datenum * (10000000L * 3600L * 24L)));
return dt.AddYears(-1).AddDays(-1);
}
// return DataTable data
static DataTable getData()
{
DataTable table = new DataTable();
table.Columns.Add("Name", typeof(string));
table.Columns.Add("Price", typeof(double));
table.Columns.Add("Date", typeof(DateTime));
table.Rows.Add("Amro", 25, DateTime.Now);
table.Rows.Add("Bob", 10, DateTime.Now.AddDays(1));
table.Rows.Add("Alice", 50, DateTime.Now.AddDays(2));
return table;
}
}
}
The output of this C# program as returned by the compiled MATLAB function:
'Amro'
'Bob'
'Alice'
25
10
50
2011 9 26 20 13 8.3906
2011 9 27 20 13 8.3906
2011 9 28 20 13 8.3906
One option, is to just open up .NET code directly from matlab, and have matlab query the database directly, using your .net interface instead of trying to go through this serialization process you describe. I have done this repeatedly in our environment with great success. In such an an endeavor
Net.addAssembly is your biggest friend.
Details are here.
http://www.mathworks.com/help/matlab/ref/net.addassembly.html
A second option would be to go with Matlab Cell Array's. You can set it up, so the columns are different data types, each column forming a cell. That is a trick matlab itself uses in the textscan function. I'd recommend reading the documentation for that function here:
http://www.mathworks.com/help/techdoc/ref/textscan.html
A third option, is to use textscan completely. Write a text file out from your .net code, and let textscan handle the parsing of it. Textscan is very powerful mechanism for getting this kind of data into matlab. You can point textscan to a file, or to a bunch of strings.
I have tried the functions written by #Amro but the result for certain dates are not correct.
What I tried was:
Create a date in C#
Use function to convert to Matlab date num as supplied by #Amro
Use that number in Matlab to check its correctness
It seems to have problems with date with 1 Jan 00:00:00 for some years e.g. 2014, 2015. For example,
DateTime dt = new DateTime(2014, 1, 1, 0, 0, 0);
double dtmat = convertToMATLABDateNum(dt);
I got dtmat = 735599.0 from this.
I used in Matlab as follow:
datestr(datenum(735599.0))
I got this in return:
ans = 31-Dec-2013
When I tried 1 Jan 2012 it was OK. Any suggestion or why this happens?
I had the same issue as #Johan.
The problem is in Leap years that not calculate correctly the date
To fix it I change the code that converts the DateTime to the following:
private static long MatlabDateConversionFactor = (10000000L * 3600L * 24L);
private static long tickDiference = 367;
public static double convertToMATLABDateNum(DateTime dt) {
var converted = ((double)dt.Ticks / (double)MatlabDateConversionFactor);
return converted + tickDiference;
}
public static DateTime convertFromMATLABDateNum(double datenum) {
var ticks = (long)((datenum - 367) * MatlabDateConversionFactor);
return new DateTime(ticks, DateTimeKind.Utc);
}