I Help and Advice on using Oledb for large excel Files

I Help and Advice on using Oledb for large excel Files - c#

So I am new to Oledb and I a have project that requires me to grab data from an excel file using a Console Application. The excel file has around 500 columns and 55 rows. How could I possibly get data from columns past 255?

In order to read columns 256 -> you just need to modify the Select statement. By default both the Microsoft.ACE.OLEDB.12.0 and the Microsoft.Jet.OLEDB.4.0 drivers will read from Column 1-255 (A->IU) but you can ask it to read the remaining columns by specifying them on the Select statement.
To read the next 255 columns and taking "Sheet1" as your sheet name you would specify...
Select * From [Sheet1$IV:SP]
This will work even if there aren't another 255 columns. It will simply return the next chunk of columns if there are 1...255 additional columns.
Incidentally, the Microsoft.ACE.OLEDB.12.0 driver will read both .xls and any variant of .xlsx, .xlsm etc without changing the extended properties from "Excel 12.0". There is no need to if...then...else the connection string depending on the file type.
The OLEDB driver is pretty good for the most part but it really does rely on well formed sheets. Mixed data types aren't handled terribly well and it does weird things if the first columns/rows are empty but aside from that it's fine. I use it a lot.

Related

reading excel with oledb not displaying correct values

This is old question I posted:
Reading one & Update some other Excel with c#
As suggested I created schema.ini file. My excel files have so many columns (many of them are not fixed) with mixed data. Even a cell contains numbers along with text. I observed that NOT ALL values are shown when I read excel using OLEDB and populate into a DataTable.
I can't assume ALL columns are put them into .ini file. Columns in my excel will go up to "DX". I observed that only 1st row which has number+text value are shown but similar text appears somewhere down aren't shown. It's shows as blank.
Here is connection string:
string strConn = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + FilePath+ "';Extended Properties=\"Excel 12.0;HDR=YES;IMEX=1;TypeGuessRows=0;ImportMixedTypes=Text\"";
Is there any solution so it reads all types of data?

This comes up a lot and it's very understandable because the documentation is somewhat lacking
Microsoft.ACE.OLEDB.12.0 does not handle columns of mixed data types very well. So what happens is that the driver will always read the first n values in each column and assign a datatype depending on what it finds in the first n cells of the column. n is determined by the setting of a registry key. It moves around depending on whether you have a 64 bit implementation or a 32 bit one but the 64 bit key is in...
HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Office\12.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows
Sadly it's not always convenient to go around modifying registry keys and it would have been much better to leave this setting on the connection string but it is what it is. The default value for this is 8 rows.
If the driver finds mixed data types then and only then does the setting of IMEX come into play. If IMEX=1 is included then a column of mixed data types is returned as text. If it is not specified then any values which do not correspond to the assigned data type are returned as null.
This is where HDR=No is useful. If you have a header then specify HDR=No and read it. This will help ensure that the column is returned as text as long as your headers are all text as well of course. You can then discard the header before processing the data. It won't help if you have a majority numeric/date time data types in the first n cells of the column.
As an aside the driver will read all types of Excel files including .xls, .xlsm and .xlsx - there is no need to change the extended properties away from Excel 12.0 to do so. This is a considerable advantage.
The older Microsoft.Jet.OLEDB.4.0 was good in that you could specify TypeGuessRows and ImportMixedTypes in the connection string but Microsoft.ACE.OLEDB.12.0 completely ignores them so you can remove them from your connection string as their presence is misleading. The older driver can only read .xls files.
Both drivers will only read 255 columns without amending the SELECT statement. To read more than 255 columns you specify a range. E.g.
Select * From [Sheet1$IV:SP]
will read columns 256-510. If your sheet ends on DX it is well within the 255 column limit.
Hidden columns are always returned.
There are a couple of nasties with this driver. Firstly leading empty rows or columns are ignored completely. This can really mess things up if you are expecting data in a specific rows/columns. Secondly Excel incorrectly treats 29/Feb/1900 as a valid date but OLEDB does not. You can stick 29/Feb/1900 into an Excel spreadsheet just fine but OLEDB will return it as 28/Feb/1900. I can't see anything else it could do really.
The driver is a very handy and cheap way of reading well formatted Excel spreadsheets as long as you are aware of the limitations and can code around them.
Good luck.

ODBC driver cannot read rows added in Excel

I create xls/xlsx file from C# using ODBC (with Provider=Microsoft.ACE.OLEDB.12.0). The result table has 4 rows (for example). I open the file with Excel, add 5-th row and save the file. When try to read it from C# over ODBC with SELECT * FROM [table] I get only the original 4 rows without 5th. It seems ODBC stores somewhere in XLS file the number of rows and later reads only them without new data entered from Excel or LibreOffice. Is this known problem and can I solve it? If I create new spreadsheet in Excel, all its rows are read fron C#.
EDIT: I found some useful information. When the XLS file is first created from C#/ODBC, there are 2 tables (sheets). If the table name is TABLE, DataTable sheets = conn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null) will contain sheets.Rows[0] == "TABLE" and sheets.Rows[1] == "TABLE$". Excel will show only one sheet "TABLE". After edit the changes (5th row) exist only in "TABLE$" sheet.

Are you adding the 5th row by code if yes, could you please share the code lines which you are using for doing the same. There might be following issue in your code.
Save commit not done properly.
Before reading the file connection refresh not done.

I think I found the problem. It seems that internal spreadsheet names created by Excel have "$" sign at the end. The sheet name generated by ODBC are the exact string given in CREATE TABLE. On the other hand, Excel (and LibreOffice) show only one sheet for both TABLE and TABLE$ sheets. If I edit the table in Excel, after save the changes are only in TABLE$. The other sheet TABLE is unchanged. When I do SELECT * FROM [TABLE] the result is from the original ODBC generated table without Excel changes. Now I enumerate the available sheets inside XLS file and if first sheet name does not end with "$" and sheets are more than 1, I add "$" to first sheet name and open the correct table. I suppose ODBC connection string may include option to work with "$"-ending tables...

Is there a better way to indicate "null" values in Excel?

I have an Excel 2007 workbook that contains tables of data that I'm importing into DataTable objects using ADO.NET.
Through some experimentation, I've managed to find two different ways to indicate that a cell should be treated as "null" by ADO.NET:
The cell is completely blank.
The cell contains #N/A.
Unfortunately, both of these are problematic:
Most of my columns of data in Excel are generated via formulas, but it's not possible in Excel to generate a formula that results in a completely blank cell. And only a completely blank cell will be considered null (an empty string will not work).
Any formula that evaluates to #N/A (either due to an actual lookup error or because the NA() function was used) will be considered null. This seemed like the ideal solution until I discovered that the Excel workbook must be open for this to work. As soon as you close the workbook, OLEDB suddenly starts seeing all those #N/As as strings. This causes exceptions like the following to be thrown when filling the DataTable:
Input string was not in a correct format. Couldn't store <#N/A> in Value Column. Expected type is Int32.
Question: How can I indicate a null value via an Excel formula without having to have the workbook open when I fill the DataTable? Or what can be done to make #N/A values be considered null even when the workbook is closed?
In case it's important, my connection string is built using the following method:
var builder = new OleDbConnectionStringBuilder
{
Provider = "Microsoft.ACE.OLEDB.12.0",
DataSource = _workbookPath
};
builder.Add("Extended Properties", "Excel 12.0 Xml;HDR=Yes;IMEX=0");
return builder.ConnectionString;
(_workbookPath is the full path to the workbook).
I've tried both IMEX=0 and IMEX=1 but it makes no difference.

You're hitting the brickwall that many very frustrated users of Excel are experiencing. Unfortunately Excel as a company tool is widespread and seems quite robust, unfortunately because each cell/column/row has a variant data type it makes it a nightmare to handle with other tools such as MySQL, SQL Server, R, RapidMiner, SPSS and the list goes on. It seems that Excel 2007/2010 is not very well supported and even more so when taking 32/64 bit versions into account, which is scandalous in this day and age.
The main problem is that when ACE/Jet access each field in Excel they use a registry setting 'TypeGuessRows' to determine how many rows to use to assess the datatype. The default for "Rows to Scan" is 8 rows. The registry setting 'TypeGuessRows' can specify an integer value from one (1) to sixteen (16) rows, or you can specify zero (0) to scan all existing rows. If you can't change the registry setting (such as in 90% of office environments) it makes life difficult as the rows to guess are limited to the first 8.
For example, without the registry change
If the first occurrence of #N/A is within the first 8 rows then IMEX = 1 will return the error as a string "#N/A". If IMEX = 0 then an #N/A will return 'Null'.
If the first occurrence of #N/A is beyond the first 8 rows then both IMEX = 0 & IMEX = 1 both return 'Null' (assuming required data type is numeric).
With the registry change (TypeGuessRows = 0) then all should be fine.
Perhaps there are 4 options:
Change the registry setting TypeGuessRows = 0
List all possible type variations in the first 8 rows as 'dummy data' (eg memo fields/nchar(max)/ errors #N/A etc)
Correct ALL data type anomalies in Excel
Don't use Excel - Seriously worth considering!
Edit:
Just to put the boot in :) another 2 things that really annoy me are; if the first field on a sheet is blank over the first 8 rows and you can't edit the registry setting then the whole sheet is returned as blank (Many fun conversations telling managers they're fools for merging cells!). Also, if in Excel 2007/2010 you have a department return a sheet with >255 columns/fields then you have huge problems if you need non-contiguous import (eg key in col 1 and data in cols 255+)

Specification of Extended Properties in OleDb connection string?

At the moment I'm searching for properties for a connection string, which can be used to connect to an Excel file in readonly mode. Searching Google gets me a lot of examples of connection strings, but I can't seem to find a specification of all possibilities in the 'Extended Properties' section of the OleDb connection string.
At the moment I've this:
Provider = Microsoft.Jet.OLEDB.4.0; Data Source = D:\Data\Customers.xls; Extended Properties = 'Excel 8.0; Mode=Read; ReadOnly=true; HDR=Yes';
However... I've composed this by examples. So questions:
1. What is a decent source for OleDb Connection String documentation/reference?
2. Is the above connection string indeed connecting to the Excel file in readonly mode?
Thanks!

I am using UDL file for that.
Do next:
create empty file test.udl
open it
You will see Data Link Properties dialog
On first tab change provider to Microsoft.Jet.OLEDB.4.0;
Second tab select you Excel file
Third tab set permission like Read
On last tab set Extended Properties = 'Excel 8.0; HDR=Yes'
Than save, and open file in text editor and you will see connection string
As well you can check msdn article ADO Provider Properties and Settings

COM-based data access to Excel specifications are likely buried in nearly inaccessible Microsoft archive documentation. (Usually served as one enormous PDF)
I added the spec to another answer here: https://stackoverflow.com/a/68912543/6237912
Copied to this answer for completeness:
The connectionstring has some parts:
Provider: It is the main oledb provider that is used to open the Excel
sheet. This will be Microsoft.Jet.OLEDB.4.0 for Excel 97 onwards Excel
file format and Microsoft.ACE.OLEDB.12.0 for Excel 2007 or higher
Excel file format (One with xlsx extension)
Data Source: It is the entire path of the Excel workbook. You need to mention a dospath that corresponds to an Excel file. Thus, it will
look like: Data Source=C:\testApp.xls".
Extended Properties (Optional): Extended properties can be applied to Excel workbooks which may change the overall activity of the Excel
workbook from your program. The most common ones are the following:
HDR: It represents Header of the fields in the Excel table. Default is YES. If you don't have fieldnames in the header of your
worksheet, you can specify HDR=NO which will take the columns of the
tables that it finds as f1,f2 etc.
ReadOnly: You can also open Excel workbook in readonly mode by specifying ReadOnly=true; By default, Readonly attribute is false, so
you can modify data within your workbook.
FirstRowHasNames: It is the same as HDR, it is always set to 1 ( which means true) you can specify it as false if you don't have your
header row. If HDR is YES, provider disregards this property. You can
change the default behaviour of your environment by changing the
Registry Value
[HKLM\Software\Microsoft\Jet\4.0\Engines\Excel\FirstRowHasNames] to 00
(which is false)
MaxScanRows: Excel does not provide the detailed schema defination of the tables it finds. It need to scan the rows before
deciding the data types of the fields. MaxScanRows specifies the
number of cells to be scanned before deciding the data type of the
column. By default, the value of this is 8. You can specify any value
from 1 - 16 for 1 to 16 rows. You can also make the value to 0 so that
it searches all existing rows before deciding the data type. You can
change the default behaviour of this property by changing the value of
[HKLM\Software\Microsoft\Jet\4.0\Engines\Excel\TypeGuessRows] which is
8 by default. Currently, MaxScanRows is ignored, so you need only to
depend on TypeGuessRows Registry value. Hope Microsoft fixes this
issue to its later versions.
IMEX: (A Caution) As mentioned above, Excel will have to guess a number or rows to select the most appropriate data type of the
column, a serious problem may occur if you have mixed data in one
column. Say you have data of both integer and text on a single column,
in that case, Excel will choose its data type based on majority of the
data. Thus it selects the data for the majority data type that is
selected, and returns NULL for the minority data type. If the two
types are equally mixed in the column, the provider chooses numeric
over text.
For example, in your eight (8) scanned rows, if the column contains five (5) numeric values and three (3) text values, the
provider returns five (5) numbers and three (3) null values.
To work around this problem for data, set "IMEX=1" in the Extended Properties section of the connection string. This enforces
the ImportMixedTypes=Text registry setting. You can change the
enforcement of type by changing
[HKLM\Software\Microsoft\Jet\4.0\Engines\Excel\ImportMixedTypes] to
numeric as well.
Thus if you look into the simple connectionstring with all of them, it will look like:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\\testexcel.xls;
Extended Properties=\"Excel 8.0;HDR=YES;IMEX=1;MAXSCANROWS=15;READONLY=FALSE\""
or:
Copy Code
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\\testexcel.xlsx;
Extended Properties=\"Excel 12.0;HDR=YES;IMEX=1;MAXSCANROWS=15;READONLY=FALSE\""
We need to place extended properties into Quotes(") as there are multiple number of values.

C# Excel import data from CSV into Excel

How do I import data in Excel from a CSV file using C#? Actually, what I want to achieve is similar to what we do in Excel, you go to the Data tab and then select From Text option and then use the Text to columns option and select CSV and it does the magic, and all that stuff. I want to automate it.
If you could head me in the right direction, I'll really appreciate that.
EDIT: I guess I didn't explained well. What I want to do is something like
Excel.Application excelApp;
Excel.Workbook excelWorkbook;
// open excel
excelApp = new Excel.Application();
// something like
excelWorkbook.ImportFromTextFile(); // is what I need
I want to import that data into Excel, not my own application. As far as I know, I don't think I would have to parse the CSV myself and then insert them in Excel. Excel does that for us. I simply need to know how to automate that process.

I think you're over complicating things. Excel automatically splits data into columns by comma delimiters if it's a CSV file. So all you should need to do is ensure your extension is CSV.
I just tried opening a file quick in Excel and it works fine. So what you really need is just to call Workbook.Open() with a file with a CSV extension.

You could open Excel, start recording a macro, do what you want, then see what the macro recorded. That should tell you what objects to use and how to use them.

I beleive there are two parts, one is the split operation for the csv that the other responder has already picked up on, which I don't think is essential but I'll include anyways. And the big one is the writing to the excel file, which I was able to get working, but under specific circumstances and it was a pain to accomplish.
CSV is pretty simple, you can do a string.split on a comma seperator if you want. However, this method is horribly broken, albeit I'll admit I've used it myself, mainly because I also have control over the source data, and know that no quotes or escape characters will ever appear. I've included a link to an article on proper csv parsing, however, I have never tested the source or fully audited the code myself. I have used other code by the same author with success. http://www.boyet.com/articles/csvparser.html
The second part is alot more complex, and was a huge pain for me. The approach I took was to use the jet driver to treat the excel file like a database, and then run SQL queries against it. There are a few limitations, which may cause this to not fit you're goal. I was looking to use prebuilt excel file templates to basically display data and some preset functions and graphs. To accomplish this I have several tabs of report data, and one tab which is raw_data. My program writes to the raw_data tab, and all the other tabs calculations point to cells in this table. I'll go into some of the reasoning for this behavior after the code:
First off, the imports (not all may be required, this is pulled from a larger class file and I didn't properly comment what was for what):
using System.IO;
using System.Diagnostics;
using System.Data.Common;
using System.Globalization;
Next we need to define the connection string, my class already has a FileInfo reference at this point to the file I want to use, so that's what I pass on. It's possible to search on google what all the parameters are for, but basicaly use the Jet Driver (should be available on ANY windows install) to open an excel file like you're referring to a database.
string connectString = #"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={filename};Extended Properties=""Excel 8.0;HDR=YES;IMEX=0""";
connectString = connectString.Replace("{filename}", fi.FullName);
Now let's open up the connection to the DB, and be ready to run commands on the DB:
DbProviderFactory factory = DbProviderFactories.GetFactory("System.Data.OleDb");
using (DbConnection connection = factory.CreateConnection())
{
connection.ConnectionString = connectString;
using (DbCommand command = connection.CreateCommand())
{
connection.Open();
Next we need the actual logic for DB insertion. So basically throw queries into a loop or whatever you're logic is, and insert the data row-by-row.
string query = "INSERT INTO [raw_aaa$] (correlationid, ipaddr, somenum) VALUES (\"abcdef", \"1.1.1.1", 10)";
command.CommandText = query;
command.ExecuteNonQuery();
Now here's the really annoying part, the excel driver tries to detect you're column type before insert, so even if you pass a proper integer value, if excel thinks the column type is text, it will insert all you're numbers as text, and it's very hard to get this treated like a number. As such, excel must already have the column type as the number. In order to accomplish this, for my template file I fill in the first 10 rows with dummy data, so that when you load the file in the jet driver, it can detect the proper types and use them. Then all my forumals that point at my csv table will operate properly since the values are of the right type. This may work for you if you're goals are similar to mine, and to use templates that already point to this data (just start at row 10 instead of row 2).
Because of this, my raw_aaa tab in excel might look something like this:
correlationid ipaddr somenum
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
Note row 1 is the column names that I referenced in my sql queries. I think you can do without this, but that will require a little more research. By already having this data in the excel file, the somenum column will be detected as a number, and any data inserted will be properly treated as such.
Antoher note that makes this annoying, the Jet Driver is 32-bit only, so in my case where I had an explicit 64-bit program, I was unable to execute this directly. So I had the nasty hack of writing to a file, then launch a program that would insert the data in the file into my excel template.
All in all, I think the solution is pretty nasty, but thus far haven't found a better way to do this unfortunatly. Good luck!

You can take a look at TakeIo.Spreadsheet .NET library. It accepts files from Excel 97-2003, Excel 2007 and newer, and CSV format (semicolon or comma separators).
Example:
var inputFile = new FileInfo("Book1.csv"); // could be .xls or .xlsx too
var sheet = Spreadsheet.Read(inputFile);
foreach (var row in sheet)
{
foreach (var cell in row)
{
// do something
}
}
You can remove beginning and trailing empty rows, and also beginning and trailing columns from the imported data using the Normalize() function:
sheet.Normalize();
Sometimes you can find that your imported data contains empty rows between data, so you can use another helper for this case:
sheet.RemoveEmptyRows();
There is a Serialize() function to convert any input to CSV too:
var outfile = new StreamWriter("AllData.csv");
sheet.Serialize(outfile);
If you like to use comma instead of the default semicolon separator in your CSV file, do:
sheet.Serialize(outfile, ',');
And yes, there is also a ToString() function too...
This package is available at NuGet too, just take a look at TakeIo.Spreadsheet.

You can use ADO.NET
http://vbadud.blogspot.com/2008/09/opening-comma-separate-file-csv-through.html

Well, importing from CSV shouldn't be a big deal. I think the most basic method would be to do it using string operations. You could build a pretty fine parser using simple Split() command, and getting the stuff in arrays.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.