This is old question I posted:
Reading one & Update some other Excel with c#
As suggested I created schema.ini file. My excel files have so many columns (many of them are not fixed) with mixed data. Even a cell contains numbers along with text. I observed that NOT ALL values are shown when I read excel using OLEDB and populate into a DataTable.
I can't assume ALL columns are put them into .ini file. Columns in my excel will go up to "DX". I observed that only 1st row which has number+text value are shown but similar text appears somewhere down aren't shown. It's shows as blank.
Here is connection string:
string strConn = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + FilePath+ "';Extended Properties=\"Excel 12.0;HDR=YES;IMEX=1;TypeGuessRows=0;ImportMixedTypes=Text\"";
Is there any solution so it reads all types of data?
This comes up a lot and it's very understandable because the documentation is somewhat lacking
Microsoft.ACE.OLEDB.12.0 does not handle columns of mixed data types very well. So what happens is that the driver will always read the first n values in each column and assign a datatype depending on what it finds in the first n cells of the column. n is determined by the setting of a registry key. It moves around depending on whether you have a 64 bit implementation or a 32 bit one but the 64 bit key is in...
HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Office\12.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows
Sadly it's not always convenient to go around modifying registry keys and it would have been much better to leave this setting on the connection string but it is what it is. The default value for this is 8 rows.
If the driver finds mixed data types then and only then does the setting of IMEX come into play. If IMEX=1 is included then a column of mixed data types is returned as text. If it is not specified then any values which do not correspond to the assigned data type are returned as null.
This is where HDR=No is useful. If you have a header then specify HDR=No and read it. This will help ensure that the column is returned as text as long as your headers are all text as well of course. You can then discard the header before processing the data. It won't help if you have a majority numeric/date time data types in the first n cells of the column.
As an aside the driver will read all types of Excel files including .xls, .xlsm and .xlsx - there is no need to change the extended properties away from Excel 12.0 to do so. This is a considerable advantage.
The older Microsoft.Jet.OLEDB.4.0 was good in that you could specify TypeGuessRows and ImportMixedTypes in the connection string but Microsoft.ACE.OLEDB.12.0 completely ignores them so you can remove them from your connection string as their presence is misleading. The older driver can only read .xls files.
Both drivers will only read 255 columns without amending the SELECT statement. To read more than 255 columns you specify a range. E.g.
Select * From [Sheet1$IV:SP]
will read columns 256-510. If your sheet ends on DX it is well within the 255 column limit.
Hidden columns are always returned.
There are a couple of nasties with this driver. Firstly leading empty rows or columns are ignored completely. This can really mess things up if you are expecting data in a specific rows/columns. Secondly Excel incorrectly treats 29/Feb/1900 as a valid date but OLEDB does not. You can stick 29/Feb/1900 into an Excel spreadsheet just fine but OLEDB will return it as 28/Feb/1900. I can't see anything else it could do really.
The driver is a very handy and cheap way of reading well formatted Excel spreadsheets as long as you are aware of the limitations and can code around them.
Good luck.
Related
So I am new to Oledb and I a have project that requires me to grab data from an excel file using a Console Application. The excel file has around 500 columns and 55 rows. How could I possibly get data from columns past 255?
In order to read columns 256 -> you just need to modify the Select statement. By default both the Microsoft.ACE.OLEDB.12.0 and the Microsoft.Jet.OLEDB.4.0 drivers will read from Column 1-255 (A->IU) but you can ask it to read the remaining columns by specifying them on the Select statement.
To read the next 255 columns and taking "Sheet1" as your sheet name you would specify...
Select * From [Sheet1$IV:SP]
This will work even if there aren't another 255 columns. It will simply return the next chunk of columns if there are 1...255 additional columns.
Incidentally, the Microsoft.ACE.OLEDB.12.0 driver will read both .xls and any variant of .xlsx, .xlsm etc without changing the extended properties from "Excel 12.0". There is no need to if...then...else the connection string depending on the file type.
The OLEDB driver is pretty good for the most part but it really does rely on well formed sheets. Mixed data types aren't handled terribly well and it does weird things if the first columns/rows are empty but aside from that it's fine. I use it a lot.
I've been requested to import an excel spreadsheet which is fine but Im getting a problem with importing a certain cell that contains both numeric and alphanumeric characters.
excel eg
Col
A B C
Row 0123 8 Fake Address CF11 1XX
XX123 8 Fake Address CF11 1XX
As per the example above when the dataset is being loaded its treating Row 2, col (A) as a numeric field resulting in an empty column in the array.
My connection for the OleDb is
var dbImportConn = new OleDbConnection(#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + dataSource
+ #";Extended Properties=""Excel 8.0;HDR=No;IMEX=1"";")
In this connection i have set the IMEX = 1 which should parse all contents as string into the dataset. Also if i change Row 1 Col (A) to have 'XX123' the entire Col (A) successfully parses as string! Unfortunately this is not going to help my scenario as the excel file is passed from an external client who have also advised that do not have the means to pass through the file with a header row which would solve my issue.
My one thought at this point is when I receive the file to edit the file (programmatically) to insert a header but again as the client may change how many columns are contained this would not be a safe option for me.
So basically I need to find a solution for dealing with the current format on the spreadsheet and to pass through all cells into the array. Has anyone come across this issue before ? Or know how to solve this ?
I await your thoughts
Thanks
Scott
ps If this is not clear just shout
Hi There is a registry setting called TypeGuessRows that you can change that tells Excel to scan all the column before deciding it's type. Currently, it seems, this is set to read an x number of rows in a column and decides the type of the column e.g. if your first x rows are integers and x+1 is string, the import will fail because it has already decided that this is an integer column. You can change the registry setting to read the whole column before deciding..
please see this also
http://jingyangli.wordpress.com/2009/02/13/imex1-revisit-and-typeguessrows-setting-change-to-0-watch-for-performance/
This isn't a direct answer, but I would like to recommend you use the Excel Data Reader, which is opensource under the LGPL licence and is Lightweight and fast library written in C# for reading Microsoft Excel files ('97-2007).
I have an Excel 2007 workbook that contains tables of data that I'm importing into DataTable objects using ADO.NET.
Through some experimentation, I've managed to find two different ways to indicate that a cell should be treated as "null" by ADO.NET:
The cell is completely blank.
The cell contains #N/A.
Unfortunately, both of these are problematic:
Most of my columns of data in Excel are generated via formulas, but it's not possible in Excel to generate a formula that results in a completely blank cell. And only a completely blank cell will be considered null (an empty string will not work).
Any formula that evaluates to #N/A (either due to an actual lookup error or because the NA() function was used) will be considered null. This seemed like the ideal solution until I discovered that the Excel workbook must be open for this to work. As soon as you close the workbook, OLEDB suddenly starts seeing all those #N/As as strings. This causes exceptions like the following to be thrown when filling the DataTable:
Input string was not in a correct format. Couldn't store <#N/A> in Value Column. Expected type is Int32.
Question: How can I indicate a null value via an Excel formula without having to have the workbook open when I fill the DataTable? Or what can be done to make #N/A values be considered null even when the workbook is closed?
In case it's important, my connection string is built using the following method:
var builder = new OleDbConnectionStringBuilder
{
Provider = "Microsoft.ACE.OLEDB.12.0",
DataSource = _workbookPath
};
builder.Add("Extended Properties", "Excel 12.0 Xml;HDR=Yes;IMEX=0");
return builder.ConnectionString;
(_workbookPath is the full path to the workbook).
I've tried both IMEX=0 and IMEX=1 but it makes no difference.
You're hitting the brickwall that many very frustrated users of Excel are experiencing. Unfortunately Excel as a company tool is widespread and seems quite robust, unfortunately because each cell/column/row has a variant data type it makes it a nightmare to handle with other tools such as MySQL, SQL Server, R, RapidMiner, SPSS and the list goes on. It seems that Excel 2007/2010 is not very well supported and even more so when taking 32/64 bit versions into account, which is scandalous in this day and age.
The main problem is that when ACE/Jet access each field in Excel they use a registry setting 'TypeGuessRows' to determine how many rows to use to assess the datatype. The default for "Rows to Scan" is 8 rows. The registry setting 'TypeGuessRows' can specify an integer value from one (1) to sixteen (16) rows, or you can specify zero (0) to scan all existing rows. If you can't change the registry setting (such as in 90% of office environments) it makes life difficult as the rows to guess are limited to the first 8.
For example, without the registry change
If the first occurrence of #N/A is within the first 8 rows then IMEX = 1 will return the error as a string "#N/A". If IMEX = 0 then an #N/A will return 'Null'.
If the first occurrence of #N/A is beyond the first 8 rows then both IMEX = 0 & IMEX = 1 both return 'Null' (assuming required data type is numeric).
With the registry change (TypeGuessRows = 0) then all should be fine.
Perhaps there are 4 options:
Change the registry setting TypeGuessRows = 0
List all possible type variations in the first 8 rows as 'dummy data' (eg memo fields/nchar(max)/ errors #N/A etc)
Correct ALL data type anomalies in Excel
Don't use Excel - Seriously worth considering!
Edit:
Just to put the boot in :) another 2 things that really annoy me are; if the first field on a sheet is blank over the first 8 rows and you can't edit the registry setting then the whole sheet is returned as blank (Many fun conversations telling managers they're fools for merging cells!). Also, if in Excel 2007/2010 you have a department return a sheet with >255 columns/fields then you have huge problems if you need non-contiguous import (eg key in col 1 and data in cols 255+)
First, I want to say that I'm out on deep water here, since I'm just doing some changes to code that is written by someone else in the company, using OleDbDataAdapter to "talk" to Excel and I'm not familiar with that. There is one bug there I just can't follow.
I'm trying to use a OleDbDataAdapter to read in a excel file with around 450 lines.
In the code it's done like this:
connection = new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;" + "Data Source='" + path + "';" + "Extended Properties=\"Excel 8.0;HDR=Yes;IMEX=1;\"");
connection.Open();
OleDbDataAdapter objAdapter = new OleDbDataAdapter(objCommand.CommandText, connection);
objAdapter.Fill(objDataSet, "Excel");
foreach (DataColumn dataColumn in objTable.Columns) {
if (dataColumn.Ordinal > objDataSet.Tables[0].Columns.Count - 1) {
objDataSet.Tables[0].Columns.Add();
}
objDataSet.Tables[0].Columns[dataColumn.Ordinal].ColumnName = dataColumn.ColumnName;
objImport.Columns.Add(dataColumn.ColumnName);
}
foreach (DataRow dataRow in objDataSet.Tables[0].Rows) {
...
}
Everything seems to be working fine except for one thing. The second column is filled with mostly four digit numbers like 6739, 3920 and so one, but fice rows have alphanumeric values like 8201NO and 8205NO. Those five cells are reported as having blank contents instead of their alphanumeric content. I have checked in excel, and all the cells in this columns are marked as Text.
This is an xls file by the way, and not xlsx.
Do anyone have any clue as why these cells are shown as blank in the DataRow, but the numeric ones are shown fine? There are other columns with alphanumeric content that are shown just fine.
What's happening is that excel is trying to assign a data type to the spreadsheet column based on the first several values in that column. I suspect that if you look at the properties in that column it will say it is a numerical column.
The problem comes when you start trying to query that spreadsheet using jet. When it thinks it's dealing with a numerical column and it finds a varchar value it quietly returns nothing. Not even a cryptic error message to go off of.
As a possible work around can you move one of the alpha numeric values to the first row of data and then try parsing. I suspect you will start getting values for the alpha numeric rows then...
Take a look at this article. It goes into more detail on this issue. it also talks about a possible work around which is:
However, as per JET documentation, we
can override the registry setting thru
the Connection String, if we set
IMEX=1( as part of Extended
Properties), the JET will set the all
column type as UNICODE VARCHAR or
ADVARWCHAR irrespective of
‘ImportMixedTypes’ key value.hey
IMEX=1 means "Read mixed data as text."
There are some gotchas, however. Jet will only use several rows to determine whether the data is mixed, and if so happens these rows are all numeric, you'll get this behaviour.
See connectionstrings.com for details:
Check out the [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel] located registry REG_DWORD "TypeGuessRows". That's the key to not letting Excel use only the first 8 rows to guess the columns data type. Set this value to 0 to scan all rows. This might hurt performance. Please also note that adding the IMEX=1 option might cause the IMEX feature to set in after just 8 rows. Use IMEX=0 instead to be sure to force the registry TypeGuessRows=0 (scan all rows) to work.
I would advise against using the OleDb data provider stuff to access Excel if you can help it. I've had nothing but problems, for exactly the reasons that others have pointed out. The performance tends to be atrocious as well when you are dealing with large spreadsheets.
You might try this open source solution:
http://exceldatareader.codeplex.com/
At the moment I'm searching for properties for a connection string, which can be used to connect to an Excel file in readonly mode. Searching Google gets me a lot of examples of connection strings, but I can't seem to find a specification of all possibilities in the 'Extended Properties' section of the OleDb connection string.
At the moment I've this:
Provider = Microsoft.Jet.OLEDB.4.0; Data Source = D:\Data\Customers.xls; Extended Properties = 'Excel 8.0; Mode=Read; ReadOnly=true; HDR=Yes';
However... I've composed this by examples. So questions:
1. What is a decent source for OleDb Connection String documentation/reference?
2. Is the above connection string indeed connecting to the Excel file in readonly mode?
Thanks!
I am using UDL file for that.
Do next:
create empty file test.udl
open it
You will see Data Link Properties dialog
On first tab change provider to Microsoft.Jet.OLEDB.4.0;
Second tab select you Excel file
Third tab set permission like Read
On last tab set Extended Properties = 'Excel 8.0; HDR=Yes'
Than save, and open file in text editor and you will see connection string
As well you can check msdn article ADO Provider Properties and Settings
COM-based data access to Excel specifications are likely buried in nearly inaccessible Microsoft archive documentation. (Usually served as one enormous PDF)
I added the spec to another answer here: https://stackoverflow.com/a/68912543/6237912
Copied to this answer for completeness:
The connectionstring has some parts:
Provider: It is the main oledb provider that is used to open the Excel
sheet. This will be Microsoft.Jet.OLEDB.4.0 for Excel 97 onwards Excel
file format and Microsoft.ACE.OLEDB.12.0 for Excel 2007 or higher
Excel file format (One with xlsx extension)
Data Source: It is the entire path of the Excel workbook. You need to mention a dospath that corresponds to an Excel file. Thus, it will
look like: Data Source=C:\testApp.xls".
Extended Properties (Optional): Extended properties can be applied to Excel workbooks which may change the overall activity of the Excel
workbook from your program. The most common ones are the following:
HDR: It represents Header of the fields in the Excel table. Default is YES. If you don't have fieldnames in the header of your
worksheet, you can specify HDR=NO which will take the columns of the
tables that it finds as f1,f2 etc.
ReadOnly: You can also open Excel workbook in readonly mode by specifying ReadOnly=true; By default, Readonly attribute is false, so
you can modify data within your workbook.
FirstRowHasNames: It is the same as HDR, it is always set to 1 ( which means true) you can specify it as false if you don't have your
header row. If HDR is YES, provider disregards this property. You can
change the default behaviour of your environment by changing the
Registry Value
[HKLM\Software\Microsoft\Jet\4.0\Engines\Excel\FirstRowHasNames] to 00
(which is false)
MaxScanRows: Excel does not provide the detailed schema defination of the tables it finds. It need to scan the rows before
deciding the data types of the fields. MaxScanRows specifies the
number of cells to be scanned before deciding the data type of the
column. By default, the value of this is 8. You can specify any value
from 1 - 16 for 1 to 16 rows. You can also make the value to 0 so that
it searches all existing rows before deciding the data type. You can
change the default behaviour of this property by changing the value of
[HKLM\Software\Microsoft\Jet\4.0\Engines\Excel\TypeGuessRows] which is
8 by default. Currently, MaxScanRows is ignored, so you need only to
depend on TypeGuessRows Registry value. Hope Microsoft fixes this
issue to its later versions.
IMEX: (A Caution) As mentioned above, Excel will have to guess a number or rows to select the most appropriate data type of the
column, a serious problem may occur if you have mixed data in one
column. Say you have data of both integer and text on a single column,
in that case, Excel will choose its data type based on majority of the
data. Thus it selects the data for the majority data type that is
selected, and returns NULL for the minority data type. If the two
types are equally mixed in the column, the provider chooses numeric
over text.
For example, in your eight (8) scanned rows, if the column contains five (5) numeric values and three (3) text values, the
provider returns five (5) numbers and three (3) null values.
To work around this problem for data, set "IMEX=1" in the Extended Properties section of the connection string. This enforces
the ImportMixedTypes=Text registry setting. You can change the
enforcement of type by changing
[HKLM\Software\Microsoft\Jet\4.0\Engines\Excel\ImportMixedTypes] to
numeric as well.
Thus if you look into the simple connectionstring with all of them, it will look like:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\\testexcel.xls;
Extended Properties=\"Excel 8.0;HDR=YES;IMEX=1;MAXSCANROWS=15;READONLY=FALSE\""
or:
Copy Code
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\\testexcel.xlsx;
Extended Properties=\"Excel 12.0;HDR=YES;IMEX=1;MAXSCANROWS=15;READONLY=FALSE\""
We need to place extended properties into Quotes(") as there are multiple number of values.