I want to allow my application to import data from XLS files. I already do this with CSV files and XML files, but would like to open the scope for users. I am having trouble with loading the file. We load the files (XLS,CSV,XML) into a data set and work on it from there. The loading code for XLS is below
FileInfo fi = new FileInfo(filename);
//create and open a connection with the supplied string
OleDbConnection objOleDBConn;
objOleDBConn = new OleDbConnection(string.Format("Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties='Excel 8.0;HDR=Yes;IMEX=1'", fi.FullName));
objOleDBConn.Open();
DataTable dt = objOleDBConn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null);
if (dt == null || dt.Rows.Count == 0)
{
return;
}
string sheet = dt.Rows[0]["TABLE_NAME"].ToString();
//then read the data as usual.
OleDbDataAdapter objOleDBDa;
objOleDBDa = new OleDbDataAdapter(string.Format("select * from [{0}]",sheet), objOleDBConn);
objOleDBDa.Fill(data);
objOleDBConn.Close();
So my data gets loaded OK, but it appears to set the data types of various columns, and this is a problem for one of my columns. It's a bit field and we have chosen to accept False, True, Yes, No, Y, and N. There is code that transfers this into a boolean later on. This works fine in a CSV file (for which the connection string is different) but in an XLS, if the first 10 rows are say FALSE or TRUE, and then say the 11th says YES, then I just get a null entry. I'm guessing that it reads the first few entries and determines the data type based on that?
Question: Is there a way to turn off the mechanism that identifies a column's data type based on the first few entries?
This question is very similar to Excel cell-values are truncated by OLEDB-provider and Excel reading in ASP.NET : Data not being read if column has different data formats Looks like a couple of workable solutions are discussed in these other questions.
There is a registry setting to tell the Jet provider how many rows to read to infer the data type for the column. It defaults to 8 I believe. It is:
HKLM\Software\Microsoft\Office\12.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows
(change version as applicable). In your case, it has infered boolean and therefore ignores the string value "yes".
Try this OleDBAdapter Excel QA I posted via stack overflow.
I populated a worksheet column w/ all TRUE or FALSE and then threw in several "yes" or "no" values at random and it worked fine...
Run in Debug mode, then click on the DataSet Visualizer after it's populated to see results.
Or, add this to the end of the code for the output
// DataSet:
Object row11Col3 = ds.Tables["xlsImport"].Rows[11][3];
string rowElevenColumn3 = row11Col3.ToString();
trick is to include header line as row from which to infer data type, so that all columns will be read as string. Then you will be able to parse in code to correct data type, if you need, without losing values - use for this HDR=No
objOleDBConn = new OleDbConnection(string.Format("Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties='Excel 8.0;HDR=No;IMEX=1'", fi.FullName));
Related
Our Excel 2013 xlsx file has tab "DEPTS" and this tab has a column called "1F/3F". Each cell in this column can have one of these values: "5", "Ati_3", "4", "Btu_4", etc.
Before today, I would move the contents of this tab to a dataset with this straightforward snippet. The dataset viewer would display all rows and all columns:
string connectionString = string.Format(ExcelConnstring, FileName);
string deptsSql = string.Format("SELECT * FROM [{0}$]", "DEPTS");
DataSet deptsDataset = new DataSet();
using (OleDbConnection con = new OleDbConnection(connectionString))
{
con.Open();
OleDbDataAdapter adapter = new OleDbDataAdapter(deptsSql, con);
adapter.Fill(deptsDataset);
con.Close();
}
return deptsDataset;
Today, I try to upload today's file, which is the same exact format. When I look at the contents of the dataset, I notice that the cells in column "1F/3F" that are not numerical are empty. It's reading all 40 rows, but those particular cells whose values could be "Ati_3", "Btu_4" (ie. not numeric) are being read as empty. The numeric values are being read correctly.
How can I compare an older file with this file? The file seems to be correct, and I have no idea how to check if something was added to that particular column that would cause the error.
Thanks.
The ADO.NET driver uses the data in each column (the first 10 rows or so) to determine its datatype, which is terrible but it is what it is. If you have a column which has numeric values in the first 10 or so rows, the driver treats that as a numeric column and will read any non-numeric values as null.
Cell formats in the excel document are not honored by the driver. If you want to ensure that the data is read as text, and you have control over the process that generates the excel document, you can force the column to be treated as text by inserting 10 or so dummy values (e.g. 'Ignore') and throw away those rows after you have read the contents.
By ensuring that the first 10 rows of a column contain text, the driver will correctly read the numeric and non-numeric values for that column (they will all be treated as text).
If you cannot control the creation of the file you are going to read you could switch to another technology to read the Excel document. Some alternatives include:
EPPlus (http://epplus.codeplex.com/) - safe to use on a server
Office Automation - not safe to use on a server
Initially I had an issue with the data type "guesses" when dealing with the jet driver (through oledb). If a sheet had mixed types, it would bring in null/empty values.
-Edit-
There is an IMEX setting in the connection string as well as in the registry that will tell jet/ace to use text for columns with multiple data types. This way if the first 6 rows have an integer value and the 7th cell has a text value, there won't be a type cast failure. There is also a setting in the registry (and connection string) that will allow you to say how many rows jet should use for sampling.
-end edit-
I changed the connection string, and the registry settings on the server. So now the program is reading fine. It will read values as text, and not use {n} rows for sampling. I thought it was working fine.
Now I have a data source that lists files in order to be read. If I have multiple files in there, it will have the same type casting issues... or at least the same symptoms. If I upload the files one at a time without using the queue then it works fine. It's when I have multiple files in a row that it seems to have the type casting issue.
I'm not really sure what is causing this to happen when reading multiple files in a row, but not when reading one at a time. The connection opens, reads all the data, and then closes... so I don't think it has to do with that.
I am just looking for any ideas ? It was hard enough to find the original problem. Working with Jet seems to be asking for a butt ache.
Added relevant code as per request
public static readonly String CONNECTION_STRING = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=\"Excel 12.0 Xml;HDR=YES; ReadOnly=True;IMEX=1;\"";
private System.Data.DataTable Query(String worksheetName, String selectList = "*")
{
DataTable table = new DataTable();
_connection.Open();
var query = String.Format(Constants.DATA_QUERY, selectList, worksheetName);
new OleDbDataAdapter(query, _connection).Fill(table);
_connection.Close();
return table;
}
I'd recommend using a native library if possible, something like Excel Data Reader or EPPlus instead of OLEDB
I found the solution here
https://www.codeproject.com/Tips/702769/How-to-Get-Data-from-Multiple-Workbooks-using-One
Provider setup:
"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\path\fileName1.xls;Extended Properties=""Excel 8.0;HDR=Yes;IMEX=1"";"
The SQL Statement must be set like this:
Select * From[Hoja1$]
UNION ALL
Select * From [Hoja1$] IN 'C:\path\fileName2.xls' 'Excel 8.0;HDR=Yes;IMEX=1'
If you want to make an inner join
Select * from [Hoja1$] as a
INNER JOIN (select * from [Hoja1$] IN 'C:\path\fileName2.xls' 'Excel 8.0;HDR=Yes;IMEX=1') as b
ON a.FOLIO=b.FOLIO
I've been requested to import an excel spreadsheet which is fine but Im getting a problem with importing a certain cell that contains both numeric and alphanumeric characters.
excel eg
Col
A B C
Row 0123 8 Fake Address CF11 1XX
XX123 8 Fake Address CF11 1XX
As per the example above when the dataset is being loaded its treating Row 2, col (A) as a numeric field resulting in an empty column in the array.
My connection for the OleDb is
var dbImportConn = new OleDbConnection(#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + dataSource
+ #";Extended Properties=""Excel 8.0;HDR=No;IMEX=1"";")
In this connection i have set the IMEX = 1 which should parse all contents as string into the dataset. Also if i change Row 1 Col (A) to have 'XX123' the entire Col (A) successfully parses as string! Unfortunately this is not going to help my scenario as the excel file is passed from an external client who have also advised that do not have the means to pass through the file with a header row which would solve my issue.
My one thought at this point is when I receive the file to edit the file (programmatically) to insert a header but again as the client may change how many columns are contained this would not be a safe option for me.
So basically I need to find a solution for dealing with the current format on the spreadsheet and to pass through all cells into the array. Has anyone come across this issue before ? Or know how to solve this ?
I await your thoughts
Thanks
Scott
ps If this is not clear just shout
Hi There is a registry setting called TypeGuessRows that you can change that tells Excel to scan all the column before deciding it's type. Currently, it seems, this is set to read an x number of rows in a column and decides the type of the column e.g. if your first x rows are integers and x+1 is string, the import will fail because it has already decided that this is an integer column. You can change the registry setting to read the whole column before deciding..
please see this also
http://jingyangli.wordpress.com/2009/02/13/imex1-revisit-and-typeguessrows-setting-change-to-0-watch-for-performance/
This isn't a direct answer, but I would like to recommend you use the Excel Data Reader, which is opensource under the LGPL licence and is Lightweight and fast library written in C# for reading Microsoft Excel files ('97-2007).
I have got an Excel file in this form :
Column 1 Column 2 Column 3
data1 data2
data1 data2
data1 data2
data1 data2
data1 data2 data3
That is, the whole Column 3 is empty except for the last row.
I am accessing the Excel file via OleDbDataAdapter, returning a DataTable: here's the code.
query = "SELECT * FROM [" + query + "]";
objDT = new DataTable();
objCmdSQL = this.GetCommand();
objCmdSQL.CommandText = query;
objSQLDad = new OleDbDataAdapter(objCmdSQL);
objSQLDad.Fill(objDT);
return objDT;
The point is, in this scenario my code returns a DataTable with just Column 1 and Column 2.
My guess is that JET engine tries to infer column type by the type of the very first cell in every column; being the first value null, the whole column is ignored.
I tried to fill in zeros and this code is actually returning all three columns; this is obviously the least preferable solution because I have to process large numbers of small files.
Inverting the selection range (from, i.e. "A1:C5" to "C5:A1" ) doesn't work either.
I'm looking for something more elegant.
I have already found a couple of posts discussing type mismatch (varchar cells in int columns and vice versa) but actually haven't found anything related to this one.
Thanks for reading!
edit
Weird behavior again. I have to work on mostly Excel 2003 .xls files, but since this question has been answered I thought I could test my code against Excel 2007 .xslx files.
The connection string is the following:
string strConn = #"Provider=Microsoft.ACE.OLEDB.12.0; Data Source=" + _fileName.Trim() + #";Extended Properties=""Excel 12.0;HDR=No;IMEX=1;""";
I get the "External table is not in the expected format" exception which I reckon is the standard exception when there is a version mismatch between ACE/JET and the file being opened.
The string
Provider=Microsoft.ACE.OLEDB.12.0
means that I am using the most recent version of OLEDB, I took a quick peek around and this version is used everywhere there is need of connecting to .xlsx files.
I have tried with just a vanilla provider ( just Excel 12.0, without IMEX nor HDR ) but I get the same exception.
I am on .NET 2.0.50727 SP2, maybe time to upgrade?
I recreated your situation and following returned the 3 columns correctly. That is, the first two columns fully populated with data and the third containing null until the last row, which had data.
string connString = #"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\MyExcel.xls;Extended Properties=""Excel 8.0;HDR=No;IMEX=1"";";
DataTable dt = new DataTable();
OleDbConnection conn = new OleDbConnection(connString);
OleDbDataAdapter adapter = new OleDbDataAdapter("SELECT * FROM [Sheet1$]", conn);
adapter.Fill(dt);
Note I used the Access Database Engine(ACE) provider, which succeeded the old Joint Engine Technology(JET) provider, and my results may represent a behavior difference between the two. Of course, if you aren't already using it I suggest using the ACE provider as I believe Microsoft would too. Also, note the connection's Extended Properties:
"HDR=Yes;" indicates that the first
row contains columnnames, not data.
"HDR=No;" indicates the opposite.
"IMEX=1;" tells the driver to always
read "intermixed" (numbers, dates,
strings etc) data columns as text.
Note that this option might affect
excel sheet write access negative.
Let me know if this helps.
I need to access an excel spreadsheet and insert the data from the spreadsheet into a SQL Database. However the Primary Keys are mixed, most are numeric and some are alpha-numeric.
The problem I have is that when the numeric and alpha-numeric Keys are in the same spreadsheet the alpha-numeric cells return blank values, whereas all the other cells return their data without problems.
I am using the OleDb method to access the Excel file. After retrieving the data with a Command string I put the data into a DataAdapter and then I fill a DataSet. I iterate through all the rows (dr) in the first DataTable in the DataSet.
I reference the columns by using, dr["..."].ToString()
If I debug the project in Visual Studio 2008 and I view the "extended properties", by holding my mouse over the "dr" I can view the values of the DataRow, but the Primary Key that should be alpha-numeric is {}. The other values are enclosed in quotes, but the blank value has braces.
Is this a C# problem or an Excel problem?
Has anyone ever encountered this problem before, or maybe found a workaround/fix?
Thanks in advance.
Solution:
Connection String:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=FilePath;Extended
Properties="Excel 8.0;HDR=Yes;IMEX=1";
HDR=Yes; indicates that the first row contains columnnames, not data. HDR=No; indicates the opposite.
IMEX=1; tells the driver to always read "intermixed" (numbers, dates, strings etc) data columns as text. Note that this option might affect excel sheet write access negative.
SQL syntax SELECT * FROM [sheet1$]. I.e. excel worksheet name followed by a $ and wrapped in [ ] brackets.
Important:
Check out the [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel] located registry REG_DWORD "TypeGuessRows". That's the key to not letting Excel use only the first 8 rows to guess the columns data type. Set this value to 0 to scan all rows. This might hurt performance.
If the Excel workbook is protected by a password, you cannot open it for data access, even by supplying the correct password with your connection string. If you try, you receive the following error message: "Could not decrypt file."
The Excel data source picks a column type for the entire column. If one of the cells doesn't match that type exactly, it leaves blanks like that. We had issues where our typist entered a " 8" (a space before the number, so Excel converted it to a string for that cell) in a numeric column. It would make sense to me that it would try the .Net Parse methods as they are more robust, but I guess that's not how the Excel driver works.
Our fix, since we were using database import services, was to log all the rows that 'failed' this way. Then, we went back to the XLS document and re-typed those cells, to ensure the underlying type was correct. (We found just deleting the space didn't fix it--we had to Clear the whole cell first, than re-type the '8'.) Feels hacky and isn't elagent, but that was the best method we found. If the Excel driver can't read it in correctly by itself, there's nothing you can do to get that data out of there once you're in .Net.
Just another case where Office hides the important details from users in the name of simplicity, and therefore making it more difficult when you have to be exact for power uses.
The {} means this is some sort of empty object and not a string. When you hover over the object you should be able to see its type. Likewise, when you use quickwatch to view dr["..."] you should see the object type. What type is the object you receive?
The ItemArray is an Object Array. So I assume that the "column" in the DataRow, that I am trying to reference, is of type object.
For VISTA compatibility you can use EXCEL 12.0 driver in connection string. This should resolve your issue. It did mine.
Solution:
You put HDR=No so that the first row is not considered the column header.
Connection String: Provider=Microsoft.Jet.OLEDB.4.0;Data Source=FilePath;Extended Properties="Excel 8.0;HDR=No;IMEX=1";
You ignore the first row and you acces the data by any means you want (DataTable, DataReader ect). You acces the columns by numeric indexes, instead of column names.
It worked for me. This way you don't have to modify registers!
I answered a similar question here. Here I've copied and pasted the same answer for your convenience:
I had this same problem, but was able to work around it without resorting to the Excel COM interface or 3rd party software. It involves a little processing overhead, but appears to be working for me.
First read in the data to get the column names
Then create a new DataSet with each of these columns, setting each of their DataTypes to string.
Read the data in again into this new
dataset. Voila - the scientific
notation is now gone and everything is read in as a string.
Here's some code that illustrates this, and as an added bonus, it's even StyleCopped!
public void ImportSpreadsheet(string path)
{
string extendedProperties = "Excel 12.0;HDR=YES;IMEX=1";
string connectionString = string.Format(
CultureInfo.CurrentCulture,
"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=\"{1}\"",
path,
extendedProperties);
using (OleDbConnection connection = new OleDbConnection(connectionString))
{
using (OleDbCommand command = connection.CreateCommand())
{
command.CommandText = "SELECT * FROM [Worksheet1$]";
connection.Open();
using (OleDbDataAdapter adapter = new OleDbDataAdapter(command))
using (DataSet columnDataSet = new DataSet())
using (DataSet dataSet = new DataSet())
{
columnDataSet.Locale = CultureInfo.CurrentCulture;
adapter.Fill(columnDataSet);
if (columnDataSet.Tables.Count == 1)
{
var worksheet = columnDataSet.Tables[0];
// Now that we have a valid worksheet read in, with column names, we can create a
// new DataSet with a table that has preset columns that are all of type string.
// This fixes a problem where the OLEDB provider is trying to guess the data types
// of the cells and strange data appears, such as scientific notation on some cells.
dataSet.Tables.Add("WorksheetData");
DataTable tempTable = dataSet.Tables[0];
foreach (DataColumn column in worksheet.Columns)
{
tempTable.Columns.Add(column.ColumnName, typeof(string));
}
adapter.Fill(dataSet, "WorksheetData");
if (dataSet.Tables.Count == 1)
{
worksheet = dataSet.Tables[0];
foreach (var row in worksheet.Rows)
{
// TODO: Consume some data.
}
}
}
}
}
}
}
Order the records in the xls file by ascii code in descending order so that alpha-numeric fields will appear at the top below the header row. This ensures that the first row of data read will define the data type as "varchar" or "nvarchar"
hi all this code is gets alphanumeric values also
using System.Data.OleDb;
string ConnectionString = #"Provider=Microsoft.Jet.OLEDB.4.0;" + "Data Source=" + filepath + ";" + "Extended Properties="+(char)34+"Excel 8.0;IMEX=1;"+(char)34;
string CommandText = "select * from [Sheet1$]";
OleDbConnection myConnection = new OleDbConnection(ConnectionString);
myConnection.Open();
OleDbDataAdapter myAdapter = new OleDbDataAdapter(CommandText, myConnection);
ds = null;
ds = new DataSet();
myAdapter.Fill(ds);
This isn't completely right! Apparently, Jet/ACE ALWAYS assumes a string type if the first 8 rows are blank, regardless of IMEX=1. Even when I made the rows read to 0 in the registry, I still had the same problem. This was the only sure fire way to get it to work:
try
{
Console.Write(wsReader.GetDouble(j).ToString());
}
catch //Lame unfixable bug
{
Console.Write(wsReader.GetString(j));
}