CSV column is being return as null

CSV column is being return as null - c#

I have parsing a CSV via the Microsoft.Jet.OLEDB.4.0 provider. Which has been working fine for most of our tasks, but recently I've noticed an issue.
I have a CSV which has a column called Rating, this is generally an integer but occasionally it will be "1-2" or a Date e.g "1/1/2010". The datatable I am importing it into has had its columns explicitly set to strings but when a non-integer field is read it is null instead.
Any ideas how I get round this??

Use a schema.ini file (in the folder that contains your .csv) and specify the columns data types correctly.

Likely what is happening is that the first few fields in the column are being sniffed to determine data type, and then when there are later columns of a different type, they're dropped.
I believe you can turn off this behavior by adding IMEX=1 to your Extended Properties in the connection string. This sets the reader to Intermixed Mode which will read the fields as text. Then you can go through in another pass and set the types yourself.

Related

reading excel with oledb not displaying correct values

This is old question I posted:
Reading one & Update some other Excel with c#
As suggested I created schema.ini file. My excel files have so many columns (many of them are not fixed) with mixed data. Even a cell contains numbers along with text. I observed that NOT ALL values are shown when I read excel using OLEDB and populate into a DataTable.
I can't assume ALL columns are put them into .ini file. Columns in my excel will go up to "DX". I observed that only 1st row which has number+text value are shown but similar text appears somewhere down aren't shown. It's shows as blank.
Here is connection string:
string strConn = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + FilePath+ "';Extended Properties=\"Excel 12.0;HDR=YES;IMEX=1;TypeGuessRows=0;ImportMixedTypes=Text\"";
Is there any solution so it reads all types of data?

This comes up a lot and it's very understandable because the documentation is somewhat lacking
Microsoft.ACE.OLEDB.12.0 does not handle columns of mixed data types very well. So what happens is that the driver will always read the first n values in each column and assign a datatype depending on what it finds in the first n cells of the column. n is determined by the setting of a registry key. It moves around depending on whether you have a 64 bit implementation or a 32 bit one but the 64 bit key is in...
HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Office\12.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows
Sadly it's not always convenient to go around modifying registry keys and it would have been much better to leave this setting on the connection string but it is what it is. The default value for this is 8 rows.
If the driver finds mixed data types then and only then does the setting of IMEX come into play. If IMEX=1 is included then a column of mixed data types is returned as text. If it is not specified then any values which do not correspond to the assigned data type are returned as null.
This is where HDR=No is useful. If you have a header then specify HDR=No and read it. This will help ensure that the column is returned as text as long as your headers are all text as well of course. You can then discard the header before processing the data. It won't help if you have a majority numeric/date time data types in the first n cells of the column.
As an aside the driver will read all types of Excel files including .xls, .xlsm and .xlsx - there is no need to change the extended properties away from Excel 12.0 to do so. This is a considerable advantage.
The older Microsoft.Jet.OLEDB.4.0 was good in that you could specify TypeGuessRows and ImportMixedTypes in the connection string but Microsoft.ACE.OLEDB.12.0 completely ignores them so you can remove them from your connection string as their presence is misleading. The older driver can only read .xls files.
Both drivers will only read 255 columns without amending the SELECT statement. To read more than 255 columns you specify a range. E.g.
Select * From [Sheet1$IV:SP]
will read columns 256-510. If your sheet ends on DX it is well within the 255 column limit.
Hidden columns are always returned.
There are a couple of nasties with this driver. Firstly leading empty rows or columns are ignored completely. This can really mess things up if you are expecting data in a specific rows/columns. Secondly Excel incorrectly treats 29/Feb/1900 as a valid date but OLEDB does not. You can stick 29/Feb/1900 into an Excel spreadsheet just fine but OLEDB will return it as 28/Feb/1900. I can't see anything else it could do really.
The driver is a very handy and cheap way of reading well formatted Excel spreadsheets as long as you are aware of the limitations and can code around them.
Good luck.

Store values in separate, C# type-specific columns or all in one column?

I'm building a C# project configuration system that will store configuration values in a SQL Server db.
I was originally going to set the table up as such:
KeyId int
FieldName varchar
DataType varchar
StringValue varchar
IntValue int
DecimalValue decimal
...
Values would be stored and retrieved with the value in the DataType column determining which Value column to use, but I really don't like that design. So I thought I'd go this route:
KeyId int
FieldName varchar
DataType varchar
Value varbinary
Here the value in DataType would still determine the type of Value brought back, but it would all be in one column and I wouldn't have to write a ton of overloads to accommodate the different types like I would have with the previous solution. I would just pull the Value in as a byte array and use DataType to perform whatever conversion(s) necessary to get my Value.
Is the varbinary approach going to cause any performance issues or is it just bad practice to drop all these different types of data into a varbinary? I've been searching around for about an hour and I can't get to a definitive answer.
Also, if there is a more preferred method anyone can think of to reach the same conclusion, I'm all ears (or eyes).

You could serialize your settings as JSON and just store that as a string. Then you have all the settings within one row and your clients can deserialize as needed. This is also a safe way to add additional settings at any time without any modifications to your database.

We are using the second solution and it works well. Remember, that the disk access is in orders of magnitude greater, than the ex. casting operation (it's milliseconds vs. nanoseconds, see ref), so do not look for bottleneck here.
The solution can be to implement polymorphic association (1, 2). But I dont think there is a need for that, or that you should do this. The second solution is close to non-Sql db - you can dump as a value anything, might be as well entire html markup for a page. It should be the caller responsability to know what to do wit the data.
Also, see threads on how to store settings in DB: 1, 2 and 3 for critique.

Converting logic of DateTime.FromBinary method in TSQL query

I have a table that contain column with VARBINARY(MAX) data type. That column represents different values for different types in my c# DB layer class. It can be: int, string and datetime. Now I need to convert that one column into three by it's type. So values with int type go to new column ObjectIntValue and so on for every new column.
But I have a problems with transmitting data to datetime column, because the old column contains datetime value as a long received from C# DateTime.ToBinary method while data saving.
I should make that in TSQL and can't using .NET for convert that value in new column. Have you any ideas?
Thanks for any advice!

Using CLR in T_SQl
Basically you use Create Assembly to register the dll with your function(s) in it,
Then create a user defined function to call it, then you can use it.
There's several rules depending on what you want to do, but as basically you only want DateTime.FromBinary(), shouldn't be too hard to figure out.
Never done it myself, but these guys seem to know what they are talking about
CLR in TSQL tutorial
This is a one off convert right? Your response to #schglurps is a bit of a concern.
If I get you there would have to be break in your update script, ie the one you have woukld work up to when you implement this chnage, then you's have a one off procedure for this manouevre, then you would be updating from a new version.
If you want to validate it, just check for the existnec or non-existance of the new columns.
Other option would be to write a wee application that filled in the new columns from the old one and invoke it. Ugh...
If this isn't one off and you want to keep and maintain the old column, then you have problems.

Getting Excel column type as actual underlying value and not formatted value with C#

Simple set up - I have an Excel file which has a column of doubles:
0.94
0.9523
0.9293
The Excel file has this column formatted to be a rounded percentage:
94%
95%
93%
In C#, where I set up an OleDbConnection to query this Excel file, all my values are returned as:
94%
95%
93%
but I need the actual and unrounded values.
My connection string includes the extended properties:
...Extended Properties="Excel 12.0;IMEX=1;HDR=No;TypeGuessRows=0;ImportMixedTypes=Text"
but this doesn't seem to do the trick. So my question is, short of changing the Excel document manually to the proper type how can I get this to return the data the way I need it?
I've heard and read about changing the registry, but this isn't the best option as this will be deployed on multiple machines. Is this the only way of doing what I need?
Thanks in advance!

You almost certainly do not want ImportMixedTypes=Text.
Try it with TypeGuessRows=1;ImportMixedTypes=Majority Types.

Specification of Extended Properties in OleDb connection string?

At the moment I'm searching for properties for a connection string, which can be used to connect to an Excel file in readonly mode. Searching Google gets me a lot of examples of connection strings, but I can't seem to find a specification of all possibilities in the 'Extended Properties' section of the OleDb connection string.
At the moment I've this:
Provider = Microsoft.Jet.OLEDB.4.0; Data Source = D:\Data\Customers.xls; Extended Properties = 'Excel 8.0; Mode=Read; ReadOnly=true; HDR=Yes';
However... I've composed this by examples. So questions:
1. What is a decent source for OleDb Connection String documentation/reference?
2. Is the above connection string indeed connecting to the Excel file in readonly mode?
Thanks!

I am using UDL file for that.
Do next:
create empty file test.udl
open it
You will see Data Link Properties dialog
On first tab change provider to Microsoft.Jet.OLEDB.4.0;
Second tab select you Excel file
Third tab set permission like Read
On last tab set Extended Properties = 'Excel 8.0; HDR=Yes'
Than save, and open file in text editor and you will see connection string
As well you can check msdn article ADO Provider Properties and Settings

COM-based data access to Excel specifications are likely buried in nearly inaccessible Microsoft archive documentation. (Usually served as one enormous PDF)
I added the spec to another answer here: https://stackoverflow.com/a/68912543/6237912
Copied to this answer for completeness:
The connectionstring has some parts:
Provider: It is the main oledb provider that is used to open the Excel
sheet. This will be Microsoft.Jet.OLEDB.4.0 for Excel 97 onwards Excel
file format and Microsoft.ACE.OLEDB.12.0 for Excel 2007 or higher
Excel file format (One with xlsx extension)
Data Source: It is the entire path of the Excel workbook. You need to mention a dospath that corresponds to an Excel file. Thus, it will
look like: Data Source=C:\testApp.xls".
Extended Properties (Optional): Extended properties can be applied to Excel workbooks which may change the overall activity of the Excel
workbook from your program. The most common ones are the following:
HDR: It represents Header of the fields in the Excel table. Default is YES. If you don't have fieldnames in the header of your
worksheet, you can specify HDR=NO which will take the columns of the
tables that it finds as f1,f2 etc.
ReadOnly: You can also open Excel workbook in readonly mode by specifying ReadOnly=true; By default, Readonly attribute is false, so
you can modify data within your workbook.
FirstRowHasNames: It is the same as HDR, it is always set to 1 ( which means true) you can specify it as false if you don't have your
header row. If HDR is YES, provider disregards this property. You can
change the default behaviour of your environment by changing the
Registry Value
[HKLM\Software\Microsoft\Jet\4.0\Engines\Excel\FirstRowHasNames] to 00
(which is false)
MaxScanRows: Excel does not provide the detailed schema defination of the tables it finds. It need to scan the rows before
deciding the data types of the fields. MaxScanRows specifies the
number of cells to be scanned before deciding the data type of the
column. By default, the value of this is 8. You can specify any value
from 1 - 16 for 1 to 16 rows. You can also make the value to 0 so that
it searches all existing rows before deciding the data type. You can
change the default behaviour of this property by changing the value of
[HKLM\Software\Microsoft\Jet\4.0\Engines\Excel\TypeGuessRows] which is
8 by default. Currently, MaxScanRows is ignored, so you need only to
depend on TypeGuessRows Registry value. Hope Microsoft fixes this
issue to its later versions.
IMEX: (A Caution) As mentioned above, Excel will have to guess a number or rows to select the most appropriate data type of the
column, a serious problem may occur if you have mixed data in one
column. Say you have data of both integer and text on a single column,
in that case, Excel will choose its data type based on majority of the
data. Thus it selects the data for the majority data type that is
selected, and returns NULL for the minority data type. If the two
types are equally mixed in the column, the provider chooses numeric
over text.
For example, in your eight (8) scanned rows, if the column contains five (5) numeric values and three (3) text values, the
provider returns five (5) numbers and three (3) null values.
To work around this problem for data, set "IMEX=1" in the Extended Properties section of the connection string. This enforces
the ImportMixedTypes=Text registry setting. You can change the
enforcement of type by changing
[HKLM\Software\Microsoft\Jet\4.0\Engines\Excel\ImportMixedTypes] to
numeric as well.
Thus if you look into the simple connectionstring with all of them, it will look like:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\\testexcel.xls;
Extended Properties=\"Excel 8.0;HDR=YES;IMEX=1;MAXSCANROWS=15;READONLY=FALSE\""
or:
Copy Code
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\\testexcel.xlsx;
Extended Properties=\"Excel 12.0;HDR=YES;IMEX=1;MAXSCANROWS=15;READONLY=FALSE\""
We need to place extended properties into Quotes(") as there are multiple number of values.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.