C#: Reading data from an xls document

C#: Reading data from an xls document - c#

I am currently working on a project for traversing an excel document and inserting data into a database using C#.
The relevant data for this project is:
The excel sheet has 14 rows at the top that I do not care about. (sometimes 15, see Russia/Siberia below)
The data is grouped by name into 2 columns (date and value), such as:
Sheet 1
USA China Russia
Date Value Date Value Siberia
1/1/09 4.3654 1/1/09 2.7456 Date Value
1/2/09 3.5545 1/3/09 9.3214 2/5/09 0.2454
1/3/09 3.2322 1/21/09 5.2234 2/6/09 0.5557
The name I need to acquire is whichever is listed directly above "Date".
I only care about data from dates we do not have in the database. Before each column set is parsed, I will acquire the max date for any given name from the database, and skip anything at or before it.
There is no guarantee that the columns will be in a constant order or have constant spacing.
I do not want data for all names, rather only those in a list I put together before the file is acquired.
My current plan is this:
For each column, if the date field is at row 16, save the name as the value in row 15 above it, check the database for the last date for that name, only insert data where the date is greater than the acquired date.
If the date field is at row 17, do the same thing, but start the for loop through each row at 18.
If the name is not in the list, skip the column. If it is, make sure to grab the column next to it for the necessary values.
My problem is:
I am currently trying to use the ExcelDataReader from Codeplex(http://www.codeplex.com/ExcelDataReader). This only likes csv-like sheets, which this project has not.
I do not know of any alternative Excel readers.
To the best of my knowledge, a straight FileStream traversal of this file can only go row-by-row, rather than column-by-column.
To anyone still reading, thank you for your time. Any recommendations on how to proceed? Please ensure that solutions can traverse each column, not each row.
Also, please don't worry about the database stuff, or the list of names that precedes the traversal.
Addendum: What I'd really like to end up with is some type of table that I can just traverse with a nested loop, making column-centric traversal much, much easier. Because there is so much garbage near the top of the sheet (14+ rows), most simple solutions are not feasible.

If you want to read from excel in C#, i've used this library with great success, it'll give you the flexibility to parse columns/rows just however you'd like:
http://sourceforge.net/projects/koogra/ (read-only)
Other open source libraries i haven't used but could be good:
http://nexcel.sourceforge.net/ (read-only)
http://npoi.codeplex.com/ (can read and write)
http://developer.novell.com/wiki/index.php/Poi.Net (this project is dead)
Alternatively, you can use one of the many good Java libraries, and convert it into a C# assembly using IKVM:
http://jxls.sourceforge.net/
http://www.andykhan.com/jexcelapi/
http://poi.apache.org/ (this one's the grand-daddy of java XLS libraries)
I've covered how to do the IKVM Java -> C# conversion here (it's really not as horrible an option as you think):
http://splinter.com.au/blog/?p=207

Not a straight answer to your question but an alternative idea:
Your data looks like a pivot-ish table. I'd recommend "unpivoting" it into simple table.
Example:
Russia USA
Q1 123 323
Q2 456 321
Q3 567 843
Becomes:
Quarter Country Value
Q1 Russia 123
Q1 USA 323
Q2 Russia 321
....
If that is the case, not sure if I got this right in your question, than processing the data using a OleDB driver or whatever CSV kind of stuff should be become much less painful.

You can access Excel directly using ADO.NET via the ODBC driver. See http://www.davidhayden.com/blog/dave/archive/2006/05/26/2973.aspx or Google for more info on how to do that. You may wish to try HDR=No in your connection string, since your first row isn't really proper headers by the looks of it.
I haven't done this for a while, but I remember that it is a bit "temperamental" and takes some playing around with to get the column names right, but it should work. Try SELECT * FROM [Sheet1$] and see what you get.

I highly recommend saving this Excel document in a CSV format before doing anything else with it. You can do using this code
After you have a CSV, you can either parse it using that library, or write your own parser for it.

As I did before, I prefer to use OLEDB connection in order to connect to an Excel document.
By the way, you can take a look at the following article for more information:
http://www.codeproject.com/KB/office/excel_using_oledb.aspx

SpreadsheetGear for .NET can load workbooks and access any cells on any sheet in any order. You can get the formatted text of the cell (such as "1/1/09") or the underlying value ("1/1/09" is stored as the double 39814.0 in Excel or SpreadsheetGear).
You can see some live ASP.NET samples here and download the free trial here if you want to try it yourself.
Disclaimer: I own SpreadsheetGear LLC

Related

How to generate and understand a list of field names in a UniData table

I'm new to both UniData and Uniobjects so if I ask something that obvious I apologize.
I'm trying to write a tool that will let me export contacts from our ERP (Manage2000) that runs on UniData (v. 6.1) and can then import them into AD/Exchange.
The primary issue I'm having is that I don't know which fields (columns?) in the table (file?) are for what. I know that that there is a dictionary that has this information in it but I'm not sure how to get what I want out of it.
I found that there is a command LIST.METADATA in the current UniData documentation from Rocket but it seems that either the version of UniData that we are using is so old that it doesn't have this command in it or it was removed from the VOC file for some unknown reason.
Does anyone know how or have any tips to pull out the structure of a table so that I can know which fields are for what data?
Thanks in advance!

At TCL:
LIST DICT contact.master
Please note that the database file name (EX: contact.master) is case sensitive. I don't have a UniData instance at the moment to provide an example output. However, it should be similar to Universe's output:
Field......... Type & Field........ Conversion.. Column......... Output Depth &
Name.......... Field. Definition... Code........ Heading........ Format Assoc..
Number
AMOUNT.WEBB A 1 MR22 Amt WEBB 10R M
PANDAS.COST A 3 MD2Z Pandass Cost 10R M
CREDIT.EXP.DT A 6 D4/ Cred Exp Date 10R M
For the example above, you can generally tell the "data type" of the field by looking at the conversion code. "D4/" is the conversion code for a date. "MD2Z" is a numeric conversion code, which we can guess is for monetary amounts. I'm glossing over the power of conversion codes, so please make sure to reference Rocket's documentation for these codes to truly understand what these fields would output. If you don't have the documentation in front of you, you can also reference this site:
http://www.koretech.com/kr_help/KU2/30/en/KMK_Prog_Conversions.htm
If you wanted to use UniObjects and C# to retrieve the field names in a file, you could use the following code:
UniCommand fieldSelectCommand = activeSession.CreateUniCommand();
fieldSelectCommand.Command = "SELECT DICT contact.master";
fieldSelectCommand.Execute();
UniSelectList resultList = activeSession.CreateUniSelectList(0);
String[] allFieldNames = resultList.ReadListAsStringArray();
Having answered your question, I would also like to make a recommendation that you check out Rocket's U2 Toolkit for .NET if you're mostly going to be selecting data from the database instead of reading and manipulating individual records:
http://www.rocketsoftware.com/products/rocket-u2-toolkit-net
Not only does it present an ADO.NET way of accessing the database, it also has a better performance version of the UniObjects library under the U2.Data.Client.UO namespace.

The Dictionary, in my opinion, is a recommendation of how the schema should behave. However, there are cases when it's not 100% accurate. You could run "LIST CONTACT.MASTER TOXML TO MYFILE.XML" which would create an xml file what you could parse.
See https://u2devzone.rocketsoftware.com/accelerate/articles/u2-xml/u2-xml#section-0 for more information.

Inserting datafiles into a SQL Server database (no separators)

I've tried researching this question but most of the answers is for .csv files which does not help me a lot.
I have a couple of large .dat files containing quite a lot of data (each file around 700MB), and I am trying to develop a software in C# where I will be able to search for a specific string and locate the line where it is (duplicates will occur so a listview / listbox might be a good idea).
Every line follows the exact same data format and the starting index/length of each datatype is well documented.
Example:
Line 1: ZATIXIZ20SWEDENSTACKOVERFLOWCHROME
Documented like this:
Username : 0-6 Age : 7-8 Country: 9-14 Website :
15-27 Browser : 28-33
My guess is that the best approach would be to do some kind of BULK INSERT on the data files into a database and then index it for faster searching later on. I am not quite sure how to do this though, nor what the best approach would be. (It also needs to search through all of the files so maybe it could be a good idea to insert them all into the same table?)
So far I have only tried to read one of the files into memory and then do a simple Regex which of course was not a good idea. Unfortunately I am a bit inexperienced with SQL queries which is why I have not tried a lot yet.
Thanks in advance!

'Insert all data of the same type into a table with indexed columns.
If the properties vary between each file, use multiple tables.
If you want to be able to trace the match back to the original file, use a table with columns:
Key - Internal key from a sequence
FileName - So you know where it came from
Line - The line number
Username
Age
Country
Website
Browser
Where FileName, Line is a unique key.
Here is a link to an article on full-text searching on MSSQL as we don't know which RDMS you are using: http://msdn.microsoft.com/en-us/library/ms142571.aspx#queries
From you example, the line 'ZATIXIZ20SWEDENSTACKOVERFLOWCHROME' becomes:
| Key | FileName | Line | Username | Age | CountryKey | Website | BrowserKey
1 'Data1.dat' 1 'ZATIXIZ' 20 46 'STACKOVERFLOW' 4
In this example, you'd need two more tables: Countries and Browsers. These are optional, as you could just include the information directly in the main table.
I must stress though, that it really depends on how you wish to query this data. The above structure gives you the opportunity to search for 'all swedish users between 20 and 25)' by performing the following query:
select * from TABLENAME where Age < 25 and Age >= 20 and CountryKey = 46
In regards to how you import a fixed width file, it depends greatly on your RDMS. If you're using Oracle, you can use SQL*Loader. Remember that it does not necessarily have to be a single-stage process. You can load the data into the tables and then look up the keys internally after the initial import.
For MSSQL, here is another answer from the stack: Bulk insert fixed width fields
You can also preprocess it in .NET. Again, it depends on your scenario. If you are piping these into your system at a rate of one 900MB file every 10 minutes, you're looking at some serious optimization of the bulk load process (and some serious hardware). But if you only need to load this file once a month, the .NET approach is absolutely fine, even though it may take a few hours.

data extraction from .rpt file to copy in database in PostgreSQL 9.0

I have a report file(.rpt) having text as shown below and this .rpt file get updated every day.
Datum/Uhrzeit,Sta.,Bez.,Unit,TBId,Batch,OrderNr,Mat1,Total1,Mat2,Total2,Mat3,Total3,Mat4,Total4,Mat5,Total5,Mat6,Total6,Summe
41521.755934(04.09.13 18:08:32),TB01,TB01,005,300,9663, ,2,27313.63,0,0.00,0,0.00,3,1776.19,0,0.00,0,0.00,29089.82
41521.797601(04.09.13 19:08:32),TB01,TB01,005,300,9682, ,2,27365.98,0,0.00,0,0.00,3,1780.86,0,0.00,0,0.00,29146.85
41521.839269(04.09.13 20:08:32),TB01,TB01,005,300,9701, ,2,27418.34,0,0.00,0,0.00,3,1785.53,0,0.00,0,0.00,29203.88
41521.880937(04.09.13 21:08:33),TB01,TB01,005,300,9721, ,2,27473.31,0,0.00,0,0.00,3,1790.40,0,0.00,0,0.00,29263.71
41521.922606(04.09.13 22:08:33),TB01,TB01,005,300,9741, ,2,27528.53,0,0.00,0,0.00,3,1795.30,0,0.00,0,0.00,29323.83
41521.964274(04.09.13 23:08:33),TB01,TB01,005,300,9760, ,2,27580.88,0,0.00,0,0.00,3,1799.97,0,0.00,0,0.00,29380.84
41522.005942(05.09.13 00:08:33),TB01,TB01,005,300,9780, ,2,27636.00,0,0.00,0,0.00,3,1804.86,0,0.00,0,0.00,29440.86
Need to extract first and last reading values of every row and need to put that reading in database table.
first reading -- Datum/Uhrzeit
last reading -- Summe
I have used COPY command also but it doesn't take the first value. I want to know which data type to use to extract this value (it is not in normal date formats)???
Also is it not possible just to take these two readings only out of this file and not the whole 20 readings?? Is there any such method available??
I am using PostgreSQL 9.0
Any help would be great.

Assuming that "reading" = "column":
You will need to COPY to a TEMPORARY table where the first column is of type text, not date, since that is an invalid date format.
Then you can do an INSERT INTO teal_table (col1, col2, ...) SELECT some_func(thedate), col2, col3... FROM temptable to transform the temp table contents into correct date data using appropriate SQL and insert it into the real target table.
There are lots of existing examples of this on Stack Overflow, though not for your particular date format. I'm guessing that the date in the parens (...) is the date you want, and that the numbers before are a representation of that date as days since epoch + time since start of day. It'll be easier to just parse the date part, which you can do with:
SELECT to_timestamp(substring('41521.880937(04.09.13 21:08:33)' from '\(.*\)' ), '(DD.MM.YY HH24.MI.SS)');
so that's your some_func for the above.
As for taking only the two desired columns, I already explained that to you before so I'm not going to repeat myself. Short version: Use an ETL tool, re-export the CSV with just those columns, or use a filter program to limit the input.

Creating an ETL system (Data import and transformation)

I have been tasked to write a module for importing data into a client's system.
I thought to break the process into 4 parts:
1. Connect to the data source (SQL, Excel, Access, CSV, ActiveDirectory, Sharepoint and Oracle) - DONE
2. Get the available tables/data groups from the source - DONE
i. Get the available fields form the selected table/data group - DONE
ii. Get all data from the selected fields - DONE
3. Transform data to the user's requirements
4. Write the transformed data the the MSSQL target
I am trying to plan how to handle complex data transformations like:
Get column A from Table tblA, inner joined to column FA from table tblB, and concatenate these two with a semicolon in between.
OR
Get column C from table tblC on source where column tblC.D is not in table tblG column G on target database.
My worry is not the visual, but the representation in code of this operation.
I am NOT asking for sample code, but rather for some creative ideas.
The data transformation will not be with free text, but drag and drop objects that represent actions.
I am a bit lost, and need some fresh input.

maybe you can grab some ideas from this open source project: Rhino ETL.

See my answer: Manipulate values in a datatable?

C# Excel import data from CSV into Excel

How do I import data in Excel from a CSV file using C#? Actually, what I want to achieve is similar to what we do in Excel, you go to the Data tab and then select From Text option and then use the Text to columns option and select CSV and it does the magic, and all that stuff. I want to automate it.
If you could head me in the right direction, I'll really appreciate that.
EDIT: I guess I didn't explained well. What I want to do is something like
Excel.Application excelApp;
Excel.Workbook excelWorkbook;
// open excel
excelApp = new Excel.Application();
// something like
excelWorkbook.ImportFromTextFile(); // is what I need
I want to import that data into Excel, not my own application. As far as I know, I don't think I would have to parse the CSV myself and then insert them in Excel. Excel does that for us. I simply need to know how to automate that process.

I think you're over complicating things. Excel automatically splits data into columns by comma delimiters if it's a CSV file. So all you should need to do is ensure your extension is CSV.
I just tried opening a file quick in Excel and it works fine. So what you really need is just to call Workbook.Open() with a file with a CSV extension.

You could open Excel, start recording a macro, do what you want, then see what the macro recorded. That should tell you what objects to use and how to use them.

I beleive there are two parts, one is the split operation for the csv that the other responder has already picked up on, which I don't think is essential but I'll include anyways. And the big one is the writing to the excel file, which I was able to get working, but under specific circumstances and it was a pain to accomplish.
CSV is pretty simple, you can do a string.split on a comma seperator if you want. However, this method is horribly broken, albeit I'll admit I've used it myself, mainly because I also have control over the source data, and know that no quotes or escape characters will ever appear. I've included a link to an article on proper csv parsing, however, I have never tested the source or fully audited the code myself. I have used other code by the same author with success. http://www.boyet.com/articles/csvparser.html
The second part is alot more complex, and was a huge pain for me. The approach I took was to use the jet driver to treat the excel file like a database, and then run SQL queries against it. There are a few limitations, which may cause this to not fit you're goal. I was looking to use prebuilt excel file templates to basically display data and some preset functions and graphs. To accomplish this I have several tabs of report data, and one tab which is raw_data. My program writes to the raw_data tab, and all the other tabs calculations point to cells in this table. I'll go into some of the reasoning for this behavior after the code:
First off, the imports (not all may be required, this is pulled from a larger class file and I didn't properly comment what was for what):
using System.IO;
using System.Diagnostics;
using System.Data.Common;
using System.Globalization;
Next we need to define the connection string, my class already has a FileInfo reference at this point to the file I want to use, so that's what I pass on. It's possible to search on google what all the parameters are for, but basicaly use the Jet Driver (should be available on ANY windows install) to open an excel file like you're referring to a database.
string connectString = #"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={filename};Extended Properties=""Excel 8.0;HDR=YES;IMEX=0""";
connectString = connectString.Replace("{filename}", fi.FullName);
Now let's open up the connection to the DB, and be ready to run commands on the DB:
DbProviderFactory factory = DbProviderFactories.GetFactory("System.Data.OleDb");
using (DbConnection connection = factory.CreateConnection())
{
connection.ConnectionString = connectString;
using (DbCommand command = connection.CreateCommand())
{
connection.Open();
Next we need the actual logic for DB insertion. So basically throw queries into a loop or whatever you're logic is, and insert the data row-by-row.
string query = "INSERT INTO [raw_aaa$] (correlationid, ipaddr, somenum) VALUES (\"abcdef", \"1.1.1.1", 10)";
command.CommandText = query;
command.ExecuteNonQuery();
Now here's the really annoying part, the excel driver tries to detect you're column type before insert, so even if you pass a proper integer value, if excel thinks the column type is text, it will insert all you're numbers as text, and it's very hard to get this treated like a number. As such, excel must already have the column type as the number. In order to accomplish this, for my template file I fill in the first 10 rows with dummy data, so that when you load the file in the jet driver, it can detect the proper types and use them. Then all my forumals that point at my csv table will operate properly since the values are of the right type. This may work for you if you're goals are similar to mine, and to use templates that already point to this data (just start at row 10 instead of row 2).
Because of this, my raw_aaa tab in excel might look something like this:
correlationid ipaddr somenum
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
abcdef 1.1.1.1 5
Note row 1 is the column names that I referenced in my sql queries. I think you can do without this, but that will require a little more research. By already having this data in the excel file, the somenum column will be detected as a number, and any data inserted will be properly treated as such.
Antoher note that makes this annoying, the Jet Driver is 32-bit only, so in my case where I had an explicit 64-bit program, I was unable to execute this directly. So I had the nasty hack of writing to a file, then launch a program that would insert the data in the file into my excel template.
All in all, I think the solution is pretty nasty, but thus far haven't found a better way to do this unfortunatly. Good luck!

You can take a look at TakeIo.Spreadsheet .NET library. It accepts files from Excel 97-2003, Excel 2007 and newer, and CSV format (semicolon or comma separators).
Example:
var inputFile = new FileInfo("Book1.csv"); // could be .xls or .xlsx too
var sheet = Spreadsheet.Read(inputFile);
foreach (var row in sheet)
{
foreach (var cell in row)
{
// do something
}
}
You can remove beginning and trailing empty rows, and also beginning and trailing columns from the imported data using the Normalize() function:
sheet.Normalize();
Sometimes you can find that your imported data contains empty rows between data, so you can use another helper for this case:
sheet.RemoveEmptyRows();
There is a Serialize() function to convert any input to CSV too:
var outfile = new StreamWriter("AllData.csv");
sheet.Serialize(outfile);
If you like to use comma instead of the default semicolon separator in your CSV file, do:
sheet.Serialize(outfile, ',');
And yes, there is also a ToString() function too...
This package is available at NuGet too, just take a look at TakeIo.Spreadsheet.

You can use ADO.NET
http://vbadud.blogspot.com/2008/09/opening-comma-separate-file-csv-through.html

Well, importing from CSV shouldn't be a big deal. I think the most basic method would be to do it using string operations. You could build a pretty fine parser using simple Split() command, and getting the stuff in arrays.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.