Parsing a CSV file problems C#

Parsing a CSV file problems C# - c#

Having a problem with parsing a CSV file. I connect to the file using the following:
string connString = "Provider=Microsoft.Jet.OLEDB.4.0;"
+ "Data Source=\"" + dir + "\\\";"
+ "Extended Properties=\"text;HDR=No;FMT=Delimited\"";
//create the database query
string query = "SELECT * FROM [" + file + "]";
//create a DataTable to hold the query results
DataTable dTable = new DataTable();
//create an OleDbDataAdapter to execute the query
OleDbDataAdapter dAdapter = new OleDbDataAdapter(query, connString);
//Get the CSV file to change position.
//fill the DataTable
dAdapter.Fill(dTable);
return dTable;
For some reason, the first column reads as a "Header" ok (i.e. HDR=Yes allows the values to be displayed). The problem is when I have HDR=No, nothing after the first 'cell' is displayed in that row. However I need to have HDR=No as I'll be writing the CSV later.
As a quick aside, the rest of the row only has a value in every other column. Also, there is a period in each of these columns. Any help?
Cheers.
EDIT: Here are a fake few lines similar to the CSV:
//Problem row->>
File:,GSK1.D,,GSK2.D,,GSK3.D,
//The following rows, however, are fine:
/ 69,120.3,16.37%,128.9,7.16%,188.92,13.97%
D / 71,48.57,75.50%,32.15,26.65%,58.35,71.43%
T / 89,35.87,45.84%,50.01,28.87%,15.38,43.30%
EDIT: When I put any value into the "blank spaces" above they are parsed, but no matter what I put into the problematic cells (e.g. GSK1.D) they won't parse - unless it is a number! Is there any chance it is automatically converting this cell to a "float" cell? And how can I stop it doing this?

at Codeproject there is an parsing library: http://www.codeproject.com/KB/database/CsvReader.aspx
with an interesting article, how this stuff work. Its working faster (Author), than the OleDB Provider.

I have finished this, just to let anyone know who may have this problem in the future. It turns out the reason there was nothing being taken in was because ADO tries to determine a column type. If other values in this column are not of said type, it removes them completely.
To counter this, you need to create a schema.ini file, like so:
StreamWriter writer = new StreamWriter(File.Create(dir + "\\schema.ini"));
writer.WriteLine("[" + fileToBeRead + "]");
writer.WriteLine("ColNameHeader = False");
writer.WriteLine("Format = CSVDelimited");
writer.WriteLine("CharacterSet=ANSI");
int iColCount = dTable.Columns.Count + 1;
for (int i = 1; i < iColCount; i++)
{
writer.WriteLine("Col" + i + "=Col" + i + "Name Char Width 20");
}
//writer.WriteLine("Col1=Col1Name Char Width 20");
//writer.WriteLine("Col2=Col1Name Char Width 20");
//etc.
writer.Close();
Thanks for everyone's suggestions!

I've seldom done well with database type access to text files - the possibilities for "issues" with the file tend to exceed theoretical time savings.
Personally I've more often than not hand crafted the code to do this. A lot (going back over 20+ years so generic solutions have been thin on the ground). That said, if I were having to process a .csv file now the first thing I'd reach for would be FileHelpers or similar.

Related

SQL Insert not considering blank values for the insert in my C# code

I have a nice piece of C# code which allows me to import data into a table with less columns than in the SQL table (as the file format is consistently bad).
My problem comes when I have a blank entry in a column. The values statement does not pickup an empty column from the csv. And so I receive the error
You have more insert columns than values
Here is the query printed to a message box...
As you can see there is nothing for Crew members 4 to 11, below is the file...
Please see my code:
SqlConnection ADO_DB_Connection = new SqlConnection();
ADO_DB_Connection = (SqlConnection)
(Dts.Connections["ADO_DB_Connection"].AcquireConnection(Dts.Transaction) as SqlConnection);
// Inserting data of file into table
int counter = 0;
string line;
string ColumnList = "";
// MessageBox.Show(fileName);
System.IO.StreamReader SourceFile =
new System.IO.StreamReader(fileName);
while ((line = SourceFile.ReadLine()) != null)
{
if (counter == 0)
{
ColumnList = "[" + line.Replace(FileDelimiter, "],[") + "]";
}
else
{
string query = "Insert into " + TableName + " (" + ColumnList + ") ";
query += "VALUES('" + line.Replace(FileDelimiter, "','") + "')";
// MessageBox.Show(query.ToString());
SqlCommand myCommand1 = new SqlCommand(query, ADO_DB_Connection);
myCommand1.ExecuteNonQuery();
}
counter++;
}
If you could advise how to include those fields in the insert that would be great.
Here is the same file but opened with a text editor and not given in picture format...
Date,Flight_Number,Origin,Destination,STD_Local,STA_Local,STD_UTC,STA_UTC,BLOC,AC_Reg,AC_Type,AdultsPAX,ChildrenPAX,InfantsPAX,TotalPAX,AOC,Crew 1,Crew 2,Crew 3,Crew 4,Crew 5,Crew 6,Crew 7,Crew 8,Crew 9,Crew 10,Crew 11
05/11/2022,241,BOG,SCL,15:34,22:47,20:34,02:47,06:13,N726AV,"AIRBUS A-319 ",0,0,0,36,AV,100612,161910,323227

Not touching the potential for sql injection as I'm free handing this code. If this a system generated file (Mainframe extract, dump from Dynamics or LoB app) the probability for sql injection is awfully low.
// Char required
char FileDelimiterChar = FileDelimiter.ToChar()[0];
int columnCount = 0;
while ((line = SourceFile.ReadLine()) != null)
{
if (counter == 0)
{
ColumnList = "[" + line.Replace(FileDelimiterChar, "],[") + "]";
// How many columns in line 1. Assumes no embedded commas
// The following assumes FileDelimiter is of type char
// Add 1 as we will have one fewer delimiters than columns
columnCount = line.Count(x => x == FileDelimiterChar) +1;
}
else
{
string query = "Insert into " + TableName + " (" + ColumnList + ") ";
// HACK: this fails if there are embedded delimiters
int foundDelimiters = line.Count(x => x == FileDelimiter) +1;
// at this point, we know how many delimiters we have
// and how many we should have.
string csv = line.Replace(FileDelimiterChar, "','");
// Pad out the current line with empty strings aka ','
// Note: I may be off by one here
// Probably a classier linq way of doing this or string.Concat approach
for (int index = foundDelimiters; index <= columnCount; index++)
{
csv += "','";
}
query += "VALUES('" + csv + "')";
// MessageBox.Show(query.ToString());
SqlCommand myCommand1 = new SqlCommand(query, ADO_DB_Connection);
myCommand1.ExecuteNonQuery();
}
counter++;
}
Something like that should get you a solid shove in the right direction. The concept is that you need to inspect the first line and see how many columns you should have. Then for each line of data, how many columns do you actually have and then stub in the empty string.
If you change this up to use SqlCommand objects and parameters, the approximate logic is still the same. You'll add all the expected parameters by figuring out columns in the first line and then for each line you will add your values and if you have a short row, you just send the empty string (or dbnull or whatever your system expects).
The big take away IMO is that CSV parsing libraries exist for a reason and there are so many cases not addressed in the above psuedocode that you'll likely want to trash the current approach in favor of a standard parsing library and then while you're at it, address the potential security flaws.
I see your updated comment that you'll take the formatting concerns back to the source party. If they can't address them, I would envision your SSIS package being
Script Task -> Data Flow task.
Script Task is going to wrangle the unruly data into a strict CSV dialect that a Data Flow task can handle. Preprocessing the data into a new file instead of trying to modify the existing in place.
The Data Flow then becomes a chip shot of Flat File Source -> OLE DB Destination

Here's how you can process this file... I would still ask for Json or XML though.
You need two outputs set up. Flight Info (the 1st 16 columns) and Flight Crew (a business key [flight number and date maybe] and CrewID).
Seems to me the problem is how the crew is handled in the CSV.
So basic steps are Read the file, use regex to split it, write out first 16 col to output1 and the rest (with key) to flight crew. And skip the header row on your read.
var lines = System.File.IO.ReadAllLines("filepath");
for(int i =1; i<lines.length; i++)
{
var = new System.Text.RegularExpressions.Regex("new Regex("(?:^|,)(?=[^\"]|(\")?)\"?((?(1)(?:[^\"]|\"\")*|[^,\"]*))\"?(?=,|$)"); //Some code I stole to split quoted CSVs
var m = r.Matches(line[i]); //Gives you all matches in a MatchCollection
//first 16 columns are always correct
OutputBuffer0.AddRow();
OutputBuffer0.Date = m[0].Groups[2].Value;
OutputBuffer0.FlightNumber = m[1].Groups[2].Value;
[And so on until m[15]]
for(int j=16; j<m.Length; j++)
{
OutputBuffer1.AddRow(); //This is a new output that you need to set up
OutputBuffer1.FlightNumber = m[1].Groups[2].Value;
[Keep adding to make a business key here]
OutputBuffer1.CrewID = m[j].Groups[2].Value;
}
}
Be careful as I just typed all this out to give you a general plan without any testing. For example m[0] might actually be m[0].Value and all of the data types will be strings that will need to be converted.
To check out how regex processes your rows, please visit https://regex101.com/r/y8Ayag/1 for explanation. You can even paste in your row data.
UPDATE:
I just tested this and it works now. Needed to escape the regex function. And specify that you wanted the value of group 2. Also needed to hit IO in the File.ReadAllLines.

The solution that I implemented in the end avoided the script task completely. Also meaning no SQL Injection possibilities.
I've done a flat file import. Everything into one column then using split_string and a pivot in SQL then inserted into a staging table before tidy up and off into main.
Flat File Import to single column table -> SQL transform -> Load
This also allowed me to iterate through the files better using a foreach loop container.
ELT on this occasion.
Thanks for all the help and guidance.

How to enter data with leading zero into the excel using c# excel application

I want to enter data into the excel. Here i am having a problem while entering a data which is having some leading zero es.
For ex:
I want to enter 024[zero two four] into the excel and this data i am having in my datatable.
But in the excel generating it is being shown as 24.

You can set the cell format to TEXT before you do that. Since you are after all trying to store the text "024". In the same vein, if all your data is meant to be 3-digits, you can use a specific number format, such as "000".
cellReference.NumberFormat = "#";
cellReference.Value = "024";
or
cellReference.NumberFormat = "000";
cellReference.Value = "024";

Adding a Apostrophe before will solve this problem.
This will direct excel to treat cell as a text rather than number.
'024
will output
024

If you are using a DataTable you need to Take another DataTable and then you need to iterate through entire cells in the datatable and prepend the cells text with space i.e with '&nbsp';.We cannot modify the rows in the same table because it will throw an exception saying "Collection was Modified".We have to take a new datatable.
Consider the following code.
//To get the result with leading zero's in excel this code is written.
DataTable dtUpdated=new DataTable();
//This gives similar schema to the new datatable
dtUpdated = dtReports.Clone();
foreach (DataRow row in dtReports.Rows)
{
for (int i = 0; i < dtReports.Columns.Count; i++)
{
string oldVal = row[i].ToString();
string newVal = " "+oldVal;
row[i] = newVal;
}
dtUpdated.ImportRow(row);
}
We can bind this updated table to datagrid so that it can be useful for Excel Conversion.

I have found a way for this.
Instead of doing anything in the SQL query or to the value we can give a Excel formula of whatever we want to achieve.
For ex.
here i want to have 024 in excel. What i did is i edited the query and change the value so that it now looks like a formula in excel. So that when it will be entered in excel it will be treated as excel formula
string strf = "=CONCATENATE(" + 024 + ")";

CSV is actually .... Semicolon Separated Values ... (Excel export on AZERTY)

I'm a bit confused here.
When I use Excel 2003 to export a sheet to CSV, it actually uses semicolons ...
Col1;Col2;Col3
shfdh;dfhdsfhd;fdhsdfh
dgsgsd;hdfhd;hdsfhdfsh
Now when I read the csv using Microsoft drivers, it expects comma's and sees the list as one big column ???
I suspect Excel is exporting with semicolons because I have a AZERTY keyboard. However, doesn't the CSV reader then also have to take in account the different delimiter ?
How can I know the appropriate delimiter, and/or read the csv properly ??
public static DataSet ReadCsv(string fileName)
{
DataSet ds = new DataSet();
string pathName = System.IO.Path.GetDirectoryName(fileName);
string file = System.IO.Path.GetFileName(fileName);
OleDbConnection excelConnection = new OleDbConnection
(#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + pathName + ";Extended Properties=Text;");
try
{
OleDbCommand excelCommand = new OleDbCommand(#"SELECT * FROM " + file, excelConnection);
OleDbDataAdapter excelAdapter = new OleDbDataAdapter(excelCommand);
excelConnection.Open();
excelAdapter.Fill(ds);
}
catch (Exception exc)
{
throw exc;
}
finally
{
if(excelConnection.State != ConnectionState.Closed )
excelConnection.Close();
}
return ds;
}

One way would be to just use a decent CSV library; one that lets you specify the delimiter:
using (var csvReader = new CsvReader("yourinputfile.csv"))
{
csvReader.ValueSeparator = ';';
csvReader.ReadHeaderRecord();
while (csvReader.HasMoreRecords)
{
var record = csvReader.ReadDataRecord():
var col1 = record["Col1"];
var col2 = record["Col2"];
}
}

Check what delimiter is specified on your computer. Control Panel > Regional and Language Options > Regional Options tab - click Customize button. There's an option there called "List separator". I suspect this is set to semi-colon.

Solution for German Windows 10:
Mention to change the decimal separator to . and maybe thousands separators to   (thin space) as well.
Can't believe this is true...Comma-separated values are separated by semicolon?

As mentioned by dendarii, the CSV separator that Excel uses is determined by your regional settings, specifically the 'list separator' character.
(And Excel does this erroneously in my opinion, as it is called a comma seperated file)
HOWEVER, if that still does not solve your issue, there is another possible complication:
Check your 'digit grouping' character and ensure that is NOT a comma.
Excel appears to revert back to semicolon when exporting decimal numbers and has digit grouping also set to a comma.
Setting the digit grouping to a full stop / period (.) solved this for me.

Updating Cells in a DataTable

I'm writing a small app to do a little processing on some cells in a CSV file I have. I've figured out how to read and write CSV files with a library I found online, but I'm having trouble: the library parses CSV files into a DataTable, but, when I try to change a cell of the table, it isn't saving the change in the table!
Below is the code in question. I've separated the process into multiple variables and renamed some of the things to make it easier to debug for this question.
Code
Inside the loop:
string debug1 = readIn.Rows[i].ItemArray[numColumnToCopyTo].ToString();
string debug2 = readIn.Rows[i].ItemArray[numColumnToCopyTo].ToString().Trim();
string debug3 = readIn.Rows[i].ItemArray[numColumnToCopyFrom].ToString().Trim();
string towrite = debug2 + ", " + debug3;
readIn.Rows[i].ItemArray[numColumnToCopyTo] = (object)towrite;
After the loop:
readIn.AcceptChanges();
When I debug my code, I see that towrite is being formed correctly and everything's OK, except that the row isn't updated: why isn't it working? I have a feeling that I'm making a simple mistake here: the last time I worked with DataTables (quite a long time ago), I had similar problems.
If you're wondering why I'm adding another comma in towrite, it's because I'm combining a street address field with a zip code field - I hope that's not messing anything up.
My code is kind of messy, as I'm only trying to edit one file to make a small fix, so sorry.

The easiest way to edit individual column values is to use the DataRow.Item indexer property:
readIn.Rows[i][numColumnToCopyTo] = (object)towrite;
This isn't well-documented, but DataRow.ItemArray's get accessor returns a copy of the underlying data. Here's the implementation, courtesy of Reflector:
public object[] get_ItemArray() {
int defaultRecord = this.GetDefaultRecord();
object[] objArray = new object[this._columns.Count];
for (int i = 0; i < objArray.Length; i++) {
DataColumn column = this._columns[i];
objArray[i] = column[defaultRecord];
}
return objArray;
}
There's an awkward alternative method for editing column values: get a row's ItemArray, modify those values, then modify the row to use the updated array:
object[] values = readIn.Rows[i].ItemArray;
values[numColumnToCopyTo] = (object)towrite;
readIn.Rows.ItemArray = values;

use SetField<> method :
string debug1 = readIn.Rows[i].ItemArray[numColumnToCopyTo].ToString();
string debug2 = readIn.Rows[i].ItemArray[numColumnToCopyTo].ToString().Trim();
string debug3 = readIn.Rows[i].ItemArray[numColumnToCopyFrom].ToString().Trim();
string towrite = debug2 + ", " + debug3;
readIn.Rows[i].SetField<string>(numColumnToCopyTo,towrite);
readIn.AcceptChanges();

writing long text in excel workbook using interop throws error?

I am writing long text (1K to 2K characters long, plain xml data) into a cell in excel workbook.
The below statement throws COM error Exception from HRESULT: 0x800A03EC
range.set_Value(Type.Missing, data);
If I copy paste the same xml manually into excel it just works fine ,but the same does not work progamatically.
If I strip the text to something like 100/300 chars it works fine.

There is a limit (somehwere between 800 and 900 chars if i remember correctly) that is nearly impossible to get around like this.
Try using an ole connection and inserting the data with an SQL command. That might work better for you. you can then use interop to do any formatting if necessary.

the following KB article explains that the max limit is 911 characters. I checked the same on my code it does work for string upto 911 chars.
http://support.microsoft.com/kb/818808
The work around mentioned in this article recommends to make sure no cell holds more than 911 characters. thats lame!

Good Ole and excel article: http://support.microsoft.com/kb/316934
The following code updates a private variable that is the number of successful rows and returns a string which is the path to the excel file.
Remember to use Path from System.IO;!
string tempXlsFilePathName;
string result = new string;
string sheetName;
string queryString;
int successCounter;
// set sheetName and queryString
sheetName = "sheetName";
queryString = "CREATE TABLE " + sheetName + "([columnTitle] char(255))";
// Write .xls
successCounter = 0;
tempXlsFilePathName = (_tempXlsFilePath + #"\literalFilename.xls");
using (OleDbConnection connection = new OleDbConnection(GetConnectionString(tempXlsFilePathName)))
{
OleDbCommand command = new OleDbCommand(queryString, connection);
connection.Open();
command.ExecuteNonQuery();
yourCollection.ForEach(dataItem=>
{
string SQL = "INSERT INTO [" + sheetName + "$] VALUES ('" + dataItem.ToString() + "')";
OleDbCommand updateCommand = new OleDbCommand(SQL, connection);
updateCommand.ExecuteNonQuery();
successCounter++;
}
);
// update result with successfully written username filepath
result = tempXlsFilePathName;
}
_successfulRowsCount = successCounter;
return result;
N.B. This was edited in a hurry, so may contain some mistakes.

To solve this limitation, only writing/updating one cell at a time and dispose the Excel com object immediately. And recreate the object again for writing/updating the next cell.
I can confirm this solution is working in VS2010 (VB.NET project) with Microsoft Excel 10.0 Object Library (Microsoft Office XP)

This limitation is supposed to have been removed in Excel 2007/2010. Using VBA the following works
Sub longstr()
Dim str1 As String
Dim str2 As String
Dim j As Long
For j = 1 To 2000
str1 = str1 & "a"
Next j
Range("a1:a5").Value2 = str1
str2 = Range("a5").Value2
MsgBox Len(str2)
End Sub

I'll start by saying I haven't tried this myself, but my research says that you can use QueryTables to overcome the 911 character limitation.
This is the primary post I found which talks about using a record set as the data source for a QueryTable and adding it to a spreadsheet: http://www.excelforum.com/showthread.php?t=556493&p=1695670&viewfull=1#post1695670.
Here is some sample C# code of using QueryTables: import txt files using excel interop in C# (QueryTables.Add).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing a CSV file problems C# - c#

at Codeproject there is an parsing library: http://www.codeproject.com/KB/database/CsvReader.aspx with an interesting article, how this stuff work. Its working faster (Author), than the OleDB Provider.

Related

SQL Insert not considering blank values for the insert in my C# code

How to enter data with leading zero into the excel using c# excel application

CSV is actually .... Semicolon Separated Values ... (Excel export on AZERTY)

Updating Cells in a DataTable

writing long text in excel workbook using interop throws error?

Categories

Resources