Excel adds extra quotes on CSV export - c#

I've recently created an application which adds items to a Database by CSV. After adding items I realized that lots of my values had extra quotes (") that weren't needed and this was messing up my ordering.
The problem is that when exporting to a CSV from Excel, Excel adds extra quotes to all of my values that already have a quote in them. I've shown the difference below:
Original Item: Drill Electric Reversible 1/2" 6.3A
Exported Item: "Drill Electric Reversible 1/2"" 6.3"
Note: the CSV export is adding three (3) extra quotes ("). Two on the ends, and one after the original intended quote.
Is there a setting I can change, or a formatting property I can set on the Excel File/Column? Or do I have to live with it and remove these quotes in my back-end code before adding them to the Database?

This is entirely normal. The outer quotes are added because this is a string. The inner quote is doubled to escape it. Same kind of thing you'd see in a SQL query for example. Use the TextFieldParser class to have tried and true framework code care of the parsing of this for you automatically.

That's standard.
The values within a CSV file should have quotes around them (otherwise commas and linebreaks inside a field may be misinterpreted).
The way to escape a quote within a field is to double it, just as you are seeing.
I suggest you read about the basic rules of CSV:
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows terminated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. If a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
(emphasis mine)

You could try exporting from Excel as TAB delimited files. I find it easier to parse.

Replace all characters Right Double Quotation Mark by characters Left Double Quotation Mark. They look similar, Excel will be confused and let the text unchanged.

This solution will only help if your end output is HTML. This is the javascript solution so obviously you'll need to redo this in C# or whichever language you're working in:
base = base.replace(/""/gi, '"');
base = base.replace(/'/gi, ''');
Apply this before you parse the CSV.

Another approach would be to use the Unicode Character "DOUBLE PRIME"
http://www.fileformat.info/info/unicode/char/2033/index.htm
in your Excel data. To export from Excel into a UTF-8 or UTF-16 .csv you'll have to provide a schema.ini with an appropriate CharacterSet property. Obviously, the tool you use to import the .csv into your database has to be Unicode aware too.
Depending on the DBMS a more direct way of data transfer (SELECT/INSERT ... INTO ... IN ) can be used, thereby eliminating the .csv entirely.

Related

Using Regex.Replace function in c# program to replace a string(data) with "data"?

I want to make use of Regex.Replace function to replace data in the format
05-11
to
"05-11"
so that excel can read it as a string.
Excel is converting the data to 05-Nov even though that particular column is defined as char.
In my application code, I have the below piece of code to replace any data that starts with a dash (-) with double quotes, "data"
var newString = Regex.Replace(data, #"^(-.*)$", "=\"$0\"");
How can I make use of this function to replace any data which are like
'05-11', '15-2019'
with
"05-11", "15-2019"
for the excel to read them as a string not as date format.
Unfortunately Excel does not accept " as an indicator of a text column, the only way to be sure is to proceed the value with a single quote.
var newstring = Regex.Replace(data, #"\b""?(\d\d-(?:\d\d)?\d\d)""?\b", "\"=\"\"$0\"\"\"");
This finds possibly double-quoted date strings that constitute the whole column value, and outputs it quoted with an equals sign and double quotes.
Unfortunately this special formatting is lost if you save as CSV from inside Excel and try to reload.
See this question for details.

Can we create format file from bcp command line as a different field terminator for each field?

We are using non-XML format file to import data into Sql Server database through bcp utility. We have used comma(,) as a field separator in the bcp.
Now we are getting issue when user provides comma(,) between the string field like name. Field get terminates by comma(,) and rest of the words goes in to the next field. We can not change the separator at this point of time. So we are looking for some other solutions.
We found a way that users will enclose all the string values columns in double quotes. we can scan first ten rows of the file and see if there is a double quote(“) in the line. If a double quote is found we have to update the format file so that doubles quotes get treated as text qualifiers and everything with in double quotes are treated as a column value.
So my question is can we code this programmatically if we find double quotes we can make field separator as ",\"" and then for the next record the terminator will be "\"," and if there will be no double quote we can keep our terminator as ",".
Or simply I want to ask how can we create different field terminator for different fields ? I went through some goggling and found in the bcp command line we have only -t that creates the field terminator and this will keep the same for all the fields.

Find a pattern and replace an element of it

I have the following problem:
I am trying to split the rows of a CSV file but the thing is that sometimes I read the following line:
string input = "a,b,c,d,\"V=12.503,I=0.194\",e,f"
I use the following code
string[] SplittedLine= input.split(',');
The result is that i get an extra column because the data \"V=12.503,I=0.194\" has a comma inside, but when I open the CSV file with excel i noticed that Excel doesn't add an extra column because it doesn't split that data into two different data. How can I properly split this CSV file considering this situation?
You are encountering commas in the "cells" of your CSV, which by convention (but not by any standard) are escaped by wrapping the cell data with double quotes. You also need to be aware that the quote-escaped string can contain quote literals.
Let's say you had a name column and someone's name was
Jonathan "Jake" Smith, Jr.
That would be encoded as
"Jonathan ""Jake"" Smith, Jr."
You can certainly improve your code to handle those cases. However, that problem has been solved before. If you don't want to reinvent the wheel, there are a number of solid open source libraries that handle the headache of parsing CSV files. The one I use is
http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader

Replacing specific Unicode characters in strings read from Excel

I am attempting to replace some undesirable characters in a string retrieved from an Excel spreadsheet. The reason being that our Oracle database is using the WE8ISO8859P1 character set, which does not define several characters that Excel "helpfully" inserts for you in text (curly quotes, em and en dashes, etc.) Since I have no control over the database or how the Excel spreadsheets are created I need to replace the characters with something else.
I retrieve the cell contents into a string thus:
string s = xlRange.get_Range("A1", Missing.Value).Value2.ToString().Trim();
Viewing the string in Visual Studio's Text Visualiser shows the text to be complete and correctly retrieved. Next I try and replace one of the undesirable characters (in this case the right-hand curly quote symbol):
s = Regex.Replace(s, "\u0094", "\u0022");
But it does nothing (Text Visualiser shows it still to be there). To try and verify that the character I want to replace is actually in there, I tried:
bool a = s.Contains("\u0094");
but it returns false. However:
bool b = s.Contains("”");
returns true.
My (somewhat lacking) understanding of strings in .NET is that they're encoded in UTF-16, whereas Excel would probably be using ANSI. So does that mean I need to change the encoding of the text as it comes out of Excel? Or am I doing something else wrong here? Any advice would be greatly appreciated. I have read and re-read all articles I can find about Unicode and encoding but am still none the wiser.
Yes strings in .Net are UTF-16.
You're doing it right; perhaps your hex-math is incorrect.
The character you tested for isn't "\u0094" (Not sure that's what you meant). The following worked for me:
((int)"”"[0]).ToString("X") returns "201D"
"”" == "\u201D" returns true
"\u0094" == "" (right hand side is the empty string) returns false
A lot of UTF-16 characters will seem as an empty string by the text visualizer but they can either be an undisplayable character or part of a surrogate (i.e. Some characters may need to be typed "\UXXXXXXXX" while others you can do with (four digits) "\uXXXX".). My knowledge of this domain is very limited.
References - Jon Skeet's articles on:
Strings
Unicode
You can use NVARCHAR and NTEXT instead of VARCHAR and TEXT for the columns that need to accomodate those characters.
That wayyou don't have to convert the whole database, and you are future proof, because the columns will be Unicode.

Splitting on a Unique Character

I want to build a comma separated list so that I can split on the comma later to get an array of the values. However, the values may have comma's in them. In fact, they may have any normal keyboard character in them (they are supplied from a user). What is a good strategy for determining a character you are sure will not collide with the values?
In case this matters in a language dependent way, I am building the "some character" separated list in C# and sending it to a browser to be split in javascript.
If JavaScript is consuming the list, why not send it in the form of a JavaScript array? It already has an established and reliable method for representing a list and escaping characters.
["Value 1", "Value 2", "Escaped \"Quotes\"", "Escaped \\ Backslash"]
You could split it by a null character, and terminate your list with a double null character.
I always use | but if you still think that it can contain it, you can use combinations like #|#. For example:
"string one#|#string two#|#...#|#last string"
Eric S. Raymond wrote a book chapter on this that you might find useful. It is directed toward Unix users but should still apply.
As for your question, if you will have commas within cells, then you will need some form of escaping. Using \, is a standard way, but you will also have to escape slashes, which are also common.
Alternatively, use another character such as the pipe (|), tab, or something else of your choice. If users need to work with the data using a spreadsheet program, you can usually add filter rules to split cells on the delimiter of your choice. If this is a concern, it's probably best to choose a delimiter that users can easily type, which excludes the nul char, among others.
You could also use quoting:
"value1", "value2", "etc"
In which case, you will only need to escape quotes (and slashes). This should also be accepted by spreadsheets given the correct filter options.
There are several ways to do this. The first is to select a separator character that would not normally be input from the keyboard. NULL or TAB are normally good. The second is to use a character sequence as a separator, the Excel CSV files are a good example where the cell values are defined by quotes with commas separating the cells.
The answer is dependent on whether you want to reinvent the wheel or not.
If there is potential for any splitting character to appear in your strings then then I would suggest that you write a script element to your output with a javascript array definition in it. For example:
<script>
var myVars=new Array();
myVars[0]="abc|#123$";
myVars[1]="123*456";
myVars[2]="blah|blah";
</script>
Your javascript can then reference that array
Doing this also avoids the need to create a comma seperated string from your C# string array.
The only gotcha I can think of is strings that contains quotes, in this case you would have to escape them in C# when writing them out to the myVars output.
There is an RFC which documents the CSV format. Follow the standards and you will avoid reinventing the wheel and creating a mess for the next guy to come along and maintain your code. The nice thing is that there are libraries available to import/export CSV for just about any platform you can imagine.
That said, if you are serialising data to send to a browser, JSON is really the way to go and it too is documented in an RFC and you can get libraries for just about any platform such as JSON.NET

Categories

Resources