I am using C#, I have a comma delimited csv file with different strings in different languages.
My app should only open the CSV if it's unicode.
Is there an easy way to determine this in code ?
When you say "Unicode" I assume you mean UTF-8. Unicode is not an encoding and a file can't be "Unicode".
You could use a library, for example, ude is a C# library that attempts to determine what encoding a file uses. It uses the algorithm described here. It is not 100% foolproof.
The CSV specification does not provide a way to provide metadata describing the encoding format. The specification itself uses ASCII encoding for separators. But the data tokens between separators can be anything.
You will have to read through the data itself and infer the coding type based on that.
If you are in control of the output and input, you could modify the format it for your own needs by adding your own metadata, but then it wouldn't fit the CSV file format then, if that matters in your case.
So no, there isn't an "easy" way to determine the encoding.
Related
I have a process in SSIS that outputs SQL table data to CSV format. However, I want the output CSV in CSV (MS-DOS). Is there a way I can convert the normal CSV file to CSV (MS-DOS) ? (Like C# code that would convert the extension/type) . I tried using the option available in visual studio in SSIS, and couldn't find the solution towards it. Your help is appreciated.
By default, the output format is in CSV(Comma delimited, highlighted blue). I want that to be converted to CSV(MS-DOS, highlighted yellow).
If this article is accurate, https://excelribbon.tips.net/T009508_Comma-Delimited_and_MS-DOS_CSV_Variations.html then getting an CSV (MS-DOS) output will be fairly straight-forward
if you have certain special characters in text fields; for example, an accented (foreign language) character. If you export as Windows CSV, those fields are encoded using the Windows-1252 code page. DOS encoding usually uses code page 437, which maps characters used in old pre-Windows PCs.
Then you need to define 2 Flat File Connection Managers. The first will use 1252 (ANSI - Latin I) as your code page and point to C:\ssisdata\input\File.csv. The second will use 437 (OEM - United States) and point to C:\ssisdata\input\DOSFile.csv (this way you create a new file instead of clobbering the existing.)
Your Data Flow then becomes a Flat File Source to Flat File Destination.
I have a APi which receives CSV files as IFormFile . I have to check if the sent file is a proper CS file or not. So i am doing below checks.
Checking the File Extension.
Checking the File content type.
Issue:- If any app will use the API, then it's feasible to change the file extension along with the content-type. So how to validate a proper CSV file? I didn't get any helpful article as of now.
e.g. a PDF file can be changed to a .CSV(in extension) file along with its content-type. But PDF file is not a valid CSV
NB:- Magic number is one of the process for .XLSX,.docx,.pdf etc.But for CSV its not applicable, tried the same & failed. Any other way to check it?
Closest you can get would be a robust TryParse method.
But instead of re-inventing the wheel, try first a few libraries, they might do the job:
https://github.com/TinyCsvParser/TinyCsvParser
https://github.com/nreco/csv
Note that CSV parsing can be a difficult task even though it's a simple format.
Even if you can't use a library, there are plenty of ideas you can grab from them.
If I were to detect CSV content, I'd do the following:
ensure that a line contains readable characters
optionally detecting file enconding might help
ensure that a line isn't incredibly long, else it's likely to be binary, see #1
detect that first line has repeating separators
try parse lines
More or less this:
find 1st index of CR/LF or LF
read up to that
find separators in it
try parse the rest of the file, check against column count
if it fails then it's probably not CSV
It's pretty much all the heuristics you can try unless I'm mistaken.
SPSS dll i have does not support encoding. It reads French and German OK but any double byte language string, e.g. Chinese, Japanese, or Russian, reads back as "???"
// At binary level, the buffer actually contains "3f" for each "?". This may be the result of a failed unicode conversion.
When we write into .sav file, the interface only allows us to set interface encoding (not file level encoding, or code page). I suspect maybe the dll does not support double byte languages.
I have created SPSSDocument for write data but cannot find any encoding while creating file.
Can anyone help me in this issue?
I am having a file name "Connecticut is now 2 °C.txt" which contains a unicode but the file contents are just normal characters. Previously the code was used to identify whether the file name has unicode if so the file header was written with the unicode details. This way of implementation leads to conflict in the output file. So can anyone suggest how to find whether the file stream has an unicode in it.
Thanks in advance,
Lokesh.
By far the simplest strategy is to decide on an encoding for a particular file, e.g. UTF-8, and use it exclusively, both when you write it and then when you read it. Trying to detect what encoding is in use is decidedly error prone so it's best not to have to do this detection.
UPDATE
In the comments below you clarify that you wish to write to a file that is created by somebody else with an unknown encoding.
In full generality this is impossible to do with 100% reliability.
If you are lucky then you may find that the file comes with a Byte Order Mark (BOM). In which case you can read the BOM and thus infer the encoding. There's no requirement for a text file to contain a BOM and they frequently don't.
However, I would urge you to agree an interchange format with whoever is creating these files. Pick a single encoding and always use it.
I think this link would be helpful for you. Pay attention to IsTextUnicode Function
I'm reading a CSV file with Fast CSV Reader (on codeproject). When I print the content of the fields, the console show the character '?' in some words. How can fix it?
The short version is that you have to know the encoding of any text file you're going to read up front. You could use things like byte order marks and other heuristics if you really aren't going to know, but you should always allow for the value to be tweaked (in the same way that Excel does if you're importing CSV).
It's also worth double checking the values in the debugger, as it may be that it is the output that is wrong, as opposed to the reading -- bear in mind that all strings are Unicode internally, and conversion to '?' sounds like it is failing converting the unicode to the relevant code page for the console.