I need to process CSV files that are kept as bsae64strings. I never know in what format they were created (usually it'll be ANSI or UTF-8). I have been struggling to achieve anything useful, still, I receive messed up characters when I am testing my code on CSV file that was saved in ANSI. The code to read is just a two-liner:
byte[] dataToDecode = Convert.FromBase64String(base64Content);
string csvContentInUTF8 = Encoding.UTF8.GetString(dataToDecode2);
I do not have access to the code that saves files.
Sample line that's in the input CSV:
;;;superÆ/æ Ø/ø and even Å/å Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
what I get after decoding (second line of code)
;;;super�/� �/� oraz �/� Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
following this question I tried changing the code to scandinavian encoding reading, so:
string csvContentInUTF8x = Encoding.GetEncoding("iso-8859-1").GetString(dataToDecode);
The output is:
;;;super�/� �/� oraz �/� Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
It looks exactly the same for the Encoding.Default
If what you wrote is correct, the text was corrupted before writing it in a csv file.
Now... Encoding.GetEncoding("iso-8859-1") is an identical encoding that doesn't do any remapping. Its 256 characters are mapped 1:1 to the first 256 (0-255) characters of unicode.
;;;super�/� �/� oraz �/� Topic;;John;Doe;;;;john#doe.com
You see the � repeated six times? Normally each time it should be different, because you want six different characters (Æ/æ, Ø/ø, Å/å). But here they are always the same. And this is because in UTF-8 � is the Unicode REPLACEMENT CHARACTER � that is used when a character can't be encoded. So the error is already present in your dataToDecode.
Related
I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".
I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv"));
Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:
type ST60_0.csv inputUTF.csv > outputBASH.txt
Q: The following code reads ANSI file and writes output as UTF-8 but
there is some giberrish characters "�".
A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?
Q: It blows my mind its so hard to do something in C# that command
prompt can do easy
A: Typically, it IS easy. There seems to be "something special" written into this particular file.
The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".
ANYWAY:
If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:
https://stackoverflow.com/a/25510366/421195
... %EF%BF%BD is the url-encoded version of the hex representation of
the 3 bytes (EF BF BD) of the UTF-8 replacement character.
See also:
https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The Replacement character � (often displayed as a black rhombus with a
white question mark) is a symbol found in the Unicode standard at code
point U+FFFD in the Specials table. It is used to indicate problems
when a system is unable to render a stream of data to a correct
symbol.[4] It is usually seen when the data is invalid and does not
match any character
You might also be interested in this:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar
character.
UPDATE:
The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.
One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv", System.Text.Encoding.GetEncoding("iso-8859-1")));
I'm using the following code:
File.WriteAllBytes("c:\\test.xml", Encoding.UTF8.GetBytes("THIS IS A TEST"))
Which should in theory write a UTF8 file, but I just get an ANSI file. I also tried this just to be especially verbose;
File.WriteAllBytes("c:\\test.xml", ASCIIEncoding.Convert(ASCIIEncoding.ASCII, UTF8Encoding.UTF8, Encoding.UTF8.GetBytes("THIS IS A TEST")))
Still the same issue though.
I am testing the outputted files by loading in TextPad which reads the format correctly (I tested with a sample file as I know these things can be a bit weird sometimes)
WriteAllBytes isn't ignoring the encoding - rather: you already did the encoding, when you called GetBytes. The entire point of WriteAllBytes is that it writes bytes. Bytes don't have an encoding; rather: encoding is the process of converting from text (string here) to bytes (byte[] here).
UTF-8 is identical to ASCII for all ASCII characters - i.e. 0-127. All of "THIS IS A TEST" is pure ASCII, so the UTF-8 and ASCII for that are identical.
Explanation:
I've come across an edge case when writing my web app. I accept UTF-8 files to be uploaded, and I've got a check in place to confirm it is UTF-8 encoded (or at least the best check possible, apparently there is no silver bullet, I'm aware there are many other questions on Stack Overflow for that specific issue).
As a test, I took an ANSI encoded file and converted it to UTF-8 by both (in separate tests) converting it UTF-8 in Notepad++, and also by just decoding as UTF-8 (even though it is ANSI) on the fly in C# using Encoding.UTF.GetBytes(inputStream).
Where The Problem Arises:
Later on, I place the raw data of the file as one of the elements in an XML file. This is where the problem arises. It appears that a character has persisted from the ANSI file which (I assume) is not valid in UTF-8. When I try load the XML using the following command...
XDocument xmlSample = XDocument.Load(outputPath);
I get this exception...
{"Invalid character in the given encoding. Line 10, position 14."}
Which looks like this in Visual Studio...
And like this in Notepad++...
Below is the character copy and pasted.
From NPP: ¡ From Visual Studio String Viewer: �
Question:
How can I remove invalid characters from UTF-8 encoded file, or at least discover them in a sane way so I can reject the file?
First, as to your example, the word “Temperature” suggests that the offending character is in fact the “degree” sign (°, Unicode 176), so that the full text reads “Temperature(°C)”. In this case the character would be coded as a \260 byte in ANSI and as the two bytes \302\260 in UTF-8. \260(preceded by the left parenthesis in this case) is not valid UTF-8.
Second – if you are still interested after more than a year – could you clarify how you use Encoding.UTF.GetBytes()to “decode a file as UTF-8?” GetBytes()reads characters, not bytes, and characters in C# do not have an encoding; the encoding has been applied when reading the file and converting it into characters. What UTF.GetBytes() does is encode (not decode) the characters into a UTF-8 byte sequence.
In order to check an incoming byte sequence you might use Encoding.UTF.GetChars() to decode your byte sequence into characters. Depending on the constructor you use you can get a “cleaned-up” character string (with data loss if problems occurred) or receive a DecoderFallbackException on offending byte sequences, so you can reject the input.
I am converting a series of strings that are designed to display correctly using a special font into a unicode version that can be used anywhere. It's just a glorified set of string replaces:
"e]" -> "ἓ"
etc.
I'm reading the text using a streamreader which takes the encoding to be UTF-8. All working well. But there are some characters used to replace the punctuation marks that just aren't working. I can see them as hex sequences in notepad++ (encoding set to UTF-8) but when I read them, they all get reduced down to the same character (the 'cannot display' question mark in the black diamond).
StreamReader srnorm = new StreamReader("C:\\Users\\John\\Desktop\\bgt.txt", Encoding.UTF8);
string norm = srnorm.ReadLine();
Should I be reading it as a binary file and working from there or is my encoding very wrong?
(Full size image)
When I read that, I get the following:
o]j ouvci. mh. �avpo�la,bh| pollaplasi,ona evn tw/| kairw/| tou,tw| kai. evn tw/| aivw/ni tw/| evrcome,nw| zwh.n aivw,nion�
C# strings use UTF-16. This is how they are stored in memory. Because of this you should be able to read the string into memory and replace the characters without any issues. You can then write those characters back to a file (UTF8 is the default character encoding for reading and writing to file if I'm not mistaken). The ?'s just means the console you outputed the string to does not support those characters or the bytes are not of a valid encoding.
Here is a good article by Jon Skeet about C#/.NET strings.
I've seen questions where the two characters are the same, but noting that relates to this specific question so here goes.
I'm running a C# console app that reads an input file that is variable length records. Each record is variable length fields. I've got everything working in terms of parsing out each individual field within each record, not a problem. Except that today I cam across the ñ character in the input file. Now I know this translates to ñ, so I'm ok with it. However, because I the input file sees ñ as 2 characters, the record length changes in the C# app because the app is interpreting those 2 characters as a single ñ. This is causing my record length to change from 154 characters to 153, and then during the parsing, messing up the individual fields.
I'm ok with the ñ character getting stored in my DB. But my question is this.
Prior to parsing the fields out of the record, how can I go about easily (with checking every single character) detecting that the ñ exists and trigger it to change the parsing logic? Should I simply do a IndexOf on the character and code it that way? I would think that would add a bit of overhead of I had to put that logic on every single field, although it seems like the easiest way. I would think there's a better way to handle it overall but I've not encountered this before. Most of the posts I have found are more for handling the ñ character in text as opposed to text being converted (properly) from ñ to ñ
Ideas?
the streamreader open I am using is as follows:
System.IO.StreamReader concatenatedFile = new System.IO.StreamReader("c:\Testing\test.txt",System.Text.Encoding.UTF8);
The record length changes from 154 characters on the input to 153 interpreted characters.
You must always read a text file in the encoding it was written. Of course, sometimes you don't which encoding that was...
Thing of the input file as a stream of bytes. Most are 1-byte-1-ASCII-character, but there are 2 bytes (probably) that can be interpreted differently depending on encoding:
UTF8 - 1 character, ñ
(some other encoding) - 2 characters, ñ
Since you say "the input file sees ñ as 2 characters", this would probably be the encoding intended by whoever produces the file.
So, you should find out which encoding was originally meant, and use that - it's probably some ANSI encoding. You could try System.Text.Encoding.Default, but beware that this changes on different machines, so your code will now depend on the machine's default encoding.
You should set the StreamReader you use to read your input file to UTF-8 encoding. I don't believe for a second the original input was meant to be ñ, so why do you care how many bytes the original input was - you care about character length, right?
Refer to this article to understand what's what in text encoding: http://www.joelonsoftware.com/articles/Unicode.html .