ASCII-Encoded values greater than 127 (0x7F) for TIFF files - c#

I'm trying to encode a TIFF file manually. Whenever byte values greater than 0x7F are encoded they should be written to file as follows:
However, whenever I try to write the value of a character greater than 0x7F to file with ASCII encoding it's written as "?" (0x3F).
Does anyone know what the encoding for byte value in the shown image is?
For reference I'm using C#, writing single chars at a time to a .tif file using the StreamWriter class (StreamWriter::Write(wchar_t)).

Problem solved: using the BinaryWriter class instead of StreamWriter allows for encoding of values above 127 as shown in the problem statement.
Going to leave this up as I wasn't able to find this solution through Google using the keywords I mentioned.
I stated I needed ASCII encoding despite the necessity to encode values above 127 (which ASCII doesn't support) because all values below 127 were still given ASCII representation. I didn't know what encoding was needed, just that it was a superset of ASCII.

Related

Change encoding to UTF8 for data kept in base64string

I need to process CSV files that are kept as bsae64strings. I never know in what format they were created (usually it'll be ANSI or UTF-8). I have been struggling to achieve anything useful, still, I receive messed up characters when I am testing my code on CSV file that was saved in ANSI. The code to read is just a two-liner:
byte[] dataToDecode = Convert.FromBase64String(base64Content);
string csvContentInUTF8 = Encoding.UTF8.GetString(dataToDecode2);
I do not have access to the code that saves files.
Sample line that's in the input CSV:
;;;superÆ/æ Ø/ø and even Å/å Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
what I get after decoding (second line of code)
;;;super�/� �/� oraz �/� Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
following this question I tried changing the code to scandinavian encoding reading, so:
string csvContentInUTF8x = Encoding.GetEncoding("iso-8859-1").GetString(dataToDecode);
The output is:
;;;super�/� �/� oraz �/� Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
It looks exactly the same for the Encoding.Default
If what you wrote is correct, the text was corrupted before writing it in a csv file.
Now... Encoding.GetEncoding("iso-8859-1") is an identical encoding that doesn't do any remapping. Its 256 characters are mapped 1:1 to the first 256 (0-255) characters of unicode.
;;;super�/� �/� oraz �/� Topic;;John;Doe;;;;john#doe.com
You see the � repeated six times? Normally each time it should be different, because you want six different characters (Æ/æ, Ø/ø, Å/å). But here they are always the same. And this is because in UTF-8 � is the Unicode REPLACEMENT CHARACTER � that is used when a character can't be encoded. So the error is already present in your dataToDecode.

C# WriteAllBytes ignores character encoding

I'm using the following code:
File.WriteAllBytes("c:\\test.xml", Encoding.UTF8.GetBytes("THIS IS A TEST"))
Which should in theory write a UTF8 file, but I just get an ANSI file. I also tried this just to be especially verbose;
File.WriteAllBytes("c:\\test.xml", ASCIIEncoding.Convert(ASCIIEncoding.ASCII, UTF8Encoding.UTF8, Encoding.UTF8.GetBytes("THIS IS A TEST")))
Still the same issue though.
I am testing the outputted files by loading in TextPad which reads the format correctly (I tested with a sample file as I know these things can be a bit weird sometimes)
WriteAllBytes isn't ignoring the encoding - rather: you already did the encoding, when you called GetBytes. The entire point of WriteAllBytes is that it writes bytes. Bytes don't have an encoding; rather: encoding is the process of converting from text (string here) to bytes (byte[] here).
UTF-8 is identical to ASCII for all ASCII characters - i.e. 0-127. All of "THIS IS A TEST" is pure ASCII, so the UTF-8 and ASCII for that are identical.

How do I remove invalid characters from UTF-8 encoded file?

Explanation:
I've come across an edge case when writing my web app. I accept UTF-8 files to be uploaded, and I've got a check in place to confirm it is UTF-8 encoded (or at least the best check possible, apparently there is no silver bullet, I'm aware there are many other questions on Stack Overflow for that specific issue).
As a test, I took an ANSI encoded file and converted it to UTF-8 by both (in separate tests) converting it UTF-8 in Notepad++, and also by just decoding as UTF-8 (even though it is ANSI) on the fly in C# using Encoding.UTF.GetBytes(inputStream).
Where The Problem Arises:
Later on, I place the raw data of the file as one of the elements in an XML file. This is where the problem arises. It appears that a character has persisted from the ANSI file which (I assume) is not valid in UTF-8. When I try load the XML using the following command...
XDocument xmlSample = XDocument.Load(outputPath);
I get this exception...
{"Invalid character in the given encoding. Line 10, position 14."}
Which looks like this in Visual Studio...
And like this in Notepad++...
Below is the character copy and pasted.
From NPP: ¡ From Visual Studio String Viewer: �
Question:
How can I remove invalid characters from UTF-8 encoded file, or at least discover them in a sane way so I can reject the file?
First, as to your example, the word “Temperature” suggests that the offending character is in fact the “degree” sign (°, Unicode 176), so that the full text reads “Temperature(°C)”. In this case the character would be coded as a \260 byte in ANSI and as the two bytes \302\260 in UTF-8. \260(preceded by the left parenthesis in this case) is not valid UTF-8.
Second – if you are still interested after more than a year – could you clarify how you use Encoding.UTF.GetBytes()to “decode a file as UTF-8?” GetBytes()reads characters, not bytes, and characters in C# do not have an encoding; the encoding has been applied when reading the file and converting it into characters. What UTF.GetBytes() does is encode (not decode) the characters into a UTF-8 byte sequence.
In order to check an incoming byte sequence you might use Encoding.UTF.GetChars() to decode your byte sequence into characters. Depending on the constructor you use you can get a “cleaned-up” character string (with data loss if problems occurred) or receive a DecoderFallbackException on offending byte sequences, so you can reject the input.

How do I read hex sequences like xD0 into a C# string?

I am converting a series of strings that are designed to display correctly using a special font into a unicode version that can be used anywhere. It's just a glorified set of string replaces:
"e]" -> "ἓ"
etc.
I'm reading the text using a streamreader which takes the encoding to be UTF-8. All working well. But there are some characters used to replace the punctuation marks that just aren't working. I can see them as hex sequences in notepad++ (encoding set to UTF-8) but when I read them, they all get reduced down to the same character (the 'cannot display' question mark in the black diamond).
StreamReader srnorm = new StreamReader("C:\\Users\\John\\Desktop\\bgt.txt", Encoding.UTF8);
string norm = srnorm.ReadLine();
Should I be reading it as a binary file and working from there or is my encoding very wrong?
(Full size image)
When I read that, I get the following:
o]j ouvci. mh. �avpo�la,bh| pollaplasi,ona evn tw/| kairw/| tou,tw| kai. evn tw/| aivw/ni tw/| evrcome,nw| zwh.n aivw,nion�
C# strings use UTF-16. This is how they are stored in memory. Because of this you should be able to read the string into memory and replace the characters without any issues. You can then write those characters back to a file (UTF8 is the default character encoding for reading and writing to file if I'm not mistaken). The ?'s just means the console you outputed the string to does not support those characters or the bytes are not of a valid encoding.
Here is a good article by Jon Skeet about C#/.NET strings.

Force C# to use ASCII

I'm working on an application in C# and need to read and write from a particular datafile format. The only issue at the moment is that the format uses strictly single byte characters, and C# keeps trying to throw in Unicode when I use a writer and a char array (which doubles filesize, among other serious issues). I've been working on modifying the code to use byte arrays instead, but that causes a few complaints when feeding them into a tree view and datagrid controls, and it involves conversions and whatnot.
I've spent a little time googling, and there doesn't seem to be a simple typedef I can use to force the char type to use byte for my program, at least not without causing extra complications.
Is there a simple way to force a C# .NET program to use ASCII-only and not touch Unicode?
Later, I got this almost working. Using the ASCIIEncoding on the BinaryReader/Writers ended up fixing most of the problems (a few issues with an extra character being prepended to strings occurred, but I fixed that up). I'm having one last issue, which is very small but could be big: In the file, a particular character (prints as the Euro sign) gets converted to a ? when I load/save the files. That's not an issue in texts much, but if it occurred in a record length, it could change the size by kilobytes (not good, obviously). I think it's caused by the encoding, but if it came from the file, why won't it go back?
The precise problem/results are such:
Original file: 0x80 (euro)
Encodings:
** ASCII: 0x3F (?)
** UTF8: 0xC280 (A-hat euro)
Neither of those results will work, since anywhere in the file, it can change (if an 80 changed to 3F in a record length int, it could be a difference of 65*(256^3)). Not good. I tried using a UTF-8 encoding, figuring that would fix the issue pretty well, but it's now adding that second character, which is even worse.
C# (.NET) will always use Unicode for strings. This is by design.
When you read or write to your file, you can, however, use a StreamReader/StreamWriter set to force ASCII Encoding, like so:
StreamReader reader = new StreamReader (fileStream, new ASCIIEncoding());
Then just read using StreamReader.
Writing is the same, just use StreamWriter.
Interally strings in .NET are always Unicode, but that really shouldn't be of much interest to you. If you have a particular format that you need to adhere to, then the route you went down (reading it as bytes) was correct. You simply need to use the System.Encoding.ASCII class to do your conversions from string->byte[] and byte[]->string.
If you have a file format that mixes text in single-byte characters with binary values such as lengths, control characters, a good encoding to use is code page 28591 aka Latin1 aka ISO-8859-1.
You can get this encoding by using whichever of the following is the most readable:
Encoding.GetEncoding(28591)
Encoding.GetEncoding("Latin1")
Encoding.GetEncoding("ISO-8859-1")
This encoding has the useful characteristic that byte values up to 255 are converted to unchanged to the unicode character with the same value (e.g. the byte 0x80 becomes the character 0x0080).
In your scenario, this may be more useful than the ASCII encoding (which converts values in the range 0x80 to 0xFF to '?') or any of the other usual encodings, which will also convert some of the characters in this range.
If you want this in .NET, you could use F# to make a library supporting this. F# supports ASCII strings, with a byte array as the underlying type, see Literals (F#) (MSDN):
let asciiString = "This is a string"B

Categories

Resources