StreamWriter does not emit BOM for UTF-7

StreamWriter does not emit BOM for UTF-7 - c#

Have a look at this very simple piece of code:
private static void WriteTestLine(string file, Encoding encoding)
{
// 'Hello, world!' in russian
const string testLine = "Здравствуй, мир!";
using (var streamWriter = new StreamWriter(path: file, append: false, encoding: encoding))
{
streamWriter.WriteLine(testLine);
}
using (var streamReader = new StreamReader(path: file, detectEncodingFromByteOrderMarks: true))
{
Console.WriteLine(streamReader.ReadLine());
}
}
WriteTestLine("utf8", new UTF8Encoding(true));
WriteTestLine("utf7", new UTF7Encoding(true));
WriteTestLine("utf7woOptionals", new UTF7Encoding(false));
This code produces the following output:
Здравствуй, мир!
+BBcENARABDAEMgRBBEIEMgRDBDk-, +BDwEOARA-!
+BBcENARABDAEMgRBBEIEMgRDBDk-, +BDwEOARAACE-
As you can see, StreamReader was unable to detect UTF-7 encoding correctly. Indeed, if you open both UTF-7 files with a hex editor, in both cases the first four bytes will look like this:
2B 42 42 63
But Wikipedia states that UTF-7 should have a BOM that looks like one of the following:
2B 2F 76 38
2B 2F 76 39
2B 2F 76 2B
2B 2F 76 2F
2B 2F 76 38 2D
So I guess that .NET's StreamWriter is unable to work correctly with UTF-7. That seems odd. Do i miss something? How can I force StreamWriter to emit BOM in UTF-7 files?
P.S. By the way:
Encoding.UTF7.GetPreamble();
produces an empty array

Related

Reading UTF8 file, gives diamond with question mark on unknown characters

I have a file I am reading from to acquire a database of music files, like this;
00 6F 74 72 6B 00 00 02 57 74 74 79 70 00 00 00 .otrk...Wttyp...
06 00 6D 00 70 00 33 70 66 69 6C 00 00 00 98 00 ..m.p.3pfil...~.
44 00 69............. D.i.....
Etc., there could be hundreds to thousands of records in this file, all split at "otrk" into a string this is the start of a new track.
The problem actually lies in the above, all the tracks start with otrk, and the has field identifiers, their length, and value, for example above;
ttyp = type, and 06 following it is the length of the value, which is .m.p.3 or 00 6D 00 70 00 33
then the next field identifier is pfil = filename, this lies the issue, specifically the length, which value is 98, however when read into a string becomes unrecognizable and defaults to a diamond with a question mark, and a value of 239, which is wrong. How can I avoid this and get the correct value in order to display the value correctly.
My code to read the file;
db_file = File.ReadAllText(filePath, Encoding.UTF8);
and the code to split and sort through the file
string[] entries = content.Split(new string[] "otrk", StringSplitOptions.None);
public List<Songs> Songs { get; } = new List<Songs>();
foreach(string entry in entries)
{
Songs.Add(Song.Create(entry));
}
Song.Create looks like;
public static Song Create(string dbString)
{
Song toRet = new Song();
for (int farthestReached = 0; farthestReached < dbString.Length;)
{
int startOfString = -1;
int iLength = -1;
byte[] b = Encoding.UTF8.GetBytes("0");
//Gets the start index
foreach(var l in labels)
{
startOfString = dbString.IndexOf(l, farthestReached);
if (startOfString >= 0)
{
// get identifer index plus its length
iLength = startOfString + 3;
var valueIndex = iLength + 5;
// get length of value
string temp = dbString.Substring(iLength + 4, 1);
b = Encoding.UTF8.GetBytes(temp);
int xLen = b[0];
// populate the label
string fieldLabel = dbString.Substring(startOfString, l.Length);
// populate the value
string fieldValue = dbString.Substring(valueIndex, xLen);
// set new
farthestReached = xLen + valueIndex;
switch (fieldLabel[0])
{
case 'p':
case 't':
string stringValue = "";
foreach (char c in fieldValue)
{
if (c == 0)
continue;
stringValue += c;
}
assignStringField(toRet, fieldLabel, stringValue);
break;
}
break;
}
}
//If a field was not found, there are no more fields
if (startOfString == -1)
break;
}
return toRet;
}

The file is not a UTF-8 file. The hex dump shown in the question makes it clear that it is not a UTF-8 file, and neither a proper text file in any other text encoding. It rather looks like some binary (serialized) format, with data fields of different types.
You cannot reliably read a binary file naively like a text file, especially considering that certain UTF-8 characters are represented by two or more bytes. Thus, it is pretty likely that the UTF-8 decoder will get confused by all the binary data and might miss the first character(s) of a text field because preceding (binary) byte not belonging to a text field could coincidentally be equal to a start byte of a multi-byte character sequence in UTF-8, thus accidentally not correctly identifying the first character(s) in a text field because the UTF-8 decoder is trying to decode a multi-byte sequence not aligning with the text field.
Not only that, but certain byte values or byte sequences are not valid UTF-8 encodings for character, and you would "lose" such bytes when trying to read them as UTF-8 text.
Also, since it is possible for byte sequences of multiple bytes to form a single UTF-8 character, you cannot rely on every individual byte being turned into a character with the same ordinal value (even if the byte value might be a valid ASCII value), since such a byte could be decoded "merely" as part of a UTF-8 byte sequence into a single character whose ordinal value is entirely different from the value of the bytes in such a byte sequence.
That said, as far as i can tell, the text field in your litte data snippet above does not look like UTF-8 at all. 00 6D 00 70 00 33 (*m*p*3) and 00 44 00 69 (*D*i) are definitely not UTF-8 -- note the zero bytes.
Thus, first consult the file format specification to figure out the actual text encoding used for the text fields in this file format. Don't guess. Don't assume. Don't believe. Look up, verify and confirm.
Secondly, since the file is not a proper text file (as already mentioned), you cannot read it like a text file with File.ReadAllText. Instead, read the raw byte data, for example with File.ReadAllBytes.
Find the otrk marker in the byte data of the file not as text, but by the 4 byte values this marker is made of.
Then, parse the byte data following the otrk marker according to the file format specification, and only decode the bytes that are actual text data into strings, using the correct text encoding as denoted by the file format specification.

How to remove N number of bytes from end of file, only if they match a specified array of bytes

I am running into an issue building a program to process files provided. These files are XML files formatted with UTF-8. Oddly, some files end with 0x0A 0x00, and cause our XML parser to throw an error. I am looking to build a function to remove these bytes on the end of a file if they exist, without "hard coding" 0x0A 0x00. Ideally this function could be used in the future for any similar behavior with an array of any size.
Here is the exception:
System.Xml.XmlException:
hexadecimal value 0x00, is an invalid character. Line 250, position 1.
This occurs in some files, but not all. The root cause of this behavior is yet to be discovered.
I'm sorry, I do not have a code sample, as I have not been able to get anything remotely close to work :) I will edit this post if I get something somewhat working.
Any insight is appreciated!

Something like this should do the trick, keep in mind though this does not have any error handling built in, this is just the barebones functionality:
static void TrimFile(string filePath, byte[] badBytes)
{
using (FileStream file = new FileStream(filePath, FileMode.Open, FileAccess.ReadWrite))
{
byte[] bytes = new byte[badBytes.Length];
file.Seek(-badBytes.Length, SeekOrigin.End);
file.Read(bytes, 0, badBytes.Length);
if (Enumerable.SequenceEqual(bytes, badBytes))
{
file.SetLength(Math.Max(0, file.Length - badBytes.Length));
}
}
}
You can call it like this:
TrimFile(filePath, new byte[] { 0x0A, 0x00 });
Here's a test file I created with 0xCA 0xFE 0xFF 0xFF at the end (some bunk data)
62 75 6E 6B 20 66 69 6C 65 CA FE FF FF
bunk fileÊþÿÿ
After running TrimFile(filePath, new byte[] { 0xCA, 0xFE, 0xFF, 0xFF });
62 75 6E 6B 20 66 69 6C 65
bunk file
Hope this comes in handy!

parse hex incoming / create outgoing strings

I've modified an example to send & receive from serial, and that works fine.
The device I'm connecting to has three commands I need to work with.
My experience is with C.
MAP - returns a list of field_names, (decimal) values & (hex) addresses
I can keep track of which values are returned as decimal or hex.
Each line is terminated with CR
:: Example:
MEMBERS:10 - number of (decimal) member names
NAME_LENGTH:15 - (decimal) length of each name string
NAME_BASE:0A34 - 10 c-strings of (15) characters each starting at address (0x0A34) (may have junk following each null terminator)
etc.
GET hexaddr hexbytecount - returns a list of 2-char hex values starting from (hexaddr).
The returned bytes are a mix of bytes/ints/longs, and null terminated c-strings terminated with CR
:: Example::
get 0a34 10 -- will return
0A34< 54 65 73 74 20 4D 65 20 4F 75 74 00 40 D3 23 0B
This happens to be 'Test Me Out'(00) followed by junk
etc.
PUT hexaddr hexbytevalue {{value...} {value...}} sends multiple hex byte values separated by spaces starting at hex address, terminated by CR/LF
These bytes are a mix of bytes/ints/longs, and null terminated c-strings :: Example:
put 0a34 50 75 73 68 - (ascii Push)
Will replace the first 4-chars at 0x0A34 to become 'Push Me Out'
SAVED OK

See my answer previously about serial handling, which might be useful Serial Port Polling and Data handling
to convert your response to actual text :-
var s = "0A34 < 54 65 73 74 20 4D 65 20 4F 75 74 00 40 D3 23 0B";
var hex = s.Substring(s.IndexOf("<") + 1).Trim().Split(new char[] {' '});
var numbers = hex.Select(h => Convert.ToInt32(h, 16)).ToList();
var toText = String.Join("",numbers.TakeWhile(n => n!=0)
.Select(n => Char.ConvertFromUtf32(n)).ToArray());
Console.WriteLine(toText);
which :-
skips through the string till after the < character, then splits the rest into hex string
then, converts each hex string into ints ( base 16 )
then, takes each number till it finds a 0 and converts each number to text (using UTF32 encoding)
then, we join all the converted strings together to recreate the original text
alternatively, more condensed
var hex = s.Substring(s.IndexOf("<") + 1).Trim().Split(new char[] {' '});
var bytes = hex.Select(h => (byte) Convert.ToInt32(h, 16)).TakeWhile(n => n != 0);
var toText = Encoding.ASCII.GetString(bytes.ToArray());
for converting to hex from a number :-
Console.WriteLine(123.ToString("X"));
Console.WriteLine(123.ToString("X4"));
Console.WriteLine(123.ToString("X8"));
Console.WriteLine(123.ToString("x4"));
also you will find playing with hex data is well documented at https://msdn.microsoft.com/en-us/library/bb311038.aspx

How is GlyphIndices property encoded in GlyphRun object?

I am trying to build a GlyphRun object, and I can't seem to understand how GlyphIndices property is encoded.
For example, the following XAML string creates "hello world". It is obvious that the following rules hold:
43 -> h
72 -> e
79 -> l
But what is this coding scheme? I tried ASCII and it doesn't seem to fit.
Do you have any idea?
<GlyphRunDrawing ForegroundBrush="Black">
<GlyphRunDrawing.GlyphRun>
<GlyphRun
CaretStops="{x:Null}"
ClusterMap="{x:Null}"
IsSideways="False"
GlyphOffsets="{x:Null}"
GlyphIndices="43 72 79 79 82 3 58 82 85 79 71"
BaselineOrigin="0,12.29"
FontRenderingEmSize="13.333333333333334"
DeviceFontName="{x:Null}"
AdvanceWidths="9.62666666666667 7.41333333333333 2.96 2.96 7.41333333333333 3.70666666666667 12.5866666666667 7.41333333333333 4.44 2.96 7.41333333333333"
BidiLevel="0">
<GlyphRun.GlyphTypeface>
<GlyphTypeface FontUri="C:\WINDOWS\Fonts\TIMES.TTF" />
</GlyphRun.GlyphTypeface>
</GlyphRun>
</GlyphRunDrawing.GlyphRun>
</GlyphRunDrawing>

Use this code to convert a Unicode code point to a glyph index:
GlyphTypeface glyphTypeface = new GlyphTypeface(new Uri(#"C:\WINDOWS\Fonts\TIMES.TTF"));
const char character = 'и';
ushort glyphIndex;
glyphTypeface.CharacterToGlyphMap.TryGetValue(character, out glyphIndex);
MessageBox.Show("Index = " + glyphIndex);
The sample above returns 610 for the Cyrillic и character.

Java SecretKey and replicating it's behaviour in C#

I am trying to replicate the encryption logic found in a Java library in a C# application.
The Java contains two methods which I have managed to replicate in C#. I get the same results in each program for any set of data.
createKey(byte data1[], MessageDigest md);
createIV(byte data2[], MessageDigest md);
The logic to generate the key and IV in Java is as follows:
public Cipher getCipher(byte[] password) {
MessageDigest md = MessageDigest.getInstance("SHA-1");
byte keyData[] = createKey(byte[] password, md);
SecretKey secretKey =
SecretKeyFactory.getInstance("DESede").
generateSecret(new DESedeKeySpec(keyData[]));
IVSpec ivspec = createIV(secretKey.getEncoded(), md);
Cipher cipher = Cipher.getInstance("DESede/CBC/PKCS5Padding");
cipher.init(1, secretKey, ivSpec, md);
return cipher;
}
Let's say I have the follow:
Java Key HEX: 9c 3a 79 df ba 49 86 0 ed 58 1 d8 9b a7 94 0 bb 3e 8f 80 4d 67 0 0
When I build the secretKey and then call secretKey.getEncoded() I get:
Java Encoded Key: : 9d 3b 79 df ba 49 86 1 ec 58 1 d9 9b a7 94 1 ba 3e 8f 80 4c 67 1 1
Because I don't know what the SecretKey is doing internally I don't know how to replicate this in C#.
My current C# code looks like this:
public static ICryptoTransform createCryptoTransform(String password)
{
ICryptoTransform ct = null;
byte[] keyData = createKey(password);
byte[] ivData = createInitialisationVector(keyData);
printByteArray("keyData", keyData);
printByteArray("ivData", ivData);
TripleDESCryptoServiceProvider tdcsp = new TripleDESCryptoServiceProvider();
tdcsp.Key = keyData; / This seems to be ignored by CreateEncryptor method below
tdcsp.KeySize = 192;
tdcsp.IV = ivData; // This seems to be ignored by CreateEncryptor method below
tdcsp.Mode = CipherMode.CBC;
tdcsp.Padding = PaddingMode.PKCS7; // PKCS5 and PKCS7 provide the same padding scheme
ct = tdcsp.CreateEncryptor(keyData, ivData);
return ct;
}
As you can see, I'm using the ivData[] created from the unencoded key.
Everything works, that is I get the same encrypted result, if I pass the same IV data in when creating the encryptor but unfortunately I cannot modify how it generates it's IVSpec.
What is SecretKey doing internally and how do I replicate this in C#?

DES (and DESede) both derive 56 bits of key material from 64 bit input. The remaining 8 bits are used as parity check bits here. The source code shows you how Java handles this, you can apply the same in C#.
See also the section at the beginning of FIPS 46-3.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.