Encoding, decoding and re-encoding not producing original results

Encoding, decoding and re-encoding not producing original results - c#

I suspect that my decoding is not working properly. That is why I am testing it by encoding, decoding and the re-encoding to see if I am getting the same result. That is however not the case.
I encoded a byte[] named model.PDF to a base64 string.
Now, for decoding, I converted model.PDF to a decoded base64 string. However the output looks faulty or corrupted upon debugging and I suspect this is where something is going wrong.
To encode again, the decoded data is turned into byte[] again and then into an encoded base64 string. However base64EncodedData does not match plainTextEncodedData. Please help me create a flawless encode to decode to re-encode flow.
// ENCODING - Byte array -> base64 encoded string
string base64EncodedData = Convert.ToBase64String(model.PDF);
// DECODING - Byte array -> base64 decoded string
var base64DecodedData = Encoding.UTF8.GetString(model.PDF);
// ENCODING AGAIN
byte[] plainTextBytes = Encoding.UTF8.GetBytes(base64DecodedData);
var plainTextEncodedData = Convert.ToBase64String(plainTextBytes);
To elaborate, the re-encoding matches the initial encoding perfectly if executed like this.
var PDF = System.Text.Encoding.UTF8.GetBytes("redgreenblue");
string base64EncodedData = Convert.ToBase64String(PDF);
// DECODING - Byte array -> base64 decoded string
var base64DecodedData = Encoding.UTF8.GetString(PDF);
// ...
But, my model.PDF is fetched from the database as shown below, in which case the re-encoding does not match.
while (reader.Read()) {
model.PDF = reader["PDF"] == DBNull.Value ? null : (byte[])reader["PDF"];
}
On an online base64 decoder (https://www.base64decode.org/), decoding an example value of base64EncodedData shows the ideal and correct value.
%PDF-1.5
%
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(en-IN) /StructTreeRoot 8 0 R/MarkInfo<</Marked true>>>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[ 4 0 R] >>
endobj
3 0 obj
<</Author(admin) /CreationDate(D:20190724114817+05'30')
/ModDate(D:20190724114817+05'30') /Producer(Microsoft Excel 2013) /Creator(Microsoft Excel 2013) >>
endobj
4 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 6 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 5 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 171>>
stream
...
However, in my program, the value of base64DecodedData shows up in its entirety as:
%PDF-1.5
%����
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(en-IN) /StructTreeRoot 8 0 R/MarkInfo<</Marked true>>>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[ 4 0 R] >>
endobj
3 0 obj
<</Author(admin) /CreationDate(D:20190724114817+05'30')
/ModDate(D:20190724114817+05'30') /Producer(��
The 2 look similar in ways but my program seems to be producing a corrupt version of what the actual base64 decoded string should be.

A PDF is an ASCII file that can contain binary data (including strings in other encodings).
So you cannot read it as plain text.
If a PDF file contains binary data, as most do [...] the header line
shall be immediately followed by a comment line containing at least
four binary characters—that is, characters whose codes are 128 or
greater.
Taken from this answer, which has some more infos
You see exactly these four characters in your own output.

Related

Why do I get a different value after turning an integer into ASCII and then back to an integer?

Why, when I turn INT value to bytes and to ASCII and back, I get another value?
Example:
var asciiStr = new string(Encoding.ASCII.GetChars(BitConverter.GetBytes(2000)));
var intVal = BitConverter.ToInt32(Encoding.ASCII.GetBytes(asciiStr), 0);
Console.WriteLine(intVal);
// Result: 1855

ASCII is only 7-bit - code points above 127 are unsupported. Unsupported characters are converted to ? per the docs on Encoding.ASCII:
The ASCIIEncoding object that is returned by this property might not have the appropriate behavior for your app. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.
So 2000 decimal = D0 07 00 00 hexadecimal (little endian) = [unsupported character] [BEL character] [NUL character] [NUL character] = ? [BEL character] [NUL character] [NUL character] = 3F 07 00 00 hexadecimal (little endian) = 1855 decimal.

TL;DR: Everything's fine. But you're a victim of character replacement.
We start with 2000. Let's acknowledge, first, that this number can be represented in hexadecimal as 0x000007d0.
BitConverter.GetBytes
BitConverter.GetBytes(2000) is an array of 4 bytes, Because 2000 is a 32-bit integer literal. So the 32-bit integer representation, in little endian (least significant byte first), is given by the following byte sequence { 0xd0, 0x07, 0x00, 0x00 }. In decimal, those same bytes are { 208, 7, 0, 0 }
Encoding.ASCII.GetChars
Uh oh! Problem. Here's where things likely took an unexpected turn for you.
You're asking the system to interpret those bytes as ASCII-encoded data. The problem is that ASCII uses codes from 0-127. The byte with value 208 (0xd0) doesn't correspond to any character encodable by ASCII. So what actually happens?
When decoding ASCII, if it encounters a byte that is out of the range 0-127 then it decodes that byte to a replacement character and moves to the next byte. This replacement character is a question mark ?. So the 4 chars you get back from Encoding.ASCII.GetChars are ?, BEL (bell), NUL (null) and NUL (null).
BEL is the ASCII name of the character with code 7, which traditionally elicits a beep when presented on a capable terminal. NUL (code 0) is a null character traditionally used for representing the end of a string.
new string
Now you create a string from that array of chars. In C# a string is perfectly capable of representing a NUL character within the body of a string, so your string will have two NUL chars in it. They can be represented in C# string literals with "\0", in case you want to try that yourself. A C# string literal that represents the string you have would be "?\a\0\0" Did you know that the BEL character can be represented with the escape sequence \a? Many people don't.
Encoding.ASCII.GetBytes
Now you begin the reverse journey. Your string is comprised entirely of characters in the ASCII range. The encoding of a question mark is code 63 (0x3F). And the BEL is 7, and the NUL is 0. so the bytes are { 0x3f, 0x07, 0x00, 0x00 }. Surprised? Well, you're encoding a question mark now where before you provided a 208 (0xd0) byte that was not representable with ASCII encoding.
BitConverter.ToInt32
Converting these four bytes back to a 32-bit integer gives the integer 0x0000073f, which, in decimal, is 1855.

String encoding (ASCII, UTF8, SHIFT_JIS, etc.) is designed to pigeonhole human language into a binary (byte) form. It isn't designed to store arbitrary binary data, such as the binary form of an integer.
While your binary data will be interpreted as a string, some of the information will be lost, meaning that storing binary data in this way will fail in the general case. You can see the point where this fails using the following code:
for (int i = 0; i < 255; ++i)
{
var byteData = new byte[] { (byte)i };
var stringData = System.Text.Encoding.ASCII.GetString(byteData);
var encodedAsBytes = System.Text.Encoding.ASCII.GetBytes(stringData);
Console.WriteLine("{0} vs {1}", i, (int)encodedAsBytes[0]);
}
Try it online
As you can see it starts off well because all of the character codes correspond to ASCII characters, but once we get up in the numbers (i.e. 128 and beyond), we start to require a more than 7 bits to store the binary value. At this point it ceases to be decoded correctly, and we start seeing 63 come back instead of the input value.
Ultimately you will have this problem encoding binary data using any string encoding. You need to choose an encoding method specifically meant for storing binary data as a string.
Two popular methods are:
Hexadecimal
Base64 using ToBase64String and FromBase64String
Hexadecimal example (using the hex methods here):
int initialValue = 2000;
Console.WriteLine(initialValue);
// Convert from int to bytes and then to hex
byte[] bytesValue = BitConverter.GetBytes(initialValue);
string stringValue = ByteArrayToString(bytesValue);
Console.WriteLine("As hex: {0}", stringValue); // outputs D0070000
// Convert form hex to bytes and then to int
byte[] decodedBytesValue = StringToByteArray(stringValue);
int intValue = BitConverter.ToInt32(decodedBytesValue, 0);
Console.WriteLine(intValue);
Try it online
Base64 example:
int initialValue = 2000;
Console.WriteLine(initialValue);
// Convert from int to bytes and then to base64
byte[] bytesValue = BitConverter.GetBytes(initialValue);
string stringValue = Convert.ToBase64String(bytesValue);
Console.WriteLine("As base64: {0}", stringValue); // outputs 0AcAAA==
// Convert form base64 to bytes and then to int
byte[] decodedBytesValue = Convert.FromBase64String(stringValue);
int intValue = BitConverter.ToInt32(decodedBytesValue, 0);
Console.WriteLine(intValue);
Try it online
P.S. If you simply wanted to convert your integer to a string (e.g. "2000") then you can simply use .ToString():
int initialValue = 2000;
string stringValue = initialValue.ToString();

C# Encoding.Default.GetBytes() method and Integer representation

I saw this code example:
using (FileStream fStream = File.Open(#"C:\myMessage.dat", FileMode.Create))
{
string msg = "Helloo";
byte[] msgAsByteArray = Encoding.Default.GetBytes(msg);
foreach (var a in msgAsByteArray)
{
Console.WriteLine($"a: {a}");
}
// Write byte[] to file.
fStream.Write(msgAsByteArray, 0, msgAsByteArray.Length);
// Reset internal position of stream.
fStream.Position = 0;
// Read the types from file and display to console.
Console.Write("Your message as an array of bytes: ");
byte[] bytesFromFile = new byte[msgAsByteArray.Length];
for (int i = 0; i < msgAsByteArray.Length; i++)
{
bytesFromFile[i] = (byte)fStream.ReadByte();
Console.Write(bytesFromFile[i]);
}
// Display decoded messages.
Console.Write("\nDecoded Message: ");
Console.WriteLine(Encoding.Default.GetString(bytesFromFile));
And the result of Console.WriteLine($"a: {a}") is this:
a: 72
a: 101
a: 108
a: 108
a: 111
a: 111
1.
I thought byte[] is composed of many each unit of byte.
But each byte is represented in integer number.
That numbers must be corresponding ASCII characters.
In C#, byte array means data represented in ASCII?
2.
Is the file myMessage.dat composed of binary data composed of only 0 and 1?
But when I open myMessage.dat with the text editor, it's showing Helloo text string. What's the reason for this?

A byte is a 8bit integer with values from 0 to 255. The output to console outputs the normal number, by providing a format string (https://learn.microsoft.com/en-us/dotnet/standard/base-types/standard-numeric-format-strings) you can output as hex. You can use this answer to get the binary representation.
You explicitly converted the "Halloo" to bytes with Encoding.Default.GetBytes() - that is kindof like converting it to its ascii value but heeding the default encoding on your system.
Your texteditor interpretes the data of the file and displays it as it can. If you put a byte[] myBytes = new [] {0,7,12,3,9,30} into a file and open that with your textedit you will get nonreadable texts as "normal text" starts around 32 , before are f.e. tabs, bells, line feeds and other special non printable characters. See f.e. NonPrintableAscii

Convert pdf file received in string variable to byte array in C#

I am trying to develop an application in C# which takes data from Service1(3rd party), processes it and then sends data to Service2(again 3rd party).
The data I am trying to receive, process and send is a pdf file
From Service1, I am receiving pdf file in a string variable.
e.g.
response.Content = "%PDF-1.4 \n1 0 obj\n<<\n/Pages 2 0 R\n/Type /Catalog\n>>\nendobj\n2 0 obj\n<<\n/Type /Pages\n/Kids [ 3 0 R 17 0 R ]\n/Count 2\n>>\nendobj\n3 0 obj\n<<\n/Type /Page\n/Parent 2 0 R\n/Resources <<\n/XObject << /Im0 8 0 R >>\n/ProcSet 6 0 R
>>\n/MediaBox [0 0 ..."
Now, Service2 requires PDF data to be in byte[] form
How do I convert response.Content i.e. string to byte[]?
FYI -
I have tried below method, but it didn't work. I am able to open file but it shows all junk values.
byte[] bitmapData = Encoding.UTF8.GetBytes(response.Content);

I just found that it was not Service1 which was sending data in string format, it's just I was using IRestResponse.Content, a string variable from RestSharp instead of using IRestResponse.RawBytes, a byte[] variable.

c# Convert.To/FromBase64String confusion

Assuming I have this Method.
private static void Example(string data)
{
Console.WriteLine("Initial : {0}", data);
data = data.PadRight(data.Length + 1, '0');
Console.WriteLine("Step 1 : {0}", data);
data = data.PadRight(data.Length + 4 - data.Length % 4, '=');
Console.WriteLine("Step 2 : {0}", data);
byte[] byteArray = Convert.FromBase64String(data);
string newData = Convert.ToBase64String(byteArray);
Console.WriteLine("Step 3 : {0}", newData);
}
I expect the output given the input string "1" to be as follows
Initial : 1
Step 1 : 10
Step 2 : 10==
Step 3 : 10==
Instead the output is this.
Initial : 1
Step 1 : 10
Step 2 : 10==
Step 3 : 1w==
And I have no idea why. I would expect the output to be the same as the input but it isn't.
I have tried replacing
data = data.PadRight(data.Length + 1, '0');
with
data = data + "0";
It appears with longer input strings too, for example strings with a length of 5 or 9. It works fine if I add "=" but then I exceed my padding limit with Convert.FromBase64String()
So my question is really what is going on and how can I get my expected output,?
What am I doing wrong?
Edit: For those confused as to why I'm using bas64 it is related to this PHP decrypting data with RSA Private Key

Basically, there's no byte array which would be encoded to 10==.
If a base64 string ends with ==, that means that the final 4 characters only represent a single byte. So only the first character and the first 2 bits of the second character are relevant. Looking at the Wikipedia table, 10 means values of:
'1' = 53 '0' = 52
110101 110100
So that's encoding a byte of 1101 0111, and then the final four bits (0100) are ignored. When you re-encode the data, it's using 0s for the final four bits instead, giving:
'1' = 53 'w' = 48
110101 110000
Fundamentally, it's not clear what you're trying to do - but if your input is part of a base64-encoded value, that's pretty odd. The code is behaving the way I'd expect it to - it's just not useful code...

Stream, string and null character

I have a stream which contains several \0 inside it. I have to replace textual parts of this stream, but when I do
StreamReader reader = new StreamReader(stream);
string text = reader.ReadToEnd();
text only contains the beginning of the stream (because of the \0 character). So
text = text.Replace(search, replace);
StreamWriter writer = new StreamWriter(stream);
writer.Write(text);
will not do the expected job since I don't parse the "full" stream. Any idea on how to get access to the full data and replace some textual parts ?
EDIT : An example of what I see on notepad
stream
H‰—[oã6…ÿÛe)Rêq%ÙrlËñE±“-úàÝE[,’íKÿþŽDjxÉ6ŒÅ"XkÏáGqF að÷óð!SN>¿¿‰È†/$ËÙpñ<^HVÀHuñ'¹¿à»U?`äŸ?
¾fØø(Ç,ükøéàâ+ùõ7øø2ÜTJ«¶Ïäd×SÿgªŸF_ß8ÜU#<Q¨|œp6åâ-ªÕ]³®7Ûn¹ÚÝï½œ‰,¨¹^ãI©…Ë<UIÐI‡Û©* Ç¼,,ý¬5O->qä›Ü
endstream 
endobj
8 0 obj
<<
/Type /FontDescriptor
/FontName /Verdana
/Ascent 765
/Descent -207
/CapHeight 1489
/Flags 32
/ItalicAngle 0
/StemV 86
/StemH 0
/FontBBox [ -560 -303 1523 1051 ]
/FontFile2 31 0 R
>>
endobj
9 0 obj
And I want to replace /FontName /Verdana by /FontName /Arial on the fly, for example.

Ah, now we're getting to it...
This file a pdf
Then it's not a text file. That's a binary file, and should be treated as a binary file. Using StreamReader on it will lose data. You'll need to use a different API to access the data in it - one which understands the PDF format. Have a look at iTextSharp or PDFTron.

I can't duplicate your results. The code below creates a string with a \0 in it, writes to file, and then reads it back. The resulting string has the \0 in it:
string s = "hello\x0world";
File.WriteAllText("foo.txt", s);
string t;
using (var f = new StreamReader("foo.txt"))
{
t = f.ReadToEnd();
}
Console.WriteLine(t == s); // prints "True"
I get the same results if I do var t = File.ReadAllText("foo.txt");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.