Stream, string and null character - c#

I have a stream which contains several \0 inside it. I have to replace textual parts of this stream, but when I do
StreamReader reader = new StreamReader(stream);
string text = reader.ReadToEnd();
text only contains the beginning of the stream (because of the \0 character). So
text = text.Replace(search, replace);
StreamWriter writer = new StreamWriter(stream);
writer.Write(text);
will not do the expected job since I don't parse the "full" stream. Any idea on how to get access to the full data and replace some textual parts ?
EDIT : An example of what I see on notepad
stream
H‰­—[oã6…ÿÛe)Rêq%ÙrlËñE±“-úàÝE[,’íKÿþŽDjxÉ6ŒÅ"XkÏáGqF að÷óð!SN>¿¿‰È†/$ËÙpñ<^HVÀHuñ'¹¿à»U?`äŸ?
¾fØø(Ç,ükøéàâ+ùõ7øø2ÜTJ«¶Ïäd×SÿgªŸF_ß8ÜU#<Q¨|œp6åâ-ªÕ]³®7Ûn¹ÚÝ|‰,¨¹^ãI©…Ë<UIÐI‡Û©* Ǽ,,ý¬5O->qä›Ü
endstream 
endobj
8 0 obj
<<
/Type /FontDescriptor
/FontName /Verdana
/Ascent 765
/Descent -207
/CapHeight 1489
/Flags 32
/ItalicAngle 0
/StemV 86
/StemH 0
/FontBBox [ -560 -303 1523 1051 ]
/FontFile2 31 0 R
>>
endobj
9 0 obj
And I want to replace /FontName /Verdana by /FontName /Arial on the fly, for example.

Ah, now we're getting to it...
This file a pdf
Then it's not a text file. That's a binary file, and should be treated as a binary file. Using StreamReader on it will lose data. You'll need to use a different API to access the data in it - one which understands the PDF format. Have a look at iTextSharp or PDFTron.

I can't duplicate your results. The code below creates a string with a \0 in it, writes to file, and then reads it back. The resulting string has the \0 in it:
string s = "hello\x0world";
File.WriteAllText("foo.txt", s);
string t;
using (var f = new StreamReader("foo.txt"))
{
t = f.ReadToEnd();
}
Console.WriteLine(t == s); // prints "True"
I get the same results if I do var t = File.ReadAllText("foo.txt");

Related

Encoding, decoding and re-encoding not producing original results

I suspect that my decoding is not working properly. That is why I am testing it by encoding, decoding and the re-encoding to see if I am getting the same result. That is however not the case.
I encoded a byte[] named model.PDF to a base64 string.
Now, for decoding, I converted model.PDF to a decoded base64 string. However the output looks faulty or corrupted upon debugging and I suspect this is where something is going wrong.
To encode again, the decoded data is turned into byte[] again and then into an encoded base64 string. However base64EncodedData does not match plainTextEncodedData. Please help me create a flawless encode to decode to re-encode flow.
// ENCODING - Byte array -> base64 encoded string
string base64EncodedData = Convert.ToBase64String(model.PDF);
// DECODING - Byte array -> base64 decoded string
var base64DecodedData = Encoding.UTF8.GetString(model.PDF);
// ENCODING AGAIN
byte[] plainTextBytes = Encoding.UTF8.GetBytes(base64DecodedData);
var plainTextEncodedData = Convert.ToBase64String(plainTextBytes);
To elaborate, the re-encoding matches the initial encoding perfectly if executed like this.
var PDF = System.Text.Encoding.UTF8.GetBytes("redgreenblue");
string base64EncodedData = Convert.ToBase64String(PDF);
// DECODING - Byte array -> base64 decoded string
var base64DecodedData = Encoding.UTF8.GetString(PDF);
// ...
But, my model.PDF is fetched from the database as shown below, in which case the re-encoding does not match.
while (reader.Read()) {
model.PDF = reader["PDF"] == DBNull.Value ? null : (byte[])reader["PDF"];
}
On an online base64 decoder (https://www.base64decode.org/), decoding an example value of base64EncodedData shows the ideal and correct value.
%PDF-1.5
%
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(en-IN) /StructTreeRoot 8 0 R/MarkInfo<</Marked true>>>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[ 4 0 R] >>
endobj
3 0 obj
<</Author(admin) /CreationDate(D:20190724114817+05'30')
/ModDate(D:20190724114817+05'30') /Producer(Microsoft Excel 2013) /Creator(Microsoft Excel 2013) >>
endobj
4 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 6 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 5 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 171>>
stream
...
However, in my program, the value of base64DecodedData shows up in its entirety as:
%PDF-1.5
%����
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(en-IN) /StructTreeRoot 8 0 R/MarkInfo<</Marked true>>>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[ 4 0 R] >>
endobj
3 0 obj
<</Author(admin) /CreationDate(D:20190724114817+05'30')
/ModDate(D:20190724114817+05'30') /Producer(��
The 2 look similar in ways but my program seems to be producing a corrupt version of what the actual base64 decoded string should be.
A PDF is an ASCII file that can contain binary data (including strings in other encodings).
So you cannot read it as plain text.
If a PDF file contains binary data, as most do [...] the header line
shall be immediately followed by a comment line containing at least
four binary characters—that is, characters whose codes are 128 or
greater.
Taken from this answer, which has some more infos
You see exactly these four characters in your own output.

C# Encoding.Default.GetBytes() method and Integer representation

I saw this code example:
using (FileStream fStream = File.Open(#"C:\myMessage.dat", FileMode.Create))
{
string msg = "Helloo";
byte[] msgAsByteArray = Encoding.Default.GetBytes(msg);
foreach (var a in msgAsByteArray)
{
Console.WriteLine($"a: {a}");
}
// Write byte[] to file.
fStream.Write(msgAsByteArray, 0, msgAsByteArray.Length);
// Reset internal position of stream.
fStream.Position = 0;
// Read the types from file and display to console.
Console.Write("Your message as an array of bytes: ");
byte[] bytesFromFile = new byte[msgAsByteArray.Length];
for (int i = 0; i < msgAsByteArray.Length; i++)
{
bytesFromFile[i] = (byte)fStream.ReadByte();
Console.Write(bytesFromFile[i]);
}
// Display decoded messages.
Console.Write("\nDecoded Message: ");
Console.WriteLine(Encoding.Default.GetString(bytesFromFile));
And the result of Console.WriteLine($"a: {a}") is this:
a: 72
a: 101
a: 108
a: 108
a: 111
a: 111
1.
I thought byte[] is composed of many each unit of byte.
But each byte is represented in integer number.
That numbers must be corresponding ASCII characters.
In C#, byte array means data represented in ASCII?
2.
Is the file myMessage.dat composed of binary data composed of only 0 and 1?
But when I open myMessage.dat with the text editor, it's showing Helloo text string. What's the reason for this?
A byte is a 8bit integer with values from 0 to 255. The output to console outputs the normal number, by providing a format string (https://learn.microsoft.com/en-us/dotnet/standard/base-types/standard-numeric-format-strings) you can output as hex. You can use this answer to get the binary representation.
You explicitly converted the "Halloo" to bytes with Encoding.Default.GetBytes() - that is kindof like converting it to its ascii value but heeding the default encoding on your system.
Your texteditor interpretes the data of the file and displays it as it can. If you put a byte[] myBytes = new [] {0,7,12,3,9,30} into a file and open that with your textedit you will get nonreadable texts as "normal text" starts around 32 , before are f.e. tabs, bells, line feeds and other special non printable characters. See f.e. NonPrintableAscii

Weird behavior when converting byte from text box to byte array to characters?

I have a textbox that I use to convert things like:
74 00 65 00 73 00 74 00
Back into a string, the above says "test" but for some reason when I click the convert button it will display only the first letter "t" 74 00 and other byte arrays work just as expected, the entire text is converted.
Here is the 2 codes I have tried which produce the same behavior of not properly converting the entire byte array back to word:
byte[] bArray = ByteStrToByteArray(iSequence.Text);
ASCIIEncoding enc = new ASCIIEncoding();
string word = enc.GetString(bArray);
iResult.Text = word + Environment.NewLine;
which uses the function:
private byte[] ByteStrToByteArray(string byteString)
{
byteString = byteString.Replace(" ", string.Empty);
byte[] buffer = new byte[byteString.Length / 2];
for (int i = 0; i < byteString.Length; i += 2)
buffer[i / 2] = (byte)Convert.ToByte(byteString.Substring(i, 2), 16);
return buffer;
}
another way I was using is:
string str = iSequence.Text.Replace(" ", "");
byte[] bArray = Enumerable.Range(0, str.Length)
.Where(x => x % 2 == 0)
.Select(x => Convert.ToByte(str.Substring(x, 2), 16))
.ToArray();
ASCIIEncoding enc = new ASCIIEncoding();
string word = enc.GetString(bArray);
iResult.Text = word + Environment.NewLine;
Tried checking for the lengths to see if it was iterating thru and it was ...
Don't really know how to debug why this is happenning to the above byte array but all the other byte arrays seemed to be working just fine only this one is outputing only the first letter of it.
Have I done something wrong that could produce this behavior some how ?
What could I try in order to find out what is wrong ?
If you have the byte sequence
var bytes = new byte[] { 0x74, 0x00, 0x65, 0x00, 0x73, 0x00, 0x74, 0x00 };
and you decode it to a string using ASCII encoding (Encoding.ASCII), then you get
var result = Encoding.ASCII.GetString(bytes);
// result == "\x74\x00\x65\x00\x73\x00\x74\x00" == "t\0e\0s\0t\0"
Notice the Null \0 characters? When you display such a string in a textbox, only the part of the string until the first Null character is displayed.
Since you say the result should read "test", the input is actually not encoded in ASCII but in UTF-16LE (Encoding.Unicode).
var result = Encoding.Unicode.GetString(bytes);
// result == "\u0074\u0065\u0073\u0074" == "test"
your converting a unicode string to ascii , your not specifying the codepage on your machine to convert from.
System.Text.Encoding.GetEncoding("codepage").GetString()
if my memory serves me correct. Also to note, any control in .NET is unicode ... Soooooo.... what your trying to stick in the text box (if the conversion isent correct) could be an end of line character .. or eof, or any kind of control character. all depends on your codepage.
I tried debugging the first program using breakpoints in VS2010. I found out that the line
string word = enc.GetString(bArray);
output word as "t\0e\0s\0t".
The last line
iResult.Text = word + Environment.NewLine;
gives iResult.Text as simply "t".
So I was thinking since \0 is not a valid escape sequence, the compiler ignored everything after it. Could be wrong though but try removing all occurrences of 00 in the input string.
I'm not really into C#. I'm only suggesting this because it looks like C++.
It works for me:
string outputText = "t\0e\0s\0t";
outputText = outputText.Replace("\0", " ");

Create CSV file from c#: extra character in excel

I create a CSV file from a C# application but the characters  are displayed in Excel and OpenOfficeCalc in the first cell but not in Notepad and Notepad++.
Here is my code:
StreamWriter streamWriter = new StreamWriter(new FileStream(filePath, FileMode.Create), Encoding.UTF8);
List<MyData> myData = GetMyData;
foreach(MyData md in myData)
{
streamWriter.WriteLine(md.Date + "," + md.Data1 + "," + md.Data2 + "," + md.Data3 + "," + md.Data4);
}
streamWriter.Flush();
streamWriter.Close();
MyData is
public struct MyData
{
public float Data1;
public float Data2;
public float Data3;
public float Data4;
public DateTime Date;
}
Here is the result in Notepad and Notepad++:
01/12/2010 00:04:00,0.08,78787.4,9.1,5
01/12/2010 00:09:00,0.07,78787.42,9.1,5
01/12/2010 00:14:00,0.06,78787.44,9.1,5
01/12/2010 00:19:00,1.45,78787.58,9.1,5
01/12/2010 00:24:00,2.13,78788.15,9.1,5
01/12/2010 00:29:00,1.72,78788.53,9,5
01/12/2010 00:34:00,0.89,78788.73,9,5
And in Excel and Calc:
01/12/2010 00:04:00 0.08 78787.4 9.1 5
01/12/10 00:09 0.07 78787.42 9.1 5
01/12/10 00:14 0.06 78787.44 9.1 5
01/12/10 00:19 1.45 78787.58 9.1 5
01/12/10 00:24 2.13 78788.15 9.1 5
01/12/10 00:29 1.72 78788.53 9 5
01/12/10 00:34 0.89 78788.73 9 5
Those 3 characters appears only once at the start of the file and then everything is as it should be.
My question is:
Where does  come from and how to remove it?
I have tried to write my output in a StringBuilder and to debug to see its content and it does not have those character.
This is the BOM for UTF8 encoding
Look at this question: Write text files without Byte Order Mark (BOM)?
You may want to specify the encoding type based on your input.
StreamWriter.Encoding Property
http://msdn.microsoft.com/en-us/library/system.io.streamwriter.encoding.aspx
I think that is BOM (unicode header). Specify ANSI encoding in constructor of StreamWriter. It is simple: Just set it to Encoding.Default.
update:
If you must use UTF-8 and need to get rid of those 3 bytes at the beginning, use this: new UTF8Encoding(false). This way the writer will stay in UTF8, but won't write BOM.
Try to specify Encoding.ASCII instead of using Encoding.UTF8 when you create the StreamWriter

Read .dat file in c#

i want to read a .dat file that have following conditions:-
First name offset 21
First name format ASCIIz 15 chars + \0
Middle initials offset 37
ID offset -8
ID format/length Unsigned int (4 bytes)
so help me for sorting this issue in c#.
Thanks in advance.
Gurpreet
.dat file
( ÿ / rE ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ XÙþÞ¦d e e Mr. Sam Ascott Sam 9209 Sandpiper Lane 21204 410 5558987 410 5556700 275 MM229399098 (¬ Þ e ܤ•Þ„ œÔ£ÝáØáØ ’Þ[Þ €–˜ ä–˜ [Þ ¶ Norman Eaton Friend of Dr. Shultz Removal of #1,16,17 & 32 öÜÝ)Ý Ä d 01 21 21 21 e 101 22099 XÙþÞ¦d e . Mrs. Patty Baxter Patty 3838 Tommytrue Court 21234 410 2929290 410 3929209 FM218798127 HAY FEVER Þ . „¤¢Þè   _ÐÍÝBÒBÒ ’ÞÝ €–˜ ä–˜ ÍÝ f Joanne Abbey
Here is a tutorial how to use BinaryReader for this purpose:
http://dotnetperls.com/binaryreader
You can use Jet OleDB to query .dat files:
var query = "select * from file.dat";
var connection = new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\\file.dat;Extended Properties=\"text;HDR=NO;FMT=FixedLength\"");
See this link:
Code Project. Read Text File (txt, csv, log, tab, fixed length)
And check these:
Reading a sequential-access file
DAT files in C#
Read a file in C#
BinaryReader Class
Jon Skeet. Reading binary data in C#
As said, have a look at the BinaryReader
//Example...
BinaryReader reader = new BinaryReader(stream);
string name = Encoding.ASCII.GetString(reader.ReadBytes(8));
int number = reader.ReadInt32();
my .dat file content - " €U§µ­PÕ „ÕG¬u "
click here to see the content of my .dat file in Notepad and hexa editor
string fileName = #"W:\yourfilename.dat";
//Read the binary file as byte array
byte[] bHex = File.ReadAllBytes(fileName);
//Create string builder for extracting the HEX values
StringBuilder st = new StringBuilder();
//initialize the int for 0
int i = 0;
// check it worked
//Reverse the HEX array for readability
foreach (char c in bHex.Reverse())
{
i++;
// 12 to 21 byte in the reverse order for interseted value in ticks"
if (i > 12 && i < 21)
st.Append(Convert.ToInt32(c).ToString("X2"));
}
// Convert HEX to Deciamal
long Output = Convert.ToInt64(st.ToString(), 16);
//Convert ticks to date time
DateTime dt = new DateTime(Output);
//Write the output date to console
Console.Write(dt);
Final de-crypt binary content to data-time.
final output of the program

Categories

Resources