Character Textfile to binary Textfile - c#

I have a text file that contains a set of records and i am trying to convert and save it as 1's and 0's .. every time I use
Byte [] arr=Encoding.UTF8.GetBytes(recordss) ;
and write it using a byte writer i still have to same record file with no difference.
So my question is there a way to convert a string to binary and write it to a file in binary format. I am using c# by the way
Here is my code so far
public static void serialData()
{
FileStream recFile = new FileStream("Records.txt", FileMode.Open, FileAccess.ReadWrite); //file to be used for records
StreamReader recordRead = new StreamReader(recFile);
String recordss = recordRead.ReadToEnd(); //Reads Record file
recordRead.Close();
recFile.Close();
Byte [] arr=Encoding.UTF8.GetBytes(recordss) ;
FileStream file = new FileStream("Temp.txt", FileMode.Create, FileAccess.Write);
StreamWriter binfile = new StreamWriter(file);
for(int i =0; i < arr.Count();i++)
binfile.WriteLine(arr[i]);
binfile.Close();
file.Close();
}

There's a built-in function to convert from integer-type values to strings with binary representation. Try replacing the line
binfile.WriteLine(arr[i]);
by this line
binfile.WriteLine(
Convert.ToString(arr[i], 2)
);
Convert.ToString() will convert the input to a representation in the given base. In this case, you choose 2 as base for a binary representation. Other common values would be 8 for octal, or 16 for hexadecimal.

Your result is in 'byte' format. Always. By definition it is data. The way you 'see' it depends on the software you use to open it.
What you want is probably a file that when openned in a text editor 'shows' the underlying binary data of your original data source: as text. For this you'll have to write in the file as character '0' and '1'. Therefore, the final file will be a lot bigger thant the original data source.
Change this code:
for(int i =0; i < arr.Count();i++)
binfile.WriteLine(arr[i]);
Into this:
foreach (byte b in arr)
{
binfile.Write((b >> 7 & 1) == 0 ? '0' : '1');
binfile.Write((b >> 6 & 1) == 0 ? '0' : '1');
binfile.Write((b >> 5 & 1) == 0 ? '0' : '1');
binfile.Write((b >> 4 & 1) == 0 ? '0' : '1');
binfile.Write((b >> 3 & 1) == 0 ? '0' : '1');
binfile.Write((b >> 2 & 1) == 0 ? '0' : '1');
binfile.Write((b >> 1 & 1) == 0 ? '0' : '1');
binfile.Write((b & 1) == 0 ? '0' : '1');
}
But it is kind of ugly. Better use an hexadecimal file viewer.

Related

Can the encoding difference for printable characters between utf-8 and Latin-1 be resolved?

I read that there should be no difference between Latin-1 and UTF-8 for printable characters. I thought that a latin-1 'Ä' would map twice into utf-8.
Once to the Multi byte Version and once directly.
Why does it seem like this is not the case?
It certainly seems like the standard could include anything that looks like a continuation byte but is not a continuation as the meaning within latin-1 without loosing anything.
Am I just missing a flag or something that would allow me to convert the data like described, or am I missing the bigger picture?
Here is a C# example:
The output on my system is
static void Main(string[] args)
{
DecodeTest("ascii7", " ~", new byte[] { 0x20, 0x7E });
DecodeTest("Latin-1", "Ä", new byte[] { 0xC4 });
DecodeTest("UTF-8", "Ä", new byte[] { 0xc3, 0x84 });
}
private static void DecodeTest(string testname, string expected, byte[] encoded)
{
var utf8 = Encoding.UTF8;
string ascii7_actual = utf8.GetString(encoded, 0, encoded.Length);
//Console_Write(encoded);
AssertEqual(testname, expected, ascii7_actual);
}
private static void AssertEqual(string testname, string expected, string actual)
{
Console.WriteLine("Test: " + testname);
if (actual != expected)
{
Console.WriteLine("\tFail");
Console.WriteLine("\tExpected: '" + expected + "' but was '" + actual + "'");
}
else
{
Console.WriteLine("\tPass");
}
}
private static void Console_Write(byte[] ascii7_encoded)
{
bool more = false;
foreach (byte b in ascii7_encoded)
{
if (more)
{
Console.Write(", ");
}
Console.Write("0x{0:X}", b);
more = true;
}
}
I read that there should be no difference between Latin-1 and UTF-8 for printable characters.
You read wrong. There is no difference between Latin-1 (and many other encodings including the rest of the ISO 8859 family) and UTF-8 for characters in the US-ASCII range (U+0000 to U+007F). They are different for all other characters.
I thought that a latin-1 'Ä' would map twice into utf-8. Once to the Multi byte Version and once directly.
For this to be possible would require UTF-8 to be stateful or to otherwise use information earlier in the stream to know whether to interpret an octet as a direct mapping or part of the multibyte encoding. One of the great advantages of UTF-8 is that it is not stateful.
Why does it seem like this is not the case?
Because it's just plain wrong.
It certainly seems like the standard could include anything that looks like a continuation byte but is not a continuation as the meaning within latin-1 without losing anything.
It couldn't do so without loosing the quality of not being stateful, which would mean corruption would destroy the entire text following the error rather than just one character.
Am I just missing a flag or something that would allow me to convert the data like described, or am I missing the bigger picture?
No, you just have a completely incorrect idea about how UTF-8 and/or Latin-1 works.
A flag would remove UTF-8's simplicity in being non-stateful and self-synchronising (you can always tell immediately if you are at a single-octet character, the start of a character or part-way into a character) as mentioned above. It would also remove UTF-8's simplicity in being algorithmic. All UTF-8 encodings map as follows.
To map from code-point to encoding:
Consider the bits of the character xxxx… e.g. for U+0027 they are 100111 for U+1F308 they are 11111001100001000.
Find the smallest of the following they will fit into:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
So U+0027 is 00100111 is 0x27 and U+1F308 is 11110000 10011111 10001100 10001000 is 0xF0 0x9F 0x8C 0x88.
To go from octets to code-points you undo this.
To map to Latin 1 you just put the character into a octet, (which obviously only works if they are in the range U+0000 to U+00FF).
As you can see, there's no way that a character outside of the range U+0000 to U+007F can have matching encodings in UTF-8 and Latin-1. ("Latin 1" is also the name of CP-1252 which is a Microsoft encoding that puts further printable characters but still only a tiny fraction of those covered by UTF-8).
There is a way that a character could theoretically have more than one UTF-8 encoding, but it is explicitly banned. Consider that instead of putting the bits of U+0027 into the single unit 00100111 we could also zero-pad and put it into 11000000 10100111 encoding it as 0xC0 0xA7. The same decoding algorithm would bring us back to U+0027 (try it and see). However as well as introducing needless complexity in having such synonym encodings this also introduced security issues and indeed there have been real-world security holes caused by code that would accept over-long UTF-8.
maybe you need a scan-function to descide which decoder is required?
try this:
/// <summary>
/// Count valid UTF8-Bytes
/// </summary>
/// <returns>
/// -1 = invalid UTF8-Bytes (may Latin1)
/// 0 = ASCII only 7-Bit
/// n = Count of UTF8-Bytes
/// </returns>
public static int Utf8CodedCharCounter(byte[] value) // result:
{
int utf8Count = 0;
for (int i = 0; i < value.Length; i++)
{
byte c = value[i];
if ((c & 0x80) == 0) continue; // valid 7 Bit-ASCII -> skip
if ((c & 0xc0) == 0x80) return -1; // wrong UTF8-Char
// 2-Byte UTF8
i++; if (i >= value.Length || (value[i] & 0xc0) != 0x80) return -1; // wrong UTF8-Char
if ((c & 0xe0) == 0xc0) { utf8Count++; continue; }
// 3-Byte UTF8
i++; if (i >= value.Length || (value[i] & 0xc0) != 0x80) return -1; // wrong UTF8-Char
if ((c & 0xf0) == 0xe0) { utf8Count++; continue; }
// 4-Byte UTF8
i++; if (i >= value.Length || (value[i] & 0xc0) != 0x80) return -1; // wrong UTF8-Char
if ((c & 0xf8) == 0xf0) { utf8Count++; continue; }
return -1; // invalid UTF8-Length
}
return utf8Count;
}
and update your code:
private static void DecodeTest(string testname, string expected, byte[] encoded)
{
var decoder = Utf8CodedCharCounter(encoded) >= 0 ? Encoding.UTF8 : Encoding.Default;
string ascii7_actual = decoder.GetString(encoded, 0, encoded.Length);
//Console_Write(encoded);
AssertEqual(testname, expected, ascii7_actual);
}
result:
Test: ascii7
Pass
Test: Latin-1
Pass
Test: UTF-8
Pass

How to convert latin character into HTML Entity (Decimal) in c#?

i want to convert latin character into html entity code in c#
for example
Th‚rŠse Ramdally should convert into
Th‚rŠse Ramdally
Thanks
vela
Possible solution is to encode every character that's beyond ASCII character table (i.e. character >= 128 or character < 32):
String source = #"Th‚rŠse Ramdally";
String result = String.Concat(source
.Select(c => (c < 128 && c > 31)
? c.ToString()
: String.Format("&#{0};", (int) c)));

Stream, string and null character

I have a stream which contains several \0 inside it. I have to replace textual parts of this stream, but when I do
StreamReader reader = new StreamReader(stream);
string text = reader.ReadToEnd();
text only contains the beginning of the stream (because of the \0 character). So
text = text.Replace(search, replace);
StreamWriter writer = new StreamWriter(stream);
writer.Write(text);
will not do the expected job since I don't parse the "full" stream. Any idea on how to get access to the full data and replace some textual parts ?
EDIT : An example of what I see on notepad
stream
H‰­—[oã6…ÿÛe)Rêq%ÙrlËñE±“-úàÝE[,’íKÿþŽDjxÉ6ŒÅ"XkÏáGqF að÷óð!SN>¿¿‰È†/$ËÙpñ<^HVÀHuñ'¹¿à»U?`äŸ?
¾fØø(Ç,ükøéàâ+ùõ7øø2ÜTJ«¶Ïäd×SÿgªŸF_ß8ÜU#<Q¨|œp6åâ-ªÕ]³®7Ûn¹ÚÝ|‰,¨¹^ãI©…Ë<UIÐI‡Û©* Ǽ,,ý¬5O->qä›Ü
endstream 
endobj
8 0 obj
<<
/Type /FontDescriptor
/FontName /Verdana
/Ascent 765
/Descent -207
/CapHeight 1489
/Flags 32
/ItalicAngle 0
/StemV 86
/StemH 0
/FontBBox [ -560 -303 1523 1051 ]
/FontFile2 31 0 R
>>
endobj
9 0 obj
And I want to replace /FontName /Verdana by /FontName /Arial on the fly, for example.
Ah, now we're getting to it...
This file a pdf
Then it's not a text file. That's a binary file, and should be treated as a binary file. Using StreamReader on it will lose data. You'll need to use a different API to access the data in it - one which understands the PDF format. Have a look at iTextSharp or PDFTron.
I can't duplicate your results. The code below creates a string with a \0 in it, writes to file, and then reads it back. The resulting string has the \0 in it:
string s = "hello\x0world";
File.WriteAllText("foo.txt", s);
string t;
using (var f = new StreamReader("foo.txt"))
{
t = f.ReadToEnd();
}
Console.WriteLine(t == s); // prints "True"
I get the same results if I do var t = File.ReadAllText("foo.txt");

Replace token's text in ANTLR

I'm trying to replace some token's text from my input program to a specific formated text. I'm using C# as output language.
Example of input:
time#1m2s
My lex grammar for that input:
fragment
DIGIT : '0'..'9'
;
CTE_DURATION
: ('T'|'t'|'TIME'|'time') '#' '-'? (DIGIT ('d'|'h'|'m'|'s'|'ms') '_'?)+
;
Output token text I'd like to get from input example:
0.0:1:2.0
That's means: 0 days, 0 hours, 1 minute, 2 seconds and 0 milliseconds.
Any advice? Thank you in advance.
Here's a way to do that (it's in Java, but shouldn't be hard to port to C#):
grammar Test;
parse
: CTE_DURATION EOF
;
CTE_DURATION
: ('T' 'IME'? | 't' 'ime'?) '#' minus='-'?
(d=DIGITS 'd')? (h=DIGITS 'h')? (m=DIGITS 'm')? (s=DIGITS 's')? (ms=DIGITS 'ms')?
{
int days = $d == null ? 0 : Integer.valueOf($d.text);
int hours = $h == null ? 0 : Integer.valueOf($h.text);
int minutes = $m == null ? 0 : Integer.valueOf($m.text);
int seconds = $s == null ? 0 : Integer.valueOf($s.text);
int mseconds = $ms == null ? 0 : Integer.valueOf($ms.text);
setText(($minus == null ? "" : "-") + days + "." + hours + ":" + minutes + ":" + seconds + "." + mseconds);
}
;
fragment DIGITS : '0'..'9'+;
Parsing the input time#1m2s results in the following parse tree:
Note that the grammar now accepts time# as well (causing it to produce 0.0:0:0.0), but you can easily produce an exception from the lexer rule in case such input is invalid.

Read .dat file in c#

i want to read a .dat file that have following conditions:-
First name offset 21
First name format ASCIIz 15 chars + \0
Middle initials offset 37
ID offset -8
ID format/length Unsigned int (4 bytes)
so help me for sorting this issue in c#.
Thanks in advance.
Gurpreet
.dat file
( ÿ / rE ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ XÙþÞ¦d e e Mr. Sam Ascott Sam 9209 Sandpiper Lane 21204 410 5558987 410 5556700 275 MM229399098 (¬ Þ e ܤ•Þ„ œÔ£ÝáØáØ ’Þ[Þ €–˜ ä–˜ [Þ ¶ Norman Eaton Friend of Dr. Shultz Removal of #1,16,17 & 32 öÜÝ)Ý Ä d 01 21 21 21 e 101 22099 XÙþÞ¦d e . Mrs. Patty Baxter Patty 3838 Tommytrue Court 21234 410 2929290 410 3929209 FM218798127 HAY FEVER Þ . „¤¢Þè   _ÐÍÝBÒBÒ ’ÞÝ €–˜ ä–˜ ÍÝ f Joanne Abbey
Here is a tutorial how to use BinaryReader for this purpose:
http://dotnetperls.com/binaryreader
You can use Jet OleDB to query .dat files:
var query = "select * from file.dat";
var connection = new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\\file.dat;Extended Properties=\"text;HDR=NO;FMT=FixedLength\"");
See this link:
Code Project. Read Text File (txt, csv, log, tab, fixed length)
And check these:
Reading a sequential-access file
DAT files in C#
Read a file in C#
BinaryReader Class
Jon Skeet. Reading binary data in C#
As said, have a look at the BinaryReader
//Example...
BinaryReader reader = new BinaryReader(stream);
string name = Encoding.ASCII.GetString(reader.ReadBytes(8));
int number = reader.ReadInt32();
my .dat file content - " €U§µ­PÕ „ÕG¬u "
click here to see the content of my .dat file in Notepad and hexa editor
string fileName = #"W:\yourfilename.dat";
//Read the binary file as byte array
byte[] bHex = File.ReadAllBytes(fileName);
//Create string builder for extracting the HEX values
StringBuilder st = new StringBuilder();
//initialize the int for 0
int i = 0;
// check it worked
//Reverse the HEX array for readability
foreach (char c in bHex.Reverse())
{
i++;
// 12 to 21 byte in the reverse order for interseted value in ticks"
if (i > 12 && i < 21)
st.Append(Convert.ToInt32(c).ToString("X2"));
}
// Convert HEX to Deciamal
long Output = Convert.ToInt64(st.ToString(), 16);
//Convert ticks to date time
DateTime dt = new DateTime(Output);
//Write the output date to console
Console.Write(dt);
Final de-crypt binary content to data-time.
final output of the program

Categories

Resources