Is it possible to read a file hex values into c# and output the corresponding ASCII? I can view the file in a hex editor which I can then see the appropriate ASCII next to the hex but rather than manually copying out the parts I need I imagine there is a way of the machine doing it for me in a c# program?
I did find Converting HEX data in a file to ascii but that didn't really help?
It sounds like you just need:
string text = File.ReadAllText("file.txt");
There's no such thing as "hex values" in a file - they're just bytes which are shown as hex in various editors geared towards editing non-text files.
The above line of code will load a text file, decoding it as UTF-8 - which is compatible with ASCII, so if your file is truly ASCII, it should be fine. If you need to specify a different encoding, you can do it with an overload, e.g.
// Load an ISO-8859-1 file
string text = File.ReadAllText("file.txt", Encoding.GetEncoding(28591));
Related
I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".
I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv"));
Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:
type ST60_0.csv inputUTF.csv > outputBASH.txt
Q: The following code reads ANSI file and writes output as UTF-8 but
there is some giberrish characters "�".
A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?
Q: It blows my mind its so hard to do something in C# that command
prompt can do easy
A: Typically, it IS easy. There seems to be "something special" written into this particular file.
The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".
ANYWAY:
If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:
https://stackoverflow.com/a/25510366/421195
... %EF%BF%BD is the url-encoded version of the hex representation of
the 3 bytes (EF BF BD) of the UTF-8 replacement character.
See also:
https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The Replacement character � (often displayed as a black rhombus with a
white question mark) is a symbol found in the Unicode standard at code
point U+FFFD in the Specials table. It is used to indicate problems
when a system is unable to render a stream of data to a correct
symbol.[4] It is usually seen when the data is invalid and does not
match any character
You might also be interested in this:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar
character.
UPDATE:
The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.
One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv", System.Text.Encoding.GetEncoding("iso-8859-1")));
I need to process CSV files that are kept as bsae64strings. I never know in what format they were created (usually it'll be ANSI or UTF-8). I have been struggling to achieve anything useful, still, I receive messed up characters when I am testing my code on CSV file that was saved in ANSI. The code to read is just a two-liner:
byte[] dataToDecode = Convert.FromBase64String(base64Content);
string csvContentInUTF8 = Encoding.UTF8.GetString(dataToDecode2);
I do not have access to the code that saves files.
Sample line that's in the input CSV:
;;;superÆ/æ Ø/ø and even Å/å Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
what I get after decoding (second line of code)
;;;super�/� �/� oraz �/� Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
following this question I tried changing the code to scandinavian encoding reading, so:
string csvContentInUTF8x = Encoding.GetEncoding("iso-8859-1").GetString(dataToDecode);
The output is:
;;;super�/� �/� oraz �/� Topic;;John;Doe;;;;john#doe.com;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
It looks exactly the same for the Encoding.Default
If what you wrote is correct, the text was corrupted before writing it in a csv file.
Now... Encoding.GetEncoding("iso-8859-1") is an identical encoding that doesn't do any remapping. Its 256 characters are mapped 1:1 to the first 256 (0-255) characters of unicode.
;;;super�/� �/� oraz �/� Topic;;John;Doe;;;;john#doe.com
You see the � repeated six times? Normally each time it should be different, because you want six different characters (Æ/æ, Ø/ø, Å/å). But here they are always the same. And this is because in UTF-8 � is the Unicode REPLACEMENT CHARACTER � that is used when a character can't be encoded. So the error is already present in your dataToDecode.
Background:
My assemblies strings are obfuscated using only ASCII characters. My logger outputs it's log files in a binary format. The log files contain much data in addition to the obfuscated strings. I can de-obfuscate strings.I need to keep the log file in it's original formatting but with the de-obfuscated strings
Task:
I need to read the binary log file and write to a new log file until I hit a length of ASCII characters. I then convert those characters to a string, de-obfuscate the string, send the new string to the binary writer and then carry on from the end of the ASCII character string reading/writing until I hit a new ASCII character string and then do it all again.
Is this possible?
I would appreciate some code in C# please if it is.
If my task methodology is flawed, is there an alternative?
I am converting a series of strings that are designed to display correctly using a special font into a unicode version that can be used anywhere. It's just a glorified set of string replaces:
"e]" -> "ἓ"
etc.
I'm reading the text using a streamreader which takes the encoding to be UTF-8. All working well. But there are some characters used to replace the punctuation marks that just aren't working. I can see them as hex sequences in notepad++ (encoding set to UTF-8) but when I read them, they all get reduced down to the same character (the 'cannot display' question mark in the black diamond).
StreamReader srnorm = new StreamReader("C:\\Users\\John\\Desktop\\bgt.txt", Encoding.UTF8);
string norm = srnorm.ReadLine();
Should I be reading it as a binary file and working from there or is my encoding very wrong?
(Full size image)
When I read that, I get the following:
o]j ouvci. mh. �avpo�la,bh| pollaplasi,ona evn tw/| kairw/| tou,tw| kai. evn tw/| aivw/ni tw/| evrcome,nw| zwh.n aivw,nion�
C# strings use UTF-16. This is how they are stored in memory. Because of this you should be able to read the string into memory and replace the characters without any issues. You can then write those characters back to a file (UTF8 is the default character encoding for reading and writing to file if I'm not mistaken). The ?'s just means the console you outputed the string to does not support those characters or the bytes are not of a valid encoding.
Here is a good article by Jon Skeet about C#/.NET strings.
We will be given different text files and each file might be in English,arabic,german or French.We have to read the respective files and display the text in UI in the respective file text language.
I am planning to use the below statement to achieve the same. Do I need to do anything in addition here? As far as I know we have ASCII character-set -255 but how about displaying other language characters like Chinese,hindi or german ? Do we need to take special care of these characters?
StreamReader(System.String filepath, System.Text.Encoding.Unicode)
Use UTF-8 or UTF-16 for the encoding, and the text should be displayed as is
StreamReader(System.String filepath, System.Text.Encoding.UTF8)