Change readable portions of file with unknown characters - c#

I'm trying to read a text file that has readable and unreadable characters. It opens easily in any text editor. Most of the text characters are unknown characters and the part I want to change is readable.
The file looks like this
readable1 gibberish readable2 gibberish.
I want to change readable2
If I use the following techniques they seem to return only readable1. They do not give the same output as dropping it on a text reader.
readFile(){
string sr=new StreamReader(path);
contents = sr.ReadToEnd();
//or
contents=File.ReadAllText(path);
}
I tried a few encodings ASCII, Unicode, UTF8, UTF32 but nothing seems to match the same output as dragging onto a text editor.
byte[] bytes = System.IO.File.ReadAllBytes(path);
string str = System.Text.Encoding.ASCII.GetString(bytes);
Is there any way to get it to return all the characters and just modify the readable characters?

Related

how to write the character `é` in an rft file?

I need to write a string from c# into an rtf file, but having weird problems.
To write the text I simply use
string fileName = System.IO.Path.GetTempPath() + Guid.NewGuid().ToString() + ".rtf";
System.IO.File.WriteAllText(fileName, body);
body is a string variable, that is filled from a varchar column from a database.
The problem is with the character é which is wrong displayed by wordpad when opening the file like this
If I open the file in notepad, I see this
(één schade gevonden -> ander dossier)
So for some dark reason wordpad decided to show the character é all messed up like this.
I tried writing the file as UTF8 or other unicode encodings, but then wordpad refused to see this file as rtf and just shows the plain text with all the tags
I also looked at this page where it tells me to write a tag like \uXXX? where XXX should be a number defining a Unicode UTF-16 code unit number.
But I cannot find what number to use, or any good example on how to do this.
Actually I am not even sure if its unicode related, the character é is not even a character that needs unicode in my mind, could be wrong off course.
Anyway, does anyone knows how to solve this problem ?
I just need a way to make wordpad not mess up the character é on display and on print.
The problem was that I did not encoded the RTF file properly.
Using this link provided by Filburt I managed to encode the RTF file correct like this.
var iso = Encoding.GetEncoding("ISO-8859-1");
string fileName = System.IO.Path.GetTempPath() + Guid.NewGuid().ToString() + ".rtf";
System.IO.File.WriteAllText(fileName, body, iso);

Thai character issues in unicode string

I have a string with few characters in Thai. This string is using unicode characters. But I don't see thai characters in IDE or even if I write the string in text file. If I want to see thai characters properly I have to write the following code
var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2)";
var ascii = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
After applying above logic, I can see string correctly with thai characters. Here is output
// notice the thai character เดี่ยว in the string
M_M-150 150CC. เดี่ยว (2 For 18 Save 2)
I am not sure why I need to apply above logic to see the thai characters even the string is Unicode?
What exactly Encoding.Default is doing in this case?
From MSDN
Here is what Encoding.Default Property is:
Different computers can use different encodings as the default, and
the default encoding can even change on a single computer. Therefore,
data streamed from one computer to another or even retrieved at
different times on the same computer might be translated incorrectly.
In addition, the encoding returned by the Default property uses
best-fit fallback to map unsupported characters to characters
supported by the code page. For these two reasons, using the default
encoding is generally not recommended. To ensure that encoded bytes
are decoded properly, you should use a Unicode encoding, such as
UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to
use a higher-level protocol to ensure that the same format is used for
encoding and decoding.
The string is coming in by Encoding.Default, but then Decoded using UTF8
So the bottleneck is not the Encoding.Default. It's Encoding.UTF8
It's taking the bytes and convert it to string correctly.
Even if you tried to print it in the Console.
Take a look at both cases :
The second line, printed with utf8 configuration
You can config your console to support utf8 by adding this line :
Console.OutputEncoding = Encoding.UTF8;
Even with your code :
the result in a file will be looks like :
while converting the string to byte with Encoding.UTF8
var text = "M_M-150 150CC. เดี่ยว (2 For 18 Save 2";
var ascii = Encoding.UTF8.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
the result will be :
If you take a look at Supported Scripts you'll see that UTF8 supports all Unicode characters
including Thai.
Note that the Encoding.Default will not be able to read chinese or japanese for an example,
take this example :
var text = "漢字";
var ascii = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
Here is the output from a text file :
Here if you try to write it to text, it'll not be converted successfully.
So you have to read and write it using UTF8
var text = "漢字";
var ascii = Encoding.UTF8.GetBytes(text);
text = Encoding.UTF8.GetString(ascii);
and you'll get this :
So as I said, the whole process depending on UTF8 not Default encoding.

Non-Unicode to unicode conversion of a txt file

Given a txt file with non-unicode text, I am able to detect its charset as 1251. Now, I would like to convert into unicode.
byte[] bytes1251 = Encoding.GetEncoding(1251).GetBytes(File.ReadAllText("sampleNU.txt"));
String str = Encoding.UTF8.GetString(bytes1251);
This doesn't work.
Is this the way to go about it for non-unicode to unicode conversion?
After trying the suggested approach on the RTF file, I get the below dialog when I try to open the output RTF file. Please let me know what to do because selecting Unicode doesn't make it readable or give the expected text?
// load as charset 1251
string text = File.ReadAllText("sampleNU.txt", Encoding.GetEncoding(1251));
// save as Unicode
File.WriteAllText("sampleU.txt", text, Encoding.Unicode);

streamwriter does not save unicode files correctly

I'm opening a text file and removing the first line to prepare it for importing in a database using bulk insert. Here is my code:
string tempFile = Path.GetTempFileName();
using (var sr = new StreamReader("F:\\Upload\\File.txt", System.Text.Encoding.UTF8))
{
using (var sw = new StreamWriter(tempFile,true, System.Text.Encoding.UTF8))
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.Substring(0, 8) != "Nr. Crt.")
sw.WriteLine(line);
}
}
}
System.IO.File.Delete("F:\\Upload\\File.txt");
System.IO.File.Move(tempFile, "F:\\Upload\\File.txt");
After this if I open the resulting file, Unicode characters are replaced with other characters. For example strings containing non-breaking space (unicode U+00A0): Value  (note the unicode char ) are transformed in Value� .
How can I avoid this?
Edit:
Notepad++ is set to 'Encode in UTF-8'
Here is a picture of how it looks :
are transformed in Value�
The byte values for those 3 odd characters are 0xef 0xbd 0xbf. Which is the utf8 encoding for codepoint \ufffd, the replacement character �. Which is used when reading utf encoded text and the text contains an invalid encoding byte sequence.
Pointing squarely at an issue with File.txt, it was probably not encoded in utf-8. If you have no idea what encoding was used for that file then the first guess is to pass Encoding.Default to the StreamReader constructor.
It looks to me like it is writing fine, but the tool you are reading with is not expecting UTF-8. In many cases, you need to explicitly tell the tool what encoding to expect. However, a common approach is to prepend a BOM ("byte order mark"). This is simple - just use new UTF8Encoding(true) as the encoding and it will happen automatically. In tools that don't expect a BOM this will display as a few mangled chars at the start - but most modern tools will know what it means, and will switch to UTF-8 automatically. The point is: the BOM for UTF-8, UTF-16 LE and UTF-16 BE etc are all slightly different, but recognisable. A more complete list is on wikipedia.

Convert ANSI (Windows 1252) to UTF8 in C#

I've asked this before in a round-about manner before here on Stack Overflow, and want to get it right this time. How do I convert ANSI (Codepage 1252) to UTF-8, while preserving the special characters? (I am aware that UTF-8 supports a larger character set than ANSI, but it is okay if I can preserve all UTF-8 characters that are supported by ANSI and substitute the rest with a ? or something)
Why I Want To Convert ANSI → UTF-8
I am basically writing a program that splits vCard files (VCF) into individual files, each containing a single contact. I've noticed that Nokia and Sony Ericsson phones save the backup VCF file in UTF-8 (without BOM), but Android saves it in ANSI (1252). And God knows in what formats the other phones save them in!
So my questions are
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
tl;dr
Need to know how to convert the character encoding from (ANSI / UTF8) to (UTF8 / ANSI) while preserving all special characters.
You shouldn't convert from one encoding to the other. You have to read each file using the encoding that it was created with, or you will lose information.
Once you read the file using the correct encoding you have the content as a Unicode string, from there you can save it using any encoding you like.
If you need to detect the encoding, you can read the file as bytes and then look for character codes that are specific for either encoding. If the file contains no special characters, either encoding will work as the characters 32..127 are the same for both encodings.
This is what I use in C# (I've been using it to convert from Windows-1252 to UTF8)
public static String readFileAsUtf8(string fileName)
{
Encoding encoding = Encoding.Default;
String original = String.Empty;
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
original = sr.ReadToEnd();
encoding = sr.CurrentEncoding;
sr.Close();
}
if (encoding == Encoding.UTF8)
return original;
byte[] encBytes = encoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
return Encoding.UTF8.GetString(utf8Bytes);
}
VCF is encoded in utf-8 as demanded by the spec in chapter 3.4. You need to take this seriously, the format would be utterly useless if that wasn't cast in stone. If you are seeing some Android app mangling accented characters then work from the assumption that this is a bug in that app. Or more likely, that it got bad info from somewhere else. Your attempt to correct the encoding would then cause more problems because your version of the card will never match the original.
You convert from 1252 to utf-8 with Encoding.GetEncoding(1252).GetString(), passing in a byte[]. Do not ever try to write code that reads a string and whacks it into a byte[] so you can use the conversion method, that just makes the encoding problems a lot worse. In other words, you'd need to read the file with FileStream, not StreamReader. But again, avoid fixing other people's problems.
I do it this way:
private static void ConvertAnsiToUTF8(string inputFilePath, string outputFilePath)
{
string fileContent = File.ReadAllText(inputFilePath, Encoding.Default);
File.WriteAllText(outputFilePath, fileContent, Encoding.UTF8);
}
I found this question while working to process a large collection of ancient text files into well formatted PDFs. None of the files have a BOM, and the oldest of the files contain Codepage 1252 code points that cause incorrect decoding to UTF8. This happens only some of the time, UTF8 works the majority of the time. Also, the latest of the text data DOES contain UTF8 code points, so it's a mixed bag.
So, I also set out "to detect which encoding the input file has" and after reading How to detect the character encoding of a text file? and How to determine the encoding of text? arrived at the conclusion that this would be difficult at best.
BUT, I found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets in the comments, read it, and found this gem:
UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet.
The entire article is short and well worth the read.
So, I solved my problem with the following code. Since only a small amount of my text data contains difficult character code points, I don't mind the performance overhead of the exception handling, especially since this only had to run once. Perhaps there are more clever ways of avoiding the try/catch but I did not bother with devising one.
public static string ReadAllTextFromFile(string file)
{
const int WindowsCodepage1252 = 1252;
string text;
try
{
var utf8Encoding = Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, utf8Encoding);
}
catch (DecoderFallbackException dfe)//then text is not entirely valid UTF8, contains Codepage 1252 characters that can't be correctly decoded to UTF8
{
var codepage1252Encoding = Encoding.GetEncoding(WindowsCodepage1252, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
text = File.ReadAllText(file, codepage1252Encoding);
}
return text;
}
It's also worth noting that the StreamReader class has constructors that take a specific Encoding object, and as I have shown you can adjust the EncoderFallback/DecoderFallback behavior to suit your needs. So if you need a StreamReader or StreamWriter for finer grained work, this approach can still be used.
I use this to convert file encoding to UTF-8
public static void ConvertFileEncoding(String sourcePath, String destPath)
{
// If the destination's parent doesn't exist, create it.
String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
if (!Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}
// Convert the file.
String tempName = null;
try
{
tempName = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(sourcePath))
{
using (StreamWriter sw = new StreamWriter(tempName, false, Encoding.UTF8))
{
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
File.Delete(destPath);
File.Move(tempName, destPath);
}
finally
{
File.Delete(tempName);
}
}
Isn't there an industry standard for vCard files' character encoding?
Which is easier for my solving my problem? Converting ANSI to UTF8 (and/or the other way round) or trying to detect which encoding the input file has and notifying the user about it?
How I solved this:
I have vCard file (*.vcf) - 200 contacts in one file in russian language...
I opened it with vCardOrganizer 2.1 program then made Split to divide it on 200....and what I see - contacts with messy symbols, only thing I can read it numbers :-) ...
Steps: (when you do this steps be patient, sometimes it takes time)
Open vCard file (my file size was 3mb) with "notepad"
Then go from Menu - File-Save As..in opened window choose file name, dont forget put .vcf , and encoding - ANSI or UTF-8...and finally click Save..
I converted filename.vcf (UTF-8) to filename.vcf (ANSI) - nothing lost and perfect readable russian language...if you have quest write: yoshidakatana#gmail.com
Good Luck !!!

Categories

Resources