Non-Unicode to unicode conversion of a txt file - c#

Given a txt file with non-unicode text, I am able to detect its charset as 1251. Now, I would like to convert into unicode.
byte[] bytes1251 = Encoding.GetEncoding(1251).GetBytes(File.ReadAllText("sampleNU.txt"));
String str = Encoding.UTF8.GetString(bytes1251);
This doesn't work.
Is this the way to go about it for non-unicode to unicode conversion?
After trying the suggested approach on the RTF file, I get the below dialog when I try to open the output RTF file. Please let me know what to do because selecting Unicode doesn't make it readable or give the expected text?

// load as charset 1251
string text = File.ReadAllText("sampleNU.txt", Encoding.GetEncoding(1251));
// save as Unicode
File.WriteAllText("sampleU.txt", text, Encoding.Unicode);

Related

Display and save '\0' in .Net

.Net 4.5 Framework.
I have a string:
string input = "abcdqw\0asdv\0aaa";
Is there any way to display the string in a richtextbox like
abcdqwasdvaaa
and when I save it to a .txt file then open by notepad++, it is
abcdqw[nul]asdv[nul]aaa
???
When I display it in a normal way as
richTextBox.Text = input;
the output is just
abcdqw
You can make a RichTextBox load a file that has ASCII nul's in it:
yourRichTextBox.LoadFile(#"C:\path\to\file\with\nulls.txt", RichTextBoxStreamType.PlainText);
But you can't do it by setting its text property. I presume this is because that (ultimately) is managed via windows message calls (WM_SETTEXT)which will cut off at the first ascii nul encountered
I haven't tried a null character in a RichTextBox, however I guess you are getting a truncated string on display. If that is the case, the solution should be as easy as
var input = "abcdqw\0asdv\0aaa";
var displayResult = input.Replace("\0","");
\0 is a "null character".
It seems rich textbox is truncating the string at \0, use like this
string input = "abcdqw\0asdv\0aaa";
var cleaned = input.Replace("\0", string.Empty);
richTextBox.Text = cleaned;

Reading text that is embedded in a PDF?

I have a PDF that has a string that is in the catalog portion of the PDF file. I need to read that string.
With iTextSharp 5 I was able to read the catalog and pull out the string.
I am now limited to another library (Syncfusion) and in that library the catalog is marked as private and I do not have access to it.
I am able to "open" the PDF in Notepad++ and I can see the string as plain text. I need to programatically open that file and retrieve that string. Using ReadAllBytes I can read the file but then am at a loss as to how to search it for a specific string.
Any suggestions or examples that I can explore would be appreciated.
If you know the encoding of the text, you could always convert the raw bytes to a string and then use a Regex to find what you need.
Here's an example of that:
var bytes = File.ReadAllBytes("example.pdf");
string pdfStr = Encoding.UTF8.GetString(bytes); //for UTF8
Regex pdfReg = new Regex(...); //the regex for finding your string
string pdfSubstring = pdfReg.Match(pdfStr); //the string you needed
C# Regex Reference

how to write the character `é` in an rft file?

I need to write a string from c# into an rtf file, but having weird problems.
To write the text I simply use
string fileName = System.IO.Path.GetTempPath() + Guid.NewGuid().ToString() + ".rtf";
System.IO.File.WriteAllText(fileName, body);
body is a string variable, that is filled from a varchar column from a database.
The problem is with the character é which is wrong displayed by wordpad when opening the file like this
If I open the file in notepad, I see this
(één schade gevonden -> ander dossier)
So for some dark reason wordpad decided to show the character é all messed up like this.
I tried writing the file as UTF8 or other unicode encodings, but then wordpad refused to see this file as rtf and just shows the plain text with all the tags
I also looked at this page where it tells me to write a tag like \uXXX? where XXX should be a number defining a Unicode UTF-16 code unit number.
But I cannot find what number to use, or any good example on how to do this.
Actually I am not even sure if its unicode related, the character é is not even a character that needs unicode in my mind, could be wrong off course.
Anyway, does anyone knows how to solve this problem ?
I just need a way to make wordpad not mess up the character é on display and on print.
The problem was that I did not encoded the RTF file properly.
Using this link provided by Filburt I managed to encode the RTF file correct like this.
var iso = Encoding.GetEncoding("ISO-8859-1");
string fileName = System.IO.Path.GetTempPath() + Guid.NewGuid().ToString() + ".rtf";
System.IO.File.WriteAllText(fileName, body, iso);

Change readable portions of file with unknown characters

I'm trying to read a text file that has readable and unreadable characters. It opens easily in any text editor. Most of the text characters are unknown characters and the part I want to change is readable.
The file looks like this
readable1 gibberish readable2 gibberish.
I want to change readable2
If I use the following techniques they seem to return only readable1. They do not give the same output as dropping it on a text reader.
readFile(){
string sr=new StreamReader(path);
contents = sr.ReadToEnd();
//or
contents=File.ReadAllText(path);
}
I tried a few encodings ASCII, Unicode, UTF8, UTF32 but nothing seems to match the same output as dragging onto a text editor.
byte[] bytes = System.IO.File.ReadAllBytes(path);
string str = System.Text.Encoding.ASCII.GetString(bytes);
Is there any way to get it to return all the characters and just modify the readable characters?

Open file, read as hex and convert it to ASCII?

Is it possible to read a file hex values into c# and output the corresponding ASCII? I can view the file in a hex editor which I can then see the appropriate ASCII next to the hex but rather than manually copying out the parts I need I imagine there is a way of the machine doing it for me in a c# program?
I did find Converting HEX data in a file to ascii but that didn't really help?
It sounds like you just need:
string text = File.ReadAllText("file.txt");
There's no such thing as "hex values" in a file - they're just bytes which are shown as hex in various editors geared towards editing non-text files.
The above line of code will load a text file, decoding it as UTF-8 - which is compatible with ASCII, so if your file is truly ASCII, it should be fine. If you need to specify a different encoding, you can do it with an overload, e.g.
// Load an ISO-8859-1 file
string text = File.ReadAllText("file.txt", Encoding.GetEncoding(28591));

Categories

Resources