Reading text that is embedded in a PDF? - c#

I have a PDF that has a string that is in the catalog portion of the PDF file. I need to read that string.
With iTextSharp 5 I was able to read the catalog and pull out the string.
I am now limited to another library (Syncfusion) and in that library the catalog is marked as private and I do not have access to it.
I am able to "open" the PDF in Notepad++ and I can see the string as plain text. I need to programatically open that file and retrieve that string. Using ReadAllBytes I can read the file but then am at a loss as to how to search it for a specific string.
Any suggestions or examples that I can explore would be appreciated.

If you know the encoding of the text, you could always convert the raw bytes to a string and then use a Regex to find what you need.
Here's an example of that:
var bytes = File.ReadAllBytes("example.pdf");
string pdfStr = Encoding.UTF8.GetString(bytes); //for UTF8
Regex pdfReg = new Regex(...); //the regex for finding your string
string pdfSubstring = pdfReg.Match(pdfStr); //the string you needed
C# Regex Reference

Related

Non-Unicode to unicode conversion of a txt file

Given a txt file with non-unicode text, I am able to detect its charset as 1251. Now, I would like to convert into unicode.
byte[] bytes1251 = Encoding.GetEncoding(1251).GetBytes(File.ReadAllText("sampleNU.txt"));
String str = Encoding.UTF8.GetString(bytes1251);
This doesn't work.
Is this the way to go about it for non-unicode to unicode conversion?
After trying the suggested approach on the RTF file, I get the below dialog when I try to open the output RTF file. Please let me know what to do because selecting Unicode doesn't make it readable or give the expected text?
// load as charset 1251
string text = File.ReadAllText("sampleNU.txt", Encoding.GetEncoding(1251));
// save as Unicode
File.WriteAllText("sampleU.txt", text, Encoding.Unicode);

Save a string with HTML format to doc (with unicode)

I try many way but I don't get the result that I expect. Please help, thanks.
I have a byte array, and I read it to a string, the result is :
string mystring = "<p>Today is a <b>beautiful</b> day</p>"
now I want to return it to a DOC file with HTML format.
I have the problem that I cant get a file with unicode format.
This is what I want in doc file :
Today is a beautiful day
Can anyone help me find the way I can save my string to doc file with unicode encode?
If you want to create a very simple Word document from C# code, I would use the .docx format, and follow the MSDN blog entry here.
It includes full sample code to download.
From the page linked to above:
This code will generate a DOCX that loads in Word 2007 or any other
valid Open XML consumer.

C#: Load *.txt to RichTextBox and convert into UTF8

I want to open text files and load them into a RichTextBox. This has been going fine so far, but now I'm struggling with an encoding issue.
So I used the GetType() method from this StackOverflow page:
How to find out the Encoding of a File? C#
- and it returns "System.Text.UnicodeEncoding".
My questions now are:
How do I convert Unicode (I guess that's what they are, although I haven't double checked) into UTF8 (and possibly backwards)?
Can I switch the RichTextBox to display Unicode correctly? The following shows awkward results: rtb.LoadFile(aFile, RichTextBoxStreamType.PlainText);
How can I define which encoding a SaveFileDialog should use?
Instead of having the RichTextBox load the file from the disk, load it yourself, while specifying the correct encoding. (By the way, Encoding.Unicode is just a synonym for "UTF-16 little-endian".)
string myText = File.ReadAllText(myFilePath, Encoding.Unicode);
This will take care of the conversion for you. The string you get is encoded "correctly" (i.e. in the format used internally by .NET), so you can just assign it to the Text property of your RichTextBox.
About your third question: The SaveFileDialog is just a tool that lets the user choose a file name. What you do with the file name (like: save some text into it, or encode some string and then save it) has nothing to do with the SaveFileDialog.
The SaveFileDialog just allow you to choose the path where the file will be saved. It doesn't save it for you..
Use Encoding class to convert from an encoding to another.
And read this article for some example on how to convert and write it to a file.
You can also use:
richTextBox.LoadFile(filePath, RichTextBoxStreamType.UnicodePlainText);

C# Working with files/bytes

I have some questions about editing files with c#.
I have managed to read a file into a byte[]. How can I get the ASCII code of each byte and show it in the text area of my form?
Also, how can I change the bytes and then write them back into a file?
For example:
I have a file and I know the first three bytes are letters. How can I change say, the second letter, to "A", then save the file?
Thanks!
If the file is ASCII, then each byte IS the ASCII code. To print the value of the byte to, say, a label, is as simple as this.
If you have read your file into byte[] file;
label1.Text = file[1].ToString();
To change the second letter to A:
file[1] = (byte)'A';
Or
file[1] = (byte)(int)'A';
I'm not sure, I don't have C# on my Mac to test.
But seriously, if it is a text file, you are better reading it in as text, not as a byte[]. And you would probably want to manipulate it using a StringBuilder
Firstly, to read it in as a string:
// Read the file as one string.
System.IO.StreamReader myFile =
new System.IO.StreamReader("c:\\test.txt");
string myString = myFile.ReadToEnd();
myFile.Close();
And this will work if the file is unicode as well.
Then, you can get the Unicode values (which for most latin characters is the same as the ASCII value) like so: int value = (int)myString[5]; or so.
You can then write back to a file like so:
System.IO.File.WriteAllText("c:\\test.txt", myString);
If you are going to do heavy modifications on the text, you should use a StringBuilder, otherwise, normal string operations would be fine.
I can only assume that you want to practice writing to/from files by the byte. You need to look into the class BitConverter, there is a lot of help out there for this class. To read in a value you would take in each byte into a byte[]. Once you have your byte[] it would look something like this.
string s = BitConverter.ToString(byteArray);
You can then make your adjustments to your string value, for writing back to the file you'll want to use the GetBytes method.
byte[] newByteArray = BitConveter.GetBytes(s);
Then you could write your bytes back to your file.

ByteArray In C# Is Unable To Show All Contents In TextBox

I'm parsing a pdf file...I converted data into byte array but it doesnt show full file..
i dnt want to use any lib or softy..
FileStream fs = new FileStream(fname, FileMode.Open);
BinaryReader br = new BinaryReader(fs);
int pos = 0;
int length = (int)br.BaseStream.Length;
byte [] file = br.ReadBytes(length);
String text = System.Text.ASCIIEncoding.ASCII.GetString(file);
displayFile.Text = text;
It would really help if you'd give more detail - including some code, preferably a short but complete program that demonstrates the problem.
My guess is that when you're doing the conversion you end up with some text containing a null character ('\0') - which Windows Forms controls treat as a string terminator.
For example, if you use:
label.Text = "hello\0there";
you'll only see "hello".
Now you may have this problem due to converting from a byte array to text using the wrong encoding - but we can't really help much more with the little information you've provided.
Based on your code example, I would say that the problem is that you are assuming that the PDF file contains plain ascii text, which is not the case. PDF is a complicated format, and there are libraries that allow you to parse them.
Doing a quick google search: iTextSharp can read the pdf format.
You cannot convert a PDF to text by just interpreting it as ASCII. You may be lucky enough that some of the text actually is ASCII, but you can also expect some of the non-text contents to be indistinguishable from ASCII.
Instead use one of the solutions for parsing PDF. Here is one way using PDFBox and IKVM: Naspinski.net: Parsing/Reading a PDF file with C# and Asp.Net to text
Even pure Ascii set contains lots of non-printable, non-display-able and control characters.
Like Jon said, a \0 (NUL) at the beginning of a string terminates everything in .NET. I had painful experience with this behavior years back. Control characters like 'bell' and 'backspace' etc etc will give you funny output. But do not expect to hear a bell ringing :P.

Categories

Resources