So an update, I've gotten my code to be able to read a single pdf file and parse the information into a text file. Great. Now I want to figure out how to do the following two things.
Get the program to be able to read more than 1 pdf file. If I could get it to read an entire file folder, that would be best. I'm not sure how to change the code to do that, but I know it can't be that different.
Change the activation method. If I could get it so that the code ran whenever a new file was dropped into a folder, that would be absolutely amazing. That has to be possible, to somehow have an event listener that activates whenever a file is dropped into a folder and parses the information.
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
System.IO.StreamWriter file = new System.IO.StreamWriter(#"C:\Users\kttricic\OneDrive - Burns & McDonnell\Desktop\test file\POs\test");
file.WriteLine(text);
file.Close();
return text.ToString();
}
}
static void Main(string[] args)
{
Console.WriteLine(ExtractTextFromPdf(#"C:\Users\kttricic\OneDrive - Burns & McDonnell\Desktop\test file\POs\PO 4505234816 Siemens Industry, Inc. 6.15.21.pdf"));
}
Related
i have a netcore 3 app to read and split a PDF containing paychecks of some companies which i am working for.
This app ran pretty well since last builds... my the way, the PDF reader started to fail to parse the contents of any PDF.
PDF is built only with Italian words, no special chars. Few tables and a single logo. I'm not able to attach it due to privacy.
public PaycheckSplitter Read()
{
using (var reader = new PdfReader(new MemoryStream(this._stream)))
{
var doc = new PdfDocument(reader);
this.Paycheck = new PaychecksCollection();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
PdfPage page = doc.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, new LocationTextExtractionStrategy());
if (text.Contains(Consts.BpEnd)) break;
// trying to find something by regex... btw text contains only a sequence of \n\n\n\n...
string cf = Consts.CodFiscale.Match(text).Value;
this.Paychecks.Add(new Paycheck(cf), i);
}
doc.Close();
}
return this;
}
Anything i can do?
As far as i can see... the only and best way to have something to read a PDF text for free is iText7...
I am trying to get the text from a PDF stored in localStorage in a Windows Phone 8.1 application,but I always get an FileNotFoundException.
To explain the whole story, I get a PDF from an online source, I store it to a folder with name same as the username (The username is an email address, but I tried also without the # sign) of the user and then I want to get some text from the PDF file. I use iTextSharp and follow the examples, but cannot succeed. When I send the PDF to the Launcher is opening succesfully with another app like Acrobat Reader.
My function is like below. I first send an PDF Object, which has an attribute called Path and it is stored to folder specific to the username of the user.
Then I get the pdf as a StorageFile Item. When I create the PDFReader calling the constructor I get a FileNotFoundException. Does anybody knows or can guess what can be the problem? Is iTextSharp compatible with Windows Phone 8.1?
internal async Task<bool> OpenPdfFromDownloadedCollections(PDF pdfToOpen, string username)
{
try
{
StorageFolder folder = ApplicationData.Current.LocalFolder;
var pdfFolder = await folder.GetFolderAsync(username + "PDFs");
var pdf = await pdfFolder.GetFileAsync(Object.Path);
StringBuilder text = new StringBuilder();
using (PdfReader reader = new PdfReader(pdf.Path))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string[] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
}
}
return true;
}
catch (Exception)
{
return false;
}
}
var pdf = await pdfFolder.GetFileAsync(Object.Path);
In this line of code you should only pass the file name but you are giving the whole Path as parameter. As pdfFolder currently represents the path.
I have a problem to read and display content of some PDFs into RichTextBox.
I use the following code:
string fileName = #"C:\Users\PC\Desktop\SomePdf.pdf";
string str = string.Empty;
PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
str = str + s;
rtbVsebina.Text = str;
}
reader.Close();
Some PDFs can be read and displayed into RichTextBox and some they can not be. For those that can not be read I only get empty RichTextBox but with some added lines as I would press Key 'Enter' on the keyboard a couple of times.
Does anybody know what could be wrong?
You are confusing page content with page annotations.
Page content is part of the content stream of a page. It's referred to in the /Contents entry of the page dictionary and (optionally) in external objects (aka XObjects). With the code snippet you have copy/pasted in your question, you are extracting this content.
A rich text box is one of the many types of annotations. Annotations are not part of the content stream of a page. They are referred to from the /Annots entry of the page dictionary. If you want to get the contents of an annotation, you need to ask the page for its annotations instead of parsing the content of the page. See for instance Reading PDF Annotations with iText.
In answer to your question "What am I doing wrong": you were looking at the wrong place.
I am using a web service that returns me some data. I am writing that data in a text file. my problem is that I am having a file already specified in the c# code, where I want to open a dialog box which ask user to save file in his desired location. Here I am posting code which I have used. Please help me in modifying my code. Actually after searching from internet, all are having different views and there is lot of changes in code required where as I do not want to change my code in extent. I am able to write the content in test file but how can I ask user to enter his desire location on computer?
StreamWriter file = new StreamWriter("D:\\test.txt");
HttpWebRequest webreq = (HttpWebRequest)WebRequest.Create(yahooURL);
// Get the response from the Internet resource.
HttpWebResponse webresp = (HttpWebResponse)webreq.GetResponse();
// Read the body of the response from the server.
StreamReader strm =
new StreamReader(webresp.GetResponseStream(), Encoding.ASCII);
string content = "";
for (int i = 0; i < symbols.Length; i++)
{
// Loop through each line from the stream,
// building the return XML Document string
if (symbols[i].Trim() == "")
continue;
content = strm.ReadLine().Replace("\"", "");
string[] contents = content.ToString().Split(',');
foreach (string dataToWrite in contents)
{
file.WriteLine(dataToWrite);
}
}
file.Close();
Try this
using (WebClient Client = new WebClient ())
{
Client.DownloadFile("http://www.abc.com/file/song/a.mpeg", "a.mpeg");
}
I use this code to read pdf content using iTextSharp. it works fine when content is english but it doesn't work whene content is Persian or Arabic Result is something like this :
Here is sample non-English PDF for test.
َٛنا Ùٔب٘طث یؿیٛ٘ زؾا ÙÙ›ÙØÙ” Ù‚Ù›Ù…Ø
یٔبٕس © Karl Seguin foppersian.codeplex.com
www.codebetter.com 1 1 Ùٔب٘طث َٛنا یؿیٛ٘
همانرب لوصا یسیون مرن دیلوت رتهب رازÙا
What is the solution ?
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append(currentText);
pdfReader.Close();
}
}
return text.ToString();
}
In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn't matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn't make sense and will almost always fail.
Your problem is this line:
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
I'm going to pull it apart into a couple of lines to illustrate:
byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ی
The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.
Side-note, it is totally possible that whatever creates a string does it incorrectly, that's not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.
EDIT
The code should be the exact same as yours above except that one line should be removed. Also, whatever you're using to display the text in, make sure that it supports Unicode. Also, as #kuujinbo said, make sure that you're using a recent version of iTextSharp. I tested this with 5.2.0.0.
public string ReadPdfFile(string fileName) {
StringBuilder text = new StringBuilder();
if (File.Exists(fileName)) {
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
EDIT 2
The above code fixes the encoding issue but doesn't fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.
Consequently, showing text in such right-to-left writing systems
requires either positioning each glyph individually (which is tedious
and costly) or representing text with show strings (see 9.2,
“Organization and Use of Fonts”) whose character codes are given in
reverse order.
PDF 2008 Spec - 14.8.2.3.3 - Reverse-Order Show Strings
When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a "marked content" section, BMC. However, the few sample PDFs that I've looked at and generated don't appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you'll have to poke around so more.