PdfTextExtractor.GetTextFromPage suddenly giving empty string - c#

We've been using the iTextSharp libraries for a couple of years now within an SSIS process to read some values out of a set of PDF exam documents. Everything has been running nicely until this week when suddenly we are getting the return of an empty string when calling the PdfTextExtractor.GetTextFromPage method. I'll include the code here:
// Read the data from the blob column where the PDF exists
byte[] byteBuffer = Row.FileData.GetBlobData(0, (int)Row.FileData.Length);
using (var pdfReader = new PdfReader(byteBuffer))
{
// Here is the important stuff
var extractStrategy = new LocationTextExtractionStrategy();
// This call will extract the page with the proper data on it depending on the exam type
// 1-page exams = NBOME - need to read first page for exam result data
// 2-page exams = NBME - need to read second page for exam result data
// The next two statements utilize this construct.
var vendor = pdfReader.NumberOfPages == 1 ? "NBOME" : "NBME";
*** THIS NEXT LINE GIVES THE EMPTY STRING
var newText = PdfTextExtractor.GetTextFromPage(pdfReader, pdfReader.NumberOfPages == 1 ? 1 : 2, extractStrategy);
var stringList = newText.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
var fileParser = FileParseFactory.GetFileParse(stringList, vendor);
// Populate our output variables
Row.ParsedExamName = fileParser.GetExamName(stringList);
Row.DateParsed = DateTime.Now;
Row.ParsedId = fileParser.GetStudentId(stringList);
Row.ParsedTestDate = fileParser.GetTestDate(stringList);
Row.ParsedTestDateString = fileParser.GetTestDateAsString(stringList);
Row.ParsedName = fileParser.GetStudentName(stringList);
Row.ParsedTotalScore = fileParser.GetTestScore(stringList);
Row.ParsedVendor = vendor;
}
This is not for all PDFs, by the way. To explain more, we are reading in exam files. One of the exam types (NBME) seems to be reading just fine. However, the other type (NBOME) is not. However, prior to this week, the NBOME ones were being read fine.
This leads me to think it is an internal format change of the PDF file itself.
Also, another bit of information is that the actual pdfReader has data - I can get a byte[] array of the data - but the call to get any text is simply giving me empty.
I'm sorry I'm not able to show any exam data or files - that information is sensitive.
Has anybody seen something like this? If so, any possible solutions?

Well - we have found our answer. The user was originally going to the NBOME web site and downloading the PDF exam result files to import into my parsing system. Like I said, this worked for quite some time. Recently (this week), however, the user started not downloading the files, but using a PDF printing feature and printed the PDF files as PDF. When she did that, the problem occurred.
Bottom line, it looks like the printing the PDF as PDF may have been injecting some characters or something under the covers that was causing the reading of the PDF via iTextSharp to not fail, but to give an empty string. She should have just continued downloading them directly.
Thanks to those who offered some comments!

Related

NReco HTML-to-PDF Generator GeneratePdfFromFiles method throws exception

I have a fully working system for creating single page PDFs from HTML as below;
After initializing the converter
var nRecoHTMLToPDFConverter = new HtmlToPdfConverter();
nRecoHTMLToPDFConverter = PDFGenerator.PDFSettings(nRecoHTMLToPDFConverter);
string PDFContents;
PDFContents is an HTML string which is being populated.
The following command works perfectly and gives me the byte[] which I can return;
createDTO.PDFContent = nRecoHTMLToPDFConverter.GeneratePdf(PDFContents);
The problem arises when I want to test and develop the multi page functionality of the NReco library and change an arbitrary number of HTML pages to PDF pages.
var stringArray = new string[]
{
PDFContents, PDFContents,
};
var stream = new MemoryStream();
nRecoHTMLToPDFConverter.GeneratePdfFromFiles(stringArray, null, stream);
var mybyteArray = stream.ToArray();
the PDFContents are exactly the same as above. On paper, this should give me the byte array for 2 identical PDF pages however on call to GeneratePdfFromFiles method, I get the following exception;
WkHtmlToPdfException: Exit with code 1 due to network error: HostNotFoundError (exit code: 1)
Please help me resolve this if you have experience with this library and its complexities. I have a feeling that I'm not familiar with the proper use of a Stream object in this scenario. I've tested the working single page line and the malfunctioning multi page lines on the same method call so their context would be identical.
Many thanks
GeneratePdfFromFiles method you used expects array of file names (or URLs): https://www.nrecosite.com/doc/NReco.PdfGenerator/?topic=html/M_NReco_PdfGenerator_HtmlToPdfConverter_GeneratePdfFromFiles_1.htm
If you operate with HTML content as .NET strings you may simply save it to temp files, generate PDF and remove after that.

PDFSharp saved Document always promps the "Save changes" message after filling form fields

Recently I made a program where I take a pdf file and using PDFsharp fill in the form fields with required values. The code that I made works fine it writes the values just fine but the problem comes after you open the pdf and try to close it, you will get a the standard message "Do you want to save changes before closing" even thou you just opened and close the document. The code that I use looks like this:
string templateDocPath = #"Original.pdf";
using (PdfDocument myTemplate = PdfReader.Open(templateDocPath, PdfDocumentOpenMode.Modify))
{
PdfAcroForm form = myTemplate.AcroForm;
if (form.Elements.ContainsKey("/NeedAppearances"))
{
form.Elements["/NeedAppearances"] = new PdfBoolean(true);
}
else
{
form.Elements.Add("/NeedAppearances", new PdfBoolean(true));
}
PdfTextField testField = (PdfTextField)(form.Fields["Name"]);
testField.Value = new PdfString("NameTest");
testField.ReadOnly = true;
myTemplate.Save(#"Output.pdf");
myTemplate.Close();
}
When I was trying to solved the problem I found out that the message comes only after you add "/NeedAppearances" Element to the AcroForms. You need this element or the values you write on the document will not show.
Googling some more I found a forum (https://forum.pdfsharp.net/viewtopic.php?f=2&t=3741) where someone asked the same question but didn't get a clear answer, the last comment mentioned that "/NeedAppearances" says to the document to generate the new values. So when you open the document new values are generated, so you have to save them.
I would like to know if it's true and is there a way to remove the message?
I came across this article yesterday while I was trying to find an answer for the exact problem described in the title. I was never able to find anything from any of the Google searches I ran. However, I was able to figure out what the problem was.
The issue is that when you save the PdfDocument object, it is defaulting the PDF as version 0. You can verify this by opening the generated PDF in Notepad++ (or similar text editor) and looking at the first line. When you open the PDF, Acrobat/Reader has to format it to be able to display it since it is an outdated PDF version; thus causing the document to be changed.
The solution is to set the Version of the PdfDocument before you save (an example of this can be seen here: https://forum.pdfsharp.net/viewtopic.php?f=2&t=617). As of PDFsharp version 1.50, the highest supported version is 17 (see PDFsharp Wiki on which PDF versions are supported).

Data import via Management API successful, but data for custom dimensions does not show

I am trying to import data for custom dimension in Google Analytics through the .NET client library. In Google Analytics, when I view the uploads for a data set from Admin > Data Import > Manage Uploads, it says my uploads are successful, but the data for the custom dimension doesn't seem to show up in my report. Right now, I am just using my custom dimension to set the category for an article.
Here is how I am uploading through the .Net client library.
string accountId = "***";
string webPropertyId = "***";
string customDataSourceId = "***";
string contentType = "application/octet-stream";
IUploadProgress progress;
using (var dataStream = CreateArticleCsvStream(articles))
{
var fs = File.Create("test.csv");
dataStream.CopyTo(fs);
fs.Close();
progress = service.Management.Uploads.UploadData(accountId, webPropertyId, customDataSourceId, dataStream, contentType).Upload();
}
if (progress.Status == UploadStatus.Failed)
{
throw progress.Exception;
}
Here is the output for test.csv
ga:pagePath,ga:dimension1
/path/to/page/,"MyCategory"
When I download the file from the data set, I get the same file as the test.csv file, it just has a random filename that gets assisgned to it.
I found this other question similar to mine, but there was no solution posted. Any help would be appreciated.
I have also waited over 24 hours, but still nothing.
It took a few days of trial and error but I finally found the solution.
First thing to check is that your Website's URL is correct under Admin > View Settings. We had ours set up like my.domain.com/path/to/site when it should have just been my.domain.com. (We are using SharePoint, which is why path/to/site was appended to the site URL)
Second thing to check is that your key/pagePath entries are all correct. In our case, we had an extra forward slash at the end of the URL. For some reason, Google Analytics displays the trailing forward slash in reports, but does not actually store it for the pagePath.
Another error may be capitalization. It seems like GA applies filters after the data has been processed. If you add the lowercase/uppercase filter, notice that it only affects how the URLs display in your reports. Behind the scenes, it seems that GA still stores the URL with whatever capitalization the hit initially came in with. For example if the URL on your site is my.domain.com/path/to/PAGE.aspx and you apply the lowercase filter, the pagePath will display in your reports as /path/to/page.aspx. But, if you use the lowercase value in your csv import, the data will not join. You must use the pagePath that appears on your site (/path/to/PAGE.aspx in this case).
It would be nice if Google gave some log files when it tries to process and join the uploaded data with the existing data, rather than just saying the upload was successful even though the processing/joining stage may fail.

SQL Server, C# and iTextSharp. Whats best way to join pdfs

I have a sql server db. In there are many, many rows. Each row has a column that contains a stored pdf.
The db is a gig in size. So we can expect roughly half that size is due to the pdfs.
now I have a requirement to join all those pdf's ... into 1 pdf. Don't ask why.
Can you suggest the best way forward and which component will be best suited for this job. There are many answers available:
How can I join two PDF's using iTextSharp?
Merge memorystreams to one itext document
How to merge multiple pdf files (generated in run time)?
as to how to join two (or more pdfs). But what I'm asking for is in terms of performance. We literally dealing with around 50 000 pdfs that need to be merged into 1 almighty pdf
[Edit Solution] Brought time to merge 1000 pdfs from 4m30s to 21s
public void MergePDFs(string targetPDF, string sourceDir)
{
using (FileStream stream = new FileStream(targetPDF, FileMode.Create))
{
var files = Directory.GetFiles(sourceDir);
Document pdfDoc = new Document(PageSize.A4);
PdfCopy pdf = new PdfCopy(pdfDoc, stream);
pdfDoc.Open();
Console.WriteLine("Merging files count: " + files.Length);
int i = 1;
var watch = System.Diagnostics.Stopwatch.StartNew();
foreach (string file in files)
{
Console.WriteLine(i + ". Adding: " + file);
pdf.AddDocument(new PdfReader(file));
i++;
}
if (pdfDoc != null)
pdfDoc.Close();
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
MessageBox.Show(elapsedMs.ToString());
}
}
I just did a C#/Winforms project with PDFSharp and merging images to PDFs and it worked phenomenally with a traditional folder structure. I imagine that it would work similarly with data stored PDFs so long as you can pull them into a memory stream first then merge them.
Some suggestions:
1) Recommend doing it in a multi-threaded environment so you can work on multiple PDFs at a time.
2) Open only what you need and close as soon as the operation is complete. So say you have three documents that need to be merged into one. Create a blank PDF. Open first into a memory stream, open blank. Append first to blank. Close first, save blank, close blank. Repeat for second and third. This way you control how much memory you are taking up at any one point in time. In this way I was able to append millions of images, but control memory usage.
3) Ensure you are using the Using statements when utilizing objects. This will help with memory cleanup and eliminate the need for calling garbage collector which is looked down upon.
4) Separate your business (work) from your UI as best you can so you can cancel the operation at any point in time, or view current status as it progresses through.
5) Log everything that is done so that you can go back and correct one-offs for the PDFs that didn't make it through the first pass.

Get all lines after the last print of line with 'keyword' in C#

I am working on a c# project.
I am trying to send a logfile via email whenever application gets crashed.
however logfile is a little bit larger in size.
So I thought that i should include only a specific portion of logfile.
For that I am trying to read all the lines after the last instance of line with specified keyword.(in my case "Application Started")
since Application get restarted many times(due to crashing), 'Application Started' gets printed many times in file. So I would only want last print of line containing 'Application Started' & lines after that until end of file.
I require help to figure out how can i do this.
I have just started with Basic code as of now.
System.IO.StreamReader file = new System.IO.StreamReader("c:\\mylogfile.txt");
while((line = file.ReadLine()) != null)
{
if ( line.Contains("keyword") )
{
}
}
Read the file, line-by-line, until you find your keyword. Once you find your keyword, start pushing every line after that into a List<string>. If you find another line with your keyword, just Clear your list and start refilling it from that point.
Something like:
List<string> buffer = new List<string>();
using (var sin = new StreamReader("pathtomylogfile"))
{
string line;
bool read;
while ((line = sin.ReadLine())!=null)
{
if (line.Contains("keyword"))
{
buffer.Clear();
read = true;
}
if (read)
{
buffer.Add(line);
}
}
// now buffer has the last entry
// you could use string.Join to put it back together in a single string
var lastEntry = string.Join("\n",buffer);
}
If the number of lines in each entry is very large, it might be more efficient to scan the file first to find the last entry and then loop again to extract it. If the whole log file isn't that large, it might be more efficient to just ReadToEnd and then use LastIndexOf to find the start of the last entry.
Read everything from the file and then select the portion you want.
string lines = System.IO.File.ReadAllText("c:\\mylogfile.txt");
int start_index = lines.LastIndexOf("Application Started");
string needed_portion = lines.Substring(start_index);
SendEmail(needed_portion);
I advise you to use a proper logger, like log4net or NLogger.
You can configure it to save to multiple files - one containing complete logs, other containing errors/exceptions only. Also you can set maximum size of log files, etc. Or can configure them to send you a mail if exception occours.
Of course this does not solves your current problem, for it there is some solution above.
But I would try simpler methods, like trying out Notepad++ - it can handle bigger files (last time i've formatted a 30MB XML document with it, it took about 20 mins, but he did it! With simple text files there should be much better perf.). Or if you open the file for reading only (not for editing) you may get much better performance (in Windows).

Categories

Resources