Read a PDF document in Blazor WebAssembly with iText7 - c#

I am having a bit of a struggle reading a PDF using iText7 in Blazor WebAssembly.
The InputFile component creates a IBrowserFile:
<div>
<InputFile OnChange="#OnFileSelection"></InputFile>
<div class="row">
<textarea>#outputText</textarea>
</div>
</div>
I can then read the file with Stream - and iText7 will supposedly read that - but it won't give a page count or anything else that I have tried. It also doesn't seem to pass over the reader, and doesn't even seem to get to the pageCount.
int pageCount = 0;
IBrowserFile pdfFile = e.File;
Stream stream = pdfFile.OpenReadStream();
PdfDocument pdfDoc = new PdfDocument(new PdfReader(stream));
pageCount = pdfDoc.GetNumberOfPages();
stream.Close();
outputText = $"{pageCount}";
StateHasChanged();
I have also tried reading the Stream into a MemoryStream first, same outcome. I have followed the information here:
https://learn.microsoft.com/en-us/aspnet/core/blazor/file-uploads?view=aspnetcore-6.0&pivots=webassembly
Same outcome.
Is there a way to handle the PDF file in such a way as the functionality of iText7 remains intact, so you can get page counts, extracted text etc?
The file I am testing on is below the 500kb limit, it is 66kb. I don't need to display the PDF - I just need to know what the contents of it are ideally on a page by page basis, but for now, simply being able to read a page or get a page count would be a big step forward.

If you look at your developer console, you'll find that it's emitting the error:
Synchronous reads are not supported.
You'll notice you aren't using await anywhere, and, unfortunately, neither is iText7. Blazor strongly enforces use of asynchronous semantics and if it's violated, you'll see an error like this.
Fortunately, you can still make this work. You said:
I have also tried reading the Stream into a MemoryStream first, same outcome.
You should show what you tried, but my hunch is it looked something like this:
var copy = new MemoryStream();
stream.CopyTo(copy);
copy.Position = 0;
PdfDocument pdfDoc = new PdfDocument(new PdfReader(copy));
This will lead to the exact same error on the line .CopyTo. And for the same reasons. If you instead make the copy process properly use async semantics, it will work:
var copy = new MemoryStream();
await stream.CopyToAsync(copy);
copy.Position = 0;
PdfDocument pdfDoc = new PdfDocument(new PdfReader(copy));
Notice await stream.CopyToAsync(copy);. You'll need to make your surrounding method async in order for the await to work, but the return type ought to be a Task already. (And if it isn't, you can make it so)
Using this, I was able to see the page count display in your text area.

Related

NReco HTML-to-PDF Generator GeneratePdfFromFiles method throws exception

I have a fully working system for creating single page PDFs from HTML as below;
After initializing the converter
var nRecoHTMLToPDFConverter = new HtmlToPdfConverter();
nRecoHTMLToPDFConverter = PDFGenerator.PDFSettings(nRecoHTMLToPDFConverter);
string PDFContents;
PDFContents is an HTML string which is being populated.
The following command works perfectly and gives me the byte[] which I can return;
createDTO.PDFContent = nRecoHTMLToPDFConverter.GeneratePdf(PDFContents);
The problem arises when I want to test and develop the multi page functionality of the NReco library and change an arbitrary number of HTML pages to PDF pages.
var stringArray = new string[]
{
PDFContents, PDFContents,
};
var stream = new MemoryStream();
nRecoHTMLToPDFConverter.GeneratePdfFromFiles(stringArray, null, stream);
var mybyteArray = stream.ToArray();
the PDFContents are exactly the same as above. On paper, this should give me the byte array for 2 identical PDF pages however on call to GeneratePdfFromFiles method, I get the following exception;
WkHtmlToPdfException: Exit with code 1 due to network error: HostNotFoundError (exit code: 1)
Please help me resolve this if you have experience with this library and its complexities. I have a feeling that I'm not familiar with the proper use of a Stream object in this scenario. I've tested the working single page line and the malfunctioning multi page lines on the same method call so their context would be identical.
Many thanks
GeneratePdfFromFiles method you used expects array of file names (or URLs): https://www.nrecosite.com/doc/NReco.PdfGenerator/?topic=html/M_NReco_PdfGenerator_HtmlToPdfConverter_GeneratePdfFromFiles_1.htm
If you operate with HTML content as .NET strings you may simply save it to temp files, generate PDF and remove after that.

iText7 Merge of Multiple PDF MemoryStreams Not Working

I am trying to Generate a Single PDF File From Multiple Memory Streams, I am having a lot of trouble determining the proper way to merge 2 PDF MemoryStreams into one PDF MemoryStream that contains all the pages from both source PDF MemoryStreams. It seems simple and I think the code below is set up properly but the resulting PDF memory stream does not contain both the Files Combined.
I am having a lot of trouble determining the proper way to merge 2 PDF MemoryStreams into one PDF MemoryStream that contains all the pages from both source PDF MemoryStreams. It seems simple and I think the code below is set up properly but the resulting PDF memory stream does not Contain Merged Documents.
I have found multiple ways documented on the Internet as the "proper" way to do the merge. The actual sample code with iText 7 seems to be unusually complex (in that is mixes multiple concepts into one sample repeatedly - as in doesn't reduce the concept to the simplest possible code), and seems to fail to demonstrate simple concepts. For instance, their PDFMerge documentation has no sample code at all in the documentation (nor does anything else I looked at in the class documentation). The examples they have online actually always mix merging from files (not MemoryStreams or byte[]) with other concepts like adding page numbers or adding Table of Contents. So they never just show one concept and they never start with anything other than files. My PDFs are coming out of a database and we just need to merge them into one PDF memory stream and save it back out. My concern is that maybe I am not creating the MemoryStream properly when I initialize the PDFWriter. As none of their samples ever do anything but initial with an actual file, I was unable to confirm this was done properly. I also fully qualified all objects in the code because I want to leave the old iTextSharp code in place while I am upgrading to the new iText 7. This was done to make sure an iTextSharp object of the same name wasn't inadvertently being unknowingly used.
Also, in the interest of making the source as easy as possible to read I removed some of the declarations and initialization of objects being used. Everything was traced through and all values are fully loaded with proper values as you trace through the code. I am assuming the problem is that I didn't prepare the PDF objects properly or that I have to do something special with the PDFWriter on the Destination PDF Document (ms) before the the PDFMerge object.
List<byte[]> streams = new List<byte[]>();
somelist.ForEach(item=>
{
using (var workStream = new MemoryStream())
using (var pdfWriter = new PdfWriter(workStream))
{
pdfWriter.SetCloseStream(false);
HtmlConverter.ConvertToPdf(strContent, pdfWriter);
streams.Add(workStream.ToArray());
pdfWriter.Close();
}
}
MemoryStream ms = new MemoryStream();
PdfWriter writer = new PdfWriter(ms);
PdfDocument document = new PdfDocument(writer);
PdfMerger merger = new PdfMerger(document);
streams.ForEach(stream =>
{
Stream msDoc = new MemoryStream(stream);
PdfDocument doc = new PdfDocument(new PdfReader(msDoc));
merger.Merge(doc, 1, doc.GetNumberOfPages());
doc.Close();
});
ByteContent = ms.ToArray();
document.Close();
Merging is a really straightforward process:
var SourceDocument1 = new PdfDocument(new PdfReader(SRC));
var SourceDocument2 = new PdfDocument(new PdfReader(SRC1));
byte[] result;
using (var memoryStream = new MemoryStream())
{
var pdfWriter = new PdfWriter(memoryStream);
var pdfDocument = new PdfDocument(pdfWriter);
PdfMerger merge = new PdfMerger(pdfDocument);
merge.Merge(SourceDocument1, 1, SourceDocument1.GetNumberOfPages())
.Merge(SourceDocument2, 1, SourceDocument2.GetNumberOfPages());
merge.Close();
result = memoryStream.ToArray();
}
File.WriteAllBytes(#"C:\temp\file.pdf", result);
this will merge SRC with SRC1.
There are a lot of examples on Github, such as this one (there is also a whole folder with merge examples).
I am writing the destination document in the end, just to make sure it's being created correctly, but you can do whatever you want to with the MemoryStream, of course.

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

How can I read pdf files and save contents to a text file using Spire.PDF?
For example: Here is a pdf file and here is the desired text file from that pdf
I tried the below code to read the file and save it to a text file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Tamal\Desktop\101395a.pdf");
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
buffer.Append(page.ExtractText());
}
doc.Close();
String fileName = #"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);
But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.
How do I get the desired result as in the desired text file?
Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.
Using iText
File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));
System.out.println(stes.getResultantText());
This is (as the code says) a basic/simple text extraction strategy.
More advanced examples can be found in the documentation.
Use IronOCR
var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));
For reference https://ironsoftware.com/csharp/ocr/
Using this you should get formatted text output, but not exact desire output which you want.
If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK
That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.

Convert HTML with CSS to PDF using iTextSharp

I am working in asp.net with C# website. I want to convert a HTML DIV which contains various html elements like divs,label, tables and images with css styles(background color, cssClass etc) and I want its whole content to be converted into PDF using iTextSharp DLL but here I am facing a issue that css is not getting applied.Can any one help me by providing any example or code snippet.
Install 2 NuGet packages iTextSharp and itextsharp.xmlworker and use the following code:
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.tool.xml;
byte[] pdf; // result will be here
var cssText = File.ReadAllText(MapPath("~/css/test.css"));
var html = File.ReadAllText(MapPath("~/css/test.html"));
using (var memoryStream = new MemoryStream())
{
var document = new Document(PageSize.A4, 50, 50, 60, 60);
var writer = PdfWriter.GetInstance(document, memoryStream);
document.Open();
using (var cssMemoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(cssText)))
{
using (var htmlMemoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(html)))
{
XMLWorkerHelper.GetInstance().ParseXHtml(writer, document, htmlMemoryStream, cssMemoryStream);
}
}
document.Close();
pdf = memoryStream.ToArray();
}
Check out Pechkin, a C# wrapper for wkhtmltopdf.
Specifically at this point in time (considering a pending pull request) I'd check out this fork that addresses a couple of bugs (particularly helpful in IIS based on my experience).
If you don't go with the fork / get other stability issues you may want to look at having some kind of "render queue" (e.g. in a database) and have a background process (e.g. Windows service) periodically run over the queue and render then store the binary content somewhere (either in database as well, or on file system). This depends entirely on your use-case though.
Alternatively the similar solution #DaveDev has comment linked to.

How to modify a pdf file's trimbox with itextsharp

I have a ready made PDF, and I would need to modify the trimbox, bleedbox with SetBoxSize and use the setPDFXConformance. Is there a way to do this?
I've tried with stamper.Writer, but it doesn't care about what I set there
2011.02.01.
We've tested it with Acrobat Pro, and it said that the trimbox was not defined. It seems the the stamper's writer's methods/properties don't effect the resulting pdf. Here are the source and result files: http://stemaweb.hu/pdfs.zip
my code:
PdfReader reader = new PdfReader(#"c:\source.pdf");
PdfStamper stamper = new PdfStamper(reader, new FileStream(#"c:\result.pdf", FileMode.Create));
stamper.Writer.SetPageSize(PageSize.A4);
stamper.Writer.PDFXConformance = PdfWriter.PDFX32002;
stamper.Writer.SetBoxSize("trim", new iTextSharp.text.Rectangle(20, 20, 100, 100));
PdfContentByte cb = stamper.GetOverContent(1);
/*drawing*/
stamper.Close();
Because the boxes are not visible, I tried to modify the pagesize with the writer but that didn't do anything either.
SetPDFXConformance won't turn a "normal" PDF into a PDF/X pdf. SetPDFXConformance is really just for document generation, causing iText to throw an exception if you do something blatantly off spec.
"it doesn't care about what I set there". Trim and bleed boxes are not something you can see visually in Reader. How are you testing for them?
Could you post some code, and a link to your output PDF?
Ah. You're using stamper.Writer. In this case, that doesn't work out so well. All the page level, Well Supported Actions via PdfStamper will take a page number or page's PdfDictionary as an argument. SetBoxSize just takes a string & a rectangle, so that's youre clue.
Going "under the hood" as you are is actually defaulting back to PdfWriter.setBoxSize... which is only for creating PDFs, not modifying an existing page.
So: You need to use the low-level PDF objects make the changes you want. No Problemo:
for (int i = 1; i <= myReader.getNumberOfPages(); ++i) {
PdfDictionary pageDict = myREADER_YES_READER.getPageN(i);
PdfRectangle newBox = new PdfRectangle( 20, 20, 100, 100 );
pageDict.put(PdfName.TRIMBOX, newBox);
newBox = new PdfRectangle( PageSize.A4 );
pageDict.put(PdfName.MEDIABOX, newBox );
}
/* drawing */
stamper.close();
As to the PDFX32002 conformance, I think you're going to have to go code diving to figure out exactly what is needed. Writer.PDFXConformance is another aspect of Writer that only works when generating a PDF, not modifying an existing one.
The good news is that PdfXConformanceImp is a public class. The bad news is that its only used internally by PdfWriter and PdfContentByte... hey. You are getting some changes in behavior with your present code (just not enough). Specifically, if you try something that isn't allowed within that PdfContentByte, you'll get a PdfXConformanceException with message describing the restriction you've violated. Trying to add an optional content group (layer) would throw for example.
Ah. That's not so bad. MAYBE. Try this:
PDFXConformanceImp pdfx = new PDFXConformanceImp();
pdfx.setConformance(PdfWriter.PDFX32002);
pdfx.commpleteInfoDictionary(stamper.Writer.getInfo());
pdfx.completeExtraCatalog(stamper.Writer.getExtraCatalog());
stamper.close();
If you drop stamper.Writer.PDFXConformance = PdfWriter.PDFX32002;, you won't get exceptions when you do something Forbidden in your contentByte. Other than that, I don't think it'll matter.
Hmm.. That's not the whole solution. The OutputIntents from the extraCatalog are merged into the main catalog as well. Perhaps this will work:
//replace the completeExtraCatalog call above with this
pdfx.completeExtraCatalog(myReader.getCatalog());
I wish you luck.

Categories

Resources