Having an issue with producing a PDF with an image in IText.
We're able to produce the text of the document using IText, but it is not pulling through the image.
We are following the text in their Ebook at https://kb.itextpdf.com/home/it7kb/ebooks/itext-7-converting-html-to-pdf-with-pdfhtml/chapter-1-hello-html-to-pdf
The code in question is below:
void createPdf(string baseUri, string html, string dest)
{
ConverterProperties properties = new ConverterProperties();
properties.SetBaseUri(baseUri);
HtmlConverter.ConvertToPdf(html, new FileStream(dest, FileMode.Create), properties);
}
And as far as we can see the issue is around the string baseUri
We have assumed that this is the directory where the image is held in our C# project in visual studio and have so far used the following to no avail as a string:
/Images/
/Images/NewLogo.png
http://localhost:64070/Images/NewLogo.png
None of these have produced the image in the PDF and any help or suggestions would be greatly appreciated.
We have found that if we set the BaseUri to the location of an image on a url that we are able to produce a Image on a PDF
Related
I'm trying to use IronOCR to create text-searchable versions of scanned PDF documents. The outputted file is displayed properly (and is text-selectable) in pretty much every viewer, except for Chrome's built-in PDF viewer.
Here's my code for converting the files:
byte[] origPdfBytes = Properties.Resources.Non_text_searchable;
using (MemoryStream pdfStream = new MemoryStream(origPdfBytes))
{
var ocr = new IronTesseract();
using (OcrInput input = new OcrInput())
{
input.AddPdf(pdfStream);
OcrResult ocrResult = ocr.Read(input);
ocrResult.SaveAsSearchablePdf("C:\\temp\\OCRTest\\output.pdf");
}
}
Here is a sample file that I've converted using IronOCR: https://drive.google.com/file/d/1_uhmZKJN_TFStApfeeieI8LezAPjp1mj/view
If you download and view this file in pretty much any viewer other than Chrome, the text is properly selectable. However, in Chrome, the cursor does appear to be selecting text, but it does not display properly.
I've Chrome's built-in PDF viewer for years, and I've never run into an issue like this. I'm not sure if this is an issue with IronOCR's output formatting, or if it's just a problem with Chrome. Any ideas?
I'm looking for a solution, either paid or free...
I have superscript text stored in SQL in RTF format, I need to print the superscript on a PDF document along with other text. So for example the PDF doc might read "Equation 1:" and then print the superscript text extracted from SQL.
I have been searching for an easy way to do this and so far come up empty.
The current PDF docs are made with PDFSharp but i'm happy to change that for a workable solution.
I thought of converting the rtf to an image with PdfConverter and then placing the image on the pdf doc but that doesn't seem to work.
I tried to do this with the following code however it throws an error "Parameter is not valid".
PdfConverter pdfConverter = new PdfConverter();
byte[] rtfstring = pdfConverter.GetPdfBytesFromRtfString(spec);
ImageConverter conv = new ImageConverter();
Image i = (Image)conv.ConvertFrom(rtfstring);
How can I read pdf files and save contents to a text file using Spire.PDF?
For example: Here is a pdf file and here is the desired text file from that pdf
I tried the below code to read the file and save it to a text file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Tamal\Desktop\101395a.pdf");
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
buffer.Append(page.ExtractText());
}
doc.Close();
String fileName = #"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);
But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.
How do I get the desired result as in the desired text file?
Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.
Using iText
File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));
System.out.println(stes.getResultantText());
This is (as the code says) a basic/simple text extraction strategy.
More advanced examples can be found in the documentation.
Use IronOCR
var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));
For reference https://ironsoftware.com/csharp/ocr/
Using this you should get formatted text output, but not exact desire output which you want.
If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK
That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.
I'm new to stack overflow, C# and onenote interop com api. I'm trying to display a pdf file in onenote using C# and the onenote com/interop api (I'd rather not use the REST API).
I am able to display a link to a pdf file using the tag < InsertedFile pathSource="[myfilepath]" preferredName = "[myPreferredName]"> in conjunction with the UpdatePageContent function in the interop API, but this doesn't display the PDF.
I have been able to get my program to display an image in onenote using the following code to create the image tag
private XElement createImageTag(Image image)
{
string OneNoteNamespace = "http://schemas.microsoft.com/office/onenote/2013/onenote";
var img = new XElement(XName.Get("Image", OneNoteNamespace));
var data = new XElement(XName.Get("Data", OneNoteNamespace));
data.Value = this.toBase64(image);
img.Add(data);
return img;
}
private string toBase64(Image image)
{
using (var memoryStream = new MemoryStream())
{
image.Save(memoryStream, ImageFormat.Png);
var binary = memoryStream.ToArray();
return Convert.ToBase64String(binary);
}
}
I tried altering this for a pdf instead of am image by converting a pdf to a byte array then converting it to base64 and assigning the result as data.Value in the createImageTag function but it did not result in a displayed pdf either (presumably because onenote was expecting an image and not a pdf). I'd like to avoid using third party libraries or extensions to convert a pdf to an image if possible, and haven't found any other ways to convert a pdf to an image.
I used ONOMSpy to look for any other onenote/xml tags that might help me display a pdf in onenote, but didn't see others besides the Image and InsertedFile tags that looked like they were close to doing what I wanted.
so if you could help me either :
1) find an easy way to convert a pdf to an image using C# or
2) show me how to tell onenote to display the PDF
I'd really appreciate it. Thanks!
I'd like to accomplish the following:
Given the path name of an html file, and the desired pathname of a pdf file, convert the HTML file to PDF using ITextSharp. I've seen plenty of code samples which do close to this but not exactly what I need. I believe my solution will need to use the iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList() function but I'm having trouble getting this to work with an actual HTML file and outputting an actual PDF file.
public void GeneratePDF(string htmlFileName, string outputPDFFileName)
{...}
is the function I'd really like to get working properly.
Thanks in advance
Edit: Here's an example I've of what I've tried:
iTextSharp.text.Document doc = new Document();
PdfWriter.GetInstance(doc, new FileStream(Path.GetFullPath("fromHTML.pdf"), FileMode.Create));
doc.Open();
try
{
List<IElement> list = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(new StringReader(File.ReadAllText(this.textBox1.Text)), null);
foreach (IElement elm in list)
{
doc.Add(elm);
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
doc.Close();
Note that textBox1.Text contains the full path name of the html file I'm trying to convert to pdf and I want this to get output to "fromHTML.pdf"
Thanks!
I had the same requirement and was diverted to this page by Google but could not find a concrete answer.
But after some head hitting and trials, i have been able to successfully convert the HTML code to PDF using iTextSharp library 5.1.1.
The code that i have shared here also takes care of the img tags in HTML with relative paths. iTextSharp library throws an error if your img tags do not have absolute src.
You an find the code here:
http://am22tech.com/s/22/Blogs/post/2011/09/28/HTML-To-PDF-using-iTextSharp.aspx
Let me know if you need more information. The code is in c#.