SelectPDF ConvertUrl Non-English Characters Error - c#

In my demo project I'm using Selectpdf tool to convert html pages to pdf documents. These html pages are stored locally. So I'm using ConvertUrl function for conversion. Here is the inline code
`
string url = AppDomain.CurrentDomain.BaseDirectory + "HTML" + "\\OrderName_" + DateTime.Now.ToString("yyyy'-'MM'-'dd'_'HH'-'mm_") + MockOrderNo + ".html";
HtmlToPdf converter = new HtmlToPdf();
PdfDocument doc = converter.ConvertUrl(htmlurl);
`
Then I save the pdf document, using doc.Save(). Here is the pdf document result
Now as you can see there is a problem displaying Turkish characters like "İ,ı,ş,ğ...". How can I resolve this using SelectPdf? If solving this with SelectPdf is not possible, what are the other prefable pdf conversion tools that does not have this kind of problem?
Also for my requirements I don't use ConvertHtmlString function. I need to store html pages in a folder, convert these html pages to pdf and store those pdf documents in an another folder.
Thanks for your help

I just changed the encoding of html file to windows-1252. This solved the problem

Related

Attaching PDF without saving document using SyncFusion C#.net

Can someone help me or give me a link on how to attach a PDF to email using syncfusion html to pdf converter without saving the document?
here is my code:
//Convert URL to PDF
PdfDocument doc = htmlConverter.Convert(render_to_html(Panel_preview_attachment), baseUrl);
doc.Save(path + "Attachments\\" + file_name); //I need to remove this
string empPdf = path + "Attachments\\" + file_name;
LinkedResource linkedresource = new LinkedResource(empPdf, "application/pdf");
linkedresource.ContentId = "empPdf";
linkedresource.ContentType.Name = file_name;
htmlView.LinkedResources.Add(linkedresource);
We can convert HTML to PDF and send the PDF document in a mail without saving it to a filesystem. We can save the PDF document to the memory stream and we can send it to a mail. Please refer below link for more information,
HTML to PDF: https://help.syncfusion.com/file-formats/pdf/convert-html-to-pdf/webkit
Email output PDF: https://www.syncfusion.com/kb/6064/how-to-create-pdf-dynamically-and-email-as-attachment
Note : I work for Syncfusion.

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

How can I read pdf files and save contents to a text file using Spire.PDF?
For example: Here is a pdf file and here is the desired text file from that pdf
I tried the below code to read the file and save it to a text file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Tamal\Desktop\101395a.pdf");
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
buffer.Append(page.ExtractText());
}
doc.Close();
String fileName = #"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);
But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.
How do I get the desired result as in the desired text file?
Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.
Using iText
File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));
System.out.println(stes.getResultantText());
This is (as the code says) a basic/simple text extraction strategy.
More advanced examples can be found in the documentation.
Use IronOCR
var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));
For reference https://ironsoftware.com/csharp/ocr/
Using this you should get formatted text output, but not exact desire output which you want.
If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK
That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.

C# Winnovative HTML to PDF

I am searching for a solution to convert HTML to PDF with external CSS support. I downloaded the trial version of the Winnovative Toolkit Total v11.14, and tried out the demo application for the method public byte[] GetPdfBytesFromHtmlString (string htmlString, string urlBase). The PDF files are generated, but the CSS is not applied.
Note: I tried the same input HTML string and base URL in the demo site. It's working fine, so I don't know why it's not working in my system. The demo application is shared in v11.14 ZIP files.
Input provided for this method:
htmlString = HTML source of the url 'http://www.winnovative-software.com/'
urlBase = "http://www.winnovative-software.com/"
Are you using any proxy to access Internet? In this case you should set the HtmlToPdfConverter.ProxyOptions object properties in your code.

Convert HTML file to PDF file using ITextSharp

I'd like to accomplish the following:
Given the path name of an html file, and the desired pathname of a pdf file, convert the HTML file to PDF using ITextSharp. I've seen plenty of code samples which do close to this but not exactly what I need. I believe my solution will need to use the iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList() function but I'm having trouble getting this to work with an actual HTML file and outputting an actual PDF file.
public void GeneratePDF(string htmlFileName, string outputPDFFileName)
{...}
is the function I'd really like to get working properly.
Thanks in advance
Edit: Here's an example I've of what I've tried:
iTextSharp.text.Document doc = new Document();
PdfWriter.GetInstance(doc, new FileStream(Path.GetFullPath("fromHTML.pdf"), FileMode.Create));
doc.Open();
try
{
List<IElement> list = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(new StringReader(File.ReadAllText(this.textBox1.Text)), null);
foreach (IElement elm in list)
{
doc.Add(elm);
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
doc.Close();
Note that textBox1.Text contains the full path name of the html file I'm trying to convert to pdf and I want this to get output to "fromHTML.pdf"
Thanks!
I had the same requirement and was diverted to this page by Google but could not find a concrete answer.
But after some head hitting and trials, i have been able to successfully convert the HTML code to PDF using iTextSharp library 5.1.1.
The code that i have shared here also takes care of the img tags in HTML with relative paths. iTextSharp library throws an error if your img tags do not have absolute src.
You an find the code here:
http://am22tech.com/s/22/Blogs/post/2011/09/28/HTML-To-PDF-using-iTextSharp.aspx
Let me know if you need more information. The code is in c#.

how to convert pdf file to text file using c#.net

currently i have been using the following code and i am using some dll files from pdfbox
FileInfo file = new FileInfo("c://aa.pdf");
PDDocument doc = PDDocument.load(file.FullName);
PDFTextStripper pdfStripper = new PDFTextStripper();
string text = pdfStripper.getText (doc);
richTextBox1.Text = qq;
using this code i can able to get text file but not in a correct format plz give me a some ideas
Extracting the text from a pdf file is anything but trivial.
To quote from th iTextSharp tutorial.
"The pdf format is just a canvas where
text and graphics are placed without
any structure information. As such
there aren't any 'iText-objects' in a
PDF file. In each page there will
probably be a number of 'Strings', but
you can't reconstruct a phrase or a
paragraph using these strings. There
are probably a number of lines drawn,
but you can't retrieve a Table-object
based on these lines. In short:
parsing the content of a PDF-file is
NOT POSSIBLE with iText."
There are several commercial applications which claim to be able to do it. Caveat Emptor.
There is also a free software library called Poppler http://poppler.freedesktop.org/ which is used by the pdf viewers of GNOME and KDE. It has a function called pdftotext() but I have no experience with it. It may be your best free option.
There is a blog article explaining the issues with PDF text extraction in general at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

Categories

Resources