Get text from PDF with broken encoding using iText7

Get text from PDF with broken encoding using iText7 - c#

I'm trying to extract text from PDF using the following method:
public static string GetRectangleText(string pdfPath, int pageId, float[] rectangleDimensions)
{
using (PdfDocument pdfDoc = new PdfDocument(new PdfReader(pdfPath)))
{
var page = pdfDoc.GetPage(pageId);
iText.Kernel.Geom.Rectangle rect = new iText.Kernel.Geom.Rectangle(rectangleDimensions[0], rectangleDimensions[1], rectangleDimensions[2], rectangleDimensions[3]);
var filter = new IEventFilter[1];
filter[0] = new TextRegionEventFilter(rect);
var filteredTextEventListener = new FilteredTextEventListener(new LocationTextExtractionStrategy(), filter);
var result = PdfTextExtractor.GetTextFromPage(page, filteredTextEventListener);
return result;
}
}
While it works fine for most documents, several PDFs which would seem to have their encoding broken, return strings like ǪȃǷǻȁǭǵǶǬǳȇǹǺǸǶǰǺǭǳȄǹǺǪǨ ,668(')25&216758&7,21 what should in fact be ВЫПУЩЕНО ДЛЯ СТРОИТЕЛЬСТВА / ISSUED FOR CONSTRUCTION
I wonder if some kind of specific LocationTextExtractionStrategy would help?

Related

iText7 for .NET barcode

I'd like to create pdf with barcode using Itex7 library.
There is a lot of examples using older version of Itex, or Java, but I can't find solution for Itex7.
(generally new lib has no implementation of createImageWithBarcode method)
My solution could look like as:
string outputPdfFile = #"c:\DEV\pdfFromScratchWithBarCode.pdf";
using (iText.Kernel.Pdf.PdfWriter writer = new iText.Kernel.Pdf.PdfWriter(outputPdfFile))
{
using (iText.Kernel.Pdf.PdfDocument pdf = new iText.Kernel.Pdf.PdfDocument(writer))
{
iText.Layout.Document doc = new iText.Layout.Document(pdf);
doc.Add(new iText.Layout.Element.Paragraph("Title"));
iText.Barcodes.BarcodeInter25 bar = new iText.Barcodes.BarcodeInter25(pdf);
bar.SetCode("00600123456");
//HOW TO ADD barcode TO PDF ??
// ...
}
}
There is similar answer but for older version:
iText for .NET barcode

Thanks for advices.
I found the solution (create pdf, add barcode {type: Code 25 – Non-interleaved 2 of 5} and set valid postion)
using (iText.Kernel.Pdf.PdfWriter writer = new iText.Kernel.Pdf.PdfWriter(outputPdfFile))
{
using (iText.Kernel.Pdf.PdfDocument pdf = new iText.Kernel.Pdf.PdfDocument(writer))
{
iText.Layout.Document doc = new iText.Layout.Document(pdf);
doc.Add(new iText.Layout.Element.Paragraph("Title"));
//barcode
iText.Barcodes.BarcodeInter25 bar = new iText.Barcodes.BarcodeInter25(pdf);
bar.SetCode("0600123456");
iText.Kernel.Pdf.Canvas.PdfCanvas canvas = new iText.Kernel.Pdf.Canvas.PdfCanvas(pdf.GetFirstPage());
//bar.PlaceBarcode(canvas, iText.Kernel.Colors.ColorConstants.BLUE, iText.Kernel.Colors.ColorConstants.GREEN);
iText.Kernel.Pdf.Xobject.PdfFormXObject barcodeFormXObject = bar.CreateFormXObject(iText.Kernel.Colors.ColorConstants.BLACK, iText.Kernel.Colors.ColorConstants.BLACK, pdf);
float scale = 1;
float x = 450;
float y = 700;
canvas.AddXObject(barcodeFormXObject, scale, 0, 0, scale, x, y);
}
}

You can create an image from a PdfFormXObject by doing this:
var barcodeImg = new Image(bar.CreateFormXObject(pdf));
Here is your code including changes that does the trick:
string outputPdfFile = #"c:\DEV\pdfFromScratchWithBarCode.pdf";
using (var writer = new iText.Kernel.Pdf.PdfWriter(outputPdfFile))
{
using (var pdf = new iText.Kernel.Pdf.PdfDocument(writer))
{
var doc = new Document(pdf);
doc.Add(new Paragraph("Title"));
var bar = new BarcodeInter25(pdf);
bar.SetCode("000600123456");
//Here's how to add barcode to PDF with IText7
var barcodeImg = new Image(bar.CreateFormXObject(pdf));
doc.Add(barcodeImg);
}
}

Pechkin converting webpage to pdf c# gives empty pdf.

I am exploring Pechkin to convert webpage to PDF. I have used article: http://ourcodeworld.com/articles/read/366/how-to-generate-a-pdf-from-html-using-wkhtmltopdf-with-c-in-winforms
Ref: How to use wkhtmltopdf.exe in ASP.net
When i try to convert using html string, it works !
byte[] pdfContent = new SimplePechkin(new GlobalConfig()).Convert("<html><body><h1>Hello world!</h1></body></html>");
However when I follow "Generate PDF from a Website" section, I get empty pdf.
configuration.SetCreateExternalLinks(false)
.SetFallbackEncoding(Encoding.ASCII)
.SetLoadImages(true)
.SetPageUri("http://ourcodeworld.com");
Has anyone encountered same issue? Appreciate all help/suggestions.

Try to use
https://github.com/tuespetre/TuesPechkin
var document = new HtmlToPdfDocument
{
GlobalSettings =
{
ProduceOutline = true,
DocumentTitle = "Pretty Websites",
PaperSize = PaperKind.A4, // Implicit conversion to PechkinPaperSize
Margins =
{
All = 1.375,
Unit = Unit.Centimeters
}
},
Objects = {
new ObjectSettings { HtmlText = "<h1>Pretty Websites</h1><p>This might take a bit to convert!</p>" },
new ObjectSettings { PageUrl = "www.google.com" },
new ObjectSettings { PageUrl = "www.microsoft.com" },
new ObjectSettings { PageUrl = "www.github.com" }
}
};
var tempFolderDeployment = new TempFolderDeployment();
var win32EmbeddedDeployment = new Win32EmbeddedDeployment(tempFolderDeployment);
var remotingToolset = new RemotingToolset<PdfToolset>(win32EmbeddedDeployment);
var converter = ThreadSafeConverter(remotingToolset);
byte[] pdfBuf = converter.Convert(document);
// Very important - overwise cpu will grow !!!
remotingToolset.Unload();
Edit
If someone else use this- please read my post here- very important!
https://stackoverflow.com/a/62428122/4836581
If you get errors use this link that help me-
TuesPechkin unable to load DLL 'wkhtmltox.dll'
Found it thanks to-
https://stackoverflow.com/a/26993484/4836581

HTMLWorker itextSharp image src

I am trying to use HTMLWorker using the following:
public static string toWorks(string s)
{
string fontpath = System.Web.HttpContext.Current.Server.MapPath("~/Content/");
BaseFont bf = BaseFont.CreateFont(fontpath + "ARIALUNI.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
var f = new Font(bf, 10, Font.NORMAL);
// var p = new Paragraph { Alignment = Element.ALIGN_LEFT, Font = f };
var styles = new StyleSheet();
styles.LoadTagStyle(HtmlTags.SPAN, HtmlTags.FONTSIZE, "10");
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);
using (var sr = new StringReader(s))
{
List<IElement> list = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(sr, styles);
// var elements = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(sr, styles);
foreach (var e in list)
{
list.Add(e);
}
return list.ToString();
}
return null;
}
It converts:
src="/Content/UserFiles/635380078478327671/Images/test.png
To:
C:\Content\UserFiles\635380078478327671\Images\test.png
Any suggestion.

Please compare the following two examples:
HtmlMovies1
HtmlMovies2
If you use the first example to render an HTML file with images, you probably won't succeed. The second example introduces an ImageProvider implementation.
In the getImage() method of the ImageProvider interface, you get information about the path to an image. It is up to you to interpret this path. For instance: if the path is /Content/UserFiles/635380078478327671/Images/test.png, you can create an Image object by loading the bytes from that path, possibly after applying some minor changes to the path.
If you don't create an ImageProvider class, iText will do a single guess to find the path. In your case, that guess is wrong.
You can find the C# equivalent of the examples here: http://tinyurl.com/itextsharpIIA2C09

Open XML SDK: How to get a valid Word document for WordprocessingDocument.Open

For unit testing purposes, I would like to generate some sample data to be stored as a stream in the dataToImport variable in the following statement:
WordprocessingDocument.Open(dataToImport, false);
Does anyone know how to create a decent set of sample data?

You could potentially use something like the following:
using (WordprocessingDocument wpd = WordprocessingDocument.Open(filename, false)
{
wpd.MainDocumentPart.Document.Body.Append(GenerateParagraph(...text ...);
}
private Paragraph GenerateParagraph(string input)
{
Paragraph paragraph1 = new Paragraph();
Run run1 = new Run();
Break break1 = new Break() { Type = BreakValues.Page };
Text txt = new Text() { Space = SpaceProcessingModeValues.Preserve };
txt.Text = input;
run1.Append(break1);
run1.Append(txt);
paragraph1.Append(run1);
return paragraph1;
}
The value of the ...text... itself could come from any file using FileInputStream objects.
Hope it helps!

ITextSharp: How to get an image embedded resource

I'm parsing an HTML with some images inside this.
This images are stored as embedded resource, not in the filesystem.
as I know, i need to set a custom image provider in HtmlPipelineContext, and this provider need to retrieve the image path or the itextsharp image.
The question is, somebody know which method of Abstract Image Provider i need to implement? and how?
this is my code:
var list = new List<string> { text };
byte[] renderedBuffer;
using (var outputMemoryStream = new MemoryStream())
{
using (
var pdfDocument = new Document(PageSize.A4, 30, 30, 30, 30))
{
var pdfWriter = PdfWriter.GetInstance(pdfDocument, outputMemoryStream);
pdfWriter.CloseStream = false;
pdfDocument.Open();
HtmlPipelineContext htmlContext = new HtmlPipelineContext(new CssAppliersImpl());
htmlContext.SetImageProvider(new MyImageProvider());
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
ICSSResolver cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);
CssResolverPipeline pipeline = new CssResolverPipeline(cssResolver, new HtmlPipeline(htmlContext, new PdfWriterPipeline(pdfDocument, pdfWriter)));
XMLWorker worker = new XMLWorker(pipeline, true);
XMLParser p = new XMLParser(worker);
foreach (var htmlText in list)
{
using (var htmlViewReader = new StringReader(htmlText))
{
p.Parse(htmlViewReader);
}
}
}
renderedBuffer = new byte[outputMemoryStream.Position];
outputMemoryStream.Position = 0;
outputMemoryStream.Read(renderedBuffer, 0, renderedBuffer.Length);
}
Thanks in advance.

Using a custom Image Provider it doesn't seem to be supported. The only thing it really supports is changing root paths.
However, here's one solution to the problem:
Create a new html tag, called <resimg src="{resource name}"/>, and write a custom tag processor for it.
Here's the implementation:
/// <summary>
/// Our custom HTML Tag to add an IElement.
/// </summary>
public class ResourceImageHtmlTagProcessor : AbstractTagProcessor
{
public override IList<IElement> End(IWorkerContext ctx, Tag tag, IList<IElement> currentContent)
{
var src = tag.Attributes["src"];
var bitmap = (Bitmap)Resources.ResourceManager.GetObject(src);
if (bitmap == null)
throw new RuntimeWorkerException("No resource with the name: " + src);
var converter = new ImageConverter();
var image = Image.GetInstance((byte[])converter.ConvertTo(bitmap, typeof(byte[])));
HtmlPipelineContext htmlPipelineContext = this.GetHtmlPipelineContext(ctx);
return new List<IElement>(1)
{
this.GetCssAppliers().Apply(
new Chunk((Image)this.GetCssAppliers().Apply(image, tag, htmlPipelineContext), 0f, 0f, true),
tag,
htmlPipelineContext)
};
}
}
To configure your new processor replace the line where you specify the TagFactory with the following:
var tagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
tagProcessorFactory.AddProcessor(new ResourceImageHtmlTagProcessor(), new[] { "resimg" });
htmlContext.SetTagFactory(tagProcessorFactory);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get text from PDF with broken encoding using iText7 - c#

Related

iText7 for .NET barcode

Pechkin converting webpage to pdf c# gives empty pdf.

HTMLWorker itextSharp image src

Open XML SDK: How to get a valid Word document for WordprocessingDocument.Open

ITextSharp: How to get an image embedded resource

Categories

Resources