This is my iText5 code that does what I have previously required done with HTML snippets;
//GetFieldPositions returns an array of field positions if you are using 5.0 or greater
rectangle = pdfStamper.AcroFields.GetFieldPositions(field.Key)[0].position;
//tell itextSharp to overlay this content
PdfContentByte contentBtye = pdfStamper.GetOverContent(1);
var elements = XMLWorkerHelper.ParseToElementList(pdfPlaceHolderData[key].ToString(), null);
ColumnText ct = new ColumnText(contentBtye);
ct.SetSimpleColumn(rectangle.Left, rectangle.Bottom, rectangle.Right, rectangle.Top);
ct.Add(elements);
ct.Go(false);
pdfFormFields.SetField(field.Key, string.Empty);
I am struggling to see how to convert this to work in iText7 .NET.
XMLWorkerHelper.ParseToElementList returns 'ElementList' which inherits 'List'. 'IElement' is structured as follows;
The iText7 Html2Pdf call to HtmlConverter.ConvertToElements(html) returns an 'IList'. However 'IElement' is now structured as follows;
I was hoping that I could of just used this result but obviously my call to 'ct.Add(elements);' in the above code chokes because of the different IElement structure.
I know I am trying to cut corners here (I have no choice at the moment); is there a relatively easy way to convert the iText7 IElement to the iTextSharp IElement that will retain my nicely parsed HTML with images?
How can I replace an acro form field content with parsed HTML instead? This would preferably be in iTextSharp 5, but I suppose would be even better using the latest version?
I currently have a solution working happily with iTextSharp 5 that allows PDF templates to populated dynamically. I have hit a problem using the XMLWorkerHelper.ParseToElementList as it does not seem to support parsing inline images.
I have found that iText7 for .net has an extension called html2pdf that has a method called HtmlConverter.ConvertToElements that does perfectly parse HTML with inline images however the result is not compatible with my iTextSharp 5 implementation and I am struggling in trying to convert it.
Related
I want to have a function that returns one object, with this object containing two paragraph with different alignments. This is easy to do manually by making them separate paragraphs and adding them to the pdf one at a time, but I would like my function to return it as a whole object to be added to a pdf. Is this possible? As an example of what I want:
someTextHere
someMoreTextHere
But as one object which I can then add to a pdf.
I have created a small standalone iText 7 example that creates the following output:
Th PDF file shown in the screen shot was created like this:
public void createPdf(String dest) throws IOException {
PdfDocument pdf = new PdfDocument(new PdfWriter(dest));
Document document = new Document(pdf);
Div div = new Div()
.add(new Paragraph("Left").setTextAlignment(TextAlignment.LEFT))
.add(new Paragraph("Right").setTextAlignment(TextAlignment.RIGHT))
.setBackgroundColor(ColorConstants.GRAY)
.setWidth(200);
document.add(div);
document.close();
}
As you can see, I created a Div element (similar to a <div> tag in HTML) to which I added two Paragraph objects with a different text alignment. That seems to be exactly what you need.
I am not a C# developer, hence I provide the code in Java. However, if you're proficient in C#, you shouldn't have any problem porting it from Java to C# (it's just a matter of changing lowercases into uppercases, such as changing add() into Add()).
Note that this is iText 7 code; if you're still using iText 5, you should consider upgrading to the latest iText version since iText 5 has gone into maintenance mode a while ago. Maintenance mode means that development on that version has stopped; it's no longer supported for users who aren't a customer.
iTextSharp library (version 5.5.5) does not extract text from my file.
I can copy and paste text from pdf into Notepad.
I uploaded file to this link.
The source code is very simple and it works for other pdf files, but for this problematic file all I get is some characters without any meaning.
var text = string.Empty;
using (var file = new File.OpenRead(path))
{
using (var reader = new PdfReader(file))
{
for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
{
text += PdfTextExtractor.GetTextFromPage(reader, pageNumber);
}
}
}
Any help is highly appreciated.
The PDF declarations of the Asian fonts in your sample PDF do not contain a ToUnicode map to allow mapping from character codes to Unicode.
Furthermore, their encoding is Identity-H which is kind of a pseudo-encoding as it merely maps 2-byte character codes ranging from 0 to 65,535 to the same 2-byte CID value, so this still doesn't define a fixed encoding usable for text extraction.
Identity-H may actually only be used with CIDFonts using any Registry, Ordering, and Supplement values, and these ROS values convey the actual encoding information from which a mapping to Unicode can be derived. This is the case in your file.
To make use of these ROS values during text extraction, iText needs a set of resource files defining the mappings for the different predefined ROS values. As these files are quite huge, they are not part of the standard iText main distribution jar/dll but have to be added to the class path as a separate jar/dll file.
I only tested this using the Java version of iText as I am more proficient with it.
iText 5.x/Java
The Maven coordinates for the 5.x version of this jar artifact:
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itext-asian</artifactId>
<version>5.2.0</version>
</dependency>
(As nothing has changed in these resources in the course of the recent years, there have been no 5.x releases since 5.2.0.)
After I added that jar to the classpath here, I could successfully extract Asian characters from your PDF. Whether they are 100% correct, I cannot say as I cannot read them.
iTextSharp 5.x/.Net
There should be a similar iTextSharp DLL with Asian font resources. (I found the iText 7 variant thereof but I am not sure that that works with a 5.x iTextSharp.)
Googl'ing around one finds a number of iTextAsian-*, iTextAsianCmaps-*, and iTextAsian-all-* files... I don't know, though, which of them work with the current iTextSharp 5.5.12.
As the OP found out, one additionally has to register the DLLs for iTextSharp (in contrast to iText / Java):
Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:
static PdfDocument()
{
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}
I have addition to the answer given by #mkl. Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:
static PdfDocument()
{
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}
I have a custom table with name, firstname, place of birth and place of living in a PDF file which I want to parse through in C#. One of the simplest way of doing it would be:
using (PdfLoadedDocument document = new PdfLoadedDocument("foobar"))
{
for (var i = 0; i < document.Pages.Count; i++)
{
Console.WriteLine($"============ PAGE NO. {i+1} ============");
Console.WriteLine(document.Pages[i].ExtractText());
}
}
But the problem is the output:
============ PAGE NO. 38 ============
John L.SmithSan Francisco5400 Baden
There's no way I can seperate this with a regex so I need a way to parse through each column of each row in order to get all the values of the customers separated. How can I parse through a table in a pdf file with syncfusion?
You will need a methods that returns you the coordinate of each character found in the pdf. Then you have some math to do (basically to compute the distance between characters) in order to know if the character is part of a word and where the word itself is located along the x-axe. It requires quite a lot of work and efforts and I didn't find such a method in syncfusion documentation.
I wrote a class which do what you want but this is for java project:
PDFLayoutTextStripper (upon PDFBox)
Syncfusion control extracting the text from PDF document based on the structure of content present in the PDF document. So, based on current implementation of Syncfusion control we cannot recognize the rows and columns present in the table of the PDF document.
Also, it is not possible to extract the text in correct order as same as the PDF document displayed using Syncfusion control since the content present in the PDF document follows fixed layout.
But we can populate the table of the PDF document in Excel using Tabula (Open source library). I have modified the Tabula java (Open Source) to achieve layout based text extraction from the PDF document based on your requirement.
Please find the sample for this implementation in below link:
http://www.syncfusion.com/downloads/support/directtrac/171585/ze/TextExtractionSample649531336
Kindly ensure the following things before executing the sample:
Install Java Runtime Environment (JRE) from the below link.
http://www.oracle.com/technetwork/java/javase/downloads/
Restart your machine.
Execute the above sample.
Try this and check whether it meets your requirement.
I am using HiQPdf Free to generate PDFs from an URL. I noticed in their documentation, you can simply grab a specific element instead of the whole page. It would go something like this:
HtmlToPdf htmlToPdfConverter = new HtmlToPdf();
htmlToPdfConverter.ConvertedHtmlElementSelector = "#logo";
htmToPdfConverter.ConvertUrlToFile("https://your-website.com/", "/path/to/pdf.pdf");
However, when I do the htmlToPdfConverter.ConvertedHtmlElementSelector in my code, it tells me this error:
Cannot access internal property 'ConvertedHtmlElementSelector' here
Could this be because it's a paid only feature? That seems like the only obvious reason, however, I haven't been able to find any source on that.
Converting only a region of the HTML page to PDF is a feature of the full version and it is not available in the free version. There is an example for this feature with C# and VB.NET code samples at http://www.hiqpdf.com/demo/ConvertHtmlRegionToPdf.aspx
I realise this is a very specific question but I have limited knowledge of both SVG and telerik so here it goes.
I am trying to convert a RadHtmlChart to an image in an C#. First to get the svg of the chart I use a built in telerik function.
var chartRendering = $find("<%=BarChart.ClientID %>").getSVGString();
Once I have that string on the server I attempt to convert it to a Memory stream representing the image. To do this I am using https://github.com/vvvv/SVG
XmlDocument xml = new XmlDocument();
xml.LoadXml(svgText);
SvgDocument svg = SvgDocument.Open(xml);
// Convert SVG document containing image to Stream
MemoryStream imageStream = new MemoryStream();
svg.Draw().Save(imageStream, ImageFormat.Png);
The last line is where the code breaks and gives the error :ColorBlend object that was set is not valid.
Position's last element must be equal to 1.0.
ColorBlend objects must be constructed with the same number of positions and color values.
Positions must be between 0.0 and 1.0, 1.0 indicating the last element in the array.
The strange thing is that this only happens on some charts and not others. I have noticed that on one particular chart that if the chart as 10 or more x values then it is fine. But if there are less than 10 values then it breaks. I have also tried doing the same thing using temperorary files to store the svg and the image to the same result.
I have run out of ideas so does anybody have a suggestion? Perhaps another way to get from svg to an image in c#. I've had a look at inkscape but I can't use it as it requires to be installed as an .exe on the server.
EDIT:
I found possible solution using javascript. Won't work for me as it uses the canvas element which is HTML5 and I need this to work in IE8 but in case anybody else stumbles across this.
Convert SVG to image (JPEG, PNG, etc.) in the browser.
SOLUTION:
I've decided to implement the javascript method shown above for all browsers that support HTML5 and the inkscape method for IE8.
I just stumbled over the same problem. First I tried to update to the most recent code from the trunk of that project but that did not fix the issue.
The problem lies inside the file SVGProject/Painting/SvgGradientServer.cs in the function protected ColorBlend GetColorBlend(ISvgRenderer renderer, float opacity, bool radial)
The function produces a ColorBlend object where the last element of the Positions array is sometimes not 1.0 but for example 1.00000012 or 0.99999994. This causes an exception in SVGProject/Painting/SvgGradientServer.cs GetBrush in the following line:
InterpolationColors = CalculateColorBlend(renderer, opacity, points[0],
effectiveStart, points[1], effectiveEnd),
The Problem is probably a rounding error with float values. I have not analysed it further but just added a check that fixes the last element of the Positions array. This worked for me.