I have the original PDF file created by Apache FOP 1.0 with basic metadata added by Apache (producer, date of creation). I was editing the file and additional metadata appeared that I don't want to have (modify date and other). Is it possible to create a new file based on a file edited in such a way that there are no editing remains?
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c016 91.163616, 2018/10/29-16:58:49 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<pdf:PDFVersion>1.4</pdf:PDFVersion>
<pdf:Producer>Apache FOP Version 1.0</pdf:Producer>
<xmp:CreateDate>2019-08-20T11:09:15+02:00</xmp:CreateDate>
<xmp:CreatorTool>Apache FOP Version 1.0</xmp:CreatorTool>
<xmp:MetadataDate>2019-08-20T11:09:15+02:00</xmp:MetadataDate>
<dc:date>
<rdf:Seq>
<rdf:li>2019-08-20T11:09:15+02:00</rdf:li>
</rdf:Seq>
</dc:date>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
In addition to metadata, there is also the possibility of restoring the original version by removing the added lines in notepad ++ (which I do not want), i tried to replace the streams, but I fail (editing and new metadata are added to the file, they do not change the old text (incremental update?)). I would like the edited file to look like the original one made in Apache FOP 1.0. I tried to create XSL-FO with Word2FO.xsl, but I could not create a file with the same appearance as the original.
I can use a window application (Windows), console, website or something for C #, everything I've found so far either works badly or adds souvenirs after myself (watermark, change of producer, additional incremental update etc.).
Our DynamicPDF Merger product allows you to merge the PDF without changing its contents. Also there are options to suppress the XMP metadata or set the producer as needed.
Below is the C# code to merge a PDF, remove the XMP metadata (using the MergeOptions class) and setting the producer.
MergeOptions options = new MergeOptions();
options.XmpMetadata = false;
MergeDocument document = new MergeDocument(#"Source PDF file path", options);
document.Producer = "My producer";
document.Draw(#"path to save the output PDF");
You can refer to the documentation on the MergeOptions class here: http://docs.dynamicpdf.com/NET_Help_Library_19_08/DynamicPDF~ceTe.DynamicPDF.Merger.MergeOptions_members.html
A fully functional evaluation edition of the DynamicPDF Merger product is available on Nuget (Package ID: ceTe.DynamicPDF.CoreSuite.NET):
https://www.nuget.org/packages/ceTe.DynamicPDF.CoreSuite.NET/
More information and the option to download it from our site can be found here:
https://www.dynamicpdf.com/Merge-PDF-.NET.aspx
Related
I have a requirement to insert a unique ID into image files without modifying the image content – ie it’s just the metadata that I want to modify. I’m starting with the JPEG files because there is an appropriate EXIF property available: ImageUniqueID.
I’m using C# with .NET Core 3.1 for this exercise with ImageSharp.
I can change the EXIF data with the ImageSharp using the following code (show simplified without existing record checks, etc):
using (var image = Image.Load(filename))
{
var imageUniqueIdentifier = Guid.NewGuid().ToString().ToLower().Replace("-", "");
image.Metadata.ExifProfile.SetValue(ExifTag.ImageUniqueID, imageUniqueIdentifier);
var jpegEncoder = new JpegEncoder() { Quality = 100 };
image.Save(filename, jpegEncoder);
}
I did play with the Quality setting in the JpegEncoder, but was still getting either unacceptable quality degradation or file size increases.
Is there a way of just reading the meta data, altering it and then writing it back without affecting the image at all?
I also looked at MetadataExtractor.NET but this doesn’t have a write facility and would happily look at other .NET Core methods or libraries.
After some research I've found that there is ExifLibrary which allow you to modify only image metadata.
Documentation (examples included)
Example how to add unique image id for jpg file:
var file = ImageFile.FromFile("path_to_jpg_file");
var imageUniqueIdentifier = Guid.NewGuid().ToString().ToLower().Replace("-", "");
file.Properties.Set(ExifLibrary.ExifTag.ImageUniqueID, imageUniqueIdentifier);
file.Save("path_to_jpg_file");
Nuget package: ExifLibNet.
Here is some code that just needs .NET with PresentationCore and WindowsBase. The underlying technology that WPF uses is WIC (Windows Imaging Component). WIC has full support for image metadata.
EXIF's ImageUniqueID is handled specifically as a Windows Property named System.Image.ImageID
Some other properties such as System.Photo.CameraModel can be seen directly in Windows Explorer detailed views if you add the corresponding column "Camera Model", but not System.Image.ImageID, AFAIK.
// needs PresentationCore & WindowsBase references
var frame = BitmapDecoder.Create(new Uri("test1.jpg", UriKind.RelativeOrAbsolute), BitmapCreateOptions.PreservePixelFormat, BitmapCacheOption.None).Frames[0];
// create encoder, add frame, we need to copy since we want to update metadata
var encoder = BitmapEncoder.Create(frame.Decoder.CodecInfo.ContainerFormat);
var copy = BitmapFrame.Create(frame);
// get frame metadata
var metaData = (BitmapMetadata)copy.Metadata;
// show existing System.Image.ImageID (if any)
Console.WriteLine("ImageUniqueID: " + metaData.GetQuery("System.Image.ImageID"));
// for some reason, we can't use "System.Image.ImageID" to set the meta data
// so use the "Metadata Query Language"
metaData.SetQuery("/app1/ifd/exif/{ushort=42016}", "My Super ID");
// write file back
encoder.Frames.Add(copy);
using (var stream = File.OpenWrite("test1copy.jpg"))
{
encoder.Save(stream);
}
I'm currently working on a project where I need to create a "dashboard" which can be exported as pdf. I wanted to use Rotativa but as our application uses .NET framework 4.0 it's not possible. So I found the NReco PdfGenerator.
Now that's the code how I create the PDF result:
var ViewAsString = RenderViewAsString("~/Views/QMetrics/StandardDashboard.cshtml", viewModel);
var htmlToPdf = new NReco.PdfGenerator.HtmlToPdfConverter();
htmlToPdf.PageWidth = 1600;
htmlToPdf.PageHeight = 900;
var pdfBytes = htmlToPdf.GeneratePdf(ViewAsString);
FileResult FileResult = new FileContentResult(pdfBytes, "application/pdf");
FileResult.FileDownloadName = "Dashboard-" + viewModel.ProjectName + "-" +
DateTime.Now.ToString() + "-.pdf";
return FileResult;
It successfully creates the PDF page with all the content that comes from the backend (Project information, and so on) but the page looks very ugly. On the original page I have 2 columns and on the PDF page it puts everything in one column. I tried a few different page sizes and I also changed the layout to be non-responsive but nothing has changed.
My first suggesstion was that the referenced CSS and JS files are not included when the PDF get's created, so I copied all the stuff that comes from external files (bootstrap, Chart.js) and pasted it directly in the .cshtml file. But nothing changed at all. My Chart is not rendering/loading and the missing CSS stuff is still not there.
On the NReco PDFGenerator website they say that it supports complex CSS code and also javascript code so I don't really understand why this is not working.
Has anyone here experiences with NReco or can someone recommend something else that works for .NET 4.0?
NReco PdfGenerator internally uses wkhtmltopdf tool, so you can check it and its options.
Regarding 2 columns: if you don't use flex/grid layout everything should work fine. Possibly you need to disable wkhtmltopdf smart shrinking logic (enabled by default) and define web page 'window' size explicitely (with "--viewport-size 1600" option).
Regarding CSS and charts: you need to check that CSS files could be accessed by wkhtmltopdf, simplest way to do that is running wkhtmltopdf.exe from the command line and check console log output (or, handle PdfGenerator's "LogReceived" event in C#). For Chart.js ensure that chart container div has explicit width (not in %), and that there are no js errors (you can get them in console by specifying "--debug-javascript" option). If your js code uses 'bind' method you have to include polyfill as WebKit engine version used in wkhtmltopdf doesn't support 'bind'.
I have the "image.png" fileless image included in my WPF C# project as "embedded resourse". The full name of such image is "myapplication.image.png".
I am using such image in a document generated via MigraDoc. However, the document generated contains all the contenent that I planified, but a gray square within wrote "image not found" instead of image "image.png".
In order to use "image.png" in my document via MigraDoc, I added the file "image.png" as embedded resource to my project. Therefore, I followed this sample to include this image in the document.
My resulting code looks like the following:
byte[] imageStream = LoadImage("myapplication.image.png");
string imageFilename = MigraDocFilenameFromByteArray(imageStream);
Image image = para.AddImage(imageFilename);
Where "LoadImage" and "MigraDocFilenameFromByteArray" methods are coded as in the sample.
What am I missing?
Would someone provide a pointer, please?
If using NuGet, please note that you have to check 'Include prerelease' in order for MigraDoc v1.50.x to show up in the list of packages. Note that this is the 'Version', not the 'Runtime Version' number (right-click your MigraDoc reference and check properties). The most recent stable release is only v1.32.x.
As suggested by #User241.007, the issue was using 1.32 and nor 1.50 or later. Hence, everything is working now that I removed 1.32 and installed 1.50 via package manager.
I want to convert pdf file to text file but some of pdf files do not work with pdfbox dll as the version of acrobat in newer than Acrobat 5.x
Please tell me what i do?
output.WriteLine("Begin Parsing.....");
output.WriteLine(DateTime.Now.ToString());
PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();
output.Write(stripper.getText(doc));
Your first attempt should be to try with a current version of PDFBox. Your version 0.7.3 dates back to 2006! PDFBox meanwhile has become an Apache project and is located here: http://pdfbox.apache.org/ and the current version (as of May 2013) is 1.8.1. And I'm very sure that PDFBox nowerdays does support PDF object streams and cross reference streams which were new in PDF Reference version 1.5, the version Adobe Acrobat 6 has been built for
If that does not work, you might want to try other PDF libraries, e.g. iText (or iTextSharp in your case) version 5.4.x if the AGPL (or alternatively buying a license) is no problem for you.
Information on text parsing using iText(Sharp) can be found in chapter15 Marked content and parsing PDF of iText in Action — 2nd Edition. The samples from that chapter can be found online: Java and .Net.
For a first test the sample ExtractPageContentSorted2.cs / ExtractPageContentSorted2.java would be a good start. The central code:
PdfReader reader = new PdfReader(PDF_FILE);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
StringBuilder sb = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++) {
sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i));
}
If neither a current PDFBox version nor a current iText(Sharp) version can parse your PDF, you might want to post a sample for inspection; there are ways to drop all information required for text parsing from a PDF...
In C#, I would like to read file details from a specific file.
I've found an interesting thread: Read/Write 'Extended' file properties (C#)
it uses a call to the GetDetailsOf() method on the folder shell object included in shell32.dll.
It works fine but I have an issue: According to the Operating System language, the header string is never the same...('Name' for the filename property on an english Windows, 'Nom' on a french Windows).
So, it's not easy to retrieve specific values with the name of the property as it changes according to the language...
Is there a way to handle this easily?
Some properties are available through the FileInfo object. For example, if you want the creation time of the file you can do:
Fileinfo myFileInfo = new Fileinfo(#"C:\path\to\file");
DateTime ftime = myFileInfo.CreationTime;
Is the FileInfo class not enough for your needs ?
FileInfo info = new FileInfo("fileName");
var name = info.Name;
var creationTime = info.CreationTime;
// etc ...
If not, tell more about which properties you'd like to read from your file.
Update to my answer :
I don't know about a library that would allow to read any type of document properties'
But here are a few ways for the formats you said,
PDF :
Extracting Additional Metadata from a PDF using iTextSharp
Read/Modify PDF Metadata using iTextSharp
So, iText ® is a library that allows you to create and manipulate PDF documents (from their website)
Office : (first link from MS stipulates that it applies to Word as well as Excel documents)
How to: Read from and Write to Document Properties
Listing properties of a word document in C#