Get Text Before image in a PDF using PdfPig

Get Text Before image in a PDF using PdfPig - c#

I search to get a string just before a image from a pdf
Example of pdf :
pdfexample
Like the image above :
For the nice picture of a cat 1 i will want to get 'What is image about :' and 'Gps Thing'
For the nice picture of a cat 2 i will want to get 'What is image about 2 :' and 'Gps Thing 2', excluding 'Lot of useless text' (Find by example the last word (GPS before the image 2)
Code used for now :
public void pdf(pdfselectionned){
var pdfdoc = IronPdf.PdfDocument.FromFile(pdfselectionned); // Define my selected pdf
string pdftxt = pdfdoc.ExtractAllText(); // Get all text from the pdf
try
{
using (PdfDocument pdfDocument = PdfDocument.Open(pdfselectionned)) // Extract all image in a folder and named them 1.png,2.png,.....
{
int imageCount = 1;
foreach (Page page in pdfDocument.GetPages())
{
List<XObjectImage> images = page.GetImages().Cast<XObjectImage>().ToList();
foreach (XObjectImage image in images)
{
byte[] imageRawBytes = image.RawBytes.ToArray();
using (FileStream stream = new FileStream($"{dir}\\{imageCount}.png", FileMode.Create, FileAccess.Write))
using (BinaryWriter writer = new BinaryWriter(stream))
{
writer.Write(imageRawBytes);
writer.Flush();
}
imageCount++;
}
}
}
}
catch (Exception)
{
throw;
}
}
Thank a lot if someone find a way to do that :)
(Other topic talk about thing like i want to do but nobody use PdfPig, if a can avoid to use different thin it will be great ^^')

Related

Why are the images in this PDF file corrupt in some viewers?

We are using PDFsharp to gather sets of images from a folder and put them into PDF files, one image per page. For this certain set of images, the resulting PDF document appears corrupted when opening in certain viewers... Chrome is broken, Adobe Reader is broken, Edge gives up entirely, but Firefox actually renders it correctly.
Here is the relevant code:
public static List<string> ImageExtensions()
{
return new List<string>() { ".tif", ".tiff", ".png", ".jpg", ".jpeg", ".gif" };
}
public void Convert()
{
PdfDocument doc = new PdfDocument();
foreach(string fPath in FilePaths)
{
string ext = Path.GetExtension(fPath);
if (ImageExtensions().Contains(ext))
AddImageToPDF(fPath, ref doc);
}
try
{
doc.Save(OutputFilePath);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
doc.Close();
doc.Dispose();
}
private void AddImageToPDF(string imagePath, ref PdfDocument doc)
{
Image MyImage = Image.FromFile(imagePath);
AddImageToPDF(MyImage, ref doc);
}
private void AddImageToPDF(Image image, ref PdfDocument doc)
{
try
{
int numPages = doc.Pages.Count;
using (Image MyImage = image)
{
for (int _pageIndex = 0; _pageIndex < MyImage.GetFrameCount(FrameDimension.Page); _pageIndex++)
{
MyImage.SelectActiveFrame(FrameDimension.Page, _pageIndex);
XImage img = XImage.FromGdiPlusImage(MyImage);
img.Interpolate = true;
var page = new PdfPage() { Orientation = img.PixelWidth > img.PixelHeight ? PageOrientation.Landscape : PageOrientation.Portrait };
doc.Pages.Add(page);
using (var xg = XGraphics.FromPdfPage(doc.Pages[_pageIndex + numPages]))
{
xg.DrawImage(img, 0, 0);
}
}
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
File set A (broken in most viewers). File set B (working in all viewers). Both sets are JPEGs that were converted from PNGs using the exact same mechanism (Irfanview).
Why do all of the viewers except Firefox render A incorrectly, and what can I do to fix that?

For images with just a single frame you should use XImage.FromFile instead of XImage.FromGdiPlusImage to work around a bug somewhere in recent versions of the Windows framework.
XImage.FromFile allows PDFsharp to access the original file while XImage.FromGdiPlusImage must rely on the information provided by the framework - and for certain JPEG images this information is not correct.
Make sure you are using the latest version of PDFsharp. Good questions indicate which version they refer to.
You should get correct PDF files if you run your code under Windows XP (but that ain't an option for production use, of course).

I am an idiot... updating to the latest version of PDFSharp fixed the issue. I had assumed we were already using the latest. >=/

Converting contenteditable content to PDF

I would like to know the most efficient way to convert contenteditable features (something that the user puts in) to an pdf. Here is an illustration of what i mean:
1.
2
3
I would also like to know how to convert css features since jsPDF doesn't suppoert this (to my knowledge)

jsPDF doesn't support almost the features what you need. I suggest to create an application to do that.
My background is C#. So:
Program.cs
using HtmlToPdf.Models;
namespace HtmlToPdf.Console
{
public class Program
{
public static void Main(string[] args)
{
var model = new HtmlToPdfModel();
model.HTML = "<h3>Hello world!</h3>";
model.CSS = "h3{color:#f00;}";
HtmlToPdf.Convert(model);
}
}
}
HtmlToPdfModel.cs
namespace HtmlToPdf.Models
{
public class HtmlToPdfModel
{
public string HTML { get; set; }
public string CSS { get; set; }
public string OutputPath { get; set; }
public string FontName { get; set; }
public string FontPath { get; set; }
}
}
HtmlToPdf.cs
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.tool.xml;
using HtmlToPdf.Models;
using System;
using System.IO;
using System.Text;
namespace HtmlToPdf.Console
{
public class HtmlToPdf
{
public static void Convert(HtmlToPdfModel model)
{
try
{
if (model == null) return;
Byte[] bytes;
//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var stream = new MemoryStream())
{
//Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
using (var doc = new Document())
{
//Create a writer that's bound to our PDF abstraction and our stream
using (var writer = PdfWriter.GetInstance(doc, stream))
{
//Open the document for writing
doc.Open();
//In order to read CSS as a string we need to switch to a different constructor
//that takes Streams instead of TextReaders.
//Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
using (var cssStream = new MemoryStream(Encoding.UTF8.GetBytes(model.CSS)))
{
using (var htmlStream = new MemoryStream(Encoding.UTF8.GetBytes(model.HTML)))
{
var fontProvider = new XMLWorkerFontProvider();
if (!string.IsNullOrEmpty(model.FontPath) && !string.IsNullOrEmpty(model.FontName))
{
fontProvider.Register(model.FontPath, model.FontName);
//Parse the HTML with css font-family
XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, htmlStream, cssStream, Encoding.UTF8, fontProvider);
}
else
{
//Parse the HTML without css font-family
XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, htmlStream, cssStream);
}
}
}
doc.Close();
}
}
//After all of the PDF "stuff" above is done and closed but **before** we
//close the MemoryStream, grab all of the active bytes from the stream
bytes = stream.ToArray();
}
//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
// use this line on Windows version
//File.WriteAllBytes(model.OutputPath, bytes);
// use these lines on Mac version
string path = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "data");
path = Path.Combine(path, "test.pdf");
File.WriteAllBytes(path, bytes);
}
catch (Exception e)
{
throw e;
}
}
}
}
When I wrote this application, I've tested on Windows. So, if you're using Mac, you can replace the line:
File.WriteAllBytes(model.OutputPath, bytes);
in the file HtmlToPdf.cs to
string path = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "data");
path = Path.Combine(path, "test.pdf");
File.WriteAllBytes(path, bytes);
I've commented inside the code.
About the font problem, if you want to use specific font (ex: Roboto), you must provide the font file and the path which your application can assign to.
Nuget packages: iTextSharp and itextsharp.xmlworker
You can convert this console application to web application, everytime you want to make PDF file, just make a request (ajax) to server and hit the method HtmlToPdf.Convert.

How to convert a JDF file to a PDF (Removing text from a multi-encoded document)

I am trying to convert a JDF file to a PDF file using C#.
After looking at the JDF format... I can see that the file is simply an XML placed at the top of a PDF document.
I've tried using the StreamWriter / StreamReader functionality in C# but due to the PDF document also containing binary data, and variable newlines (\r\t and \t) the file produced cannot be opened as some of the binary data is distroyed on the PDF's. Here is some of the code I've tried using without success.
using (StreamReader reader = new StreamReader(_jdf.FullName, Encoding.Default))
{
using (StreamWriter writer = new StreamWriter(_pdf.FullName, false, Encoding.Default))
{
writer.NewLine = "\n"; //Tried without this and with \r\n
bool IsStartOfPDF = false;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (line.IndexOf("%PDF-") != -1)
{
IsStartOfPDF = true;
}
if (!IsStartOfPDF)
{
continue;
}
writer.WriteLine(line);
}
}
}

I am self answering this question, as it may be a somewhat common problem, and the solution could be informative to others.
As the document contains both binary and text, we cannot simply use the StreamWriter to write the binary back to another file. Even when you use the StreamWriter to read a file then write all the contents into another file you will realize differences between the documents.
You can utilize the BinaryWriter in order to search a multi-part document and write each byte exactly as you found it into another document.
//Using a Binary Reader/Writer as the PDF is multitype
using (var reader = new BinaryReader(File.Open(_file.FullName, FileMode.Open)))
{
using (var writer = new BinaryWriter(File.Open(tempFileName.FullName, FileMode.CreateNew)))
{
//We are searching for the start of the PDF
bool searchingForstartOfPDF = true;
var startOfPDF = "%PDF-".ToCharArray();
//While we haven't reached the end of the stream
while (reader.BaseStream.Position != reader.BaseStream.Length)
{
//If we are still searching for the start of the PDF
if (searchingForstartOfPDF)
{
//Read the current Char
var str = reader.ReadChar();
//If it matches the start of the PDF signiture
if (str.Equals(startOfPDF[0]))
{
//Check the next few characters to see if they match
//keeping an eye on our current position in the stream incase something goes wrong
var currBasePos = reader.BaseStream.Position;
for (var i = 1; i < startOfPDF.Length; i++)
{
//If we found a char that isn't in the PDF signiture, then resume the while loop
//to start searching again from the next position
if (!reader.ReadChar().Equals(startOfPDF[i]))
{
reader.BaseStream.Position = currBasePos;
break;
}
//If we've reached the end of the PDF signiture then we've found a match
if (i == startOfPDF.Length - 1)
{
//Success
//Set the Position to the start of the PDF signiture
searchingForstartOfPDF = false;
reader.BaseStream.Position -= startOfPDF.Length;
//We are no longer searching for the PDF Signiture so
//the remaining bytes in the file will be directly wrote
//using the stream writer
}
}
}
}
else
{
//We are writing the binary now
writer.Write(reader.ReadByte());
}
}
}
}
This code example uses the BinaryReader to read each char 1 by 1 and if it finds a match of the string %PDF- (The PDF Start Signature) it will move the reader position back to the % and then write the remaining document using writer.Write(reader.ReadByte()).

Extract embedded package files from word document using open xml?

I am trying extract the word document, It has embedded files(word,excel,package). I am not able to extract package and save it Using C# Open XML.
The below code just extracts word and excel but not package.
using (WordprocessingDocument document = WordprocessingDocument.Open(fileName, false))
{
foreach (EmbeddedPackagePart pkgPart in document.MainDocumentPart.GetPartsOfType<EmbeddedPackagePart>())
{
if (pkgpart.uri.tostring().startswith(embeddingpartstring))
{
string filename1 = pkgpart.uri.tostring().remove(0, embeddingpartstring.length);
// get the stream from the part
system.io.stream partstream = pkgpart.getstream();
string filepath = "d:\\test\\" + filename1;
// write the steam to the file.
system.io.filestream writestream = new system.io.filestream(filepath, filemode.create, fileaccess.write);
readwritestream(pkgpart.getstream(), writestream);
}
}
}

The issue you're having is, that when you go to MainDocument.Parts and start searching, what you'll get is things like "Imagepart", "ChartPart" etc. where the ChartPart might have it's own embedded part, which could be the Excel or Word file you are looking for.
In short, you need to extend your search for embedded parts, to the actual parts in the mainDocument.
If I just wanted to extract all embedded parts in one of the files from my own project, I would go about it like this.
using (var document = WordprocessingDocument.Open(#"C:\Test\myTestDocument.docx", false))
{
//just grab all the parts, might be relevant to be a bit more clever about it, depending on sizes of files and how many files you want to search through
foreach(var part in document.MainDocumentPart.Parts)
{
//foreach part see if that part containts an EmbeddedPackagePart
var testForEmbedding = part.OpenXmlPart.GetPartsOfType<EmbeddedPackagePart>();
foreach(EmbeddedPackagePart embedding in testForEmbedding)
{
//You should probably insert some clever naming scheme here..
string fileName = embedding.Uri.OriginalString.Split('/').Last();
//stream the EmbeddedPackagePart to a file
using(FileStream myFile = File.Create(#"C:\test\" + fileName))
using (var stream = embedding.GetStream())
{
stream.Seek(0, SeekOrigin.Begin);
stream.CopyTo(myFile);
myFile.Close();
}
}
}
}
I hope this helps!

Removing images in header with OpenXML SDK

I have worked a bit with OpenXML SDK, and made a POC of replacing images in a header in a word document. However, when I try to call DeletePart or DeleteParts with the images I want to remove, it doesn't go as expected.
When I open the word doc afterwards, where there before was an image, there now is a frame with the text "This image cannot currently be displayed" and a red cross.
From a bit of googling it appears as if the references have not been completely removed, but I can't find any help on how to do that..
Below is an example of how I delete images. I only add some of them to the list, because I need to remove all but the ones with a specific uri..
//...
foreach(HeaderPart headerPart in document.MainDocumentPart.HeaderParts) {
List<ImagePart> list = new List<ImagePart>();
List<ImagePart> imgParts = new List<ImagePart> (headerPart.ImageParts);
foreach(ImagePart headerImagePart in imgParts) {
string newUri = headerImagePart.Uri.ToString();
if(newUri != uri) {
list.Add(headerImagePart);
}
}
headerPart.DeleteParts(list);
}
//...

Images are made up of 2 parts in OpenXml; you have the actual image itself and you also have details of the Picture container that the image is displayed within in the document.
This makes sense if you think about an image being displayed more than once in the same document; details of the image can be stored once and the position(s) of the image can be stored as many times as needed.
The following code will find any Drawing objects that contain the ImagePart objects that you wish to delete. This is done by matching the Embed property of the Blip against the Id of the ImagePart.
using (WordprocessingDocument document = WordprocessingDocument.Open(filename, true))
{
foreach (HeaderPart headerPart in document.MainDocumentPart.HeaderParts)
{
List<ImagePart> list = new List<ImagePart>();
List<ImagePart> imgParts = new List<ImagePart>(headerPart.ImageParts);
List<Drawing> drwdDeleteParts = new List<Drawing>();
List<Drawing> drwParts = new List<Drawing>(headerPart.RootElement.Descendants<Drawing>());
foreach (ImagePart headerImagePart in imgParts)
{
string newUri = headerImagePart.Uri.ToString();
if (newUri != uri)
{
list.Add(headerImagePart);
//you also need to find the Drawings the image was related to
IEnumerable<Drawing> drawings = drwParts.Where(d => d.Descendants<Pic.Picture>().Any(p => p.BlipFill.Blip.Embed == headerPart.GetIdOfPart(headerImagePart)));
foreach (var drawing in drawings)
{
if (drawing != null && !drwdDeleteParts.Contains(drawing))
drwdDeleteParts.Add(drawing);
}
}
}
foreach (var d in drwdDeleteParts)
{
d.Remove();
}
headerPart.DeleteParts(list);
}
}
As you pointed out in the comments, you'll need to add a using statement:
Pic = DocumentFormat.OpenXml.Drawing.Pictures;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get Text Before image in a PDF using PdfPig - c#

Related

Why are the images in this PDF file corrupt in some viewers?

Converting contenteditable content to PDF

How to convert a JDF file to a PDF (Removing text from a multi-encoded document)

Extract embedded package files from word document using open xml?

Removing images in header with OpenXML SDK

Categories

Resources