I'm having difficulty understanding how to obtain the content from a PdfDocument. I've learned from previous questions that PdfDocument flushes the content to optimize working with large documents. If my function returns a new PdfDocument, how do I get the byte[] to pass into my other functions?
Even with PdfDocument.GetReader() - I can't seem to find what I'm looking for.
My use-case is as follows:
Get pdf content from an email attachment
Pass the pdf to a helper function, which extracts specific pages from the initial attachment
Pass the new PdfDocument into a function which calls Azure's Forms Recognizer API to read the fields into an object
To summarize: given a PdfDocument only, how can I get/create a byte[] from it?
Here is my code:
public async Task<BaseResponse> Handle(ReceiveEmailCommand command, CancellationToken cancellationToken) {
var ms = new MemoryStream(command.attachments.First().Content)
var extractedDocument = pdfService.PreparePdfDocument(ms);
var analyzedDocument = await formsRecognizerService.AnalyzeDocument(extractedDocument);
// Do stuff with the analyzed document...
var response = await FileWebService.AddAnalyzedDocumentToFileSystem(analyzedDocument);
}
The function AnalyzeDocument expects a Stream parameter. I want to pass something like
new Stream(extractedDocument.GetReader().Stream)
Helper function implementations are below:
public PdfDocument PreparePdfDocument(MemoryStream ms)
{
PdfDocument extractedDoc;
var pdfReader = new PdfReader(ms);
var pdf = new PdfDocument(pdfReader);
var doc = new Document(pdf);
var matches = GetNumberWithPages(pdf);
if (matches.Count > 0)
{
var pageRange = matches
.Where(x => x.Number == "125")
.Select(x => Convert.ToInt32(x.PageIndex))
.ToList();
extractedDoc = SplitPages(pdf, pageRange.First(), pageRange.Last());
}
else
{
// If we couldn't parse the PDF then just take first 4, 3 or 2 pages
try
{
extractedDoc = SplitPages(pdf, 1, 4);
}
catch (ITextException)
{
try
{
extractedDoc = SplitPages(pdf, 1, 3);
}
catch (ITextException)
{
try
{
extractedDoc = SplitPages(pdf, 1, 2);
}
catch (Exception)
{
throw;
}
}
}
}
return extractedDoc;
}
private static List<Match> GetNumberWithPages(PdfDocument doc)
{
var regex = new Regex(#"\s+([0-9]+)\s+(\([0-9]+\/[0-9]+\))\s+Page\s+([0-9])\s+of\s+([0-9]+)");
var matches = new List<Match>();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
var page = doc.GetPage(i);
var text = PdfTextExtractor.GetTextFromPage(page);
if (!string.IsNullOrEmpty(text))
{
var match = regex.Match(text);
if (match.Success)
{
var match = EvaluateMatch(match, i, doc.GetNumberOfPages());
if (match != null)
{
matches.Add(match);
}
}
}
}
return matches;
}
private static Match? EvaluateMatch(Match match, int pageIndex, int totalPages)
{
if (match.Captures.Count == 1 && match.Groups.Count == 5)
{
var match = new Match
{
Number = match.Groups[1].Value,
Version = match.Groups[2].Value,
PageIndex = pageIndex.ToString(),
TotalPages = totalPages.ToString()
};
return match;
}
else
{
return null;
}
}
public PdfDocument SplitPages(PdfDocument doc, int startIndex, int endIndex)
{
var outputDocument = CreatePdfDocument();
doc.CopyPagesTo(startIndex, endIndex, outputDocument);
return outputDocument;
}
public PdfDocument CreatePdfDocument()
{
var baos = new ByteArrayOutputStream();
var writer = new PdfWriter(baos);
var pdf = new PdfDocument(writer);
return pdf;
}
I'm having difficulty understanding how to obtain the content from a PdfDocument.
You don't!
When you create a PdfDocument to write to, you initialize it with a PdfWriter. That PdfWriter in turn has been initialized to write somewhere. If you want to access the final PDF, you have to close the PdfDocument and look at that somewhere. Also it is not easy to retrieve that somewhere from the PdfWriter as it is wrapped in a number of layers therein. Thus, you should keep a reference to that somewhere close by.
Thus, your ByteArrayOutputStream usually wouldn't be created hidden in some method CreatePdfDocument but instead in the base method and forwarded to other methods as parameter. Then you can eventually retrieve its data. If you need to create your ByteArrayOutputStream hidden like that, you can return a Pair of PdfDocument and ByteArrayOutputStream instead of the plain PdfDocument.
By the way, the idea behind this architecture is that iText tries to write as much PDF content as possible to that somewhere output as early as possible and free the memory. This allows it to create large documents without requiring a similarly large amount of memory.
when I return the stream I cannot access a closed stream
The ByteArrayOutputStream essentially is a MemoryStream; so you can in particular call ToArray to retrieve the finished PDF even if it's closed.
If you need the ByteArrayOutputStream as a regular stream, simply call PdfWriter.SetCloseStream(false) for your writer to prevent the close of the PdfDocument from also closing the stream.
Related
I am trying to attach large files to a ToDoTask using the Graph Api using the example in the docs for attaching large files for ToDoTask and the recommend class LargeFileUploadTask for uploading large files.
I have done this sucessfully before with attaching large files to emails and sending so i used that as base for the following method.
public async Task CreateTaskBigAttachments( string idList, string title, List<string> categories,
BodyType contentType, string content, Importance importance, bool isRemindOn, DateTime? dueTime, cAttachment[] attachments = null)
{
try
{
var _newTask = new TodoTask
{
Title = title,
Categories = categories,
Body = new ItemBody()
{
ContentType = contentType,
Content = content,
},
IsReminderOn = isRemindOn,
Importance = importance
};
if (dueTime.HasValue)
{
var _timeZone = TimeZoneInfo.Local;
_newTask.DueDateTime = DateTimeTimeZone.FromDateTime(dueTime.Value, _timeZone.StandardName);
}
var _task = await _graphServiceClient.Me.Todo.Lists[idList].Tasks.Request().AddAsync(_newTask);
//Add attachments
if (attachments != null)
{
if (attachments.Length > 0)
{
foreach (var _attachment in attachments)
{
var _attachmentContentSize = _attachment.ContentBytes.Length;
var _attachmentInfo = new AttachmentInfo
{
AttachmentType = AttachmentType.File,
Name = _attachment.FileName,
Size = _attachmentContentSize,
ContentType = _attachment.ContentType
};
var _uploadSession = await _graphServiceClient.Me
.Todo.Lists[idList].Tasks[_task.Id]
.Attachments.CreateUploadSession(_attachmentInfo).Request().PostAsync();
using (var _stream = new MemoryStream(_attachment.ContentBytes))
{
_stream.Position = 0;
LargeFileUploadTask<TaskFileAttachment> _largeFileUploadTask = new LargeFileUploadTask<TaskFileAttachment>(_uploadSession, _stream, MaxChunkSize);
try
{
await _largeFileUploadTask.UploadAsync();
}
catch (ServiceException errorGraph)
{
if (errorGraph.StatusCode == HttpStatusCode.InternalServerError || errorGraph.StatusCode == HttpStatusCode.BadGateway
|| errorGraph.StatusCode == HttpStatusCode.ServiceUnavailable || errorGraph.StatusCode == HttpStatusCode.GatewayTimeout)
{
Thread.Sleep(1000); //Wait time until next attempt
//Try again
await _largeFileUploadTask.ResumeAsync();
}
else
throw errorGraph;
}
}
}
}
}
}
catch (ServiceException errorGraph)
{
throw errorGraph;
}
catch (Exception ex)
{
throw ex;
}
}
Up to the point of creating the task everything goes well, it does create the task for the user and its properly shown in the user tasks list. Also, it does create an upload session properly.
The problem comes when i am trying to upload the large file in the UploadAsync instruction.
The following error happens.
Code: InvalidAuthenticationToken Message: Access token is empty.
But according to the LargeFileUploadTask doc , the client does not need to set Auth Headers.
param name="baseClient" To use for making upload requests. The client should not set Auth headers as upload urls do not need them.
Is not LargeFileUploadTask allowed to be used to upload large files to a ToDoTask?
If not then what is the proper way to upload large files to a ToDoTask using the Graph Api, can someone provide an example?
If you want, you can raise an issue for the same with the details here, so that they can have look: https://github.com/microsoftgraph/msgraph-sdk-dotnet-core/issues.
It seems like its a bug and they are working on it.
Temporarily I did this code to deal with the issue of the large files.
var _task = await _graphServiceClient.Me.Todo.Lists[idList].Tasks.Request().AddAsync(_newTask);
//Add attachments
if (attachments != null)
{
if (attachments.Length > 0)
{
foreach (var _attachment in attachments)
{
var _attachmentContentSize = _attachment.ContentBytes.Length;
var _attachmentInfo = new AttachmentInfo
{
AttachmentType = AttachmentType.File,
Name = _attachment.FileName,
Size = _attachmentContentSize,
ContentType = _attachment.ContentType
};
var _uploadSession = await _graphServiceClient.Me
.Todo.Lists[idList].Tasks[_task.Id]
.Attachments.CreateUploadSession(_attachmentInfo).Request().PostAsync();
// Get the upload URL and the next expected range from the response
string _uploadUrl = _uploadSession.UploadUrl;
using (var _stream = new MemoryStream(_attachment.ContentBytes))
{
_stream.Position = 0;
// Create a byte array to hold the contents of each chunk
byte[] _chunk = new byte[MaxChunkSize];
//Bytes to read
int _bytesRead = 0;
//Times the stream has been read
var _ind = 0;
while ((_bytesRead = _stream.Read(_chunk, 0, _chunk.Length)) > 0)
{
// Calculate the range of the current chunk
string _currentChunkRange = $"bytes {_ind * MaxChunkSize}-{_ind * MaxChunkSize + _bytesRead - 1}/{_stream.Length}";
//Despues deberiamos calcular el next expected range en caso de ocuparlo
// Create a ByteArrayContent object from the chunk
ByteArrayContent _byteArrayContent = new ByteArrayContent(_chunk, 0, _bytesRead);
// Set the header for the current chunk
_byteArrayContent.Headers.Add("Content-Range", _currentChunkRange);
_byteArrayContent.Headers.Add("Content-Type", _attachment.ContentType);
_byteArrayContent.Headers.Add("Content-Length", _bytesRead.ToString());
// Upload the chunk using the httpClient Request
var _client = new HttpClient();
var _requestMessage = new HttpRequestMessage()
{
RequestUri = new Uri(_uploadUrl + "/content"),
Method = HttpMethod.Put,
Headers =
{
{ "Authorization", bearerToken },
}
};
_requestMessage.Content = _byteArrayContent;
var _response = await _client.SendAsync(_requestMessage);
if (!_response.IsSuccessStatusCode)
throw new Exception("File attachment failed");
_ind++;
}
}
}
}
}
How can I retrieve all metadata stored in a PDF with iText7?
using (var pdfReader = new iText.Kernel.Pdf.PdfReader("path-to-a-pdf-file"))
{
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var pdfDocumentInfo = pdfDocument.GetDocumentInfo();
// Getting basic metadata
var author = pdfDocumentInfo.GetAuthor();
var title = pdfDocumentInfo.GetTitle();
// Getting everything else
var someMetadata = pdfDocumentInfo.GetMoreInfo("need-a-key-here");
// How to get all metadata ?
}
I was using this with iTextSharp but I can't figure how to do it with the new iText7.
using (var pdfReader = new iTextSharp.text.pdf.PdfReader("path-to-a-pdf-file"))
{
// Getting basic metadata
var author = pdfReader.Info.ContainsKey("Author") ? pdfReader.Info["Author"] : null;
var title = pdfReader.Info.ContainsKey("Title") ? pdfReader.Info["Title"] : null;
// Getting everything else
var metadata = pdfReader.Info;
metadata.Remove("Author");
metadata.Remove("Title");
// Print metadata
Console.WriteLine($"Author: {author}");
Console.WriteLine($"Title: {title}");
foreach (var line in metadata)
{
Console.WriteLine($"{line.Key}: {line.Value}");
}
}
I am using version 7.1.1 of iText7.
In iText 7 the PdfDocumentInfo class unfortunately does not expose a method to retrieve the keys in the underlying dictionary.
But you can simply retrieve the Info dictionary contents by immediately accessing that dictionary from the trailer dictionary. E.g. for a PdfDocument pdfDocument:
PdfDictionary infoDictionary = pdfDocument.GetTrailer().GetAsDictionary(PdfName.Info);
foreach (PdfName key in infoDictionary.KeySet())
Console.WriteLine($"{key}: {infoDictionary.GetAsString(key)}");
There is problem with "UnicodeBig", "UTF-8" or "PDF" encoded strings.
For example, if PDF is created with Microsoft Word, then "/Creator" is unreadable encoded and needs to be converted:
.
iText7 has own function for that convert:
...ToUnicodeString().
But it is a Method of the PdfString object and PdfDictionary value (PdfObject) hast to be first casted to this PdfString type.
Complete solution as async, "unbreakable" and auto-disposed function:
public static async Task<(Dictionary<string, string> MetaInfo, string Error)> GetMetaInfoAsync(string path)
{
try
{
var metaInfo = await Task.Run(() =>
{
var metaInfoDict = new Dictionary<string, string>();
using (var pdfReader = new PdfReader(path))
using (var pdfDocument = new PdfDocument(pdfReader))
{
metaInfoDict["PDF.PageCount"] = $"{pdfDocument.GetNumberOfPages():D}";
metaInfoDict["PDF.Version"] = $"{pdfDocument.GetPdfVersion()}";
var pdfTrailer = pdfDocument.GetTrailer();
var pdfDictInfo = pdfTrailer.GetAsDictionary(PdfName.Info);
foreach (var pdfEntryPair in pdfDictInfo.EntrySet())
{
var key = "PDF." + pdfEntryPair.Key.ToString().Substring(1);
string value;
switch (pdfEntryPair.Value)
{
case PdfString pdfString:
value = pdfString.ToUnicodeString();
break;
default:
value = pdfEntryPair.Value.ToString();
break;
}
metaInfoDict[key] = value;
}
return metaInfoDict;
}
});
return (metaInfo, null);
}
catch (Exception ex)
{
if (Debugger.IsAttached) Debugger.Break();
return (null, ex.Message);
}
}
I have a batch of PDFs that I want to convert to Text. It's easy to get text with something like this from iTextSharp:
PdfTextExtractor.GetTextFromPage(reader, pageNumber);
It's easy to get Images using this answer (or similar answers in the thread).
What I can't figure out easily... is how to interleave image placeholders in the text.
Given a PDF, a page # and GetTextFromPage I expect the output to be:
line 1
line 2
line 3
When I'd like it to be (Where 1.1 means page 1, image 1... Page 1, image 2):
line 1
[1.1]
line 2
[1.2]
line 3
Is there a way to get an "image placeholder" for iTextSharp, PdfSharp or anything similar? I'd like a GetTextAndPlaceHoldersFromPage method (or similar).
PS: Hrm... it's not letting me tag iTextSHARP - not iText. C# not Java.
C# Pdf to Text with image placeholder
https://stackoverflow.com/a/28087521/
https://stackoverflow.com/a/33697745/
Although this doesn't have the exact layout mentioned in my question (Since that was a simplified version of what I really wanted anyways), it does have the starting parts as listed by the second note (translated from iText Java)... with extra information pulled from the third note (Some of the reflection used in Java didn't seem to work in C#, so that info came from #3).
Working from this, I'm able to get a List of Strings representing lines in the PDF (all pages, instead of just page 1)... with text added where images should be (Huzzah!). ByteArrayToFile extension method added for flavor (Although I didn't include other parts/extensions that may break a copy/paste usages of this code).
I've also been able to greatly simplify other parts of my process and gut half of the garbage I had working before. Huzzah!!! Thanks #Mkl
internal class Program
{
public static void Main(string[] args)
{
var dir = Settings.TestDirectory;
var file = Settings.TestFile;
Log.Info($"File to Process: {file.FullName}");
using (var reader = new PdfReader(file.FullName))
{
var parser = new PdfReaderContentParser(reader);
var listener = new SimpleMixedExtractionStrategy(file, dir);
parser.ProcessContent(1, listener);
var x = listener.GetResultantText().Split('\n');
}
}
}
public class SimpleMixedExtractionStrategy : LocationTextExtractionStrategy
{
public static readonly ILog Log = LogManager.GetLogger(MethodBase.GetCurrentMethod().DeclaringType);
public DirectoryInfo OutputPath { get; }
public FileInfo OutputFile { get; }
private static readonly LineSegment UNIT_LINE = new LineSegment(new Vector(0, 0, 1), new Vector(1, 0, 1));
private int _counter;
public SimpleMixedExtractionStrategy(FileInfo outputFile, DirectoryInfo outputPath)
{
OutputPath = outputPath;
OutputFile = outputFile;
}
public override void RenderImage(ImageRenderInfo renderInfo)
{
try
{
var image = renderInfo.GetImage();
if (image == null) return;
var number = _counter++;
var imageFile = new FileInfo($"{OutputFile.FullName}-{number}.{image.GetFileType()}");
imageFile.ByteArrayToFile(image.GetImageAsBytes());
var segment = UNIT_LINE.TransformBy(renderInfo.GetImageCTM());
var location = new TextChunk("[" + imageFile + "]", segment.GetStartPoint(), segment.GetEndPoint(), 0f);
var locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance);
var LocationalResults = (List<TextChunk>)locationalResultField.GetValue(this);
LocationalResults.Add(location);
}
catch (Exception ex)
{
Log.Debug($"{ex.Message}");
Log.Verbose($"{ex.StackTrace}");
}
}
}
public static class ByteArrayExtensions
{
public static bool ByteArrayToFile(this FileInfo fileName, byte[] byteArray)
{
try
{
// Open file for reading
var fileStream = new FileStream(fileName.FullName, FileMode.Create, FileAccess.Write);
// Writes a block of bytes to this stream using data from a byte array.
fileStream.Write(byteArray, 0, byteArray.Length);
// close file stream
fileStream.Close();
return true;
}
catch (Exception exception)
{
// Error
Log.Error($"Exception caught in process: {exception.Message}", exception);
}
// error occured, return false
return false;
}
}
I am trying to fill a word document with data from an XML. I am using openXML to fill the document, which works great and save it as .docx. The thing is I have to open Word and save the document as an .odt and then use the OpenOffice SDK to open the .docx and save it as a pdf. When I don't save the .docx as .odt, the formatting is off.
What I need to be able to do is be able to either convert the .docx to .odt or save it originally as .odt.
Here is what I have right now:
static void Main()
{
string documentText;
XmlDocument xmlDoc = new XmlDocument(); // Create an XML document object
xmlDoc.Load("C:\\Cache\\MMcache.xml"); // Load the XML document from the specified file
XmlNodeList PatientFirst = xmlDoc.GetElementsByTagName("PatientFirst");
XmlNodeList PatientSignatureImg = xmlDoc.GetElementsByTagName("PatientSignatureImg");
byte[] byteArray = File.ReadAllBytes("C:\\Cache\\TransportationRunReporttemplate.docx");
using (MemoryStream stream = new MemoryStream())
{
stream.Write(byteArray, 0, (int)byteArray.Length);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(stream, true))
{
using (StreamReader reader = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
documentText = reader.ReadToEnd();
}
using (StreamWriter writer = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
writer.Write(documentText);
}
}
// Save the file with the new name
File.WriteAllBytes("C:\\Cache\\MYFINISHEDTEMPLATE.docx", stream.ToArray());
}
}
private static void AddPicture(Bitmap bitmap)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open("C:\\Cache\\MYFINISHEDTEMPLATE.docx", true))
{
//Bitmap image = new Bitmap("C:\\Cache\\scribus.jpg");
SdtElement controlBlock = doc.MainDocumentPart.Document.Body
.Descendants<SdtElement>()
.Where
(r =>
r.SdtProperties.GetFirstChild<Tag>().Val == "Signature"
).SingleOrDefault();
// Find the Blip element of the content control.
A.Blip blip = controlBlock.Descendants<A.Blip>().FirstOrDefault();
ImagePart imagePart = doc.MainDocumentPart
.AddImagePart(ImagePartType.Jpeg);
using (MemoryStream stream = new MemoryStream())
{
bitmap.Save(stream, ImageFormat.Jpeg);
stream.Position = 0;
imagePart.FeedData(stream);
}
blip.Embed = doc.MainDocumentPart.GetIdOfPart(imagePart);
/* DW.Inline inline = controlBlock
.Descendants<DW.Inline>().FirstOrDefault();
// 9525 = pixels to points
inline.Extent.Cy = image.Size.Height * 9525;
inline.Extent.Cx = image.Size.Width * 9525;
PIC.Picture pic = inline
.Descendants<PIC.Picture>().FirstOrDefault();
pic.ShapeProperties.Transform2D.Extents.Cy
= image.Size.Height * 9525;
pic.ShapeProperties.Transform2D.Extents.Cx
= image.Size.Width * 9525;*/
}
ConvertToPDF(#"C:\Cache\MYFINISHEDTEMPLATE2.docx",#"C:\Cache\OpenPdf.pdf");
}
public static Bitmap Base64StringToBitmap(string base64String)
{
Bitmap bmpReturn = null;
byte[] byteBuffer = Convert.FromBase64String(base64String);
MemoryStream memoryStream = new MemoryStream(byteBuffer);
memoryStream.Position = 0;
bmpReturn = (Bitmap)Bitmap.FromStream(memoryStream);
memoryStream.Close();
memoryStream = null;
byteBuffer = null;
return bmpReturn;
}
public static void ConvertToPDF(string inputFile, string outputFile)
{
if (ConvertExtensionToFilterType(System.IO.Path.GetExtension(inputFile)) == null)
throw new InvalidProgramException("Unknown file type for OpenOffice. File = " + inputFile);
StartOpenOffice();
//Get a ComponentContext
var xLocalContext =
Bootstrap.bootstrap();
//Get MultiServiceFactory
var xRemoteFactory =
(XMultiServiceFactory)
xLocalContext.getServiceManager();
//Get a CompontLoader
var aLoader =
(XComponentLoader)xRemoteFactory.createInstance("com.sun.star.frame.Desktop");
//Load the sourcefile
XComponent xComponent = null;
try
{
xComponent = InitDocument(aLoader,
PathConverter(inputFile), "_blank");
//Wait for loading
while (xComponent == null)
{
Thread.Sleep(1000);
}
// save/export the document
SaveDocument(xComponent, inputFile, PathConverter(outputFile));
}
finally
{
if (xComponent != null) xComponent.dispose();
}
}
private static void StartOpenOffice()
{
var ps = Process.GetProcessesByName("soffice.exe");
if (ps.Length != 0)
throw new InvalidProgramException("OpenOffice not found. Is OpenOffice installed?");
if (ps.Length > 0)
return;
var p = new Process
{
StartInfo =
{
Arguments = "-headless -nofirststartwizard",
FileName = "soffice.exe",
CreateNoWindow = true
}
};
var result = p.Start();
if (result == false)
throw new InvalidProgramException("OpenOffice failed to start.");
}
private static XComponent InitDocument(XComponentLoader aLoader, string file, string target)
{
var openProps = new PropertyValue[1];
openProps[0] = new PropertyValue { Name = "Hidden", Value = new Any(true) };
XComponent xComponent = aLoader.loadComponentFromURL(
file, target, 0,
openProps);
return xComponent;
}
private static void SaveDocument(XComponent xComponent, string sourceFile, string destinationFile)
{
var propertyValues = new PropertyValue[2];
// Setting the flag for overwriting
propertyValues[1] = new PropertyValue { Name = "Overwrite", Value = new Any(true) };
//// Setting the filter name
propertyValues[0] = new PropertyValue
{
Name = "FilterName",
Value = new Any(ConvertExtensionToFilterType(System.IO.Path.GetExtension(sourceFile)))
};
((XStorable)xComponent).storeToURL(destinationFile, propertyValues);
}
private static string PathConverter(string file)
{
if (file == null || file.Length == 0)
throw new NullReferenceException("Null or empty path passed to OpenOffice");
return String.Format("file:///{0}", file.Replace(#"\", "/"));
}
public static string ConvertExtensionToFilterType(string extension)
{
switch (extension)
{
case ".odt":
case ".doc":
case ".docx":
case ".txt":
case ".rtf":
case ".html":
case ".htm":
case ".xml":
case ".wps":
case ".wpd":
return "writer_pdf_Export";
case ".xls":
case ".xlsb":
case ".ods":
return "calc_pdf_Export";
case ".ppt":
case ".pptx":
case ".odp":
return "impress_pdf_Export";
default: return null;
}
}
}
}
Just for information I cannot use anything that uses Interop because word will not be installed the machine. I am using OpenXML and OpenOffice
This is what I would try (details below):
1) try Doc format instead of DocX
2) switch to Libre Office and try DocX again
2) use the odf-converter to get a better DocX -> ODT conversion.
More details...
There's something called the odf-conveter (opensource) that can convert DocX->ODT which gives you (typically) more accurate DocX->ODT than Open Office. Take a look at odf-conveter-integrator by OONinja for pre-packaged versions.
Also, Libre Office has DocX support ahead of OpenOffice so you might get a better result simply by switching to Libre Office.
A further option is to start from Doc format rather than DocX. In the OpenOffice world that translates much better to ODT and then to PDF.
Hope that helps.
You may try to use Docxpresso to generate your .odt directly from HTML + CSS code and avoid any conversion issue.
Docxpresso is free for non-commercial use.
I m developing an application in which a word document is converted in pdf. My problem is too complicated please help me out.
My word doc has a toc, bookmarks, endnotes and hyperlinks. when I save this doc as pdf, only bookmarks are converted. After a long research I found that PDF documents does not support bookmark to bookmark hyperlinks, it needs either page number or named destinations.
So I choose named destinations for this purpose, but I am stuck again , because simple "save as" cannot generate named destinations in the pdf doc. So I print the word doc on adobe PDF printer and I got named destination as required, but again this document neither have bookmarks in it nor hyperlinks. so what I decided that I generate two pdf from a word, first by save as option and second one is by printing.
test.pdf (by save as) (contains bookmarks, hyperlinks)
test_p.pdf( by printing) (only contains named destination)
then I research ones again and found a way to extract all named destination from test_p.pdf into XML by a function of itextsharp.but unfortunately I dont get any way to import back this xml in test.pdf.. thats why I came here.
Guide me what to do next if this approach is ok. else suggest me any ohter approach to accomplish this mission.
I wrote a class to replace urls in my PDF files some times ago:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using iTextSharp.text.pdf;
namespace ReplaceLinks
{
public class ReplacePdfLinks
{
Dictionary<string, PdfObject> _namedDestinations;
PdfReader _reader;
public string InputPdf { set; get; }
public string OutputPdf { set; get; }
public Func<Uri, string> UriToNamedDestination { set; get; }
public void Start()
{
updatePdfLinks();
saveChanges();
}
private PdfArray getAnnotationsOfCurrentPage(int pageNumber)
{
var pageDictionary = _reader.GetPageN(pageNumber);
var annotations = pageDictionary.GetAsArray(PdfName.ANNOTS);
return annotations;
}
private static bool hasAction(PdfDictionary annotationDictionary)
{
return annotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK);
}
private static bool isUriAction(PdfDictionary annotationAction)
{
return annotationAction.Get(PdfName.S).Equals(PdfName.URI);
}
private void replaceUriWithLocalDestination(PdfDictionary annotationAction)
{
var uri = annotationAction.Get(PdfName.URI) as PdfString;
if (uri == null)
return;
if (string.IsNullOrWhiteSpace(uri.ToString()))
return;
var namedDestination = UriToNamedDestination(new Uri(uri.ToString()));
if (string.IsNullOrWhiteSpace(namedDestination))
return;
PdfObject entry;
if (!_namedDestinations.TryGetValue(namedDestination, out entry))
return;
annotationAction.Remove(PdfName.S);
annotationAction.Remove(PdfName.URI);
var newLocalDestination = new PdfArray();
annotationAction.Put(PdfName.S, PdfName.GOTO);
var xRef = ((PdfArray)entry).First(x => x is PdfIndirectReference);
newLocalDestination.Add(xRef);
newLocalDestination.Add(PdfName.FITH);
annotationAction.Put(PdfName.D, newLocalDestination);
}
private void saveChanges()
{
using (var fileStream = new FileStream(OutputPdf, FileMode.Create, FileAccess.Write, FileShare.None))
using (var stamper = new PdfStamper(_reader, fileStream))
{
stamper.Close();
}
}
private void updatePdfLinks()
{
_reader = new PdfReader(InputPdf);
_namedDestinations = _reader.GetNamedDestinationFromStrings();
var pageCount = _reader.NumberOfPages;
for (var i = 1; i <= pageCount; i++)
{
var annotations = getAnnotationsOfCurrentPage(i);
if (annotations == null || !annotations.Any())
continue;
foreach (var annotation in annotations.ArrayList)
{
var annotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(annotation);
if (!hasAction(annotationDictionary))
continue;
var annotationAction = annotationDictionary.Get(PdfName.A) as PdfDictionary;
if (annotationAction == null)
continue;
if (!isUriAction(annotationAction))
continue;
replaceUriWithLocalDestination(annotationAction);
}
}
}
}
}
To use it:
new ReplacePdfLinks
{
InputPdf = #"test.pdf",
OutputPdf = "mod.pdf",
UriToNamedDestination = uri =>
{
if (uri.Host.ToLowerInvariant().Contains("google.com"))
{
return "entry1";
}
return string.Empty;
}
}.Start();
This sample will modify all of the urls containing google.com to point to a specific named destination "entry1".
And this is the sample file to test the above class:
void WriteFile()
{
using (var doc = new Document(PageSize.LETTER))
{
using (var fs = new FileStream("test.pdf", FileMode.Create))
{
using (var writer = PdfWriter.GetInstance(doc, fs))
{
doc.Open();
var blueFont = FontFactory.GetFont("Arial", 12, Font.NORMAL, BaseColor.BLUE);
doc.Add(new Chunk("Go to URL", blueFont).SetAction(new PdfAction("http://www.google.com/", false)));
doc.NewPage();
doc.Add(new Chunk("Go to Test", blueFont).SetLocalGoto("entry1"));
doc.NewPage();
doc.Add(new Chunk("Test").SetLocalDestination("entry1"));
doc.Close();
}
}
}
}