OpenXml Dotx to Docx Word Found Unreadable Content

OpenXml Dotx to Docx Word Found Unreadable Content - c#

I am trying to write code to fill a template's content controls then save it as a new file.
I found this very helpful entry Word OpenXml Word Found Unreadable Content And used the code there.
I copied the code from that post as shown below
public static MemoryStream ReadAllBytesToMemoryStream(string path)
{
byte[] buffer = File.ReadAllBytes(path);
var destStream = new MemoryStream(buffer.Length);
destStream.Write(buffer, 0, buffer.Length);
destStream.Seek(0, SeekOrigin.Begin);
return destStream;
}
public static void Generate()
{
using MemoryStream stream = ReadAllBytesToMemoryStream(#"c:\Templates\TemplateTest.dotx");
using (WordprocessingDocument wpd = WordprocessingDocument.Open(stream, true))
{
wpd.ChangeDocumentType(WordprocessingDocumentType.Document);
}
File.WriteAllBytes(#"c:\Templates\TemplateTestOutput.docx", stream.GetBuffer());
return;
}
It successfully creates the file, but the problem is, whenever I open the new docx file, it gives the "Word Found Unreadable Content" error. The template I made isn't complex, it just has 3 content controls with regular text for labels. I also tried copying a regular docx with just some lines of text, same error.
Whenever I click ok, on the Word Found Unreadable Content error, it shows the document just fine. I'm not sure what I'm doing wrong, I'm not even editing anything at this point.

Figured out a solution.
Instead of using File.WriteAllBytes
File.WriteAllBytes(#"C:\\Templates\TemplateTestOutput.docx", stream.GetBuffer());
I used the following code:
using (FileStream fileStream = new FileStream(#"C:\\Templates\TemplateTestOutput.docx", System.IO.FileMode.CreateNew))
{
stream.WriteTo(fileStream);
}

Related

Unable to merge 2 PDFs using MemoryStream

I have a c# class that takes an HTML and converts it to PDF using wkhtmltopdf.
As you will see below, I am generating 3 PDFs - Landscape, Portrait, and combined of the two.
The properties object contains the html as a string, and the argument for landscape/portrait.
System.IO.MemoryStream PDF = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file = new System.IO.FileStream("abc_landscape.pdf", System.IO.FileMode.Create);
PDF.Position = 0;
properties.IsHorizontalOrientation = false;
System.IO.MemoryStream PDF_portrait = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file_portrait = new System.IO.FileStream("abc_portrait.pdf", System.IO.FileMode.Create);
PDF_portrait.Position = 0;
System.IO.MemoryStream finalStream = new System.IO.MemoryStream();
PDF.CopyTo(finalStream);
PDF_portrait.CopyTo(finalStream);
System.IO.FileStream file_combined = new System.IO.FileStream("abc_combined.pdf", System.IO.FileMode.Create);
try
{
PDF.WriteTo(file);
PDF.Flush();
PDF_portrait.WriteTo(file_portrait);
PDF_portrait.Flush();
finalStream.WriteTo(file_combined);
finalStream.Flush();
}
catch (Exception)
{
throw;
}
finally
{
PDF.Close();
file.Close();
PDF_portrait.Close();
file_portrait.Close();
finalStream.Close();
file_combined.Close();
}
The PDFs "abc_landscape.pdf" and "abc_portrait.pdf" generate correctly, as expected, but the operation fails when I try to combine the two in a third pdf (abc_combined.pdf).
I am using MemoryStream to preform the merge, and at the time of debug, I can see that the finalStream.length is equal to the sum of the previous two PDFs. But when I try to open the PDF, I see the content of just 1 of the two PDFs.
The same can be seen below:
Additionally, when I try to close the "abc_combined.pdf", I am prompted to save it, which does not happen with the other 2 PDFs.
Below are a few things that I have tried out already, to no avail:
Change CopyTo() to WriteTo()
Merge the same PDF (either Landscape or Portrait one) with itself
In case it is required, below is the elaboration of the GetPdfStream() method.
var htmlStream = new MemoryStream();
var writer = new StreamWriter(htmlStream);
writer.Write(htmlString);
writer.Flush();
htmlStream.Position = 0;
return htmlStream;
Process process = Process.Start(psi);
process.EnableRaisingEvents = true;
try
{
process.Start();
process.BeginErrorReadLine();
var inputTask = Task.Run(() =>
{
htmlStream.CopyTo(process.StandardInput.BaseStream);
process.StandardInput.Close();
});
// Copy the output to a memorystream
MemoryStream pdf = new MemoryStream();
var outputTask = Task.Run(() =>
{
process.StandardOutput.BaseStream.CopyTo(pdf);
});
Task.WaitAll(inputTask, outputTask);
process.WaitForExit();
// Reset memorystream read position
pdf.Position = 0;
return pdf;
}
catch (Exception ex)
{
throw ex;
}
finally
{
process.Dispose();
}

Merging pdf in C# or any other language is not straight forward with out using 3rd party library.
I assume your requirement for not using library is that most Free libraries, nuget packages has limitation or/and cost money for commercial use.
I have made research and found you an Open Source library called PdfClown with nuget package, it is also available for Java. It is Free with out limitation (donate if you like). The library has a lot of features. One such you can merge 2 or more documents to one document.
I supply my example that take a folder with multiple pdf files, merged it and save it to same or another folder. It is also possible to use MemoryStream, but I do not find it necessary in this case.
The code is self explaining, the key point here is using SerializationModeEnum.Incremental:
public static void MergePdf(string srcPath, string destFile)
{
var list = Directory.GetFiles(Path.GetFullPath(srcPath));
if (string.IsNullOrWhiteSpace(srcPath) || string.IsNullOrWhiteSpace(destFile) || list.Length <= 1)
return;
var files = list.Select(File.ReadAllBytes).ToList();
using (var dest = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(files[0])))
{
var document = dest.Document;
var builder = new org.pdfclown.tools.PageManager(document);
foreach (var file in files.Skip(1))
{
using (var src = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(file)))
{ builder.Add(src.Document); }
}
dest.Save(destFile, SerializationModeEnum.Incremental);
}
}
To test it
var srcPath = #"C:\temp\pdf\input";
var destFile = #"c:\temp\pdf\output\merged.pdf";
MergePdf(srcPath, destFile);
Input examples
PDF doc A and PDF doc B
Output example
Links to my research:
https://csharp-source.net/open-source/pdf-libraries
https://sourceforge.net/projects/clown/
https://www.oipapio.com/question-3526089
Disclaimer: A part of this answer is taken from my my personal web site https://itbackyard.com/merge-multiple-pdf-files-to-one-pdf-file-in-c/ with source code to github.

This answer from Stack Overflow (Combine two (or more) PDF's) by Andrew Burns works for me:
using (PdfDocument one = PdfReader.Open("pdf 1.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument two = PdfReader.Open("pdf 2.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument outPdf = new PdfDocument())
{
CopyPages(one, outPdf);
CopyPages(two, outPdf);
outPdf.Save("file1and2.pdf");
}
void CopyPages(PdfDocument from, PdfDocument to)
{
for (int i = 0; i < from.PageCount; i++)
{
to.AddPage(from.Pages[i]);
}
}

That's not quite how PDFs work. PDFs are structured files in a specific format.
You can't just append the bytes of one to the other and expect the result to be a valid document.
You're going to have to use a library that understands the format and can do the operation for you, or developing your own solution.

PDF files aren't just text and images. Behind the scenes there is a strict file format that describes things like PDF version, the objects contained in the file and where to find them.
In order to merge 2 PDFs you'll need to manipulate the streams.
First you'll need to conserve the header from only one of the files. This is pretty easy since it's just the first line.
Then you can write the body of the first page, and then the second.
Now the hard part, and likely the part that will convince you to use a library, is that you have to re-build the xref table. The xref table is a cross reference table that describes the content of the document and more importantly where to find each element. You'd have to calculate the byte offset of the second page, shift all of the elements in it's xref table by that much, and then add it's xref table to the first. You'll also need to ensure you create objects in the xref table for the page break.
Once that's done, you need to re-build the document trailer which tells an application where the various sections of the document are among other things.
See https://resources.infosecinstitute.com/pdf-file-format-basic-structure/
This is not trivial and you'll end up re-writing lots of code that already exists.

Extract embedded package files from word document using open xml?

I am trying extract the word document, It has embedded files(word,excel,package). I am not able to extract package and save it Using C# Open XML.
The below code just extracts word and excel but not package.
using (WordprocessingDocument document = WordprocessingDocument.Open(fileName, false))
{
foreach (EmbeddedPackagePart pkgPart in document.MainDocumentPart.GetPartsOfType<EmbeddedPackagePart>())
{
if (pkgpart.uri.tostring().startswith(embeddingpartstring))
{
string filename1 = pkgpart.uri.tostring().remove(0, embeddingpartstring.length);
// get the stream from the part
system.io.stream partstream = pkgpart.getstream();
string filepath = "d:\\test\\" + filename1;
// write the steam to the file.
system.io.filestream writestream = new system.io.filestream(filepath, filemode.create, fileaccess.write);
readwritestream(pkgpart.getstream(), writestream);
}
}
}

The issue you're having is, that when you go to MainDocument.Parts and start searching, what you'll get is things like "Imagepart", "ChartPart" etc. where the ChartPart might have it's own embedded part, which could be the Excel or Word file you are looking for.
In short, you need to extend your search for embedded parts, to the actual parts in the mainDocument.
If I just wanted to extract all embedded parts in one of the files from my own project, I would go about it like this.
using (var document = WordprocessingDocument.Open(#"C:\Test\myTestDocument.docx", false))
{
//just grab all the parts, might be relevant to be a bit more clever about it, depending on sizes of files and how many files you want to search through
foreach(var part in document.MainDocumentPart.Parts)
{
//foreach part see if that part containts an EmbeddedPackagePart
var testForEmbedding = part.OpenXmlPart.GetPartsOfType<EmbeddedPackagePart>();
foreach(EmbeddedPackagePart embedding in testForEmbedding)
{
//You should probably insert some clever naming scheme here..
string fileName = embedding.Uri.OriginalString.Split('/').Last();
//stream the EmbeddedPackagePart to a file
using(FileStream myFile = File.Create(#"C:\test\" + fileName))
using (var stream = embedding.GetStream())
{
stream.Seek(0, SeekOrigin.Begin);
stream.CopyTo(myFile);
myFile.Close();
}
}
}
}
I hope this helps!

OpenXml-SDK: How to apply FontFamily/Size to AltChunk of Type [TextPlain]

Can anybody show me how to apply Fontfamily/size to an AltChunk of Type
AlternativeFormatImportPartType.TextPlain
This is my Code, but I can´t figure out how to do this at all (even Google doesn´t help)
MainDocumentPart main = doc.MainDocumentPart;
string altChunkId = "AltChunkId" + Guid.NewGuid().ToString().Replace("-", "");
var chunk = main.AddAlternativeFormatImportPart
(AlternativeFormatImportPartType.TextPlain, altChunkId);
using (var mStream = new MemoryStream())
{
using (var writer = new StreamWriter(mStream))
{
writer.Write(value);
writer.Flush();
mStream.Position = 0;
chunk.FeedData(mStream);
}
}
var altChunk = new AltChunk();
altChunk.Id = altChunkId;
OpenXmlElement afterThat = null;
foreach (var para in main.Document.Body.Descendants<Paragraph>())
{
if (para.InnerText.Equals("Notizen:"))
{
afterThat = para;
}
}
main.Document.Body.InsertAfter(altChunk, afterThat);
if I do it this way I get "Courier New" with a Size of "10,5"
UPDATE
This is the working Solution I came up with:
Convert Plaintext to RTF, change the Fontfamily/size and apply it to the WordProcessingDocument!
public static string PlainToRtf(string value)
{
using (var rtf = new System.Windows.Forms.RichTextBox())
{
rtf.Text = value;
rtf.SelectAll();
rtf.SelectionFont = new System.Drawing.Font("Calibri", 10);
return rtf.Rtf;
}
}
var chunk = main.AddAlternativeFormatImportPart
(AlternativeFormatImportPartType.Rtf, altChunkId);
using (var mStream = new MemoryStream())
{
using (var writer = new StreamWriter(mStream))
{
var rtf = PlainToRtf(value);
writer.Write(rtf);
writer.Flush();
mStream.Position = 0;
chunk.FeedData(mStream);
}
}
//proceed with creating AltChunk and inserting it to the Document...

How to apply FontFamily/Size to AltChunk of Type [TextPlain]
I am afraid this is NOT possible, in any case, not with OpenXml SDK.
Why?
altChunk (Anchor for Imported External Content) object is further designed for importing content in the document. They are 'temporary' objects: it is a just a reference to an external content, that is incorporated "as is" in the document, and then, when the document will be opened and saved with Word, Word converts this external content in valid OpenXml content.
So you can't, for a newly created document, loop into the paragraphs in order to retrieve it and apply a style.
If you import rtf content for example, the style must be applied to rtf before importing it.
In case of plain text TextPlain (= Text file .txt), there is no style conversion (there is no style attached to the text file, you can change the font in NotePad, it will apply to all documents, this is an Application Level property).
And I can confirm that Word creates by default a style with "Courier New 10,5" to display the content of the file. I just tested.
What can I do?
Apply style after the document has been open/saved with Word. Note you will have to retreive the paragrap(s), or you could try to retrieve the style created in the document and change the font here. This link could help to achieve this:
How to: Apply a style to a paragraph in a word processing document (Open XML SDK).
Or maybe it exists(?) a registry key something Like this that you can change to change Word's default behavior on your computer. And even if it is, it doesn't solve the problem for newly created document which is opened the first time on the client.
Note from the OP:
I think a possible Solution to the Problem could be, converting the PlainText to RTF apply StyleInformation and then append it to WordProcessingDocument as AltChunk.
I totally agreed. Just note when he says apply StyleInformation, it means at rtf level.

iTextSharp exception: PDF header signature not found

I'm using iTextSharp to read the contents of PDF documents:
PdfReader reader = new PdfReader(pdfPath);
using (StringWriter output = new StringWriter())
{
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
reader.Close();
pdfText = output.ToString();
}
99% of the time it works just fine. However, there is this one PDF file that will sometimes throw this exception:
PDF header signature not found. StackTrace: at
iTextSharp.text.pdf.PRTokeniser.CheckPdfHeader() at
iTextSharp.text.pdf.PdfReader.ReadPdf() at
iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[]> ownerPassword) at
Reader.PDF.DownloadPdf(String url) in
What's annoying is that I can't always reproduce the error. Sometimes it works, sometimes it doesn't. Has anyone encountered this problem?

After some research, I've found that this problem relates to either a file being corrupted during PDF generation, or an error related to an object in the document that doesn't conform to the PDF standard as implemented in iTextSharp. It also seems to happen only when you read from a PDF file from disk.
I have not found a complete solution to the problem, but only a workaround. What I've done is read the PDF document using the PdfReader itextsharp object and see if an error or exception happens before reading the file in a normal operation.
So running something similar to this:
private bool IsValidPdf(string filepath)
{
bool Ret = true;
PdfReader reader = null;
try
{
reader = new PdfReader(filepath);
}
catch
{
Ret = false;
}
return Ret;
}

I found it was because I was calling new PdfReader(pdf) with the PDF stream position at the end of the file. By setting the position to zero it resolved the issue.
Before:
// Throws: InvalidPdfException: PDF header signature not found.
var pdfReader = new PdfReader(pdf);
After:
// Works correctly.
pdf.Position = 0;
var pdfReader = new PdfReader(pdf);

In my case, it was because I was calling a .json file, and iTextSharp only accepts pdf file obviously.

There is the possibility that you are opening the file with another method or program as was my case. Verify that nothing is working with your file, you can also use the resource monitor to verify which processes are working on your file.

OpenXML SDK Inject VBA into excel workbook

I can successfully inject a piece of VBA code into a generated excel workbook, but what I am trying to do is use the Workbook_Open() event so the VBA code executes when the file opens. I am adding the sub to the "ThisWorkbook" object in my xlsm template file. I then use the openxml productivity tool to reflect the code and get the encoded VBA data.
When the file is generated and I view the VBA, I see "ThisWorkbook" and "ThisWorkbook1" objects. My VBA is in "ThisWorkbook" object but the code never executes on open. If I move my VBA code to "ThisWorkbook1" and re-open the file, it works fine. Why is an extra "ThisWorkbook" created? Is it not possible to inject an excel spreadsheet with a Workbook_Open() sub? Here is a snippet of the C# code I am using:
private string partData = "..."; //base 64 encoded data from reflection code
//open workbook, myWorkbook
VbaProjectPart newPart = myWorkbook.WorkbookPart.AddNewPart<VbaProjectPart>("rId1");
System.IO.Stream data = GetBinaryDataStream(partData);
newPart.FeedData(data);
data.Close();
//save and close workbook
Anyone have ideas?

Based on my research there isn't a way to insert the project part data in a format that you can manipulate in C#. In the OpenXML format, the VBA project is still stored in a binary format. However, copying the VbaProjectPart from one Excel document into another should work. As a result, you'd have to determine what you wanted the project part to say in advance.
If you are OK with this, then you can add the following code to a template Excel file in the 'ThisWorkbook' Microsoft Excel Object, along with the appropriate Macro code:
Private Sub Workbook_Open()
Run "Module1.SomeMacroName()"
End Sub
To copy the VbaProjectPart object from one file to the other, you would use code like this:
public static void InsertVbaPart()
{
using(SpreadsheetDocument ssDoc = SpreadsheetDocument.Open("file1.xlsm", false))
{
WorkbookPart wbPart = ssDoc.WorkbookPart;
MemoryStream ms;
CopyStream(ssDoc.WorkbookPart.VbaProjectPart.GetStream(), ms);
using(SpreadsheetDocument ssDoc2 = SpreadsheetDocument.Open("file2.xlsm", true))
{
Stream stream = ssDoc2.WorkbookPart.VbaProjectPart.GetStream();
ms.WriteTo(stream);
}
}
}
public static void CopyStream(Stream input, Stream output)
{
byte[] buffer = new byte[short.MaxValue + 1];
while (true)
{
int read = input.Read(buffer, 0, buffer.Length);
if (read <= 0)
return;
output.Write(buffer, 0, read);
}
}
Hope that helps.

I found that the other answers still resulted in the duplicate "Worksheet" object. I used a similar solution to what #ZlotaMoneta said, but with a different syntax found here:
List<VbaProjectPart> newParts = new List<VbaProjectPart>();
using (var originalDocument = SpreadsheetDocument.Open("file1.xlsm"), false))
{
newParts = originalDocument.WorkbookPart.GetPartsOfType<VbaProjectPart>().ToList();
using (var document = SpreadsheetDocument.Open("file2.xlsm", true))
{
document.WorkbookPart.DeleteParts(document.WorkbookPart.GetPartsOfType<VbaProjectPart>());
foreach (var part in newParts)
{
VbaProjectPart vbaProjectPart = document.WorkbookPart.AddNewPart<VbaProjectPart>();
using (Stream data = part.GetStream())
{
vbaProjectPart.FeedData(data);
}
}
//Note this prevents the duplicate worksheet issue
spreadsheetDocument.WorkbookPart.Workbook.WorkbookProperties.CodeName = "ThisWorkbook";
}
}

You need to specify "codeName" attribute in the "xl/workbook..xml" object
After feeding the VbaProjectPart with macro. Add this code:
var workbookPr = spreadsheetDocument.WorkbookPart.Workbook.Descendants<WorkbookProperties>().FirstOrDefault();
workbookPr.CodeName = "ThisWorkBook";
After opening the file everything should work now.
So, to add macro you need to:
Change document type to macro enabled
Add VbaProjectPart and feed it with earlier created macro
Add workbookPr codeName attr in xl/workbook..xml with value "ThisWorkBook"
Save as with .xlsm ext.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.