Extract embedded package files from word document using open xml?

Extract embedded package files from word document using open xml? - c#

I am trying extract the word document, It has embedded files(word,excel,package). I am not able to extract package and save it Using C# Open XML.
The below code just extracts word and excel but not package.
using (WordprocessingDocument document = WordprocessingDocument.Open(fileName, false))
{
foreach (EmbeddedPackagePart pkgPart in document.MainDocumentPart.GetPartsOfType<EmbeddedPackagePart>())
{
if (pkgpart.uri.tostring().startswith(embeddingpartstring))
{
string filename1 = pkgpart.uri.tostring().remove(0, embeddingpartstring.length);
// get the stream from the part
system.io.stream partstream = pkgpart.getstream();
string filepath = "d:\\test\\" + filename1;
// write the steam to the file.
system.io.filestream writestream = new system.io.filestream(filepath, filemode.create, fileaccess.write);
readwritestream(pkgpart.getstream(), writestream);
}
}
}

The issue you're having is, that when you go to MainDocument.Parts and start searching, what you'll get is things like "Imagepart", "ChartPart" etc. where the ChartPart might have it's own embedded part, which could be the Excel or Word file you are looking for.
In short, you need to extend your search for embedded parts, to the actual parts in the mainDocument.
If I just wanted to extract all embedded parts in one of the files from my own project, I would go about it like this.
using (var document = WordprocessingDocument.Open(#"C:\Test\myTestDocument.docx", false))
{
//just grab all the parts, might be relevant to be a bit more clever about it, depending on sizes of files and how many files you want to search through
foreach(var part in document.MainDocumentPart.Parts)
{
//foreach part see if that part containts an EmbeddedPackagePart
var testForEmbedding = part.OpenXmlPart.GetPartsOfType<EmbeddedPackagePart>();
foreach(EmbeddedPackagePart embedding in testForEmbedding)
{
//You should probably insert some clever naming scheme here..
string fileName = embedding.Uri.OriginalString.Split('/').Last();
//stream the EmbeddedPackagePart to a file
using(FileStream myFile = File.Create(#"C:\test\" + fileName))
using (var stream = embedding.GetStream())
{
stream.Seek(0, SeekOrigin.Begin);
stream.CopyTo(myFile);
myFile.Close();
}
}
}
}
I hope this helps!

Related

sharpziplib FastZip extract rar error cannot find central directory c#

i want to extract RAR file using FastZip, here is my code :
FastZip fastZip = new FastZip();
fastZip.CreateEmptyDirectories = true;
if (password != "")
{
fastZip.Password = password;
}
string fileFilter = null;
fastZip.ExtractZip(CompressedFilePathValue, OutputFolderPathValue, fileFilter);
but i always get error:
cannot find central directory
the RAR file is ok,i open it with WinRAR without error, so how to extract RAR file using sharpziplib with FastZip or without FastZip?
Note: I do not want to use SharpCompress because i dose not support password.
Any way to extract RAR file using sharpziplib?
Thanks for help

here is how to extract RAR file ,without error cannot find central directory:
using (Stream fs = File.OpenRead(CompressedFilePathValue))
using (var zf = new ZipFile(fs))
{
if (!String.IsNullOrEmpty(password))
{
// AES encrypted entries are handled automatically
zf.Password = password;
}
foreach (ZipEntry zipEntry in zf)
{
if (!zipEntry.IsFile)
{
// Ignore directories
continue;
}
String entryFileName = zipEntry.Name;
// to remove the folder from the entry:
//entryFileName = Path.GetFileName(entryFileName);
// Optionally match entrynames against a selection list here
// to skip as desired.
// The unpacked length is available in the zipEntry.Size property.
// Manipulate the output filename here as desired.
var fullZipToPath = Path.Combine(OutputFolderPathValue, entryFileName);
var directoryName = Path.GetDirectoryName(fullZipToPath);
if (directoryName.Length > 0)
{
Directory.CreateDirectory(directoryName);
}
// 4K is optimum
var buffer = new byte[4096];
// Unzip file in buffered chunks. This is just as fast as unpacking
// to a buffer the full size of the file, but does not waste memory.
// The "using" will close the stream even if an exception occurs.
using (var zipStream = zf.GetInputStream(zipEntry))
using (Stream fsOutput = File.Create(fullZipToPath))
{
StreamUtils.Copy(zipStream, fsOutput, buffer);
}
}
}
To be honest this work only with rar file created with sharpziplib , it does not open rar created with winrar

Unable to merge 2 PDFs using MemoryStream

I have a c# class that takes an HTML and converts it to PDF using wkhtmltopdf.
As you will see below, I am generating 3 PDFs - Landscape, Portrait, and combined of the two.
The properties object contains the html as a string, and the argument for landscape/portrait.
System.IO.MemoryStream PDF = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file = new System.IO.FileStream("abc_landscape.pdf", System.IO.FileMode.Create);
PDF.Position = 0;
properties.IsHorizontalOrientation = false;
System.IO.MemoryStream PDF_portrait = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file_portrait = new System.IO.FileStream("abc_portrait.pdf", System.IO.FileMode.Create);
PDF_portrait.Position = 0;
System.IO.MemoryStream finalStream = new System.IO.MemoryStream();
PDF.CopyTo(finalStream);
PDF_portrait.CopyTo(finalStream);
System.IO.FileStream file_combined = new System.IO.FileStream("abc_combined.pdf", System.IO.FileMode.Create);
try
{
PDF.WriteTo(file);
PDF.Flush();
PDF_portrait.WriteTo(file_portrait);
PDF_portrait.Flush();
finalStream.WriteTo(file_combined);
finalStream.Flush();
}
catch (Exception)
{
throw;
}
finally
{
PDF.Close();
file.Close();
PDF_portrait.Close();
file_portrait.Close();
finalStream.Close();
file_combined.Close();
}
The PDFs "abc_landscape.pdf" and "abc_portrait.pdf" generate correctly, as expected, but the operation fails when I try to combine the two in a third pdf (abc_combined.pdf).
I am using MemoryStream to preform the merge, and at the time of debug, I can see that the finalStream.length is equal to the sum of the previous two PDFs. But when I try to open the PDF, I see the content of just 1 of the two PDFs.
The same can be seen below:
Additionally, when I try to close the "abc_combined.pdf", I am prompted to save it, which does not happen with the other 2 PDFs.
Below are a few things that I have tried out already, to no avail:
Change CopyTo() to WriteTo()
Merge the same PDF (either Landscape or Portrait one) with itself
In case it is required, below is the elaboration of the GetPdfStream() method.
var htmlStream = new MemoryStream();
var writer = new StreamWriter(htmlStream);
writer.Write(htmlString);
writer.Flush();
htmlStream.Position = 0;
return htmlStream;
Process process = Process.Start(psi);
process.EnableRaisingEvents = true;
try
{
process.Start();
process.BeginErrorReadLine();
var inputTask = Task.Run(() =>
{
htmlStream.CopyTo(process.StandardInput.BaseStream);
process.StandardInput.Close();
});
// Copy the output to a memorystream
MemoryStream pdf = new MemoryStream();
var outputTask = Task.Run(() =>
{
process.StandardOutput.BaseStream.CopyTo(pdf);
});
Task.WaitAll(inputTask, outputTask);
process.WaitForExit();
// Reset memorystream read position
pdf.Position = 0;
return pdf;
}
catch (Exception ex)
{
throw ex;
}
finally
{
process.Dispose();
}

Merging pdf in C# or any other language is not straight forward with out using 3rd party library.
I assume your requirement for not using library is that most Free libraries, nuget packages has limitation or/and cost money for commercial use.
I have made research and found you an Open Source library called PdfClown with nuget package, it is also available for Java. It is Free with out limitation (donate if you like). The library has a lot of features. One such you can merge 2 or more documents to one document.
I supply my example that take a folder with multiple pdf files, merged it and save it to same or another folder. It is also possible to use MemoryStream, but I do not find it necessary in this case.
The code is self explaining, the key point here is using SerializationModeEnum.Incremental:
public static void MergePdf(string srcPath, string destFile)
{
var list = Directory.GetFiles(Path.GetFullPath(srcPath));
if (string.IsNullOrWhiteSpace(srcPath) || string.IsNullOrWhiteSpace(destFile) || list.Length <= 1)
return;
var files = list.Select(File.ReadAllBytes).ToList();
using (var dest = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(files[0])))
{
var document = dest.Document;
var builder = new org.pdfclown.tools.PageManager(document);
foreach (var file in files.Skip(1))
{
using (var src = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(file)))
{ builder.Add(src.Document); }
}
dest.Save(destFile, SerializationModeEnum.Incremental);
}
}
To test it
var srcPath = #"C:\temp\pdf\input";
var destFile = #"c:\temp\pdf\output\merged.pdf";
MergePdf(srcPath, destFile);
Input examples
PDF doc A and PDF doc B
Output example
Links to my research:
https://csharp-source.net/open-source/pdf-libraries
https://sourceforge.net/projects/clown/
https://www.oipapio.com/question-3526089
Disclaimer: A part of this answer is taken from my my personal web site https://itbackyard.com/merge-multiple-pdf-files-to-one-pdf-file-in-c/ with source code to github.

This answer from Stack Overflow (Combine two (or more) PDF's) by Andrew Burns works for me:
using (PdfDocument one = PdfReader.Open("pdf 1.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument two = PdfReader.Open("pdf 2.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument outPdf = new PdfDocument())
{
CopyPages(one, outPdf);
CopyPages(two, outPdf);
outPdf.Save("file1and2.pdf");
}
void CopyPages(PdfDocument from, PdfDocument to)
{
for (int i = 0; i < from.PageCount; i++)
{
to.AddPage(from.Pages[i]);
}
}

That's not quite how PDFs work. PDFs are structured files in a specific format.
You can't just append the bytes of one to the other and expect the result to be a valid document.
You're going to have to use a library that understands the format and can do the operation for you, or developing your own solution.

PDF files aren't just text and images. Behind the scenes there is a strict file format that describes things like PDF version, the objects contained in the file and where to find them.
In order to merge 2 PDFs you'll need to manipulate the streams.
First you'll need to conserve the header from only one of the files. This is pretty easy since it's just the first line.
Then you can write the body of the first page, and then the second.
Now the hard part, and likely the part that will convince you to use a library, is that you have to re-build the xref table. The xref table is a cross reference table that describes the content of the document and more importantly where to find each element. You'd have to calculate the byte offset of the second page, shift all of the elements in it's xref table by that much, and then add it's xref table to the first. You'll also need to ensure you create objects in the xref table for the page break.
Once that's done, you need to re-build the document trailer which tells an application where the various sections of the document are among other things.
See https://resources.infosecinstitute.com/pdf-file-format-basic-structure/
This is not trivial and you'll end up re-writing lots of code that already exists.

Append text to existing Gzip file

I'm working on a log program, which would dump data into gzip archive.
The first entry would look like this:
using (var fs = File.OpenWrite(logFile))
{
using (var gs = new GZipStream(fs, CompressionMode.Compress))
{
using (var sw = new StreamWriter(gs))
{
sw.WriteLine(logEntry);
}
}
}
Now I want add other lines to that file without having to re-read all file content and than to re-write it in a way that the result can be read with a single GZipStream.
What is the best way to do that?

You can use gzlog.h and gzlog.c from the zlib distribution in the examples directory. They do exactly what you're looking for.

Creating an Epub file with a Zip library

HI All,
I am trying to zip up an Epub file i have made using c#
Things I have tried
Dot Net Zip http://dotnetzip.codeplex.com/
- DotNetZip works but epubcheck fails the resulting file (**see edit below)
ZipStorer zipstorer.codeplex.com
- creates an epub file that passes validation but the file won't open in Adobe Digital Editions
7 zip
- I have not tried this using c# but when i zip the file using there interface it tells me that the mimetype file name has a length of 9 and it should be 8
In all cases the mimetype file is the first file added to the archive and is not compressed
The Epub validator that I'am using is epubcheck http://code.google.com/p/epubcheck/
if anyone has succesfully zipped an epub file with one of these libraries please let me know how or if anyone has zipped an epub file successfully with any other open source zipping api that would also work.
EDIT
DotNetZip works, see accepted answer below.

If you need to control the order of the entries in the ZIP file, you can use DotNetZip and the ZipOutputStream.
You said you tried DotNetZip and it (the epub validator) gave you an error complaining about the mime type thing. This is probably because you used the ZipFile type within DotNetZip. If you use ZipOutputStream, you can control the ordering of the zip entries, which is apparently important for epub (I don't know the format, just surmising).
EDIT
I just checked, and the epub page on Wikipedia describes how you need to format the .epub file. It says that the mimetype file must contain specific text, must be uncompressed and unencrypted, and must appear as the first file in the ZIP archive.
Using ZipOutputStream, you would do this by setting CompressionLevel = None on that particular ZipEntry - that value is not the default.
Here's some sample code:
private void Zipup()
{
string _outputFileName = "Fargle.epub";
using (FileStream fs = File.Open(_outputFileName, FileMode.Create, FileAccess.ReadWrite ))
{
using (var output= new ZipOutputStream(fs))
{
var e = output.PutNextEntry("mimetype");
e.CompressionLevel = CompressionLevel.None;
byte[] buffer= System.Text.Encoding.ASCII.GetBytes("application/epub+zip");
output.Write(buffer,0,buffer.Length);
output.PutNextEntry("META-INF/container.xml");
WriteExistingFile(output, "META-INF/container.xml");
output.PutNextEntry("OPS/"); // another directory
output.PutNextEntry("OPS/whatever.xhtml");
WriteExistingFile(output, "OPS/whatever.xhtml");
// ...
}
}
}
private void WriteExistingFile(Stream output, string filename)
{
using (FileStream fs = File.Open(fileName, FileMode.Read))
{
int n = -1;
byte[] buffer = new byte[2048];
while ((n = fs.Read(buffer,0,buffer.Length)) > 0)
{
output.Write(buffer,0,n);
}
}
}
See the documentation for ZipOutputStream here.

Why not make life easier?
private void IonicZip()
{
string sourcePath = "C:\\pulications\\";
string fileName = "filename.epub";
// Creating ZIP file and writing mimetype
using (ZipOutputStream zs = new ZipOutputStream(sourcePath + fileName))
{
var o = zs.PutNextEntry("mimetype");
o.CompressionLevel = CompressionLevel.None;
byte[] mimetype = System.Text.Encoding.ASCII.GetBytes("application/epub+zip");
zs.Write(mimetype, 0, mimetype.Length);
}
// Adding META-INF and OEPBS folders including files
using (ZipFile zip = new ZipFile(sourcePath + fileName))
{
zip.AddDirectory(sourcePath + "META-INF", "META-INF");
zip.AddDirectory(sourcePath + "OEBPS", "OEBPS");
zip.Save();
}
}

For anyone like me who's searching for other ways to do this, I would like to add that the ZipStorer class from Jaime Olivares is a great alternative. You can copy the code right into your project, and it's very easy to choose between 'deflate' and 'store'.
https://github.com/jaime-olivares/zipstorer
Here's my code for creating an EPUB:
Dictionary<string, string> FilesToZip = new Dictionary<string, string>()
{
{ ConfigPath + #"mimetype", #"mimetype"},
{ ConfigPath + #"container.xml", #"META-INF/container.xml" },
{ OutputFolder + Name.Output_OPF_Name, #"OEBPS/" + Name.Output_OPF_Name},
{ OutputFolder + Name.Output_XHTML_Name, #"OEBPS/" + Name.Output_XHTML_Name},
{ ConfigPath + #"style.css", #"OEBPS/style.css"},
{ OutputFolder + Name.Output_NCX_Name, #"OEBPS/" + Name.Output_NCX_Name}
};
using (ZipStorer EPUB = ZipStorer.Create(OutputFolder + "book.epub", ""))
{
bool First = true;
foreach (KeyValuePair<string, string> File in FilesToZip)
{
if (First) { EPUB.AddFile(ZipStorer.Compression.Store, File.Key, File.Value, ""); First = false; }
else EPUB.AddFile(ZipStorer.Compression.Deflate, File.Key, File.Value, "");
}
}
This code creates a perfectly valid EPUB file. However, if you don't need to worry about validation, it seems most eReaders will accept an EPUB with a 'deflate' mimetype. So my previous code using .NET's ZipArchive produced EPUBs that worked in Adobe Digital Editions and a PocketBook.

SharpZipLib : Compressing a single file to a single compressed file

I am currently working with SharpZipLib under .NET 2.0 and via this I need to compress a single file to a single compressed archive. In order to do this I am currently using the following:
string tempFilePath = #"C:\Users\Username\AppData\Local\Temp\tmp9AE0.tmp.xml";
string archiveFilePath = #"C:\Archive\Archive_[UTC TIMESTAMP].zip";
FileInfo inFileInfo = new FileInfo(tempFilePath);
ICSharpCode.SharpZipLib.Zip.FastZip fZip = new ICSharpCode.SharpZipLib.Zip.FastZip();
fZip.CreateZip(archiveFilePath, inFileInfo.Directory.FullName, false, inFileInfo.Name);
This works exactly (ish) as it should, however while testing I have encountered a minor gotcha. Lets say that my temp directory (i.e. the directory that contains the uncompressed input file) contains the following files:
tmp9AE0.tmp.xml //The input file I want to compress
xxx_tmp9AE0.tmp.xml // Some other file
yyy_tmp9AE0.tmp.xml // Some other file
wibble.dat // Some other file
When I run the compression all the .xml files are included in the compressed archive. The reason for this is because of the final fileFilter parameter passed to the CreateZip method. Under the hood SharpZipLib is performing a pattern match and this also picks up the files prefixed with xxx_ and yyy_. I assume it would also pick up anything postfixed as well.
So the question is, how can I compress a single file with SharpZipLib? Then again maybe the question is how can I format that fileFilter so that the match can only ever pick up the file I want to compress and nothing else.
As an aside, is there any reason as to why System.IO.Compression not include a ZipStream class? (It only supports GZipStream)
EDIT : Solution (Derived from accepted answer from Hans Passant)
This is the compression method I implemented:
private static void CompressFile(string inputPath, string outputPath)
{
FileInfo outFileInfo = new FileInfo(outputPath);
FileInfo inFileInfo = new FileInfo(inputPath);
// Create the output directory if it does not exist
if (!Directory.Exists(outFileInfo.Directory.FullName))
{
Directory.CreateDirectory(outFileInfo.Directory.FullName);
}
// Compress
using (FileStream fsOut = File.Create(outputPath))
{
using (ICSharpCode.SharpZipLib.Zip.ZipOutputStream zipStream = new ICSharpCode.SharpZipLib.Zip.ZipOutputStream(fsOut))
{
zipStream.SetLevel(3);
ICSharpCode.SharpZipLib.Zip.ZipEntry newEntry = new ICSharpCode.SharpZipLib.Zip.ZipEntry(inFileInfo.Name);
newEntry.DateTime = DateTime.UtcNow;
zipStream.PutNextEntry(newEntry);
byte[] buffer = new byte[4096];
using (FileStream streamReader = File.OpenRead(inputPath))
{
ICSharpCode.SharpZipLib.Core.StreamUtils.Copy(streamReader, zipStream, buffer);
}
zipStream.CloseEntry();
zipStream.IsStreamOwner = true;
zipStream.Close();
}
}
}

This is an XY problem, just don't use FastZip. Follow the first example on this web page to avoid accidents.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract embedded package files from word document using open xml? - c#

Related

sharpziplib FastZip extract rar error cannot find central directory c#

Unable to merge 2 PDFs using MemoryStream

Append text to existing Gzip file

Creating an Epub file with a Zip library

SharpZipLib : Compressing a single file to a single compressed file

Categories

Resources