When decoding an image within a PDF as FlateDecode via iTextSharp the image is distorted and I can't seem to figure out why.
The recognized bpp is Format1bppIndexed. If I modify the PixelFormat to Format4bppIndexed the image is recognizable to some degree (shrunk, coloring is off but readable) and is duplicated 4 times in a horizontal manner. If I adjust the pixel format to Format8bppIndexed it is also recognizable to some degree and is duplicated 8 times in a horizontal manner.
The image below is after a Format1bppIndexed pixel format approach. Unfortunately I am unable to show the others due to security constraints.
The code is seen below which is essentially the single solution I have come across littered around both SO and the web.
int xrefIdx = ((PRIndirectReference)obj).Number;
PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
PdfStream str = (PdfStream)(pdfObj);
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);
string filter = ((PdfArray)tg.Get(PdfName.FILTER))[0].ToString();
string width = tg.Get(PdfName.WIDTH).ToString();
string height = tg.Get(PdfName.HEIGHT).ToString();
string bpp = tg.Get(PdfName.BITSPERCOMPONENT).ToString();
if (filter == "/FlateDecode")
{
bytes = PdfReader.FlateDecode(bytes, true);
System.Drawing.Imaging.PixelFormat pixelFormat;
switch (int.Parse(bpp))
{
case 1:
pixelFormat = System.Drawing.Imaging.PixelFormat.Format1bppIndexed;
break;
case 8:
pixelFormat = System.Drawing.Imaging.PixelFormat.Format8bppIndexed;
break;
case 24:
pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb;
break;
default:
throw new Exception("Unknown pixel format " + bpp);
}
var bmp = new System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat);
System.Drawing.Imaging.BitmapData bmd = bmp.LockBits(new System.Drawing.Rectangle(0, 0, Int32.Parse(width),
Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat);
Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length);
bmp.UnlockBits(bmd);
bmp.Save(#"C:\temp\my_flate_picture-" + DateTime.Now.Ticks.ToString() + ".png", ImageFormat.Png);
}
What do I need to do to so that my image extraction works as desired when dealing with FlateDecode?
NOTE: I do not want to use another library to extract the images. I am looking for a solution leveraging ONLY iTextSharp and the .NET FW. If a solution exists via Java (iText) and is easily portable to .NET FW bits that would suffice as well.
UPDATE: The ImageMask property is set to true, which would imply that there is no color space and is therefore implicitly black and white. With the bpp coming in at 1, the PixelFormat should be Format1bppIndexed which as mentioned earlier, produces the embedded image seen above.
UPDATE: To get the image size I extracted it out using Acrobat X Pro and the image size for this particular example was listed as 2403x3005. When extracting via iTextSharp the size was listed as 2544x3300. I modified the image size within the debugger to mirror 2403x3005 however upon calling Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length); I get an exception raised.
Attempted to read or write protected memory. This is often an
indication that other memory is corrupt.
My assumption is that this is due to the modification of the size and thus no longer corresponding to the byte data that is being used.
UPDATE: Per Jimmy's recommendation, I verified that calling PdfReader.GetStreamBytes returns a byte[] length equal to widthheight/8 since GetStreamBytes should be calling FlateDecode. Manually calling FlateDecode and calling PdfReader.GetStreamBytes both produced a byte[] length of 1049401, while the widthheight/8 is 2544*3300/8 or 1049400, so there is a difference of 1. Not sure if this would be the root cause or not, an off by one; however I am not sure how to resolve if that is indeed the case.
UPDATE: In trying the approach mentioned by kuujinbo I am met with an IndexOutOfRangeException when I attempt to call renderInfo.GetImage(); within the RenderImage listener. The fact that the width*height/8 as stated earlier is off by 1 in comparison to the byte[] length when calling FlateDecode makes me think these are all one in the same; however a solution still eludes me.
at System.util.zlib.Adler32.adler32(Int64 adler, Byte[] buf, Int32 index, Int32 len)
at System.util.zlib.ZStream.read_buf(Byte[] buf, Int32 start, Int32 size)
at System.util.zlib.Deflate.fill_window()
at System.util.zlib.Deflate.deflate_slow(Int32 flush)
at System.util.zlib.Deflate.deflate(ZStream strm, Int32 flush)
at System.util.zlib.ZStream.deflate(Int32 flush)
at System.util.zlib.ZDeflaterOutputStream.Write(Byte[] b, Int32 off, Int32 len)
at iTextSharp.text.pdf.codec.PngWriter.WriteData(Byte[] data, Int32 stride)
at iTextSharp.text.pdf.parser.PdfImageObject.DecodeImageBytes()
at iTextSharp.text.pdf.parser.PdfImageObject..ctor(PdfDictionary dictionary, Byte[] samples)
at iTextSharp.text.pdf.parser.PdfImageObject..ctor(PRStream stream)
at iTextSharp.text.pdf.parser.ImageRenderInfo.PrepareImageObject()
at iTextSharp.text.pdf.parser.ImageRenderInfo.GetImage()
at cyos.infrastructure.Core.MyImageRenderListener.RenderImage(ImageRenderInfo renderInfo)
UPDATE: Trying varying the varying methods listed here in my original solution as well as the solution posed by kuujinbo with a different page in the PDF produces imagery; however the issues always surface when the the filter type is /FlateDecode and no image is produced for that given instance.
Try copy your data row by row, maybe it will solve the problem.
int w = imgObj.GetAsNumber(PdfName.WIDTH).IntValue;
int h = imgObj.GetAsNumber(PdfName.HEIGHT).IntValue;
int bpp = imgObj.GetAsNumber(PdfName.BITSPERCOMPONENT).IntValue;
var pixelFormat = PixelFormat.Format1bppIndexed;
byte[] rawBytes = PdfReader.GetStreamBytesRaw((PRStream)imgObj);
byte[] decodedBytes = PdfReader.FlateDecode(rawBytes);
byte[] streamBytes = PdfReader.DecodePredictor(decodedBytes, imgObj.GetAsDict(PdfName.DECODEPARMS));
// byte[] streamBytes = PdfReader.GetStreamBytes((PRStream)imgObj); // same result as above 3 lines of code.
using (Bitmap bmp = new Bitmap(w, h, pixelFormat))
{
var bmpData = bmp.LockBits(new Rectangle(0, 0, w, h), ImageLockMode.WriteOnly, pixelFormat);
int length = (int)Math.Ceiling(w * bpp / 8.0);
for (int i = 0; i < h; i++)
{
int offset = i * length;
int scanOffset = i * bmpData.Stride;
Marshal.Copy(streamBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length);
}
bmp.UnlockBits(bmpData);
bmp.Save(fileName);
}
If you're able to use the latest version (5.1.3), the API to extract FlateDecode and other image types has been simplified using the iTextSharp.text.pdf.parser namespace. Basically you use a PdfReaderContentParser to help you parse the PDF document, then you implement the IRenderListener interface specifically (in this case) to deal with images. Here's a working example HTTP handler:
<%# WebHandler Language="C#" Class="bmpExtract" %>
using System;
using System.Collections.Generic;
using System.IO;
using System.Web;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
public class bmpExtract : IHttpHandler {
public void ProcessRequest (HttpContext context) {
HttpServerUtility Server = context.Server;
HttpResponse Response = context.Response;
PdfReader reader = new PdfReader(Server.MapPath("./bmp.pdf"));
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener();
for (int i = 1; i <= reader.NumberOfPages; i++) {
parser.ProcessContent(i, listener);
}
for (int i = 0; i < listener.Images.Count; ++i) {
string path = Server.MapPath("./" + listener.ImageNames[i]);
using (FileStream fs = new FileStream(
path, FileMode.Create, FileAccess.Write
))
{
fs.Write(listener.Images[i], 0, listener.Images[i].Length);
}
}
}
public bool IsReusable { get { return false; } }
public class MyImageRenderListener : IRenderListener {
public void RenderText(TextRenderInfo renderInfo) { }
public void BeginTextBlock() { }
public void EndTextBlock() { }
public List<byte[]> Images = new List<byte[]>();
public List<string> ImageNames = new List<string>();
public void RenderImage(ImageRenderInfo renderInfo) {
PdfImageObject image = null;
try {
image = renderInfo.GetImage();
if (image == null) return;
ImageNames.Add(string.Format(
"Image{0}.{1}", renderInfo.GetRef().Number, image.GetFileType()
));
using (MemoryStream ms = new MemoryStream(image.GetImageAsBytes())) {
Images.Add(ms.ToArray());
}
}
catch (IOException ie) {
/*
* pass-through; image type not supported by iText[Sharp]; e.g. jbig2
*/
}
}
}
}
The iText[Sharp] development team is still working on the implementation, so I can't say for sure if it will work in your case. But it does work on this simple example PDF. (used above and with a couple of other PDFs I tried with bitmap images)
EDIT: I've been experimenting with the new API too and made a mistake in the original code example above. Should have initialized the PdfImageObject to null outside the try..catch block. Correction made above.
Also, when I use the above code on an unsupported image type, (e.g. jbig2) I get a different Exception - "The color depth XX is not supported", where "XX" is a number. And iTextSharp does support FlateDecode in all the examples I've tried. (but that's not helping you in this case, I know)
Is the PDF produced by third-party software? (non-Adobe) From what I've read in the book, some third-party vendors produce PDFs that aren't completely up to spec, and iText[Sharp] can't deal with some of these PDFs, while Adobe products can. IIRC I've seen cases specific to some PDFs generated by Crystal Reports on the iText mailing list that caused problems, here's one thread.
Is there any way you can generate a test PDF with the software you're using with some non-sensitive FlateDecode image(s)? Then maybe someone here could help a little better.
Related
My game takes a screenshot each game loop and stores it memory. The user can then press "print screen" to trigger "SaveScreenshot" (see code below) to store each screenshot as a PNG and also compile them into an AVI using SharpAvi. The saving of images works fine, and a ~2sec AVI is produced, but it doesn't show any video when played. It's just the placeholder VLC Player icon. I think this is very close to working, but I can't determine what's wrong. Please see my code below. If anyone has any ideas, I'd be very appreciative!
private Bitmap GrabScreenshot()
{
try
{
Bitmap bmp = new Bitmap(this.ClientSize.Width, this.ClientSize.Height);
System.Drawing.Imaging.BitmapData data =
bmp.LockBits(this.ClientRectangle, System.Drawing.Imaging.ImageLockMode.WriteOnly,
System.Drawing.Imaging.PixelFormat.Format24bppRgb);
GL.ReadPixels(0, 0, this.ClientSize.Width, this.ClientSize.Height, PixelFormat.Bgr, PixelType.UnsignedByte,
data.Scan0);
bmp.UnlockBits(data);
bmp.RotateFlip(RotateFlipType.RotateNoneFlipY);
return bmp;
} catch(Exception ex)
{
// occasionally getting GDI generic exception when rotating the image... skip that one.
return null;
}
}
private void SaveScreenshots()
{
var directory = "c:\\helioscreenshots\\";
var rootFileName = string.Format("{0}_", DateTime.UtcNow.Ticks);
var writer = new AviWriter(directory + rootFileName + ".avi")
{
FramesPerSecond = 30,
// Emitting AVI v1 index in addition to OpenDML index (AVI v2)
// improves compatibility with some software, including
// standard Windows programs like Media Player and File Explorer
EmitIndex1 = true
};
// returns IAviVideoStream
var aviStream = writer.AddVideoStream();
// set standard VGA resolution
aviStream.Width = this.ClientSize.Width;
aviStream.Height = this.ClientSize.Height;
// class SharpAvi.KnownFourCCs.Codecs contains FOURCCs for several well-known codecs
// Uncompressed is the default value, just set it for clarity
aviStream.Codec = KnownFourCCs.Codecs.Uncompressed;
// Uncompressed format requires to also specify bits per pixel
aviStream.BitsPerPixel = BitsPerPixel.Bpp32;
var index = 0;
while (this.Screenshots.Count > 0)
{
Bitmap screenshot = this.Screenshots.Dequeue();
var screenshotBytes = ImageToBytes(screenshot);
// write data to a frame
aviStream.WriteFrame(true, // is key frame? (many codecs use concept of key frames, for others - all frames are keys)
screenshotBytes, // array with frame data
0, // starting index in the array
screenshotBytes.Length); // length of the data
// save it!
// NOTE: compared jpeg, gif, and png. PNG had smallest file size.
index++;
screenshot.Save(directory + rootFileName + index + ".png", System.Drawing.Imaging.ImageFormat.Png);
}
// save the AVI!
writer.Close();
}
public static byte[] ImageToBytes(Image img)
{
using (var stream = new MemoryStream())
{
img.Save(stream, System.Drawing.Imaging.ImageFormat.Png);
return stream.ToArray();
}
}
From what I see, you're providing the byte-array in png-encoding, yet the stream is configured as KnownFourCCs.Codecs.Uncompressed.
Furthermore, from the manual:
AVI expects uncompressed data in format of standard Windows DIB, that is bottom-up bitmap of the specified bit-depth. For each frame, put its data in byte array and call IAviVideoStream.WriteFrame()
Next, all encoders expect input image data in specific format. It's BGR32 top-down - 32 bits per pixel, blue byte first, alpha byte not used, top line goes first. This is the format you can often get from existing images. [...] So, you simply pass an uncompressed top-down BGR32
I would retrieve the byte-array directly from the Bitmap using LockBits and Marshal.Copy as described in the manual.
I am uploading jpeg images as fast as i can to a web service (it is the requirement I have been given).
I am using async call to the web service and I calling it within a timer.
I am trying to optimise as much as possible and tend to use an old laptop for testing. On a normal/reasonable build PC all is OK. On the laptop I get high RAM usage.
I know I will get a higher RAM usage using that old laptop but I want to know the lowest spec PC the app will work on.
As you can see in the code below I am converting the jpeg image into a byte array and then I upload the byte array.
If I can reduce/compress/zip the bye array then I am hoping this will be 1 of the ways of improving memory usage.
I know jpegs are already compressed but if I compare the current byte array with the previous byre array then uploading the difference between this byte arrays I could perhaps compress it even more on the basis that some of the byte values will be zero.
If I used a video encoder (which would do the trick) I would not be real time as much I would like.
Is there an optimum way of comparing 2 byte arrays and outputting to a 3rd byte array? I have looked around but could not find an answer that I liked.
This is my code on the client:
bool _uploaded = true;
private void tmrLiveFeed_Tick(object sender, EventArgs e)
{
try
{
if (_uploaded)
{
_uploaded = false;
_live.StreamerAsync(Shared.Alias, imageToByteArray((Bitmap)_frame.Clone()), Guid.NewGuid().ToString()); //web service being called here
}
}
catch (Exception _ex)
{
//do some thing but probably time out error here
}
}
//web service has finished the client invoke
void _live_StreamerCompleted(object sender, AsyncCompletedEventArgs e)
{
_uploaded = true; //we are now saying we start to upload the next byte array
}
private wsLive.Live _live = new wsLive.Live(); //web service
private byte[] imageToByteArray(Image imageIn)
{
MemoryStream ms = new MemoryStream();
imageIn.Save(ms,System.Drawing.Imaging.ImageFormat.Jpeg); //convert image to best image compression
imageIn.Dispose();
return ms.ToArray();
}
thanks...
As C.Evenhuis said - JPEG files are compressed, and changing even few pixels results in complettly differrent file. So - comparing resulting JPEG files is useless.
BUT you can compare your Image objects - quick search results in finding this:
unsafe Bitmap PixelDiff(Bitmap a, Bitmap b)
{
Bitmap output = new Bitmap(a.Width, a.Height, PixelFormat.Format32bppArgb);
Rectangle rect = new Rectangle(Point.Empty, a.Size);
using (var aData = a.LockBitsDisposable(rect, ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb))
using (var bData = b.LockBitsDisposable(rect, ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb))
using (var outputData = output.LockBitsDisposable(rect, ImageLockMode.ReadWrite, PixelFormat.Format32bppArgb))
{
byte* aPtr = (byte*)aData.Scan0;
byte* bPtr = (byte*)bData.Scan0;
byte* outputPtr = (byte*)outputData.Scan0;
int len = aData.Stride * aData.Height;
for (int i = 0; i < len; i++)
{
// For alpha use the average of both images (otherwise pixels with the same alpha won't be visible)
if ((i + 1) % 4 == 0)
*outputPtr = (byte)((*aPtr + *bPtr) / 2);
else
*outputPtr = (byte)~(*aPtr ^ *bPtr);
outputPtr++;
aPtr++;
bPtr++;
}
}
return output;
}
If your goal is to find out whether two byte arrays contain exactly the same data, you can create an MD5 hash and compare these as others have suggested. However in your question you mention you want to upload the difference which means the result of the comparison must be more than a simple yes/no.
As JPEGs are already compressed, the smallest change to the image could lead to a large difference in the binary data. I don't think any two JPEGs contain binary data similar enough to easily compare.
For BMP files you may find that changing a single pixel affects only one or a few bytes, and more importantly, the data for the pixel at a certain offset in the image is located at the same position in both binary files (given that both images are of equal size and color depth). So for BMPs the difference in binary data directly relates to the difference in the images.
In short, I don't think obtaining the binary difference between JPEG files will improve the size of the data to be sent.
Is there a way to get the "Path" to a memorystream?
For example if i want to use CMD and point to a filepath, like "C:..." but instead the file is in a memorystream, is it possible to point it there?
I have tried searching on it but i can´t find any clear information on this.
EDIT:
If it helps, the thing i am wanting to access is an image file, a print screen like this:
using (Bitmap b = new Bitmap(Screen.PrimaryScreen.Bounds.Width, Screen.PrimaryScreen.Bounds.Height))
{
using (Graphics g = Graphics.FromImage(b))
{
g.CopyFromScreen(0, 0, 0, 0, Screen.PrimaryScreen.Bounds.Size, CopyPixelOperation.SourceCopy);
}
using (MemoryStream ms = new MemoryStream())
{
b.Save(ms, ImageFormat.Bmp);
StreamReader read = new StreamReader(ms);
ms.Position = 0;
var cwebp = new Process
{
StartInfo =
{
WindowStyle = ProcessWindowStyle.Normal,
FileName = "cwebp.exe",
Arguments = string.Format(
"-q 100 -lossless -m 6 -alpha_q 100 \"{0}\" -o \"{1}\"", ms, "C:\test.webp")
},
};
cwebp.Start();
}
}
and then some random testing to get it to work....
And the thing i want to pass it to is cwebp, a Webp encoder.
Which is why i must use CMD, as i can´t work with it at the C# level, else i wouldn´t have this problem.
Yeah that is usually protected. If you know where it is, you might be able to grab it with an unsafe pointer. It might be easier to write it to a text file that cmd could read, or push it to Console to read.
If using .NET 4.0 or greater you can use a MemoryMappedFile. I haven't toyed with this class since 4.0 beta. However, my understanding is its useful for writing memory to disk in cases where you are dealing with large amounts of data or want some level of application memory sharing.
Usage per MSDN:
static void Main(string[] args)
{
long offset = 0x10000000; // 256 megabytes
long length = 0x20000000; // 512 megabytes
// Create the memory-mapped file.
using (var mmf = MemoryMappedFile.CreateFromFile(#"c:\ExtremelyLargeImage.data", FileMode.Open,"ImgA"))
{
// Create a random access view, from the 256th megabyte (the offset)
// to the 768th megabyte (the offset plus length).
using (var accessor = mmf.CreateViewAccessor(offset, length))
{
int colorSize = Marshal.SizeOf(typeof(MyColor));
MyColor color;
// Make changes to the view.
for (long i = 0; i < length; i += colorSize)
{
accessor.Read(i, out color);
color.Brighten(10);
accessor.Write(i, ref color);
}
}
}
}
If cwebp.exe is expecting a filename, there is nothing you can put on the command line that satisfies your criteria. Anything enough like a file that the external program can open it won't be able to get its data from your program's memory. There are a few possibilities, but they probably all require changes to cwebp.exe:
You can write to the new process's standard in
You can create a named pipe from which the process can read your data
You can create a named shared memory object from which the other process can read
You haven't said why you're avoiding writing to a file, so it's hard to say which is best.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am currently trying to recompress a pdf that has already been created, I am trying to find a way to recompress the images that are in the document, to reduce the file size.
I have been trying to do this with the DataLogics PDE and iTextSharp libraries but I can not find a way to do the stream recompression of the items.
I have though about looping over the xobjects and getting the images and then dropping the DPI down to 96 or using the libjpeg C# implimentation to change the quality of the image but getting it back into the pdf stream seems to always end up, with memory corruption or some other issue.
Any samples will be appreciated.
Thanks
iText and iTextSharp have some methods for replacing indirect objects. Specifically there's PdfReader.KillIndirect() which does what it says and PdfWriter.AddDirectImageSimple(iTextSharp.text.Image, PRIndirectReference) which you can then use to replace what you killed off.
In pseudo C# code you'd do:
var oldImage = PdfReader.GetPdfObject();
var newImage = YourImageCompressionFunction(oldImage);
PdfReader.KillIndirect(oldImage);
yourPdfWriter.AddDirectImageSimple(newImage, (PRIndirectReference)oldImage);
Converting the raw bytes to a .Net image can be tricky, I'll leave that up to you or you can search here. Mark has a good description here. Also, technically PDFs don't have a concept of DPI, that's for printers mostly. See the answer here for more on that.
Using the method above your compression algorithm can actually do two things, physically shrink the image as well as apply JPEG compression. When you physically shrink the image and add it back it will occupy the same amount of space as the original image but with less pixels to work with. This will get you what you consider to be DPI reduction. The JPEG compression speaks for itself.
Below is a full working C# 2010 WinForms app targeting iTextSharp 5.1.1.0. It takes an existing JPEG on your desktop called "LargeImage.jpg" and creates a new PDF from it. Then it opens the PDF, extracts the image, physically shrinks it to 90% of the original size, applies 85% JPEG compression and writes it back to the PDF. See the comments in the code for more of an explanation. The code needs lots more null/error checking. Also looks for NOTE comments where you'll need to expand to handle other situations.
using System;
using System.Drawing;
using System.Drawing.Imaging;
using System.Drawing.Drawing2D;
using System.Windows.Forms;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
namespace WindowsFormsApplication1 {
public partial class Form1 : Form {
public Form1() {
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e) {
//Our working folder
string workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
//Large image to add to sample PDF
string largeImage = Path.Combine(workingFolder, "LargeImage.jpg");
//Name of large PDF to create
string largePDF = Path.Combine(workingFolder, "Large.pdf");
//Name of compressed PDF to create
string smallPDF = Path.Combine(workingFolder, "Small.pdf");
//Create a sample PDF containing our large image, for demo purposes only, nothing special here
using (FileStream fs = new FileStream(largePDF, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (Document doc = new Document()) {
using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
iTextSharp.text.Image importImage = iTextSharp.text.Image.GetInstance(largeImage);
doc.SetPageSize(new iTextSharp.text.Rectangle(0, 0, importImage.Width, importImage.Height));
doc.SetMargins(0, 0, 0, 0);
doc.NewPage();
doc.Add(importImage);
doc.Close();
}
}
}
//Now we're going to open the above PDF and compress things
//Bind a reader to our large PDF
PdfReader reader = new PdfReader(largePDF);
//Create our output PDF
using (FileStream fs = new FileStream(smallPDF, FileMode.Create, FileAccess.Write, FileShare.None)) {
//Bind a stamper to the file and our reader
using (PdfStamper stamper = new PdfStamper(reader, fs)) {
//NOTE: This code only deals with page 1, you'd want to loop more for your code
//Get page 1
PdfDictionary page = reader.GetPageN(1);
//Get the xobject structure
PdfDictionary resources = (PdfDictionary)PdfReader.GetPdfObject(page.Get(PdfName.RESOURCES));
PdfDictionary xobject = (PdfDictionary)PdfReader.GetPdfObject(resources.Get(PdfName.XOBJECT));
if (xobject != null) {
PdfObject obj;
//Loop through each key
foreach (PdfName name in xobject.Keys) {
obj = xobject.Get(name);
if (obj.IsIndirect()) {
//Get the current key as a PDF object
PdfDictionary imgObject = (PdfDictionary)PdfReader.GetPdfObject(obj);
//See if its an image
if (imgObject.Get(PdfName.SUBTYPE).Equals(PdfName.IMAGE)) {
//NOTE: There's a bunch of different types of filters, I'm only handing the simplest one here which is basically raw JPG, you'll have to research others
if (imgObject.Get(PdfName.FILTER).Equals(PdfName.DCTDECODE)) {
//Get the raw bytes of the current image
byte[] oldBytes = PdfReader.GetStreamBytesRaw((PRStream)imgObject);
//Will hold bytes of the compressed image later
byte[] newBytes;
//Wrap a stream around our original image
using (MemoryStream sourceMS = new MemoryStream(oldBytes)) {
//Convert the bytes into a .Net image
using (System.Drawing.Image oldImage = Bitmap.FromStream(sourceMS)) {
//Shrink the image to 90% of the original
using (System.Drawing.Image newImage = ShrinkImage(oldImage, 0.9f)) {
//Convert the image to bytes using JPG at 85%
newBytes = ConvertImageToBytes(newImage, 85);
}
}
}
//Create a new iTextSharp image from our bytes
iTextSharp.text.Image compressedImage = iTextSharp.text.Image.GetInstance(newBytes);
//Kill off the old image
PdfReader.KillIndirect(obj);
//Add our image in its place
stamper.Writer.AddDirectImageSimple(compressedImage, (PRIndirectReference)obj);
}
}
}
}
}
}
}
this.Close();
}
//Standard image save code from MSDN, returns a byte array
private static byte[] ConvertImageToBytes(System.Drawing.Image image, long compressionLevel) {
if (compressionLevel < 0) {
compressionLevel = 0;
} else if (compressionLevel > 100) {
compressionLevel = 100;
}
ImageCodecInfo jgpEncoder = GetEncoder(ImageFormat.Jpeg);
System.Drawing.Imaging.Encoder myEncoder = System.Drawing.Imaging.Encoder.Quality;
EncoderParameters myEncoderParameters = new EncoderParameters(1);
EncoderParameter myEncoderParameter = new EncoderParameter(myEncoder, compressionLevel);
myEncoderParameters.Param[0] = myEncoderParameter;
using (MemoryStream ms = new MemoryStream()) {
image.Save(ms, jgpEncoder, myEncoderParameters);
return ms.ToArray();
}
}
//standard code from MSDN
private static ImageCodecInfo GetEncoder(ImageFormat format) {
ImageCodecInfo[] codecs = ImageCodecInfo.GetImageDecoders();
foreach (ImageCodecInfo codec in codecs) {
if (codec.FormatID == format.Guid) {
return codec;
}
}
return null;
}
//Standard high quality thumbnail generation from http://weblogs.asp.net/gunnarpeipman/archive/2009/04/02/resizing-images-without-loss-of-quality.aspx
private static System.Drawing.Image ShrinkImage(System.Drawing.Image sourceImage, float scaleFactor) {
int newWidth = Convert.ToInt32(sourceImage.Width * scaleFactor);
int newHeight = Convert.ToInt32(sourceImage.Height * scaleFactor);
var thumbnailBitmap = new Bitmap(newWidth, newHeight);
using (Graphics g = Graphics.FromImage(thumbnailBitmap)) {
g.CompositingQuality = CompositingQuality.HighQuality;
g.SmoothingMode = SmoothingMode.HighQuality;
g.InterpolationMode = InterpolationMode.HighQualityBicubic;
System.Drawing.Rectangle imageRectangle = new System.Drawing.Rectangle(0, 0, newWidth, newHeight);
g.DrawImage(sourceImage, imageRectangle);
}
return thumbnailBitmap;
}
}
}
I don't know about iTextSharp, but you have to rewrite a PDF file if anything is changed, as it contains an xref table (index) with the exact file position of each object. This means if even one byte is added or removed, the PDF becomes corrupted.
Your best bet for recompressing the images is JBIG2 if they are B&W, or JPEG2000 otherwise, for which Jasper library will happily encode JPEG2000 codestreams for placement into PDF files at whatever quality you so desire.
If it were me I'd do it all from code without the PDF libraries. Just find all images (anything between stream and endstream after an occurance of JPXDecode (JPEG2000), JBIG2Decode (JBIG2) or DCTDecode (JPEG)) pull that out, reencode it with Jasper, then stick it back in again and update the xref table.
To update the xref table, find the positions of each object (starting 00001 0 obj) and just update the new positions in the xref table. It's not too much work, less than it sounds. You might be able to get all the offsets with a single regular expression (I'm not a C# programmer, but in PHP it would be that simple.)
Then finally update the value of the startxref tag in the trailer with the offset of the beginning of the xref table (where it says xref in the file).
Otherwise you'll end up decoding the entire PDF and rewriting it all, which will be slow, and you might lose something along the way.
There is an example on how to find and replace images in an existing PDF by the creator of iText. It's actually a small excerpt from his book. Since it's in Java, here's a simple replacement:
public void ReduceResolution(PdfReader reader, long quality) {
int n = reader.XrefSize;
for (int i = 0; i < n; i++) {
PdfObject obj = reader.GetPdfObject(i);
if (obj == null || !obj.IsStream()) {continue;}
PdfDictionary dict = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName subType = (PdfName)PdfReader.GetPdfObject(
dict.Get(PdfName.SUBTYPE)
);
if (!PdfName.IMAGE.Equals(subType)) {continue;}
PRStream stream = (PRStream )obj;
try {
PdfImageObject image = new PdfImageObject(stream);
PdfName filter = (PdfName) image.Get(PdfName.FILTER);
if (
PdfName.JBIG2DECODE.Equals(filter)
|| PdfName.JPXDECODE.Equals(filter)
|| PdfName.CCITTFAXDECODE.Equals(filter)
|| PdfName.FLATEDECODE.Equals(filter)
) continue;
System.Drawing.Image img = image.GetDrawingImage();
if (img == null) continue;
var ll = image.GetImageBytesType();
int width = img.Width;
int height = img.Height;
using (System.Drawing.Bitmap dotnetImg =
new System.Drawing.Bitmap(img))
{
// set codec to jpeg type => jpeg index codec is "1"
System.Drawing.Imaging.ImageCodecInfo codec =
System.Drawing.Imaging.ImageCodecInfo.GetImageEncoders()[1];
// set parameters for image quality
System.Drawing.Imaging.EncoderParameters eParams =
new System.Drawing.Imaging.EncoderParameters(1);
eParams.Param[0] =
new System.Drawing.Imaging.EncoderParameter(
System.Drawing.Imaging.Encoder.Quality, quality
);
using (MemoryStream msImg = new MemoryStream()) {
dotnetImg.Save(msImg, codec, eParams);
msImg.Position = 0;
stream.SetData(msImg.ToArray());
stream.SetData(
msImg.ToArray(), false, PRStream.BEST_COMPRESSION
);
stream.Put(PdfName.TYPE, PdfName.XOBJECT);
stream.Put(PdfName.SUBTYPE, PdfName.IMAGE);
stream.Put(PdfName.FILTER, filter);
stream.Put(PdfName.FILTER, PdfName.DCTDECODE);
stream.Put(PdfName.WIDTH, new PdfNumber(width));
stream.Put(PdfName.HEIGHT, new PdfNumber(height));
stream.Put(PdfName.BITSPERCOMPONENT, new PdfNumber(8));
stream.Put(PdfName.COLORSPACE, PdfName.DEVICERGB);
}
}
}
catch {
// throw;
// iText[Sharp] can't handle all image types...
}
finally {
// may or may not help
reader.RemoveUnusedObjects();
}
}
}
You'll notice it's only handling JPEG. The logic is reversed (instead of explicitly handling only DCTDECODE/JPEG) so you can uncomment some of the ignored image types and experiment with the PdfImageObject in the code above. In particular, most of the FLATEDECODE images (.bmp, .png, and .gif) are represented as PNG (confirmed in the DecodeImageBytes method of the PdfImageObject source code). As far as I know, .NET does not support PNG encoding. There are some references to support this here and here. You can try a stand-alone PNG optimization executable, but you also have to figure out how to set PdfName.BITSPERCOMPONENT and PdfName.COLORSPACE in the PRStream.
For completeness sake, since your question specifically asks about PDF compression, here's how you compress a PDF with iTextSharp:
PdfStamper stamper = new PdfStamper(
reader, YOUR-STREAM, PdfWriter.VERSION_1_5
);
stamper.Writer.CompressionLevel = 9;
int total = reader.NumberOfPages + 1;
for (int i = 1; i < total; i++) {
reader.SetPageContent(i, reader.GetPageContent(i));
}
stamper.SetFullCompression();
stamper.Close();
You might also try and run the PDF through PdfSmartCopy to get the file size down. It removes redundant resources, but like the call to RemoveUnusedObjects() in the finally block, it may or may not help. That will depend on how the PDF was created.
IIRC iText[Sharp] doesn't deal well with JBIG2DECODE, so #Alasdair's suggestion looks good - if you want to take the time learning the Jasper library and using the brute-force approach.
Good luck.
EDIT - 2012-08-17, comment by #Craig:
To save the PDF after compressing the jpegs using the ReduceResolution() method above:
a. Instantiate a PdfReader object:
PdfReader reader = new PdfReader(pdf);
b. Pass the PdfReader to the ReduceResolution() method above.
c. Pass the altered PdfReader to a PdfStamper. Here's one way using a MemoryStream:
// Save altered PDF. then you can pass the btye array to a database, etc
using (MemoryStream ms = new MemoryStream()) {
using (PdfStamper stamper = new PdfStamper(reader, ms)) {
}
return ms.ToArray();
}
Or you can use any other Stream if you don't need to keep the PDF in memory. E.g. use a FileStream and save directly to disk.
I've written a library to do just that. It will also OCR the pdf's using Tesseract or Cuneiform and create searchable, compressed PDF files. It's a library that uses several open source projects (iTextsharp, jbig2 encoder, Aforge, muPDF#) to complete the task. You can check it out here http://hocrtopdf.codeplex.com/
I am not sure if you are considering other libraries, but you can easily recompress existing images using Docotic.Pdf library (Disclaimer: I work for the company).
Here is some sample code:
static void RecompressExistingImages(string fileName, string outputName)
{
using (PdfDocument doc = new PdfDocument(fileName))
{
foreach (PdfImage image in doc.Images)
image.RecompressWithGroup4Fax();
doc.Save(outputName);
}
}
There are also RecompressWithFlate, RecompressWithGroup3Fax, RecompressWithJpeg and Uncompress methods.
The library will convert color images to bilevel ones if needed. You can specify deflate compression level, JPEG quality etc.
I am also ask you to think twice before using approach suggested by #Alasdair. If you are going to deal with PDF files that weren't created by you than the task is far more complex that it might seem.
To start with, there is great deal of images compressed by codecs other than JPXDecode, JBIG2Decode or DCTDecode. And PDF can also contain inline images.
PDF files saved using newer versions of standard (1.5 or newer) can contain cross-reference streams. It means that reading and updating such files is more complex than just finding/updating some numbers at the end of the file.
So, please, use a PDF library.
A simple way to compress PDF is using gsdll32.dll (Ghostscript) and Cyotek.GhostScript.dll (wrapper):
public static void CompressPDF(string sInFile, string sOutFile, int iResolution)
{
string[] arg = new string[]
{
"-sDEVICE=pdfwrite",
"-dNOPAUSE",
"-dSAFER",
"-dBATCH",
"-dCompatibilityLevel=1.5",
"-dDownsampleColorImages=true",
"-dDownsampleGrayImages=true",
"-dDownsampleMonoImages=true",
"-sPAPERSIZE=a4",
"-dPDFFitPage",
"-dDOINTERPOLATE",
"-dColorImageDownsampleThreshold=1.0",
"-dGrayImageDownsampleThreshold=1.0",
"-dMonoImageDownsampleThreshold=1.0",
"-dColorImageResolution=" + iResolution.ToString(),
"-dGrayImageResolution=" + iResolution.ToString(),
"-dMonoImageResolution=" + iResolution.ToString(),
"-sOutputFile=" + sOutFile,
sInFile
};
using(GhostScriptAPI api = new GhostScriptAPI())
{
api.Execute(arg);
}
}
Im getting some images from a webpage at a specified url, i want to get their heights and widths. I'm using something like this:
Stream str = null;
HttpWebRequest wReq = (HttpWebRequest)WebRequest.Create(ImageUrl);
HttpWebResponse wRes = (HttpWebResponse)(wReq).GetResponse();
str = wRes.GetResponseStream();
var imageOrig = System.Drawing.Image.FromStream(str);
int height = imageOrig.Height;
int width = imageOrig.Width;
My main concern with this is that that the image file may actually be very large,
Is there anything I can do? ie specify to only get images if they are less than 1mb?
or is there a better alternative approach to getting the dimension of an image from a webpage?
thanks
Some (all?) image formats include the width and height property in the header of the file. You could just request enough bytes to be able to read the header and then parse them yourself. You can add a range header to your web request that will request only the first 50 bytes (50 is just an example, you'd probably need less) of the image file with:
wReq.AddRange(0, 50);
I suppose this will only work if you know that the formats you are working with include this data.
Edit: Looks like I misunderstood the AddRange method before. It's fixed now. I also went ahead and tested it out by getting the width and height of a png using this documentation.
static void Main(string[] args)
{
string imageUrl = "http://upload.wikimedia.org/wikipedia/commons/4/47/PNG_transparency_demonstration_1.png";
byte[] pngSignature = new byte[] { 137, 80, 78, 71, 13, 10, 26, 10 };
HttpWebRequest wReq = (HttpWebRequest)WebRequest.Create(imageUrl);
wReq.AddRange(0, 30);
WebResponse wRes = wReq.GetResponse();
byte[] buffer = new byte[30];
int width = 0;
int height = 0;
using (Stream stream = wRes.GetResponseStream())
{
stream.Read(buffer, 0, 30);
}
// Check for Png
// 8 byte - Signature
// 4 byte - Chunk length
// 4 byte - Chunk type - IDHR (Image Header)
// 4 byte - Width
// 4 byte - Height
// Other stuff we don't care about
if (buffer.Take(8).SequenceEqual(pngSignature))
{
var idhr = buffer.Skip(12);
width = BitConverter.ToInt32(idhr.Skip(4).Take(4).Reverse().ToArray(), 0);
height = BitConverter.ToInt32(idhr.Skip(8).Take(4).Reverse().ToArray(), 0);
}
// Check for Jpg
//else if (buffer.Take(?).SequenceEqual(jpgSignature))
//{
// // Do Jpg stuff
//}
// Check for Gif
//else if (etc...
Console.WriteLine("Width: " + width);
Console.WriteLine("Height: " + height);
Console.ReadKey();
}
You only need to download the header from a graphics file in order to find the picture size, see...
BMP: (only 26 bytes needed)
http://www.fileformat.info/format/bmp/corion.htm
JPG: (scan for "Star of Frame" marker)
http://wiki.tcl.tk/757
GIF: (10 bytes needed, i.e. first two words of Logical Screen Descriptor)
http://www.matthewflickinger.com/lab/whatsinagif/bits_and_bytes.asp
also, notice how you can read the first couple of bytes to find out what the file-type really is (don't rely on the extension of the filename. For example, a bmp could be named ".gif" by accident. Once you know the filetype you look at the spec to know what offset to read.
P.S. get yourself a hex editor, such as "Hex Editor XVI32", to see the file structure.
You can download XVI32 here: http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm
You may want to write a web service where given the input image name, provides the size as response.
Based on the resultant value, you can choose to download the image.
If you want something light weight than a web service, go for a HTTP Handler.
You could try using the FileInfo class like so to find the file size.
long fileSizeInKb = new FileInfo(fileName).Length / 1000;
if (fileSizeInKb < 1000)
{
// get image
}
The Length property returns the number of bytes in the current file.