Determine number of pages in a PDF file [closed] - c#

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I need to determine the number of pages in a specified PDF file using C# code (.NET 2.0). The PDF file will be read from the file system, and not from an URL. Does anyone have any idea on how this could be done? Note: Adobe Acrobat Reader is installed on the PC where this check will be carried out.

You'll need a PDF API for C#. iTextSharp is one possible API, though better ones might exist.
iTextSharp Example
You must install iTextSharp.dll as a reference. Download iTextsharp from SourceForge.net This is a complete working program using a console application.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.xml;
namespace GetPages_PDF
{
class Program
{
static void Main(string[] args)
{
// Right side of equation is location of YOUR pdf file
string ppath = "C:\\aworking\\Hawkins.pdf";
PdfReader pdfReader = new PdfReader(ppath);
int numberOfPages = pdfReader.NumberOfPages;
Console.WriteLine(numberOfPages);
Console.ReadLine();
}
}
}

This should do the trick:
public int getNumberOfPdfPages(string fileName)
{
using (StreamReader sr = new StreamReader(File.OpenRead(fileName)))
{
Regex regex = new Regex(#"/Type\s*/Page[^s]");
MatchCollection matches = regex.Matches(sr.ReadToEnd());
return matches.Count;
}
}
From Rachael's answer and this one too.

found a way at http://www.dotnetspider.com/resources/21866-Count-pages-PDF-file.aspx
this does not require purchase of a pdf library

One Line:
int pdfPageCount = System.IO.File.ReadAllText("example.pdf").Split(new string[] { "/Type /Page" }, StringSplitOptions.None).Count()-2;
Recommended:
ITEXTSHARP

I have used pdflib for this.
p = new pdflib();
/* Open the input PDF */
indoc = p.open_pdi_document("myTestFile.pdf", "");
pageCount = (int) p.pcos_get_number(indoc, "length:pages");

Docotic.Pdf library may be used to accomplish the task.
Here is sample code:
PdfDocument document = new PdfDocument();
document.Open("file.pdf");
int pageCount = document.PageCount;
The library will parse as little as possible so performance should be ok.
Disclaimer: I work for Bit Miracle.

I have good success using CeTe Dynamic PDF products. They're not free, but are well documented. They did the job for me.
http://www.dynamicpdf.com/

I've used the code above that solves the problem using regex and it works, but it's quite slow. It reads the entire file to determine the number of pages.
I used it in a web app and pages would sometimes list 20 or 30 PDFs at a time and in that circumstance the load time for the page went from a couple seconds to almost a minute due to the page counting method.
I don't know if the 3rd party libraries are much better, I would hope that they are and I've used pdflib in other scenarios with success.

Related

How do I convert an html web page into image using C# [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
How to convert a dynamic link which is a html web page into an Image format. Remember the link is dynamic which contains html content in string format. I have tried a lot of ways like reading the html content using converting to base64 first then visa versa.
var htmlToImageConv = new HtmlToImageConverter();
byte[] jpegBytes = htmlToImageConv.GenerateImage(html, ImageFormat.Jpeg); System.Drawing.Image image; using (System.IO.MemoryStream ms = new System.IO.MemoryStream(strOg))
{
image = System.Drawing.Image.FromStream(ms); string path = Server.MapPath("~/images/");
}
I have tried this code in c# for converting html webpage to image.
You can use a headless browser to render the html and then take a snapshot.
Have a look at PuppeteerSHarp: https://github.com/kblok/puppeteer-sharp
You could use Selenium to render the page and save a screenshot as a png image.
Add the following packages to your project:
Selenium.WebDriver
Selenium.Chrome.WebDriver
Use the following code to save a screenshot:
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
var driver = new ChromeDriver();
driver.Navigate().GoToUrl("http://google.com");
Screenshot ss = ((ITakesScreenshot)driver).GetScreenshot();
ss.SaveAsFile("screenshot.png");
}
}
}
That what you need is a conversation from a html containing string to an image, which is already discussed in the answers of this Question.

Which library to use to extract text from images?

I am writing a program that when given an image of a low level math problem (e.g. 98*13) should be able to output the answer. The numbers would be black, and the background white. Not a captcha, just an image of a math problem.
The math problems would only have two numbers and one operator, and that operator would only be +, -, *, or /.
Obviously, I know how to do the calculating ;) I'm just not sure how to go about getting the text from the image.
A free library would be ideal... although If I have to write the code myself I could probably manage.
For extract words from image, I use the most accurate open source OCR engine: Tesseract. Available here or directly in your packages NuGet.
And this is my function in C#, which extract words from image passed in sourceFilePath. Set EngineMode to TesseractAndCube; it detect more word than the other options.
var path = "YourSolutionDirectoryPath";
using (var engine = new TesseractEngine(path + Path.DirectorySeparatorChar + "tessdata", "fra", EngineMode.TesseractAndCube))
{
using (var img = Pix.LoadFromFile(sourceFilePath))
{
using (var page = engine.Process(img))
{
var text = page.GetText();
// text variable contains a string with all words found
}
}
}
I hope that helps.
Try this post regarding using the C++ Google Tessaract OCR lib in C#
OCR with the Tesseract interface
You need OCR. There is the free Tesseract library from Google, but it's C code. You could use in a C++/CLI project and access via .NET.
This article gives some information on recognizing numbers (for Sudoku, but your problem is similar)
http://sudokugrab.blogspot.com/2009/07/how-does-it-all-work.html
you can use Microsoft Office Document Imaging (Interop.MODI.dll) in visaul studio and extract text of pictures
Document modiDocument = new Document();
modiDocument.Create(filePath);
modiDocument.OCR(MiLANGUAGES.miLANG_ENGLISH);
MODI.Image modiImage = (modiDocument.Images[0] as MODI.Image);
string extractedText = modiImage.Layout.Text;
modiDocument.Close();
return extractedText;
IronOCR is free for development and testing. The default English language pack should do a good job of reading this, but you may also want to consider using a custom Tesseract language pack written specifically for equations.
See https://ironsoftware.com/csharp/ocr/languages/#custom-language-example
using IronOcr;
var Ocr = new IronTesseract();
Ocr.UseCustomTesseractLanguageFile("languages/equ.traineddata");
using (var Input = new OcrInput(#"images\equation.png"))
{
var Result = Ocr.Read(Input);
Console.WriteLine(Result.Text);
}
Disclaimer: I work for Iron Software.

ASP.NET/ MVC/ C#/ jQuery to create a CMS front end and PDF Generator [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a few general ideas on how I want to do this.
What I am trying to do is: create a front end CMS system, which is very simple, where a report will be generated from i.e. a template, using jQuery (drag, drop etc), included in the report will be placeholders where data will be imported into e.g. name, address etc. This data can be changed by different users who have access to the data.
I was thinking I would need to convert this HTML into xsl-fo format and then generate it into a PDF as xsl-fo will give me a major advantage on custom display of data on PDF, i.e. the data will appear how I want it to. This will also enable me to do a lookup in the xsl-fo using xslt (or something?) to import the latest updated database values. The tool to actually convert from xsl-fo into PDF that looks like it fits my bill is: fo.net. Ultimately I would need to use some code already out there but where I can avoid it, I would want to.
Keep in mind:
I need ultimate control over everything (eventually)
Free / open source alternatives that are flexible (with source code)
Questions:
Is jQuery the best thing to use for the CMS? As I will be having custom controls which will contain database data or placeholders for data to be imported into
Is XSL-FO the best intermediary language to port this template into for rendering/ converting into a PDF?
How do I convert html into xsl-fo? Does c#/.net have an API I can look at?
Have I overcomplicated things? Any simpler ways to do this?
Note
The HTML + CSS on the page may be very complicated/ flexible so I may need to use jQuery to add the CSS inline to the elements, hence why I am thinking of using XSL-FO as I may be able to generate tags that can read this data and place it on the PDF in a certain way, please keep this in mind when answering my question (if you choose to!) :)
I have found PDFsharp and MigraDoc to be great for pdf generation.
I have created a pdf utility...
using System;
using System.IO;
using System.Web;
using System.Web.Mvc;
using PdfSharp.Pdf;
//Controller for a PdfResult
namespace Web.Utilities
{
public class PdfResult : ActionResult
{
public String Filename { get; set; }
protected MemoryStream pdfStream = new MemoryStream();
public PdfResult(PdfDocument doc)
{
Filename = String.Format("{0}.pdf", doc.Info.Title);
doc.Save(pdfStream, false);
}
public PdfResult(String pdfpath)
{
/* optional if requried ToString save ToString file System */
throw new NotImplementedException("PdfResult is just an example and does not serve files from the filesystem.");
}
public override void ExecuteResult(ControllerContext context)
{
context.HttpContext.Response.Clear();
context.HttpContext.Response.ContentType = "application/pdf";
context.HttpContext.Response.AddHeader("Content-Disposition", "attachment; filename=" + Filename); // specify filename
context.HttpContext.Response.AddHeader("content-length", pdfStream.Length.ToString());
context.HttpContext.Response.BinaryWrite(pdfStream.ToArray());
context.HttpContext.Response.Flush();
pdfStream.Close();
context.HttpContext.Response.End();
}
}
}
And then you can render a view of pdf in the controller...
public ActionResult Download()
{
Document document = new Document();
document.Info.Title = "Hello";
Section section = document.AddSection();
section.AddParagraph("Hello").AddFormattedText("World", TextFormat.Bold);
PdfDocumentRenderer renderer = new PdfDocumentRenderer();
renderer.Document = document;
renderer.RenderDocument();
return new PdfResult(renderer.PdfDocument);
}
I have found this to be a really neat and easy to control method of putting pdf into mvc.
To answer my own question, I have decided to use Fo.NET, a C# implementation of Fop.Net by Apache. I will generate my XML file on the fly, then transform this document into an XSL:Fo xml file then send to create a PDF.
I have managed to do this quite successfully, this will enable me to throw out Fo.Net in the future and get another software or even write my own if needed. Hopefully over the next few months I will have a firmer answer to how flexible my choice actually was. :)
I will handle the front end with jQuery and jQuery UI.

C# Read Excel using late binding

Hi
I have not used late binding before but it would seem to be the solution, if only I could find a concise example!
Or may be it's not the solution but I'm sure you guys will know!
I need to fill a dropdown combbox list from a column in excel reading down to the first blank cell. the solution needs to work with excel 2003 and above some PCs never have had 2003 install only office 2010 other have been upgraded from 2003 and some are still on 2003!
I need a solution that works on all of the above.
So I'm looking into late binding is this the correct way to go? would Linq help!?
Its a clasic windows Form app using .Net 4.
I thought I would write a method that takes the file name and path and returns a list which I would then assign to the combobox.
But being new I'm not getting pass go!
Any help/examples PLEASE
It sounds like you're looking at using COM interop/automation with the Excel application installed on client machines.
If your sole requirement is to extract data from an Excel file, you'll be better off using a library that can simply read data out of the file itself, rather than launching the Excel process in the background.
It is faster, cleaner, and more testable.
I've used NPOI for .xls files (and there are certainly others), and there are LOTS of options for .xlsx files. This SO question is about creating a file, but any of the suggested libraries can of course read files as well.
The only time I'd use COM automation is to interact with a running instance of Excel.
Edit (in response to comments)
Here is a sample of getting the values of column B as strings:
using System;
using System.IO;
using NPOI.HSSF.UserModel;
using NPOI.SS.UserModel;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var stream = new FileStream(#"c:\my_workbook.xls", FileMode.Open);
var workbook = new HSSFWorkbook(stream);
stream.Close();
var sheet = workbook.GetSheet("My Sheet Name");
var row_enumerator = sheet.GetRowEnumerator();
while (row_enumerator.MoveNext())
{
var row = (Row)row_enumerator.Current;
var cell = row.GetCell(1); // in Excel, indexes are 1-based; in NPOI the indexes are 0-based
Console.WriteLine(cell.StringCellValue);
}
Console.ReadKey();
}
}
}

Generate a pdf thumbnail (open source/free) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Looking at other posts for this could not find an adequate solution that for my needs. Trying to just get the first page of a pdf document as a thumbnail. This is to be run as a server application so would not want to write out a pdf document to file to then call a third application that reads the pdf to generate the image on disk.
doc = new PDFdocument("some.pdf");
page = doc.page(1);
Image image = page.image;
Thanks.
Matthew Ephraim released an open source wrapper for Ghostscript that sounds like it does what you want and is in C#.
Link to Source Code: https://github.com/mephraim/ghostscriptsharp
Link to Blog Posting: http://www.mattephraim.com/blog/2009/01/06/a-simple-c-wrapper-for-ghostscript/
You can make a simple call to the GeneratePageThumb method to generate a thumbnail (or use GeneratePageThumbs with a start and end page number to generate thumbnails for multiple seperate pages, with each page being a seperate output file), default file format is jpeg but you can change it, and many other options, by using the alternate GenerateOutput method call and specify options such as file format, page size, etc...
I think that Windows API Code pack for Microsoft .NET framework might do the trick easiest. What it can is to generate the same thumbnail that Windows Explorer does (and that is first page), and you can chose several sizes, they go up to 1024x1024, so it should be enough. It is quite simple, just create ShellObject.FromParsingName(filepath) and find its Thumbnail subclass.
The problem might be what your server is. This works on Windows 7, Windows Vista and I guess Windows Server 2008. Also, Windows Explorer must be able to show thumbnails on that machine. The easiest way to insure that is to install Adobe Reader. If all of this is not a problem, I think that this is the most elegant way.
UPDATE: Adobe Reader has dropped support for thumbnails in the recent versions so its legacy versions must be used.
UPDATE2: According to comment from Roberto, you can still use latest version of Adobe Reader if you turn on thumbnails option in Edit - Preferences - General.
Download PDFLibNet and use the following code
public void ConvertPDFtoJPG(string filename, String dirOut)
{
PDFLibNet.PDFWrapper _pdfDoc = new PDFLibNet.PDFWrapper();
_pdfDoc.LoadPDF(filename);
for (int i = 0; i < _pdfDoc.PageCount; i++)
{
Image img = RenderPage(_pdfDoc, i);
img.Save(Path.Combine(dirOut, string.Format("{0}{1}.jpg", i,DateTime.Now.ToString("mmss"))));
}
_pdfDoc.Dispose();
return;
}
public Image RenderPage(PDFLibNet.PDFWrapper doc, int page)
{
doc.CurrentPage = page + 1;
doc.CurrentX = 0;
doc.CurrentY = 0;
doc.RenderPage(IntPtr.Zero);
// create an image to draw the page into
var buffer = new Bitmap(doc.PageWidth, doc.PageHeight);
doc.ClientBounds = new Rectangle(0, 0, doc.PageWidth, doc.PageHeight);
using (var g = Graphics.FromImage(buffer))
{
var hdc = g.GetHdc();
try
{
doc.DrawPageHDC(hdc);
}
finally
{
g.ReleaseHdc();
}
}
return buffer;
}
I used to do this kind of stuff with imagemagick (Convert) long ago.
There is a .Net Wrapper for that, maybe it's worth checking out :
http://imagemagick.codeplex.com/releases/view/30302
http://www.codeproject.com/KB/cs/GhostScriptUseWithCSharp.aspx
This works very well. The only dependencies are GhostScript's gsdll32.dll (you need to download GhostScript separately to get this, but there is no need to have GhostScript installed in your production environment), and PDFSharp.dll which is included in the project.

Categories

Resources