'PdfTextExtractor' does not contain definition for 'GetTextFromPage', it throws Compiler Error CS0117
This is my code, which I have coppied just to check how does iText7 work:
using System;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
namespace PdfParser
{
public static class PdfTextExtractor
{
public static void ExtractTextFromPDF(string filePath)
{
PdfReader pdfReader = new PdfReader(filePath);
PdfDocument pdfDoc = new PdfDocument(pdfReader);
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
// the line below throws the exception
string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);
}
pdfDoc.Close();
pdfReader.Close();
}
}
}
I tried using iTextCsharp, but there was writen that iText7 is a new version.
I am working on "Console Application", maybe this is the problem? Should I use another framework?
The problem is that your class is also called PdfTextExtractor. Please rename your static class and the issue will be solved.
For future issues, you can jump to the reference (via F12 or similar, depending on your IDE/Shorctus) and check where it directs you.
Related
So an update, I've gotten my code to be able to read a single pdf file and parse the information into a text file. Great. Now I want to figure out how to do the following two things.
Get the program to be able to read more than 1 pdf file. If I could get it to read an entire file folder, that would be best. I'm not sure how to change the code to do that, but I know it can't be that different.
Change the activation method. If I could get it so that the code ran whenever a new file was dropped into a folder, that would be absolutely amazing. That has to be possible, to somehow have an event listener that activates whenever a file is dropped into a folder and parses the information.
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
System.IO.StreamWriter file = new System.IO.StreamWriter(#"C:\Users\kttricic\OneDrive - Burns & McDonnell\Desktop\test file\POs\test");
file.WriteLine(text);
file.Close();
return text.ToString();
}
}
static void Main(string[] args)
{
Console.WriteLine(ExtractTextFromPdf(#"C:\Users\kttricic\OneDrive - Burns & McDonnell\Desktop\test file\POs\PO 4505234816 Siemens Industry, Inc. 6.15.21.pdf"));
}
I would like to know the most efficient way to convert contenteditable features (something that the user puts in) to an pdf. Here is an illustration of what i mean:
1.
2
3
I would also like to know how to convert css features since jsPDF doesn't suppoert this (to my knowledge)
jsPDF doesn't support almost the features what you need. I suggest to create an application to do that.
My background is C#. So:
Program.cs
using HtmlToPdf.Models;
namespace HtmlToPdf.Console
{
public class Program
{
public static void Main(string[] args)
{
var model = new HtmlToPdfModel();
model.HTML = "<h3>Hello world!</h3>";
model.CSS = "h3{color:#f00;}";
HtmlToPdf.Convert(model);
}
}
}
HtmlToPdfModel.cs
namespace HtmlToPdf.Models
{
public class HtmlToPdfModel
{
public string HTML { get; set; }
public string CSS { get; set; }
public string OutputPath { get; set; }
public string FontName { get; set; }
public string FontPath { get; set; }
}
}
HtmlToPdf.cs
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.tool.xml;
using HtmlToPdf.Models;
using System;
using System.IO;
using System.Text;
namespace HtmlToPdf.Console
{
public class HtmlToPdf
{
public static void Convert(HtmlToPdfModel model)
{
try
{
if (model == null) return;
Byte[] bytes;
//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var stream = new MemoryStream())
{
//Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
using (var doc = new Document())
{
//Create a writer that's bound to our PDF abstraction and our stream
using (var writer = PdfWriter.GetInstance(doc, stream))
{
//Open the document for writing
doc.Open();
//In order to read CSS as a string we need to switch to a different constructor
//that takes Streams instead of TextReaders.
//Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
using (var cssStream = new MemoryStream(Encoding.UTF8.GetBytes(model.CSS)))
{
using (var htmlStream = new MemoryStream(Encoding.UTF8.GetBytes(model.HTML)))
{
var fontProvider = new XMLWorkerFontProvider();
if (!string.IsNullOrEmpty(model.FontPath) && !string.IsNullOrEmpty(model.FontName))
{
fontProvider.Register(model.FontPath, model.FontName);
//Parse the HTML with css font-family
XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, htmlStream, cssStream, Encoding.UTF8, fontProvider);
}
else
{
//Parse the HTML without css font-family
XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, htmlStream, cssStream);
}
}
}
doc.Close();
}
}
//After all of the PDF "stuff" above is done and closed but **before** we
//close the MemoryStream, grab all of the active bytes from the stream
bytes = stream.ToArray();
}
//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
// use this line on Windows version
//File.WriteAllBytes(model.OutputPath, bytes);
// use these lines on Mac version
string path = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "data");
path = Path.Combine(path, "test.pdf");
File.WriteAllBytes(path, bytes);
}
catch (Exception e)
{
throw e;
}
}
}
}
When I wrote this application, I've tested on Windows. So, if you're using Mac, you can replace the line:
File.WriteAllBytes(model.OutputPath, bytes);
in the file HtmlToPdf.cs to
string path = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "data");
path = Path.Combine(path, "test.pdf");
File.WriteAllBytes(path, bytes);
I've commented inside the code.
About the font problem, if you want to use specific font (ex: Roboto), you must provide the font file and the path which your application can assign to.
Nuget packages: iTextSharp and itextsharp.xmlworker
You can convert this console application to web application, everytime you want to make PDF file, just make a request (ajax) to server and hit the method HtmlToPdf.Convert.
Does anyone know if there is a way to check for a watermark on a PDF document using iTextSharp?
I want to do this before adding a new one. In my case, I have to add a new watermark if it wasn't already added by someone, but I don't know how to check this using iTextSharp's PdfReader class.
Something like this:
var reader = new PdfReader(bytes);
var stamper = new PdfStamper(reader, ms);
var dc = stamper.GetOverContent(pageNumber);
bool alreadyStamped = cd.CheckIfTextOrImageExists();
After some investigation thanks to the #ChrisHaas comment I was able to achieve that verification. So, if text is present on the particular page, I can find it using SimpleTextExtractionStrategy, even if it's in the WaterMark collection.
PdfReader pdfReader = new PdfReader(bytes);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searthText))
{
// adding new WaterMark here
Console.WriteLine("text was found on page "+i);
}
}
pdfReader.Close();
Hopefully, this approach helps someone, who got a similar issue.
I'm using iTextSharp to read the contents of PDF documents:
PdfReader reader = new PdfReader(pdfPath);
using (StringWriter output = new StringWriter())
{
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
reader.Close();
pdfText = output.ToString();
}
99% of the time it works just fine. However, there is this one PDF file that will sometimes throw this exception:
PDF header signature not found. StackTrace: at
iTextSharp.text.pdf.PRTokeniser.CheckPdfHeader() at
iTextSharp.text.pdf.PdfReader.ReadPdf() at
iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[]> ownerPassword) at
Reader.PDF.DownloadPdf(String url) in
What's annoying is that I can't always reproduce the error. Sometimes it works, sometimes it doesn't. Has anyone encountered this problem?
After some research, I've found that this problem relates to either a file being corrupted during PDF generation, or an error related to an object in the document that doesn't conform to the PDF standard as implemented in iTextSharp. It also seems to happen only when you read from a PDF file from disk.
I have not found a complete solution to the problem, but only a workaround. What I've done is read the PDF document using the PdfReader itextsharp object and see if an error or exception happens before reading the file in a normal operation.
So running something similar to this:
private bool IsValidPdf(string filepath)
{
bool Ret = true;
PdfReader reader = null;
try
{
reader = new PdfReader(filepath);
}
catch
{
Ret = false;
}
return Ret;
}
I found it was because I was calling new PdfReader(pdf) with the PDF stream position at the end of the file. By setting the position to zero it resolved the issue.
Before:
// Throws: InvalidPdfException: PDF header signature not found.
var pdfReader = new PdfReader(pdf);
After:
// Works correctly.
pdf.Position = 0;
var pdfReader = new PdfReader(pdf);
In my case, it was because I was calling a .json file, and iTextSharp only accepts pdf file obviously.
There is the possibility that you are opening the file with another method or program as was my case. Verify that nothing is working with your file, you can also use the resource monitor to verify which processes are working on your file.
I am using code from other question and i am getting the error as
Error 1 The non-generic type
'iTextSharp.text.List' cannot be used
with type arguments
Error 2 The name 'HTMLWorker' does not
exist in the current context
Error 3 The type or namespace name
'HTMLWorker' could not be found (are
you missing a using directive or an
assembly reference?)
My code so far is as follows:
protected void Button2_Click(object sender, EventArgs e)
{
//Extract data from Page (pd).
Label16.Text = Editor1.Content; // Attribute
// makae ready HttpContext
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.ContentType = "application/pdf";
// Create PDF document
Document pdfDocument = new Document(PageSize.A4, 80, 50, 30, 65);
//PdfWriter pw = PdfWriter.GetInstance(pdfDocument, HttpContext.Current.Response.OutputStream);
PdfWriter.GetInstance(pdfDocument, HttpContext.Current.Response.OutputStream);
pdfDocument.Open();
//WebClient wc = new WebClient();
string htmlText = Editor1.Content;
List<IElement> htmlarraylist = HTMLWorker.ParseToList(new StringReader(htmlText), null);
for (int k = 0; k < htmlarraylist.Count; k++)
{
pdfDocument.Add((IElement)htmlarraylist[k]);
}
//pdfDocument.Add(new Paragraph(IElement));
pdfDocument.Close();
HttpContext.Current.Response.End();
}
Please Help me to resolve the error. What i am trying is to get the contents (non html) from htmleditor and display in a pdf file. please confirm me whether what i am trying to do is correct or not.
1.Prefix your List like
System.Collections.Generics.List<IElement> htmlarraylist
2.Looks like you didn't import the namespace of HTMLWorker
EDIT:I googled for you ,the namespace could be any of these three.I doubt it could be the last one,but i am not sure.
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
There's a name conflict in this code - you are using iTextSharp.text namespace and trying to use standard System.Collections.Generic.List<T> class.
Either you need to remove using iTextSharp.text and use its classes with explicit namespace or use explicit namespace for List<T>.
System.Collections.Generic.List<IElement> htmlarraylist = HTMLWorker.ParseToList(new StringReader(htmlText), null);
The third solution is to use aliases.
And for the second error, you need to import HTMLWorker namespace. Put
using iTextSharp.text.html.simpleparser;
at the top.