Having problem converting big file of PDF to docx - c#

I have almost 900mb of PDF file and I want to convert it to documents or .docx
I've use sautinsoft.pdfFocus
Using this code
string pdfFile = #"d:\Coffee Table Book NPPNP (1).pdf";
string wordFile = #"d:\sample.docx";
// Convert PDF file to DOCX file
SautinSoft.PdfFocus f = new SautinSoft.PdfFocus();
f.OpenPdf(pdfFile);
if (f.PageCount > 0)
{
// You may choose output format between Docx and Rtf.
f.WordOptions.Format = SautinSoft.PdfFocus.CWordOptions.eWordDocument.Docx;
int result = f.ToWord(wordFile);
MessageBox.Show(result.ToString());
// Show the resulting Word document.
if (result == 0)
{
System.Diagnostics.Process.Start(wordFile);
}
}
After running this code the application get laggy.
And how do I know if how many pages where converted?

String inputPath = #"d:\Coffee Table Book NPPNP (1).pdf";
String outputPath = #"d:\sample.docx";
PDFDocument doc = new PDFDocument(inputPath);
doc.ConvertToDocument(DocumentType.DOCX, outputPath);
Better to refer the above code to reduce complexity.

Related

Why does my PDF file size increase after splitting and merging back? (Using PDFSharp c#)

I am basically splitting a PDF document into multiple documents containing one page each. After splitting I perform some operations and the merge the documents back to a single PDF. I am using PDFsharp in c# to do this. Now the problem I am facing is that when I split the document and then add them back, the file size increases from 1.96Mbs to 12.2Mbs. Now after thoroughly testing, I have pointed out that the problem lies not in the operations which I performing after splitting but in the actual splitting and merging of PDF documents. The following are my functions which I have created.
public static List<Stream> SplitPdf(Stream PdfDoc)
{
System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
List<Stream> outputStreamList = new List<Stream>();
PdfSharp.Pdf.PdfDocument inputDocument = PdfReader.Open(PdfDoc, PdfDocumentOpenMode.Import);
for (int idx = 0; idx < inputDocument.PageCount; idx++)
{
PdfSharp.Pdf.PdfDocument outputDocument = new PdfSharp.Pdf.PdfDocument();
outputDocument.Version = inputDocument.Version;
outputDocument.Info.Title =
String.Format("Page {0} of {1}", idx + 1, inputDocument.Info.Title);
outputDocument.Info.Creator = inputDocument.Info.Creator;
outputDocument.AddPage(inputDocument.Pages[idx]);
MemoryStream stream = new MemoryStream();
outputDocument.Save(stream);
outputStreamList.Add(stream);
}
return outputStreamList;
}
public static Stream MergePdfs(List<Stream> PdfFiles)
{
System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
PdfSharp.Pdf.PdfDocument outputPDFDocument = new PdfSharp.Pdf.PdfDocument();
foreach (Stream pdfFile in PdfFiles)
{
PdfSharp.Pdf.PdfDocument inputPDFDocument = PdfReader.Open(pdfFile, PdfDocumentOpenMode.Import);
outputPDFDocument.Version = inputPDFDocument.Version;
foreach (PdfSharp.Pdf.PdfPage page in inputPDFDocument.Pages)
{
outputPDFDocument.AddPage(page);
}
}
Stream compiledPdfStream = new MemoryStream();
outputPDFDocument.Save(compiledPdfStream);
return compiledPdfStream;
}
The question which I have is:
Why am I getting this behaviour?
Is there a solution where I can perform split and merge and then get the file of same size? (Can be of any open-source c# library)
Replying to question 1:
When splitting the files, every file will contain all resources required by the pages it contains.
When merging with PDFsharp again, resources will not be merged and the final document may contain duplicated resources (fonts, images), thus leading to larger files.
This is by design.

Reading PDF in net core with itext7 returns "\n\n\n\n\n...."

i have a netcore 3 app to read and split a PDF containing paychecks of some companies which i am working for.
This app ran pretty well since last builds... my the way, the PDF reader started to fail to parse the contents of any PDF.
PDF is built only with Italian words, no special chars. Few tables and a single logo. I'm not able to attach it due to privacy.
public PaycheckSplitter Read()
{
using (var reader = new PdfReader(new MemoryStream(this._stream)))
{
var doc = new PdfDocument(reader);
this.Paycheck = new PaychecksCollection();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
PdfPage page = doc.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, new LocationTextExtractionStrategy());
if (text.Contains(Consts.BpEnd)) break;
// trying to find something by regex... btw text contains only a sequence of \n\n\n\n...
string cf = Consts.CodFiscale.Match(text).Value;
this.Paychecks.Add(new Paycheck(cf), i);
}
doc.Close();
}
return this;
}
Anything i can do?
As far as i can see... the only and best way to have something to read a PDF text for free is iText7...

c# docx load xml from string xml

I have created a program to read a file as array of bytes. The program is consuming word files by using docx library from Xceed. What I want to do is to recreate the parsed docx file from array of bytes.
To bytes:
var doc = Docx.Load("afile.docx");
...
return Encoding.Unicode.GetBytes(doc.Xml.Document.ToString());
Parse:
var doc = Docx.Create("anotherFile.docx");
var document = Encoding.Unicode.GetBytes({--returned bytes--}); <-- document is string with xml
How to save the document like the original?
I'm getting only blank file without any content.
using (var doc = DocX.Load("afile.docx"))
{
//here modify
doc.SaveAs("anotherFile.docx");
}
See this document BinaryWriter
bWriter.Writebytes(bytearray);

Converting a PDF file to Excel file using SautinSoft reference shows no error but no Excel output files either

I have used the code from this link to convert PDF to Excel file. No error is observed in Visual Studio but no output file in Excel format was found either. Hoping for feedback. Please note that I'm new in C#.
static void Main(string[] args)
{
string pathToPdf = #"C:\cSharp\PDFToExcelConversion\IT.pdf";
string pathToExcel = #"C:\cSharp\PDFToExcelConversion\excelconverted.xls";
// Convert PDF file to Excel file
SautinSoft.PdfFocus f = new SautinSoft.PdfFocus();
// 'true' = Convert all data to spreadsheet (tabular and even textual).
// 'false' = Skip textual data and convert only tabular (tables) data.
f.ExcelOptions.ConvertNonTabularDataToSpreadsheet = true;
// 'true' = Preserve original page layout.
// 'false' = Place tables before text.
f.ExcelOptions.PreservePageLayout = true;
f.OpenPdf(pathToPdf);
if (f.PageCount > 0)
{
int result = f.ToExcel(pathToExcel);
//Open a produced Excel workbook
if (result == 0)
{
System.Diagnostics.Process.Start(pathToExcel);
}
}
}
It took about sometime before the Excel file was found in the output folder. Because the PDF was relatively large, I think that was the reason the code wasn't producing the Excel file quickly enough. Also, I have used a trial version of the library, so only 3 pages of PDF can be converted at a time. Hoping this code helps someone.

Value of a string for file's location is nil but a stored value says it isn't

I'm trying to convert secured PDFs to XPS and back to PDF using FreeSpire and then combine them using iTextSharp. Below is my code snippet for converting various files.
char[] delimiter = { '\\' };
string WorkDir = #"C:\Users\*******\Desktop\PDF\Test";
Directory.SetCurrentDirectory(WorkDir);
string[] SubWorkDir = Directory.GetDirectories(WorkDir);
//convert items to PDF
foreach (string subdir in SubWorkDir)
{
string[] samplelist = Directory.GetFiles(subdir);
for (int f = 0; f < samplelist.Length - 1; f++)
{
if (samplelist[f].EndsWith(".doc") || samplelist[f].EndsWith(".DOC"))
{
Spire.Pdf.PdfDocument doc = new Spire.Pdf.PdfDocument();
doc.LoadFromFile(sampleist[f], FileFormat.DOC);
doc.SaveToFile((Path.ChangeExtension(samplelist[f],".pdf")), FileFormat.PDF);
doc.Close();
}
. //other extension cases
.
.
else if (samplelist[f].EndsWith(".pdf") || sampleList[f].EndsWith(".PDF"))
{
PdfReader reader = new PdfReader(samplelist[f]);
bool PDFCheck = reader.IsOpenedWithFullPermissions;
reader.Close();
if (PDFCheck)
{
Console.WriteLine("{0}\\Full Permisions", Loan_list[f]);
reader.Close();
}
else
{
Console.WriteLine("{0}\\Secured", samplelist[f]);
Spire.Pdf.PdfDocument doc = new Spire.Pdf.PdfDocument();
string path = Loan_List[f];
doc.LoadFromFile(samplelist[f]);
doc.SaveToFile((Path.ChangeExtension(samplelist[f], ".xps")), FileFormat.XPS);
doc.Close();
Spire.Pdf.PdfDocument doc2 = new Spire.Pdf.PdfDocument();
doc2.LoadFromFile((Path.ChangeExtension(samplelist[f], ".xps")), FileFormat.XPS);
doc2.SaveToFile(samplelist[f], FileFormat.PDF);
doc2.Close();
}
The issue is I get a Value cannot be null error in doc.LoadFromFile(samplelist[f]);.I have the string path = sampleList[f]; to check if samplelist[f] was empty but it was not. I tried to replace the samplelist[f] parameter with the variable named path but it also does not go though. I tested the PDF conversion on a smaller scale it it worked (see below)
string PDFDoc = #"C:\Users\****\Desktop\Test\Test\Test.PDF";
string XPSDoc = #"C:\Users\****\Desktop\Test\Test\Test.xps";
//Convert PDF file to XPS file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(PDFDoc);
doc.SaveToFile(XPSDoc, FileFormat.XPS);
doc.Close();
//Convert XPS file to PDF
PdfDocument doc2 = new PdfDocument();
doc2.LoadFromFile(XPSDoc, FileFormat.XPS);
doc2.SaveToFile(PDFDoc, FileFormat.PDF);
doc2.Close();
I would like to understand why I am getting this error and how to fix it.
There would be 2 solutions for the problem you are facing.
Get the Document in the Document Object not in PDFDocument. And then probably try to SaveToFile Something like this
Document document = new Document();
//Load a Document in document Object
document.SaveToFile("Sample.pdf", FileFormat.PDF);
You can use Stream for the same something like this
PdfDocument doc = new PdfDocument();
//Load PDF file from stream.
FileStream from_stream = File.OpenRead(Loan_list[f]);
//Make sure the Loan_list[f] is the complete path of the file with extension.
doc.LoadFromStream(from_stream);
//Save the PDF document.
doc.SaveToFile(Loan_list[f] + ".pdf",FileFormat.PDF);
Second approach is the easy one, but I would recommend you to use the first one as for obvious reasons like document will give better convertability than stream. Since the document have section, paragraph, page setup, text, fonts everything which need to be required to do a better or exact formatting required.

Categories

Resources