How to convert Multiple HTML Pages to Single Doc in c# - c#

I am converting a single HTML page to Doc using spire doc. I need to convert multiple html pages from single folder to single Doc. How this can be done. Can anyone give some idea or any library available to achieve this?
Please find my code to convert single HTML to Doc.
Spire.Doc.Document document = new Spire.Doc.Document();
document.LoadFromFile(#"D:\DocFilesConvert\htmlfile.html", Spire.Doc.FileFormat.Html, XHTMLValidationType.None);
document.SaveToFile(#"D:\DocFilesConvert\docfiless.docx", Spire.Doc.FileFormat.Docx);

There seems no direct way to achieve this. One workaround I find is to convert each HTML document to a single Word file, and then merge these Word files in one file.
//get HTML file paths
string[] htmlfilePaths = new string[]{
#"F:\Documents\Html\1.html",
#"F:\Documents\Html\2.html",
#"F:\Documents\Html\3.html"
};
//create Document array
Document[] docs = new Document[htmlfilePaths.Length];
for (int i = 0; i < htmlfilePaths.Length; i++)
{
//load each HTML to a sperate Word file
docs[i] = new Document(htmlfilePaths[i], FileFormat.Html);
//combine these Word files in one file
if (i>=1)
{
foreach (Section sec in docs[i].Sections)
{
docs[0].Sections.Add(sec.Clone());
}
}
}
//save to a Word document
docs[0].SaveToFile("output.docx", FileFormat.Docx2013);

Related

How to solve the error of Word opening in background when trying to read text from Word documents?

I'm trying to read the string of text from word documents into a List Array, and then search for the word in these string of text. The problem, however, is that the word documents kept on running continuously in the windows background when opened, even though I close the document after reading the text.
Parallel.ForEach(files, file =>
{
switch (System.IO.Path.GetExtension(file))
{
case ".docx":
List<string> Word_list = GetTextFromWord(file);
SearchForWordContent(Word_list, file);
break;
}
});
static List<string> GetTextFromWord(string direct)
{
if (string.IsNullOrEmpty(direct))
{
throw new ArgumentNullException("direct");
}
if (!File.Exists(direct))
{
throw new FileNotFoundException("direct");
}
List<string> word_List = new List<string>();
try
{
Microsoft.Office.Interop.Word.Application app =
new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(direct);
int count = doc.Words.Count;
for (int i = 1; i <= count; i++)
{
word_List.Add(doc.Words[i].Text);
}
((_Application)app).Quit();
}
catch (System.Runtime.InteropServices.COMException e)
{
Console.WriteLine("Error: " + e.Message.ToString());
}
return word_List;
}
When you use Word Interop you're actually starting the Word application and talk to it using COM. Every call, even reading a property, is an expensive cross-process call.
You can read a Word document without using Word. A docx document is a ZIP package containing well-defined XML files. You could read those files as XML directly, you can use the Open XML SDK to read a docx file or use a library like NPOI which simplifies working with Open XML.
The word count is a property of the document itself. To read it using the Open XML SDK you need to check the document's ExtendedFileProperties part :
using (var document = WordprocessingDocument.Open(fileName, false))
{
var words = (int) document.ExtendedFilePropertiesPart.Properties.Words.Text;
}
You'll find the Open XML documentation, including the strucrure of Word documents at MSDN
Avoiding Owner Files
Word or Excel Files that start with ~ are owner files. These aren't real Word or Excel files. They're temporary files created when someone opens a document for editing and contain the logon name of that user. These files are deleted when Word closes gracefully but may be left behind if Word crashes or the user has no DELETE permissions, eg in a shared folder.
To avoid these one only needs to check whether the filename starts with ~.
If the fileName is only the file name and extension, fileName.StartsWith("~") is enough
If fileName is an absolute path, `Path.GetFileName(fileName).StartsWith("~")
Things get trickier when trying to filter such files in a folder. The patterns used in Directory.EnumerateFiles or DirectoryInfo.EnumerateFiles are simplistic and can't exclude characters. The files will have to be filtered after the call to EnumerateFiles, eg :
var dir=new DirectoryInfo(folderPath);
foreach(var file in dir.EnumerateFiles("*.docx"))
{
if (!file.Name.StartsWith("~"))
{
...
}
}
or, using LINQ :
var dir=new DirectoryInfo(folderPath);
var files=dir.EnumerateFiles("*.docx")
.Where(file=>!file.Name.StartsWith("~"));
foreach(var file in files)
{
...
}
Enumeration can still fail if a file is opened exclusively for editing. To avoid exceptions, the EnumerationOptions.IgnoreInaccessible parameter can be used to skip over locked files:
var dir=new DirectoryInfo(folderPath);
var options=new EnumerationOptions
{
IgnoreInaccessible =true
};
var files=dir.EnumerateFiles("*.docx",options)
.Where(file=>!file.Name.StartsWith("~"));
One option is to
List item
List item

Reading PDF in net core with itext7 returns "\n\n\n\n\n...."

i have a netcore 3 app to read and split a PDF containing paychecks of some companies which i am working for.
This app ran pretty well since last builds... my the way, the PDF reader started to fail to parse the contents of any PDF.
PDF is built only with Italian words, no special chars. Few tables and a single logo. I'm not able to attach it due to privacy.
public PaycheckSplitter Read()
{
using (var reader = new PdfReader(new MemoryStream(this._stream)))
{
var doc = new PdfDocument(reader);
this.Paycheck = new PaychecksCollection();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
PdfPage page = doc.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, new LocationTextExtractionStrategy());
if (text.Contains(Consts.BpEnd)) break;
// trying to find something by regex... btw text contains only a sequence of \n\n\n\n...
string cf = Consts.CodFiscale.Match(text).Value;
this.Paychecks.Add(new Paycheck(cf), i);
}
doc.Close();
}
return this;
}
Anything i can do?
As far as i can see... the only and best way to have something to read a PDF text for free is iText7...

iTextSharp 7 PdfTextExtractor.GetTextFromPage returns non readable data

I'm trying to evaluate various libraries to read PDF files. One of them is iText 7, the 7.0.4 .NET version to be precise. Some files work fine, but there is at least one file i tested it with, where iText simply returns gibberish (most files work just fine).
This is my code:
private void PdfToText(FileInfo pdfFileInfo, FileInfo textFileInfo)
{
var textFile = new StreamWriter(textFileInfo.FullName);
var pdfDocument = new PdfDocument(new PdfReader(pdfFileInfo.FullName));
var strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, strategy);
textFile.Write(text);
}
pdfDocument.Close();
textFile.Close();
}
The resulting file starts with this in hex (and goes on like this as well):
Other libraries can extract the text from this file just fine, selecting the text with Foxit Reader and then use copy paste to Notepad++ also gives me readable text.
I'm sorry, but i can't provide the PDF in question, since it contains confidential data.
Any idea how to fix this?

How to access OpenXML content by page number?

Using OpenXML, can I read the document content by page number?
wordDocument.MainDocumentPart.Document.Body gives content of full document.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
int pageCount = 0;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
for (int i = 1; i <= pageCount; i++)
{
//Read the content by page number
}
}
}
MSDN Reference
Update 1:
it looks like page breaks are set as below
<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
<w:r>
<w:br w:type="page" />
</w:r>
</w:p>
So now I need to split the XML with above check and take InnerTex for each, that will give me page vise text.
Now question becomes how can I split the XML with above check?
Update 2:
Page breaks are set only when you have page breaks, but if text is floating from one page to other pages, then there is no page break XML element is set, so it revert back to same challenge how o identify the page separations.
You cannot reference OOXML content via page numbering at the OOXML data level alone.
Hard page breaks are not the problem; hard page breaks can be counted.
Soft page breaks are the problem. These are calculated according to
line break and pagination algorithms which are implementation
dependent; it is not intrinsic to the OOXML data. There is nothing
to count.
What about w:lastRenderedPageBreak, which is a record of the position of a soft page break at the time the document was last rendered? No, w:lastRenderedPageBreak does not help in general either because:
By definition, w:lastRenderedPageBreak position is stale when content has
been changed since last opened by a program that paginates its
content.
In MS Word's implementation, w:lastRenderedPageBreak is known to be unreliable in various circumstances including
when table spans two pages
when next page starts with an empty paragraph
for
multi-column layouts with text boxes starting a new column
for
large images or long sequences of blank lines
If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc.
Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.
This is how I ended up doing it.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
int pageCount = 0;
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
int i = 1;
StringBuilder pageContentBuilder = new StringBuilder();
foreach (var element in body.ChildElements)
{
if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
{
pageContentBuilder.Append(element.InnerText);
}
else
{
pageviseContent.Add(i, pageContentBuilder.ToString());
i++;
pageContentBuilder = new StringBuilder();
}
if (body.LastChild == element && pageContentBuilder.Length > 0)
{
pageviseContent.Add(i, pageContentBuilder.ToString());
}
}
}
}
Downside: This wont work in all scenarios. This will work only when you have a page break, but if you have text extended from page 1 to page 2, there is no identifier to know you are in page two.
Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .
You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.
List<Paragraph> Allparagraphs = wp.MainDocumentPart.Document.Body.OfType<Paragraph>().ToList();
List<Paragraph> PageParagraphs = Allparagraphs.Where (x=>x.Descendants<LastRenderedPageBreak>().Count() ==1) .Select(x => x).Distinct().ToList();
Rename docx to zip.
Open docProps\app.xml file. :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<Template>Normal</Template>
<TotalTime>0</TotalTime>
<Pages>1</Pages>
<Words>141</Words>
<Characters>809</Characters>
<Application>Microsoft Office Word</Application>
<DocSecurity>0</DocSecurity>
<Lines>6</Lines>
<Paragraphs>1</Paragraphs>
<ScaleCrop>false</ScaleCrop>
<HeadingPairs>
<vt:vector size="2" baseType="variant">
<vt:variant>
<vt:lpstr>Название</vt:lpstr>
</vt:variant>
<vt:variant>
<vt:i4>1</vt:i4>
</vt:variant>
</vt:vector>
</HeadingPairs>
<TitlesOfParts>
<vt:vector size="1" baseType="lpstr">
<vt:lpstr/>
</vt:vector>
</TitlesOfParts>
<Company/>
<LinksUpToDate>false</LinksUpToDate>
<CharactersWithSpaces>949</CharactersWithSpaces>
<SharedDoc>false</SharedDoc>
<HyperlinksChanged>false</HyperlinksChanged>
<AppVersion>14.0000</AppVersion>
</Properties>
OpenXML lib reads wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text from <Pages>1</Pages> property . This properies are created only by winword application. if word document changed wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text is not actual. if word document created programmatically the wordDocument.ExtendedFilePropertiesPart is offten null.

WebSupergoo ABCPDF Automatic pdf generation - Adding pages on the fly?

I have a question regarding building dynamic PDF documents with ABCPDF.dll.
I understand the basics and have a solid solution working. I have a new requirement where I need to dynamically add pages to a PDF doc.
Specifically, my PDF doc is a two pager. The second page needs to be a separate PDF file where one or more pages will be added by the user.
I've looked at the docs and code samples and see a AddPage() method. It doesn't seem liek this would work per my need.
Here is a code sample:
void Page_Load( object sender, System.EventArgs e )
{
int theID = 0;
string theText = "This PDF file is generated by WebSupergoo ABCpdf.NET on the fly";
Doc theDoc = new Doc();
theDoc.Width = 4;
theDoc.FontSize = 32;
theDoc.Rect.Inset( 20, 20 );
theDoc.FrameRect();
theID = theDoc.AddHtml( theText );
while ( theDoc.GetInfo( theID, "Truncated" ) == "1" )
{
theDoc.Page = theDoc.AddPage();
theDoc.FrameRect();
theID = theDoc.AddHtml( "", theID );
}
theDoc.Save( Server.MapPath( "textflow.pdf" ) );
theDoc.Clear();
Response.Write( "PDF file written<br>" );
Response.Write( "View PDF File" );
}
Can someone suggest a method for adding pages to a PDF document using ABC PDF? The above sample may be using AddPage, but I need to specify another PDF file to dynamical add on the fly. The PDF file name can change.
Thank you.
Thank you.
If I'm understanding your question, you want to add a PDF to the end of a different PDF. If that is what you need, it looks like the Append method is what you need.
I believe that abcpdf allows you to merge a PDF document to the end of another. See here

Categories

Resources