Split PDF by chapters from Table Of Contents

Split PDF by chapters from Table Of Contents - c#

I'm using GemBox.Pdf and I need to extract individual chapters in a PDF file as a separate PDF files.
The first page (maybe the second page as well) contains TOC (Table Of Contents) and I need to split the rest of the PDF pages based on it:
Also, those PDF documents that are split, should be named as the chapters they contains.
I can split the PDF based on the number of pages for each document (I figured that out using this example):
using (var source = PdfDocument.Load("Chapters.pdf"))
{
int pagesPerSplit = 3;
int count = source.Pages.Count;
for (int index = 1; index < count; index += pagesPerSplit)
{
using (var destination = new PdfDocument())
{
for (int splitIndex = 0; splitIndex < pagesPerSplit; splitIndex++)
destination.Pages.AddClone(source.Pages[index + splitIndex]);
destination.Save("Chapter " + index + ".pdf");
}
}
}
But I can't figure out how to read and process that TOC and incorporate the chapters splitting base on its items.

You should iterate through the document's bookmarks (outlines) and split it based on the bookmark destination pages.
For instance, try this:
using (var source = PdfDocument.Load("Chapters.pdf"))
{
PdfOutlineCollection outlines = source.Outlines;
PdfPages pages = source.Pages;
Dictionary<PdfPage, int> pageIndexes = pages
.Select((page, index) => new { page, index })
.ToDictionary(item => item.page, item => item.index);
for (int index = 0, count = outlines.Count; index < count; ++index)
{
PdfOutline outline = outlines[index];
PdfOutline nextOutline = index + 1 < count ? outlines[index + 1] : null;
int pageStartIndex = pageIndexes[outline.Destination.Page];
int pageEndIndex = nextOutline != null ?
pageIndexes[nextOutline.Destination.Page] :
pages.Count;
using (var destination = new PdfDocument())
{
while (pageStartIndex < pageEndIndex)
{
destination.Pages.AddClone(pages[pageStartIndex]);
++pageStartIndex;
}
destination.Save($"{outline.Title}.pdf");
}
}
}
Note, from the screenshot it seems that your chapter bookmarks include the order's number (roman numerals). If needed, you can easily remove those with something like this:
destination.Save($"{outline.Title.Substring(outline.Title.IndexOf(' ') + 1)}.pdf");

Related

How to get and set value for /BSIColumnData of an annotation PDF itext c#

How to get and set value for /BSIColumnData of an annotation (markup) in PDF using itext c# as attached file?
I'm using Itext7 code below, but it is error at BSIColumnData:
public void BSIcontents ()
{
string pdfPath = #"C:\test PDF.pdf";
iText.Kernel.Pdf.PdfReader pdfReader = new iText.Kernel.Pdf.PdfReader(pdfPath);
iText.Kernel.Pdf.PdfDocument pdfDoc = new iText.Kernel.Pdf.PdfDocument(pdfReader);
int numberOfPages = pdfDoc.GetNumberOfPages();
int z = 0;
for (int i = 1; i <= numberOfPages; i++)
{
iText.Kernel.Pdf.PdfDictionary page = pdfDoc.GetPage(i).GetPdfObject();
iText.Kernel.Pdf.PdfArray annotArray = page.GetAsArray(iText.Kernel.Pdf.PdfName.Annots);
if (annotArray == null)
{
z++;
continue;
}
int size = annotArray.Size();
for (int x = 0; x < size; x++)
{
iText.Kernel.Pdf.PdfDictionary curAnnot = annotArray.GetAsDictionary(x);
if (curAnnot != null)
{
if (curAnnot.GetAsString(iText.Kernel.Pdf.PdfName.BSIColumnData) != null)
{
MessageBox.Show("BSIColumnData: " + curAnnot.GetAsString(iText.Kernel.Pdf.PdfName.BSIColumnData).ToString());
}
}
}
}
pdfReader.Close();
}
In Bluebeam Revu, you can see as below:
In Itext-rups 5.5.9, you can see as below:

I see two errors:
You try to use the BSIColumnData name like this:
iText.Kernel.Pdf.PdfName.BSIColumnData
This assumes that there is already a static PdfName member for your custom name. But of course there isn't, there only are predefined members for standard names used in itext itself. If you want to work with other names, you have to create a PdfName instance yourself and use that instance, e.g. like this
var BSIColumnData = new iText.Kernel.Pdf.PdfName("BSIColumnData");
You try to retrieve the value of that name as string
curAnnot.GetAsString(iText.Kernel.Pdf.PdfName.BSIColumnData)
but it is clear from your RUPS screenshot that the value of that name is an array of strings. Thus, even after correcting as described in the first item GetAsString(BSIColumnData) will return null. Instead do
var BSIColumnData = new iText.Kernel.Pdf.PdfName("BSIColumnData");
var array = curAnnot.GetAsArray(BSIColumnData);
After checking if (array != null) you now can access the strings at their respective indices using array.GetAsString(index).

c# HtmlAgilityPack for on nodes array

I'm using html agility pack and after I got array of nodes:
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//tbody[#class='table']").ToArray();
now i want to run a for loop one each nodes[i]. I've tried this:
for (int i = 0; i < 1; i++)
{
if (t == null)
t = new Model.Track();
HtmlNode[] itemText = nodes[i].SelectNodes("//td[#class='artist']").ToArray();
for (int x = 0; x < itemText.Length; x++)
{ //doing something }
the problem is that the itemtext array isn't focusing on nodes[i] .
but brings out an array of all the ("//td[#class='artist']") in the html document.
help?

Using //td[#class='artist'] will fetch all columns with artist class from your document.DocumentNode.
Using .//td[#class='artist'] (Notice the dot at the begining) will fetch all columns with artist class from the current selected node, which in your case is nodes[i].

StreamWriter C# formatting output

Problem Statement
In order to run gene annotation software, I need to prepare two types of files, vcard files and coverage tables, and there has to be one-to-one match of vcard to coverage table. Since Im running 2k samples, its hard to identify which file is not one-to-one match. I know that both files have unique identifier numbers, hence, if both folders have files that have same unique numbers, i treat that as "same" file
I made a program that compares two folders and reports unique entries in each folder. To do so, I made two list that contains unique file names to each directory.
I want to format the report file (tab delimited .txt file) such that it looks something like below:
Unique in fdr1 Unique in fdr2
file x file a
file y file b
file z file c
I find this difficult to do because I have to iterate twice (since I have two lists), but there is no way of going back to the previous line in StreamWriter as far as I know. Basically, once I iterate through the first list and fill the first column, how can I fill the second column with the second list?
Can someone help me out with this?
Thanks
If design of the code has to change (i.e. one list instead of two), please let me know
As requested by some user, this is how I was going to do (not working version)
// Write report
using (StreamWriter sw = new StreamWriter(dest_txt.Text + #"\" + "Report.txt"))
{
// Write headers
sw.WriteLine("Unique Entries in Folder1" + "\t" + "Unique Entries in Folder2");
// Write unique entries in fdr1
foreach(string file in fdr1FileList)
{
sw.WriteLine(file + "\t");
}
// Write unique entries in fdr2
foreach (string file in fdr2FileList)
{
sw.WriteLine(file + "\t");
}
sw.Dispose();
}
As requested for my approach for finding unique entries, here's my code snippet
Dictionary<int, bool> fdr1Dict = new Dictionary<int, bool>();
Dictionary<int, bool> fdr2Dict = new Dictionary<int, bool>();
List<string> fdr1FileList = new List<string>();
List<string> fdr2FileList = new List<string>();
string fdr1Path = folder1_txt.Text;
string fdr2Path = folder2_txt.Text;
// File names in the specified directory; path not included
string[] fdr1FileNames = Directory.GetFiles(fdr1Path).Select(Path.GetFileName).ToArray();
string[] fdr2FileNames = Directory.GetFiles(fdr2Path).Select(Path.GetFileName).ToArray();
// Iterate through the first directory, and add GL number to dictionary
for(int i = 0; i < fdr1FileNames.Length; i++)
{
// Grabs only the number from the file name
string number = Regex.Match(fdr1FileNames[i], #"\d+").ToString();
int glNumber;
// Make sure it is a number
if(Int32.TryParse(number, out glNumber))
{
fdr1Dict[glNumber] = true;
}
// If number not present, raise exception
else
{
throw new Exception(String.Format("GL Number not found in: {0}", fdr1FileNames[i]));
}
}
// Iterate through the second directory, and add GL number to dictionary
for (int i = 0; i < fdr2FileNames.Length; i++)
{
// Grabs only the number from the file name
string number = Regex.Match(fdr2FileNames[i], #"\d+").ToString();
int glNumber;
// Make sure it is a number
if (Int32.TryParse(number, out glNumber))
{
fdr2Dict[glNumber] = true;
}
// If number not present, raise exception
else
{
throw new Exception(String.Format("GL Number not found in: {0}", fdr2FileNames[i]));
}
}
// Iterate through the first directory, and find files that are unique to it
for (int i = 0; i < fdr1FileNames.Length; i++)
{
int glNumber = Int32.Parse(Regex.Match(fdr1FileNames[i], #"\d+").Value);
// If same file is not present in the second folder add to the list
if(!fdr2Dict[glNumber])
{
fdr1FileList.Add(fdr1FileNames[i]);
}
}
// Iterate through the second directory, and find files that are unique to it
for (int i = 0; i < fdr2FileNames.Length; i++)
{
int glNumber = Int32.Parse(Regex.Match(fdr2FileNames[i], #"\d+").Value);
// If same file is not present in the first folder add to the list
if (!fdr1Dict[glNumber])
{
fdr2FileList.Add(fdr2FileNames[i]);
}

I am a quite confident that this will work as I've tested it:
static void Main(string[] args)
{
var firstDir = #"Path1";
var secondDir = #"Path2";
var firstDirFiles = System.IO.Directory.GetFiles(firstDir);
var secondDirFiles = System.IO.Directory.GetFiles(secondDir);
print2Dirs(firstDirFiles, secondDirFiles);
}
private static void print2Dirs(string[] firstDirFile, string[] secondDirFiles)
{
var maxIndex = Math.Max(firstDirFile.Length, secondDirFiles.Length);
using (StreamWriter streamWriter = new StreamWriter("result.txt"))
{
streamWriter.WriteLine(string.Format("{0,-150}{1,-150}", "Unique in fdr1", "Unique in fdr2"));
for (int i = 0; i < maxIndex; i++)
{
streamWriter.WriteLine(string.Format("{0,-150}{1,-150}",
firstDirFile.Length > i ? firstDirFile[i] : string.Empty,
secondDirFiles.Length > i ? secondDirFiles[i] : string.Empty));
}
}
}
It's a quite simple code but if you need help understanding it just let me know :)

I would construct each line at a time. Something like this:
int row = 0;
string[] fdr1FileList = new string[0];
string[] fdr2FileList = new string[0];
while (row < fdr1FileList.Length || row < fdr2FileList.Length)
{
string rowText = "";
rowText += (row >= fdr1FileList.Length ? "\t" : fdr1FileList[row] + "\t");
rowText += (row >= fdr2FileList.Length ? "\t" : fdr2FileList[row]);
row++;
}

Try something like this:
static void Main(string[] args)
{
Dictionary<int, string> fdr1Dict = FilesToDictionary(Directory.GetFiles("path1"));
Dictionary<int, string> fdr2Dict = FilesToDictionary(Directory.GetFiles("path2"));
var unique_f1 = fdr1Dict.Where(f1 => !fdr2Dict.ContainsKey(f1.Key)).ToArray();
var unique_f2 = fdr2Dict.Where(f2 => !fdr1Dict.ContainsKey(f2.Key)).ToArray();
int f1_size = unique_f1.Length;
int f2_size = unique_f2.Length;
int list_length = 0;
if (f1_size > f2_size)
{
list_length = f1_size;
Array.Resize(ref unique_f2, list_length);
}
else
{
list_length = f2_size;
Array.Resize(ref unique_f1, list_length);
}
using (StreamWriter writer = new StreamWriter("output.txt"))
{
writer.WriteLine(string.Format("{0,-30}{1,-30}", "Unique in fdr1", "Unique in fdr2"));
for (int i = 0; i < list_length; i++)
{
writer.WriteLine(string.Format("{0,-30}{1,-30}", unique_f1[i].Value, unique_f2[i].Value));
}
}
}
static Dictionary<int, string> FilesToDictionary(string[] filenames)
{
Dictionary<int, string> dict = new Dictionary<int, string>();
for (int i = 0; i < filenames.Length; i++)
{
int glNumber;
string filename = Path.GetFileName(filenames[i]);
string number = Regex.Match(filename, #"\d+").ToString();
if (int.TryParse(number, out glNumber))
dict.Add(glNumber, filename);
}
return dict;
}

can not correctly insert text into a bookmark from another bookmark

I'm writing a windows form application which must exchange the content of Word bookmarks between two documents.
There are two similar documents (wordDocument and wordPattern) with similar amount of bookmarks. I'm trying this:
for (int i = 1; i <= wordDocument.Bookmarks.Count; i++)
{
object j = i;
wordDocument.Bookmarks.get_Item(ref j).Range.Text = wordPattern.Bookmarks.get_Item(ref j).Range.Text.ToString();
//MessageBox.Show(wordDocument.Bookmarks[i].Range.Text);
//MessageBox.Show(wordPattern.Bookmarks[i].Range.Text);
}
But it does the task incorrectly. I mean, it does it in improper order and deletes bookmarks. Help me by providing right way to exchange the text inside the bookmarks.

int count1 = 0;
int count2 = 0;
foreach (Word.Bookmark bookmark1 in wordDocument.Bookmarks)
{
Word.Range bmRange = bookmark1.Range;
//bmRange.Text = "заметка" + count1;
listOfRanges.Add(bmRange);
count1++;
}
foreach (Word.Bookmark bookmark2 in wordPattern.Bookmarks)
{
Word.Range mbRange = bookmark2.Range;
mbRange.Text = listOfRanges[count2].Text;
count2++;
}
Solved it that way.

How to Merge items within a List<> collection C#

I have a implememtation where i need to loop through a collection of documents and based on certain condition merge the documents .
The merge condition is very simple, if present document's doctype is same as later document's doctype, then copy all the pages from the later doctype and append it to the pages of present document's and remove the later document from the collection.
Note : Both response.documents and response.documents[].pages are List<> collections.
I was trying this but was getting following exception Once I remove the document.
collection was modified enumeration may not execute
Here is the code:
int docindex = 0;
foreach( var document in response.documents)
{
string presentDoctype = string.Empty;
string laterDoctype = string.Empty;
presentDoctype = response.documents[docindex].doctype;
laterDoctype = response.documents[docindex + 1].doctype;
if (laterDoctype == presentDoctype)
{
response.documents[docindex].pages.AddRange(response.documents[docindex + 1].pages);
response.documents.RemoveAt(docindex + 1);
}
docindex = docindex + 1;
}
Ex:
reponse.documents[0].doctype = "BankStatement" //page count = 1
reponse.documents[1].doctype = "BankStatement" //page count = 2
reponse.documents[2].doctype = "BankStatement" //page count = 2
reponse.documents[3].doctype = "BankStatement" //page count = 1
reponse.documents[4].doctype = "BankStatement" //page count = 4
Expected result:
response.documents[0].doctype = "BankStatement" //page count = 10
Please suggest.Appreciate your help.

I would recommend you to look at LINQ GroupBy and Distinct to process your response.documents
Example (as I cannot use your class, I give example using my own defined class):
Suppose you have DummyClass
public class DummyClass {
public int DummyInt;
public string DummyString;
public double DummyDouble;
public DummyClass() {
}
public DummyClass(int dummyInt, string dummyString, double dummyDouble) {
DummyInt = dummyInt;
DummyString = dummyString;
DummyDouble = dummyDouble;
}
}
Then doing GroupBy as shown,
DummyClass dc1 = new DummyClass(1, "This dummy", 2.0);
DummyClass dc2 = new DummyClass(2, "That dummy", 2.0);
DummyClass dc3 = new DummyClass(1, "These dummies", 2.0);
DummyClass dc4 = new DummyClass(2, "Those dummies", 2.0);
DummyClass dc5 = new DummyClass(3, "The dummies", 2.0);
List<DummyClass> dummyList = new List<DummyClass>() { dc1, dc2, dc3, dc4, dc5 };
var groupedDummy = dummyList.GroupBy(x => x.DummyInt).ToList();
Will create three groups, marked by DummyInt
Then to process the group you could do
for (int i = 0; i < groupedDummy.Count; ++i){
foreach (DummyClass dummy in groupedDummy[i]) { //this will process the (i-1)-th group
//do something on this group
//groupedDummy[0] will consists of "this" and "these", [1] "that" and "those", while [2] "the"
//Try it out!
}
}
In your case, you should create group based on doctype.
Once you create groups based on your doctype, everything else would be pretty "natural" for you to continue.
Another LINQ method which you might be interested in would be Distinct. But I think for this case, GroupBy would be the primary method you would like to use.

Use only "for loop" instead of "foreach".
foreach will hold the collection and cannot be modified while looping thru it.

Here is an example using groupBy, hope this help.
//mock a collection
ICollection<string> collection1 = new List<string>();
for (int i = 0; i < 10; i++)
{
collection1.Add("BankStatement");
}
for (int i = 0; i < 5; i++)
{
collection1.Add("BankStatement2");
}
for (int i = 0; i < 4; i++)
{
collection1.Add("BankStatement3");
}
//merge and get count
var result = collection1.GroupBy(c => c).Select(c => new { name = c.First(), count = c.Count().ToString() }).ToList();
foreach (var item in result)
{
Console.WriteLine(item.name + ": " + item.count);
}

Just use AddRange()
response.documents[0].pages.AddRange(response.documents[1].pages);
it will merge all pages of document[1] with the document[0] into document[0]

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Split PDF by chapters from Table Of Contents - c#

Related

How to get and set value for /BSIColumnData of an annotation PDF itext c#

c# HtmlAgilityPack for on nodes array

StreamWriter C# formatting output

can not correctly insert text into a bookmark from another bookmark

How to Merge items within a List<> collection C#

Categories

Resources