C# OpenXml Get DOCX WordStyle Property Simplified Code - c#

Just curious if there is a more simplified version to check if the given body has the word style of "Heading3" applied given this sample C# code I wrote learning the OpenXML library. To be clear, I am just asking given a body element how can I determine if the given body element has what word style applied. I eventually have to write a program that process numerous .DOCX files and need to process them from a top to bottom approach.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System.IO;
namespace docxparsing
{
class Program
{
static void Main()
{
string file_to_parse = #"C:\temp\sample.docx";
WordprocessingDocument doc = WordprocessingDocument.Open(file_to_parse,false);
Body body = doc.MainDocumentPart.Document.Body;
string fooStr
foreach( var foo in body )
{
fooStr = foo.InnerXml;
/*
these 2 comments represent 2 different xml snippets from 'fooStr'. the only way i figure out how to get the word style is by reading
this xml and doing checks for values. i don't know of any other approach in using the body element to check for the applied word style
<w:pPr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:pStyle w:val="Heading2" />
<w:pPr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:pStyle w:val="Heading3" />
*/
bool hasHeading3 = fooStr.Contains("pStyle w:val=\"Heading3\"");
if ( hasHeading3 )
{
Console.WriteLine("heading3 found");
}
}
doc.Close();
}
}
}
// -------------------------------------------------------------------------------
EDIT
Here is updated code of one way to do this. Still not overall happy with it but it works. Function to look at is getWordStyleValue(string x)
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System;
using System.Diagnostics;
using System.IO;
using System.Text;
namespace docxparsing
{
class Program
{
// ************************************************
// grab the word style value
// ************************************************
static string getWordStyleValue(string x)
{
int p = 0;
p = x.IndexOf("w:pStyle w:val=");
if ( p == -1 )
{
return "";
}
p = p + 15;
StringBuilder sb = new StringBuilder();
while (true)
{
p++;
char c = x[p];
if (c != '"')
{
sb.Append(c);
}
else
{
break;
}
}
string s = sb.ToString();
return s;
}
// ************************************************
// Main
// ************************************************
static void Main(string[] args)
{
string theFile = #"C:\temp\sample.docx";
WordprocessingDocument doc = WordprocessingDocument.Open(theFile,false);
string body_table = "DocumentFormat.OpenXml.Wordprocessing.Table";
string body_paragraph = "DocumentFormat.OpenXml.Wordprocessing.Paragraph";
Body body = doc.MainDocumentPart.Document.Body;
StreamWriter sw1 = new StreamWriter("paragraphs.log");
foreach (var b in body)
{
string body_type = b.ToString();
if (body_type == body_paragraph)
{
string str = getWordStyleValue(b.InnerXml);
if (str == "" || str == "HeadingNon-TOC" || str == "TOC1" || str == "TOC2" || str == "TableofFigures" || str == "AcronymList" )
{
continue;
}
sw1.WriteLine(str + "," + b.InnerText);
}
if ( body_type == body_table )
{
// sw1.WriteLine("Table:\n{0}",b.InnerText);
}
}
doc.Close();
sw1.Close();
}
}
}

Yes. You could do something like this:
bool ContainsHeading3 = body.Descendants<ParagraphSytleId>().Any(psId => psId.Val == "Heading3");
This will look at all the ParagraphStyleId elements (w:pStyle in the xml) and see if any of them have the Val of Heading3.

Just pasting this Edit from original post so he has better visibility.
Here is one solution I came up with. Yes, it a little cody ( if that is a word ) but working LINQ ( my fav ) to optimize a more elegant solution.
--
Here is updated code of one way to do this. Still not overall happy with it but it works. Function to look at is getWordStyleValue(string x)
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System;
using System.Diagnostics;
using System.IO;
using System.Text;
namespace docxparsing
{
class Program
{
// ************************************************
// grab the word style value
// ************************************************
static string getWordStyleValue(string x)
{
int p = 0;
p = x.IndexOf("w:pStyle w:val=");
if ( p == -1 )
{
return "";
}
p = p + 15;
StringBuilder sb = new StringBuilder();
while (true)
{
p++;
char c = x[p];
if (c != '"')
{
sb.Append(c);
}
else
{
break;
}
}
string s = sb.ToString();
return s;
}
// ************************************************
// Main
// ************************************************
static void Main(string[] args)
{
string theFile = #"C:\temp\sample.docx";
WordprocessingDocument doc = WordprocessingDocument.Open(theFile,false);
string body_table = "DocumentFormat.OpenXml.Wordprocessing.Table";
string body_paragraph = "DocumentFormat.OpenXml.Wordprocessing.Paragraph";
Body body = doc.MainDocumentPart.Document.Body;
StreamWriter sw1 = new StreamWriter("paragraphs.log");
foreach (var b in body)
{
string body_type = b.ToString();
if (body_type == body_paragraph)
{
string str = getWordStyleValue(b.InnerXml);
if (str == "" || str == "HeadingNon-TOC" || str == "TOC1" || str == "TOC2" || str == "TableofFigures" || str == "AcronymList" )
{
continue;
}
sw1.WriteLine(str + "," + b.InnerText);
}
if ( body_type == body_table )
{
// sw1.WriteLine("Table:\n{0}",b.InnerText);
}
}
doc.Close();
sw1.Close();
}
}
}

Related

How do I parse and find a string in a C# script?

I've been trying to parse and search for a specific word in a big string, but I can't seem to be able to figure it out. I have created a script that connects a Twitch Channel's chat into unity.
An example of a message would be:
"#badge-info=subscriber/4;badges=moderator/1,subscriber/3,bits/1;bits=1;color=;display-name=TwitchUser1234;emotes=;flags=;id=da6ec4c6-af61-4346-abc-123456789;mod=1;room-id=12345678;subscriber=1;tmi-sent-ts=160987654321;turbo=0;user-id=123456789;user-type=mod :TwitchUser1234#TwitchUser1234.tmi.twitch.tv PRIVMSG #thechannelyouarewatching :PogChamp1 Another Test Bit"
I tried parsing and searching for the string 'bits' the message by doing:
private void GameInputs(string ChatInputs)
{
string Search;
Search = ChatInputs.Split(";", "=");
if(string "bits" in Search)
{
print("I made it here.");
}
}
I'm at a complete loss and have no idea how to do this. Any help is appreciated.
If my full code is needed it is:
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using System;
using System.ComponentModel;
using System.Net.Sockets;
using System.IO;
public class TwitchChat : MonoBehaviour
{
private TcpClient twitchClient;
private StreamReader reader;
private StreamWriter writer;
public string username, password, channelName; // http://twitchapps.com/tmi
// Start is called before the first frame update
void Start()
{
Connect();
}
void Update()
{
if(!twitchClient.Connected)
{
Connect();
}
ReadChat();
}
private void Connect()
{
twitchClient = new TcpClient("irc.chat.twitch.tv", 6667);
reader = new StreamReader(twitchClient.GetStream());
writer = new StreamWriter(twitchClient.GetStream());
writer.WriteLine("PASS " + password);
writer.WriteLine("NICK " + username);
writer.WriteLine("USER " + username + " 8 * :" + username);
writer.WriteLine("JOIN #" + channelName);
writer.WriteLine("CAP REQ :twitch.tv/tags");
writer.Flush();
}
private void ReadChat()
{
if (twitchClient.Available > 0)
{
var message = reader.ReadLine();
print(message);
GameInputs(message);
}
}
private void GameInputs(string ChatInputs)
{
string Search;
Search = ChatInputs.Split(";", "=");
if(string "bits" in Search)
{
print("I made it here.");
}
}
}
If you want to pull the value of "bits=xx" out, this would do it:
var b = value.Split(';').FirstOrDefault(s => s.StartsWith("bits="))?[5..];
b will be null if "bits=" is not present
If you're going to parse a lot of values out of this string consider turning it into a dictionary:
var c = new []{'='};
var d = value.Split(';').ToDictionary(s => s.Split(c,2)[0], s => s.Split(c,2)[1]);
It's slightly inefficient to split twice, if it bothers you, you can sub string:
value.Split(';').ToDictionary(s => s[..s.IndexOf('=')], s => s[s.IndexOf('=')+1..]);
This gives a dictionary of string, so you can do like:
if(d.ContainsKey("bits")){
var bits = int.Parse(d["bits"]);
...
String has a method Contains(string) that does the job:
if (ChantInputs.Contains("bits")
{
print("I made it here.");
}
You can try below.
private void GameInputs(string ChatInputs)
{
string[] Search = ChatInputs.Split(new char[] { ';', '=' });
foreach(string s in Search)
{
if(s == "bits")
{
print("I made it here.");
}
}
}
Below is the working code.
using System;
namespace ConsoleApp3
{
class Program
{
static void Main(string[] args)
{
string str = "one;two;test;three;test+test";
string[] strs = str.Split(new char[] { ';', '+' });
foreach(string s in strs)
{
if(s == "test")
{
Console.WriteLine(s);
}
}
Console.ReadLine();
}
}
}

XMLReader Invalid XML Character Exception

I am parsing a big XML file ~500MB, and it contains some invalid XML character 0x07 , so you can imagine what's happening, the XMLReader is throwing an Invalid XML character exception, to handle this we streamed the Stream into StreamReader and used Regex.Replace and wrote the result to memory using StreamWriter and stream the clean version back to XMLReader, now I would like to avoid this and skip this filthy tag from the XMLReader directly, my question is if there's anyway to achieve that, below is the code snippet where I try to do this but it's throwing the exception at this line
var node = (XElement)XNode.ReadFrom(xr);
protected override IEnumerable<XElement> StreamReader(Stream stream, string elementName)
{
var arrTag = elementName.Split('|').ToList();
using (var xr = XmlReader.Create(stream, new XmlReaderSettings { CheckCharacters = false }))
{
while (xr.Read())
{
if (xr.NodeType == XmlNodeType.Element && arrTag.Contains(xr.Name))
{
var node = (XElement)XNode.ReadFrom(xr);
node.ReplaceWith(node.Elements().Where(e => e.Name != "DaylightSaveInfo"));
yield return node;
}
}
xr.Close();
}
}
XML SAMPLE, the invalid attribute DaylightSaveInfo
<?xml version="1.0" encoding="ISO-8859-1"?>
<LATree>
<LA className="BTT00NE" fdn="NE=9739">
<attr name="fdn">NE=9739</attr>
<attr name="IP">10.157.144.100</attr>
<attr name="realLatitude">0D0&apos;0"S</attr>
<attr name="realLongitude">0D0&apos;0"W</attr>
<attr name="DaylightSaveInfo">NO</attr>
</LA>
</LATree>
I just saw that Jon Skeet wrote something about this, so I cannot take credit really, but since his stature on SO is way above mine, I could perhaps gain a point or two for writing it. :)
First I wrote a class that overloads the TextReader class.
(Some reference material in the links.)
https://www.w3.org/TR/xml/#NT-Char
https://github.com/Microsoft/referencesource/blob/master/mscorlib/system/io/textreader.cs
class FilterInvalidXmlReader : System.IO.TextReader
{
private System.IO.StreamReader _streamReader;
public System.IO.Stream BaseStream => _streamReader.BaseStream;
public FilterInvalidXmlReader(System.IO.Stream stream) => _streamReader = new System.IO.StreamReader(stream);
public override void Close() => _streamReader.Close();
protected override void Dispose(bool disposing) => _streamReader.Dispose();
public override int Peek()
{
var peek = _streamReader.Peek();
while (IsInvalid(peek, true))
{
_streamReader.Read();
peek = _streamReader.Peek();
}
return peek;
}
public override int Read()
{
var read = _streamReader.Read();
while (IsInvalid(read, true))
{
read = _streamReader.Read();
}
return read;
}
public static bool IsInvalid(int c, bool invalidateCompatibilityCharacters)
{
if (c == -1)
{
return false;
}
if (invalidateCompatibilityCharacters && ((c >= 0x7F && c <= 0x84) || (c >= 0x86 && c <= 0x9F) || (c >= 0xFDD0 && c <= 0xFDEF)))
{
return true;
}
if (c == 0x9 || c == 0xA || c == 0xD || (c >= 0x20 && c <= 0xD7FF) || (c >= 0xE000 && c <= 0xFFFD))
{
return false;
}
return true;
}
}
Then I created a console application and in the main I put:
using (var memoryStream = new System.IO.MemoryStream(System.Text.Encoding.UTF8.GetBytes("<Test><GoodAttribute>a\u0009b</GoodAttribute><BadAttribute>c\u0007d</BadAttribute></Test>")))
{
using (var xmlFilteredTextReader = new FilterInvalidXmlReader(memoryStream))
{
using (var xr = System.Xml.XmlReader.Create(xmlFilteredTextReader))
{
while (xr.Read())
{
if (xr.NodeType == System.Xml.XmlNodeType.Element)
{
var xe = System.Xml.Linq.XElement.ReadFrom(xr);
System.Console.WriteLine(xe.ToString());
}
}
}
}
}
Hopefully this could help, or at least provide some starter point.
Following xml linq code runs without errors. I used in xml file following "NO" :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;
namespace ConsoleApplication108
{
class Program
{
const string FILENAME = #"c:\temp\test.xml";
static void Main(string[] args)
{
XmlReaderSettings settings = new XmlReaderSettings();
settings.CheckCharacters = false;
XmlReader reader = XmlReader.Create(FILENAME, settings);
XDocument doc = XDocument.Load(reader);
Dictionary<string, string> dict = doc.Descendants("attr")
.GroupBy(x => (string)x.Attribute("name"), y => (string)y)
.ToDictionary(x => x.Key, y => y.FirstOrDefault());
}
}
}

c# Compare arduino serial with string

I am trying to get serial data from my Arduino and compare it with a string in c#
the arduino code waits for a switch and then sends the corresponding string
along with it,
eg switch 5 make it print "5on" when on and "5off" when off
i know that arduino Serial.println(""); prints with a new line, so in c# I made the string a multi line string with # and did a new line but still with this it won't compare.
I can get the serial data into C# but I can't compare it
Arduino Code:
if(digitalRead(14) == 1){
pin13 = 1;
}
else if(digitalRead(14) == 0){
pin13 = 0;
}
if(digitalRead(5) == 1){
Serial.println("5on");
}
else if(digitalRead(5) == 0){
Serial.println("5off");
}
if(digitalRead(2) == 1){
Serial.println("2on");
}
else if(digitalRead(2) == 0){
Serial.println("2off");
}
if(digitalRead(12) == 1){
Serial.println("12on");
}
else if(digitalRead(12) == 0){
Serial.println("12off");
}
if(digitalRead(4) == 1){
Serial.println("4on");
}
else if(digitalRead(4) == 0){
Serial.println("4off");
}
C# Code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO.Ports;
using System.Threading;
namespace test
{
class Program
{
static void Main(string[] args)
{
//SerialPort port = new SerialPort("COM3", 115200);
SerialPort port = new SerialPort("COM4", 9600);
port.Open();
while (!port.IsOpen)
{
Console.WriteLine(".");
}
if (port.IsOpen)
{
Console.WriteLine("CONNECTED");
}
int switch5onallow=0;
while (true)
{
Thread.Sleep(100);
string a = port.ReadExisting();
Console.WriteLine(a);
string switch5on = #"5on
";
string switch5off = #"5off
";
string switch2on = #"2on
";
string switch2off = #"2off
";
string switch12on = #"12on
";
string switch12off = #"12off
";
string switch4on = #"4on
";
string switch4off = #"4off
";
if (a == switch5on)
{
//System.Diagnostics.Process.Start(#".\AHK\5on.ahk");
Console.WriteLine("switch5on");
switch5onallow = 0;
}
else if (a == switch5off)
{
//System.Diagnostics.Process.Start(#".\AHK\5off.ahk");
Console.WriteLine("switch5off");
switch5onallow = 1;
}
else if (a == switch2on)
{
System.Diagnostics.Process.Start(#".\AHK\2on.ahk");
}
else if (a == switch2off)
{
System.Diagnostics.Process.Start(#".\AHK\2off.ahk");
}
else if (a == switch12on)
{
System.Diagnostics.Process.Start(#".\AHK\12on.ahk");
}
else if (a == switch12off)
{
System.Diagnostics.Process.Start(#".\AHK\12off.ahk");
}
else if (a == switch4on)
{
System.Diagnostics.Process.Start(#".\AHK\4on.ahk");
}
else if (a == switch4off)
{
System.Diagnostics.Process.Start(#".\AHK\4off.ahk");
}
}
}
}
}
Two options:
First option is just Trim the input string when you read it in
const string SwitchOn5 = "5on";
string a = port.ReadExisting().Trim();
if (a.Equals(SwitchOn5)) // will return true
Second option is instead of trying of comparing whole string, you can just do a StartsWith check (first is probably better):
const string SwitchOn5 = "5on";
if (a.StartsWith(SwitchOn5)) // will return true

Regex get group block with specific start and end each group

If we had some string like :
----------DBVer=1
/*some sql script*/
----------DBVer=1
----------DBVer=2
/*some sql script*/
----------DBVer=2
----------DBVer=n
/*some sql script*/
----------DBVer=n
Can we extract scripts between first DBVer=1 and second DBVer=1 and so on... with regex?
I thing we must have some placehoder for regex, and tel regex engine if saw DBVer=digitA pick string until DBVer=digitA again if saw DBVer=digitB pick string until DBVer=digitB and so on...
Can we implement this with regex and if we can how?
Yes, using backreferences and lookarounds, you can capture the scripts:
var pattern = #"(?<=(?<m>-{10}DBVer=\d+)\r?\n).*(?=\r?\n\k<m>)";
var scripts = Regex.Matches(input, pattern, RegexOptions.Singleline)
.Cast<Match>()
.Select(m => m.Value);
Here, we capture the m (marker) group with (?<m>-{10}DBVer=\d+) and reuse the m value later in the regex with \k<m> to match against the end marker.
In order for .* to match newline chars, it is necessary to turn on Singleline mode. This, in turn, means we have to be specific about our newlines. In Singleline mode, these can be accounted for in a non-platform specific way with \r?\n.
Try code below. Not RegEx but works very well.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Text.RegularExpressions;
namespace ConsoleApplication6
{
class Program
{
const string FILENAME = #"c:\temp\test.txt";
static void Main(string[] args)
{
Script.ReadScripts(FILENAME);
}
}
public class Script
{
enum State
{
Get_Script,
Read_Script
}
public static List<Script> scripts = new List<Script>();
public int version { get; set; }
public string script { get; set; }
public static void ReadScripts(string filename)
{
string inputLine = "";
string pattern = "DBVer=(?'version'\\d+)";
State state = State.Get_Script;
StreamReader reader = new StreamReader(filename);
Script newScript = null;
while ((inputLine = reader.ReadLine()) != null)
{
inputLine = inputLine.Trim();
if (inputLine.Length > 0)
{
switch (state)
{
case State.Get_Script :
if(inputLine.StartsWith("-----"))
{
newScript = new Script();
scripts.Add(newScript);
string version =
Regex.Match(inputLine, pattern).Groups["version"].Value;
newScript.version = int.Parse(version);
newScript.script = "";
state = State.Read_Script;
}
break;
case State.Read_Script :
if (inputLine.StartsWith("-----"))
{
state = State.Get_Script;
}
else
{
if (newScript.script.Length == 0)
{
newScript.script = inputLine;
}
else
{
newScript.script += "\n" + inputLine;
}
}
break;
}
}
}
}
}
}

How to create an array and fill from tree node variable

I'm trying to transfer data from a treenode (at least I think that's what it is) which contains much more data than I need. It would be very difficult for me to manipulate the data within the treenode. I would much rather have an array which provides me with only the necessary data for data manipulation.
I would like higher rates have following variables:
1. BookmarkNumber (integer)
2. Date (string)
3. DocumentType (string)
4. BookmarkPageNumberString (string)
5. BookmarkPageNumberInteger (integer)
I would like to the above defined rate from the data from variable book_mark (as can be seen in my code).
I've been wrestling with this for two days. Any help would be much appreciated. I'm probably sure that the question wasn't phrased correctly so please ask questions so that I may explain further if needed.
Thanks so much
BTW what I'm trying to do is create a Windows Form program which parses a PDF file which has multiple bookmarks into discrete PDF files for each bookmark/chapter while saving the bookmark in the correct folder with the correct naming convention, the folder and naming convention dependent upon the PDF name and title name of the bookmark/chapter being parsed.
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.IO;
using itextsharp.pdfa;
using iTextSharp.awt;
using iTextSharp.testutils;
using iTextSharp.text;
using iTextSharp.xmp;
using iTextSharp.xtra;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void ChooseImageFileWrapper_Click(object sender, EventArgs e)
{
OpenFileDialog openFileDialog1 = new OpenFileDialog();
openFileDialog1.InitialDirectory = GlobalVariables.InitialDirectory;
openFileDialog1.Filter = "Pdf Files|*.pdf";
openFileDialog1.RestoreDirectory = true;
openFileDialog1.Title = "Image File Wrapper Chooser";
if (openFileDialog1.ShowDialog() == DialogResult.OK)
{
try
{
GlobalVariables.ImageFileWrapperPath = openFileDialog1.FileName;
}
catch (Exception ex)
{
MessageBox.Show("Error: Could not read file from disk. Original error: " + ex.Message);
}
}
ImageFileWrapperPath.Text = GlobalVariables.ImageFileWrapperPath;
}
private void ImageFileWrapperPath_TextChanged(object sender, EventArgs e)
{
}
private void button2_Click(object sender, EventArgs e)
{
iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(GlobalVariables.ImageFileWrapperPath);
IList<Dictionary<string, object>> book_mark = iTextSharp.text.pdf.SimpleBookmark.GetBookmark(pdfReader);
List<ImageFileWrapperBookmarks> IFWBookmarks = new List<ImageFileWrapperBookmarks>();
foreach (Dictionary<string, object> bk in book_mark) // bk is a single instance of book_mark
{
ImageFileWrapperBookmarks.BookmarkNumber = ImageFileWrapperBookmarks.BookmarkNumber + 1;
foreach (KeyValuePair<string, object> kvr in bk) // kvr is the key/value in bk
{
if (kvr.Key == "Kids" || kvr.Key == "kids")
{
//create recursive program for children
}
else if (kvr.Key == "Title" || kvr.Key == "title")
{
}
else if (kvr.Key == "Page" || kvr.Key == "page")
{
}
}
}
MessageBox.Show(GlobalVariables.ImageFileWrapperPath);
}
}
}
Here's one way to parse a PDF and create a data structure similar to what you describe. First the data structure:
public class BookMark
{
static int _number;
public BookMark() { Number = ++_number; }
public int Number { get; private set; }
public string Title { get; set; }
public string PageNumberString { get; set; }
public int PageNumberInteger { get; set; }
public static void ResetNumber() { _number = 0; }
// bookmarks title may have illegal filename character(s)
public string GetFileName()
{
var fileTitle = Regex.Replace(
Regex.Replace(Title, #"\s+", "-"),
#"[^-\w]", ""
);
return string.Format("{0:D4}-{1}.pdf", Number, fileTitle);
}
}
A method to create a list of Bookmark (above):
List<BookMark> ParseBookMarks(IList<Dictionary<string, object>> bookmarks)
{
int page;
var result = new List<BookMark>();
foreach (var bookmark in bookmarks)
{
// add top-level bookmarks
var stringPage = bookmark["Page"].ToString();
if (Int32.TryParse(stringPage.Split()[0], out page))
{
result.Add(new BookMark() {
Title = bookmark["Title"].ToString(),
PageNumberString = stringPage,
PageNumberInteger = page
});
}
// recurse
if (bookmark.ContainsKey("Kids"))
{
var kids = bookmark["Kids"] as IList<Dictionary<string, object>>;
if (kids != null && kids.Count > 0)
{
result.AddRange(ParseBookMarks(kids));
}
}
}
return result;
}
Call method above like this to dump the results to a text file:
void DumpResults(string path)
{
using (var reader = new PdfReader(path))
{
// need this call to parse page numbers
reader.ConsolidateNamedDestinations();
var bookmarks = ParseBookMarks(SimpleBookmark.GetBookmark(reader));
var sb = new StringBuilder();
foreach (var bookmark in bookmarks)
{
sb.AppendLine(string.Format(
"{0, -4}{1, -100}{2, -25}{3}",
bookmark.Number, bookmark.Title,
bookmark.PageNumberString, bookmark.PageNumberInteger
));
}
File.WriteAllText(outputTextFile, sb.ToString());
}
}
The bigger problem is how to extract each Bookmark into a separate file. If every Bookmark starts a new page it's easy:
Iterate over the return value of ParseBookMarks()
Select a page range that begins with the current BookMark.Number, and ends with the next BookMark.Number - 1
Use that page range to create separate files.
Something like this:
void ProcessPdf(string path)
{
using (var reader = new PdfReader(path))
{
// need this call to parse page numbers
reader.ConsolidateNamedDestinations();
var bookmarks = ParseBookMarks(SimpleBookmark.GetBookmark(reader));
for (int i = 0; i < bookmarks.Count; ++i)
{
int page = bookmarks[i].PageNumberInteger;
int nextPage = i + 1 < bookmarks.Count
// if not top of page will be missing content
? bookmarks[i + 1].PageNumberInteger - 1
/* alternative is to potentially add redundant content:
? bookmarks[i + 1].PageNumberInteger
*/
: reader.NumberOfPages;
string range = string.Format("{0}-{1}", page, nextPage);
// DEMO!
if (i < 10)
{
var outputPath = Path.Combine(OUTPUT_DIR, bookmarks[i].GetFileName());
using (var readerCopy = new PdfReader(reader))
{
var number = bookmarks[i].Number;
readerCopy.SelectPages(range);
using (FileStream stream = new FileStream(outputPath, FileMode.Create))
{
using (var document = new Document())
{
using (var copy = new PdfCopy(document, stream))
{
document.Open();
int n = readerCopy.NumberOfPages;
for (int j = 0; j < n; )
{
copy.AddPage(copy.GetImportedPage(readerCopy, ++j));
}
}
}
}
}
}
}
}
}
The problem is that it's highly unlikely all bookmarks are going to be at the top of every page of the PDF. To see what I mean, experiment with commenting / uncommenting the bookmarks[i + 1].PageNumberInteger lines.

Categories

Resources