I'm trying to use OpenXml to replace a text "Veteran" in file A.docx with content in B.docx . If B.docx contains text or paragraph , it works fine and I get modified A.docx file.
However, if B.docx contains a table, then the code doesn't work.
static void Main(string[] args)
{
SearchAndReplace(#"C:\A.docx", #"C:\B.docx");
}
public static void SearchAndReplace(string docTo, string docFrom)
{
List<WordprocessingDocument> docList = new List<WordprocessingDocument>();
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(docTo, true))
using (WordprocessingDocument wordDoc1 = WordprocessingDocument.Open(docFrom, true))
{
var parts = wordDoc1.MainDocumentPart.Document.Descendants().FirstOrDefault();
docList.Add(wordDoc);
docList.Add(wordDoc1);
if (parts != null)
{
foreach (var node in parts.ChildElements)
{
if (node is Table)
{
ParseTable(docList, (Table)node, textBuilder);
}
}
}
}
}
public static void ParseText(List<WordprocessingDocument> wpd, Paragraph node, StringBuilder textBuilder)
{
Body body = wpd[0].MainDocumentPart.Document.Body;
Body body1 = wpd[1].MainDocumentPart.Document.Body;
string content = body1.InnerXml;
var paras = body.Elements<Paragraph>();
foreach (var para in paras)
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
if (text.Text.Contains("Veteran"))
{
run.InnerXml.Replace(run.InnerXml, content);
break;
}
}
}
}
}
public static void ParseTable(List<WordprocessingDocument> wpd, Table node, StringBuilder textBuilder)
{
foreach (var row in node.Descendants<TableRow>())
{
textBuilder.Append("| ");
foreach (var cell in row.Descendants<TableCell>())
{
foreach (var para in cell.Descendants<Paragraph>())
{
ParseText(wpd, para, textBuilder);
}
textBuilder.Append(" | ");
}
textBuilder.AppendLine("");
}
}
}
}
How to make this work ? Is there a better way to replace content with another docx file?
Not having enough detail for a specific answer, here's how you solve such problems in general:
Ensure you understand the Open XML specification and valid Open XML markup on an appropriate level of detail.
If you don't understand what w:document, w:body, w:p, w:r, w:t, w:tbl, etc. are and how they relate to each other, you have no chance.
You must look at actual Open XML markup, e.g., using the Open XML Productivity Tool or the Open XML Package Editor for Modern Visual Studios to get to an appropriate level of understanding and develop Open XML-based solutions.
Understand that most Open XML-related code transforms some source markup into some target markup. Therefore, you must:
understand the source and target markup first and then
define the transformation required to create the target from the source.
Depending on what you need to do, the Open XML Productivity Tool can help create the transforming code. If you have a source and target document, you can use the Productivity Tool to compare those documents. This shows the difference in the markup, so you see what markup is created, deleted, or changed. It even shows you the Open XML SDK-based code required to effect the change.
In my own use cases, I typically prefer to write recursive, pure functional transformations. While you need to wrap your head around the concept, this is an extremely powerful approach.
In your case, you should:
take a few representative, manually-created samples of source (A.docx with "Vetaran" still to be replaced) and target (A.docx with "Veteran" replaced as desired) documents;
look at the Open XML markup of the source and target documents; and
write code that creates the target markup.
Once you have created code that at least tries to create valid target Open XML markup, you could come back with further questions in case you identify further issues.
Related
I need to check all tags on all shapes on all slides. I can select each shape, however I can't see how to get the shape's tags.
For the given DocumentFormat.OpenXml.Presentation.Shape, how can I get the "val" of the tag with name="MOUNTAIN"
In my shape, the tag rId is in this structure: p:sp > p:nvSpPr > p:cNvPr > p:nvPr > p:custDataList > p:tags
I'm guessing my code needs to do these steps:
• Get the rId of the p:custDataLst p:tags
• Look up the "Target" file name in the slideX.xml.rels file, based on the rId
• Look in the root/tags folder for the "Target" file
• Get the p:tagLst p:tags and look for the p:tag with name="MOUNTAIN"
<p:tagLst
<p:tag name="MOUNTAIN" val="Denali"/>
</p:tagLst>
Here is how my code iterates through shapes on each slide:
for (int x = 0; x < doc.PresentationPart.SlideParts.Count(); x++)
{
SlidePart slide = doc.PresentationPart.SlideParts.ElementAt(x);
ShapeTree tree = slide.Slide.CommonSlideData.ShapeTree;
IEnumerable<DocumentFormat.OpenXml.Presentation.Shape> slShapes = slide.Slide.Descendants<DocumentFormat.OpenXml.Presentation.Shape>();
foreach (DocumentFormat.OpenXml.Presentation.Shape shape in slShapes)
{
//get the specified tag, if it exists
}
}
I see an example of how to add tags: How to add custom tags to powerpoint slides using OpenXml in c#
But I can't figure out how to read the existing tags.
So, how do I get the shape's tags with c#?
I was hoping to do something like this:
IEnumerable<UserDefinedTagsPart> userDefinedTagsParts = shape.NonVisualShapeProperties.ApplicationNonVisualDrawingProperties.CustomerDataList.CustomerDataTags<UserDefinedTagsPart>();
foreach (UserDefinedTagsPart userDefinedTagsPart in userDefinedTagsParts)
{}
but Visual Studio says "ApplicationNonVisualDrawingProperties does not contain a definition for CustomerDataList".
From the OpenXML Productivity Tool, here is the element tree:
You and I seem to be working on similar problems. I'm struggling with learning the file format. The following code is working for me, I'm sure it can be optimized.
public void ReadTags(Shape shape, SlidePart slidePart)
{
NonVisualShapeProperties nvsp = shape.NonVisualShapeProperties;
ApplicationNonVisualDrawingProperties nvdp = nvsp.ApplicationNonVisualDrawingProperties;
IEnumerable<CustomerDataTags> data_tags = nvdp.Descendants<CustomerDataTags>();
foreach (var data_tag in data_tags)
{
UserDefinedTagsPart shape_tags = slidePart.GetPartById(data_tag.Id) as UserDefinedTagsPart;
if (shape_tags != null)
{
foreach (Tag tag in shape_tags.TagList)
{
Debug.Print($"\t{nvsp.NonVisualDrawingProperties.Name} tag {tag.Name} = '{tag.Val}");
}
}
}
}
I've spent a lot of time with OpenXML .docx and .xlsx files ... but not so much with .pptx.
Nevertheless, here are a couple of suggestions that might help:
If you haven't already done so, please downoad the OpenXML SDK Productivity Tool to analyze your file's contents. It's currently available on GitHub:
https://github.com/dotnet/Open-XML-SDK/releases/tag/v2.5
You might simply be able to "grep" for items you're looking for.
EXAMPLE (Word, not PowerPoint... but the same principle should apply):
using (doc = WordprocessingDocument.Open(stream, true))
{
// Init OpenXML members
mainPart = doc.MainDocumentPart;
body = mainPart.Document.Body;
...
foreach (var text in body.Descendants<Text>())
{
if (text.Text.Contains(target))
...
Using OpenXML, can I read the document content by page number?
wordDocument.MainDocumentPart.Document.Body gives content of full document.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
int pageCount = 0;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
for (int i = 1; i <= pageCount; i++)
{
//Read the content by page number
}
}
}
MSDN Reference
Update 1:
it looks like page breaks are set as below
<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
<w:r>
<w:br w:type="page" />
</w:r>
</w:p>
So now I need to split the XML with above check and take InnerTex for each, that will give me page vise text.
Now question becomes how can I split the XML with above check?
Update 2:
Page breaks are set only when you have page breaks, but if text is floating from one page to other pages, then there is no page break XML element is set, so it revert back to same challenge how o identify the page separations.
You cannot reference OOXML content via page numbering at the OOXML data level alone.
Hard page breaks are not the problem; hard page breaks can be counted.
Soft page breaks are the problem. These are calculated according to
line break and pagination algorithms which are implementation
dependent; it is not intrinsic to the OOXML data. There is nothing
to count.
What about w:lastRenderedPageBreak, which is a record of the position of a soft page break at the time the document was last rendered? No, w:lastRenderedPageBreak does not help in general either because:
By definition, w:lastRenderedPageBreak position is stale when content has
been changed since last opened by a program that paginates its
content.
In MS Word's implementation, w:lastRenderedPageBreak is known to be unreliable in various circumstances including
when table spans two pages
when next page starts with an empty paragraph
for
multi-column layouts with text boxes starting a new column
for
large images or long sequences of blank lines
If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc.
Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.
This is how I ended up doing it.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
int pageCount = 0;
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
int i = 1;
StringBuilder pageContentBuilder = new StringBuilder();
foreach (var element in body.ChildElements)
{
if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
{
pageContentBuilder.Append(element.InnerText);
}
else
{
pageviseContent.Add(i, pageContentBuilder.ToString());
i++;
pageContentBuilder = new StringBuilder();
}
if (body.LastChild == element && pageContentBuilder.Length > 0)
{
pageviseContent.Add(i, pageContentBuilder.ToString());
}
}
}
}
Downside: This wont work in all scenarios. This will work only when you have a page break, but if you have text extended from page 1 to page 2, there is no identifier to know you are in page two.
Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .
You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.
List<Paragraph> Allparagraphs = wp.MainDocumentPart.Document.Body.OfType<Paragraph>().ToList();
List<Paragraph> PageParagraphs = Allparagraphs.Where (x=>x.Descendants<LastRenderedPageBreak>().Count() ==1) .Select(x => x).Distinct().ToList();
Rename docx to zip.
Open docProps\app.xml file. :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<Template>Normal</Template>
<TotalTime>0</TotalTime>
<Pages>1</Pages>
<Words>141</Words>
<Characters>809</Characters>
<Application>Microsoft Office Word</Application>
<DocSecurity>0</DocSecurity>
<Lines>6</Lines>
<Paragraphs>1</Paragraphs>
<ScaleCrop>false</ScaleCrop>
<HeadingPairs>
<vt:vector size="2" baseType="variant">
<vt:variant>
<vt:lpstr>Название</vt:lpstr>
</vt:variant>
<vt:variant>
<vt:i4>1</vt:i4>
</vt:variant>
</vt:vector>
</HeadingPairs>
<TitlesOfParts>
<vt:vector size="1" baseType="lpstr">
<vt:lpstr/>
</vt:vector>
</TitlesOfParts>
<Company/>
<LinksUpToDate>false</LinksUpToDate>
<CharactersWithSpaces>949</CharactersWithSpaces>
<SharedDoc>false</SharedDoc>
<HyperlinksChanged>false</HyperlinksChanged>
<AppVersion>14.0000</AppVersion>
</Properties>
OpenXML lib reads wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text from <Pages>1</Pages> property . This properies are created only by winword application. if word document changed wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text is not actual. if word document created programmatically the wordDocument.ExtendedFilePropertiesPart is offten null.
I have a List that I would like to join in a single Word.Document. Below is all that I have so far.
Any ideas?
public static Word.Document JoinDocuments(List<Word.Document> DocstoJoin)
{
Word.Document JoinedDoc = new Word.Document();
foreach (Word.Document doc in DocstoJoin)
{
foreach (Word.Section sec in doc.Sections)
{
**????**
}
}
return JoinedDoc;
}
The Selection class provides the following methods that can be used to get the job done:
Copy - copies the specified selection to the Clipboard.
Paste - inserts the contents of the Clipboard at the specified selection.
Also you may consider using the Open XML SDK if you deal with open XML documents only.
I have several thousand (ASP.net - messy html) html generated invoices that I'm trying to parse and save into a database.
Basically like:
foreach(var htmlDoc in HtmlFolder)
{
foreach(var inputBox in htmlDoc)
{
//Make Collection of ID and Values Insert to DB
}
}
From all the other questions I've read the best tool for this type of problem is the HtmlAgilityPack, however for the life of me I can't get the documentation .chm file to work. Any ideas on how I could accomplish this with or without the Agility Pack ?
Thanks in advance
An newer alternative to HtmlAgilityPack is CsQuery. See this later question on its relative performance merits, but its use of CSS selectors can't be beat:
var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
//get values
To get the CHM to work, you probably need to view the properties in Windows Explorer and uncheck the "Unblock Content" checkbox.
The HTML Agility Pack is quite easy when you know your way around Linq-to-XML or XPath.
Basics you'll need to know:
//import the HtmlAgilityPack
using HtmlAgilityPack;
HtmlDocument doc = new HtmlDocument();
// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);
// OR
// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------
// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
.Where(node => !string.IsNullOrWhitespace(node.Id)
&& node.Attributes["value"] != null
&& !string.IsNullOrWhitespace(node.Attributes["value"].Value));
// OR
// Finding things using XPath
var nodes = doc.DocumentNode
.SelectNodes("//input[not(#id='') and not(#value='')]");
// -----------------------------
// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null)
{
foreach (var node in nodes)
{
var id = node.Id;
var value = node.Attributes["value"].Value;
}
}
The easiest way to add the HtmlAgility Pack is using NuGet:
PM> Install-Package HtmlAgilityPack
Hah, looks like the ideal time to make a shameless plug of a library I wrote!
This should be rather easy to accomplish with this library (that's built on top of HtmlAgility pack by the way!) : https://github.com/amoerie/htmlbuilders
(You can find the Nuget package here: https://www.nuget.org/packages/HtmlBuilders/ )
Code samples:
const string html = "<div class='invoice'><input type='text' name='abc' value='123'/><input id='ohgood' type='text' name='def' value='456'/></div>";
var htmlDocument = new HtmlDocument {OptionCheckSyntax = false}; // avoid exceptions when html is invalid
htmlDocument.Load(new StringReader(html));
var tag = HtmlTag.Parse(htmlDocument); // if there is a root tag
var tags = HtmlTag.ParseAll(htmlDocument); // if there is no root tag
// find looks recursively through the entire DOM tree
var inputFields = tag.Find(t => string.Equals(t.TagName, "input"));
foreach (var inputField in inputFields)
{
Console.WriteLine(inputField["type"]);
Console.WriteLine(inputField["value"]);
if(inputField.HasAttribute("id"))
Console.WriteLine(inputField["id"]);
}
Note that inputField[attribute] will throw a 'KeyNotFoundException' if that field does not have the specified attribute name. That's because HtmlTag implements and reuses IDictionary logic for its attributes.
Edit: If you're not running this code in a web environment, you'll need to add a reference to System.Web. That's because this library makes use of the HtmlString class which can be found in System.Web. Just choose 'Add reference' and then you can find it under 'Assemblies > Framework'
You can download HtmlAgilityPack Documents CHM file from here.
If chm file contents are not visible then un-check Always ask before opening this file check-box as shown in screen shot
Note: The above dialog appears for unsigned files
Source: HtmlAgilityPack Documentation
I am trying the following code. It takes a fileName (docx file with many sections) and I try to iterate through each section getting the section name. The problem is that I end up with unreadable docx files. It does not error, but I think I am doing something wrong with getting the elements in the section.
public void Split(string fileName) {
using (WordprocessingDocument myDoc =
WordprocessingDocument.Open(fileName, true)) {
string curCliCode = "";
MainDocumentPart mdp = myDoc.MainDocumentPart;
foreach (var element in mdp.Document.Body.ChildElements) {
if (element.Descendants().OfType<SectionProperties>().Count() == 1) {
//get the name of the section from the footer
var footer = (FooterPart) mdp.GetPartById(
element.Descendants().OfType<SectionProperties>().First().OfType
<FooterReference>().First().
Id.Value);
foreach (Paragraph p in footer.Footer.ChildElements.OfType<Paragraph>()) {
if (p.InnerText != "") {
curCliCode = p.InnerText;
}
}
if (curCliCode != "") {
var forFile = new List<OpenXmlElement>();
var els = element.ElementsBefore();
if (els != null) {
foreach (var e in els) {
if (e != null) {
forFile.Add(e);
}
}
for (int i = 0; i < els.Count(); i++) {
els.ElementAt(i).Remove();
}
}
Create(curCliCode, forFile);
}
}
}
}
}
private void Create(string cliCode,IEnumerable<OpenXmlElement> docParts) {
var parts = from e in docParts select e.Clone();
const string template = #"\Test\toSplit\blank.docx";
string destination = string.Format(#"\Test\{0}.docx", cliCode);
File.Copy(template, destination,true);
/* Create the package and main document part */
using (WordprocessingDocument myDoc =
WordprocessingDocument.Open(destination, true)) {
MainDocumentPart mainPart = myDoc.MainDocumentPart;
/* Create the contents */
foreach(var part in parts) {
mainPart.Document.Body.Append((OpenXmlElement)part);
}
/* Save the results and close */
mainPart.Document.Save();
myDoc.Close();
}
}
Does anyone know what the problem could be (or how to properly copy a section from one document to another)?
I've done some work in this area, and what I have found invaluable is diffing a known good file with a prospective file; the error is usually fairly obvious.
What I would do is take a file that you know works, and copy all of the sections into the template. Theoretically, the two files should be identical. Run a diff on them the document.xml inside the docx file, and you'll see the difference.
BTW, I'm assuming that you know that the docx is actually a zip; change the extension to "zip", and you'll be able to get at the actual xml files which compose the format.
As far as diff tools, I use Beyond Compare from Scooter Software.
An approach along the lines of what you are doing will work only for simple documents (ie those not containing images, hyperlinks, comments etc). To handle these more complex documents, take a look at http://blogs.msdn.com/b/ericwhite/archive/2009/02/05/move-insert-delete-paragraphs-in-word-processing-documents-using-the-open-xml-sdk.aspx and the resulting DocumentBuilder API (part of the PowerTools for Open XML project on CodePlex).
In order to split a docx into sections using DocumentBuilder, you'll still need to first find the index of the paragraphs containing sectPr elements.