Parsing an Open XML doc via styled blocks

Parsing an Open XML doc via styled blocks - c#

I'm working with docx docs, and I need to parse a document into sections on the basis of headings styled with the "heading 1" style. So if I had a doc like this (markup is pseudocode):
<doc>
<title style>Doc Title</title style>
<heading1>First Section</heading1>
...
<heading2>Second Section</heading2>
...
<heading3>Third Section</heading3>
...
</doc>
I'd want to break this into a doc with four sections, the first being the content that precedes the first section. I figure that this is probably pretty simple once you're familiar with Open XML, but I am not.
TIA.

Wow...not even any views on this question all day. Well, I figured it out and thought I'd share the wealth. I can't share the code directly, but it's just three nested loops, one looping through the paragraphs, then the paragraph runs, then the styles. The XPath for each of those is:
.//w:p
./w:pPr
./w:pStyle
Once you find a run with the style you like, you pop back up a level to get the first run, which will contain the styled text. From there on, it's just Comp Sci 101 stuff. I think the real breakthrough was to not even try to mess with the Open Xml SDK (aside from the IO Packaging stuff), and go straight to XML manipulation.

Related

Delete node from xml documents

I'm new in the programming world.
I'm just looking for help with some kind of code, that would delete node from bunch of xml documents.
It is possible to make something which would delete node in bunch of xml documents at once?

There are many different technologies you could use, which is a bit daunting if you are new to programming. Since this quite a simple task, it's probably not worth investing a lot of time learning new tools: but then it all depends on what you're already comfortable with.
Many people would use XSLT for any job that involves modifying XML documents. You could write an XSLT 3.0 stylesheet transform.xsl like this:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="deleted-node"/>
</xsl:transform>
where "deleted-node" is the name of the nodes you want to delete (or a more complex pattern if you need it). And then you could apply this to all XML files in a directory in, putting the result in directory out, using the Saxon XSLT processor from the command line like this:
Transform -s:in -o:out -xsl:transform.xsl
The way this works is that xsl:mode defines the default processing to be applied to nodes if there isn't a more specific rule; shallow-copy means that you copy the tags and then move on to process the content. There's only one more specific rule, which matches the elements you want to delete; the rule is empty indicating that when you hit one of these elements, you output nothing.

Parse XML with regards to one namespace only

I need to parse XML files with regards to only one namespace.
By "with regards to only one namespace" I mean that if I have document like this:
<xc:document xmlns:xc="asdasd">
<asdf>
<xc:abcd />
</asdf>
</xc:document>
I would like <asdf>, </asdf> to be treated as text.
The structure of this document should look like this:
document
|
|- text (<asdf>)
|- abcd
|- text (</asdf>)
What is the simplest method to achieve this?

Transform the document with xslt first so that the nodes you want treated as text actually are text.

Pretty much any XML parser is going to lose distinctions like whether single or double quotes were used, or CDATA sections were used, or whitespace inside tags (not between tags).
So:
<boy socks="black"
></boy>
might come back as <boy socks='black'/>
If you want to treat the input as not XML, you'll have to fall back on non-XML tools, or rethink your situation entirely, as this is a very unusual thing to want to do.
It's fairly easy in a text-processing language such as Perl, if you are careful. For example,
perl -p -e 's#<(/?[^:]+[\s>])#\<$1#g'
will go a long way, by changing the < signs you want to treat as text into < instead. This approach actually works best if you read the whole file in Perl rather than (as in this example) a line at a time, so that you can match close tags spread over multiple lines,
</boy
> like this.
But, best to parse XML with an XML parser, not regular expressions, so if the sort of changes I mentioned above are OK, this is really easy to do in XSLT.

How to prepare a Word 2007 document so that C# can pull data out of it semantically?

I have a friend who is writing a 400-page book in Microsoft Word 2007.
Throughout the book he has 200 stories each which consist of numerous paragraphs.
When he is finished writing the book, he wants to copy the text of each story that is embedded in his Word document into a database table such as:
Title, varchar(200)
Description, text
Content, text
We do not want to have to copy and paste each story into the database but want to have a program automatically pull the marked up data from the Word file into the appropriate fields in the database.
What does he have to do in Microsoft Word to denote each group of paragraphs as "story content" and each title as a "story title" etc. A prerequisite is that this markup cannot be visible in the document. I know that Word 2007 files are basically zipped XML files so I assume this is possible and I assume that stylesheets are what we need, but how do I need to prepare the Word document precisely so that as he adds stories they are properly marked up?
I assume that the new COM Interop features of C# 4.0 is what I need to analyze the Word file and retrieve only the title, description, and content from the embedded stories, but how do I do this technically? Does anyone have examples?
Does anyone have experience doing a project like this (reading Microsoft Word as a semnatic data file) that they could share?

What I would do is use styles. Have one style for each type of content, and write a macro that traverses your document paragraph-by-paragraph and spits out the corresponding text file.

Okay, this can be resolved in numerous ways.
First of all, I would suggest that you save the file to a *.txt, to have some plain text to parse.
Then, your friend will have to be really consistent during the writing, because what you will create, (text parser) will need consistency.
Make some rules like :
Title on first line, then 2 linebreaks;
All the paragraphs separated with 1 linebreak;
Then 3 linebreaks after the last paragraph;
After that, load the file, and parse it using the rules above.
{enjoy}

Following is the xml for a docx document, which contains a heading containing the word "Title" and two paragraphs containing the word "Content". Study a sample file of the novel while your friend is writing it, use a uniform format for all heading and paragraph elelments and you will be able to parse it pretty easily.The content is in the word/document.xml of the zipped docx file.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="005C78DC" w:rsidRDefault="00350339" w:rsidP="00350339"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>Title</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRPr="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:sectPr w:rsidR="00350339" w:rsidRPr="00350339" w:rsidSect="005C78DC"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>

Use Bookmarks for Start and Stop of Each Story
I strongly suggest this technique.
Mark the start and end of each "story" with Word's Bookmark feature. To see "bookmarks", go to Word Options, Advanced, Show document content, and check Show bookmarks.
Then just go through the document collecting the content between the bookmarks.
Fairly easy and a technique I been using since Word 6.x. The only issue is having to come up with 200 bookmark names. Yet, this may be an advantage because the bookmark name could be the migrated to a "name" field in the database.
Using Styles to Mark Story Content
Another technique is to define specific style or styles that make up the story. You then extract the styles. This is a little harder and can be error prone if the author is not disciplined.
Using Text Boxes That Contain Story Content
Lastly, if these "stories" can be placed into a "text box", you can simply extract the text-boxes content. The problem with this approach is the limitations of the text-box and document layout changes which the author may not what to apply.
Notes
There are others ways, but the bookmark approach is the easiest to use and implement. I will try to respond to any comments/questions you have.
MSDN Search for "vsto word bookmark" at http://social.msdn.microsoft.com/Search/en-US?query=vsto%20word%20bookmark&refinement=-112&ac=3
MSDN Search for "vsto word 2007" at http://social.msdn.microsoft.com/Search/en-US?query=vsto%20word%202007&refinement=-112&ac=3

Getting a "summary" of a webpage

I have something of a a hairy problem, I'd like to generate a couple of paragraphs of "description" of a given url, normally the start of an article. The Meta description field is one way to go but it isn't always good or set properly.
It's fair to say it's a bit problematic to accomplish this from the screenscraped HTML. I had a general idea that perhaps one could scan the HTML for the first "appropriate" segment but it's hard to say what that is, perhaps something like the first paragraph containing a certain amount of text...
Anyone have any good ideas? :) It doesn't have to be foolproof

So, you wanna become a new Google, heh? :-)
Many sites are "SEO friendly" these days. This enables you to go for the headings and then look for paragraphs bellow.
Also, look for lists. There is a lot of content in some sort of tab-like (tabs, accordions...) interfaces that is done using ordered or unordered lists.
If that fails, maybe look for a div with class "content" or "main" or a combination and start from there.
If you use different approaches, make sure you keep statistics of what worked and what didn't (maybe even save a full page), so you can review and tweak your parsing and searching methods.
As a side note, I've used htmlagilitypack to parse and search through html with success. Well, at leasts it beats parsing with regex :-)

Perhaps look for the div element that contains the most p elements, and then grab the first p child. If no div, get the first p from the body element.
This will always have its problems.

You can strip the HTML tags using this regular expression
string stripped = Regex.Replace(textBox1.Text,#"<(.|\n)*?>",string.Empty)
You will them get the content text you can use to generate your paragraphs.

Regex or XML Parser C#

I have some word templates(dot/dotx) files that contain xml tags along with plain text.
At run time, I need to replace the xml tags with their respective mail merge fields.
So, need to parse the document for these xml tags and replace them with merge fields.
I was using Regex to find and replace these xml tags. But I was suggested to use XML parser to parse for XML tags ([Regex for string enclosed in <*>, C#).
The sample document looks like:
Solicitor Letter
<Tfirm/>
<Tbuilding/>
<TstreetNumber/> <TstreetName/>
For the attention of: <TContact1/> <TEmail/>
Dear <TContact1/>
RE: <Pbuilding/> <PstreetNumber/> <PstreetName/> <Pvillage/> <PTown/>
We were pleased to hear that contracts have now been exchanged in the sale of the
above property on behalf of our mutual client/s. We now have pleasure in enclosing a
copy of our invoice for your kind attention upon completion.
....
One more note, the angle brackets are typed manually by end user in the template.
I tried using XMLReader, but got error as my documents have no root tags on their own.
Please guide if I should stick to Regex or is there any way to use XML Parser.
Thank you!

Unless you can get it structured as an XML document, the tools in the .NET Libraries to read XML are going to be entirely useless.
What you have is not XML. Having a tag or two that would qualify as XML does not an XML document make. The problem is that it simply does not follow any of the rules of XML.
Moral of the story is that you will have to come up with your own method to parse this. If you like to drink the RegEx kool-aid, that'll be the best solution for ya. Of course, there are plenty of ways to skin this cat.

It looks like you aren't actually using XML, just using a token that looks similar to XML as a placeholder for replacement.
If that's the case, you should be using Regex.

I would suggest neither. Microsoft has a free library in C# specifically for modifying open xml format documents without an installation of Microsoft Office.
OpenXML SDK

Doesn't seem like XML processing to me. It's not an XML doc. It's looks like straight string-replacement, and for that, you're better off with a Regular Expression.

An XML parser doesn't help you locate XML; it only helps you understand a given piece of XML. You will need some other mechanism, perhaps a Regex, to find the XML.

Seems that authors of most replies didnt read the question carefully.
inutan is asking for something that will parse Word documents. If a Word document is saved in docx format, it will be actually XML file that can be read by XML Reader or XPathReader, however I will not recomend to do it
Normally, mail merge with Word doesnt require any programming and XML parsing, see http://helpdesk.ua.edu/training/word/merg07.html
However if you still want to have XML-like fields in your Word templates and replace them with values, I would suggest using Word automation objects.
Below is an example of VBA code, for a similar code on other languages please refer MS Office development site http://msdn.microsoft.com/en-us/library/bb726434.aspx . For example if you use .NET - you should use Office interops and best of all is to install MS Visual Studio Tools for Office development http://msdn.microsoft.com/en-us/library/5s12ew2x.aspx
With Selection.Find
.Text = "<TContact1/>"
.Replacement.Text = "TContact1"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing an Open XML doc via styled blocks - c#

Related

Delete node from xml documents

Parse XML with regards to one namespace only

How to prepare a Word 2007 document so that C# can pull data out of it semantically?

Getting a "summary" of a webpage

Regex or XML Parser C#

Categories

Resources