How to parse text from MS Word document to string

How to parse text from MS Word document to string - c#

I am trying to find a way to parse a word document's text to a string in my project.I have more than 600 word(.doc) files that I need to get the text content(with the new lines and tabs if possible) and assign it to a string for each one.
I've been reading stuff about the Open XML SDK but it looks quite complicated for something that looks so simple.

Open XML SDK is only for 2007 and newer formats and it is not trivial to use.
If performance is not an issue you could use Word Automation and have Word do this for you.
It will look something like this:
var app = new Application();
var doc = app.Documents.Open(documentLocation);
string rangeText = doc.Range().Text;
doc.Save();
doc.Close();
Marshal.ReleaseComObject(doc);
Marshal.ReleaseComObject(app);
Take a look at http://www.codeproject.com/Articles/18703/Word-2007-Automation or http://www.codeproject.com/Articles/21247/Word-Automation for more complete examples and instructions. Note that this may become a bit more tricky if your documents are move complex (footnotes, text boxes, tables...).
Another option is have word save the document as a text and then read the text file. Take a look at this - http://msdn.microsoft.com/en-us/library/microsoft.office.tools.word.document.saveas(v=vs.80).aspx

You could give a look at NPOI:
This project is the .NET version of POI Java project at
http://poi.apache.org/. POI is an open source project which can help
you read/write xls, doc, ppt files. It has a wide application.
Take a look at this previous SO thread for more information.

Related

Copy content from a Word document to another with the style

I want to copy the content of a section in a Word document to a new document.
I do this to copy :
var docPath = #"C:\temp\myDoc.docx";
var doc = word.Documents.Open(FileName: docPath, ReadOnly: true);
var emptyDoc = word.Documents.Add();
doc.Sections.First.Range.Copy();
emptyDoc.Sections.First.Range.Paste();
This works well to copy content, but the style is not the same. How can I copy the complete section and have it rendered exactly the same way in the new document ?
If there is a better solution involving the OpenXML SDK instead of VSTO, I can take it.

You will find it much easier to automate Word if you do things manually first. That way you can get a better understanding of the various options available etc. You can also record a macro which will often, though not always, provide the answer.
In this instance you need to automate selecting 'Keep Source Formatting' from the context toolbar that appears after pasting. The code you need for that is:
emptyDoc.Sections.First.Range.PasteAndFormat wdFormatOriginalFormatting

Save a string with HTML format to doc (with unicode)

I try many way but I don't get the result that I expect. Please help, thanks.
I have a byte array, and I read it to a string, the result is :
string mystring = "<p>Today is a <b>beautiful</b> day</p>"
now I want to return it to a DOC file with HTML format.
I have the problem that I cant get a file with unicode format.
This is what I want in doc file :
Today is a beautiful day
Can anyone help me find the way I can save my string to doc file with unicode encode?

If you want to create a very simple Word document from C# code, I would use the .docx format, and follow the MSDN blog entry here.
It includes full sample code to download.
From the page linked to above:
This code will generate a DOCX that loads in Word 2007 or any other
valid Open XML consumer.

How to replace text in a PDF with C#?

I saw a lot of solutions in here but none are clear or good answers.
Here is my simple question, hoping with a straight answer.
I have a PDF file (a template) which is created having text something like this:
{FIRSTNAME} {LASTNAME} {ADDRESS} {PHONENUMBER}
is it possible to have C# code that replace these templates with a text of my choice?
No fields, no other complex stuff.
Is there any Open source library helping me achieve that?

This thread is dead, however I'm posting my solution for other lost souls that might face this problem in the future. Unfortunately my company doesn't allow posting code online so I'll describe the solution :).
So basically what you have to do is use PdfSharp and modify this sample to replace text in stream, but you must take into account that text may be split into many parentheses (convert stream to string to see what the format is).
Then, with code similar to this sample traverse through source pdf page by page and modify current page by searching for PdfContent items inside PdfReference items and replacing text in content's stream.

The 'problem' with PDF documents is that they are inherently not suitable for editing. Especially ones without fields. The best thing is to step back and look at your process and see if there is a way to replace the text before the PDF was generated. Obviously, you may not always have this freedom.
If you will be able to replace text, then you should be aware that there will be no automatic reflow of the text following the replaced text. Given that you are fine with that, then there are very few solutions that allows you to replace text.
I know that you are looking for an OpenSource solution so I feel reluctant to offer you a commercial solution. We offer one called PDFKit.NET. It allows you to extract all content on a page as so-called shapes (text, images, curves, etc.). See method Page.CreateShapes in the type reference. You can then programmatically navigate and edit this structure of shapes and then write it back to a PDF again.
Here it is:
http://www.tallcomponents.com/pdfkit
Disclosure: I am the founder of TallComponents, vendor of this component

For simple text replace use iTextSharp library.
The code that replace one string with another is below.
Note that this will replace only simple text and may not work in all cases.
//using iTextSharp.text.pdf;
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
}
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}

As stated in similar thread this is not really possible an easy way. The easier way it seems to be getting a DocX file and using DocX library which allow easy word swapping and then converting your DocX to PDF (using PDF Creator printer or so).
Or use pdf sharp/migradoc to create new documents.

Updating in PDF is hard and dirty. So may be adding a content on top of existing will work for you as well, as it worked for me. If so, here's my primitive, but working solution covering a lot of cases ("covering", indeed):
https://github.com/astef/PatchPdfText

How to prepare a Word 2007 document so that C# can pull data out of it semantically?

I have a friend who is writing a 400-page book in Microsoft Word 2007.
Throughout the book he has 200 stories each which consist of numerous paragraphs.
When he is finished writing the book, he wants to copy the text of each story that is embedded in his Word document into a database table such as:
Title, varchar(200)
Description, text
Content, text
We do not want to have to copy and paste each story into the database but want to have a program automatically pull the marked up data from the Word file into the appropriate fields in the database.
What does he have to do in Microsoft Word to denote each group of paragraphs as "story content" and each title as a "story title" etc. A prerequisite is that this markup cannot be visible in the document. I know that Word 2007 files are basically zipped XML files so I assume this is possible and I assume that stylesheets are what we need, but how do I need to prepare the Word document precisely so that as he adds stories they are properly marked up?
I assume that the new COM Interop features of C# 4.0 is what I need to analyze the Word file and retrieve only the title, description, and content from the embedded stories, but how do I do this technically? Does anyone have examples?
Does anyone have experience doing a project like this (reading Microsoft Word as a semnatic data file) that they could share?

What I would do is use styles. Have one style for each type of content, and write a macro that traverses your document paragraph-by-paragraph and spits out the corresponding text file.

Okay, this can be resolved in numerous ways.
First of all, I would suggest that you save the file to a *.txt, to have some plain text to parse.
Then, your friend will have to be really consistent during the writing, because what you will create, (text parser) will need consistency.
Make some rules like :
Title on first line, then 2 linebreaks;
All the paragraphs separated with 1 linebreak;
Then 3 linebreaks after the last paragraph;
After that, load the file, and parse it using the rules above.
{enjoy}

Following is the xml for a docx document, which contains a heading containing the word "Title" and two paragraphs containing the word "Content". Study a sample file of the novel while your friend is writing it, use a uniform format for all heading and paragraph elelments and you will be able to parse it pretty easily.The content is in the word/document.xml of the zipped docx file.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="005C78DC" w:rsidRDefault="00350339" w:rsidP="00350339"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>Title</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRPr="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:sectPr w:rsidR="00350339" w:rsidRPr="00350339" w:rsidSect="005C78DC"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>

Use Bookmarks for Start and Stop of Each Story
I strongly suggest this technique.
Mark the start and end of each "story" with Word's Bookmark feature. To see "bookmarks", go to Word Options, Advanced, Show document content, and check Show bookmarks.
Then just go through the document collecting the content between the bookmarks.
Fairly easy and a technique I been using since Word 6.x. The only issue is having to come up with 200 bookmark names. Yet, this may be an advantage because the bookmark name could be the migrated to a "name" field in the database.
Using Styles to Mark Story Content
Another technique is to define specific style or styles that make up the story. You then extract the styles. This is a little harder and can be error prone if the author is not disciplined.
Using Text Boxes That Contain Story Content
Lastly, if these "stories" can be placed into a "text box", you can simply extract the text-boxes content. The problem with this approach is the limitations of the text-box and document layout changes which the author may not what to apply.
Notes
There are others ways, but the bookmark approach is the easiest to use and implement. I will try to respond to any comments/questions you have.
MSDN Search for "vsto word bookmark" at http://social.msdn.microsoft.com/Search/en-US?query=vsto%20word%20bookmark&refinement=-112&ac=3
MSDN Search for "vsto word 2007" at http://social.msdn.microsoft.com/Search/en-US?query=vsto%20word%202007&refinement=-112&ac=3

Regex or XML Parser C#

I have some word templates(dot/dotx) files that contain xml tags along with plain text.
At run time, I need to replace the xml tags with their respective mail merge fields.
So, need to parse the document for these xml tags and replace them with merge fields.
I was using Regex to find and replace these xml tags. But I was suggested to use XML parser to parse for XML tags ([Regex for string enclosed in <*>, C#).
The sample document looks like:
Solicitor Letter
<Tfirm/>
<Tbuilding/>
<TstreetNumber/> <TstreetName/>
For the attention of: <TContact1/> <TEmail/>
Dear <TContact1/>
RE: <Pbuilding/> <PstreetNumber/> <PstreetName/> <Pvillage/> <PTown/>
We were pleased to hear that contracts have now been exchanged in the sale of the
above property on behalf of our mutual client/s. We now have pleasure in enclosing a
copy of our invoice for your kind attention upon completion.
....
One more note, the angle brackets are typed manually by end user in the template.
I tried using XMLReader, but got error as my documents have no root tags on their own.
Please guide if I should stick to Regex or is there any way to use XML Parser.
Thank you!

Unless you can get it structured as an XML document, the tools in the .NET Libraries to read XML are going to be entirely useless.
What you have is not XML. Having a tag or two that would qualify as XML does not an XML document make. The problem is that it simply does not follow any of the rules of XML.
Moral of the story is that you will have to come up with your own method to parse this. If you like to drink the RegEx kool-aid, that'll be the best solution for ya. Of course, there are plenty of ways to skin this cat.

It looks like you aren't actually using XML, just using a token that looks similar to XML as a placeholder for replacement.
If that's the case, you should be using Regex.

I would suggest neither. Microsoft has a free library in C# specifically for modifying open xml format documents without an installation of Microsoft Office.
OpenXML SDK

Doesn't seem like XML processing to me. It's not an XML doc. It's looks like straight string-replacement, and for that, you're better off with a Regular Expression.

An XML parser doesn't help you locate XML; it only helps you understand a given piece of XML. You will need some other mechanism, perhaps a Regex, to find the XML.

Seems that authors of most replies didnt read the question carefully.
inutan is asking for something that will parse Word documents. If a Word document is saved in docx format, it will be actually XML file that can be read by XML Reader or XPathReader, however I will not recomend to do it
Normally, mail merge with Word doesnt require any programming and XML parsing, see http://helpdesk.ua.edu/training/word/merg07.html
However if you still want to have XML-like fields in your Word templates and replace them with values, I would suggest using Word automation objects.
Below is an example of VBA code, for a similar code on other languages please refer MS Office development site http://msdn.microsoft.com/en-us/library/bb726434.aspx . For example if you use .NET - you should use Office interops and best of all is to install MS Visual Studio Tools for Office development http://msdn.microsoft.com/en-us/library/5s12ew2x.aspx
With Selection.Find
.Text = "<TContact1/>"
.Replacement.Text = "TContact1"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to parse text from MS Word document to string - c#

You could give a look at NPOI: This project is the .NET version of POI Java project at http://poi.apache.org/. POI is an open source project which can help you read/write xls, doc, ppt files. It has a wide application. Take a look at this previous SO thread for more information.

Related

Copy content from a Word document to another with the style

Save a string with HTML format to doc (with unicode)

How to replace text in a PDF with C#?

How to prepare a Word 2007 document so that C# can pull data out of it semantically?

Regex or XML Parser C#

Categories

Resources