I am trying to write an Add-In (VSTO) for Word to add some software requirement-management capabilities to it.
assume a Requirement would follow this convention:
[REQ_<nr>] Specification item title
⌈ Specification item description. ⌋ (REQ_<nr1>, REQ_<nr2>)
Is there anyway to declare this block and name it so, that you can later find all Requriements using OpenXML and c#?
thanks
Word has two basic ways you can identify something in a document: Bookmarks and ContentControls. Both will work for OpenXML and for the Interop (C#).
Each bookmark must have a different name, so you'd use a naming convention with an incrementing counter (Req1, Req2 for example).
ContentControls are specifically designed to allow the same entry for the Title property for multiple controls. The interop command Document.SelectContentControlsByTitle returns an array of all content controls. OpenXML can do the same, of course. This approach is not available for bookmarks in the interop.
Another difference between the two is that content controls can be protected, in case that's important for you.
Related
My problem is this: I open a word document (using WordprocessingDocument) and take specific paragraphs from it. Then, I need to find these paragraphs (by text) in a Word activeDocument using .NET VSTO.
I can easily find the text using range.Text.Contains on the whole document, but since the text can appear more than once, I can't know which appearance it is.
Do paragraphs or ranges have any unique identifiers in the document?
Alternatively I can iterate all paragraphs and take the right one by the running index, but that is extremely costly in performance. So I need to get the information from range somehow.
Any ideas? Thank you
Is there a simple process someone can recommend for generating an rtf document from a pre-built "template" and populate fields.
I would prefer to avoid ms word automation type solutions as i cannot guarantee ms office versions etc.
Resulting file needs to be editable so I cant go pdf
is it as simple as using something like nvelocity, or do i need to do something fancier?
thanks
You can always use keywords : Example:
This is a [text] that needs to be [replaced]
and the use a stringbuilder to replace the keywords.
StringBuilder sb = new StringBuilder(yourTemplate);
sb.Replace("[text]", "car");
sb.Replace("[replaced]", "washed");
So if you get me the final text will be :
This is a car that needs to be wahsed.
This is just one way of doing it.
I saw a lot of solutions in here but none are clear or good answers.
Here is my simple question, hoping with a straight answer.
I have a PDF file (a template) which is created having text something like this:
{FIRSTNAME} {LASTNAME} {ADDRESS} {PHONENUMBER}
is it possible to have C# code that replace these templates with a text of my choice?
No fields, no other complex stuff.
Is there any Open source library helping me achieve that?
This thread is dead, however I'm posting my solution for other lost souls that might face this problem in the future. Unfortunately my company doesn't allow posting code online so I'll describe the solution :).
So basically what you have to do is use PdfSharp and modify this sample to replace text in stream, but you must take into account that text may be split into many parentheses (convert stream to string to see what the format is).
Then, with code similar to this sample traverse through source pdf page by page and modify current page by searching for PdfContent items inside PdfReference items and replacing text in content's stream.
The 'problem' with PDF documents is that they are inherently not suitable for editing. Especially ones without fields. The best thing is to step back and look at your process and see if there is a way to replace the text before the PDF was generated. Obviously, you may not always have this freedom.
If you will be able to replace text, then you should be aware that there will be no automatic reflow of the text following the replaced text. Given that you are fine with that, then there are very few solutions that allows you to replace text.
I know that you are looking for an OpenSource solution so I feel reluctant to offer you a commercial solution. We offer one called PDFKit.NET. It allows you to extract all content on a page as so-called shapes (text, images, curves, etc.). See method Page.CreateShapes in the type reference. You can then programmatically navigate and edit this structure of shapes and then write it back to a PDF again.
Here it is:
http://www.tallcomponents.com/pdfkit
Disclosure: I am the founder of TallComponents, vendor of this component
For simple text replace use iTextSharp library.
The code that replace one string with another is below.
Note that this will replace only simple text and may not work in all cases.
//using iTextSharp.text.pdf;
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
}
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
As stated in similar thread this is not really possible an easy way. The easier way it seems to be getting a DocX file and using DocX library which allow easy word swapping and then converting your DocX to PDF (using PDF Creator printer or so).
Or use pdf sharp/migradoc to create new documents.
Updating in PDF is hard and dirty. So may be adding a content on top of existing will work for you as well, as it worked for me. If so, here's my primitive, but working solution covering a lot of cases ("covering", indeed):
https://github.com/astef/PatchPdfText
I have a friend who is writing a 400-page book in Microsoft Word 2007.
Throughout the book he has 200 stories each which consist of numerous paragraphs.
When he is finished writing the book, he wants to copy the text of each story that is embedded in his Word document into a database table such as:
Title, varchar(200)
Description, text
Content, text
We do not want to have to copy and paste each story into the database but want to have a program automatically pull the marked up data from the Word file into the appropriate fields in the database.
What does he have to do in Microsoft Word to denote each group of paragraphs as "story content" and each title as a "story title" etc. A prerequisite is that this markup cannot be visible in the document. I know that Word 2007 files are basically zipped XML files so I assume this is possible and I assume that stylesheets are what we need, but how do I need to prepare the Word document precisely so that as he adds stories they are properly marked up?
I assume that the new COM Interop features of C# 4.0 is what I need to analyze the Word file and retrieve only the title, description, and content from the embedded stories, but how do I do this technically? Does anyone have examples?
Does anyone have experience doing a project like this (reading Microsoft Word as a semnatic data file) that they could share?
What I would do is use styles. Have one style for each type of content, and write a macro that traverses your document paragraph-by-paragraph and spits out the corresponding text file.
Okay, this can be resolved in numerous ways.
First of all, I would suggest that you save the file to a *.txt, to have some plain text to parse.
Then, your friend will have to be really consistent during the writing, because what you will create, (text parser) will need consistency.
Make some rules like :
Title on first line, then 2 linebreaks;
All the paragraphs separated with 1 linebreak;
Then 3 linebreaks after the last paragraph;
After that, load the file, and parse it using the rules above.
{enjoy}
Following is the xml for a docx document, which contains a heading containing the word "Title" and two paragraphs containing the word "Content". Study a sample file of the novel while your friend is writing it, use a uniform format for all heading and paragraph elelments and you will be able to parse it pretty easily.The content is in the word/document.xml of the zipped docx file.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="005C78DC" w:rsidRDefault="00350339" w:rsidP="00350339"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>Title</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:p w:rsidR="00350339" w:rsidRPr="00350339" w:rsidRDefault="00350339" w:rsidP="00350339"><w:r><w:t>Content</w:t></w:r></w:p><w:sectPr w:rsidR="00350339" w:rsidRPr="00350339" w:rsidSect="005C78DC"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>
Use Bookmarks for Start and Stop of Each Story
I strongly suggest this technique.
Mark the start and end of each "story" with Word's Bookmark feature. To see "bookmarks", go to Word Options, Advanced, Show document content, and check Show bookmarks.
Then just go through the document collecting the content between the bookmarks.
Fairly easy and a technique I been using since Word 6.x. The only issue is having to come up with 200 bookmark names. Yet, this may be an advantage because the bookmark name could be the migrated to a "name" field in the database.
Using Styles to Mark Story Content
Another technique is to define specific style or styles that make up the story. You then extract the styles. This is a little harder and can be error prone if the author is not disciplined.
Using Text Boxes That Contain Story Content
Lastly, if these "stories" can be placed into a "text box", you can simply extract the text-boxes content. The problem with this approach is the limitations of the text-box and document layout changes which the author may not what to apply.
Notes
There are others ways, but the bookmark approach is the easiest to use and implement. I will try to respond to any comments/questions you have.
MSDN Search for "vsto word bookmark" at http://social.msdn.microsoft.com/Search/en-US?query=vsto%20word%20bookmark&refinement=-112&ac=3
MSDN Search for "vsto word 2007" at http://social.msdn.microsoft.com/Search/en-US?query=vsto%20word%202007&refinement=-112&ac=3
So let's say I have a program with just a text box and an okay button. The user types in whatever word he wants, and when he clicks ok, it opens a specific file called Test.doc and CTRL+F for the word "test" and replaces it with whatever the user entered into the text box. How can I open said file and replace instances of the word test with the user's defined word?
Ignoring the format of the document, you could literally use the folowing for any type of file:
var contents = System.IO.File.ReadAllText(#"C:\myDoc.doc");
contents = contents.Replace("Test", "Tested");
System.IO.File.WriteAllText(#"C:\myDoc.doc", contents);
The best way would be to use the ms office interop library though.
Andrew
A number of things:
I'd recommend using a FileDialog to get the file's location. This lets you select the file to edit, but also gives you functionality to only show the file types that you want to handle in this program.
If you're handling .doc's, I'd suggest you look into VSTO and opening word docs. Here's a guide I found after a quick search. I'd suggest using it as a place to start, but you'll need to look around for more specifics.
Lastly, the string.Replace("", ""); method is probably very helpful in the CTRL-F functionality. You should be able to extract a string of the text from whatever document you're analyzing and use that method.