Get different style sections in Microsoft Publisher via Interop - c#

I have a little C# app that is extracting text from a Microsoft Publisher file via the COM Interop API.
This works fine, but I'm struggling if I have multiple styles in one section. Potentially every character in a word could have a different font, format, etc.
Do I really have to compare character after character? Or is there something that returns me the different style sections? Kinda like I can get the different Paragraphs?
foreach (Microsoft.Office.Interop.Publisher.Shape shp in pg.Shapes)
{
if (shp.HasTextFrame == MsoTriState.msoTrue)
{
text.Append(shp.TextFrame.TextRange.Text);
for(int i = 0; i< shp.TextFrame.TextRange.WordsCount; i++)
{
TextRange range = shp.TextFrame.TextRange.Words(i+1, 1);
string test = range.Text;
}
}
}
Or is there in general a better way to extract the text from a Publisher file? But I have to be able to actually write it back with the same formatting. It's for a translation.

You could consider using the clipboard to copy text sections as RTF which you can later paste back as RTF as with the example below for Word. I am not familiar with Publisher's object model.
string text = wordDocument.Content.Paragraphs[0];
System.Windows.Forms.Clipboard.SetText(text, TextDataFormat.Rtf);
Other than that, I have not found a collection of applied styles when using interop with any of the office products.

We tried an approach were we just compared for every character as many font styles as possible. Not pretty, but works in most cases...

Related

microsoft office interop word: how to know if a paragraph is a caption

I'm writing a VSTO add-ins in C# that parses an Office Word document.
I have to check if each figure of the document has a caption label.
I managed to know when a paragraph contains a Figure:
var activeDoc = Globals.ThisAddIn.Application.ActiveDocument;
for (int i = 0; i < activeDoc.Paragraphs.Count; i++)
{
Paragraph par = activeDoc.Paragraphs[i + 1];
if (par.Range.InlineShapes.Count == 1)
{
// the paragraph has an image
}
}
but I don't see any ways to know if the paragraph is a caption or simple text.
I tried to use CaptionLabels but it returns the types of the captions [Figure, Table, Equation] and not all the captions in my documents.
I ran a quick test and the paragraph that is a caption has a builtin style applied that's called 'Caption' (par.Style.NameLocal) If that name is always 'Caption' (or you feed it as a parameter) then you can distinguish non-caption paragraphs from captions.
As a tip: write a little test code and put a breakpoint to examine the objects and find what makes them unique. In this case the paragraph/range style is 'Caption'. This is the most efficient way imo.

Why does the Notepad++ [NULL] character not paste?

I am new to this site, and I don't know if I am providing enough info - I'll do my best =)
If you use Notepad++, then you will know what I am talking about -- When a user loads a .exe into Notepad++, the NUL / \x0 character is replaced by NULL, which has a black background, and white text. I tried pasting it into Visual Studio, hoping to obtain the same output, but it just pasted some spaces...
Does anyone know if this is a certain key-combination, or something? I would like to put the NULL character in replacement of \x0, just like Notepad++ =)
Notepad++ is a rich text editor unlike your regular notepad. It can display custom graphics so common in all modern text editors. While reading a file whenever notepad++ encounters the ASCII code of a null character then instead of displaying nothing it adds the string "NULL" to the UI setting the text background colour to black and text colour to white which is what you are seeing. You can show any custom style in your rich text editor too.
NOTE: This is by no means an efficient solution. I'm clearly traversing a read string 2 times just to take benefit of already present methods. This can be done manually in a single pass. It is just to give a hint about how you can do it. Also I wrote the code carefully but haven't ran it because I don't have the tools at the moment. I apologise for any mistakes let me know I'll update it
Step 1 : Read a text file by line (line ends at '\n') and replace all instances of null character of that line with the string "NUL" using the String.Replace(). Finally append the modified text to your RichTextBox.
Step 2 : Re traverse your read line using String.IndexOf() finding start indexes of each "NUL" word. Using these indexed you select text from RichTextBox and then style that selected text using RichTextBox.SelectionColor and RichTextBox.SelectionBackColor
richTextBoxCursor basically just represents the start index of each line in RichTextBox
StreamReader sr = new StreamReader(#"c:\test.txt" , Encoding.UTF8);
int richTextBoxCursor = 0;
while (!sr.EndOfStream){
richTextBoxCursor = richTextBox.TextLength;
string line = sr.ReadLine();
line = line.Replace(Convert.ToChar(0x0).ToString(), "NUL");
richTextBox.AppendText(line);
i = 0;
while(true){
i = line.IndexOf("NUL", i) ;
if(i == -1) break;
// This specific select function select text start from a certain start index to certain specified character range passed as second parameter
// i is the start index of each found "NUL" word in our read line
// 3 is the character range because "NUL" word has three characters
richTextBox.Select(richTextBoxCursor + i , 3);
richTextBox.SelectionColor = Color.White;
richTextBox.SelectionBackColor = Color.Black;
i++;
}
}
Notepad++ may use custom or special fonts to show these particular characters. This behavior also may not appropriate for all text editors. So, they don't show them.
If you want to write a text editor that visualize these characters, you probably need to implement this behavior programmatically. Seeing notepad++ source can be helpful If you want.
Text editor
As far as I know in order to make Visual Studio display non printable characters you need to install an extension from the marketplace at https://marketplace.visualstudio.com.
One such extension, which I have neither tried nor recomend - I just did a quick search and this is the first result - is
Invisible Character Visualizer.
Having said that, copy-pasting binaries is a risky business.
You may try Edit > Advanced > View White Space first.
Binary editor
To really see what's going on you could use the VS' binary editor: File->Open->(Open with... option)->Binary Editor -> OK
To answer your question.
It's a symbolic representation of 00H double byte.
You're copying and pasting the values. Notepad++ is showing you symbols that replace the representation of those values (because you configured it to do so in that IDE).

Prepare text to paste to MS Word with proper alignment

I am working on a C# application that calculates some values. They are placed into a DataGridView. When I select values from a column of the DataGridView and I paste them to a column of a table from a Word document, I want the values to be center aligned. Even if I set all cells of the DataGridView to be center aligned, when I copy and paste the text into the Word table, they show up left aligned.
The code used to copy the table into clipboard is
Clipboard.SetDataObject(this.dataGridView1.GetClipboardContent());
How can I prepare the text so that when I paste it, it will show centered? If I copy and paste text from another column of Word, it maintains the original alignment. This indicates there are some special characters surrounding the cell values. I don't know how to view those characters (maybe I can add them to each value).
I'm going to cheat and start with an incomplete answer.
This RTF text seems like it should work for a single value:
{\rtf1\ansi\qc
This is some centered text.}
That's a "hard newline" after \qc or I think just a \n. If I put that RTF into a file then open in WordPad it shows up centered.
I've been playing with System.Windows.Forms.ClipBoard.
Clipboard.SetData(DataFormats.Rtf, "{\\rtf1\\qc\n{\\b foo}}");
If I run the above even in a console application, I can next ctrl-V paste into MS Word and the bold works, but unfortunately the centering doesn't.
In any case, I then looked at pasting into a MS Word table and clearly it's not just a matter of text with newlines, some delimiter or other is required to show cell boundaries. So not only does the RTF I have not work, there's likely at least one more step / wrapper beyond the RTF to get a "column" not just a block of text.
Feel free to not vote, I just thought perhaps something here might be helpful to avoid both of us doing the same thing twice.
EDIT: DataFormats.Html may also work and seems it could even be the format normally used by your grid control. (though it also supports CSV)
However there's an extra clipboard header for HTML I haven't figured out yet described here: How to set HTML to clipboard in C#?
I will post my solution in here, so it can be used (and improved) by other people. We need to place a populated dataGridView object on form and a button. The button calls function CopyColumn(). I have some problems to properly format the code (some of the long strings are separated as text by stackoverflow, maybe somebody will help with including them into code.
void CopyColumn()
{
if (this.dataGridView1.GetCellCount(DataGridViewElementStates.Selected) > 0)
{
try
{
Clipboard.SetDataObject(this.dataGridView1.GetClipboardContent());
string sText = Clipboard.GetText();
string sColumn = FormatColumn(sText);
Clipboard.SetData(DataFormats.Rtf, sColumn); // this will set the proper format of the Uncertainty column in clipboard memory
}
catch (System.Runtime.InteropServices.ExternalException)
{
MessageBox.Show("The Clipboard could not be accessed. Please try again.");
}
}
}
string FormatColumn(string sValues)
{
int nlines = NumLines(sValues);
string[] values = Values(sValues);
string sStart = #"{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang3081{\fonttbl{\f0\froman\fprq2\fcharset0 Times New Roman;}}
{*\generator Riched20 10.0.14393}\viewkind4\uc1";
string sEnd = "}";
string sRowStart = #"\trowd\trgaph85\trleft5\trbrdrl\brdrs\brdrw10 \trbrdrt\brdrs\brdrw10 \trbrdrr\brdrs\brdrw10 \trbrdrb\brdrs\brdrw10 \trpaddl85\trpaddr85\trpaddfl3\trpaddfr3
\clvertalc\clbrdrl\brdrw10\brdrs\clbrdrt\brdrw10\brdrs\clbrdrr\brdrw10\brdrs\clbrdrb\brdrw10\brdrs \cellx1706
\pard\intbl\widctlpar\qc ";
string sRowEnd = #"\cell\row";
string sFormattedColumn = sStart;
string sRow = string.Empty;
for(int i = 0; i < nlines; i++) {
sRow = sRowStart + values[i] + sRowEnd;
sFormattedColumn += sRow;
}
sFormattedColumn += sEnd;
return sFormattedColumn;
}
int NumLines(string sValue)
{
string[] values = sValue.Split('\r');
return values.Length;
}
string[] Values(string sValue)
{
string[] values = sValue.Split('\r');
for(int i = 0; i < values.Length; i++) {
values[i] = values[i].Replace("\n", "");
}
return values;
}
I fear this is not related to your VBA, but a bug in MS Word (tested for 2010 and 2016).
It's enough to do the following (not sure it's the same bug you're suffering from):
1. Open a new Word doc.
2. Insert a table (let's say 4 columns, 4 rows).
3. Mark all cells and format them as centered.
4. Type some text into a cell, select it and copy it.
Now there are two scenarios:
A: Paste the copied text into a single free cell -> The pasted text is centered, as expected.
B: Mark several adjacent free cells and paste the text -> The pasted text is left-aligned instead of centered!!
If I use LibreOffice (which I recommend to you, too), all this works fine (no surprise).
No, I did not pay for MS Office, but I have to use it at work :/

Word - Replace text by hyperlinks

I am working on a MS-Word addin that reads the content of a document and replaces every occurence of a specific word by a hyperlink.
So far, I came up with this working algorithm.
// Initializes the Find parameters
searchRange.Find.ClearFormatting();
searchRange.Find.Forward = true;
searchRange.Find.Text = "foo";
do
{
searchRange.Find.Execute(Wrap: Word.WdFindWrap.wdFindStop);
if (searchRange.Find.Found)
{
// Creates a Hyperlink at the found location in the current document
this.WordDocument.Hyperlinks.Add(searchRange, externalLink, link, "bar");
}
searchRange.Find.Execute(Wrap: Word.WdFindWrap.wdFindStop);
} while (searchRange.Find.Found);
This code works, however, it can be slow on bigger documents. Thus, instead of adding hyperlinks one by one, I wanted to simply to use the Find.Replacement object and with the WdReplace.ReplaceAllproperty.
However, I cannot manage to replace my search result by a Hyperlink.
Is there a way to replace a piece of text by a hyperlink using the Replacemethod ?
In other words, I'd like to find a way to do this :
Find.Replacement.Text = new Hyperlink(...);
On an other side, I've seen that, by hitting Alt + F9in Word, we can see hyperlinks as code.
The code looks like this :
{ HYPERLINK \l "link" \o "Caption" }
Another solution would be to be able to set the text replacement as that string and make Word interpret it and thus, create the link.
Thanks for reading.
As far as I know, fields can only be inserted programmatically, or by using CTRL-F9. There are two possible reasons for this that I see:
They are not simple text. They have two ranges, the Code and the Result, only one of which is displayed at any time.
How else would a user insert text that looks like a code but is not supposed to be one, unless there was a special mechanism to create one?

How to replace text in a PDF with C#?

I saw a lot of solutions in here but none are clear or good answers.
Here is my simple question, hoping with a straight answer.
I have a PDF file (a template) which is created having text something like this:
{FIRSTNAME} {LASTNAME} {ADDRESS} {PHONENUMBER}
is it possible to have C# code that replace these templates with a text of my choice?
No fields, no other complex stuff.
Is there any Open source library helping me achieve that?
This thread is dead, however I'm posting my solution for other lost souls that might face this problem in the future. Unfortunately my company doesn't allow posting code online so I'll describe the solution :).
So basically what you have to do is use PdfSharp and modify this sample to replace text in stream, but you must take into account that text may be split into many parentheses (convert stream to string to see what the format is).
Then, with code similar to this sample traverse through source pdf page by page and modify current page by searching for PdfContent items inside PdfReference items and replacing text in content's stream.
The 'problem' with PDF documents is that they are inherently not suitable for editing. Especially ones without fields. The best thing is to step back and look at your process and see if there is a way to replace the text before the PDF was generated. Obviously, you may not always have this freedom.
If you will be able to replace text, then you should be aware that there will be no automatic reflow of the text following the replaced text. Given that you are fine with that, then there are very few solutions that allows you to replace text.
I know that you are looking for an OpenSource solution so I feel reluctant to offer you a commercial solution. We offer one called PDFKit.NET. It allows you to extract all content on a page as so-called shapes (text, images, curves, etc.). See method Page.CreateShapes in the type reference. You can then programmatically navigate and edit this structure of shapes and then write it back to a PDF again.
Here it is:
http://www.tallcomponents.com/pdfkit
Disclosure: I am the founder of TallComponents, vendor of this component
For simple text replace use iTextSharp library.
The code that replace one string with another is below.
Note that this will replace only simple text and may not work in all cases.
//using iTextSharp.text.pdf;
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
}
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
As stated in similar thread this is not really possible an easy way. The easier way it seems to be getting a DocX file and using DocX library which allow easy word swapping and then converting your DocX to PDF (using PDF Creator printer or so).
Or use pdf sharp/migradoc to create new documents.
Updating in PDF is hard and dirty. So may be adding a content on top of existing will work for you as well, as it worked for me. If so, here's my primitive, but working solution covering a lot of cases ("covering", indeed):
https://github.com/astef/PatchPdfText

Categories

Resources