Word found unreadable content in docm - c#

I am writing a program in C# using Open XML that transfers data from excel to word.
Currently, I have this:
internal override void UpdateSectionSheets(int sectionNum, List<List<string>> tableContents)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(MainForm.WordFileDialog.FileName, true))
{
List<Table> tables = doc.MainDocumentPart.Document.Descendants<Table>().ToList();
foreach(Table table in tables)
{
int row = 1;
if (table.Descendants<TableRow>().FirstOrDefault().Descendants<TableCell>().FirstOrDefault().InnerText == sectionNum.ToString())
{
foreach(var item in tableContents[0])
{
// splits the tableContents[0][row - 1] into individual strings at each instance of "\n\n"
String str = tableContents[0][row - 1];
String[] separator = {"\n\n"};
Int32 count = 6; // max 6 sub strings (usually only two but allowed for extra)
String[] subStrs = str.Split(separator, count, StringSplitOptions.RemoveEmptyEntries);
// transfer comment
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).RemoveAllChildren<Paragraph>(); // removes the existing contents in the cell
foreach (string s in subStrs)
{
// for every substring, create a new paragraph and append the substring to that new paragraph. Makes it so that each sentence is on its own line
Text text = new Text(s);
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).AppendChild(new Paragraph(new Run(text)));
}
// transfer verdict
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(3).RemoveAllChildren<Paragraph>();
Paragraph p = new Paragraph(new ParagraphProperties(new Justification() { Val = JustificationValues.Center }));
p.Append(new Run(new Text(tableContents[1][row - 1])));
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(3).AppendChild(p);
row++;
}
}
}
doc.Save();
}
}
I believe the line causing the issue is: table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).AppendChild(new Paragraph(new Run(text)));
If I put new Text(tableContents[0][row - 1]) in place of (text) in the above line, the program will run and word doc will open with no errors, but the output is not in the format I need.
The program runs without throwing any errors, but when I try to open the word doc it gives a "word found unreadable content in xxx.docm" error. If I say I trust the source and want word to recover the document, I can open the doc and see that the code is working how I want. However, I don't want to have to do that every time. Does anyone know what is causing the error and how I can fix it?

Related

C# Replacing Text In String Changes Paragraph Formatting in Word - Interop Assemblies

I have a code where I am iterating through every paragraph present in a word document with the use of the Primary Interop Assemblies. What I am essentially doing is extract all the text from each paragraph into a string. Then I searching that string for specific key words/phrases. If it is present it is swapped with something else. Then the paragraph is inserted back into the document.
This works perfect however on some documents what is happening is a new line is being added in between the paragraphs. Upon further investigation it turns out that the paragraph formatting is being altered, that is the line spacing after is increasing from zero to 12 and other things change as well, these include left indents is being removed from paragraphs etc.
I would like to know if there is any way to perform the above task without having the paragraph properties change when inserting the text back. My code is included below in order to show how I am iterating through the document.
Before getting to the main code I do have a word application and document open using the following namespace:
using Word = Microsoft.Office.Interop.Word
and then the following code
Word.Application app = new Word.Application();
Word.Document doc = app.Documents.Open(filePath, ReadOnly: false);
After opening the document I have done the following:
try
{
int totalParagraphs = document.Paragraphs.Count;
string final;
for (int i = 1; i <= totalParagraphs; i++)
{
string temp = document.Paragraphs[i].Range.Text;
if (temp.Length > 1)
{
Regex regex = new Regex(findText);
final = regex.Replace(temp, replaceText);
if (final != temp)
{
document.Paragraphs[i].Range.Text = final;
}
}
}
} catch (Exception) { }
Some things to note is that I have a if statement with "temp.Length > 1". I noticed that is there is nothing but a blank line, it is still counted as a paragraph and the text present inside that paragraph is of length one. When working with blank lines this actually adds in an extra line again when inserting it back in even if no replacements were done. So in order to combat this I simply used this to make sure the paragraph has at least one letter in it and is not just a blank line. This way no additional blank lines are added in between paragraphs.
I have found the answer to my own question. I have included the solution down below in case anyone else is having the same problem or would like it for reference.
What you have to do is get the paragraph format properties of the extracted text before any changes were made. Then once the paragraph is inserted back in, set the same properties we previously extracted to the inserted paragraph to counter any changes that may have been made. The full code is included below:
try
{
int totalParagraphs = document.Paragraphs.Count;
string final;
for (int i = 1; i <= totalParagraphs; i++)
{
string temp = document.Paragraphs[i].Range.Text;
float x1 = document.Paragraphs[i].Format.LeftIndent;
float x2 = document.Paragraphs[i].Format.RightIndent;
float x3 = document.Paragraphs[i].Format.SpaceBefore;
float x4 = document.Paragraphs[i].Format.SpaceAfter;
if (temp.Length > 1)
{
Regex regex = new Regex(findText);
final = regex.Replace(temp, replaceText);
if (final != temp)
{
document.Paragraphs[i].Range.Text = final;
document.Paragraphs[i].Format.LeftIndent = x1;
document.Paragraphs[i].Format.RightIndent = x2;
document.Paragraphs[i].Format.SpaceBefore = x3;
document.Paragraphs[i].Format.SpaceAfter = x4;
}
}
}
} catch (Exception) { }

Insert List<string> into word document

I'm currently trying to find a way to read in, and insert data into a word document. So far this is what I have gotten:
class Program
{
static void Main(string[] args)
{
var FileName = #"C:\temp\test.DOC";
List<string> data = new List<string>();
Application app = new Application();
Document doc = app.Documents.Open(#"C:\temp\test.DOC");
foreach (Paragraph objParagraph in doc.Paragraphs)
{
data.Add(objParagraph.Range.Text.Trim());
}
//data.Insert
data.Insert(16, "Test 1");
data.Insert(16, "\tTest 2\tName\tAmount");
data.Insert(16, "Test 3");
data.Insert(16, "Test 4");
data.Insert(16, "Test 5");
data.Insert(16, "Test 6");
data.Insert(16, "Test 7");
data.Insert(16, "Test 8");
data.Insert(16, "Test 9");
data.Insert(16, "Test 10");
var x = doc.Paragraphs.Add();
x.Range.Text.Insert(0,"\tTest 2\tName\tAmount");
doc.SaveAs2(#"C:\temp\test3.DOC");
((_Document)doc).Close();
((_Application)app).Quit();
}
}
Now, this successfully populates the List data - but I'm trying to append each new test element at the [16]th index, and save it into the word document. Is there a simple way to accomplish this, or am I just over-thinking this issue?
I realize the string list is separate from the Document object which represents the word document.
I have a few other places in the document where I am using bookmarks to add data, but I don't think it is possible to use bookmarks for placing the data in this instance - or If I don't have to use bookmarks I'd like to stray away from that.
EDIT: I am trying to insert X amount of elements at the [16]th position within the data[].
EDIT 2:
Essentially I am sourcing the data dynamically, and I'm not sure how many records/rows I'll need to add to the document, so it could be as follows:
[15]
[16]\tName\tID\tAMOUNT
[17]\tName\tID\tAMOUNT
[18]\tName\tID\tAMOUNT
Since the headers will already be there (NAME,ID,AMOUNT), and each time I run the program I'm not sure how many elements I'll be inserting into the document - so as long as each element is placed under one another, and on the 16th line in the document template I have setup that should accomplish what I am trying to do.
Image 1 - Image into string array
Image 2 - Image after adding content into the string - this is what the resulting document. (this is to be saved)
I'm attempting to put each element ie: Test1 Test2 Test3 in their each own column each (see above)
Again I am totally confused as to why you want to read the word file into a string list array. This simply adds the text you show after line 15 into the word document. You do not specify WHERE Test 1, Test 2, Test3... are coming from.
Edit: Added a try-catch just in case the document does not have at least 16 paragraphs.
static void Main(string[] args)
{
List<string> data = new List<string>();
Application app = new Application();
Document doc = app.Documents.Open(#"C:\temp\test.DOC");
string testRows = "Test 1\n\tTest 2\tName\tAmount\nTest 3\nTest 4\nTest 5\nTest 6\nTest 7\nTest 8\nTest 9\nTest 10\n";
try
{
var x = doc.Paragraphs[16];
x = doc.Paragraphs.Add(x.Range);
x.Range.Text = testRows;
doc.SaveAs2(#"C:\temp\test3.DOC");
}
catch (System.Runtime.InteropServices.COMException e)
{
Console.WriteLine("COMException: " + e.StackTrace.ToString());
Console.ReadKey();
}
((_Document)doc).Close();
((_Application)app).Quit();
}
So what I figured out (for my purposes) is that is is easiest to insert a list of strings into makeshift columns separated by tabs by inserting at specific paragraphs.
Since I am using bookmarks to place text as well - I found it useful to work from a copy of a document instead of worrying about removing/creating bookmarks each time.
When populating the list that you are going to be placing at a specific paragraph mark it is useful to append tab characters as well as newline charters on the fly. Later on this will make it easier to loop through the list and place them nicely on the document.
Depending on the way you are going to go about placing columns some logic will have to be determined to space everything correctly. I did this by creating maximum lengths for columns and trimming, and accommodating for smaller/larger lengths by adding specific amounts of tab characters.
So, my columns I am populating would look like:
myList.Add("\t12345678912345\tJohn Doe\t\t\t\t123456\r\n");
myList.Add("\987654321654987\tJohn Smith\t\t\t\98765\r\n");
These lines would be inserted at paragraph 17 and placed neatly under headers.
Lastly, I decided to use bookmarks to place single lines of text like the date,title, and signature values since those values don't need to be correctly spaced or anything.
At the end I delete the copy of the word document I'm working on, and delete the pdf (since in my case I'm sending it via email)
Thank you for the help #JohnG - I hope this answer might help others who come across it. I removed the try-catch since I'm working from the template as well.
File.Copy(sCurrentPath + "\\" + "testTemplate.DOC", sCurrentPath + "\\" + "test.DOC");
Application app = new Application();
Document doc = app.Documents.Add(sCurrentPath + "\\" + "test.DOC");
foreach (string sValue in myList)
{
var List = doc.Paragraphs[17];
myList = doc.Paragraphs.Add(myList.Range);
myList.Range.Text = sValue;
}
if (doc.Bookmarks.Exists("Date"))
{
object oBookMark = "Date";
doc.Bookmarks.get_Item(ref oBookMark).Range.Text = DateTime.Now.ToString("MM/dd/yyyy");
}
if (doc.Bookmarks.Exists("Signature"))
{
object oBookMark = "Signature";
doc.Bookmarks.get_Item(ref oBookMark).Range.Text = "My Name";
}
if (doc.Bookmarks.Exists("Title"))
{
object oBookMark = "Title";
doc.Bookmarks.get_Item(ref oBookMark).Range.Text = "Title Here";
}
doc.ExportAsFixedFormat(sCurrentPath + "\\" + "test.pdf", WdExportFormat.wdExportFormatPDF);
File.Delete(sCurrentPath + "\\" + "testCopy.DOC");
File.Delete(sCurrentPath + "\\" + "test.pdf");
((_Document)doc).Close();
((_Application)app).Quit();

How to insert single value in single cell to CSV file?

I have this problem. I would like to create a csv file by using C#. So I try to development this code:
public static void creaExcel(Oggetto obj)
{
string filePath = #"C:\Temp\test.csv";
string delimiter = ",";
string[][] output = new string[][]{
new string[]{"TobRod Porosity", "Batch code", "Nu.","PAD","G.Po","L.PoD "},
new string[]{"Col1 Row 2", "Col2 Row 2", "Col3 Row 2"}
};
int length = output.GetLength(0);
StringBuilder sb = new StringBuilder();
for (int index = 0; index < length; index++)
sb.AppendLine(string.Join(delimiter, output[index]));
File.WriteAllText(filePath, sb.ToString());
// open xls file
}
This code found but I would like to insert single value in a single cell, so with this code I insert all value ([]{"TobRod Porosity", "Batch code", "Nu.","PAD","G.Po","L.PoD "}, ) in a single row, in a single cell, instead I would like to insert every value a single cell.
Can we help me?
Best reguards
The code is working fine because the result is:
TobRod Porosity,Batch code,Nu.,PAD,G.Po,L.PoD
Col1 Row 2,Col2 Row 2,Col3 Row 2
Can you confirm this?
Here is how it is displayed on my PC:
If you see all the values in a single cell on your machine, this means that there is a problem identifying the correct separator. In order to fix this, add this line: sep=, at the beginning of your CSV content, so the resulting content would be:
sep=,
TobRod Porosity,Batch code,Nu.,PAD,G.Po,L.PoD
Col1 Row 2,Col2 Row 2,Col3 Row 2
This way you can force certain devices (I know for sure that iPhones have an issue with this) to use the correct separator.
I would also suggest you to use " as a string qualifier. Example:
sep=,
"TobRod Porosity","Batch code","Nu.","PAD","G.Po","L.PoD"
"Col1 Row 2","Col2 Row 2","Col3 Row 2"
the created file is a - more or less - correct csv (comma separated values) file.
however if you open that file with excel and it puts all values in one cell, it doesn't know that you want to separate it with the comma. you can however teach it to. with excel 2013 you mark the cell and go to the tab DATA and the "text to Columns" button.
edit: however, i have the feeling that you would like to use CSV to create excel documents. thats not what CSV is made for. if you want to create real excel sheets have a look here: Create Excel (.XLS and .XLSX) file from C#
the thing is that you are selection the full array when you make the insertion
Here you accesing to the global array and getting or the first array or the second
output[index]
If you want to insert each value of the chosen array , you just have to loop again the selected array
output[index][anotherIndex]
For example
output[0][0]
Will return "TobRod Porosity" as selected value
I have fixed my error, so I have write this method
public static void creaExcel(Oggetto obj)
{
try
{
string filePath = #"TOBROD_POROSITY_" + Utility.getData() + ".csv";
string delimiter = ";";
string[][] output = new string[][]{
new string[]{"TobRod Porosity", "Batch code", "Nu.","PAD","G.Po","L.PoD "}
};
int length = output.GetLength(0);
StringBuilder sb = new StringBuilder();
for (int index = 0; index < length; index++)
sb.Append(string.Join(delimiter, output[index]));
sb.AppendLine("");
//una volta, settato l'header del file bisogna inserire i valori
if (obj != null && obj.listaMisure != null)
{
for (int i = 0; i < obj.listaMisure.Count(); i++)
{
ValoriMisure v = obj.listaMisure[i];
sb.AppendLine(obj.tobaccoPorosity
+ delimiter + obj.batchCode
+ delimiter + v.nu
+ delimiter + v.pad
+ delimiter + v.gPo
+ delimiter + v.lPod);
}
}
File.WriteAllText(filePath, sb.ToString());
//muovi il file nel percorso di destinazione
File.Move(filePath, pathFolderDestination+"\\"+filePath);
}
catch (Exception e)
{
log.Error(e);
}
}
We should see this code:
string delimiter = ";";
because if you insert this delimiter ";" you can write a value in different cell on CSV file.

Remove a specific column from a delimited file

I've been working with some big delimited text (~1GB) files these days. It looks like somewhat below
COlumn1 #COlumn2#COlumn3#COlumn4
COlumn1#COlumn2#COlumn3 #COlumn4
where # is the delimiter.
In case a column is invalid I might have to remove it from the whole text file. The output file when Column 3 is invalid should look like this.
COlumn1 #COlumn2#COlumn4
COlumn1#COlumn2#COlumn4
string line = "COlumn1# COlumn2 #COlumn3# COlumn4";
int junk =3;
int columncount = line.Split(new char[] { '#' }, StringSplitOptions.None).Count();
//remove the [junk-1]th '#' and the value till [junk]th '#'
//"COlumn1# COlumn2 # COlumn4"
I's not able to find a c# version of this in SO. Is there a way I can do that? Please help.
EDIT:
The solution which I found myself is like below which does the job. Is there a way I could modify this to a better way so that it narrows down the performance impact it might have in case of large text files?
int junk = 3;
string line = "COlumn1#COlumn2#COlumn3#COlumn4";
int counter = 0;
int colcount = line.Split(new char[] { '#' }, StringSplitOptions.None).Length - 1;
string[] linearray = line.Split(new char[] { '#' }, StringSplitOptions.None);
List<string> linelist = linearray.ToList();
linelist.RemoveAt(junk - 1);
string finalline = string.Empty;
foreach (string s in linelist)
{
counter++;
finalline += s;
if (counter < colcount)
finalline += "#";
}
Console.WriteLine(finalline);
EDITED
This method can be very memory expensive, as your can read in this post, the suggestion should be:
If you need to run complex queries against the data in the file, the right thing to do is to load the data to database and let DBMS to take care of data retrieval and memory management.
To avoid memory consumption you should use a StreamReader to read file line by line
This could be a start for your task, missing your invalid match logic
using System.Collections.Generic;
using System.IO;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
const string fileName = "temp.txt";
var results = FindInvalidColumns(fileName);
using (var reader = File.OpenText(fileName))
{
while (!reader.EndOfStream)
{
var builder = new StringBuilder();
var line = reader.ReadLine();
if (line == null) continue;
var split = line.Split(new[] { "#" }, 0);
for (var i = 0; i < split.Length; i++)
if (!results.Contains(i))
builder.Append(split[i]);
using (var fs = new FileStream("new.txt", FileMode.Append, FileAccess.Write))
using (var sw = new StreamWriter(fs))
{
sw.WriteLine(builder.ToString());
}
}
}
}
private static List<int> FindInvalidColumns(string fileName)
{
var invalidColumnIndexes = new List<int>();
using (var reader = File.OpenText(fileName))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (line == null) continue;
var split = line.Split(new[] { "#" }, 0);
for (var i = 0; i < split.Length; i++)
{
if (IsInvalid(split[i]) && !invalidColumnIndexes.Contains(i))
invalidColumnIndexes.Add(i);
}
}
}
return invalidColumnIndexes;
}
private static bool IsInvalid(string s)
{
return false;
}
}
}
First, what you will do is re-write the line to a text file using a 0-length string for COlumn3. Therefore the line after being written correctly would look like this:
COlumun1#COlumn2##COlumn4
As you can see, there are two delimiters between COlumn2 and COlumn4. This is a cell with no data in it. (By "cell" I mean one column of a certain, single row.) Later, when some other process reads this using the Split function, it will still create a new value for Column 3, but in the array generated by Split, the 3rd position would be an empty string:
String[] columns = stream_reader.ReadLine().Split('#');
int lengthOfThirdItem = columns[2].Length; // for proof
// lengthOfThirdItem = 0
This reduces invalid values to null and persists them back in the text file.
For more on String.Split see C# StreamReader save to Array with separator.
It is not possible to write to lines internal to a text file while it is also open for read. This article discusses it some (simultaneous read-write a file in C#), but it looks like that question-asker just wants to be able to write lines to the end. You want to be able to write lines at any point in the interior. I think this is not possible without buffering the data in some way.
The simplest way to buffer the data is rename the file to a temp file first (using File.CoMovepy() // http://msdn.microsoft.com/en-us/library/system.io.file.move(v=vs.110).aspx). Then use the temp file as the data source. Just open the temp file that to read in the data which may have corrupt entries, and write the data afresh to the original file name using the approach I describe above to represent empty columns. After this is complete, then you should delete the temp file.
Important
Deleting the temp file may leave you vulnerable to power and data transients (or software 'transients'). (I.e., a power drop that interrupts part of the process could leave the data in an unusable state.) So you may also want to leave the temp file on the drive as an emergency backup in case of some problem.

Rich Text to Plain Text via C#?

I have a program that reads through a Microsoft Word 2010 document and puts all text read from the first column of every table into a datatable. However, the resulting text also includes special formatting characters (that are usually invisible in the original Word document).
Is there a way that I can take the string of text that I've read and strip all the formatting characters from it?
The program is pretty simple, and uses the Microsoft.Office.Interop.Word assemblies. Here is the main loop where I'm grabbing the text from the document:
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
dt.Rows.Add(text);
}
}
EDIT: Here is what the text ("1. Introduction") looks like in the Word document:
This is what it looks like before being put into my datatable:
And this is what it looks like when put into the datatable:
So, I'm trying to figure out a simple way to get rid of the control characters that seem to be appearing (\r, \a, \n, etc).
EDIT: Here is the code I'm trying to use. I created a new method to convert the string:
private string ConvertToText(string rtf)
{
using (RichTextBox rtb = new RichTextBox())
{
rtb.Rtf = rtf;
return rtb.Text;
}
}
When I run the program, it bombs with the following error:
The variable rtf, at this point, looks like this:
RESOLUTION: I trimmed the unneeded characters before writing them to the datatable.
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var charsToTrim = new[] { '\r', '\a', ' ' };
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
text = text.TrimEnd(charsToTrim);
dt.Rows.Add(text);
}
}
I don't know exactly what formatting you're trying to remove, but you could try something like:
text = text.Where(c => !Char.IsControl(c)).ToString();
That should strip the non-printing characters out.
Al alternative can be that You need to add a rich textbox in your form (you can keep it hidden if you don't want to show it) and when you have read all your data just assign it to the richtextbox. Like
//rtfText is rich text
//rtBox is rich text box
rtBox.Rtf = rtfText;
//get simple text here.
string plainText = rtBox.Text;
Why dont you give this a try:
using System;
using System.Text.RegularExpressions;
public class Example
{
static string CleanInput(string strIn)
{
// Replace invalid characters with empty strings.
try {
return Regex.Replace(strIn, #"[^\w\.#-]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException) {
return String.Empty;
}
}
}
Here's a link for it as well.
http://msdn.microsoft.com/en-us/library/844skk0h.aspx
Totally different approach would be to look at the Open Office XML SDK.
This example should get you started.

Categories

Resources