I have a program that reads through a Microsoft Word 2010 document and puts all text read from the first column of every table into a datatable. However, the resulting text also includes special formatting characters (that are usually invisible in the original Word document).
Is there a way that I can take the string of text that I've read and strip all the formatting characters from it?
The program is pretty simple, and uses the Microsoft.Office.Interop.Word assemblies. Here is the main loop where I'm grabbing the text from the document:
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
dt.Rows.Add(text);
}
}
EDIT: Here is what the text ("1. Introduction") looks like in the Word document:
This is what it looks like before being put into my datatable:
And this is what it looks like when put into the datatable:
So, I'm trying to figure out a simple way to get rid of the control characters that seem to be appearing (\r, \a, \n, etc).
EDIT: Here is the code I'm trying to use. I created a new method to convert the string:
private string ConvertToText(string rtf)
{
using (RichTextBox rtb = new RichTextBox())
{
rtb.Rtf = rtf;
return rtb.Text;
}
}
When I run the program, it bombs with the following error:
The variable rtf, at this point, looks like this:
RESOLUTION: I trimmed the unneeded characters before writing them to the datatable.
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var charsToTrim = new[] { '\r', '\a', ' ' };
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
text = text.TrimEnd(charsToTrim);
dt.Rows.Add(text);
}
}
I don't know exactly what formatting you're trying to remove, but you could try something like:
text = text.Where(c => !Char.IsControl(c)).ToString();
That should strip the non-printing characters out.
Al alternative can be that You need to add a rich textbox in your form (you can keep it hidden if you don't want to show it) and when you have read all your data just assign it to the richtextbox. Like
//rtfText is rich text
//rtBox is rich text box
rtBox.Rtf = rtfText;
//get simple text here.
string plainText = rtBox.Text;
Why dont you give this a try:
using System;
using System.Text.RegularExpressions;
public class Example
{
static string CleanInput(string strIn)
{
// Replace invalid characters with empty strings.
try {
return Regex.Replace(strIn, #"[^\w\.#-]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException) {
return String.Empty;
}
}
}
Here's a link for it as well.
http://msdn.microsoft.com/en-us/library/844skk0h.aspx
Totally different approach would be to look at the Open Office XML SDK.
This example should get you started.
Related
I am writing a program in C# using Open XML that transfers data from excel to word.
Currently, I have this:
internal override void UpdateSectionSheets(int sectionNum, List<List<string>> tableContents)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(MainForm.WordFileDialog.FileName, true))
{
List<Table> tables = doc.MainDocumentPart.Document.Descendants<Table>().ToList();
foreach(Table table in tables)
{
int row = 1;
if (table.Descendants<TableRow>().FirstOrDefault().Descendants<TableCell>().FirstOrDefault().InnerText == sectionNum.ToString())
{
foreach(var item in tableContents[0])
{
// splits the tableContents[0][row - 1] into individual strings at each instance of "\n\n"
String str = tableContents[0][row - 1];
String[] separator = {"\n\n"};
Int32 count = 6; // max 6 sub strings (usually only two but allowed for extra)
String[] subStrs = str.Split(separator, count, StringSplitOptions.RemoveEmptyEntries);
// transfer comment
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).RemoveAllChildren<Paragraph>(); // removes the existing contents in the cell
foreach (string s in subStrs)
{
// for every substring, create a new paragraph and append the substring to that new paragraph. Makes it so that each sentence is on its own line
Text text = new Text(s);
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).AppendChild(new Paragraph(new Run(text)));
}
// transfer verdict
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(3).RemoveAllChildren<Paragraph>();
Paragraph p = new Paragraph(new ParagraphProperties(new Justification() { Val = JustificationValues.Center }));
p.Append(new Run(new Text(tableContents[1][row - 1])));
table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(3).AppendChild(p);
row++;
}
}
}
doc.Save();
}
}
I believe the line causing the issue is: table.Descendants<TableRow>().ElementAt(row).Descendants<TableCell>().ElementAt(2).AppendChild(new Paragraph(new Run(text)));
If I put new Text(tableContents[0][row - 1]) in place of (text) in the above line, the program will run and word doc will open with no errors, but the output is not in the format I need.
The program runs without throwing any errors, but when I try to open the word doc it gives a "word found unreadable content in xxx.docm" error. If I say I trust the source and want word to recover the document, I can open the doc and see that the code is working how I want. However, I don't want to have to do that every time. Does anyone know what is causing the error and how I can fix it?
I have a csv file in which there is a field having comma in it. e.g under office location column I have a value xyz, building. When i checked the value through debugger it only shows "\"xyz". I have tried escaping the comma and backward slash by using Replace(",","") and Replace("\"","") but it failed. Also I am getting extra \ in the result as marked in red circle.
I have attached the image while debugging showing the structure of the csv row. The problem is in the red circle area.
I have also tried following function:
public static string RemoveColumnDelimitersInsideValues(string input)
{
const char valueDelimiter = '"';
const char columnDelimiter = ',';
StringBuilder output = new StringBuilder();
bool isInsideValue = false;
for (var i = 0; i < input.Length; i++)
{
var currentChar = input[i];
if (currentChar == valueDelimiter)
{
isInsideValue = !isInsideValue;
output.Append(currentChar);
continue;
}
if (currentChar != columnDelimiter || !isInsideValue)
{
output.Append(currentChar);
}
}
return output.ToString();
}
Kindly help in resolving the issues. Thanks
The \ character you see is not in the actual string, it's just an escaping character added in the debugger view.
Click on the magnifier to get the actual value of the string.
Hope it helps.
Try using TextFieldParser, in csv if the column value has comma the column value is escaped with qoutes, so adding HasFieldsEnclosedInQuotes to true will automatically read it as single column.
using Microsoft.VisualBasic.FileIO;
using (TextFieldParser reader = new TextFieldParser(csvpath))
{
reader.Delimiters = new string[] { "," };
reader.HasFieldsEnclosedInQuotes = true;
string[] col = reader.ReadFields();
}
String.Replace doesn't modify the existing string, it returns a new one. Because of that, you have the same old row string outside IsNullOrEmpty check.
Also, you are telling, you are trying to escape comma and quotes, but from you are removing it in your code.
If you want to remove commas and quotes, your code may look like
if (string.IsNullOrEmpty(row))
{
row = row.Replace(",", "").Replace("\"", "");
}
If you want to escape quotes and commas, your code may look like
if (row != null && row.Contains(","))
{
row = "\"" + row.Replace("\"", "\"\"") + "\"";
}
There are 3 issues with your code that are worth pointing out.
1. Parsing a CSV can be tricky
Would you code handle a multiline string correctly? Would you code handle a " inside one of the columns (so an escaped ")?
I recommend using a csv reading libary (aka NuGet package).
There is no backslash
Here is a file.
1,"The string in the first row has a comma, and an f, in it"
2,The string in the 2nd row does not have a comma in it
Here is what Visual Studio shows (I'm using VS Code here).
Here is what Console.WriteLine prints.
1,"The string in the first row has a comma, and an f, in it"
2,The string in the 2nd row does not have a comma in it
3. Replacing commas
Even if you deal with the quotes, wouldn't replacing commans get rid of the field delimiter?
I need to be able to rename the column in a spreadsheet from 'idn_prod' to 'idn_prod1', but there are two columns with this name.
I have tried implementing code from similar posts, but I've only been able to update both columns. Below you'll find the code I have that just renames both columns.
//locate and edit column in csv
string file1 = #"C:\Users\username\Documents\AppDevProjects\import.csv";
string[] lines = System.IO.File.ReadAllLines(file1);
System.IO.StreamWriter sw = new System.IO.StreamWriter(file1);
foreach(string s in lines)
{
sw.WriteLine(s.Replace("idn_prod", "idn_prod1"));
}
I expect only the 2nd column to be renamed, but the actual output is that both are renamed.
Here are the first couple rows of the CSV:
I'm assuming that you only need to update the column header, the actual rows need not be updated.
var file1 = #"test.csv";
var lines = System.IO.File.ReadAllLines(file1);
var columnHeaders = lines[0];
var textToReplace = "idn_prod";
var newText = "idn_prod1";
var indexToReplace = columnHeaders
.LastIndexOf("idn_prod");//LastIndex ensures that you pick the second idn_prod
columnHeaders = columnHeaders
.Remove(indexToReplace,textToReplace.Length)
.Insert(indexToReplace, newText);//I'm removing the second idn_prod and replacing it with the updated value.
using (System.IO.StreamWriter sw = new System.IO.StreamWriter(file1))
{
sw.WriteLine(columnHeaders);
foreach (var str in lines.Skip(1))
{
sw.WriteLine(str);
}
sw.Flush();
}
Replace foreach(string s in lines) loop with
for loop and get the lines count and rename only the 2nd column.
I believe the only way to handle this properly is to crack the header line (first string that has column names) into individual parts, separated by commas or tabs or whatever, and run through the columns one at a time yourself.
Your loop would consider the first line from the file, use the Split function on the delimiter, and look for the column you're interested in:
bool headerSeen = false;
foreach (string s in lines)
{
if (!headerSeen)
{
// special: this is the header
string [] parts = s.Split("\t");
for (int i = 0; i < parts.Length; i++)
{
if (parts[i] == "idn_prod")
{
// only fix the *first* one seen
parts[i] = "idn_prod1";
break;
}
}
sw.WriteLine( string.Join("\t", parts));
headerSeen = true;
}
else
{
sw.WriteLine( s );
}
}
The only reason this is even remotely possible is that it's the header and not the individual lines; headers tend to be more predictable in format, and you worry less about quoting and fields that contain the delimiter, etc.
Trying this on the individual data lines will rarely work reliably: if your delimiter is a comma, what happens if an individual field contains a comma? Then you have to worry about quoting, and this enters all kinds of fun.
For doing any real CSV work in C#, it's really worth looking into a package that specializes in this, and I've been thrilled with CsvHelper from Josh Close. Highly recommended.
I have this problem. I would like to create a csv file by using C#. So I try to development this code:
public static void creaExcel(Oggetto obj)
{
string filePath = #"C:\Temp\test.csv";
string delimiter = ",";
string[][] output = new string[][]{
new string[]{"TobRod Porosity", "Batch code", "Nu.","PAD","G.Po","L.PoD "},
new string[]{"Col1 Row 2", "Col2 Row 2", "Col3 Row 2"}
};
int length = output.GetLength(0);
StringBuilder sb = new StringBuilder();
for (int index = 0; index < length; index++)
sb.AppendLine(string.Join(delimiter, output[index]));
File.WriteAllText(filePath, sb.ToString());
// open xls file
}
This code found but I would like to insert single value in a single cell, so with this code I insert all value ([]{"TobRod Porosity", "Batch code", "Nu.","PAD","G.Po","L.PoD "}, ) in a single row, in a single cell, instead I would like to insert every value a single cell.
Can we help me?
Best reguards
The code is working fine because the result is:
TobRod Porosity,Batch code,Nu.,PAD,G.Po,L.PoD
Col1 Row 2,Col2 Row 2,Col3 Row 2
Can you confirm this?
Here is how it is displayed on my PC:
If you see all the values in a single cell on your machine, this means that there is a problem identifying the correct separator. In order to fix this, add this line: sep=, at the beginning of your CSV content, so the resulting content would be:
sep=,
TobRod Porosity,Batch code,Nu.,PAD,G.Po,L.PoD
Col1 Row 2,Col2 Row 2,Col3 Row 2
This way you can force certain devices (I know for sure that iPhones have an issue with this) to use the correct separator.
I would also suggest you to use " as a string qualifier. Example:
sep=,
"TobRod Porosity","Batch code","Nu.","PAD","G.Po","L.PoD"
"Col1 Row 2","Col2 Row 2","Col3 Row 2"
the created file is a - more or less - correct csv (comma separated values) file.
however if you open that file with excel and it puts all values in one cell, it doesn't know that you want to separate it with the comma. you can however teach it to. with excel 2013 you mark the cell and go to the tab DATA and the "text to Columns" button.
edit: however, i have the feeling that you would like to use CSV to create excel documents. thats not what CSV is made for. if you want to create real excel sheets have a look here: Create Excel (.XLS and .XLSX) file from C#
the thing is that you are selection the full array when you make the insertion
Here you accesing to the global array and getting or the first array or the second
output[index]
If you want to insert each value of the chosen array , you just have to loop again the selected array
output[index][anotherIndex]
For example
output[0][0]
Will return "TobRod Porosity" as selected value
I have fixed my error, so I have write this method
public static void creaExcel(Oggetto obj)
{
try
{
string filePath = #"TOBROD_POROSITY_" + Utility.getData() + ".csv";
string delimiter = ";";
string[][] output = new string[][]{
new string[]{"TobRod Porosity", "Batch code", "Nu.","PAD","G.Po","L.PoD "}
};
int length = output.GetLength(0);
StringBuilder sb = new StringBuilder();
for (int index = 0; index < length; index++)
sb.Append(string.Join(delimiter, output[index]));
sb.AppendLine("");
//una volta, settato l'header del file bisogna inserire i valori
if (obj != null && obj.listaMisure != null)
{
for (int i = 0; i < obj.listaMisure.Count(); i++)
{
ValoriMisure v = obj.listaMisure[i];
sb.AppendLine(obj.tobaccoPorosity
+ delimiter + obj.batchCode
+ delimiter + v.nu
+ delimiter + v.pad
+ delimiter + v.gPo
+ delimiter + v.lPod);
}
}
File.WriteAllText(filePath, sb.ToString());
//muovi il file nel percorso di destinazione
File.Move(filePath, pathFolderDestination+"\\"+filePath);
}
catch (Exception e)
{
log.Error(e);
}
}
We should see this code:
string delimiter = ";";
because if you insert this delimiter ";" you can write a value in different cell on CSV file.
I Have a textfile that has a value of multiple row text that don't have delimiter.
Here is the sample text on my textfile.
000100000080020201000000005309970000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003F
00010001008002020100010000530997000014820000148200010000012C00001482000014820000148200010000012C000014820000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101000000000000000000000000000000000000003F
then i must devide each every line in this format.
"XXXXXXXX-XXXX-XXXXXXXXXXXXXX-XX-XXXX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX-XXXXXX-XX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX".Split('-');
the out put is like this.
00010000-0080-02020100000000-53-0997-00000000-00000000-0000-00000000-00000000-00000000-00000000-0000-00000000-00000000-00000000-00000000-0000-00000000-00000000-00000001-00000000-0000-00000000-00000000-00000000-00000000-0000-00000000-00000000-000000-00-00000000-00000000-0000-00000000-0000003F
00010001-0080-02020100010000-53-0997-00001482-00001482-0001-0000012C-00001482-00001482-00001482-0001-0000012C-00001482-00000000-00000000-0000-00000000-00000000-00000000-00000000-0000-00000000-00000000-00000000-00000000-0000-00000000-00000000-010100-00-00000000-00000000-0000-00000000-0000003F
i imported it into a multiline textbox.
here is my code
private void btn_input_Click(object sender, EventArgs e)
{
string content;
content = File.ReadAllText(txt_path.Text);
string[] patern = "XXXXXXXX-XXXX-XXXXXXXXXXXXXX-XX-XXXX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX-XXXXXX-XX-XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXX-XXXXXXXX".Split('-');
string mystring = content;
string regex = string.Empty;
string match = string.Empty;
for (int i = 0; i < patern.Length; i++)
{
regex += #"(\w{" + patern[i].Length + "})";
match += "$" + (i + 1).ToString() + "-";
}
match = match.Substring(0, match.Length - 1);
txt_textfile.Text = Regex.Replace(mystring, regex, match);
}
then i want to save to my database the text that i split in ('-'). but i dont know how to do it. I want to ask is there is something i can do for me to able to save it. Even in while importing it. or after importing it. anyway. Please Help. Thank you
Your Question is not that much clear.
all you have done there is format your input string.
I'm not sure about the way that you want to insert the values to database.
just guessing that you want to split the formatted text by '-' and insert that split values
one by one
in that case what you have to do is split the text in txt_textfile text box and get string array.
loop the array and insert values the database.
if this is not the answer that you looked for please comment here the exact thing that you want to do
Thanks :)