Cannot manipulate string text obtained by Interop Word

Cannot manipulate string text obtained by Interop Word - c#

I fetched the headings using this code
foreach (Paragraph paragraph in this.Application.ActiveDocument.Paragraphs)
{
Style style = paragraph.get_Style() as Style;
string styleName = style.NameLocal;
string text = paragraph.Range.Text;
if( styleName == "Heading 1" )
{
myList.Add(text);
}
}
as the list is string and also the text fetched is also string so I should be able to perform all string operation but I cannot perform like join, concat or any other operations. Basically it seems as a string but doesn't function as a string.

Finally solved it by Replace("\r","").
When paragraph.range.text is used to read text it add \r(replace) at the end of the text. Simply remove it by Paragraph.Range.Text.Replace("\r","") when storing in a string.
Thank you MethodMan for guiding me to the solution.

Related

Passing a string gives a different ourcome to passing a string variable

I tried finding an answer for this but .
I have this function that is supposed to create a formatted paragraph.
When I pass it an html string like "<b>Test</b>" I get the bold text in the pdf as expected.
However when I pass a string variable with the same value I don't get a formatted text but instead I just get the original string in the pdf.
private Paragraph CreateSimpleHtmlParagraph(string text)
{
//Our return object
Paragraph p = new Paragraph();
//ParseToList requires a StreamReader instead of just text
using (StringReader sr = new StringReader(text))
{
//Parse and get a collection of elements
List<IElement> elements = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(sr, null);
foreach (IElement e in elements)
{
//Add those elements to the paragraph
p.Add(e);
}
}
//Return the paragraph
return p;
}

Thanks so much guys. I checked the variable at runtime and it was in HTML format (eg: &lt instead of <). I had to use the HttpUtility.HtmlDecode function on the variable and that worked out perfectly.

Rich Text to Plain Text via C#?

I have a program that reads through a Microsoft Word 2010 document and puts all text read from the first column of every table into a datatable. However, the resulting text also includes special formatting characters (that are usually invisible in the original Word document).
Is there a way that I can take the string of text that I've read and strip all the formatting characters from it?
The program is pretty simple, and uses the Microsoft.Office.Interop.Word assemblies. Here is the main loop where I'm grabbing the text from the document:
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
dt.Rows.Add(text);
}
}
EDIT: Here is what the text ("1. Introduction") looks like in the Word document:
This is what it looks like before being put into my datatable:
And this is what it looks like when put into the datatable:
So, I'm trying to figure out a simple way to get rid of the control characters that seem to be appearing (\r, \a, \n, etc).
EDIT: Here is the code I'm trying to use. I created a new method to convert the string:
private string ConvertToText(string rtf)
{
using (RichTextBox rtb = new RichTextBox())
{
rtb.Rtf = rtf;
return rtb.Text;
}
}
When I run the program, it bombs with the following error:
The variable rtf, at this point, looks like this:
RESOLUTION: I trimmed the unneeded characters before writing them to the datatable.
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var charsToTrim = new[] { '\r', '\a', ' ' };
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
text = text.TrimEnd(charsToTrim);
dt.Rows.Add(text);
}
}

I don't know exactly what formatting you're trying to remove, but you could try something like:
text = text.Where(c => !Char.IsControl(c)).ToString();
That should strip the non-printing characters out.

Al alternative can be that You need to add a rich textbox in your form (you can keep it hidden if you don't want to show it) and when you have read all your data just assign it to the richtextbox. Like
//rtfText is rich text
//rtBox is rich text box
rtBox.Rtf = rtfText;
//get simple text here.
string plainText = rtBox.Text;

Why dont you give this a try:
using System;
using System.Text.RegularExpressions;
public class Example
{
static string CleanInput(string strIn)
{
// Replace invalid characters with empty strings.
try {
return Regex.Replace(strIn, #"[^\w\.#-]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException) {
return String.Empty;
}
}
}
Here's a link for it as well.
http://msdn.microsoft.com/en-us/library/844skk0h.aspx

Totally different approach would be to look at the Open Office XML SDK.
This example should get you started.

HtmlAgilityPack parse text blocks

I am making a small web analysis tool and need to somehow extract all the text blocks on a given url that contain more than X amount of words.
The method i currently use is this:
public string getAllText(string _html)
{
string _allText = "";
try
{
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(_html);
var root = document.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
_allText = sb.ToString();
}
catch (Exception)
{
}
_allText = System.Web.HttpUtility.HtmlDecode(_allText);
return _allText;
}
The problem here is that i get all text returned, even if its a meny text, a footer text with 3 words, etc.
I want to analyse the actual content on a page, so my idea is to somehow only parse the text that could be content (ie text blocks with more than X words)
Any ideas how this could be achieved?

Well, first approach can be a simple word count analisys on each node.InnerText value using string.Split function:
string[] words;
words = text.Split((string[]) null, StringSplitOptions.RemoveEmptyEntries);
and append only text where words.Length is larger than 3.
Also see this question answer for some more tricks in raw text gathering.

Read a file and replace test after a certain word

I have a few files, for example:
FileBegin Finance Open 87547.25 Close 548484.54 EndDay 4 End
Another file example:
FileBegin Finance Open 344.34 Close -3434.34 EndDay 5 End
I need to read the text in the file and replace only the numeric value after the word Open leaving the rest of the text before and after the word Open intact. I have been using this code:
string fileToRead = "c:\\file.txt";
public void EditValue(string oldValue, string newValue, Control Item)
{
if (Item is TextBox)
{
string text = File.ReadAllText(fileToRead);
text = text.Replace(oldValue, newValue);
File.WriteAllText(activeSaveFile, text);
}
}
What would be the best way of going about replacing just the numeric value after the word open?

Using Regular Expressions:
Regex rgx = new Regex(#"Open [^\s]+");
string result = rgx.Replace(text, newValue);
File.WriteAllText(activeSaveFile, result );
Using this approach, you can store the regex object outside the method so you avoid recompiling it each time. I'm guessing it won't have a significant performance impact compared to the file I/O in your case, but it is a good practice in other situations.

Split the row by the empty spaces like string.split(new char[] { ' ' }, StringSplitOptions.Empty) and then get the _splittedRow[3] and replace and merge the new row together.

If I understand you, the line:
FileBegin Finance Open 344.34 Close -3434.34 EndDay 5 End
is the entire file? And you have been typing in "344.34" for the old value and "something" for the new value? And you'd like to just type the new value only?
You could say:
string fileToRead = "c:\\file.txt";
public void EditValue(string oldValue, string newValue, Control Item)
{
if (Item is TextBox)
{
string text = File.ReadAllText(fileToRead);
string[] words = text.Split(new char[] {' '}); // assuming space-delimited
words[3] = "new value"; // replace the target value
text = "";
foreach (string w in words)
{
text += w + " "; // build our new string
}
File.WriteAllText(activeSaveFile, text.Trim()); // and write it back out
}
}
That's a lot of ifs, but I think this is what you mean. Also there are a lot of different ways to replace that one part of the string, I just thought this would give you the flexibility to do other things with a convenient array of words.

Simple text to HTML conversion

I have a very simple asp:textbox with the multiline attribute enabled. I then accept just text, with no markup, from the textbox. Is there a common method by which line breaks and returns can be converted to <p> and <br/> tags?
I'm not looking for anything earth shattering, but at the same time I don't just want to do something like:
html.Insert(0, "<p>");
html.Replace(Enviroment.NewLine + Enviroment.NewLine, "</p><p>");
html.Replace(Enviroment.NewLine, "<br/>");
html.Append("</p>");
The above code doesn't work right, as in generating correct html, if there are more than 2 line breaks in a row. Having html like <br/></p><p> is not good; the <br/> can be removed.

I know this is old, but I couldn't find anything better after some searching, so here is what I'm using:
public static string TextToHtml(string text)
{
text = HttpUtility.HtmlEncode(text);
text = text.Replace("\r\n", "\r");
text = text.Replace("\n", "\r");
text = text.Replace("\r", "<br>\r\n");
text = text.Replace(" ", " ");
return text;
}
If you can't use HttpUtility for some reason, then you'll have to do the HTML encoding some other way, and there are lots of minor details to worry about (not just <>&).
HtmlEncode only handles the special characters for you, so after that I convert any combo of carriage-return and/or line-feed to a BR tag, and any double-spaces to a single-space plus a NBSP.
Optionally you could use a PRE tag for the last part, like so:
public static string TextToHtml(string text)
{
text = "<pre>" + HttpUtility.HtmlEncode(text) + "</pre>";
return text;
}

Your other option is to take the text box contents and instead of trying for line a paragraph breaks just put the text between PRE tags. Like this:
<PRE>
Your text from the text box...
and a line after a break...
</PRE>

Depending on exactly what you are doing with the content, my typical recommendation is to ONLY use the <br /> syntax, and not to try and handle paragraphs.

How about throwing it in a <pre> tag. Isn't that what it's there for anyway?

I know this is an old post, but I've recently been in a similar problem using C# with MVC4, so thought I'd share my solution.
We had a description saved in a database. The text was a direct copy/paste from a website, and we wanted to convert it into semantic HTML, using <p> tags. Here is a simplified version of our solution:
string description = getSomeTextFromDatabase();
foreach(var line in description.Split('\n')
{
Console.Write("<p>" + line + "</p>");
}
In our case, to write out a variable, we needed to prefix # before any variable or identifiers, because of the Razor syntax in the ASP.NET MVC framework. However, I've shown this with a Console.Write, but you should be able to figure out how to implement this in your specific project based on this :)

Combining all previous plus considering titles and subtitles within the text comes up with this:
public static string ToHtml(this string text)
{
var sb = new StringBuilder();
var sr = new StringReader(text);
var str = sr.ReadLine();
while (str != null)
{
str = str.TrimEnd();
str.Replace(" ", " ");
if (str.Length > 80)
{
sb.AppendLine($"<p>{str}</p>");
}
else if (str.Length > 0)
{
sb.AppendLine($"{str}</br>");
}
str = sr.ReadLine();
}
return sb.ToString();
}
the snippet could be enhanced by defining rules for short strings

I understand that I was late with the answer for 13 years)
but maybe someone else needs it
sample line 1 \r\n
sample line 2 (last at paragraph) \r\n\r\n [\r\n]+
sample line 3 \r\n
Example code
private static Regex _breakRegex = new("(\r?\n)+");
private static Regex _paragrahBreakRegex = new("(?:\r?\n){2,}");
public static string ConvertTextToHtml(string description) {
string[] descrptionParagraphs = _paragrahBreakRegex.Split(description.Trim());
if (descrptionParagraphs.Length > 0)
{
description = string.Empty;
foreach (string line in descrptionParagraphs)
{
description += $"<p>{line}</p>";
}
}
return _breakRegex.Replace(description, "<br/>");
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Cannot manipulate string text obtained by Interop Word - c#

Finally solved it by Replace("\r",""). When paragraph.range.text is used to read text it add \r(replace) at the end of the text. Simply remove it by Paragraph.Range.Text.Replace("\r","") when storing in a string. Thank you MethodMan for guiding me to the solution.

Related

Passing a string gives a different ourcome to passing a string variable

Rich Text to Plain Text via C#?

HtmlAgilityPack parse text blocks

Read a file and replace test after a certain word

Simple text to HTML conversion

Categories

Resources