Passing a string gives a different ourcome to passing a string variable - c#

I tried finding an answer for this but .
I have this function that is supposed to create a formatted paragraph.
When I pass it an html string like "<b>Test</b>" I get the bold text in the pdf as expected.
However when I pass a string variable with the same value I don't get a formatted text but instead I just get the original string in the pdf.
private Paragraph CreateSimpleHtmlParagraph(string text)
{
//Our return object
Paragraph p = new Paragraph();
//ParseToList requires a StreamReader instead of just text
using (StringReader sr = new StringReader(text))
{
//Parse and get a collection of elements
List<IElement> elements = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(sr, null);
foreach (IElement e in elements)
{
//Add those elements to the paragraph
p.Add(e);
}
}
//Return the paragraph
return p;
}

Thanks so much guys. I checked the variable at runtime and it was in HTML format (eg: &lt instead of <). I had to use the HttpUtility.HtmlDecode function on the variable and that worked out perfectly.

Related

Cannot manipulate string text obtained by Interop Word

I fetched the headings using this code
foreach (Paragraph paragraph in this.Application.ActiveDocument.Paragraphs)
{
Style style = paragraph.get_Style() as Style;
string styleName = style.NameLocal;
string text = paragraph.Range.Text;
if( styleName == "Heading 1" )
{
myList.Add(text);
}
}
as the list is string and also the text fetched is also string so I should be able to perform all string operation but I cannot perform like join, concat or any other operations. Basically it seems as a string but doesn't function as a string.
Finally solved it by Replace("\r","").
When paragraph.range.text is used to read text it add \r(replace) at the end of the text. Simply remove it by Paragraph.Range.Text.Replace("\r","") when storing in a string.
Thank you MethodMan for guiding me to the solution.

Rich Text to Plain Text via C#?

I have a program that reads through a Microsoft Word 2010 document and puts all text read from the first column of every table into a datatable. However, the resulting text also includes special formatting characters (that are usually invisible in the original Word document).
Is there a way that I can take the string of text that I've read and strip all the formatting characters from it?
The program is pretty simple, and uses the Microsoft.Office.Interop.Word assemblies. Here is the main loop where I'm grabbing the text from the document:
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
dt.Rows.Add(text);
}
}
EDIT: Here is what the text ("1. Introduction") looks like in the Word document:
This is what it looks like before being put into my datatable:
And this is what it looks like when put into the datatable:
So, I'm trying to figure out a simple way to get rid of the control characters that seem to be appearing (\r, \a, \n, etc).
EDIT: Here is the code I'm trying to use. I created a new method to convert the string:
private string ConvertToText(string rtf)
{
using (RichTextBox rtb = new RichTextBox())
{
rtb.Rtf = rtf;
return rtb.Text;
}
}
When I run the program, it bombs with the following error:
The variable rtf, at this point, looks like this:
RESOLUTION: I trimmed the unneeded characters before writing them to the datatable.
// Loop through each table in the document,
// grab only text from cells in the first column
// in each table.
foreach (Table tb in docs.Tables)
{
for (int row = 1; row <= tb.Rows.Count; row++)
{
var charsToTrim = new[] { '\r', '\a', ' ' };
var cell = tb.Cell(row, 1);
var listNumber = cell.Range.ListFormat.ListString;
var text = listNumber + " " + cell.Range.Text;
text = text.TrimEnd(charsToTrim);
dt.Rows.Add(text);
}
}
I don't know exactly what formatting you're trying to remove, but you could try something like:
text = text.Where(c => !Char.IsControl(c)).ToString();
That should strip the non-printing characters out.
Al alternative can be that You need to add a rich textbox in your form (you can keep it hidden if you don't want to show it) and when you have read all your data just assign it to the richtextbox. Like
//rtfText is rich text
//rtBox is rich text box
rtBox.Rtf = rtfText;
//get simple text here.
string plainText = rtBox.Text;
Why dont you give this a try:
using System;
using System.Text.RegularExpressions;
public class Example
{
static string CleanInput(string strIn)
{
// Replace invalid characters with empty strings.
try {
return Regex.Replace(strIn, #"[^\w\.#-]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException) {
return String.Empty;
}
}
}
Here's a link for it as well.
http://msdn.microsoft.com/en-us/library/844skk0h.aspx
Totally different approach would be to look at the Open Office XML SDK.
This example should get you started.

HtmlAgilityPack parse text blocks

I am making a small web analysis tool and need to somehow extract all the text blocks on a given url that contain more than X amount of words.
The method i currently use is this:
public string getAllText(string _html)
{
string _allText = "";
try
{
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(_html);
var root = document.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
_allText = sb.ToString();
}
catch (Exception)
{
}
_allText = System.Web.HttpUtility.HtmlDecode(_allText);
return _allText;
}
The problem here is that i get all text returned, even if its a meny text, a footer text with 3 words, etc.
I want to analyse the actual content on a page, so my idea is to somehow only parse the text that could be content (ie text blocks with more than X words)
Any ideas how this could be achieved?
Well, first approach can be a simple word count analisys on each node.InnerText value using string.Split function:
string[] words;
words = text.Split((string[]) null, StringSplitOptions.RemoveEmptyEntries);
and append only text where words.Length is larger than 3.
Also see this question answer for some more tricks in raw text gathering.

Loading an XML file when the tags are written in Greek doesn't work, why?

When I load XML files with English tags everything works fine but when I try to load an XML file with tags written in the Greek Language nothing works, why is this happening?
Do I have to change the encoding somewhere in the code?
This is the code I use:
XmlDocument xdoc = new XmlDocument();
xdoc.Load(filename);
XmlNode root = xdoc.DocumentElement;
if (root.HasChildNodes)
{
for (int i = 0; i < root.ChildNodes.Count; i++)
{
richTextBox1.AppendText(root.ChildNodes[i].InnerXml + "\n");
}
}
I downloaded your file and deserialized/displayed succesfully.
public class ΦΑΡΜΑΚΑ
{
public string A;
public string ΦΑΡΜ_ΑΓΩΓΗ;
public string ΧΟΡΗΓΗΣΗ;
public string ΛΗΞΗΣ;
public string ΑMKA;
}
XmlSerializer xml = new XmlSerializer(typeof(ΦΑΡΜΑΚΑ[]),new XmlRootAttribute("dataroot"));
ΦΑΡΜΑΚΑ[] array = (ΦΑΡΜΑΚΑ[])xml.Deserialize(File.Open(#"D:\Downloads\bio3.xml", FileMode.Open));
richTextBox1.Text = String.Join(Environment.NewLine, array.Select(x => x.ΦΑΡΜ_ΑΓΩΓΗ));
Make sure your rich text box has its multiline property set to true. Default is true, but you can may have changed it. Also, instead of \n use Environment.NewLine.
Also .InnerText will get you the value without the tags. InnerXml gives you the markup as well.

How to read dicom tag value using openDicom.net in C#?

I'm reading dicom tags using openDicom.net like this:
string tag = "";
string description = "";
string val_rep = "";
foreach (DataElement elementy in sq)
{
tag = elementy.Tag.ToString();
description = elementy.VR.Tag.GetDictionaryEntry().Description;
val_rep = elementy.VR.ToString();
}
How can I read dicom tag values?
the value.ToString() method isn't implemented. Implement your own method in Value.cs and you will get a value for "Value".
For example (only strings and numeric values):
public override string ToString()
{
return valueList[0].ToString();
}
I'm assuming that sq is a Sequence...
I've not worked with openDicom, but I'm pretty sure what you're doing there isn't going to yield the results you want.
You have a single tag, description, and val_rep variable, but you're filling them using a foreach, meaning the last DataElement in the Sequence will be the only values you retrieve. You would achieve the same effect by using:
string tag = sq[sq.Count - 1].Tag.ToString();
string description = sq[sq.Count -1].VR.Tag.GetDictionaryEntry().Description;
string val_rep = sq[sq.Count - 1].VR.ToString();
Thus retrieving the last set of values from the Sequence. I believe you'll find that if you step through the foreach as it executes, it will be loading all the different DataElements contained in your DICOM file.
Feel free to return a comment or post more information in your original post if I'm way off base here.
The tag value is retrieved as an array of generic object from the 'Value' member in 'DataElement' using the 'ToArray()' overriding method in the 'Value' class.
cniDicomTagList = new List<CNIDicomTag>();
foreach (DataElement element in sq)
{
string tag = element.Tag.ToString();
string description = element.VR.Tag.GetDictionaryEntry().Description;
object[] valueArr = element.Value.ToArray();
cniDicomTagList.Add(new CNIDicomTag(tag, description, valueArr));
}

Categories

Resources