RichTextBox.Find doesn't work with Solr Highlights - c#

My application needs to able to indicate where in the original document do the highlights from Solr actually come from. For the time being, my project deals only with .txt files.
I'm using the highlights returned by Solr as string inputs to an richtextbox.find function. Once I have the starting point of the hit, I highlight the string using richtextbox.select function and set backcolor and color and other properties.
PROBLEM : RichTextBox.Find is never returning a valid output (always -1), which means it's not finding my highlight text in the document.
I've tried removing the <em> and </em> tags along with the \n tags that are there in the highlight string but won't be there in the actual text document, but it still doesn't
work. Searching the same string on MS Word or Notepad on the original file doesn't work either, even though the string appears identical to the text fragment in the file. Is there any other information I can get on changes i need to make to the string to make it searchable?
EDIT 1 :
I've tracked down the problem. Apparently in certain cases, the highlight that Solr returns itself contains some non-printable or junk characters not initially found in the original document. I need a way to reliably clean these on some criteria. My text contains a lot of valid special characters so I cannot afford to have those removed by mistake!

Related

How can I handle a Crystal Report's Record Selection Formula appearing as a concatenated string when loaded in C#?

I have several Crystal Reports with Record Selection Formulas, and I'm loading them into a C# program with the help of some CrystalDecisions dll's. I have no trouble loading them or viewing their information.
However, when the reports are loaded, the multi-lined Record Selection Formulas are concatenated into a one-line string field so that it can be stored in the ReportDocument object. For regular formula fields, this is fine, as it appears to automatically include "\n" to break up the string into the multiple lines so it can make sense.
The problem seems to specifically be the Record Selection Formulas. As a simple example, here's a sample record selection formula I'd try to work with, and the result I get within C#:
Within Crystal Reports:
if (condition) then true
else false
Within C#:
if (condition) then trueelse false
If I try to use this formula, it gives the error:
A number, currency amount, boolean, date, time, date-time, or string is expected here.
Because obviously "trueelse" isn't valid syntax.
One thing I've tried is to change the formula in crystal to all be on one line with spaces in between rather than newlines, which does seem to work, however I have many reports so I'd like to avoid this if there's another way to fix this.
The other thing I've tried is to ensure each line has extra whitespace at the end to help break up the lines when it gets loaded, but these seem to just get trimmed.

Word vsto get text of document with hidden characters

I'm developing a text analysis vsto add-in for Word.
Therefore I get the text of the active document like this:
Globals.ThisAddin.Apllication.ActiveDocument.Content.Text
After that I analyze it. The analysis returns a list of positions that Word should comment (like character 3 - 6 and character 10 - 13).
The problem is that it seems like the comment from 3 to 6 is adding a character (that is hidden) to the document. Because all comments that Word is doing after the first one are placed one character too early.
Is there a way how I can fix that or how I can get the text with the hidden characters?
I found TextRetrievalMode but I can not get it working with that.
Basically, the answer is "No, you can't do it the way you propose."
Yes, Word does add "hidden characters" to the text flow that cannot be picked up using the object model. Trying to work with character index values is not going to work reliably. The reliable method is Word's built-in Find/Replace with wildcards. If RegEx is absolutely necessary, then some kind of Find/Replace within a character-index range (say, starting 5 characters before and ending 5 characters after the indices computed using RegEx) might be a way to double-check the result and pick up the correct Range.
Possibly, depending on what kind of analysis this is, it might be better to work with the closed file, leveraging the Office Open XML. That will not have the problem of "hidden characters" that Word uses for structural information. On the other hand, there's a lot of formatting information that breaks up text runs that needs to be contended with...

C# not finding space in string copied from Excel

Help! I have a list of records in Excel that I'm copying/pasting into an ASP.NET web page. From there, the C# code parses the records.
This code below works for one of the names, but not another. If, however, I copy/replace the empty space in Excel with a typed space or if I actually backspace and type the name into the webpage with the keyboard, it does work.
It's as if Excel has some odd ghost character in the file I was given for the space on this record. I've pasted in Notepad++ and showed all characters, and I don't see anything special here that's different among the records.
This one works and detects the spaces: Carolyn Bentivegna
This one does not: Allan D. Blake
if (fullName.IndexOf(" ") > -1)
Try the tabspace:
if (fullName.IndexOf("\t") > -1)
Cells copied via excel are separated by a TabSpace and Rows are separated via newlines and carriage return.

Why am I getting "�" characters?

I've written a quick-and-dirty utility to parse a text file, but in some cases it's writing out a "�" character. My utility reads from a .txt file which contains "records" in this format:
Biography
Title:George F. Kennan: An American Life
Author:John Lewis Gaddis
Kindle: B0054TVO1G
Hardcover: B007R93I1U
Paperback: 0143122150
Image link: <img src="http://images.amazon.com/images/P/B0054TVO1G.01.MZZZZZZZ.jpg" alt="Book Cover" />
...and writes out lines from that to a CSV file such as:
Biography,"George F. Kennan: An American Life","John Lewis Gaddis",B0054TVO1G,B007R93I1U,0143122150,<img src="http://images.amazon.com/images/P/B0054TVO1G.01.MZZZZZZZ.jpg" alt="Book Cover" />
...but in several cases, as mentioned, that weird character is appending itself to an author's name. In most cases where this is happening, it's what appears to be a space character in the .txt file. I'm trimming the author's name prior to writing it out to the CSV file, so it's obviously not being seen as a space, though.
When I save the text file with these characters, I get the message about non-unicode characters, etc.
What could be the cause of that? And better yet, how can I delete them with a search and replace operation? In Notepad, they are not found, so I have to delete them one-by-one.
Prior to being in the .txt file, this data was in an Open Office/.odt file, if that means anything to anyone.
BTW, I have no idea how that "stackoverflow" got into the href above; it's not in the original text I pasted in...
UPDATE
I am curious how that character got in my files. I sure didn't put it there (deliberately), any more than I added the "stackoverflow" to the URL above. Could it be that a call to Environment.Newline would add that?
Here was my process:
1) Copy and paste info from the interwebs into an Open Office/.odt file
2) Copy and past that into a text (Notepad) file
3) Open that text file programmatically and loop through it, writing to a new "csv"/.txt file.
UPDATE 2
Silly me - all I had to do was save the file (which wouldn't save those weird characters), then open it again. IOW, when I opened it today (at home, after work) those were gone.
UPDATE 3
I wrote too soon - it replaced the weird character with a question mark (a "normal" one, not a stylized one).
They are almost certainly non-breaking spaces, U+00A0 (although there are other fixed-width space characters which are also possible.) These won't be trimmed as spaces, but will be rendered as spaces if the encoding of the file matches the encoding of the output device.
My guess is that your text file is in CP-1252 (i.e., Windows default one-byte coding) but your output is being rendered as though it were UTF-8.
Normally you would type these characters as AltGr+Space. You might try that with Notepad, but no guarantees.

Outputting Programmatically to MSword; sensing end of line

I'm trying to use the MSWord Interop Library to write a C# application that outputs specially formated text (isolated arabic letters) to a file. The problem I'm running into is determining how many characters remain before the text wraps onto a new line. I need the words to be on the same line, without wrapping, which is the default behavior. I'm finding this difficult because when I have the Arabic letters of the word isolated with spaces, they are treated as individual characters and therefore behave differently then connected words.
Any help is appreciated. Thanks.
Add each character to your range and then check the number of lines in the range
LineCount = range.ComputeStatistics(Word.WdStatistic.wdStatisticLines);
When the line count changes, you know it has been wrapped, and can remove the last character or reformat accordingly
Actually I don't know how this behaves today, but I've written something for the MSWork API when I was facing a somewhat weird fact. Actually you can't find that out. In MSWord, text in a document is always in paragraphs.
If you input text to your document, you won't get it in a page only, but this page will at least contain a paragraph for the text you wrote into it.
Unfortunately I can't figure this out again, because I don't have a license for MS Word these day.
Give it a try and look at the problem again in this way.
Hope this helps, and if not, please provide the code that generates the input and the exact version of MSWord.
Greetings,
Kjellski
I'm not sure what "Arabic letters of the word isolated with spaces" means exactly, but I assume that non breaking space is what you need.
Here's more details.

Categories

Resources