Word vsto get text of document with hidden characters - c#

I'm developing a text analysis vsto add-in for Word.
Therefore I get the text of the active document like this:
Globals.ThisAddin.Apllication.ActiveDocument.Content.Text
After that I analyze it. The analysis returns a list of positions that Word should comment (like character 3 - 6 and character 10 - 13).
The problem is that it seems like the comment from 3 to 6 is adding a character (that is hidden) to the document. Because all comments that Word is doing after the first one are placed one character too early.
Is there a way how I can fix that or how I can get the text with the hidden characters?
I found TextRetrievalMode but I can not get it working with that.

Basically, the answer is "No, you can't do it the way you propose."
Yes, Word does add "hidden characters" to the text flow that cannot be picked up using the object model. Trying to work with character index values is not going to work reliably. The reliable method is Word's built-in Find/Replace with wildcards. If RegEx is absolutely necessary, then some kind of Find/Replace within a character-index range (say, starting 5 characters before and ending 5 characters after the indices computed using RegEx) might be a way to double-check the result and pick up the correct Range.
Possibly, depending on what kind of analysis this is, it might be better to work with the closed file, leveraging the Office Open XML. That will not have the problem of "hidden characters" that Word uses for structural information. On the other hand, there's a lot of formatting information that breaks up text runs that needs to be contended with...

Related

Synchronously read file from wwwroot in Blazor Webassembly

So I've been able to successfully read files asynchronously like so.
I'm making a simple word game (hangman), and I'm reading the words from some text files in wwwroot.
On the new game screen, I have the user pick the length of the word to guess and the set of words they want to use (e.g. regular words, halloween themed words, christmas themed words etc.)
The problem is when the user enters a number for the length of the word, I'm validating that there is a word of that length, and if there isn't then I set the number to the closest word length to what the user entered (e.g. if they entered 999 into the word length field and they were using the regular words list, it would be set to 18 because that's the longest word in the regular word list (telecommunications)). Basically, I need to read the file right when the user changes the word list, (at least with how I'm currently doing things) so that I can correctly validate if/when the user changes the word length field.
I've tried doing something like:
public static string ReadFile(string localUrl) {
return Http.GetStringAsync(localUrl).Result;
}
Unfortunately Blazor doesn't like this and it throws a runtime exception when I try to do this.
I'm currently binding my input fields to properties (like <input #bind="MyProperty">) because it seemed like an easy way to validate them, but properties can't be asynchronous. I had the idea that I could use #onchange in my word lists combobox to call an asynchronous method that updates the word list, but this feels a lot more messy and I feel like I'm probably going about this the wrong way.
Should I stick with my solution of using #onchange to call an asynchronous method to update my list of words, or should I be doing things differently? Maybe load all my word lists at the beginning (it'd only be a few kb)? Or is there a better way to synchronously load files?

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

Outputting Programmatically to MSword; sensing end of line

I'm trying to use the MSWord Interop Library to write a C# application that outputs specially formated text (isolated arabic letters) to a file. The problem I'm running into is determining how many characters remain before the text wraps onto a new line. I need the words to be on the same line, without wrapping, which is the default behavior. I'm finding this difficult because when I have the Arabic letters of the word isolated with spaces, they are treated as individual characters and therefore behave differently then connected words.
Any help is appreciated. Thanks.
Add each character to your range and then check the number of lines in the range
LineCount = range.ComputeStatistics(Word.WdStatistic.wdStatisticLines);
When the line count changes, you know it has been wrapped, and can remove the last character or reformat accordingly
Actually I don't know how this behaves today, but I've written something for the MSWork API when I was facing a somewhat weird fact. Actually you can't find that out. In MSWord, text in a document is always in paragraphs.
If you input text to your document, you won't get it in a page only, but this page will at least contain a paragraph for the text you wrote into it.
Unfortunately I can't figure this out again, because I don't have a license for MS Word these day.
Give it a try and look at the problem again in this way.
Hope this helps, and if not, please provide the code that generates the input and the exact version of MSWord.
Greetings,
Kjellski
I'm not sure what "Arabic letters of the word isolated with spaces" means exactly, but I assume that non breaking space is what you need.
Here's more details.

counting 1 word, 2 word and 3 word phrases occurrences in a webpage in C#/.NET

I'm going to write a program that takes a URL and counts the occurrences of EVERY single 1-word, 2-word, and 3-word phrases in the webpage (and possibly x-word phrases).
Here's the best algorithm I could come up with:
1). strip html tags
2) make everything lowercase
3) split the text on space and put them all into an array
4) iterate over each word, and for each word you must: put word[i], word[i+1], word[i+2] into a hashtable.
Every time u have a collision you increase the word count for that word or 2-3 letter word phrase.
My questions are:
1) Can anyone provide any more efficient solutions in terms of space and runtime?
2) Are there any easy ways to do #1 in C#?
I can probably use a dom parser and parse out all the inner text maybe.
Depending on your case, You might be oversimplifying the problem and/or You may end up putting a lot of effort implementing functionalites that already exist in some libraries. So this will not be an direct answer but suggestion on what path to take in tackling this problem.
Process You want to implement is called information retrieval. It is very broad and complex but luckily there is a lot of research in this area. Part of it is extracting word ngrams (ngram is set of consecutive letters or words in sequence).
Let me show you some additional problems you should think of ahead:
is the capitalization of letters in word important?
is dot the only sign that You want to use to mark the end of sentence?
do You want to exclude stop words? Stop words are words You don't want to include in phrase like 'a', 'the', 'I', 'my' and so on.
do you want to stem words? Convert words from their original form to their root form, like plural to singular form: basketballs -> basketball
And for extracting pure text from HTML:
extract only text shown on page?
extract hints also? (like those shown when hovering mouse over picture)
Any other non-visible text (meta tag and so on)
There are libraries that perform searching and extracting information from raw material. "Raw material" means that You have to process document (html, doc, pdf, image, ...) and turn it into text in order for search engine to index it (extract phrases, for instance). Once document is indexed it can be searched. One such library for .NET is Lucene.NET. It supports different stemmers, analyzers, filters.
I am not sure but i believe there are libraries for extracting text from html also.
Basically, your approach may work in some simpler scenarios where not so small error-level is acceptable. I recently gain interest in information retrieval and found it really complex and interesting. You may get benefits researching this topic depending on your goals. There is a lot of info here on stackoverflow as well as the rest of Internet.
And if You decide to go this way, there is much more info on Lucene (orioginal Lucene JAVA version, Lucene.NET is port to .NET) than on Lucene.NET. So if You don't find answer for Lucene.NET immediately do a search on Lucene discussions.
To answer your question #2.
HtmlDocument doc = WebBrowser1.Document;
string text = doc.GetInnerText();
If you want to make it more efficient - use a suffix trie (you may have to write your own)
http://en.wikipedia.org/wiki/Suffix_trie
A suffix trie basically makes searching through strings depend on the length of the string instead of the length of the array. Its the sort of thing they use in search engines.

RichTextBox.Find doesn't work with Solr Highlights

My application needs to able to indicate where in the original document do the highlights from Solr actually come from. For the time being, my project deals only with .txt files.
I'm using the highlights returned by Solr as string inputs to an richtextbox.find function. Once I have the starting point of the hit, I highlight the string using richtextbox.select function and set backcolor and color and other properties.
PROBLEM : RichTextBox.Find is never returning a valid output (always -1), which means it's not finding my highlight text in the document.
I've tried removing the <em> and </em> tags along with the \n tags that are there in the highlight string but won't be there in the actual text document, but it still doesn't
work. Searching the same string on MS Word or Notepad on the original file doesn't work either, even though the string appears identical to the text fragment in the file. Is there any other information I can get on changes i need to make to the string to make it searchable?
EDIT 1 :
I've tracked down the problem. Apparently in certain cases, the highlight that Solr returns itself contains some non-printable or junk characters not initially found in the original document. I need a way to reliably clean these on some criteria. My text contains a lot of valid special characters so I cannot afford to have those removed by mistake!

Categories

Resources