Outputting Programmatically to MSword; sensing end of line - c#

I'm trying to use the MSWord Interop Library to write a C# application that outputs specially formated text (isolated arabic letters) to a file. The problem I'm running into is determining how many characters remain before the text wraps onto a new line. I need the words to be on the same line, without wrapping, which is the default behavior. I'm finding this difficult because when I have the Arabic letters of the word isolated with spaces, they are treated as individual characters and therefore behave differently then connected words.
Any help is appreciated. Thanks.

Add each character to your range and then check the number of lines in the range
LineCount = range.ComputeStatistics(Word.WdStatistic.wdStatisticLines);
When the line count changes, you know it has been wrapped, and can remove the last character or reformat accordingly

Actually I don't know how this behaves today, but I've written something for the MSWork API when I was facing a somewhat weird fact. Actually you can't find that out. In MSWord, text in a document is always in paragraphs.
If you input text to your document, you won't get it in a page only, but this page will at least contain a paragraph for the text you wrote into it.
Unfortunately I can't figure this out again, because I don't have a license for MS Word these day.
Give it a try and look at the problem again in this way.
Hope this helps, and if not, please provide the code that generates the input and the exact version of MSWord.
Greetings,
Kjellski

I'm not sure what "Arabic letters of the word isolated with spaces" means exactly, but I assume that non breaking space is what you need.
Here's more details.

Related

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

Word vsto get text of document with hidden characters

I'm developing a text analysis vsto add-in for Word.
Therefore I get the text of the active document like this:
Globals.ThisAddin.Apllication.ActiveDocument.Content.Text
After that I analyze it. The analysis returns a list of positions that Word should comment (like character 3 - 6 and character 10 - 13).
The problem is that it seems like the comment from 3 to 6 is adding a character (that is hidden) to the document. Because all comments that Word is doing after the first one are placed one character too early.
Is there a way how I can fix that or how I can get the text with the hidden characters?
I found TextRetrievalMode but I can not get it working with that.
Basically, the answer is "No, you can't do it the way you propose."
Yes, Word does add "hidden characters" to the text flow that cannot be picked up using the object model. Trying to work with character index values is not going to work reliably. The reliable method is Word's built-in Find/Replace with wildcards. If RegEx is absolutely necessary, then some kind of Find/Replace within a character-index range (say, starting 5 characters before and ending 5 characters after the indices computed using RegEx) might be a way to double-check the result and pick up the correct Range.
Possibly, depending on what kind of analysis this is, it might be better to work with the closed file, leveraging the Office Open XML. That will not have the problem of "hidden characters" that Word uses for structural information. On the other hand, there's a lot of formatting information that breaks up text runs that needs to be contended with...

Something like translit.net but on autohotkey

I want to write Translit.net but on autohotkey. So I succesfully done with the part where I have only one letter:
:*:a::а
:*:b::б
:*:v::в
:*:g::г
:*:d::д
...
But now I have a problem with the translation of "shh" to "щ" and other 'two to one' char translations. When I start typing shh i get схх back, but I want to get щ. What could I do?
My current idea: When I press a key it should write down the letter and add non translated letter to a 3 element array and check if the array elements create a shh ,ch, sh or any other combination larger than one. Then I could remove the last 3 or 2 typed letter and send a russian letter what I need. Maybe someone know an easier way to do that. I want my script to work exactly like that page I posted. A solution in C or C# instead of AutoHotkey would help me too.
I have the same problem, while using the unicode version of Autohotkey, but only if the file is saved in UTF-8 without BOM format.
Saving the file as UNICODE (UCS-2, must be Little Endian) solves the problem.
It also works with UTF-8 with BOM, so apparently autohotkey has truble determining endianness on its own.

C# Regex Replace ignore specific string

Since this is my first question here on stackoverflow I hope my question is correctly asked.
Basicly I have a normal .txt file which contains any text like:
car accident
people died
cat without owner
<!-- Text added at 6/29/2011 9:20:38 AM -->
Some addintional Text
other Text added
add Text
I have a write/append function which allows the user to append some text and set a little timestamp.
So my problem is: With another function, you can search and replace text in the textfile, but as you can guess if someone wants to replace the word "Text" it will be replaced in the xml-stylish comment(timestamp) as well.
My result until now is
content = Regex.Replace(content,"[^<+.*"+input+".*>+]*", replace);
//content = content of the .txt file, input = search term, replace = string to replace
But this fails miserably, as some regex pro's will see without executing it.
Now I hope that some regex pro could help me out here and provide me a search pattern which replaces the normal text but ignores the timestamp.
I'm not realy aware of the logic from regex until now, nevertheless I understand the single expressions so this would be a hook for me to understand Regex more properly.
Thanks in advice.
If I understand your question correctly, you want to replace every instance of "Text" except for the one(s) inside the comment.
The easist way is to use a negative lookbehind (fantastic description here) as below:
content = Regex.Replace(content, #"(?<!<!--.*?)" + input, replace);
What you're doing is attempting to replace a repetition of any length of a character that is NOT <+.*> or a character contained in input with the value in replace.
If you're going to be working a lot with Regex, I would HIGHLY recommend giving the website above a good read. It's hands down the best intro to Regex that I've found, the time spent now will save you lots of headaches later!
Edit
Updated to add flexibility thanks to #stema

Inserting a character inside the body of a paragraph

If someone enters a very long title/sentence, the text will stretch across the web page.
Is there a way to break the text so it continues on to the next line?
Using overflow hidden will hide the text.
I think I should be using the wbr tag.
Should I use the insert(); method for this?
i.e.
string myText = "111111111111111111111111111111111111111111111111111111111111111111111111";
myText = myText.Insert(80, "<wbr/>");
Not sure how cross browser the wbr tag is also!
Strictly speaking, you should use the zero width space (​) for this rather than <wbr>. However, Internet Explorer 6 and earlier are known not to support this (they show an ugly box). So <wbr> is probably the safest choice. Except... Internet Explorer 8 in standards mode is known not to support <wbr>, so you've got yourself a wonderful conundrum here.
You can read more at quirksmode.org.
Do note HBoss' comment in that it's hard to predict where to break, unless you're using a fixed width font like Courier. You should probably heed his advice and break more often than just every 80 characters. (And don't get me started on combining characters.)
As far as ASP.NET is concerned, you can indeed use the Insert method for this, but beware when you need to insert more than one: you'' need to do some book keeping (and a StringBuilder would also be advised).
You could use a regex to find words surrounded by whitespace/special chars, and surround it with a div/span that has different overflow properties.
If you do use <wbr>, be sure to surround the word with <nobr>.
I'm not sure you can solve this by splitting the content with breaks since you aren't guaranteed that the break will fit uniformly across browsers. You have variations of font-size, widths, etc.
Normally when I see content that extends too var it simply overlaps over the rest of the page or the designer sets the overflow so that the content can be scrolled. There could potentially be some CSS tricks you could use, but I'm not aware of any.
As an alternative approach, instead of simply inserting a line break every x number of characters, you might just insert a space after certain characters, for example, punctuation. This will make sure that the content wraps at some point or another.

Categories

Resources