I am writing a Word plugin to read all text in a document and saving it to a text file.
The text file generated will be used by another application of mine, and so I need to mark the end of every page's text by a '\f' character. My current logic merely saves the file though word as a plain text file, by using
object format = WdSaveFormat.WdFormatText;
...
Application.ActiveDocument.SaveAs( ..., ref format, ... );
The best method I found to insert a break was using ActiveDocument.Selection.InsertBreak().
Is there some way to determine the positions of page breaks in the original Word document so that I know where to insert the '\f' character?
This is one of the hard way to do it
use computestastics() for line number and get the no of lines
use goto to goto last line in the documnet and insert a Hard EOF
ex:
Selection.GoTo(wdGoToLine,wdGoToAbsolute,4)
The only thing that I can think of right now is for you to save it as an html which will give you a tag for every paragraph. Then you can get the begining of each paragraph text and used that to find the first starting position of each paragraph on the original document.
Also, you can do a Selection.Find and search for "^p" which is a paragraph mark.
Related
I am new to this site, and I don't know if I am providing enough info - I'll do my best =)
If you use Notepad++, then you will know what I am talking about -- When a user loads a .exe into Notepad++, the NUL / \x0 character is replaced by NULL, which has a black background, and white text. I tried pasting it into Visual Studio, hoping to obtain the same output, but it just pasted some spaces...
Does anyone know if this is a certain key-combination, or something? I would like to put the NULL character in replacement of \x0, just like Notepad++ =)
Notepad++ is a rich text editor unlike your regular notepad. It can display custom graphics so common in all modern text editors. While reading a file whenever notepad++ encounters the ASCII code of a null character then instead of displaying nothing it adds the string "NULL" to the UI setting the text background colour to black and text colour to white which is what you are seeing. You can show any custom style in your rich text editor too.
NOTE: This is by no means an efficient solution. I'm clearly traversing a read string 2 times just to take benefit of already present methods. This can be done manually in a single pass. It is just to give a hint about how you can do it. Also I wrote the code carefully but haven't ran it because I don't have the tools at the moment. I apologise for any mistakes let me know I'll update it
Step 1 : Read a text file by line (line ends at '\n') and replace all instances of null character of that line with the string "NUL" using the String.Replace(). Finally append the modified text to your RichTextBox.
Step 2 : Re traverse your read line using String.IndexOf() finding start indexes of each "NUL" word. Using these indexed you select text from RichTextBox and then style that selected text using RichTextBox.SelectionColor and RichTextBox.SelectionBackColor
richTextBoxCursor basically just represents the start index of each line in RichTextBox
StreamReader sr = new StreamReader(#"c:\test.txt" , Encoding.UTF8);
int richTextBoxCursor = 0;
while (!sr.EndOfStream){
richTextBoxCursor = richTextBox.TextLength;
string line = sr.ReadLine();
line = line.Replace(Convert.ToChar(0x0).ToString(), "NUL");
richTextBox.AppendText(line);
i = 0;
while(true){
i = line.IndexOf("NUL", i) ;
if(i == -1) break;
// This specific select function select text start from a certain start index to certain specified character range passed as second parameter
// i is the start index of each found "NUL" word in our read line
// 3 is the character range because "NUL" word has three characters
richTextBox.Select(richTextBoxCursor + i , 3);
richTextBox.SelectionColor = Color.White;
richTextBox.SelectionBackColor = Color.Black;
i++;
}
}
Notepad++ may use custom or special fonts to show these particular characters. This behavior also may not appropriate for all text editors. So, they don't show them.
If you want to write a text editor that visualize these characters, you probably need to implement this behavior programmatically. Seeing notepad++ source can be helpful If you want.
Text editor
As far as I know in order to make Visual Studio display non printable characters you need to install an extension from the marketplace at https://marketplace.visualstudio.com.
One such extension, which I have neither tried nor recomend - I just did a quick search and this is the first result - is
Invisible Character Visualizer.
Having said that, copy-pasting binaries is a risky business.
You may try Edit > Advanced > View White Space first.
Binary editor
To really see what's going on you could use the VS' binary editor: File->Open->(Open with... option)->Binary Editor -> OK
To answer your question.
It's a symbolic representation of 00H double byte.
You're copying and pasting the values. Notepad++ is showing you symbols that replace the representation of those values (because you configured it to do so in that IDE).
In my website, admin uploads a .docx file. I convert the file into xml using OpenXmlPowerTools Api.
The issue is the document has some bullets in it.
• This is my bullet 1 in the document.
• This is my bullet 2 in the document.
XElement html = OpenXmlPowerTools.HtmlConverter.ConvertToHtml(wDoc, settings);
var htmlString = html.ToString();
File.WriteAllText(destFileName.FullName, htmlString, Encoding.UTF8);
Now when I open the xml file, it renders the bullets as below:-
I need to read each node of XML & save in the database & reconsturct html from nodes.
Please don't ask me why so, as I am not the boss of the system.
How do I get the bullets render correctly in xml so that I can save the right
html in the database?
I have fixed same issue for my requirement and this working without issue so far.
In case like this you'll always have to try workaround i.e. copy this character and compare it within your input/read strings etc. if found then replace it with equivalent html encoded character. In your case it will be bullet list character "ampersandbull;" or "ampersand#8226;" .
Code should look like
listItem == "Compare with your copied character like one in your pic" ? "•" : listItem
you can find more equivalent characters at this link:
http://www.zytrax.com/tech/web/entities.html
Hey I don't think XML can read bullets. I'll advise you programmatically handle it. Try and debug and see what the square is being represented as and then do an if statement to find it and replace it with a code you can define so that when you return it to use it you can convert that code if found to a bullet.
I have an excel file which contains some data when I save that file to CSV then some weird ? marks appear before & end of the text. Will any 1 please tell me how can I resolve that issue.
?XXXXXX-XXX?
Above is the link to download excel file : http://www.filedropper.com/book1_5
In this file, in the column C you've got following data:
"0000468750-IN"
"0000468750-IN"
"0000843576AB"
"0000843576AB"
It is not reslly visible now, but at start and end of every number you have there an additional invisible whitespace character. You may see it for yourself, just edit that cell and move through the text by directional arrows - it will make a little pause when moving over that invisible character. If I replace it with an underscore, it looks like that:
"_0000468750-IN_"
"_0000468750-IN_"
"_0000843576AB_"
"_0000843576AB_"
If my text editor doesn't cheat on me, that character has code 0x00, and it's called null-character.
When converting to CSV, Excel didn't know what to do with that character. CSV is a textfile and must follow some encoding rules. For example, if you saved it as CSV/ANSI, then it's not possible to store some Unicode characters like ąęćżń. Similarly, it's usually not possible to store a 0x00 character in a textfile at all, because this character is special in most encodings. With this character inside, such textfile could be detected as "binary file" by readers and rejected.
Excel simply replaced that odd charcter with "?" character to make the data safe for CSV format. Excel didn't just erase the 0x00 character to let you know that there was something odd in the original data.
It's very strange to see it in textual data. If this XLSX was generated by a computer program, it might indicate that this program has some bugs/errors. I highly doubt this file to be manually created. It's really hard to write "0x00" character by hand. One option I can think of when you could get this manually is by using a crappy barcode reader, and scanning the codes right into the Excel sheet. The barcode scanning software sometimes leaks the control characters into the textdata stream. If that's the case, change the reader or write a filter that will cut those chars out.
Btw. you should be able to just find&replace all that strange characters. Edit one of the cells (F2 key), go to the end of the text (END key) select the LAST character of the text (Shift + LeftArrow ONCE), copy that character (Control + C), then open Find&Replace window (Control + H) and paster that character into "Find" and press "Replace All".
On my Excel this resulted in finding/replacing 8 such characters, so it works.
Note that after the END key you must press ShiftLeft exactly ONCE. The cursor will not move and nothing will happen, no selection will show up. That's because the character is invisible. But it is there, and it will be selected and copied.
I'm trying to use the MSWord Interop Library to write a C# application that outputs specially formated text (isolated arabic letters) to a file. The problem I'm running into is determining how many characters remain before the text wraps onto a new line. I need the words to be on the same line, without wrapping, which is the default behavior. I'm finding this difficult because when I have the Arabic letters of the word isolated with spaces, they are treated as individual characters and therefore behave differently then connected words.
Any help is appreciated. Thanks.
Add each character to your range and then check the number of lines in the range
LineCount = range.ComputeStatistics(Word.WdStatistic.wdStatisticLines);
When the line count changes, you know it has been wrapped, and can remove the last character or reformat accordingly
Actually I don't know how this behaves today, but I've written something for the MSWork API when I was facing a somewhat weird fact. Actually you can't find that out. In MSWord, text in a document is always in paragraphs.
If you input text to your document, you won't get it in a page only, but this page will at least contain a paragraph for the text you wrote into it.
Unfortunately I can't figure this out again, because I don't have a license for MS Word these day.
Give it a try and look at the problem again in this way.
Hope this helps, and if not, please provide the code that generates the input and the exact version of MSWord.
Greetings,
Kjellski
I'm not sure what "Arabic letters of the word isolated with spaces" means exactly, but I assume that non breaking space is what you need.
Here's more details.
I have a program that generates a plain text file. The structure (layout) is always the same. Example:
Text File:
LinkLabel
"Hello, this text will appear in a LinkLabel once it has been
added to the form. This text may not always cover more than one line. But will always be surrounded by quotation marks."
240, 780
So, to explain what is going on in that file:
Control
Text
Location
And when a button on the Form is clicked, and the user opens one of these files from the OpenFileDialog dialog, I need to be able to Read each line. Starting from the top, I want to check to see what control it is, then starting on the second line I need to be able to get all text inside the quotation marks (regardless of whether is is one line of text or more), and on the next line (after the closing quotation mark), I need to extract the location (240, 780)... I have thought of a few ways of going about this but when I go to write it down and put it to practice, it doesn't make much sense and end up figuring out ways that it won't work.
Has anybody ever done this before? Would anybody be able to provide any help, suggestions or advice on how I'd go about doing this?
I have looked up CSV files but that seems too complicated for something that seems so simple.
Thanks
jase
You could use a regular expression to get the lines from the text:
MatchCollection lines = Regex.Matches(File.ReadAllText(fileName), #"(.+?)\r\n""([^""]+)""\r\n(\d+), (\d+)\r\n");
foreach (Match match in lines) {
string control = match.Groups[1].Value;
string text = match.Groups[2].Value;
int x = Int32.Parse(match.Groups[3].Value);
int y = Int32.Parse(match.Groups[4].Value);
Console.WriteLine("{0}, \"{1}\", {2}, {3}", control, text, x, y);
}
I'll try and write down the algorithm, the way I solve these problems (in comments):
// while not at end of file
// read control
// read line of text
// while last char in line is not "
// read line of text
// read location
Try and write code that does what each comment says and you should be able to figure it out.
HTH.
You are trying to implement a parser and the best strategy for that is to divide the problem into smaller pieces. And you need a TextReader class that enables you to read lines.
You should separate your ReadControl method into three methods: ReadControlType, ReadText, ReadLocation. Each method is responsible for reading only the item it should read and leave the TextReader in a position where the next method can pick up. Something like this.
public Control ReadControl(TextReader reader)
{
string controlType = ReadControlType(reader);
string text = ReadText(reader);
Point location = ReadLocation(reader);
... return the control ...
}
Of course, ReadText is the most interesting one, since it spans multiple lines. In fact it's a loop that calls TextReader.ReadLine until the line ends with a quotation mark:
private string ReadText(TextReader reader)
{
string text;
string line = reader.ReadLine();
text = line.Substring(1); // Strip first quotation mark.
while (!text.EndsWith("\"")) {
line = reader.ReadLine();
text += line;
}
return text.Substring(0, text.Length - 1); // Strip last quotation mark.
}
This kind of stuff gets irritating, it's conceptually simple, but you can end up with gnarly code. You've got a comparatively simple case:one record per file, it gets much harder if you have lots of records, and you want to deal nicely with badly formed records (consider writing a parser for a language such as C#.
For large scale problems one might use a grammar driven parser such as this: link text
Much of your complexity comes from the lack of regularity in the file. The first field is terminated by nwline, the second by delimited by quotes, the third terminated by comma ...
My first recomendation would be to adjust the format of the file so that it's really easy to parse. You write the file so you're in control. For example, just don't have new lines in the text, and each item is on its own line. Then you can just read four lines, job done.