How to extract text from a Word document?

How to extract text from a Word document? - c#

I am trying to extract specific text from a Word document based on coordinates. I have searched many sites on this requirement, but with no luck. How to set the coordinates to the Word document text?

The accepted answer in this question has a solution for finding the text of a certain line number in a Word doc.
Obviously you'll need a bit of extra code to search the strLine variable for a specific substring or whatever, but the hard work is done there I think.

Depending on the format of the Word file, there are two object models. The older style .doc files used one that had paragraphs, tables, and the like. The .docx files have an XML based structure that's a totally different model.
If you need to support both formats, you've got your work cut out for you.
Here's a link to the documentation:
Word Object Model

Related

Best way to write a word document in C# (Bi-Directional)

I've been struggling in the last few days, trying to write a Word document.
I've tried downloading DocX by (Novacode) which was not a big success, then moved to Microsoft.Office.Interop.Word library which was better but still, not a huge success.
The problem is that I'm trying to write a Right-To-Left document, which is of course mixed with different punctuation. The moment I add punctuation the entire line is reversed.
I get many lines written from Database, write them the way they are in the document, and I can not manipulate them, unlike titles and stuff, which I can manipulate, reverse stuff and get the lines the way I want, after struggling.
I've seen some answers saying I should use a specific char which 'tells' the reading algorithm it is about to face a Right-To-Left line, but here most data is derived from database.
Has anyone faced that kind of problem and can give some advices?

To whoever finds it relevant, and none of the above helped, I found this answer the best for Right-To-Left documents:
oDoc.Paragraphs.ReadingOrder = Word.WdReadingOrder.wdReadingOrderRtl;

Did you try with the Open XML SDK and using the BiDi class?
http://msdn.microsoft.com/en-us/library/dd452407(v=office.12).aspx

How to convert a word document to a text file in c# without using microsoft.office.interop?

I have plenty of different versions of word documents which have to be converted to text files.
I hope this link brings you right way
How to extract text from Word files using C#?
I want to read the content of the word document and remove all the formats(just have words in text files). I have done by using microsoft.office.interop(here, always instantiate a Word on the client) which is not recommended. So I am trying to create a c# project which should convert word to text automatically. Can anyone suggest me any 3rd party tool which should be efficient open source or reasonable price for all the versions of word to text file conversion in c#?
With Regards,
Shanthini

Finally I found solution which perfectly works for me at the moment. I haven't test with 10000 documents. Here you go., http://sourceforge.net/projects/word-reader/?source=dlp
Comments and suggestions are expecting about this solution...
Thank you,
Shanthini

How to generate an RTF document server side in c#

I have tried using the System.Windows.Documents.FlowDocument server side, but ran into a problem with images.
What I need to produce is a document with headings, section breaks, page breaks, images (with text wrapping around from the left or the right), tables and ideally some kind of table of contents.
I use c# and asp.net.
Is there a library that will do most of this?
RTF has been chosen because the document needs to be openable in older versions of word, be editable, and we can't run word on the server.
Thank-you

I used MigraDoc in the past, it is a free library. You can create PDFs or RTFs. Just Google it.

I have started using .net rtf writer.
It produces clean rtf, but doesn't do everything I need.
There is pretty good documentation for rtf here.
I am working some things out for my self. For example, I needed to be able to wrap text around an image. Whilst the rtf writer above enables you to add images to documents, it does so by putting the image in its own paragraph. What I need is a shape element.
In the rtf it ends up looking something like this (some of the numbers define the size and position of the image in twips):
{\shp{\*\shpinst\shpleft3801\shptop1\shpright8300\shpbottom4500\shpfhdr0\shpbxcolumn\shpbxignore\shpbypara\shpbyignore\shpwr2\shpwrk0\shpfblwtxt0\shpz0
{\sp
{\sn pib}
{\sv
{\pict\pngblip\pichgoal4499\picwgoal4499
-- image binary data goes here --
}}}
{\sp
{\sn fLine}
{\sv 0}}}}
I sometimes just save something in word and try and understand what it did (but word seems to add a lot of noise).

How do I explore a PDF to determine if an element is text?

I have a PDF and want to extract the text contained in it. I've tried a few different PDF libraries and they all return basically the same results. When extracting the text from a two page document with literally hundreds of words, only a dozen or so words from the header are returned.
Is there any way to tell if the text I'm after is actually text or a raster image of the text? I'm thinking something along the lines of Firebug's "Inspect Element" but at this point I'll take any solution that tells what I'm really looking at.
This project really doesn't justify attempting to use OCR. And, although a simple solution, using fields in the PDF is not an option since the generator of the file is a third party.

If Acrobat/Reader can select the text, then it Is Text.
Reasons your library might not be able to find the text in question:
Complex/bad fonts or encodings. Adobe can be very forgiving of garbage in, somehow managing to get Good Info out.
The text could be in an annotation rather than the page contents. It won't matter what program parses the content stream if you need to look in the annot array instead.
You didn't name a particular library, so it's possible that the library you're using doesn't look inside XObject Forms. That's unlikely in an even remotely mature API, but stranger things have happened.
If you can get away with copy/pasta from Reader, then just go that route.

Have you tried Amyuni PDF Creator .Net? It allows you to enumerate all components from a specified rectangular region of a page and inspect their type from a predefined types list. You could run a quick test using the trial version and the following code sample for text extraction:
// open a PDF file
axPDFCreactiveX1.Open(System.IO.Directory.GetCurrentDirectory()+"\\sampleBookmarks.pdf", "");
axPDFCreactiveX1.Refresh ();
String text = axPDFCreactiveX1.GetRawPageText (1);
MessageBox.Show (text);
Additionally, it provides Tesseract OCR integration in case you needed it.
Disclaimer: I am part of the development team of this product.

Check this site out. It may contain some helpful code snippets. http://www.codeproject.com/KB/cs/PDFToText.aspx

Databinding in existing XPS document

I have an existing XPS file that I would like to use as a template and possibly bind data to it. I have tried several methods, but cannot seem to get it to work.
Does anyone have any experience altering an existing XPS file to add data at runtime and then print or save?
Any help is appreciated.

XPS documents conform to the Open XML standard. There is an SDK for working with these docs. Here is a How-to article by Beth Massi: "Accessing Open XML Document Parts with the Open XML SDK".
Since you are working with the internal doc structure you might also check out 'Open XML Package Editor" which lets you explore the doc with Visual Studio. Here is another How-to by Beth Massi: "Handy Visual Studio Add-In to View Office 2007 Files".
+tom

it's a bit of a challenge to do this with XPS, but it is possible.
You can do this with our NiXPS SDK.
I've posted an example on my blog a while ago:
XPS variable data example
Regards,
Nick

Bindings are evaluated during the process of writing to an XPS document. So you can't set up a {Binding} in a FixedDocument, Write that FD to an XpsDocument, and expect to get that original FD back again when you next open that saved doc.
Also, the standard XpsWriter does convert everything into Glyphs on canvases, so you can't, say, a textbox in the original and expect to be able to find it after its been saved to a document.
I've never used the NiXPS libraries, so if Nick says it can be done you might want to check it out.
One last possibility--You can create placeholders in a form that you will be able to find later. They'd have to be text (something like [[{{FORMFIELDHERELOL}}]]) with some kind of delimiter scheme to differentiate the text from everything else. You could then go spelunking in the XML looking for text that fits the delimeter pattern and switch out those glyphs for your binding text. Of course, the issue with THAT is that if you aren't putting X chars in place of X chars you might find you have to do some repositioning. As its all glyphs on canvas this might be slightly harder than, say, threading a needle with a shoelace.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract text from a Word document? - c#

I am trying to extract specific text from a Word document based on coordinates. I have searched many sites on this requirement, but with no luck. How to set the coordinates to the Word document text?

The accepted answer in this question has a solution for finding the text of a certain line number in a Word doc. Obviously you'll need a bit of extra code to search the strLine variable for a specific substring or whatever, but the hard work is done there I think.

Related

Best way to write a word document in C# (Bi-Directional)

How to convert a word document to a text file in c# without using microsoft.office.interop?

How to generate an RTF document server side in c#

How do I explore a PDF to determine if an element is text?

Databinding in existing XPS document

Categories

Resources