I have tried using the System.Windows.Documents.FlowDocument server side, but ran into a problem with images.
What I need to produce is a document with headings, section breaks, page breaks, images (with text wrapping around from the left or the right), tables and ideally some kind of table of contents.
I use c# and asp.net.
Is there a library that will do most of this?
RTF has been chosen because the document needs to be openable in older versions of word, be editable, and we can't run word on the server.
Thank-you
I used MigraDoc in the past, it is a free library. You can create PDFs or RTFs. Just Google it.
I have started using .net rtf writer.
It produces clean rtf, but doesn't do everything I need.
There is pretty good documentation for rtf here.
I am working some things out for my self. For example, I needed to be able to wrap text around an image. Whilst the rtf writer above enables you to add images to documents, it does so by putting the image in its own paragraph. What I need is a shape element.
In the rtf it ends up looking something like this (some of the numbers define the size and position of the image in twips):
{\shp{\*\shpinst\shpleft3801\shptop1\shpright8300\shpbottom4500\shpfhdr0\shpbxcolumn\shpbxignore\shpbypara\shpbyignore\shpwr2\shpwrk0\shpfblwtxt0\shpz0
{\sp
{\sn pib}
{\sv
{\pict\pngblip\pichgoal4499\picwgoal4499
-- image binary data goes here --
}}}
{\sp
{\sn fLine}
{\sv 0}}}}
I sometimes just save something in word and try and understand what it did (but word seems to add a lot of noise).
Related
Background
I am working on a WPF windows application and I want add embedded PDF viewer with only basic functionalities including PDF view, text search and page navigation.
I tried embedded Internet Explorer and Adobe PDF Reader installed method (this way ) but this method is not suitable for our requirement as Adobe PDF Reader has too may external links which can not be allowed because of the security reasons of the application.
Therefore, I am trying to use moonpdf library. This library works fine with our requirements but the only problem is there is no text search functionality in this library. (I think it shows PDF as images)
Then, I have download moonpdf source code and realized that moonpdf is using libmupdf.dll wrapping to c#.
I can modify the moonpdf source code and mupdf source code for our requirement if needed.
My Question
Is there any text search functionalities in mupdf? if so how can I use it?
In the basic mupdf library, there are several functions for searching for text. These work by searching a page for a text string, in a few different variants, and returns the area for all hits of the given text. You need to iterate over the pages yourself (in order to do forward or reverse search).
fz_quad hits[1000];
count = fz_search_page(ctx, page, needle, hits, nelem(hits));
That said, I do not know how or even if "moonpdf" has wrapped these functions.
You can certainly extract the text from a document, the MuPDF library will do that. I believe it's up to you to apply your own search criteria after that. I'm afraid I'm not expert enough to answer the 'how to' part of it though. I imagine one of the mutool examples would be helpful here though. I'll see if I can get one of the developers to answer.
I was wondering if it is possible to find the coordinates of a specific Run (text, no drawing or other elements that have offset parameter) on a page in a Word document using OpenXML SDK. I know that OpenXML is basically .. well XML, and simple runs have no relative, numerical position embedded in them.
I was reading through OpenXML SDK API and found no clues but maybe I have missed something. By coordinates I mean any tuple that can be mapped to pixels if I would generated an image out of the page (imagine you made a screenshot of page)
I suspect, if this is possible, it is not trivial.
Appreciate your help!
The Open XML SDK does not include this functionality. This would require a layout engine, which is not part of the SDK.
Word is not a page layout program, it's a word processor. Therefore:
No, it's not possible because...
The Word application dynamically lays out a page when it's opened in the Word application. Exactly how it's layed out and where things appear on-screen (or on the printed page) depends on how Word calculates font size as well as line, character and paragraph spacing (in all directions) for the currently selected printer driver. So it can vary and thus cannot be saved in the Open XML file.
Currently, I can extract all the text chunks with their location data from a PDF. The problem is that the PDF contains images with text annotations which I do not want including in the extraction.
However, for whatever reason whenever I search the PDF for images, it only finds 1 of the images and usually throws the exception: The colour space is not supported. It's as if it doesn't recognise them as images?
I am not wishing to extract the images, just locate where they start and end in relation to the PDF so I can exempt the text that is on top of the images.
For example:
Where the numbers on the graph are unwanted and need to be removed from the extracted text.
Im just not sure how to:
A) Locate all the images and store the coordinates of where it starts and ends
B) Ignore the text that is on top of the images in the PDF document
(I am using iTextSharp to try and achieve this, but so far I am not having much luck)
I'm not exactly sure how iTextSharp works but the PostScript language reference or the PDF Reference manuals may be a good place to start figuring out what you need to know.
I just cracked open a PDF file in a text editor to check out the format because I haven't seen it in a while and then realized what the problem might be.
PDFs support "Images", and "Stream Objects" which can contain image data. Stream objects actually declare enough information that you can know where they begin and end and write something to manually ignore them.
A Stream Object Header looks like this:
<</Intent/RelativeColorimetric/Subtype/Image/Length 19678/Filter/DCTDecode/Name/X/Metadata 4314 0 R/BitsPerComponent 8/ColorSpace 5247 0 R/Width 290/Height 372/Type/XObject>>stream
It's entirely possible that your particular PDF has only one "Image" and then the rest of it is "Streams".
I suggest cracking it open to take a look. It would also be beneficial if you included some sample code with on the library you're using.
I also found by opening a PDF in a text editor this string /Type /Page which seems to create new pages, so you there's a chance you could count those to determine which page you're currently on.
The header at the top of the document I'm reviewing is %PDF-1.2 and the latest version is 1.7, so there may be some disparity here because of that.
Any chance you can share the PDF file you're working with?
I have a PDF and want to extract the text contained in it. I've tried a few different PDF libraries and they all return basically the same results. When extracting the text from a two page document with literally hundreds of words, only a dozen or so words from the header are returned.
Is there any way to tell if the text I'm after is actually text or a raster image of the text? I'm thinking something along the lines of Firebug's "Inspect Element" but at this point I'll take any solution that tells what I'm really looking at.
This project really doesn't justify attempting to use OCR. And, although a simple solution, using fields in the PDF is not an option since the generator of the file is a third party.
If Acrobat/Reader can select the text, then it Is Text.
Reasons your library might not be able to find the text in question:
Complex/bad fonts or encodings. Adobe can be very forgiving of garbage in, somehow managing to get Good Info out.
The text could be in an annotation rather than the page contents. It won't matter what program parses the content stream if you need to look in the annot array instead.
You didn't name a particular library, so it's possible that the library you're using doesn't look inside XObject Forms. That's unlikely in an even remotely mature API, but stranger things have happened.
If you can get away with copy/pasta from Reader, then just go that route.
Have you tried Amyuni PDF Creator .Net? It allows you to enumerate all components from a specified rectangular region of a page and inspect their type from a predefined types list. You could run a quick test using the trial version and the following code sample for text extraction:
// open a PDF file
axPDFCreactiveX1.Open(System.IO.Directory.GetCurrentDirectory()+"\\sampleBookmarks.pdf", "");
axPDFCreactiveX1.Refresh ();
String text = axPDFCreactiveX1.GetRawPageText (1);
MessageBox.Show (text);
Additionally, it provides Tesseract OCR integration in case you needed it.
Disclaimer: I am part of the development team of this product.
Check this site out. It may contain some helpful code snippets. http://www.codeproject.com/KB/cs/PDFToText.aspx
I have an existing XPS file that I would like to use as a template and possibly bind data to it. I have tried several methods, but cannot seem to get it to work.
Does anyone have any experience altering an existing XPS file to add data at runtime and then print or save?
Any help is appreciated.
XPS documents conform to the Open XML standard. There is an SDK for working with these docs. Here is a How-to article by Beth Massi: "Accessing Open XML Document Parts with the Open XML SDK".
Since you are working with the internal doc structure you might also check out 'Open XML Package Editor" which lets you explore the doc with Visual Studio. Here is another How-to by Beth Massi: "Handy Visual Studio Add-In to View Office 2007 Files".
+tom
it's a bit of a challenge to do this with XPS, but it is possible.
You can do this with our NiXPS SDK.
I've posted an example on my blog a while ago:
XPS variable data example
Regards,
Nick
Bindings are evaluated during the process of writing to an XPS document. So you can't set up a {Binding} in a FixedDocument, Write that FD to an XpsDocument, and expect to get that original FD back again when you next open that saved doc.
Also, the standard XpsWriter does convert everything into Glyphs on canvases, so you can't, say, a textbox in the original and expect to be able to find it after its been saved to a document.
I've never used the NiXPS libraries, so if Nick says it can be done you might want to check it out.
One last possibility--You can create placeholders in a form that you will be able to find later. They'd have to be text (something like [[{{FORMFIELDHERELOL}}]]) with some kind of delimiter scheme to differentiate the text from everything else. You could then go spelunking in the XML looking for text that fits the delimeter pattern and switch out those glyphs for your binding text. Of course, the issue with THAT is that if you aren't putting X chars in place of X chars you might find you have to do some repositioning. As its all glyphs on canvas this might be slightly harder than, say, threading a needle with a shoelace.