Prepare doc and docx Files for Lucene indexing

Prepare doc and docx Files for Lucene indexing - c#

I wanted to ask if there is a quick way of getting content of a document into a single document field. All the examples i have seen have relatively short strings. I cannot save an entire journal article into a string and indexthat is there a quick way of telling lucene to index all the words in a file? I am using Lucene.net 3.03 for this application.

There is not an easy way to pass just the file, you have to provide the entire content to lucene to made the indexing for the search. Here is a answer from the Q/A about indexing PDF, but is the same from every type of document, just open it and index to lucene.

You can just pass a System.IO.TextReader to a Field. If the file is plain text, or something like it, you should just be able to open the Reader on it, and pass it directly into the Field, like:
System.IO.TextReader reader = new StreamReader("path/to/my/file.txt");
Field field = new Field("fieldName", reader);
document.add(field);

Related

Copy content from a Word document to another with the style

I want to copy the content of a section in a Word document to a new document.
I do this to copy :
var docPath = #"C:\temp\myDoc.docx";
var doc = word.Documents.Open(FileName: docPath, ReadOnly: true);
var emptyDoc = word.Documents.Add();
doc.Sections.First.Range.Copy();
emptyDoc.Sections.First.Range.Paste();
This works well to copy content, but the style is not the same. How can I copy the complete section and have it rendered exactly the same way in the new document ?
If there is a better solution involving the OpenXML SDK instead of VSTO, I can take it.

You will find it much easier to automate Word if you do things manually first. That way you can get a better understanding of the various options available etc. You can also record a macro which will often, though not always, provide the answer.
In this instance you need to automate selecting 'Keep Source Formatting' from the context toolbar that appears after pasting. The code you need for that is:
emptyDoc.Sections.First.Range.PasteAndFormat wdFormatOriginalFormatting

generating rtf from template in c# winforms app

Is there a simple process someone can recommend for generating an rtf document from a pre-built "template" and populate fields.
I would prefer to avoid ms word automation type solutions as i cannot guarantee ms office versions etc.
Resulting file needs to be editable so I cant go pdf
is it as simple as using something like nvelocity, or do i need to do something fancier?
thanks

You can always use keywords : Example:
This is a [text] that needs to be [replaced]
and the use a stringbuilder to replace the keywords.
StringBuilder sb = new StringBuilder(yourTemplate);
sb.Replace("[text]", "car");
sb.Replace("[replaced]", "washed");
So if you get me the final text will be :
This is a car that needs to be wahsed.
This is just one way of doing it.

How to work with an Xml file without loading the whole document in memory?

How to add a new node, update an existing node and remove an existing node of an xml document without loading the whole document in memory?
I'm having an xml document and treating it as the memory of my application so would need to be able to do hundreds of reads and writes quickly without loading the whole document.
its structure is like this:
<spiderMemory>
<profileSite profileId="" siteId="">
<links>
<link>
<originalUrl></originalUrl>
<isCrawled></isCrawled>
<isBroken></isBroken>
<isHtmlPage></isHtmlPage>
<firstAppearedLevel></firstAppearedLevel>
</link>
</links>
</profileSite>
</spiderMemory>
How would that be possible with XDocument?
Thanks

If you want to do hundreds of reads and writes quickly... you might be using the wrong technology. Have you tried using a plain old RDBMS?
If you still need the XML representation, then you can create an export methods to produce it from the database.
XML isn't really a good substitute for this kind of problem. Just saying.
Also... what is wrong with having the whole thing in memory? How big can it possibly get? Say 1GB? Suck it up. Say 1TB? Oops. But then XML is wrong, wrong, wrong anyway in that case ;) way too verbose!

You can use XmlReader, something like this :
FileStream stream = new FileStream("test.xml", FileMode.Open);
XmlReader reader = new XmlTextReader(stream);
while(reader.Read())
{
Console.WriteLine(reader.Value);
}
here is an more elaborate example http://msdn.microsoft.com/en-us/library/cc189056%28v=vs.95%29.aspx

As Daren Thomas said, the proper solution is to use RDBMS instead of XML for your needs. I have a partial solution using XML and Java. Stax parser does not parse the whole document in memory and is a lot faster than DOM (still XML parsing will always be slow). A 'pull parser' (eg Stax) allows u to control what gets parsed. A less cleaner way is to throw an exception in SAX parser when you get the element(s) needed.
To modify, the simplest (but slow) way is to use XPath. Another (untested) option is to treat XML file as text and then 'Search and replace' stuff. Here you can use all kinds of text search optimization.

How to replace text in a PDF with C#?

I saw a lot of solutions in here but none are clear or good answers.
Here is my simple question, hoping with a straight answer.
I have a PDF file (a template) which is created having text something like this:
{FIRSTNAME} {LASTNAME} {ADDRESS} {PHONENUMBER}
is it possible to have C# code that replace these templates with a text of my choice?
No fields, no other complex stuff.
Is there any Open source library helping me achieve that?

This thread is dead, however I'm posting my solution for other lost souls that might face this problem in the future. Unfortunately my company doesn't allow posting code online so I'll describe the solution :).
So basically what you have to do is use PdfSharp and modify this sample to replace text in stream, but you must take into account that text may be split into many parentheses (convert stream to string to see what the format is).
Then, with code similar to this sample traverse through source pdf page by page and modify current page by searching for PdfContent items inside PdfReference items and replacing text in content's stream.

The 'problem' with PDF documents is that they are inherently not suitable for editing. Especially ones without fields. The best thing is to step back and look at your process and see if there is a way to replace the text before the PDF was generated. Obviously, you may not always have this freedom.
If you will be able to replace text, then you should be aware that there will be no automatic reflow of the text following the replaced text. Given that you are fine with that, then there are very few solutions that allows you to replace text.
I know that you are looking for an OpenSource solution so I feel reluctant to offer you a commercial solution. We offer one called PDFKit.NET. It allows you to extract all content on a page as so-called shapes (text, images, curves, etc.). See method Page.CreateShapes in the type reference. You can then programmatically navigate and edit this structure of shapes and then write it back to a PDF again.
Here it is:
http://www.tallcomponents.com/pdfkit
Disclosure: I am the founder of TallComponents, vendor of this component

For simple text replace use iTextSharp library.
The code that replace one string with another is below.
Note that this will replace only simple text and may not work in all cases.
//using iTextSharp.text.pdf;
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
}
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}

As stated in similar thread this is not really possible an easy way. The easier way it seems to be getting a DocX file and using DocX library which allow easy word swapping and then converting your DocX to PDF (using PDF Creator printer or so).
Or use pdf sharp/migradoc to create new documents.

Updating in PDF is hard and dirty. So may be adding a content on top of existing will work for you as well, as it worked for me. If so, here's my primitive, but working solution covering a lot of cases ("covering", indeed):
https://github.com/astef/PatchPdfText

Regex or XML Parser C#

I have some word templates(dot/dotx) files that contain xml tags along with plain text.
At run time, I need to replace the xml tags with their respective mail merge fields.
So, need to parse the document for these xml tags and replace them with merge fields.
I was using Regex to find and replace these xml tags. But I was suggested to use XML parser to parse for XML tags ([Regex for string enclosed in <*>, C#).
The sample document looks like:
Solicitor Letter
<Tfirm/>
<Tbuilding/>
<TstreetNumber/> <TstreetName/>
For the attention of: <TContact1/> <TEmail/>
Dear <TContact1/>
RE: <Pbuilding/> <PstreetNumber/> <PstreetName/> <Pvillage/> <PTown/>
We were pleased to hear that contracts have now been exchanged in the sale of the
above property on behalf of our mutual client/s. We now have pleasure in enclosing a
copy of our invoice for your kind attention upon completion.
....
One more note, the angle brackets are typed manually by end user in the template.
I tried using XMLReader, but got error as my documents have no root tags on their own.
Please guide if I should stick to Regex or is there any way to use XML Parser.
Thank you!

Unless you can get it structured as an XML document, the tools in the .NET Libraries to read XML are going to be entirely useless.
What you have is not XML. Having a tag or two that would qualify as XML does not an XML document make. The problem is that it simply does not follow any of the rules of XML.
Moral of the story is that you will have to come up with your own method to parse this. If you like to drink the RegEx kool-aid, that'll be the best solution for ya. Of course, there are plenty of ways to skin this cat.

It looks like you aren't actually using XML, just using a token that looks similar to XML as a placeholder for replacement.
If that's the case, you should be using Regex.

I would suggest neither. Microsoft has a free library in C# specifically for modifying open xml format documents without an installation of Microsoft Office.
OpenXML SDK

Doesn't seem like XML processing to me. It's not an XML doc. It's looks like straight string-replacement, and for that, you're better off with a Regular Expression.

An XML parser doesn't help you locate XML; it only helps you understand a given piece of XML. You will need some other mechanism, perhaps a Regex, to find the XML.

Seems that authors of most replies didnt read the question carefully.
inutan is asking for something that will parse Word documents. If a Word document is saved in docx format, it will be actually XML file that can be read by XML Reader or XPathReader, however I will not recomend to do it
Normally, mail merge with Word doesnt require any programming and XML parsing, see http://helpdesk.ua.edu/training/word/merg07.html
However if you still want to have XML-like fields in your Word templates and replace them with values, I would suggest using Word automation objects.
Below is an example of VBA code, for a similar code on other languages please refer MS Office development site http://msdn.microsoft.com/en-us/library/bb726434.aspx . For example if you use .NET - you should use Office interops and best of all is to install MS Visual Studio Tools for Office development http://msdn.microsoft.com/en-us/library/5s12ew2x.aspx
With Selection.Find
.Text = "<TContact1/>"
.Replacement.Text = "TContact1"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Prepare doc and docx Files for Lucene indexing - c#

There is not an easy way to pass just the file, you have to provide the entire content to lucene to made the indexing for the search. Here is a answer from the Q/A about indexing PDF, but is the same from every type of document, just open it and index to lucene.

Related

Copy content from a Word document to another with the style

generating rtf from template in c# winforms app

How to work with an Xml file without loading the whole document in memory?

How to replace text in a PDF with C#?

Regex or XML Parser C#

Categories

Resources