Find a string in 8000 Ms word files

Find a string in 8000 Ms word files - c#

I stored many word file in my SQL database using FileStream , now I want to search in all of them to return those witch contains a string.
The first solution that I found is open each file and read content of them and search for given string (Using File stream)
The second solution is to do not use FileStream and store the content of word in database, so this cause to we need a big big hard space!!!
Is there any one to help me about this !?
*UPDATE1 : I am creating a Document Management System in WPF. This application will work on a LAN. there is 2 applications. The first one will install on the server and users will add or delete files using it. The second part will install on the clients and the users will use it to search in the content of files
*UPDATE2 : During all you guys answer my question , I found new feature of SQL Server 2012 named File Table. Can this one help me ?! I think I can use this one and a third party solution to do this ?! Are you agree with me ?!

Finally I use new feather of MSSQL 2012 named File Table , and because it support just .doc files, I installed Microsoft IFilter 2.0 to support .docx files. Also create full text index on my file table and it work great

I suggest to go with first solution because, first u can allocate memory to read one file at a time, after it is done, release the memory and allocated the memory to another file and read it. In this process u can return the string u required and no need of big big hard space.

Related

C# the best way to edit strings MS Word from database?

I will be building a desktop application that should interact with a database, I will need to build an API logically to contact the database remotely and retrieve data from there.
I was given a word file and I will be updating values where the black is the values I'm getting from the database. I will sometimes have to print the file.
I however not sure what's the best way to do this. Do I need to modify the Word file and return it to the default value each time? Should I use reports instead or something else?

I think there are no 'best of the best' practises
You may use DocX from NuGet
You may also get a direct document access using ms office interop word namespace
(afaik)

reading excel file into visual studio (c#) and matching with database(in visual studio) write matched results into word document

I am working on words matching project. after matlab processing I got excel file
1 2
2 5
3 10
4 3
5 7
. .
. .
the first column describe the word no and second column related to reference word no in database.
let database include
1 AMNBBHHH
2 I
3 PRIYANGA
5 AM
. .
. .
10 SAHAN
. .
. .
so that the result should be write in word document as follow,
I AM SAHAN PRIYANGA . . .
so that I want to suggestions to
creating database that include words and reference no
read excel file into visual studio and matched with database
write them back into word document
note by
I am using visual studio 2013 and Microsoft 2010 and going to use
windows form application in visual studio with button ( to submit
load file into visual studio).
around 200 unique words in database

Your question is little incomplete but I am trying to answer your needs with the best guess of mine:
Choice of Database
You haven't mentioned your choice of database. Do you have a database
server hosted in a central location or it is part of your windows
form application itself, something like Sqlite?
You haven't mentioned the number of unique words you are expecting to
be stored in a key value pair, if there are going to be a million of
words then you would need some kind of partitioning if it is RDBMS
table or you may consider something like Redis or Memcache based data
store for fast retrieval
Reading Excel File
Since you have a well formatted excel file, I would recommend reading the file using (ADO.Net) Microsoft OLEDB or ODBC Data Provider. This will help you in staying away from Excel Library wrappers in your Windows Forms Application and will avoid some memory leaks
In case you don't want to use ADO.Net Data Providers, there open source libraries for reading excel files. I would prefer that.
Second part of this problem is to read corresponding strings and store them in some C# object. The choice of solution may vary by your choice of database. If you prefer to go with Redis/ Memcache based database store, then I would probably read each row from Excel, find the index of word and access the word stored in datastore and keep forming the sentence. In case it's RDBMS, we may have to look at more complex solution in order to appropriately store and access the data.
Writing into Word Doc
Please prefer to use OpenXML Format for writing back to Word Doc, this will help you in staying away from the Office Libraries in your Windows Forms Application.

What is the standard way for dealing with PowerPoint (.PPTX) files on the server?

I've been tasked with a feature that can generate PowerPoint files on the server using C#. I'd basically start with a template, and programmatically replace some text with live data from the database. I've been doing some research into this area for the past day and here's what I've found:
PowerPoint has this sort of thing built in, meaning it can connect to external data sources and pull in data. Most examples of this, I've found, either use PowerPoint automation done on the server (I've been advised against this) or assume a SQL Server backend. Our company uses Oracle for our RDMS needs. Oracle has a solution for this called Oracle BI, but it requires a whole new web server setup to run various Java EE components and what not. I didn't look at the price, but knowing Oracle it's not cheap. It also requires new software to be installed on the end user's machine, which we really want to avoid.
Generating PowerPoint files on the fly is possible. The company that is basically the go-to guys for this problem (every help forum points to them, and they get all the rave reviews) is Aspose. They have .NET components for dealing with just about any Office format you can think of. The problem is, they are astronomically expensive. Just the PowerPoint component (a site license for up to 10 developers) would cost $3,995.
The third possibility is generating a solution in-house. After all, a PPTX file is just xml, right? Well, looking closer, a PPTX appears to be a gzip archive. It contains many folders, each containing many XML files. Modifying a PPTX file would, correct me if I'm wrong, entail unzipping the file to a temporary directory, reading the XML file and modifying the contents, then packaging up everything again and write the file out to the response stream. Perhaps there are libraries that can work with gzip streams on the fly without extracting everything.
My Question: Are there easier ways to work with a PPTX file using .NET that don't require working with compressed XML files or buying very expensive software? Basically, we need to modify a PowerPoint file, change some text, and allow the user to download that generated file from a web server.

OpenXML is Microsoft's .Net library that lets you manipulate Office documents. It lets you open a PPTX file and provides an object model that wraps the XML contents.
Here's the link to the OpenXML SDK and the MSDN documentation.
I've used OpenXML to let a ASP.Net page dynamically generate Word documents from a database.

Don't use Office Interop on a web server. It's an all-around bad idea.
If you are only replacing text placeholders for files that will not change, the home grown solution that finds the placeholders in the xml files in the gzip archive should be doable. .Net has had zip support for some time, and it is greatly improved if you are able to use .Net 4.5, so you shouldn't need to extract the archive to a temporary location at all.
PowerPoint should also support connecting directly to Oracle in the same way it supports connecting to Sql Server (just play around with the connection options), without needing the special Oracle BI stuff. However, I'd still prefer the home-grown solution, as this will only work while the powerpoint file is able to reach your database directly, which is typically only possible in your local LAN environment or with an active VPN.
If you want anything fancier than a simple text replacement, perhaps looks for an Aspose competitor.

Opening byte[] as a file without actually saving it as a file first

What is the best way to open a Word file that was stored as a byte[] in a database?
I have to store some documents in an Access database - Word files, 2003 and up - on an application that is strictly run off of a CD. Unfortunately they have to be in the database and can't be stored loose in folders. I'm storing them as an OLE object, and I can read and write them just fine as a byte[].
However, I don't know the best way of getting these documents back open in Word. Right now I'm using a FileStream to recreate the file in somewhere and then shooting off a System.Diagnostics.Process.Start(filename) to get it to open. This is going to be used on government computers which can have some funky security rules sometimes, so I don't know if this is the best way.
Is it possible to open a file previously stored as a byte[] without using any intermediary file saved to the hard drive? I know they'll at least have Word 2003, so I'm open to using the Word interop.
Thanks for any input!

I doubt you're going to be able to feed Word a file in memory without saving it to at least a RAMDisk or something wild like that.
Why not use the system temp folder or the GetTempFile() method to write the byte array to a file just before opening it using Word and then cleaning up the temp files when you're done?
string fullPathToATempFile = System.IO.Path.GetTempFileName();
// or
string tempDirPath = System.IO.Path.GetTempPath();

Best way to store multiple revisions of a text file to a single file

I'm working on a C# application that needs to store all the successive revisions of a given report file to a single project file: each time the (plain text) report file changes, the contents of the new version shall be appended to the project file, along with some metadata. Other requirements:
each version of the report file is 100 kB to 1 MB. Theoritically, the maximum number of revisions is unlimited but it should be less than 1000 in practice.
to keep things simple, I'd like to avoid computing differences between the revisions of the report - just store the whole report to the project file every time it has changed.
the project file should be compressed - it doesn't need to be a text file
it should be easy to retrieve a given version of the report from the application
How can I implement this in an efficient way? Should I create a custom binary file, consider using a database, other ideas?
Many thanks, Guy.

What's wrong with the simple workflow?
Un-gzip file
Append header and new report
Gzip project file
Gzip is a standard format, so it's easily accessible. Subsequent reports probably won't change that much, so you'll have a great compression ratio. To file every report, just open the file and scan the headers. (If scanning doesn't work, also mirror the metadata in an SQLite database, and make sure to include offsets into the project file so you can seek to the right place quickly.)
If your requirements are flexible (e.g. that "shall append" part) and you just want something to keep track of past versions of the file, a revision control system will do all of what you need quite easily.

No need to implement that. I would suggest you to use source control. Personally I use subversion with TortoiseSVN client. There is also a plug-in that integrates Subversion with Visual Studio, VisualSVN. Have a look at them.

If using SVN is not an option, you can just store each revision in an individual file (with filename that represents date for example). You can use separate files for metadata as well. Then all the aforementioned files are zipped into one file (look at http://DotNetZip.codeplex.com/ for example).

I don't think there is much point building this yourself when there are already tens, if not hundreds, of systems that are basically designed to do exactly this - source control systems.
I'd recommend choosing some source control solution that has bindings to C# and store your document in there. Then you can easily check out any revision of the document. You will also be able to diff, branch, etc. if necessary.
To give just one example to get you started you can use Subversion with C# bindings.

You could use alternate data streams to store the old revisions of your file. There is no built-in support in the .NET framework, but there exist some helper classes and articles like here and here.
I have never used this myself, so I can't really tell if this is a good option. But it seems, it would make an elegant solution, since you could store each file version in a separate data stream and only the current version in the "main file". In any case, it will probably only work on NTFS drives.

I think that the already SVN (or another source control system) is a very good idea because source control seems to have exactly the features you require. But if that's not an option you could use a file database like SQL Server Compact Edition or SQLite.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.