I'm developing an C# application that will search a particular word in a list of PDF files.
The result should return:
1) The PDF files where the word was found;
2) Those PDF's page where the word was found;
3) Bring a part of the text where the word was found and highlight that word.
I have found in my research some solutions described below:
- Insert the PDF file into the SQL Server as varbinary and use SQL SERVER's full-text search;
- Use SQL SERVER's Filetables resource and SQL SERVER's full-text search;
- Upload the PDF file to a physical folder and use ITextSharp plugin to accomplish the tasks.
Could someone that has experience with this, how can I do this ?
Thanks in advance!
You'll have to find some way to either read the PDF in real time, or store its text in the database before the search request. You can't use SQL full text directly on PDF content, because the textual content of the PDF is sometimes encoded as binary data.
For your requirement, you'd have to build a table that has a row for each distinct PDF page you want to reference, if you're going to use the database approach.
Related
I have a SQL Server 2008 database from an application which stores office file templates.
The files in the database are stored in hex format (0x504B030414000600...).
With a file signature table (https://www.garykessler.net/library/file_sigs.html), I can find out which file format it is: Microsoft Office Open XML Format Documents (OOXML, like DOCX, PPTX, XLSX ...).
How can I export/convert these hex strings into the original files?
Maybe with C# ...
With the application itself, I can only export 1 file at a time. It would take days to do this with all files (about 1000).
Thank you
Export column with files from SQL Server to single file (it may be very large, but it shouldn't matter). You can for example right click and export results to CSV file.
Write simple C# console application:
use CSV parser to loop over data
inside loop you can make simple if statements to determine document file format
in every iteration convert hex format to blob file and save it somewhere on your drive
I am looking for any help, so i am currently uploading a file which is either a pdf or a docx to a file server and storing the address in a sql database. I am now trying to allow the user to search the whole database and also the text in the documents that have been uploaded. I am struggling to find any solutions using asp.net core.
Does anyone have any ideas?
Depending how many documents you have this is probably going to be slow as hell.
You will want to look into building a full-text-index like lucene or importing the contents of the files into a SQL full text index.
There isn't a pre-made way to do this as it is complicated depending on specific requirements
I have got some HTML stored in a SQL database which contains links to images (img src tags)
I need to find a way of replacing the links with the actual file embedded in base64 format and copying it to a different table. I can't think how best to achieve this?
The reason for it is that I need to migrate some HTML knowledge documents from one system to another and the other system requires the images to be embeded.
I am using MS SQL Server and Visual Studio (c#)
Thanks
hey i'm working on project in which I have webform which includes some editfields. I don't want to enter the data manually into that editfield. what I want is to extract data from a word document and fill that editfields. But the catch is, through which MS doc I fill the editfield?
Suppose We have a bunch of lectures uploaded on some page. so what should I do to retrieve the data from a particular document?
Is it necessary to open the MS-doc file first?
or I should download the file first?
If I goes with option 1 when should I have to use some library? what opens the file within browser, retrieve the data and the a pop-up message appears "the data has been retrieved now you can close the file". and next I can fill the form with that data.
or should I goes with the 2 Option when an individual hit the download button then the file will be stored into the local machine. how can I keep the track that which ms-file is downloaded or stored into the local machine?and is it necessary to open that file for retrieving the data again?
These are my point of views that how I can implement that module. So I need your suggestions? Is this the right way to achieve this goal or should I follow the other path? and which libraries are required to achieve this task or any tutorial similar to this problem ?
Thanks in Advance
I would suggest considering a third option: since the Word document files exist on the server, the cleanest place to pre-populate a form would be by extracting data from the document while it is on the server and filling in the form's fields before sending it down to the user in a codebehind. Trying to extract data on the client side from a recently-downloaded file via an application other than the browser seems ripe for kludgy-ness. Articles such as http://support.microsoft.com/kb/257757 should help get you started in the right direction.
For extracting data from ms word document using free .net word component and fill data to webform,
extract data,
Document doc = new Document();
doc.LoadFromFile("YouDocOrDocx.Docx");
string content = doc.GetText();
I stored many word file in my SQL database using FileStream , now I want to search in all of them to return those witch contains a string.
The first solution that I found is open each file and read content of them and search for given string (Using File stream)
The second solution is to do not use FileStream and store the content of word in database, so this cause to we need a big big hard space!!!
Is there any one to help me about this !?
*UPDATE1 : I am creating a Document Management System in WPF. This application will work on a LAN. there is 2 applications. The first one will install on the server and users will add or delete files using it. The second part will install on the clients and the users will use it to search in the content of files
*UPDATE2 : During all you guys answer my question , I found new feature of SQL Server 2012 named File Table. Can this one help me ?! I think I can use this one and a third party solution to do this ?! Are you agree with me ?!
Finally I use new feather of MSSQL 2012 named File Table , and because it support just .doc files, I installed Microsoft IFilter 2.0 to support .docx files. Also create full text index on my file table and it work great
I suggest to go with first solution because, first u can allocate memory to read one file at a time, after it is done, release the memory and allocated the memory to another file and read it. In this process u can return the string u required and no need of big big hard space.