Search an uploaded document asp.net core - c#

I am looking for any help, so i am currently uploading a file which is either a pdf or a docx to a file server and storing the address in a sql database. I am now trying to allow the user to search the whole database and also the text in the documents that have been uploaded. I am struggling to find any solutions using asp.net core.
Does anyone have any ideas?

Depending how many documents you have this is probably going to be slow as hell.
You will want to look into building a full-text-index like lucene or importing the contents of the files into a SQL full text index.
There isn't a pre-made way to do this as it is complicated depending on specific requirements

Related

Search Particular Word in PDF by using C# and SQL Server

I'm developing an C# application that will search a particular word in a list of PDF files.
The result should return:
1) The PDF files where the word was found;
2) Those PDF's page where the word was found;
3) Bring a part of the text where the word was found and highlight that word.
I have found in my research some solutions described below:
- Insert the PDF file into the SQL Server as varbinary and use SQL SERVER's full-text search;
- Use SQL SERVER's Filetables resource and SQL SERVER's full-text search;
- Upload the PDF file to a physical folder and use ITextSharp plugin to accomplish the tasks.
Could someone that has experience with this, how can I do this ?
Thanks in advance!
You'll have to find some way to either read the PDF in real time, or store its text in the database before the search request. You can't use SQL full text directly on PDF content, because the textual content of the PDF is sometimes encoded as binary data.
For your requirement, you'd have to build a table that has a row for each distinct PDF page you want to reference, if you're going to use the database approach.

Does Google Accept Only Sitemap With .txt Extention?

I have finalized working on my Asp.Net 4.0 website. Now that i am to publish it by next few days, i am finding resources that can help me better rank my site on popular search engines. My site displays both static and dynamic contents. For dynamic contents i will be generating dynamic sitemap each week. My problem is that i read on google webmaster website that google accepts sitemaps only with .txt extension. (https://support.google.com/webmasters/answer/183668?hl=en). Orignal instructions quoted as:
* For best results, use the following guidelines for creating text file sitemaps:
You must fully specify all URLs in your sitemap as Google attempts to crawl them exactly as you list them.
Your text file must use UTF-8 encoding.
Your text file should contain nothing but the list of URLs.
You can name the text file anything you wish, provided it has a .txt extension (for instance, sitemap.txt).
As i have mentioned, i will be using c# code to dynamically generate xml sitemap for my site but i am not sure i will be able to write xml (by using C#) into .txt files. I have very little knowledge of writing xml by using C# (Such as by utilizing XmlWriter Class). I have found this website which uses it's sitemap file which is in .xml extension (http://www.mikesdotnetting.com/sitemap). Can anybody tell me what do i need to do to complete this final step of my project. Another thing that i am interested to know is should i submit my sitemap every time to google when a link is modified? Google says to submit your sitemap to google that contains no more than 50000 urls or less than 50mb.
Submitting every dynamic URL isn't necessarily going to improve your ranking. A lot of your ranking will depend on your pagerank, not the number of URLs you have. You need good content, and people who link to your site that think your content is good.
You read the document wrong. The very first line says
In addition to the standard XML format, Google also accepts the following file types as sitemaps:
The .txt file extension is only required when you have a text file that lists only the urls. Notice how even their submission example has a .xml extension since it's using the XML sitemap format.
It would be much simpler instead of generating a new sitemap file weekly to simply write a handler to generate it upon request, and cache the data for a period of time so you're not constantly generating it at every request for the sitemap file.
If google knows where the sitemap is, it will check it out periodically anyways so re-submitting it might not get you anything. You also don't want to submit too often as Google and other search engines may think you are trying to spam them. That's why the xml sitemap definition has elements for change frequency so they know how often to re-spider the page.
FYI, don't expect to see good ranking right away. It takes a while, months even, depending on the popularity of your site and the quality of the content. The volume of links won't help and Google will find them anyways when it spiders. It is possible to do too much.

What is the standard way for dealing with PowerPoint (.PPTX) files on the server?

I've been tasked with a feature that can generate PowerPoint files on the server using C#. I'd basically start with a template, and programmatically replace some text with live data from the database. I've been doing some research into this area for the past day and here's what I've found:
PowerPoint has this sort of thing built in, meaning it can connect to external data sources and pull in data. Most examples of this, I've found, either use PowerPoint automation done on the server (I've been advised against this) or assume a SQL Server backend. Our company uses Oracle for our RDMS needs. Oracle has a solution for this called Oracle BI, but it requires a whole new web server setup to run various Java EE components and what not. I didn't look at the price, but knowing Oracle it's not cheap. It also requires new software to be installed on the end user's machine, which we really want to avoid.
Generating PowerPoint files on the fly is possible. The company that is basically the go-to guys for this problem (every help forum points to them, and they get all the rave reviews) is Aspose. They have .NET components for dealing with just about any Office format you can think of. The problem is, they are astronomically expensive. Just the PowerPoint component (a site license for up to 10 developers) would cost $3,995.
The third possibility is generating a solution in-house. After all, a PPTX file is just xml, right? Well, looking closer, a PPTX appears to be a gzip archive. It contains many folders, each containing many XML files. Modifying a PPTX file would, correct me if I'm wrong, entail unzipping the file to a temporary directory, reading the XML file and modifying the contents, then packaging up everything again and write the file out to the response stream. Perhaps there are libraries that can work with gzip streams on the fly without extracting everything.
My Question: Are there easier ways to work with a PPTX file using .NET that don't require working with compressed XML files or buying very expensive software? Basically, we need to modify a PowerPoint file, change some text, and allow the user to download that generated file from a web server.
OpenXML is Microsoft's .Net library that lets you manipulate Office documents. It lets you open a PPTX file and provides an object model that wraps the XML contents.
Here's the link to the OpenXML SDK and the MSDN documentation.
I've used OpenXML to let a ASP.Net page dynamically generate Word documents from a database.
Don't use Office Interop on a web server. It's an all-around bad idea.
If you are only replacing text placeholders for files that will not change, the home grown solution that finds the placeholders in the xml files in the gzip archive should be doable. .Net has had zip support for some time, and it is greatly improved if you are able to use .Net 4.5, so you shouldn't need to extract the archive to a temporary location at all.
PowerPoint should also support connecting directly to Oracle in the same way it supports connecting to Sql Server (just play around with the connection options), without needing the special Oracle BI stuff. However, I'd still prefer the home-grown solution, as this will only work while the powerpoint file is able to reach your database directly, which is typically only possible in your local LAN environment or with an active VPN.
If you want anything fancier than a simple text replacement, perhaps looks for an Aspose competitor.

Writing to an Excel file from SQL Server Database

Currently, I need to be able to retrieve values from an SQL Server DB, populate an Excel file according to a certain template, and then allow the user to download the file. I also need this this certain template to be customizable, in the sense that the user can add new fields, and remove fields.
I understand that there are a couple of approaches I can take: using .xlt, and using C# directly. With C#, the user will need to interact with a UI, which will then populate a ExcelTemplate table in the SQL Server. This ExcelTemplate table will then be used when the user wishes to download a new Excel file.
I know all this stuff may sound kinda abstract, so please do tell me if there are some places I need to elaborate/clarify. Thanks a bunch, man.
EDIT: Sorry, I kinda missed this part out, but I'd prefer to allow the user to customize these Excel templates via a Silverlight UI.
You can create Data Sources in Excel and pull the data from MS SQL Server.
You can use MS Reporting Services which allow to get reports in MS Excel format. In this case users can use Report Builder to customize the reports.
For pulling down data from SQL Server and dumping it into Excel, you can use Officewriter. It has Reporting Services integration and supports generating .xls and .xlsx documents. There's also a template component that basically does what you're trying to do. The templates are actually Excel documents, so the users can edit them directly in Excel. Not Silverlight, but not bad. You can try an eval for free.
DISCLAIMER: I'm one of the engineers who built the latest version.
at the end of the day I think I'm gonna spend some time building a customized dashboard. It won't be generic, but rather focused on the existing database.
I know this answer is kinda vague and all, but I'd like to say thanks for all the help :) it'd be great if there are dynamic solutions for this in the future! I think...

Uploading document to Google Docs: how to replace existing and how to upload and convert large doc?

I am trying to make tool for backup/restore of Documents from Google account.
Backup is easy and I have no problems with it. But I have two unsolved questions for restore:
1) Is it possible to upload new version of existing document? When I upload document, it appears as separate copy.
I found it was discussed already here Upload and replace file in given folder on Google Docs using .net api, but it seems it was suggested just to remove old version before uploading new, the Id of document will be changed. Is this correct?
2) Google Docs have limit for size of documents able to be converted into internal format. http://docs.google.com/support/bin/answer.py?hl=en&answer=37603. So it is possible to create large document, save it to local computer and then Google Docs will refuse to convert it because the document's size is over limit. In such case it is possible to upload the document without convert, but it becomes un-editable via web site. Is there some workaround for this situation?
Unable to upload large files to Google Docs - Here is advice to break document into small pieces before uploading and link them together after. But maybe there some other ideas?
1. Is it possible to upload new version of existing document? When I upload document, it appears as separate copy.
Yes, this is possible. We call it "upload & replace" as you've noticed. No need to remove the existing version first. The following link describes how to do this in the protocol:
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#UpdatingMetadataAndContent
From the .NET client library, what you need to do is attach a an input stream to the Update() request. The method header for what you need is here:
http://code.google.com/p/google-gdata/source/browse/trunk/clients/cs/src/core/service.cs#554
Create a stream containing your new file content, and just pass that in. That should be it!
2. Google Docs have limit for size... Is there some workaround for this situation?
Unfortunately there is not a way currently to circumvent the size limitations of converted documents. They must be uploaded as unconverted files, and thus, are not editable in the Google Docs user interface.

Categories

Resources