Searching through hundreds of HTML files - c#

I am not sure how to start solving this problem so any suggestions will be of help.
My client has a number of static HTML pages running into hundreds of files. These under go updates every now and then and are overwritten on the website. We list these pages on the website via a simple left hand side explorer mimicking the folder structure in which these files are given to us.
We now want to give the ability to search these files and display matching results. Doing a brute search through such a large number of files is going to be very time consuming. Matching related words (for example plurals, misspellings etc) is also desirable. Showing results in the order of popularity would be a useful feature. I am not sure how to get started on this. Should we pre-process the html files after every update for instance? Any recommended indexing libraries available in .NET? What little programming has been done on the website has been done using C#.
Thanks
MS

Lucene.net may be of interest.

I´d first write a simple program to transfer all those files contents to a database. Then you could implement your search properly without having to read all files every time.

I am not sure if its within your budget, but Google can do it for you as user1161318 pointed out.
Try Google Site Search - http://www.google.co.uk/enterprise/search/products_gss.html

Related

Connecting To A Website To Look Up A Word(Compiling Mass Data/Webcrawler)

I am currently developing a Word-Completion application in C# and after getting the UI up and running, keyboard hooks set, and other things of that nature, I came to the realization that I need a WordList. The only issue is, I cant seem to find one with the appropriate information. I also don't want to spend an entire week formatting and gathering a WordList by hand.
The information I want is something like "TheWord, The definition, verb/etc."
So, it hit me. Why not download a basic word list with nothing but words(Already did this; there are about 109,523 words), write a program that iterates through every word, connects to the internet, retrieves the data(definition etc) from some arbitrary site, and creates XML data from said information. It could be 100% automated, and I would only have to wait for maybe an hour depending on my internet connection speed.
This however, brought me to a few questions.
How should I connect to a site to look up these words? << This my actual question.
How would I read this information from the website?
Would I piss off my ISP or the website for that matter?
Is this a really bad idea? Lol.
How do you guys think I should go about this?
EDIT
Someone noticed that Dictionary.com uses the word as a suffix in the url. This will make it easy to iterate through the word file. I also see that the webpage is stored in XHTML(Or maybe just HTML). Here is the source for the Word "Cat". http://pastebin.com/hjZj6AC1
For what you marked as your actual question - you just need to download the data from the website and find what you need.
A great tool for this is CsQuery which allows you to use jquery selectors.
You could do something like this:
var dom = CQ.CreateFromUrl("http://www.jquery.com");
string definition = dom.Select(".definitionDiv").Text();

Create list of AJAX type links to be googled by appliance

We have a large collection of web content that we want to make searchable by Google Appliance but have a fairly complex list of what we want and don't want included. Because a lot of the content is AJAX like, just having Google do the searching isn't a solution. Instead, we have a classic ASP page that loops through all of our directories and files using the Scripting.FileSystemObject and excludes/includes files/folders and generates a large list of hyperlinks in a page that Google can then query. This process is painfully slow (20 minutes or more) but now we are able to move this process on Dot Net server.
I'm doing a little bit of exploring wondering what solutions people may have found useful for this kind of thing. We're exploring Microsoft.Web.Administration and anything else that will make this more efficient including writing the resulting list to a html file.
Does anyone with experience with this have any suggestions as to how to approach this?
Thank you in advance.
According to the documentation https://developers.google.com/search-appliance/documentation/614/admin_crawl/Preparing#robotscs, the Appliance Server obeys robots.txt. You should be able to that to the root of the site and configure it to disallow indexing of particular folders or extensions.

Creating winforms help file

I'm looking to create a indexable help file for a winforms app, but how do you get started?
The Microsoft MSDN is rubbish, it says "create a new project" but doesn't specify which type to create.
How do I go about creating a help file for my applications?
Are you looking for this
Integrating "Help" into WinForms Application?
Maybe this doesn't count as a real answer:
I would vote against those help files. 5-6 years back we had real context sensitive help files on a per-dialog-basis in our applications, and it was a lot of effort to maintain those.
Therefore, we changed this to shipping "simple" PDF files that appear on F1. We never got any complaints from users.
Recently we started migrating this to real HTML websites with lots of individual pages, a search function, "prev" and "next" navigation, and a printer-friendly format. This enables us to update the manual much quicker and makes it more "linkable" compared to PDF.
Personally, I really never get warm with those help files. E.g. I still do not understand why some files need to be trusted, before I can open and view them.

Using C# to retrieve data from a Google search

Here's what I want the program to do:
Read a text file (the text file contains random search criteria like "sunflower seeds", "chrome water faucets", etc) to retrieve a search phrase.
Submit the search phrase to Google and retrieve the first four URLs.
Retrieve the Google Page Rank of each of the returned URLs.
Being a neophyte C# programmer, I can handle #1 easily. Unfortunately, I've never dealt with using the Google APIs before. I do have a Google API key and I'm aware that there is a search limit using the API. At most, I'll probably use this on a dozen search phrases (or "keywords") per day. I can do this manually, but I know there has to be a way to do this with a C# program. I've read that this can be done using AJAX, but I don't know AJAX and I'd rather this just be an executable program on my PC rather than a web-based app. A push in the right direction from someone would be a big help. Also, I really don't want this to be a "screen-scraper", either. Isn't there a way that I can get the info (URLs and Page Rank) from Google without having to scrape a returned HTML search page?
I don't want anyone to write the code for me, just need to know if it's possible and a push towards finding the information on how to accomplish it.
Thanks in advance everyone!
I don't want anyone to write the code
for me, just need to know if it's
possible and a push towards finding
the information on how to accomplish
it.
Look into the WebClient class
http://msdn.microsoft.com/en-us/library/system.net.webclient(VS.80).aspx
Try this:
googleSearch = #"http://" + #"www.google.com/#hl=en&q="+#query;
where query is the string of your search.

Compare the textual content of websites

I'm experimenting a bit with textual comparison/basic plagiarism detection, and want to try this on a website-to-website basis. However, I'm a bit stuck in finding a proper way to process the text.
How would you process and compare the content of two websites for plagiarism?
I'm thinking something like this pseudo-code:
// extract text
foreach website in websites
crawl website - store structure so pages are only scanned once
extract text blocks from all pages - store this is in list
// compare
foreach text in website1.textlist
compare with all text in website2.textlist
I realize that this solution could very quickly accumulate a lot of data, so it might only be possible to make it work with very small websites.
I haven't decided on the actual text comparison algorithm yet, but right now I'm more interested in getting the actual process algorithm working first.
I'm thinking it would be a good idea to extract all text as individual text pieces (from paragraphs, tables, headers and so on), as text can move around on pages.
I'm implementing this in C# (maybe ASP.NET).
I'm very interested in any input or advice you might have, so please shoot! :)
My approach to this problem would be to google for specific, fairly unique blocks of text whose copyright you are trying to protect.
Having said that, if you want to build your own solution, here are some comments:
Respect robots.txt. If they have marked the site as do-not-crawl, chances are they are not trying to profit from your content anyway.
You will need to refresh the site structure you have stored from time-to-time as websites change.
You will need to properly separate text from HTML tags and JavaScript.
You will essentially need to do a full text search in the entire text of the page (with tags/Script removed) for the text you wish to protect. There are good, published algorithms for this.
You're probably going to be more interested in fragment detection. for example, lots of pages will have the word "home" on them and you don't care. But it's fairly unlikely very many pages will have exactly the same words on the entire page. So you probably want to compare and report on pages that have exct matches of length 4,5,6,7,8, etc words and counts for each length. Assign a score and weight them and if you exceed your "magic number" report the suspected xeroxers.
For C#, you can use the webBrowser() to get a page and fairly easily get its text. Sorry, no code sample handy to copy/paste but MSDN usually has pretty good samples.

Categories

Resources