How to crawl external web search

How to crawl external web search - c#

I have a website, http://www.op.nysed.gov/opsearches.htm, for example where the user selects a Profession and enters a Licensee Name and clicks on the Search button which takes them to a new page to display the result.
For example, the following:
Which displays the following result:
Clicking on any of the set of number next to each name brings up the information, in example like this:
I looked at scrapy, arachnode and other web crawlers on the web for this purpose but wasn't too convinced that is the right technology for it.
I was told that we have to crawl those search results from the page. Is it something that can be done?
Can crawler crawl as the user does the search?

Web Crawling programs will get you a local copy of the target web's srtucture, not really sure if that is what you want.
If you want to extract that data and store it in a way you can query it later, then you must create your own app.
As a point of start the idea is this:
Navigate manually through the web and analyze the POST's done between pages (as an example, what is sent to the server when "Architect" is selected and the button is pressed, or where points a link on the license) and find the real queries, which variables are sent and the formatt of them, then analyze the page's HTML structure to find patterns which can be used with a regular expression engine.
That part will be a hard, you must analyze outgoing and incoming HTTP queries (LiveHTTP Headers complement in Firefox can help you a lot) to simulate them in your program, and construct realiable regular expressions patterns (to test regular expressions The RegEx Coach comes very handy).
Once you know how to navigate through the page structure and have patterns to strip the data, the rest is relatively easy, create a client using WebClient, navigate through the structure, strip the necessary data and store it in a DB.
As you can see this is a very broad answer, but because your question also is really broad.

Related

Connecting To A Website To Look Up A Word(Compiling Mass Data/Webcrawler)

I am currently developing a Word-Completion application in C# and after getting the UI up and running, keyboard hooks set, and other things of that nature, I came to the realization that I need a WordList. The only issue is, I cant seem to find one with the appropriate information. I also don't want to spend an entire week formatting and gathering a WordList by hand.
The information I want is something like "TheWord, The definition, verb/etc."
So, it hit me. Why not download a basic word list with nothing but words(Already did this; there are about 109,523 words), write a program that iterates through every word, connects to the internet, retrieves the data(definition etc) from some arbitrary site, and creates XML data from said information. It could be 100% automated, and I would only have to wait for maybe an hour depending on my internet connection speed.
This however, brought me to a few questions.
How should I connect to a site to look up these words? << This my actual question.
How would I read this information from the website?
Would I piss off my ISP or the website for that matter?
Is this a really bad idea? Lol.
How do you guys think I should go about this?
EDIT
Someone noticed that Dictionary.com uses the word as a suffix in the url. This will make it easy to iterate through the word file. I also see that the webpage is stored in XHTML(Or maybe just HTML). Here is the source for the Word "Cat". http://pastebin.com/hjZj6AC1

For what you marked as your actual question - you just need to download the data from the website and find what you need.
A great tool for this is CsQuery which allows you to use jquery selectors.
You could do something like this:
var dom = CQ.CreateFromUrl("http://www.jquery.com");
string definition = dom.Select(".definitionDiv").Text();

Start Orchard CMS Workflow whenever a user clicks on a link leading to an external domain

I need to create a "speed bump" that issues a warning whenever a user clicks on a link that would direct them to a different website (not on the domain). Is there any way to create a custom Orchard workflow activity that will activate whenever a link on the website is clicked? I'm having a problem getting C# to fire an event whenever a link (or anchor tag) on the page gets clicked (I can't just add an onServerClick event to every anchor tag or add an event handler to anchor tags with specific IDs because I need it to fire on all anchor tags many of which are dynamically assigned an id when created).
Another option I was toying with would be to create a custom workflow task that will search any content item for links and then add a speedbump to any link that is determined to lead to an external url. Is it possible to use C# to search the contents of any content item upon creation/publish for anchor tags and then alter the tag somehow to include a speedbump?
As a side note I also need to be able to whitelist urls so a third party can't use the speedbump to direct the user to a malicious website.
I've been stumped on this for quite some time any help would be greatly appreciated.

One way to do this is to add a bit of client-side script to intercept the A tags click events and handle them according to the logic you want to implement. Advantages are performance and ease of implementation. Very, very few people disable javascript, and those users who do can presumably read a domain name in the address bar, so there are no downsides.
Another way, if you don't want to use javascript, is to write a server-side filter that parses the response being output, finds all A tags, and replaces their URL on the fly with the URL of a special controller, with the actual URL being passed as a querystring parameter. Drawbacks of this method is that it's going to be an important drag on the performance of the server, and it's going to be hard to write.
But the best way to solve the issue, by far, for you and your users, is to convince your legal department that this is an extremely bad idea and that there is, in reality, no legal issue here (but I may be wrong about this: not a lawyer (this is not legal advice)).

How can I copy HTML textbox values from one domain to another domain's textboxes?

I'm trying to help save time at work with for a lot of tedious copy/paste tasks we have.
So, we have a propitiatory CRM (with proper HTML ID's, etc for accessing elements) and I'd like to copy those vales from the CRM to textboxes on other web pages (outside of the CRM, so sites like Twitter, Facebook, Google, etc)
I'm aware browsers limit this for security and I'm open to anything, it can be a C#/C++ application, Adobe AIR, etc. We only use Firefox at work so even an extension would work. (We do have GreaseMonkey installed so if that's usable too, sweet).
So, any ideas on how to copy values from one web page to another? Ideally, I'm looking to click a button and have it auto-populate fields. If that button has to launch the web pages that need to be copied over to, that's fine.
Example: Copy customers Username from our CRM, paste it in Facebook's Username field when creating a new account.
UPDATE: To answer a user below, the HTML elements on each domain have specific HTML ID's. The data won't need to be manipulated or cleaned up, just a simple copy from ourCRM.com to facebook.com / twitter.com

Ruby Mechanize is a good bet for scraping the data. Then you can store it and post it however you please.

First, I'd suggest that you more clearly define exactly what it is you're looking to do. I read this as you're trying to take some unstructured data from Point A and copy it to Point B. Do the names of these fields remain constant every time you do the operation? Do you need to simply pull any textbox elements from the page and copy them all over? Do some sort of filtering of this data before writing it over?
Once you've got a clear idea of the requirements, if you go the C# route, I'd use something like SimpleBrowser. Judging by the example on their Github page, you could give it the URL of the page you're looking to copy, then name each of the fields you're looking to obtain the value of, perhaps store these in an IDictionary, then open a new URL and copy those values back into the page (and submit the form).
Alternatively, if you don't know the names of the fields, perhaps there's a provided function in that or a similar project that will allow you to simply enumerate all the text fields on the page and retrieve the values for all of them. Then you'd simply apply some logic of your own to filter those options down to whatever is on the destination form.

SO we thought of an easier way to do this (in case anyone else runs into this issue).
1) From our CRM, we added a "Sign up for Facebook" button
2) The button opens a new window with GET variables in the URL
3) Use a greasemonkey script to read those GET variables and fill in textbox values
4) SUCCESS!
Simple, took about 10 minutes to get working. Thanks for you suggestions.

Hiding Querystring in ASP.NET 2.0

Our site consists of 3 main pages we call "Start.aspx" and then a content iframe inside of that where the user does nearly all of the site interactions.
Recently though, I've had to implement functionality that will jump between Start.aspx pages in different products and automatically change the content iframe to a specified page.
The actual functionality works just fine, but the issue we're having is that the full querystring is exposed. Because we load all pages in the content iframe, the page URL remains at "Product/Start.aspx" during regular site usage.
However, this new functionality is passing a querystring to Start.aspx (which has appropriate parsers to load the requested page in the content iframe), and we need that URL to remain as "Start.aspx".
So far, I've researched into URL Rewriting, which was throwing errors because the landing page for each product is "[Product]/Start.aspx". I've looked at a different URL Rewriting solution, as well as ScottGu's blog post on routing.
The issue is that these solutions seem to be used for simplifying navigation, such as taking "Blogpost.aspx?Year=2013&Month=07&Day=15" and turning it into "Blogpost.aspx/2013/07/14" which really isn't what we're going for. We're not trying to simplify navigation via URL, we're really just trying to completely hide our querystrings.
What we're going for is turning "[Product]/Start.aspx?frame=Company.aspx?id=1570" into "[Product]/Start.aspx" once the content iframe has what it needs from the initial querystring. We don't need to account for every single page. We just need that to be the overarching rule. 90% of the time it won't be an issue, as most of the work being done doesn't jump from product to product without the user just switching products (which is done in a fashion that specifically uses "Response.Redirect("[Product]/Start.aspx")".
Once the content iframe has loaded from the Querystring paramters, we don't need them anymore for anything. The rest of the functionality runs through the iframe without any issue.
Am I overthinking this, or am I asking for something that's not really feasible?

As far as literally "removing all of the query string characters" and still beg able to pass the querystring values to another page, I do not think that is possible. Unless you do it in a Session Variable or something like that.
IF you're simply worried about sensitive data being displayed in plain text in the query string, there is the option of "encrypting" the query string:
http://www.codeproject.com/Articles/33350/Encrypting-Query-Strings
The query string will still show but it will be "Product/Start.aspx?e0ayfefae0y0someencryptedmess108yfe0ayf0a". The page that receives the query string would decrypt it. So the functionality of the query string is there, but the values are not known to the end user.

Since you've tagged this as an ASP.NET question, I'd say the way to go is to keep navigation data in your Session variables.

Can you use a POST instead of a GET? That way, the data is in the form, rather than the Query String.
As a side note, hiding the parameters as a way of making the URL look nicer and be bookmark-able is fine. If you're doing it for any kind of security reasons, it's very shallow security. It's trivial for a user to see what's being passed in both the form and on the query string and to change and repost those. Security needs to be handled primarily on the server side.

Get referral item (link)

We have a sitecore website and we need to know the item from which the link that brought you to page X.
Example:
You're on page A and click a link provided by item X that will lead you to page B.
On page B we need to be able to get that item X referred you, and thus access the item and it's properties.
It could go through session, Sitecore context, I don't know what and we don't even need the entire item itself, just the ID would do.
Anyone know how to accomplish this?

From the discussion in the comments you have a web-architecture problem that isn't really Sitecore specific.
You have a back end which consumes several data items to produce some HTML which is sent to the client. Each of those data items may produce links in the HTML. They may produce identical links. Only one of the items is considered the source of the HTML page.
You wan't to know which of those items produced the link. Your only option is to find a way of identifying the links produced. To do this you will have to add some form of tagging information to the URL produced(such as a querystring) that can be interpretted when the request for the URL is processed. The items themselves don't exist in the client.
The problem would be exactly the same if your links were produced by a database query. If you wanted to know which record produced the link you'd have to add an identifier to the link.
You could probably devise a system that would allow you to identify item most of the time (i.e. when the link clicked on was unique to that page), but it would involve either caching lots of data in a session (list of links produced and the items that produced them) or recreating the request for the referring URL. Either sounds like a lot of hassle for a non-perfect solution that could feasibly slow your server down a fair amount.

James is correct... your original parameters are basically impossible to satisfy.
With some hacking and replacing of the standard Sitecore providers though, you could track these. But it would be far easier to use a querystring ID of some sort.
On our system, we have 3rd party advertising links... they have client javascript which actually submits the request to a local page and then gets redirected to the target URL. So when you hover over the link, the status bar shows you "http://whatever.com"... it appears the link is going to whatever.com, but you are actually going to http://ourserver/redirect.aspx first so we can track that link, and then getting a Response.Redirect().
You could do something similar by providing your own LinkManager and including the generating item ID in the tracking URL, then redirecting to the actual page/item the user wants.
However... this seems rather convoluted and error-prone, and I would not recommend it.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.