Get a new link for a site - c#

Suppose I had a link like this: https://site/2019
This site updates irregularly and I want to check if there is a new entry as soon as there is one.
if https://site/2020 becomes available, then I want to parse the full link to a string.
if the site doesn't contain an element (Selenium), it should skip over this link and wait for https://site/2021 to become available.
I have tried a while loop in which I passed an old link (like https://site/2020) and repeatedly checked if https://site/2021 has become available. I have found this to be more difficult then I thought, and thus it failed.
I think it could be done with events, but I don't know how.
If you have any ideas, I would love to hear them.

Related

In Selenium, can I get the last URL visited without actually navigating there?

I'm aware that I can navigate backwards through my history using the IWebDriver.Navigate().Back() method, but what if I just need the URL of the last page visited? Is there a way to grab that from the WebDriver, without actually navigating there?
To be clear, this is a question about Selenium WebDriver, and has nothing to do with JavaScript.
As suggested by several, what I ended up doing is wrapping the Selenium WebDriver in my own class that monitors all navigation, and keeps its own history. It seems a redundancy, given that somewhere deep in the bowels of WebDriver another history already exists, but since the tester doesn't have access to it, I see no other way of achieving this goal.
Thanks to all who contributed their thoughts and suggestions!
You can just keep the previous page URL in a variable and update/pass it to the actual code that needs it. Even keep a collection of all visited URLs and just get the last item, all this will give you a history without any need for JS hacks. Storing and sharing state is a valid case, already implemented in some frameworks, like SpecFlow's ScenarioContext. And the previous page URL value will be available to all your steps/code for each test.

Connecting To A Website To Look Up A Word(Compiling Mass Data/Webcrawler)

I am currently developing a Word-Completion application in C# and after getting the UI up and running, keyboard hooks set, and other things of that nature, I came to the realization that I need a WordList. The only issue is, I cant seem to find one with the appropriate information. I also don't want to spend an entire week formatting and gathering a WordList by hand.
The information I want is something like "TheWord, The definition, verb/etc."
So, it hit me. Why not download a basic word list with nothing but words(Already did this; there are about 109,523 words), write a program that iterates through every word, connects to the internet, retrieves the data(definition etc) from some arbitrary site, and creates XML data from said information. It could be 100% automated, and I would only have to wait for maybe an hour depending on my internet connection speed.
This however, brought me to a few questions.
How should I connect to a site to look up these words? << This my actual question.
How would I read this information from the website?
Would I piss off my ISP or the website for that matter?
Is this a really bad idea? Lol.
How do you guys think I should go about this?
EDIT
Someone noticed that Dictionary.com uses the word as a suffix in the url. This will make it easy to iterate through the word file. I also see that the webpage is stored in XHTML(Or maybe just HTML). Here is the source for the Word "Cat". http://pastebin.com/hjZj6AC1
For what you marked as your actual question - you just need to download the data from the website and find what you need.
A great tool for this is CsQuery which allows you to use jquery selectors.
You could do something like this:
var dom = CQ.CreateFromUrl("http://www.jquery.com");
string definition = dom.Select(".definitionDiv").Text();

How can I copy HTML textbox values from one domain to another domain's textboxes?

I'm trying to help save time at work with for a lot of tedious copy/paste tasks we have.
So, we have a propitiatory CRM (with proper HTML ID's, etc for accessing elements) and I'd like to copy those vales from the CRM to textboxes on other web pages (outside of the CRM, so sites like Twitter, Facebook, Google, etc)
I'm aware browsers limit this for security and I'm open to anything, it can be a C#/C++ application, Adobe AIR, etc. We only use Firefox at work so even an extension would work. (We do have GreaseMonkey installed so if that's usable too, sweet).
So, any ideas on how to copy values from one web page to another? Ideally, I'm looking to click a button and have it auto-populate fields. If that button has to launch the web pages that need to be copied over to, that's fine.
Example: Copy customers Username from our CRM, paste it in Facebook's Username field when creating a new account.
UPDATE: To answer a user below, the HTML elements on each domain have specific HTML ID's. The data won't need to be manipulated or cleaned up, just a simple copy from ourCRM.com to facebook.com / twitter.com
Ruby Mechanize is a good bet for scraping the data. Then you can store it and post it however you please.
First, I'd suggest that you more clearly define exactly what it is you're looking to do. I read this as you're trying to take some unstructured data from Point A and copy it to Point B. Do the names of these fields remain constant every time you do the operation? Do you need to simply pull any textbox elements from the page and copy them all over? Do some sort of filtering of this data before writing it over?
Once you've got a clear idea of the requirements, if you go the C# route, I'd use something like SimpleBrowser. Judging by the example on their Github page, you could give it the URL of the page you're looking to copy, then name each of the fields you're looking to obtain the value of, perhaps store these in an IDictionary, then open a new URL and copy those values back into the page (and submit the form).
Alternatively, if you don't know the names of the fields, perhaps there's a provided function in that or a similar project that will allow you to simply enumerate all the text fields on the page and retrieve the values for all of them. Then you'd simply apply some logic of your own to filter those options down to whatever is on the destination form.
SO we thought of an easier way to do this (in case anyone else runs into this issue).
1) From our CRM, we added a "Sign up for Facebook" button
2) The button opens a new window with GET variables in the URL
3) Use a greasemonkey script to read those GET variables and fill in textbox values
4) SUCCESS!
Simple, took about 10 minutes to get working. Thanks for you suggestions.

Get referral item (link)

We have a sitecore website and we need to know the item from which the link that brought you to page X.
Example:
You're on page A and click a link provided by item X that will lead you to page B.
On page B we need to be able to get that item X referred you, and thus access the item and it's properties.
It could go through session, Sitecore context, I don't know what and we don't even need the entire item itself, just the ID would do.
Anyone know how to accomplish this?
From the discussion in the comments you have a web-architecture problem that isn't really Sitecore specific.
You have a back end which consumes several data items to produce some HTML which is sent to the client. Each of those data items may produce links in the HTML. They may produce identical links. Only one of the items is considered the source of the HTML page.
You wan't to know which of those items produced the link. Your only option is to find a way of identifying the links produced. To do this you will have to add some form of tagging information to the URL produced(such as a querystring) that can be interpretted when the request for the URL is processed. The items themselves don't exist in the client.
The problem would be exactly the same if your links were produced by a database query. If you wanted to know which record produced the link you'd have to add an identifier to the link.
You could probably devise a system that would allow you to identify item most of the time (i.e. when the link clicked on was unique to that page), but it would involve either caching lots of data in a session (list of links produced and the items that produced them) or recreating the request for the referring URL. Either sounds like a lot of hassle for a non-perfect solution that could feasibly slow your server down a fair amount.
James is correct... your original parameters are basically impossible to satisfy.
With some hacking and replacing of the standard Sitecore providers though, you could track these. But it would be far easier to use a querystring ID of some sort.
On our system, we have 3rd party advertising links... they have client javascript which actually submits the request to a local page and then gets redirected to the target URL. So when you hover over the link, the status bar shows you "http://whatever.com"... it appears the link is going to whatever.com, but you are actually going to http://ourserver/redirect.aspx first so we can track that link, and then getting a Response.Redirect().
You could do something similar by providing your own LinkManager and including the generating item ID in the tracking URL, then redirecting to the actual page/item the user wants.
However... this seems rather convoluted and error-prone, and I would not recommend it.

How can I scrape a page that contains data updated with JavaScript after page load?

Im trying to scrape a page. Everything is ok, but when values are updated, the sourse code of page is still the same for a minute. Even when i refresh a page with slow internet connection, first i see old data, and only after page gets fully loaded values are current.
I guess javascript updates them. BUt it still has to download them somehow.
How can i get current values?
I write my program in C#, but if you have some ideas/advices/examples language doesnt really matter.
Thank you.
You're right - javascript is probably updating the data after load.
I could think of three ways to handle this:
Use a webbrowser control - I guess your using the HttpWebRequest object to retrieve values from the site. This won't work if you need to let the javascript to run. You can use the webbrowser control, let the javascript run and retrieve values from the DOM. Only thing I don't like about this approach is it feels like a hack and probably too clunky for prod applications. You also need to know when to read the contents of the DOM (an update might be inprogress in the background). Google "C# WebBrowser Control Read DOM Programmatically" or you can read more about that here.
I personally prefer this over the previous but it doesn't work all the time. First you need to inspect the website from firebug or something and see which urls are called from the background. Say for example the site is updating stock quotes using javascript. Most likely, its using an asynchronous request to retrieving the updated information from a webservice. Using firebug, you can view this under NET>XHR. Now is the hard part. Well, take a look at the request and the values returned. The idea is, you can try to retrieve the values your self and parse the contents - which can be a lot easier than scraping a page. The problem is, you would need to do a bit of reverse engineering to get it right. You might also encounter problems with authentication and/or encryption.
Lastly and my most preferred solution is asking the owner [of the site your are scraping] directly.
I think the WebBrowser control approach is probably OK and doesn't depend on third party libraries. Here is what I intend to use and it solves the problem of waiting for the page to complete loading:
private string ReadPage(string Link)
{
using (var client = new WebClient())
{
this.wbrwPages.Navigate(Link);
while (this.wbrwPages.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
}
ReadPage = this.wbrwPages.DocumentText;
}
}
I will get information out of the HTML through some form of DOM or XPath treatment. I am curious if others will have comments about entering the 'while' loop and depending upon the 'complete' state to get me out of it. I may put a timer of some sort in there as well - just to be safe.

Categories

Resources