Download an Entire Website in C#

Download an Entire Website in C# - c#

Forgive my ignorance on the subject
I am using
string p="http://" + Textbox2.text;
string r= textBox3.Text;
System.Net.WebClient webclient=new
System.Net.Webclient();
webclient.DownloadFile(p,r);
to download a webpage. Can you please help me with enhancing the code so that it downloads the entire website. Tried using HTML Screen Scraping but it returns me only the href links of the index.html files. How do i proceed ahead
Thanks

Scraping a website is actually a lot of work, with a lot of corner cases.
Invoke wget instead. The manual explains how to use the "recursive retrieval" options.

protected string GetWebString(string url)
{
string appURL = url;
HttpWebRequest wrWebRequest = WebRequest.Create(appURL) as HttpWebRequest;
HttpWebResponse hwrWebResponse = (HttpWebResponse)wrWebRequest.GetResponse();
StreamReader srResponseReader = new StreamReader(hwrWebResponse.GetResponseStream());
string strResponseData = srResponseReader.ReadToEnd();
srResponseReader.Close();
return strResponseData;
}
This puts the webpage into a string from the supplied URL.
You can then use REGEX to parse through the string.
This little piece gets specific links out of craigslist and adds them to an arraylist...Modify to your purpose.
protected ArrayList GetListings(int pages)
{
ArrayList list = new ArrayList();
string page = GetWebString("http://albany.craigslist.org/bik/");
MatchCollection listingMatches = Regex.Matches(page, "(<p>)(?<TITLE>.*)(-)");
foreach (Match m in listingMatches)
{
list.Add("http://albany.craigslist.org" + m.Groups["LINK"].Value.ToString());
}
return list;
}

Related

Downloading Large Google Drive files with WebClient in C#

I know there are tones of questions on this subject already. After reading all the threads, I decided to get a redirected URL in a confirmation HTML page and then use it as a direct link to download.
As you know, the original URL format of the direct download link is like this.
https://drive.google.com/uc?export=download&id=XXXXX..
But if the size of the target file is big, then it is like this.
https://drive.google.com/uc?export=download&confirm=RRRR&id=XXXXX..
I can get RRRR from the first downloaded data, so I need to try twice in order to download the real file. The concept is very simple enough but I can't get this to work.
class Test
{
class MyWebClient: WebClient
{
CookieContainer c = new CookieContainer();
protected override WebRequest GetWebRequest(Uri u)
{
var r = (HttpWebRequest) base.GetWebRequest(u);
r.CookieContainer = c;
return r;
}
}
static string GetRealURL(string filename)
{
// Some Jobs to Parse....
return directLink;
}
static void Main()
{
MyWebClient wc = new MyWebClient();
string targetLink = "https://drive.google.com/uc?export=download&id=XXXXXXX";
wc.DownloadFile(targetLink, "tempFile.tmp");
targetLink = GetRealURL("tempFile.tmp");
wc.DownloadFile(targetLink, "realFile.dat");
}
}
What did I wrong?
I can get the right download link from the first file, but I get another confirmation page file with another confirm code on the second try. I thought this was because of cookies, so I created my own WebClient class as you can see above.
Also I originally used DownloadFileAsync(), and changed it to DownloadFile() just in case, but the same result..
I'm still thinking it has something to do with cookie things.
What am I missing here?

I had this same problem but had solved it in an HttpClient. I tried via your approach with WebClient and was able to get it to work. You don't show your GetRealUrl() source, but i'm willing to bet in there lies the issue. Here's how I did it:
You need to parse the html response to get the url in the href attribute of the "download anyway" button. It will only have the relative url, (the /uc?export=download... part)
You need to replace the xml escape character & with &
Then you can build the url using thte domain https://drive.google.com
At which point you can download the file. Here's the source (used in a test WPF application):
class MyWebClient : WebClient
{
CookieContainer c = new CookieContainer();
protected override WebRequest GetWebRequest(Uri u)
{
var r = (HttpWebRequest)base.GetWebRequest(u);
r.CookieContainer = c;
return r;
}
}
private async void WebClientTestButtonGdrive_Click(object sender, RoutedEventArgs e)
{
using (MyWebClient client = new MyWebClient())
{
//get the warning page
string htmlPage = await client.DownloadStringTaskAsync("https://drive.google.com/uc?id=XXXXXXX&export=download");
//use HtmlAgilityPack to get the url with the confirm parameter in the url
HtmlDocument document = new HtmlDocument();
document.LoadHtml(htmlPage);
HtmlNode node = document.DocumentNode;
HtmlNode urlNode = node.SelectSingleNode(#"//a[contains(#href, 'XXXXXXX') and contains(#id, 'uc-download-link')]//#href");
string downloadUrl = urlNode.Attributes["href"].Value;
downloadUrl = downloadUrl.Replace("&", "&");
downloadUrl = "https://drive.google.com" + downloadUrl;
//download the file
if (File.Exists("FileToDownload.zip"))
File.Delete("FileToDownload.zip");
await client.DownloadFileTaskAsync(downloadUrl, "FileToDownload.zip");
}
}

How to translate word using google translator API?

I am trying to get the converted text using google translator's api.
public JsonResult getCultureMeaning(string word, string langcode)
{
string url = String.Format("https://translate.google.com/#en/" + langcode+ "/" + word + "");
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
string m = "";
foreach (HtmlNode node in doc.DocumentNode.SelectSingleNode("//span[#id='result_box']").ChildNodes)
{
m += node.InnerHtml;
}
return Json(m, JsonRequestBehavior.AllowGet);
}
In this above method I am passing parameters, say if word is Welcome and langcode is hi in this case.
So I would have url https://translate.google.com/#en/hi/welcome and result is आपका स्वागत है
But when I do select result container with its children nodes as- doc.DocumentNode.SelectSingleNode("//span[#id='result_box']").ChildNodes) then it does not find this result container within the result. Hence I don't get this api work in my case.
Edit-
result container from the url-
<span id="result_box" class="short_text" lang="hi"><span class="hps">आपका स्वागत है</span></span>
How should I approach it to get it working. For reference I am using HtmlAgilityPack.

If you inspect page requests, you might notice, that actual translation request done via AJAX, sample query for your translation is: https://translate.google.com/translate_a/single?client=t&sl=en&tl=hi&hl=en&dt=bd&dt=ex&dt=ld&dt=md&dt=qc&dt=rw&dt=rm&dt=ss&dt=t&dt=at&dt=sw&ie=UTF-8&oe=UTF-8&ssel=0&tsel=0&q=welcome
It returns JSON, you might inspect it and get what you looking for(data is pretty big, so i won't post it here)

Agility pack only requests back document elements, It cannot request contents after ajax request is done. Thanks to #Uriil for pointing light on this issue.
However I was able to manage it via traditional way using WebClient
Here is what I did-
public JsonResult getCultureMeaning(string word, string languagePair)
{
string languagePair = "en|" + langua + "";
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", word, languagePair);
WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
string result = webClient.DownloadString(url);
result = result.Substring(result.IndexOf("<span title=\"") + "<span title=\"".Length);
result = result.Substring(result.IndexOf(">") + 1);
result = result.Substring(0, result.IndexOf("</span>"));
result = HttpUtility.HtmlDecode(result.Trim());
return Json(result, JsonRequestBehavior.AllowGet);
}
It works for every culture pair. Except converting en|en, In this case It would request whole HTML document with result.

Capturing IP address of web stream with StreamReader returning too much data

I have this piece of code:
WebClient web = new WebClient();
System.IO.Stream stream = web.OpenRead("http://url/getAddress.html");
string text = "";
using (System.IO.StreamReader reader = new System.IO.StreamReader(stream))
{
text = reader.ReadToEnd();
reader.Close();
}
The result of this in HTML is an IP Address, but when i try to save this result in a database, the result that returns is the whole html page of the web request.
What am i doing wrong?
Example:
text has the result
if i do a Response.Write(text); it returns: 111.222.33.3
if i try to save the variable text which returns the value, this will save the whole html content from the web request page.

You realize that if you write to the Response object, that's going to format HTML on the web page? That just formats the HTML you received from the web page. You need to parse the HTML you get to get the actual data you're looking for, in the format you want it.

public static String GetIP()
{
String ip =
HttpContext.Current.Request.ServerVariables["HTTP_X_FORWARDED_FOR"];
if (string.IsNullOrEmpty(ip))
{
ip = HttpContext.Current.Request.ServerVariables["REMOTE_ADDR"];
}
return ip;
}
i've found this solution, is perfect for what i want

How to fetch webpage title and images from URL?

I want to fetch website title and images from URL.
as facebook.com doing. How I get images and website title from third party link.?

use html Agility Pack this is a sample code to get the title:
using System;
using HtmlAgilityPack;
protected void Page_Load(object sender, EventArgs e)
{
string url = #"http://www.veranomovistar.com.pe/";
System.Net.WebClient wc = new System.Net.WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(wc.OpenRead(url));
var metaTags = doc.DocumentNode.SelectNodes("//title");
if (metaTags != null)
{
string title = metaTags[0].InnerText;
}
}
Any doubt, post your comment.

At a high level, you just need to send a standard HTTP request to the desired URL. This will get you the site's markup. You can then inspect the markup (either by parsing it into a DOM object and then querying the DOM, or by running some simple regexp's/pattern matching to find the things you are interested in) to extract things like the document's <title> element and any <img> elements on the page.

Off the top of my head, I'd use an HttpWebRequest to go get the page and parse the title out myself, then use further HttpWebRequests in order to go get any images referenced on the page. There's a darn good chance though that there's a better way to do this and somebody will come along and tell you what it is. If not, it'd look something like this:
HttpWebResponse response = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(<your URL here>);
response = (HttpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
StreamReader reader = new StreamReader(responseStream);
//use the StreamReader object to get the page data and parse out the title as well as
//getting locations of any images you need to get
catch
{
//handle exceptions
}
finally
{
if(response != null)
{
response.Close();
}
}
Probably the dumb way to do it, but that's my $0.02.

just u hav to write using javascript on source body
for example
if u r using master page just u hav to write code on matser page thats reflect on all the pages
u can also used the image Url property in this script like that
khan mohd faizan

get data from page that a url points to

Given a Url, I'd like to be able to capture the Title of the page this url points to, as well
as other info - eg a snippet of text from the first paragraph on a page? - maybe even an image from the page.
Digg.com does this nicely when you submit a url.
How could something like this be done in .Net c#?

You're looking for the HTML Agility Pack, which can parse malformed HTML documents.
You can use its HTMLWeb class to download a webpage over HTTP.
You can also download text over HTTP using .Net's WebClient class.
However, it won't help you parse the HTML.

You could try something like this:
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
namespace WebGet
{
class progMain
{
static void Main(string[] args)
{
ASCIIEncoding asc = new ASCIIEncoding();
WebRequest wrq = WebRequest.Create("http://localhost");
WebResponse wrp = wrq.GetResponse();
byte [] responseBuf = new byte[wrp.ContentLength];
int status = wrp.GetResponseStream().Read(responseBuf, 0, responseBuf.Length);
Console.WriteLine(asc.GetString(responseBuf));
}
}
}
Once you have the buffer, you can process it looking for paragraph or image HTML tags to extract portions of the returned data.

You can extract the title of a page with a function like the following. You would need to modify the regular expression to look for, say, the first paragraph of text but since each page is different, that may prove difficult. You could look for a meta description tag and take the value from that, however.
public static string GetWebPageTitle(string url)
{
// Create a request to the url
HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;
// If the request wasn't an HTTP request (like a file), ignore it
if (request == null) return null;
// Use the user's credentials
request.UseDefaultCredentials = true;
// Obtain a response from the server, if there was an error, return nothing
HttpWebResponse response = null;
try { response = request.GetResponse() as HttpWebResponse; }
catch (WebException) { return null; }
// Regular expression for an HTML title
string regex = #"(?<=<title.*>)([\s\S]*)(?=</title>)";
// If the correct HTML header exists for HTML text, continue
if (new List<string>(response.Headers.AllKeys).Contains("Content-Type"))
if (response.Headers["Content-Type"].StartsWith("text/html"))
{
// Download the page
WebClient web = new WebClient();
web.UseDefaultCredentials = true;
string page = web.DownloadString(url);
// Extract the title
Regex ex = new Regex(regex, RegexOptions.IgnoreCase);
return ex.Match(page).Value.Trim();
}
// Not a valid HTML page
return null;
}

You could use Selenium RC (Open Source, www.seleniumhq.org) to parse data etc. from the pages. It is a web test automation tool with an C# .Net lib.
Selenium have full API to read out specific items on a html page.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Download an Entire Website in C# - c#

Scraping a website is actually a lot of work, with a lot of corner cases. Invoke wget instead. The manual explains how to use the "recursive retrieval" options.

Related

Downloading Large Google Drive files with WebClient in C#

How to translate word using google translator API?

Capturing IP address of web stream with StreamReader returning too much data

How to fetch webpage title and images from URL?

get data from page that a url points to

Categories

Resources