Downloading Large Google Drive files with WebClient in C# - c#

I know there are tones of questions on this subject already. After reading all the threads, I decided to get a redirected URL in a confirmation HTML page and then use it as a direct link to download.
As you know, the original URL format of the direct download link is like this.
https://drive.google.com/uc?export=download&id=XXXXX..
But if the size of the target file is big, then it is like this.
https://drive.google.com/uc?export=download&confirm=RRRR&id=XXXXX..
I can get RRRR from the first downloaded data, so I need to try twice in order to download the real file. The concept is very simple enough but I can't get this to work.
class Test
{
class MyWebClient: WebClient
{
CookieContainer c = new CookieContainer();
protected override WebRequest GetWebRequest(Uri u)
{
var r = (HttpWebRequest) base.GetWebRequest(u);
r.CookieContainer = c;
return r;
}
}
static string GetRealURL(string filename)
{
// Some Jobs to Parse....
return directLink;
}
static void Main()
{
MyWebClient wc = new MyWebClient();
string targetLink = "https://drive.google.com/uc?export=download&id=XXXXXXX";
wc.DownloadFile(targetLink, "tempFile.tmp");
targetLink = GetRealURL("tempFile.tmp");
wc.DownloadFile(targetLink, "realFile.dat");
}
}
What did I wrong?
I can get the right download link from the first file, but I get another confirmation page file with another confirm code on the second try. I thought this was because of cookies, so I created my own WebClient class as you can see above.
Also I originally used DownloadFileAsync(), and changed it to DownloadFile() just in case, but the same result..
I'm still thinking it has something to do with cookie things.
What am I missing here?

I had this same problem but had solved it in an HttpClient. I tried via your approach with WebClient and was able to get it to work. You don't show your GetRealUrl() source, but i'm willing to bet in there lies the issue. Here's how I did it:
You need to parse the html response to get the url in the href attribute of the "download anyway" button. It will only have the relative url, (the /uc?export=download... part)
You need to replace the xml escape character & with &
Then you can build the url using thte domain https://drive.google.com
At which point you can download the file. Here's the source (used in a test WPF application):
class MyWebClient : WebClient
{
CookieContainer c = new CookieContainer();
protected override WebRequest GetWebRequest(Uri u)
{
var r = (HttpWebRequest)base.GetWebRequest(u);
r.CookieContainer = c;
return r;
}
}
private async void WebClientTestButtonGdrive_Click(object sender, RoutedEventArgs e)
{
using (MyWebClient client = new MyWebClient())
{
//get the warning page
string htmlPage = await client.DownloadStringTaskAsync("https://drive.google.com/uc?id=XXXXXXX&export=download");
//use HtmlAgilityPack to get the url with the confirm parameter in the url
HtmlDocument document = new HtmlDocument();
document.LoadHtml(htmlPage);
HtmlNode node = document.DocumentNode;
HtmlNode urlNode = node.SelectSingleNode(#"//a[contains(#href, 'XXXXXXX') and contains(#id, 'uc-download-link')]//#href");
string downloadUrl = urlNode.Attributes["href"].Value;
downloadUrl = downloadUrl.Replace("&", "&");
downloadUrl = "https://drive.google.com" + downloadUrl;
//download the file
if (File.Exists("FileToDownload.zip"))
File.Delete("FileToDownload.zip");
await client.DownloadFileTaskAsync(downloadUrl, "FileToDownload.zip");
}
}

Related

Working cefsharp POST example

Having a difficult time getting an HTTP POST request/response using cefsharp / ChromiumWebBrowser. I'm unable to find a working example on Stackoverflow, nor in the documentation. Looking to see if anyone has a full example? I'm stuck on if it can be done with a Navigate function (as show in one example), or needs to be a done with a handler / schema.
I'm trying a basic POST to a PHP script. If the data1/data2 match the input, it's return json status:success, otherwise failure. I see in the devtools that the html body comes back with json success, but this code returns or nothing at all. I've tried too different ways to get the response data. I want to grab the JSON response for the C# code to review. Surely there should be an easy way to accomplish this? I want to send an HTTP request and then get the body (json) to parse. If this needs the schema/handler, I cannot find a full example of using this.
namespace BrowserTest
{
public partial class MainForm : Form
{
ChromiumWebBrowser browser = null;
public Loader()
{
browser = new ChromiumWebBrowser("http://localhost/test/"); // Initialize to this page
pBrowserLogin.Controls.Add(browser);
}
private void btnTest_Click(object sender, EventArgs e)
{
byte[] request = Encoding.ASCII.GetBytes("data1=" + txtData1.Text + "&data2=" + txtData2.Text);
PostTest.Navigate(browser, "http://localhost/test/posttest.php", request, "application/x-www-form-urlencoded");
}
}
public static class PostTest
{
public static void Navigate(this IWebBrowser browser, string url, byte[] postDataBytes, string contentType)
{
IFrame frame = browser.GetMainFrame();
IRequest request = frame.CreateRequest();
request.Url = url;
request.Method = "POST";
request.InitializePostData();
var element = request.PostData.CreatePostDataElement();
element.Bytes = postDataBytes;
request.PostData.AddElement(element);
NameValueCollection headers = new NameValueCollection();
headers.Add("Content-Type", contentType);
request.Headers = headers;
frame.LoadRequest(request);
frame.GetTextAsync().ContinueWith(taskHtml =>
{
var html = taskHtml.Result;
System.Windows.Forms.MessageBox.Show(html);
});
string script = string.Format("document.documentElement.outerHTML;");
frame.EvaluateScriptAsync(script).ContinueWith(x =>
{
var response = x.Result;
if (response.Success && response.Result != null)
{
var fullhtml = response.Result;
System.Windows.Forms.MessageBox.Show(fullhtml.ToString());
}
});
}
}
}

retriving php variable in c#

I have a PHP script which redirects the user to a file download. Upon viewing this page in a web browser I am automatically prompted for a location to save the file, with the correct filename and extension inside the SaveFileDialog.
I wish to download this file using an application written in C#. How can I retrieve the filename and extension of the file that is included in the response from the PHP script?
I think have to read the PHP variable, but I have not found the correct method to read it.
The PHP variables in which I am storing the filename and extension are $file and $ext respectively.
I've read several questions here, but I'm confused. Some user speak about WebClient, others speak about HttpWebRequest.
Can you point me in the correct direction?
Take a look here, where the process of downloading and saving file is described.
Here's how to get file name from the request response headers:
String header = client.ResponseHeaders["content-disposition"];
String filename = new ContentDisposition(header).FileName;
And one more notice: here client is WebClient component. And here is how to use download with WebClient: enter link description here
------The full solution ----------------------------
As it turned out, your server uses authentication. That's why in order to download file we have to pass authentication. So, PLEASE write full details. And here's the code:
private class CWebClient : WebClient
{
public CWebClient()
: this(new CookieContainer())
{ }
public CWebClient(CookieContainer c)
{
this.CookieContainer = c;
}
public CookieContainer CookieContainer { get; set; }
protected override WebRequest GetWebRequest(Uri address)
{
WebRequest request = base.GetWebRequest(address);
if (request is HttpWebRequest)
{
(request as HttpWebRequest).CookieContainer = this.CookieContainer;
}
return request;
}
}
static void Main(string[] args)
{
var client = new CWebClient();
client.BaseAddress = #"http://forum.tractor-italia.net/";
var loginData = new NameValueCollection();
loginData.Add("username", "demodemo");
loginData.Add("password", "demodemo");
loginData.Add("login","Login");
loginData.Add("redirect", "download/myfile.php?id=1622");
client.UploadValues("ucp.php?mode=login", null, loginData);
string remoteUri = "http://forum.tractor-italia.net/download/myfile.php?id=1622";
client.OpenRead(remoteUri);
string fileName = String.Empty;
string contentDisposition = client.ResponseHeaders["content-disposition"];
if (!string.IsNullOrEmpty(contentDisposition))
{
string lookFor = #"=";
int index = contentDisposition.IndexOf(lookFor, 0);
if (index >= 0)
fileName = contentDisposition.Substring(index + lookFor.Length+7);
}//attachment; filename*=UTF-8''JohnDeere6800.zip
client.DownloadFile(remoteUri, fileName);
}
On my PC that works.

C# Console/Server access to web site

I am working on a C# project where I need to get data from a secured web site that does not have an API or web services. My plan is to login, get to the page I need, and parse out the HTML to get to the data bits I need to log to a database. Right now I'm testing with a console app, but eventually this will be converted to an Azure Service bus application.
In order to get to anything, you have to login at their login.cfm page, which means I need to load the username and password input controls on the page and click the submit button. Then navigate to the page I need to parse.
Since I don't have a 'browser' to parse for controls, I am trying to use various C# .NET classes to get to the page, set the username and password, and click submit, but nothing seems to work.
Any examples I can look at, or .NET classes I should be reviewing that were designed for this sort of project?
Thanks!
Use the WebClient class in System.Net
For persistence of session cookie you'll have to make a custom WebClient class.
#region webclient with cookies
public class WebClientX : WebClient
{
public CookieContainer cookies = new CookieContainer();
protected override WebRequest GetWebRequest(Uri location)
{
WebRequest req = base.GetWebRequest(location);
if (req is HttpWebRequest)
(req as HttpWebRequest).CookieContainer = cookies;
return req;
}
protected override WebResponse GetWebResponse(WebRequest request)
{
WebResponse res = base.GetWebResponse(request);
if (res is HttpWebResponse)
cookies.Add((res as HttpWebResponse).Cookies);
return res;
}
}
#endregion
Use a browser add-on like FireBug or the development tools built into Chrome to get the HTTP POST data being sent when you submit a form. Send those POSTs using the WebClientX class and parse the response HTML.
The fastest way to parse HTML when you already know the format is using a simple Regex.Match. So you'd go through the actions in your browser using the development tools to record your POSTs, URLs and HTML content then you'll perform the same tasks using the WebClientX.
Ok, so here is the complete Code to login to one page, then read from a 2nd page after the login.
class Program
{
static void Main(string[] args)
{
string uriString = "http://www.remotesite.com/login.cfm";
// Create a new WebClient instance.
WebClientX myWebClient = new WebClientX();
// Create a new NameValueCollection instance to hold some custom parameters to be posted to the URL.
NameValueCollection myNameValueCollection = new NameValueCollection();
// Add necessary parameter/value pairs to the name/value container.
myNameValueCollection.Add("userid", "myname");
myNameValueCollection.Add("mypassword", "mypassword");
Console.WriteLine("\nUploading to {0} ...", uriString);
// 'The Upload(String,NameValueCollection)' implicitly method sets HTTP POST as the request method.
byte[] responseArray = myWebClient.UploadValues(uriString, myNameValueCollection);
// Decode and display the response.
Console.WriteLine("\nResponse received was :\n{0}", Encoding.ASCII.GetString(responseArray));
Console.WriteLine("\n\n\n pausing...");
Console.ReadKey();
// Go to 2nd page on the site to get additional data
Stream myStream = myWebClient.OpenRead("https://www.remotesite.com/status_results.cfm?t=8&prog=d");
Console.WriteLine("\nDisplaying Data :\n");
StreamReader sr = new StreamReader(myStream);
StringBuilder sb = new StringBuilder();
using (StreamReader reader = new StreamReader(myStream, System.Text.Encoding.UTF8))
{
string line;
while ((line = reader.ReadLine()) != null)
{
sb.Append(line + "\r\n");
}
}
using (StreamWriter outfile = new StreamWriter(#"Logfile1.txt"))
{
outfile.Write(sb.ToString());
}
Console.WriteLine(sb.ToString());
Console.WriteLine("\n\n\n pausing...");
Console.ReadKey();
}
}
public class WebClientX : WebClient
{
public CookieContainer cookies = new CookieContainer();
protected override WebRequest GetWebRequest(Uri location)
// public override WebRequest GetWebRequest(Uri location)
{
WebRequest req = base.GetWebRequest(location);
if (req is HttpWebRequest)
(req as HttpWebRequest).CookieContainer = cookies;
return req;
}
protected override WebResponse GetWebResponse(WebRequest request)
{
WebResponse res = base.GetWebResponse(request);
if (res is HttpWebResponse)
cookies.Add((res as HttpWebResponse).Cookies);
return res;
}
}

How to fetch webpage title and images from URL?

I want to fetch website title and images from URL.
as facebook.com doing. How I get images and website title from third party link.?
use html Agility Pack this is a sample code to get the title:
using System;
using HtmlAgilityPack;
protected void Page_Load(object sender, EventArgs e)
{
string url = #"http://www.veranomovistar.com.pe/";
System.Net.WebClient wc = new System.Net.WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(wc.OpenRead(url));
var metaTags = doc.DocumentNode.SelectNodes("//title");
if (metaTags != null)
{
string title = metaTags[0].InnerText;
}
}
Any doubt, post your comment.
At a high level, you just need to send a standard HTTP request to the desired URL. This will get you the site's markup. You can then inspect the markup (either by parsing it into a DOM object and then querying the DOM, or by running some simple regexp's/pattern matching to find the things you are interested in) to extract things like the document's <title> element and any <img> elements on the page.
Off the top of my head, I'd use an HttpWebRequest to go get the page and parse the title out myself, then use further HttpWebRequests in order to go get any images referenced on the page. There's a darn good chance though that there's a better way to do this and somebody will come along and tell you what it is. If not, it'd look something like this:
HttpWebResponse response = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(<your URL here>);
response = (HttpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
StreamReader reader = new StreamReader(responseStream);
//use the StreamReader object to get the page data and parse out the title as well as
//getting locations of any images you need to get
catch
{
//handle exceptions
}
finally
{
if(response != null)
{
response.Close();
}
}
Probably the dumb way to do it, but that's my $0.02.
just u hav to write using javascript on source body
for example
if u r using master page just u hav to write code on matser page thats reflect on all the pages
u can also used the image Url property in this script like that
khan mohd faizan

get data from page that a url points to

Given a Url, I'd like to be able to capture the Title of the page this url points to, as well
as other info - eg a snippet of text from the first paragraph on a page? - maybe even an image from the page.
Digg.com does this nicely when you submit a url.
How could something like this be done in .Net c#?
You're looking for the HTML Agility Pack, which can parse malformed HTML documents.
You can use its HTMLWeb class to download a webpage over HTTP.
You can also download text over HTTP using .Net's WebClient class.
However, it won't help you parse the HTML.
You could try something like this:
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
namespace WebGet
{
class progMain
{
static void Main(string[] args)
{
ASCIIEncoding asc = new ASCIIEncoding();
WebRequest wrq = WebRequest.Create("http://localhost");
WebResponse wrp = wrq.GetResponse();
byte [] responseBuf = new byte[wrp.ContentLength];
int status = wrp.GetResponseStream().Read(responseBuf, 0, responseBuf.Length);
Console.WriteLine(asc.GetString(responseBuf));
}
}
}
Once you have the buffer, you can process it looking for paragraph or image HTML tags to extract portions of the returned data.
You can extract the title of a page with a function like the following. You would need to modify the regular expression to look for, say, the first paragraph of text but since each page is different, that may prove difficult. You could look for a meta description tag and take the value from that, however.
public static string GetWebPageTitle(string url)
{
// Create a request to the url
HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;
// If the request wasn't an HTTP request (like a file), ignore it
if (request == null) return null;
// Use the user's credentials
request.UseDefaultCredentials = true;
// Obtain a response from the server, if there was an error, return nothing
HttpWebResponse response = null;
try { response = request.GetResponse() as HttpWebResponse; }
catch (WebException) { return null; }
// Regular expression for an HTML title
string regex = #"(?<=<title.*>)([\s\S]*)(?=</title>)";
// If the correct HTML header exists for HTML text, continue
if (new List<string>(response.Headers.AllKeys).Contains("Content-Type"))
if (response.Headers["Content-Type"].StartsWith("text/html"))
{
// Download the page
WebClient web = new WebClient();
web.UseDefaultCredentials = true;
string page = web.DownloadString(url);
// Extract the title
Regex ex = new Regex(regex, RegexOptions.IgnoreCase);
return ex.Match(page).Value.Trim();
}
// Not a valid HTML page
return null;
}
You could use Selenium RC (Open Source, www.seleniumhq.org) to parse data etc. from the pages. It is a web test automation tool with an C# .Net lib.
Selenium have full API to read out specific items on a html page.

Categories

Resources