I have a Windows Desktop application that is used to do WebScraping on a website using WebBrowser.
I had to use WebBrowser because the website implements some Javascript function so that was the only way to get the html content of the pages.
The program has to parse about 1500 pages so I have implemented a task delay in order to avoid to overload the server ( and may be getting banned ).
The problem is that after 50-100 parsed pages, I get an out of memory error and the program gets closed.
This is the code:
private async void buttonProd_Click(object sender, EventArgs e)
{
const string C_Prod_UrlTemplate = "http://www.mysite.it";
var _searches = new List<Get_SiteSearchResult>();
using (ProdDataContext db = new ProdDataContext())
{
_searches = db.Get_SiteSearch("PROD").ToList();
foreach (var s in _searches)
{
WebBrowser wb1 = new WebBrowser();
wb1.ScriptErrorsSuppressed = true;
Uri uri = new Uri(String.Format(C_Prod_UrlTemplate,s.prod));
wb1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser_DocumentCompleted);
wb1.Url = uri;
await Task.Delay(90 * 1000);
}
}
}
private void webBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
using (ProdDataContext db = new ProdDataContext())
{
WebBrowser wb = (WebBrowser)sender;
string s = wb.Document.Body.InnerHtml;
string fName = wb.CodSite + "_" + wb.PostId + ".txt";
File.WriteAllText(wb.FolderPath + #"LINKS\" + fName, s);
db.Set_LinkDownloaded(wb.CodSite, wb.PostId);
}
}
The error messa is generated on this command line in webBrowser_DocumentCompleted method:
string s = wb.Document.Body.InnerHtml;
Thanks to support
Instead of using a control (which is a rather complex construct that requires more memory than a simple object), you can simply fetch the string (the HTML code only) associated with an URL like this:
using(WebClient wc = new WebClient()) {
string s = wc.DownloadString(url);
// do stuff with content
}
Of course, you should ensure some error handling (maybe even a retrial mechanism) and put some delays to ensure you are not doing too much requests per time interval.
Related
I have a C# form with a web browser control on it. I want to open a url eg:(www. google.com)in a loop and for each time the loop runs I want to first navigate to the url fill a search string and click the search button and wait until the search results load fully.
How can I do this?
I wrote this code to save the url that we get after search result loads but only the search result for the last string seems to load and gets saved in my list.
private void button1_Click(object sender, EventArgs e)
{
var task = DoNavigationAsync();
task.ContinueWith((t) =>
{
MessageBox.Show("Done!");
}, TaskScheduler.FromCurrentSynchronizationContext());
}
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElement url = webBrowser1.Document.GetElementById("sb_form_q");
if (url != null)
{
url.SetAttribute("value", search[searchindx-1]);
webBrowser1.Document.GetElementById("sb_form_go").InvokeMember("click");
}
if (webBrowser1.Url.ToString() != "http://www.bing.com/")
{
SavedUrl.Add(webBrowser1.Url.ToString());
}
}
async Task DoNavigationAsync()
{
TaskCompletionSource<bool> tcsNavigation = null;
TaskCompletionSource<bool> tcsDocument = null;
this.webBrowser1.Navigated += (s, e) =>
{
if (tcsNavigation.Task.IsCompleted)
return;
tcsNavigation.SetResult(true);
};
this.webBrowser1.DocumentCompleted += (s, e) =>
{
if (this.webBrowser1.ReadyState != WebBrowserReadyState.Complete)
return;
if (tcsDocument.Task.IsCompleted)
return;
tcsDocument.SetResult(true);
};
search = new string[3];
search[0] = "C";
search[1] = "C++";
search[2] = "C#";
searchindx = 0;
foreach (string sval in search)
{
searchindx++;
tcsNavigation = new TaskCompletionSource<bool>();
tcsDocument = new TaskCompletionSource<bool>();
webBrowser1.Navigate("www.bing.com");
await tcsNavigation.Task;
await tcsDocument.Task;
}
}
Using the async HttpClient from .NET Framework 4.5, you can load a web page without using a gui element such as WebBrowser.
A download would look like this:
using (HttpClient client = new HttpClient()) {
await client.GetStringAsync("https://google.com");
}
This would get you the HTML content of the google search site.
But if you just want to have the resulting URL, you won't even need to perform a download because google (and most other search engines) provides an API for that. Note the following google URL: https://www.google.com/search?q=google. You can see that the search string "google" appears as parameter named "q". So if you build your code like this...
string[] search = new string[] { "C", "C++", "C#" };
foreach (string sval in search)
{
// C# <= 5
SavedUrl.Add(string.Format("https://google.com/search?q={0}", sval));
// C# 6
SavedUrl.Add($"https://google.com/search?q={sval}");
}
... you won't need any web access.
I'm using a block of code I got from a blog, to upload images to IMGur using API v3.
It works fine, but I wanted to implement a progress bar system to let the user know how much has been uploaded, if the program deals with high res images.
So far I haven't been able to do so.
I'm not an experienced coder, just doing this as a learning project.
The Code:
public object UploadImage(string image)
{
WebClient w = new WebClient();
w.UploadProgressChanged += (s, e) => { };
w.UploadValuesCompleted += (s, e) => { };
w.Headers.Add("Authorization", "Client-ID " + ClientId);
System.Collections.Specialized.NameValueCollection Keys = new System.Collections.Specialized.NameValueCollection();
try
{
Keys.Add("image", Convert.ToBase64String(File.ReadAllBytes(image)));
byte[] responseArray = w.UploadValues("https://api.imgur.com/3/image", Keys);
dynamic result = Encoding.ASCII.GetString(responseArray); System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex("link\":\"(.*?)\"");
System.Text.RegularExpressions.Match match = reg.Match(result);
string url = match.ToString().Replace("link\":\"", "").Replace("\"", "").Replace("\\/", "/");
textBox1.Text = url;
return url;
}
catch (Exception s)
{
MessageBox.Show("Something went wrong. " + s.Message);
return "Failed!";
}
}
At first I tried using the events UploadProgressChanged and UploadValuesCompleted, but they are not triggered, my theory is they are triggered when UploadValuesAsync is called instead of UploadValues.
How do I implement a progress system?
What is the difference between async and normal transfer?
The difference between aync and normal transfer is, that the UploadValues method will block the current thread until all data has been transferred. Because the thread is blocked in this time you can't catch any events too. Therefore you have to use the asynchrony method UploadValuesAsync which will transfer the data in the background and you're able to go on with the execution of your code.
The UploadProgressChanged only fires for the UploadValuesAsync too. Your code should look something like this (Not tested!):
public String UploadImage(string image)
{
WebClient w = new WebClient();
w.UploadProgressChanged += (s, e) =>
{
myProgressBar.Maximum = (int)e.TotalBytesToSend;
myProgressBar.Value = (int)e.BytesSent;
};
w.UploadValuesCompleted += new UploadValuesCompletedEventHandler(UploadComplete);
w.Headers.Add("Authorization", "Client-ID " + ClientId);
System.Collections.Specialized.NameValueCollection Keys = new System.Collections.Specialized.NameValueCollection();
try
{
Keys.Add("image", Convert.ToBase64String(File.ReadAllBytes(image)));
w.UploadValuesAsync("https://api.imgur.com/3/image", Keys);
return "Uploading..";
} catch (Exception s)
{
MessageBox.Show("Something went wrong. " + s.Message);
return "Failed!";
}
}
public void UploadComplete(Object sender, UploadValuesCompletedEventArgs e)
{
myProgressBar.Value = 100;
byte[] responseArray = e.Result;
dynamic result = Encoding.ASCII.GetString(responseArray);
System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex("link\":\"(.*?)\"");
System.Text.RegularExpressions.Match match = reg.Match(result);
string url = match.ToString().Replace("link\":\"", "").Replace("\"", "").Replace("\\/", "/");
textBox1.Text = url;
}
Edit
I moved the code after the UploadValuesAsync call into the w.UploadValuesCompleted. You can find the server response in the Result field of the UploadValuesCompletedEventArgs class which is passed to the event in the variable e.
Your method UploadImage will now return Uploading when the progress started and you'll have to do your rest work in the w.UploadValuesCompleted event.
I try to create a web browser. Currently I try to realize a function that if the user wants to download some file an additional window is shown with a list of already downloaded files. If the file has already been loaded, a message is shown (just an idea).
So far, I get a link to the file location in the main form and send it to the other form:
DownLoadFile dlf = new DownLoadFile();
...
WebBrowser wb = new WebBrowser();
wb.Navigating += new WebBrowserNavigatingEventHandler(wb_Navigating);
...
private void wb_Navigating(object sender, WebBrowserNavigatingEventArgs e)
{
...
if (e.Url.ToString().EndsWith(".mp3"))
{
dlf.DownloadPath = e.Url;
dlf.Show();
}
}
In the new form I try to use this link for file downloading:
public Uri DownloadPath { get; set; }
...
private void DownLoadFile_Load(object sender, EventArgs e)
{
string filePath = null;
//get FileName from URL
string[] ArrayForName;
ArrayForName = DownloadPath.ToString().Split('/');
saveFileDialogFile.FileName =
ArrayForName[ArrayForName.Length-1].Replace("%"," ").Trim();
if (saveFileDialogFile.ShowDialog() == DialogResult.OK)
{
WebClient client = new WebClient();
//get Url
Uri url = new Uri(DownloadPath.ToString());
//get place where want to save with default name
filePath = saveFileDialogFile.FileName;
//event for result
client.DownloadFileCompleted +=
new System.ComponentModel.AsyncCompletedEventHandler (client_DownloadFileCompleted);
//download
client.DownloadFileAsync(url, filePath);
}
}
void client_DownloadFileCompleted(object sender, System.ComponentModel.AsyncCompletedEventArgs e)
{
MessageBox.Show("Compleated");
}
My questions are:
Regarding if (e.Url.ToString().EndsWith(".mp3")) - How can i
change this for knowing not only when the user tries to download mp3 file,
but all types of files - maybe there is a better way
If i want to download a file using some link directly, I get the message "Currently you have not required permission for that" - How can I
change permission level for my web browser
If i finally get a link to the file and start to download it, as result just name of file (size of file 0 kb) - where i'm wrong.
my solution (maybe not the best one)
create event for webBrowser
wb.Navigating += new WebBrowserNavigatingEventHandler(wb_Navigating);
and in this event use next
if (GetWorkingWebBrowser().StatusText != null)
{
try
{
WebRequest request = WebRequest.Create(GetWorkingWebBrowser().StatusText);
request.Method = "HEAD";
using (WebResponse response = request.GetResponse())
{
if (response.ContentLength > 0 &&
!response.ContentType.ToString().ToLower().Contains("text/html"))
{
dlf.DownloadPath = e.Url; //move url to my form for dwnload
dlf.Show(); //show form
}
}
}
catch (UriFormatException)
{
}
catch (WebException)
{
}
}
GetWorkingWebBrowser() - method that return current active webBrowser on tab, meas webBrowser
lets say I have a GroupBox with several Labels. In these Labels, various IP-related information are displayed. One info is the external IP address of the machine.
string externalIP = "";
try
{
WebRequest request = WebRequest.Create("http://checkip.dyndns.org/");
request.Timeout = 3000;
System.Threading.Tasks.Task<System.Net.WebResponse> response = request.GetResponseAsync();
using (StreamReader stream = new StreamReader(response.Result.GetResponseStream()))
{
if (response.Result.ContentLength != -1)
{
externalIP = stream.ReadToEnd();
}
}
}
catch (Exception e)
{
externalIP = "Error.";
}
if (externalIP == "")
{
return "No service.";
}
else
{
return externalIP = (new Regex(#"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}")).Matches(externalIP)[0].ToString();
}
This method is called from following code:
private void updateNetworkIP()
{
string ip4e = "External IPv4: " + getExternalIPv4();
lblIP4external.Text = ip4e;
//Get some more info here.
}
How do I execute the code after getExternalIPv4() even when it's not finished yet? It works when setting a TimeOut like I did above but sometimes the request just takes a little longer but still completes successfully. So I want to still be able to display the external IP but continue to execute the other methods for refreshing the GroupBox.
The BackgroundWorker will deliver what you are after. Sample code:
BackgroundWorker bg = new BackgroundWorker();
bg.DoWork += new DoWorkEventHandler(getExternalIPv4Back);
bg.RunWorkerCompleted += new RunWorkerCompletedEventHandler(writeLabel);
bg.RunWorkerAsync();
//The code below this point will be executed while the BackgroundWorker does its work
You have to define getExternalIPv4Back as a DoWork Event Method and include inside it the code to be executed in parallel; also writeLabel as a RunWorkerCompleted Event(required to edit the label without provoking muti-threading-related errors). That is:
private void getExternalIPv4Back(object sender, DoWorkEventArgs e)
{
IP = "External IPv4: " + getExternalIPv4(); //IP -> Globally defined variable
}
private void writeLabel(object sender, RunWorkerCompletedEventArgs e)
{
lblIP4external.Text = IP;
}
Im trying to get into some webpages and get some information, using web browser so that it remembers my login details. things worked till here but for multiple urls web browser document load is not working properly as i want.
My intention was go to url->wait till it loads--> get required data into text--> new url and same process.
i used for loop to change url but when i run all the url's passed one by one not waiting till document loads and writes to text. please help me.
private void button1_Click_1(object sender, EventArgs e)
{
String text = File.ReadAllText("links.txt");
var result = Regex.Split(text, "\r\n|\r|\n");
foreach (string s in result)
{
listBox1.Items.Add(s);
}
for (int i = 0; i < listBox1.Items.Count; i++)
{
this.Text = Convert.ToString(i + 1) + "/" + Convert.ToString(listBox1.Items.Count);
textBox1.Text += listBox1.Items[i];
String url = textBox1.Text;
webBrowser2.ScriptErrorsSuppressed = true;
webBrowser2.DocumentCompleted += webBrowser2_DocumentCompleted;
webBrowser2.Navigate(url);
}
}
void webBrowser2_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
string sourceCode = webBrowser2.DocumentText;
try
{
/*someregax expressions to filter text */
StreamWriter sw = new StreamWriter("inks_info.txt", true);
sw.Write("url" + "~" + sourceCode + "\n");
sw.Close();
textBox1.Text = "";
}
catch
{
StreamWriter sw = new StreamWriter("inks_fail.txt", true);
sw.Write(textBox1.Text + "\n");
sw.Close();
textBox1.Text = "";
}
}
You have an event handler on the document load for each item, but you're not waiting for it to fire after the first navigation before you initiate the second navigation. Your for loop needs to be "more asynchronous". For example, placing items in a queue and requesting one at a time:
Queue<string> _items;
private void button1_Click_1(object sender, EventArgs e)
{
String text = File.ReadAllText("links.txt");
_items = new Queue<string>(Regex.Split(text, "\r\n|\r|\n"));
webBrowser2.ScriptErrorsSuppressed = true;
webBrowser2.DocumentCompleted += webBrowser2_DocumentCompleted;
RequestItem();
}
private void RequestItem()
{
if (_items.Any())
{
var url = _items.Dequeue(); // preprocess as required
webBrowser2.Navigate(url);
}
}
void webBrowser2_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
// Handle result
RequestItem(); // Then request next item
}
Your code also looks like it's using UI elements (like a list box) as intermediate variables just for a logical purpose rather than display. You should separate out the logic (using regular variables, data structures such as lists and requesting data) from the display (showing the results in list boxes, updating text boxes, etc). It's not clear that you want to be using a WebBrowser even - it looks like you're just downloading text and should use WebClient or HttpClient. The code can then also be much cleaner using async/await:
foreach (var url in urls)
{
string text = await new WebClient().DownloadStringAsync(url);
// Handle text
}
Very Simple answer. The WebBorwser control sucks for this stuff but here is what you are looking for:
WHILE(webBrowser.ReadyState != WebBrowserReadyState.Ready)
{
Application.DoEvents()
}
Thats it.. It will not freeze your app or get you lost in code, it just waits till its don't navigating. You be most welcome.