For a couple of days I am working on a WebBrowser based webscraper. After a couple of prototypes working with Threads and DocumentCompleted events, I decided to try and see if I could make a simple, easy to understand Webscraper.
The goal is to create a Webscraper that doesn't involve actual Thread objects. I want it to work in sequential steps (i.e. go to url, perform action, go to other url etc. etc.).
This is what I got so far:
public static class Webscraper
{
private static WebBrowser _wb;
public static string URL;
//WebBrowser objects have to run in Single Thread Appartment for some reason.
[STAThread]
public static void Init_Browser()
{
_wb = new WebBrowser();
}
public static void Navigate_And_Wait(string url)
{
//Navigate to a specific url.
_wb.Navigate(url);
//Wait till the url is loaded.
while (_wb.IsBusy) ;
//Loop until current url == target url. (In case a website loads urls in steps)
while (!_wb.Url.ToString().Contains(url))
{
//Wait till next url is loaded
while (_wb.IsBusy) ;
}
//Place URL
URL = _wb.Url.ToString();
}
}
I am a novice programmer, but I think this is pretty straightforward code.
That's why I detest the fact that for some reason the program throws an NullReferenceException at this piece of code:
_wb.Url.ToString().Contains(url)
I just called the _wb.Navigate() method so the NullReference can't be in the _wb object itself. So the only thing that I can imagine is that the _wb.Url object is null. But the while _wb.IsBusy() loop should prevent that.
So what is going on and how can I fix it?
Busy waiting (while (_wb.IsBusy) ;) on UI thread isn't much advisable. If you use the new features async/await of .Net 4.5 you can get a similar effect (i.e. go to url, perform action, go to other url etc. etc.) you want
public static class SOExtensions
{
public static Task NavigateAsync(this WebBrowser wb, string url)
{
TaskCompletionSource<object> tcs = new TaskCompletionSource<object>();
WebBrowserDocumentCompletedEventHandler completedEvent = null;
completedEvent = (sender, e) =>
{
wb.DocumentCompleted -= completedEvent;
tcs.SetResult(null);
};
wb.DocumentCompleted += completedEvent;
wb.ScriptErrorsSuppressed = true;
wb.Navigate(url);
return tcs.Task;
}
}
async void ProcessButtonClick()
{
await webBrowser1.NavigateAsync("http://www.stackoverflow.com");
MessageBox.Show(webBrowser1.DocumentTitle);
await webBrowser1.NavigateAsync("http://www.google.com");
MessageBox.Show(webBrowser1.DocumentTitle);
}
Related
I am trying to load many pages using the AngleSharp. The idea is that it loads a page, and if this page has a link to the next, loads the next page and so forth, the methods are described like bellow. But I am getting the inner exception:
Specified argument was out of the range of valid values.
Parameter name: index"
I believe is something related with Thread and syncrhronization.
public static bool ContainsNextPage(IDocument document)
{
String href = document.QuerySelectorAll(".prevnext a")[0].GetAttribute("href");
if (href == String.Empty)
return false;
else
return true;
}
public static string GetNextPageUrl(IDocument document)
{
return document.QuerySelectorAll(".prevnext a")[0].GetAttribute("href");
}
public static async Task<IDocument> ParseUrlSynch(string Url)
{
var config = new Configuration().WithDefaultLoader();
IDocument document = await BrowsingContext.New(config).OpenAsync(Url);
return document;
}
public static async Task<ConcurrentBag<IDocument>> GetAllPagesDOMs(IDocument initialDocument)
{
ConcurrentBag< IDocument> AllPagesDOM = new ConcurrentBag< IDocument>();
IDocument nextPageDOM;
IDocument currentDocument = initialDocument;
if (initialDocument != null)
{
AllPagesDOM.Add(initialDocument);
}
while (ContainsNextPage(currentDocument))
{
String nextPageUrl = GetNextPageUrl(currentDocument);
nextPageDOM = ParseUrlSynch(nextPageUrl).Result;
if (nextPageDOM != null)
AllPagesDOM.Add(nextPageDOM);
currentDocument = nextPageDOM;
}
return AllPagesDOM;
}
static void Main(string[] args)
{
List<IDocument> allPageDOMs = new List<IDocument>();
IDocument initialDocument = ParseUrlSynch(InitialUrl).Result;
List<String> urls = new List<string>();
List<Subject> subjects = new List<Subject>();
IHtmlCollection<IElement> subjectAnchors = initialDocument.QuerySelectorAll(".course_title a");
String[] TitleAndCode;
String Title;
String Code;
String Description;
IDocument currentDocument = initialDocument;
ConcurrentBag<IDocument> documents =
GetAllPagesDOMs(initialDocument).Result; //Exception in here
...
}
Error message is caused by this code:
document.QuerySelectorAll(".prevnext a")[0]
One of your documents doesn't have any anchors inside prevnext. Maybe it's first page, maybe the last, either way you need to check the array for it's length.
Also blocking call on async method is a bad practice and should be avoided. You'll get the deadlock in any UI app. The only reason you don't get it now is that you're in console app.
Your instincts are correct, if you are using this from an application with a non-default SynchronizationContext such as WPF, Win Forms, or ASP.NET then you will have a deadlock because you are synchronously blocking on an async Task returning function (this is bad and should be avoided). When the first await is reaching inside of the blocking call, it will try to post the continuation to the current SyncronizationContext, which will be already locked by the blocking call (if you use .ConfigureAwait(false) you avoid this, but that is a hack in this case).
A quick fix would be to use async all the way through by changing:
nextPageDOM = ParseUrlSynch(nextPageUrl).Result;
with:
nextPageDOM = await ParseUrlSynch(nextPageUrl);
After you get stung by this a few times, you'll learn to have alarm bells go off in your head every time you block an asynchronous method.
I will try to tell my problem in as simple words as possible.
In my UWP app, I am loading the data async wise on my Mainpage.xaml.cs`
public MainPage()
{
this.InitializeComponent();
LoadVideoLibrary();
}
private async void LoadVideoLibrary()
{
FoldersData = new List<FolderData>();
var folders = (await Windows.Storage.StorageLibrary.GetLibraryAsync
(Windows.Storage.KnownLibraryId.Videos)).Folders;
foreach (var folder in folders)
{
var files = (await folder.GetFilesAsync(Windows.Storage.Search.CommonFileQuery.OrderByDate)).ToList();
FoldersData.Add(new FolderData { files = files, foldername = folder.DisplayName, folderid = folder.FolderRelativeId });
}
}
so this is the code where I am loading up a List of FolderData objects.
There in my other page Library.xaml.cs I am using that data to load up my gridview with binding data.
protected override void OnNavigatedTo(NavigationEventArgs e)
{
try
{
LoadLibraryMenuGrid();
}
catch { }
}
private async void LoadLibraryMenuGrid()
{
MenuGridItems = new ObservableCollection<MenuItemModel>();
var data = MainPage.FoldersData;
foreach (var folder in data)
{
var image = new BitmapImage();
if (folder.files.Count == 0)
{
image.UriSource = new Uri("ms-appx:///Assets/StoreLogo.png");
}
else
{
for (int i = 0; i < folder.files.Count; i++)
{
var thumb = (await folder.files[i].GetThumbnailAsync(Windows.Storage.FileProperties.ThumbnailMode.VideosView));
if (thumb != null) { await image.SetSourceAsync(thumb); break; }
}
}
MenuGridItems.Add(new MenuItemModel
{
numberofvideos = folder.files.Count.ToString(),
folder = folder.foldername,
folderid = folder.folderid,
image = image
});
}
GridHeader = "Library";
}
the problem I am facing is that when i launch my application, wait for a few seconds and then i navigate to my library page, all data loads up properly.
but when i try to navigate to library page instantly after launching the app, it gives an exception that
"collection was modified so it cannot be iterated"
I used the breakpoint and i came to know that if i give it a few seconds the List Folder Data is already loaded properly asyncornously, but when i dnt give it a few seconds, that async method is on half way of loading the data so it causes exception, how can i handle this async situation? thanks
What you need is a way to wait for data to arrive. How you fit that in with the rest of the application (e.g. MVVM or not) is a different story, and not important right now. Don't overcomplicate things. For example, you only need an ObservableCollection if you expect the data to change while the user it looking at it.
Anyway, you need to wait. So how do you wait for that data to arrive?
Use a static class that can be reached from everywhere. In there put a method to get your data. Make sure it returns a task that you cache for future calls. For example:
internal class Data { /* whatever */ }
internal static class DataLoader
{
private static Task<Data> loaderTask;
public static Task<Data> LoadDataAsync(bool refresh = false)
{
if (refresh || loaderTask == null)
{
loaderTask = LoadDataCoreAsync();
}
return loaderTask;
}
private static async Task<Data> LoadDataCoreAsync()
{
// your actual logic goes here
}
}
With this, you can start the download as soon as you start the application.
await DataLoader.LoadDataAsync();
When you need the data in that other screen, just call that method again. It will not download the data again (unless you set refresh is true), but will simply wait for the work that you started earlier to finish, if it is not finished yet.
I get that you don't have enough experience.There are multiple issues and no solution the way you are loading the data.
What you need is a Service that can give you ObservableCollection of FolderData. I think MVVM might be out of bounds at this instance unless you are willing to spend a few hours on it. Though MVVM will make things lot easier in this instance.
The main issue at hand is this
You are using foreach to iterate the folders and the FolderData list. Foreach cannot continue if the underlying collection changes.
Firstly you need to start using a for loop as opposed to foreach. 2ndly add a state which denotes whether loading has finished or not. Finally use observable data source. In my early days I used to create static properties in App.xaml.cs and I used to use them to share / observe other data.
I have a delegate method to run a heavy process in my app (I must use MS Framework 3.5):
private delegate void delRunJob(string strBox, string strJob);
Execution:
private void run()
{
string strBox = "G4P";
string strJob = "Test";
delRunJob delegateRunJob = new delRunJob(runJobThread);
delegateRunJob.Invoke(strBox, strJob);
}
In some part of the method runJobThread
I call to an external program (SAP - Remote Function Calls) to retrieve data. The execution of that line can take 1-30 mins.
private void runJobThread(string strBox, string strJob)
{
// CODE ...
sapLocFunction.Call(); // When this line is running I cannot cancel the process
// CODE ...
}
I want to allow the user cancel whole process.
How can achieve this? I tried some methods; but I fall in the same point; when this specific line is running I cannot stop the process.
Instead of using the delegate mechanism you have to study the async and await mechanism. When you understand this mechanism you can move to cancellationtoken.
An example doing both things can be found here :
http://blogs.msdn.com/b/dotnet/archive/2012/06/06/async-in-4-5-enabling-progress-and-cancellation-in-async-apis.aspx
Well; I find out a complicated, but effective, way to solve my problem:
a.) I created a "Helper application" to show a notification icon when the process is running (To ensure to don't interfere with the normal execution of the main app):
private void callHelper(bool blnClose = false)
{
if (blnClose)
fw.processKill("SDM Helper");
else
Process.Start(fw.appGetPath + "SDM Helper.exe");
}
b.) I created a Thread that call only the heavy process line.
c.) While the Thread is alive I check for external file named "cancel" (The "Helper application" do that; when the user click an option to cancel the process the Helper create the file).
d.) If exists the file; dispose all objects and break the while cycle.
e.) The method sapLocFunction.Call() will raise an exception but I expect errors.
private void runJobThread(string strBox, string strJob)
{
// CODE ...
Thread thrSapCall = new Thread(() =>
{
try { sapLocFunction.Call(); }
catch { /* Do nothing */ }
});
thrSapCall.Start();
while (thrSapCall.IsAlive)
{
Thread.Sleep(1000);
try
{
if (fw.fileExists(fw.appGetPath + "\\cancel"))
{
sapLocFunction = null;
sapLocTable = null;
sapConn.Logoff();
sapConn = null;
canceled = true;
break;
}
}
finally { /* Do nothing */ }
}
thrSapCall = null;
// CODE ...
}
Works like a charm!
I think you would have to resort to the method described here. Read the post to see why this is a long way from ideal.
Perhaps this might work...
private void runJobThread(string strBox, string strJob, CancellationToken token)
{
Thread t = Thread.CurrentThread;
using (token.Register(t.Abort))
{
// CODE ...
sapLocFunction.Call(); // When this line is running I cannot cancel the process
// CODE ...
}
}
A bit of dnspy exposes a cancel method on nco3.0.
private readonly static Type RfcConnection = typeof(RfcSessionManager).Assembly.GetType("SAP.Middleware.Connector.RfcConnection");
private readonly static Func<RfcDestination, object> GetConnection = typeof(RfcSessionManager).GetMethod("GetConnection", BindingFlags.Static | BindingFlags.NonPublic).CreateDelegate(typeof(Func<RfcDestination, object>)) as Func<RfcDestination, object>;
private readonly static MethodInfo Cancel = RfcConnection.GetMethod("Cancel", BindingFlags.Instance | BindingFlags.NonPublic);
object connection = null;
var completed = true;
using (var task = Task.Run(() => { connection = GetConnection(destination); rfcFunction.Invoke(destination); }))
{
try
{
completed = task.Wait(TimeSpan.FromSeconds(invokeTimeout));
if (!completed)
Cancel.Invoke(connection, null);
task.Wait();
}
catch(AggregateException e)
{
if (e.InnerException is RfcCommunicationCanceledException && !completed)
throw new TimeoutException($"SAP FM {functionName} on {destination} did not respond in {timeout} seconds.");
throw;
}
}
I have a user control that displays information from the database. This user control has to update these information constantly(let's say every 5 seconds). A few instances of this user control is generated programmatically during run time in a single page. In the code behind of this user control I added a code that sends a query to the database to get the needed information (which means every single instance of the user control is doing this). But this seems to slow down the processing of queries so I am making a static class that will do the querying and store the information in its variables and let the instances of my user control access those variables. Now I need this static class to do queries every 5 seconds to update its variables. I tried using a new thread to do this but the variables don't seem to be updated since I always get a NullReferenceException whenever I access them from a different class.
Here's my static class:
public static class SessionManager
{
public static volatile List<int> activeSessionsPCIDs;
public static volatile List<int> sessionsThatChangedStatus;
public static volatile List<SessionObject> allSessions;
public static void Initialize() {
Thread t = new Thread(SetProperties);
t.Start();
}
public static void SetProperties() {
SessionDataAccess sd = new SessionDataAccess();
while (true) {
allSessions = sd.GetAllSessions();
activeSessionsPCIDs = new List<int>();
sessionsThatChangedStatus = new List<int>();
foreach (SessionObject session in allSessions) {
if (session.status == 1) { //if session is active
activeSessionsPCIDs.Add(session.pcid);
}
if (session.status != session.prevStat) { //if current status doesn't match the previous status
sessionsThatChangedStatus.Add(session.pcid);
}
}
Thread.Sleep(5000);
}
}
And this is how I am trying to access the variables in my static class:
protected void Page_Load(object sender, EventArgs e)
{
SessionManager.Initialize();
loadSessions();
}
private void loadSessions()
{ // refresh the current_sessions table
List<int> pcIds = pcl.GetPCIds(); //get the ids of all computers
foreach (SessionObject s in SessionManager.allSessions)
{
SessionInfo sesInf = (SessionInfo)LoadControl("~/UserControls/SessionInfo.ascx");
sesInf.session = s;
pnlMonitoring.Controls.Add(sesInf);
}
}
Any help, please? Thanks
Multiple threads problem
You have one thread that gets created for each and every call to SessionManager.Initialize.
That happens more than once in the lifetime of the process.
IIS recycles your app at some point, after a period of time should you have absolutely no requests.
Until that happens, all your created threads continue to run.
After the first PageLoad you will have one thread which updates stuff every 5 seconds.
If you refresh the page again you'll have two threads, possibly with different offsets in time but each of which, doing the same thing at 5 second intervals.
You should atomically check to see if your background thread is started already. You need at least an extra bool static field and a object static field which you should use like a Monitor (using the lock keyword).
You should also stop relying on volatile and simply using lock to make sure that other threads "observe" updated values for your static List<..> fields.
It may be the case that the other threads don't observe a change field and thusly, for them, the field is still null - therefore you get the NullReferenceException.
About volatile
Using volatile is bad, at least in .NET. There is a 90% chance that you think you know what it is doing and it's not true and there's a 99% chance that you feel relief because you used volatile and you aren't checking for other multitasking hazards the way you should.
RX to the rescue
I strongly suggest you take a look at this wonderful thing called Reactive Extensions.
Believe me, a couple of days' research combined with the fact that you're in a perfect position to use RX will pay of, not just now but in the future as well.
You get to keep your static class, but instead of materialised values that get stored within that class you create pipes that carry information. The information flows when you want it to flow. You get to have subscribers to those pipes. The number of subscribers does not affect the overall performance of your app.
Your app will be more scalable, and more robust.
Good luck!
There are few solution for this approach:
One of them is:
It's better in Global.asax in Application_start or Session_Start (depends on your case) create Thread to call your method:
Use below code :
var t = Task.Factory.StartNew(() => {
while(true)
{
SessionManager.SetProperties();
Task.Delay(5);
}
});
Second solution is using Job Scheduler for ASP.NET (that's my ideal solution).
for more info you can check this link How to run Background Tasks in ASP.NET
and third solution is rewrite your static class as follow:
public static class SessionManager
{
public static volatile List<int> activeSessionsPCIDs;
public static volatile List<int> sessionsThatChangedStatus;
public static volatile List<SessionObject> allSessions;
static SessionManager()
{
Initialize();
}
public static void Initialize() {
var t = Task.Factory.StartNew(() => {
while(true)
{
SetProperties();
Task.Delay(5);
}
});
}
public static void SetProperties() {
SessionDataAccess sd = new SessionDataAccess();
while (true) {
allSessions = sd.GetAllSessions();
activeSessionsPCIDs = new List<int>();
sessionsThatChangedStatus = new List<int>();
foreach (SessionObject session in allSessions) {
if (session.status == 1) { //if session is active
activeSessionsPCIDs.Add(session.pcid);
}
if (session.status != session.prevStat) { //if current status doesn't match the previous status
sessionsThatChangedStatus.Add(session.pcid);
}
}
Thread.Sleep(5000);
}
}
This is a solution that is a change in approach, but I kept the solution in Web Forms, to make it more directly applicable to your use case.
SignalR is a technology that enables real-time, two way communication between server and clients (browsers), which can replace your static session data class. Below, I have implemented a simple example to demonstrate the concept.
As a sample, create a new ASP.NET Web Forms application and add the SignalR package from nuget.
Install-Package Microsoft.AspNet.SignalR
You will need to add a new Owin Startup class and add these 2 lines:
using Microsoft.AspNet.SignalR;
... and within the method
app.MapSignalR();
Add some UI elements to Default.aspx:
<div class="jumbotron">
<H3 class="MyName">Loading...</H3>
<p class="stats">
</p>
</div>
Add the following JavaScript to the Site.Master. This code references signalr, and implement client-side event handlers and initiates contact with the signalr hub from the browser. here's the code:
<script src="Scripts/jquery.signalR-2.2.0.min.js"></script>
<script src="signalr/hubs"></script>
<script >
var hub = $.connection.sessiondata;
hub.client.someOneJoined = function (name) {
var current = $(".stats").text();
current = current + '\nuser ' + name + ' joined.';
$(".stats").text(current);
};
hub.client.myNameIs = function (name) {
$(".MyName").text("Your user id: " + name);
};
$.connection.hub.start().done(function () { });
</script>
Finally, add a SignalR Hub to the solution and use this code for the SessionDataHub implementation:
[HubName("sessiondata")]
public class SessionDataHub : Hub
{
private ObservableCollection<string> sessions = new ObservableCollection<string>();
public SessionDataHub()
{
sessions.CollectionChanged += sessions_CollectionChanged;
}
private void sessions_CollectionChanged(object sender, NotifyCollectionChangedEventArgs e)
{
if (e.Action == NotifyCollectionChangedAction.Add)
{
Clients.All.someOneJoined(e.NewItems.Cast<string>().First());
}
}
public override Task OnConnected()
{
return Task.Factory.StartNew(() =>
{
var youAre = Context.ConnectionId;
Clients.Caller.myNameIs(youAre);
sessions.Add(youAre);
});
}
public override Task OnDisconnected(bool stopCalled)
{
// TODO: implement this as well.
return base.OnDisconnected(stopCalled);
}
}
For more information about SignalR, go to http://asp.net/signalr
Link to source code: https://lsscloud.blob.core.windows.net/downloads/WebApplication1.zip
I have 2 asynchronous downloads in a Downloader class. Basically the code first makes a simple http based API request to get some data containing a url, and then uses this url to download an image - the last function call - Test(adImage) tries to pass the UIImage back to a function in the main ViewController class, so that it can update a UIImageView with the downloaded image. When I try to do this, I get an ArgumentNullException at the line
string result = System.Text.Encoding.UTF8.GetString (e.Result);
I think this is because I need to use the main UI thread to update the main VC, and can't do it from this object running on another asynchronous thread. If I take the Test function out, everything runs fine and the image is downloaded - just not used for anything.
How do I pass the image back to the mainVC and get it to update the image on the main UI thread?
(This is related to a question I asked before, but I think I was totally barking up the wrong tree before, so I felt it better to re-express the problem in the different way).
public class Downloader : IImageUpdated {
UIImage adImage;
Manga5ViewController mainVC;
public void DownloadWebData(Uri apiUrl, Manga5ViewController callingVC)
{
mainVC = callingVC;
WebClient client = new WebClient();
client.DownloadDataCompleted += DownloadDataCompleted;
client.DownloadDataAsync(apiUrl);
}
public void DownloadDataCompleted(object sender, DownloadDataCompletedEventArgs e)
{
string result = System.Text.Encoding.UTF8.GetString (e.Result);
string link = GetUri(result);
Console.WriteLine (link);
downloadImage(new Uri (link));
}
public void downloadImage (Uri imageUri) {
var tmp_img = ImageLoader.DefaultRequestImage (imageUri, this);
if (tmp_img != null)
{
adImage = tmp_img;
Console.WriteLine ("Image already cached, displaying");
Console.WriteLine ("Size: " + adImage.Size);
mainVC.Test (adImage);
}
else
{
adImage = UIImage.FromFile ("Images/downloading.jpg");
Console.WriteLine ("Image not cached. Using placeholder.");
}
}
public void UpdatedImage (System.Uri uri) {
adImage = ImageLoader.DefaultRequestImage(uri, this);
Console.WriteLine ("Size: " + adImage.Size);
mainVC.Test (adImage);
}
....
}
Damn, after working on this for hours, I finally figured it out a few minutes after posting this.
It was as simple as wrapping the UI code like so:
InvokeOnMainThread (delegate {
// UI Update code here...
});