I am trying to load many pages using the AngleSharp. The idea is that it loads a page, and if this page has a link to the next, loads the next page and so forth, the methods are described like bellow. But I am getting the inner exception:
Specified argument was out of the range of valid values.
Parameter name: index"
I believe is something related with Thread and syncrhronization.
public static bool ContainsNextPage(IDocument document)
{
String href = document.QuerySelectorAll(".prevnext a")[0].GetAttribute("href");
if (href == String.Empty)
return false;
else
return true;
}
public static string GetNextPageUrl(IDocument document)
{
return document.QuerySelectorAll(".prevnext a")[0].GetAttribute("href");
}
public static async Task<IDocument> ParseUrlSynch(string Url)
{
var config = new Configuration().WithDefaultLoader();
IDocument document = await BrowsingContext.New(config).OpenAsync(Url);
return document;
}
public static async Task<ConcurrentBag<IDocument>> GetAllPagesDOMs(IDocument initialDocument)
{
ConcurrentBag< IDocument> AllPagesDOM = new ConcurrentBag< IDocument>();
IDocument nextPageDOM;
IDocument currentDocument = initialDocument;
if (initialDocument != null)
{
AllPagesDOM.Add(initialDocument);
}
while (ContainsNextPage(currentDocument))
{
String nextPageUrl = GetNextPageUrl(currentDocument);
nextPageDOM = ParseUrlSynch(nextPageUrl).Result;
if (nextPageDOM != null)
AllPagesDOM.Add(nextPageDOM);
currentDocument = nextPageDOM;
}
return AllPagesDOM;
}
static void Main(string[] args)
{
List<IDocument> allPageDOMs = new List<IDocument>();
IDocument initialDocument = ParseUrlSynch(InitialUrl).Result;
List<String> urls = new List<string>();
List<Subject> subjects = new List<Subject>();
IHtmlCollection<IElement> subjectAnchors = initialDocument.QuerySelectorAll(".course_title a");
String[] TitleAndCode;
String Title;
String Code;
String Description;
IDocument currentDocument = initialDocument;
ConcurrentBag<IDocument> documents =
GetAllPagesDOMs(initialDocument).Result; //Exception in here
...
}
Error message is caused by this code:
document.QuerySelectorAll(".prevnext a")[0]
One of your documents doesn't have any anchors inside prevnext. Maybe it's first page, maybe the last, either way you need to check the array for it's length.
Also blocking call on async method is a bad practice and should be avoided. You'll get the deadlock in any UI app. The only reason you don't get it now is that you're in console app.
Your instincts are correct, if you are using this from an application with a non-default SynchronizationContext such as WPF, Win Forms, or ASP.NET then you will have a deadlock because you are synchronously blocking on an async Task returning function (this is bad and should be avoided). When the first await is reaching inside of the blocking call, it will try to post the continuation to the current SyncronizationContext, which will be already locked by the blocking call (if you use .ConfigureAwait(false) you avoid this, but that is a hack in this case).
A quick fix would be to use async all the way through by changing:
nextPageDOM = ParseUrlSynch(nextPageUrl).Result;
with:
nextPageDOM = await ParseUrlSynch(nextPageUrl);
After you get stung by this a few times, you'll learn to have alarm bells go off in your head every time you block an asynchronous method.
Related
I will try to tell my problem in as simple words as possible.
In my UWP app, I am loading the data async wise on my Mainpage.xaml.cs`
public MainPage()
{
this.InitializeComponent();
LoadVideoLibrary();
}
private async void LoadVideoLibrary()
{
FoldersData = new List<FolderData>();
var folders = (await Windows.Storage.StorageLibrary.GetLibraryAsync
(Windows.Storage.KnownLibraryId.Videos)).Folders;
foreach (var folder in folders)
{
var files = (await folder.GetFilesAsync(Windows.Storage.Search.CommonFileQuery.OrderByDate)).ToList();
FoldersData.Add(new FolderData { files = files, foldername = folder.DisplayName, folderid = folder.FolderRelativeId });
}
}
so this is the code where I am loading up a List of FolderData objects.
There in my other page Library.xaml.cs I am using that data to load up my gridview with binding data.
protected override void OnNavigatedTo(NavigationEventArgs e)
{
try
{
LoadLibraryMenuGrid();
}
catch { }
}
private async void LoadLibraryMenuGrid()
{
MenuGridItems = new ObservableCollection<MenuItemModel>();
var data = MainPage.FoldersData;
foreach (var folder in data)
{
var image = new BitmapImage();
if (folder.files.Count == 0)
{
image.UriSource = new Uri("ms-appx:///Assets/StoreLogo.png");
}
else
{
for (int i = 0; i < folder.files.Count; i++)
{
var thumb = (await folder.files[i].GetThumbnailAsync(Windows.Storage.FileProperties.ThumbnailMode.VideosView));
if (thumb != null) { await image.SetSourceAsync(thumb); break; }
}
}
MenuGridItems.Add(new MenuItemModel
{
numberofvideos = folder.files.Count.ToString(),
folder = folder.foldername,
folderid = folder.folderid,
image = image
});
}
GridHeader = "Library";
}
the problem I am facing is that when i launch my application, wait for a few seconds and then i navigate to my library page, all data loads up properly.
but when i try to navigate to library page instantly after launching the app, it gives an exception that
"collection was modified so it cannot be iterated"
I used the breakpoint and i came to know that if i give it a few seconds the List Folder Data is already loaded properly asyncornously, but when i dnt give it a few seconds, that async method is on half way of loading the data so it causes exception, how can i handle this async situation? thanks
What you need is a way to wait for data to arrive. How you fit that in with the rest of the application (e.g. MVVM or not) is a different story, and not important right now. Don't overcomplicate things. For example, you only need an ObservableCollection if you expect the data to change while the user it looking at it.
Anyway, you need to wait. So how do you wait for that data to arrive?
Use a static class that can be reached from everywhere. In there put a method to get your data. Make sure it returns a task that you cache for future calls. For example:
internal class Data { /* whatever */ }
internal static class DataLoader
{
private static Task<Data> loaderTask;
public static Task<Data> LoadDataAsync(bool refresh = false)
{
if (refresh || loaderTask == null)
{
loaderTask = LoadDataCoreAsync();
}
return loaderTask;
}
private static async Task<Data> LoadDataCoreAsync()
{
// your actual logic goes here
}
}
With this, you can start the download as soon as you start the application.
await DataLoader.LoadDataAsync();
When you need the data in that other screen, just call that method again. It will not download the data again (unless you set refresh is true), but will simply wait for the work that you started earlier to finish, if it is not finished yet.
I get that you don't have enough experience.There are multiple issues and no solution the way you are loading the data.
What you need is a Service that can give you ObservableCollection of FolderData. I think MVVM might be out of bounds at this instance unless you are willing to spend a few hours on it. Though MVVM will make things lot easier in this instance.
The main issue at hand is this
You are using foreach to iterate the folders and the FolderData list. Foreach cannot continue if the underlying collection changes.
Firstly you need to start using a for loop as opposed to foreach. 2ndly add a state which denotes whether loading has finished or not. Finally use observable data source. In my early days I used to create static properties in App.xaml.cs and I used to use them to share / observe other data.
I used the following approach long time (approx 5 years):
Create one big class with initialization of XXXEntities in controller and create each method for each action with DB. Example:
public class DBRepository
{
private MyEntities _dbContext;
public DBRepository()
{
_dbContext = new MyEntities();
}
public NewsItem NewsItem(int ID)
{
var q = from i in _dbContext.News where i.ID == ID select new NewsItem() { ID = i.ID, FullText = i.FullText, Time = i.Time, Topic = i.Topic };
return q.FirstOrDefault();
}
public List<Screenshot> LastPublicScreenshots()
{
var q = from i in _dbContext.Screenshots where i.isPublic == true && i.ScreenshotStatus.Status == ScreenshotStatusKeys.LIVE orderby i.dateTimeServer descending select i;
return q.Take(5).ToList();
}
public void SetPublicScreenshot(string filename, bool val)
{
var screenshot = Get<Screenshot>(p => p.filename == filename);
if (screenshot != null)
{
screenshot.isPublic = val;
_dbContext.SaveChanges();
}
}
public void SomeMethod()
{
SomeEntity1 s1 = new SomeEntity1() { field1="fff", field2="aaa" };
_dbContext.SomeEntity1.Add(s1);
SomeEntity2 s2 = new SomeEntity2() { SE1 = s1 };
_dbContext.SomeEntity1.Add(s2);
_dbContext.SaveChanges();
}
And some external code create DBRepository object and call methods.
It worked fine. But now Async operations came in. So, if I use code like
public async void AddStatSimplePageAsync(string IPAddress, string login, string txt)
{
DateTime dateAdded2MinsAgo = DateTime.Now.AddMinutes(-2);
if ((from i in _dbContext.StatSimplePages where i.page == txt && i.dateAdded > dateAdded2MinsAgo select i).Count() == 0)
{
StatSimplePage item = new StatSimplePage() { IPAddress = IPAddress, login = login, page = txt, dateAdded = DateTime.Now };
_dbContext.StatSimplePages.Add(item);
await _dbContext.SaveChangesAsync();
}
}
can be a situation, when next code will be executed before SaveChanged completed and one more entity will be added to _dbContext, which should not be saved before some actions. For example, some code:
DBRepository _rep = new DBRepository();
_rep.AddStatSimplePageAsync("A", "b", "c");
_rep.SomeMethod();
I worry, that SaveChanged will be called after line
_dbContext.SomeEntity1.Add(s1);
but before
_dbContext.SomeEntity2.Add(s2);
(i.e. these 2 actions is atomic operation)
Am I right? My approach is wrong now? Which approach should be used?
PS. As I understand, will be the following stack:
1. calling AddStatSimplePageAsync
2. start calling await _dbContext.SaveChangesAsync(); inside AddStatSimplePageAsync
3. start calling SomeMethod(), _dbContext.SaveChangesAsync() in AddStatSimplePageAsync is executing in another (child) thread.
4. complete _dbContext.SaveChangesAsync() in child thread. Main thread is executing something in SomeMethod()
Ok this time I (think)'ve got your problem.
At first, it's weird that you have two separate calls to SaveChangesmethod. Usually you should try to have it at the end of all your operations and then dispose it.
Even thought yes, your concerns are right, but some clarifications are needed here.
When encountering an asyncor await do not think about threads, but about tasks, that are two different concepts.
Have a read to this great article. There is an image that will practically explain you everything.
To say that in few words, if you do not await an async method, you can have the risk that your subsequent operation could "harm" the execution of the first one. To solve it, simply await it.
I have an Excel Add-In written in C#, .NET 4.5. It will send many web service requests to a web server to get data. E.g. it sends 30,000 requests to web service server. When data of a request comes back, the addin will plot the data in Excel.
Originally I did all the requests asynchronously, but sometime I will get OutOfMemoryException
So I changed, sent the requests one by one, but it is too slow, takes long time to finish all requests.
I wonder if there is a way that I can do 100 requests at a time asynchronously, once the data of all the 100 requests come back and plot in Excel, then send the next 100 requests.
Thanks
Edit
On my addin, there is a ribbon button "Refresh", when it is clicked, refresh process starts.
On main UI thread, ribbon/button is clicked, it will call web service BuildMetaData,
once it is returned back, in its callback MetaDataCompleteCallback, another web service call is sent
Once it is returned back, in its callback DataRequestJobFinished, it will call plot to plot data on Excel. see below
RefreshBtn_Click()
{
if (cells == null) return;
Range firstOccurence = null;
firstOccurence = cells.Find(functionPattern, null,
null, null,
XlSearchOrder.xlByRows,
XlSearchDirection.xlNext,
null, null, null);
DataRequest request = null;
_reportObj = null;
Range currentOccurence = null;
while (!Helper.RefreshCancelled)
{
if(firstOccurence == null ||IsRangeEqual(firstOccurence, currentOccurence)) break;
found = true;
currentOccurence = cells.FindNext(currentOccurence ?? firstOccurence);
try
{
var excelFormulaCell = new ExcelFormulaCell(currentOccurence);
if (excelFormulaCell.HasValidFormulaCell)
{
request = new DataRequest(_unityContainer, XLApp, excelFormulaCell);
request.IsRefreshClicked = true;
request.Workbook = Workbook;
request.Worksheets = Worksheets;
_reportObj = new ReportBuilder(_unityContainer, XLApp, request, index, false);
_reportObj.ParseParameters();
_reportObj.GenerateReport();
//this is necessary b/c error message is wrapped in valid object DataResponse
//if (!string.IsNullOrEmpty(_reportObj.ErrorMessage)) //Clear previous error message
{
ErrorMessage = _reportObj.ErrorMessage;
Errors.Add(ErrorMessage);
AddCommentToCell(_reportObj);
Errors.Remove(ErrorMessage);
}
}
}
catch (Exception ex)
{
ErrorMessage = ex.Message;
Errors.Add(ErrorMessage);
_reportObj.ErrorMessage = ErrorMessage;
AddCommentToCell(_reportObj);
Errors.Remove(ErrorMessage);
Helper.LogError(ex);
}
}
}
on Class to GenerateReport
public void GenerateReport()
{
Request.ParseFunction();
Request.MetacompleteCallBack = MetaDataCompleteCallback;
Request.BuildMetaData();
}
public void MetaDataCompleteCallback(int id)
{
try
{
if (Request.IsRequestCancelled)
{
Request.FormulaCell.Dispose();
return;
}
ErrorMessage = Request.ErrorMessage;
if (string.IsNullOrEmpty(Request.ErrorMessage))
{
_queryJob = new DataQueryJob(UnityContainer, Request.BuildQueryString(), DataRequestJobFinished, Request);
}
else
{
ModifyCommentOnFormulaCellPublishRefreshEvent();
}
}
catch (Exception ex)
{
ErrorMessage = ex.Message;
ModifyCommentOnFormulaCellPublishRefreshEvent();
}
finally
{
Request.MetacompleteCallBack = null;
}
}
public void DataRequestJobFinished(DataRequestResponse response)
{
Dispatcher.Invoke(new Action<DataRequestResponse>(DataRequestJobFinishedUI), response);
}
public void DataRequestJobFinished(DataRequestResponse response)
{
try
{
if (Request.IsRequestCancelled)
{
return;
}
if (response.status != Status.COMPLETE)
{
ErrorMessage = ManipulateStatusMsg(response);
}
else // COMPLETE
{
var tmpReq = Request as DataRequest;
if (tmpReq == null) return;
new VerticalTemplate(tmpReq, response).Plot();
}
}
catch (Exception e)
{
ErrorMessage = e.Message;
Helper.LogError(e);
}
finally
{
//if (token != null)
// this.UnityContainer.Resolve<IEventAggregator>().GetEvent<DataQueryJobComplete>().Unsubscribe(token);
ModifyCommentOnFormulaCellPublishRefreshEvent();
Request.FormulaCell.Dispose();
}
}
on plot class
public void Plot()
{
...
attributeRange.Value2 = headerArray;
DataRange.Value2 = ....
DataRange.NumberFormat = ...
}
OutOfMemoryException is not about the too many requests sent simultaneously. It is about freeing your resources right way. In my practice there are two main problems when you are getting such exception:
Wrong working with immutable structures or System.String class
Not disposing your disposable resources, especially graphic objects and WCF requests.
In case of reporting, for my opinion, you got a second one type of a problem. DataRequest and DataRequestResponse are good point to start the investigation for the such objects.
If this doesn't help, try to use the Tasks library with async/await pattern, you can find good examples here:
// Signature specifies Task<TResult>
async Task<int> TaskOfTResult_MethodAsync()
{
int hours;
// . . .
// Return statement specifies an integer result.
return hours;
}
// Calls to TaskOfTResult_MethodAsync
Task<int> returnedTaskTResult = TaskOfTResult_MethodAsync();
int intResult = await returnedTaskTResult;
// or, in a single statement
int intResult = await TaskOfTResult_MethodAsync();
// Signature specifies Task
async Task Task_MethodAsync()
{
// . . .
// The method has no return statement.
}
// Calls to Task_MethodAsync
Task returnedTask = Task_MethodAsync();
await returnedTask;
// or, in a single statement
await Task_MethodAsync();
In your code I see a while loop, in which you can store your Task[] of size of 100, for which you can use the WaitAll method, and the problem should be solved. Sorry, but your code is huge enough, and I can't provide you a more straight example.
I'm having a lot of trouble parsing your code to figure out is being iterated for your request but the basic template for batching asynchronously is going to be something like this:
static const int batchSize = 100;
public async Task<IEnumerable<Results>> GetDataInBatches(IEnumerable<RequestParameters> parameters) {
if(!parameters.Any())
return Enumerable.Empty<Result>();
var batchResults = await Task.WhenAll(parameters.Take(batchSize).Select(doQuery));
return batchResults.Concat(await GetDataInBatches(parameters.Skip(batchSize));
}
where doQuery is something with the signature
Task<Results> async doQuery(RequestParameters parameters) {
//.. however you do the query
}
I wouldn't use this for a million requests since its recursive, but your case should would generate a callstack only 300 deep so you'll be fine.
Note that this also assumes that your data request stuff is done asynchronously and returns a Task. Most libraries have been updated to do this (look for methods with the Async suffix). If it doesn't expose that api you might want to create a separate question for how to specifically get your library to play nice with the TPL.
I recently encountered this very strange problem.
Initially I have this block of code
public async Task<string> Fetch(string module, string input)
{
if (module != this._moduleName)
{
return null;
}
try
{
var db = new SQLiteAsyncConnection(_dbPath);
ResponsePage storedResponse = new ResponsePage();
Action<SQLiteConnection> trans = connect =>
{
storedResponse = connect.Get<ResponsePage>(input);
};
await db.RunInTransactionAsync(trans);
string storedResponseString = storedResponse.Response;
return storedResponseString;
}
catch (Exception e)
{
return null;
}
}
However control will never be handed back to my code after the transaction finishes running. I traced the program and it seems that after the lock is release, the flow of program stops. Then I switched to using the GetAsync method from SQLiteAsyncConnection class. Basically it did the same thing so I was still blocked at await. Then I removed the async calls and used the synchronous api like below:
public async Task<string> Fetch(string module, string input)
{
if (module != this._moduleName)
{
return null;
}
try
{
var db = new SQLiteConnection(_dbPath);
ResponsePage storedResponse = new ResponsePage();
lock (_dbLock)
{
storedResponse = db.Get<ResponsePage>(input);
}
string storedResponseString = storedResponse.Response;
return storedResponseString;
}
catch (Exception e)
{
return null;
}
}
Only then can the logic flows back to my code. However I can't figure out why is this so.
Another problem is that for this kind of simple query is there any gain in terms of query time if I use aysnc api instead of sync api? If not I'll stick to the sync version then.
You are most likely calling Result (or Wait) further up the call stack from Fetch. This will cause a deadlock, as I explain on my blog and in a recent MSDN article.
For your second question, there is some overhead from async, so for extremely fast asynchronous operations, the synchronous version will be faster. There is no way to tell whether this is the case in your code unless you do profiling.
For a couple of days I am working on a WebBrowser based webscraper. After a couple of prototypes working with Threads and DocumentCompleted events, I decided to try and see if I could make a simple, easy to understand Webscraper.
The goal is to create a Webscraper that doesn't involve actual Thread objects. I want it to work in sequential steps (i.e. go to url, perform action, go to other url etc. etc.).
This is what I got so far:
public static class Webscraper
{
private static WebBrowser _wb;
public static string URL;
//WebBrowser objects have to run in Single Thread Appartment for some reason.
[STAThread]
public static void Init_Browser()
{
_wb = new WebBrowser();
}
public static void Navigate_And_Wait(string url)
{
//Navigate to a specific url.
_wb.Navigate(url);
//Wait till the url is loaded.
while (_wb.IsBusy) ;
//Loop until current url == target url. (In case a website loads urls in steps)
while (!_wb.Url.ToString().Contains(url))
{
//Wait till next url is loaded
while (_wb.IsBusy) ;
}
//Place URL
URL = _wb.Url.ToString();
}
}
I am a novice programmer, but I think this is pretty straightforward code.
That's why I detest the fact that for some reason the program throws an NullReferenceException at this piece of code:
_wb.Url.ToString().Contains(url)
I just called the _wb.Navigate() method so the NullReference can't be in the _wb object itself. So the only thing that I can imagine is that the _wb.Url object is null. But the while _wb.IsBusy() loop should prevent that.
So what is going on and how can I fix it?
Busy waiting (while (_wb.IsBusy) ;) on UI thread isn't much advisable. If you use the new features async/await of .Net 4.5 you can get a similar effect (i.e. go to url, perform action, go to other url etc. etc.) you want
public static class SOExtensions
{
public static Task NavigateAsync(this WebBrowser wb, string url)
{
TaskCompletionSource<object> tcs = new TaskCompletionSource<object>();
WebBrowserDocumentCompletedEventHandler completedEvent = null;
completedEvent = (sender, e) =>
{
wb.DocumentCompleted -= completedEvent;
tcs.SetResult(null);
};
wb.DocumentCompleted += completedEvent;
wb.ScriptErrorsSuppressed = true;
wb.Navigate(url);
return tcs.Task;
}
}
async void ProcessButtonClick()
{
await webBrowser1.NavigateAsync("http://www.stackoverflow.com");
MessageBox.Show(webBrowser1.DocumentTitle);
await webBrowser1.NavigateAsync("http://www.google.com");
MessageBox.Show(webBrowser1.DocumentTitle);
}