Can someone please put another set of eyes oh this? I am trying to validate that a blob image exists prior to displaying. This could take 1-4 seconds.
my JS looks like this:
var url = '/api/blob/ValidateBlobExists?id=' + blobImage;
$.getJSON(url,
function(json) {
console.log("success");
})
.done(function () {
console.log("second success");
exists = data;
if (exists) {
console.log("exists'");
$('#imgPhotograph').hide().attr('src', blobImage).fadeIn();
} else {
$('#imgPhotograph').attr('src', '../Images/NoPhotoFound.jpg');
}
});
the api looks like this.. please don't judge.. its vb because it has to be.
Public Function ValidateBlobExists(id As String) As JsonResult(Of Boolean)
dim result = CDNHelper.BlobExists(id) 'this could take ~5 seconds
Return Json(Of Boolean)(result)
End Function
the underlying method looks like this:
public static bool BlobExists(string filename)
{
try
{
var sw = new Stopwatch();
sw.Start();
do
{
if (client.AssertBlobExists(filename).Result) // <-- this is a wrapper to query the azure blob
return true;
System.Threading.Thread.Sleep(500); //no reason to hammer the service
} while (sw.ElapsedMilliseconds < 8000);
}
catch (Exception e)
{
Console.WriteLine(e);
throw;
}
}
return false;
}
the thing is in my console output I can't even get to the "success". It seem the getJson() is just not willing to wait for the 8 seconds to elapse before continuing. Any thoughts are appreciated .
The issue was that my API response signature was not decorated as Async. Even though I was calling .result it was getting messed up. The solution for me was to modify my code as below
Public Async Function ValidateBlobExists(id As String) As Task(Of JsonResult(Of Boolean))
result = await CDNHelper.BlobExists(id, "ioc")
return result
End function
and
public static async Task<bool> BlobExists()
{
return client postasync wrapper
}
Related
My approach is to visit each link and if all are visited to receive a return value from.
The problem is, when I start the code, I get instantly a response, clearly empty because not all the links are visited.
private async void ibtn_start_visiting_Click(object sender, EventArgs e)
{
string js = "var ele = document.querySelectorAll('#profiles * .tile__link');document.getElementsByClassName('js-scrollable')[0].scrollBy(0,30);ele.forEach(function(value,index){setTimeout(function(){if(index < ele.length-1){ele[index].click();}else{document.querySelectorAll('.search-results__item').forEach(e => e.parentNode.removeChild(e));document.getElementsByClassName('js-close-spotlight')[0].click();return 'hallo';}},1000 * index)})";
await browser.EvaluateScriptAsync(js).ContinueWith(x =>
{
var response = x.Result;
if (response.Success)
{
this.Invoke((MethodInvoker)delegate
{
var res = (string)response.Result;
Console.WriteLine("Response: " + res);
});
}
else {
Console.WriteLine("NO");
}
});
}
This is the javascript:
var ele = document.querySelectorAll('#profiles * .tile__link');
document.getElementsByClassName('js-scrollable')[0].scrollBy(0,30);
ele.forEach(function(value,index){
setTimeout(function(){
if(index < ele.length-1){
ele[index].click();
}
else{
document.querySelectorAll('.search-results__item').forEach(e => e.parentNode.removeChild(e));
document.getElementsByClassName('js-close-spotlight')[0].click();
alert('hallo');
}
},1500 * index)
})
Oh, I see. Your javascript is using setTimeout, which is kind of equivalent to making the function you pass to it also be async. CefSharp doesn't know when those setTimeout tasks are completed, hence the early return. The pended javascript code does execute, eventually. To know when that's been completed you've got a couple of options:
Make your async javascript code synchronous by getting rid of setTimeout completely.
Set some global variable in your async javascript code and periodically check your webpage in C# to see if that variable is set.
Register some JS handler and call that when your async javascript is completed.
#3 is my favorite, so you might register that handler in C# like so:
public class CallbackObjectForJs{
public void showMessage(string msg){
// we did it!
}
}
webView.RegisterJsObject("callbackObj", new CallbackObjectForJs());
And your JS might look something like:
var totalTasks = 0;
function beginTask() {
totalTasks++;
}
function completeTask() {
totalTasks--;
if (totalTasks === 0) {
callbackObj("we finished!"); // this function was registered via C#
}
}
var ele = document.querySelectorAll('#profiles * .tile__link');
document.getElementsByClassName('js-scrollable')[0].scrollBy(0,30);
ele.forEach(function(value,index){
beginTask(); // NEW
setTimeout(function(){
... // work
completeTask();
}, 1500 * index);
})
To make this cleaner you may want to look into Javascript's Promise.all().
I have been working on a webscraping project.
I am having two issues, one being presenting the number of urls processed as percentage but a far larger issue is that I can not figure out how I know when all the threads i am creating are totaly finished.
NOTE: I am aware of that the a parallel foreach once done moves on BUT this is within a recursive method.
My code below:
public async Task Scrape(string url)
{
var page = string.Empty;
try
{
page = await _service.Get(url);
if (page != string.Empty)
{
if (regex.IsMatch(page))
{
Parallel.For(0, regex.Matches(page).Count,
index =>
{
try
{
if (regex.Matches(page)[index].Groups[1].Value.StartsWith("/"))
{
var match = regex.Matches(page)[index].Groups[1].Value.ToLower();
if (!links.Contains(BaseUrl + match) && !Visitedlinks.Contains(BaseUrl + match))
{
Uri ValidUri = WebPageValidator.GetUrl(match);
if (ValidUri != null && HostUrls.Contains(ValidUri.Host))
links.Enqueue(match.Replace(".html", ""));
else
links.Enqueue(BaseUrl + match.Replace(".html", ""));
}
}
}
catch (Exception e)
{
log.Error("Error occured: " + e.Message);
Console.WriteLine("Error occured, check log for further details."); ;
}
});
WebPageInternalHandler.SavePage(page, url);
var context = CustomSynchronizationContext.GetSynchronizationContext();
Parallel.ForEach(links, new ParallelOptions { MaxDegreeOfParallelism = 25 },
webpage =>
{
try
{
if (WebPageValidator.ValidUrl(webpage))
{
string linkToProcess = webpage;
if (links.TryDequeue(out linkToProcess) && !Visitedlinks.Contains(linkToProcess))
{
ShowPercentProgress();
Thread.Sleep(15);
Visitedlinks.Enqueue(linkToProcess);
Task d = Scrape(linkToProcess);
Console.Clear();
}
}
}
catch (Exception e)
{
log.Error("Error occured: " + e.Message);
Console.WriteLine("Error occured, check log for further details.");
}
});
Console.WriteLine("parallel finished");
}
}
catch (Exception e)
{
log.Error("Error occured: " + e.Message);
Console.WriteLine("Error occured, check log for further details.");
}
}
NOTE that Scrape gets called multiple times(recursive)
call the method like this:
public Task ExecuteScrape()
{
var context = CustomSynchronizationContext.GetSynchronizationContext();
Scrape(BaseUrl).ContinueWith(x => {
Visitedlinks.Enqueue(BaseUrl);
}, context).Wait();
return null;
}
which in turn gets called like so:
static void Main(string[] args)
{
RunScrapper();
Console.ReadLine();
}
public static void RunScrapper()
{
try
{
_scrapper.ExecuteScrape();
}
catch (Exception e)
{
Console.WriteLine(e);
throw;
}
}
my result:
How do I solve this?
(Is it ethical for me to answer a question about web page scraping?)
Don't call Scrape recursively. Place the list of urls you want to scrape in a ConcurrentQueue and begin processing that queue. As the process of scraping a page returns more urls, just add them into the same queue.
I wouldn't use just a string, either. I recommend creating a class like
public class UrlToScrape //because naming things is hard
{
public string Url { get; set; }
public int Depth { get; set; }
}
Regardless of how you execute this it's recursive, so you have to somehow keep track of how many levels deep you are. A website could deliberately generate URLs that send you into infinite recursion. (If they did this then they don't want you scraping their site. Does anybody want people scraping their site?)
When your queue is empty that doesn't mean you're done. The queue could be empty, but the process of scraping the last url dequeued could still add more items back into that queue, so you need a way to account for that.
You could use a thread safe counter (int using Interlocked.Increment/Decrement) that you increment when you start processing a url and decrement when you finish. You're done when the queue is empty and the count of in-process urls is zero.
This is a very rough model to illustrate the concept, not what I'd call a refined solution. For example, you still need to account for exception handling, and I have no idea where the results go, etc.
public class UrlScraper
{
private readonly ConcurrentQueue<UrlToScrape> _queue = new ConcurrentQueue<UrlToScrape>();
private int _inProcessUrlCounter;
private readonly List<string> _processedUrls = new List<string>();
public UrlScraper(IEnumerable<string> urls)
{
foreach (var url in urls)
{
_queue.Enqueue(new UrlToScrape {Url = url, Depth = 1});
}
}
public void ScrapeUrls()
{
while (_queue.TryDequeue(out var dequeuedUrl) || _inProcessUrlCounter > 0)
{
if (dequeuedUrl != null)
{
// Make sure you don't go more levels deep than you want to.
if (dequeuedUrl.Depth > 5) continue;
if (_processedUrls.Contains(dequeuedUrl.Url)) continue;
_processedUrls.Add(dequeuedUrl.Url);
Interlocked.Increment(ref _inProcessUrlCounter);
var url = dequeuedUrl;
Task.Run(() => ProcessUrl(url));
}
}
}
private void ProcessUrl(UrlToScrape url)
{
try
{
// As the process discovers more urls to scrape,
// pretend that this is one of those new urls.
var someNewUrl = "http://discovered";
_queue.Enqueue(new UrlToScrape { Url = someNewUrl, Depth = url.Depth + 1 });
}
catch (Exception ex)
{
// whatever you want to do with this
}
finally
{
Interlocked.Decrement(ref _inProcessUrlCounter);
}
}
}
If I was doing this for real the ProcessUrl method would be its own class, and it would take HTML, not a URL. In this form it's difficult to unit test. If it were in a separate class then you could pass in HTML, verify that it outputs results somewhere, and that it calls a method to enqueue new URLs it finds.
It's also not a bad idea to maintain the queue as a database table instead. Otherwise if you're processing a bunch of urls and you have to stop, you'd have start all over again.
Can't you add all tasks Task d to some type of concurrent collection you thread through all recursive calls (via method argument) and then simply call Task.WhenAll(tasks).Wait()?
You'd need an intermediate method (makes it cleaner) that launches the base Scrape call and passes in the empty task collection. When the base call returns you have in hand all tasks and you simply wait them out.
public async Task Scrape (
string url) {
var tasks = new ConcurrentQueue<Task>();
//call your implementation but
//change it so that you add
//all launched tasks d to tasks
Scrape(url, tasks);
//1st option: Wait().
//This will block caller
//until all tasks finish
Task.WhenAll(tasks).Wait();
//or 2nd option: await
//this won't block and will return to caller.
//Once all tasks are finished method
//will resume in WriteLine
await Task.WhenAll(tasks);
Console.WriteLine("Finished!"); }
Simple rule: if you want to know when something finishes, the first step is to keep track of it. In your current implementation you are essentially firing and forgetting all launched tasks...
I have an Excel Add-In written in C#, .NET 4.5. It will send many web service requests to a web server to get data. E.g. it sends 30,000 requests to web service server. When data of a request comes back, the addin will plot the data in Excel.
Originally I did all the requests asynchronously, but sometime I will get OutOfMemoryException
So I changed, sent the requests one by one, but it is too slow, takes long time to finish all requests.
I wonder if there is a way that I can do 100 requests at a time asynchronously, once the data of all the 100 requests come back and plot in Excel, then send the next 100 requests.
Thanks
Edit
On my addin, there is a ribbon button "Refresh", when it is clicked, refresh process starts.
On main UI thread, ribbon/button is clicked, it will call web service BuildMetaData,
once it is returned back, in its callback MetaDataCompleteCallback, another web service call is sent
Once it is returned back, in its callback DataRequestJobFinished, it will call plot to plot data on Excel. see below
RefreshBtn_Click()
{
if (cells == null) return;
Range firstOccurence = null;
firstOccurence = cells.Find(functionPattern, null,
null, null,
XlSearchOrder.xlByRows,
XlSearchDirection.xlNext,
null, null, null);
DataRequest request = null;
_reportObj = null;
Range currentOccurence = null;
while (!Helper.RefreshCancelled)
{
if(firstOccurence == null ||IsRangeEqual(firstOccurence, currentOccurence)) break;
found = true;
currentOccurence = cells.FindNext(currentOccurence ?? firstOccurence);
try
{
var excelFormulaCell = new ExcelFormulaCell(currentOccurence);
if (excelFormulaCell.HasValidFormulaCell)
{
request = new DataRequest(_unityContainer, XLApp, excelFormulaCell);
request.IsRefreshClicked = true;
request.Workbook = Workbook;
request.Worksheets = Worksheets;
_reportObj = new ReportBuilder(_unityContainer, XLApp, request, index, false);
_reportObj.ParseParameters();
_reportObj.GenerateReport();
//this is necessary b/c error message is wrapped in valid object DataResponse
//if (!string.IsNullOrEmpty(_reportObj.ErrorMessage)) //Clear previous error message
{
ErrorMessage = _reportObj.ErrorMessage;
Errors.Add(ErrorMessage);
AddCommentToCell(_reportObj);
Errors.Remove(ErrorMessage);
}
}
}
catch (Exception ex)
{
ErrorMessage = ex.Message;
Errors.Add(ErrorMessage);
_reportObj.ErrorMessage = ErrorMessage;
AddCommentToCell(_reportObj);
Errors.Remove(ErrorMessage);
Helper.LogError(ex);
}
}
}
on Class to GenerateReport
public void GenerateReport()
{
Request.ParseFunction();
Request.MetacompleteCallBack = MetaDataCompleteCallback;
Request.BuildMetaData();
}
public void MetaDataCompleteCallback(int id)
{
try
{
if (Request.IsRequestCancelled)
{
Request.FormulaCell.Dispose();
return;
}
ErrorMessage = Request.ErrorMessage;
if (string.IsNullOrEmpty(Request.ErrorMessage))
{
_queryJob = new DataQueryJob(UnityContainer, Request.BuildQueryString(), DataRequestJobFinished, Request);
}
else
{
ModifyCommentOnFormulaCellPublishRefreshEvent();
}
}
catch (Exception ex)
{
ErrorMessage = ex.Message;
ModifyCommentOnFormulaCellPublishRefreshEvent();
}
finally
{
Request.MetacompleteCallBack = null;
}
}
public void DataRequestJobFinished(DataRequestResponse response)
{
Dispatcher.Invoke(new Action<DataRequestResponse>(DataRequestJobFinishedUI), response);
}
public void DataRequestJobFinished(DataRequestResponse response)
{
try
{
if (Request.IsRequestCancelled)
{
return;
}
if (response.status != Status.COMPLETE)
{
ErrorMessage = ManipulateStatusMsg(response);
}
else // COMPLETE
{
var tmpReq = Request as DataRequest;
if (tmpReq == null) return;
new VerticalTemplate(tmpReq, response).Plot();
}
}
catch (Exception e)
{
ErrorMessage = e.Message;
Helper.LogError(e);
}
finally
{
//if (token != null)
// this.UnityContainer.Resolve<IEventAggregator>().GetEvent<DataQueryJobComplete>().Unsubscribe(token);
ModifyCommentOnFormulaCellPublishRefreshEvent();
Request.FormulaCell.Dispose();
}
}
on plot class
public void Plot()
{
...
attributeRange.Value2 = headerArray;
DataRange.Value2 = ....
DataRange.NumberFormat = ...
}
OutOfMemoryException is not about the too many requests sent simultaneously. It is about freeing your resources right way. In my practice there are two main problems when you are getting such exception:
Wrong working with immutable structures or System.String class
Not disposing your disposable resources, especially graphic objects and WCF requests.
In case of reporting, for my opinion, you got a second one type of a problem. DataRequest and DataRequestResponse are good point to start the investigation for the such objects.
If this doesn't help, try to use the Tasks library with async/await pattern, you can find good examples here:
// Signature specifies Task<TResult>
async Task<int> TaskOfTResult_MethodAsync()
{
int hours;
// . . .
// Return statement specifies an integer result.
return hours;
}
// Calls to TaskOfTResult_MethodAsync
Task<int> returnedTaskTResult = TaskOfTResult_MethodAsync();
int intResult = await returnedTaskTResult;
// or, in a single statement
int intResult = await TaskOfTResult_MethodAsync();
// Signature specifies Task
async Task Task_MethodAsync()
{
// . . .
// The method has no return statement.
}
// Calls to Task_MethodAsync
Task returnedTask = Task_MethodAsync();
await returnedTask;
// or, in a single statement
await Task_MethodAsync();
In your code I see a while loop, in which you can store your Task[] of size of 100, for which you can use the WaitAll method, and the problem should be solved. Sorry, but your code is huge enough, and I can't provide you a more straight example.
I'm having a lot of trouble parsing your code to figure out is being iterated for your request but the basic template for batching asynchronously is going to be something like this:
static const int batchSize = 100;
public async Task<IEnumerable<Results>> GetDataInBatches(IEnumerable<RequestParameters> parameters) {
if(!parameters.Any())
return Enumerable.Empty<Result>();
var batchResults = await Task.WhenAll(parameters.Take(batchSize).Select(doQuery));
return batchResults.Concat(await GetDataInBatches(parameters.Skip(batchSize));
}
where doQuery is something with the signature
Task<Results> async doQuery(RequestParameters parameters) {
//.. however you do the query
}
I wouldn't use this for a million requests since its recursive, but your case should would generate a callstack only 300 deep so you'll be fine.
Note that this also assumes that your data request stuff is done asynchronously and returns a Task. Most libraries have been updated to do this (look for methods with the Async suffix). If it doesn't expose that api you might want to create a separate question for how to specifically get your library to play nice with the TPL.
I am trying to parse data from a json file but I am getting variable outputs(Sometimes right, othertimes nothing). I am pretty much sure it is related with the time needed to parse the file, but having trouble finding out where. Here it is-
public class HspitalVM
{
List<Hspital> hspitalList=null;
public List<KeyedList<string, Hspital>> GroupedHospitals
{
get
{
getJson();
var groupedHospital =
from hspital in hspitalList
group hspital by hspital.Type into hspitalByType
select new KeyedList<string, Hspital>(hspitalByType);
return new List<KeyedList<string, Hspital>>(groupedHospital);
}
}
public async void getJson()
{
StorageFolder localFolder = ApplicationData.Current.LocalFolder;
try
{
StorageFile textFile = await localFolder.GetFileAsync(m_HospFileName);
using (IRandomAccessStream textStream = await textFile.OpenReadAsync())
{
using (DataReader textReader = new DataReader(textStream))
{
uint textLength = (uint)textStream.Size;
await textReader.LoadAsync(textLength);
string jsonContents = textReader.ReadString(textLength);
hspitalList = JsonConvert.DeserializeObject<IList<Hspital>>(jsonContents) as List<Hspital>;
}
}
}
catch (Exception ex)
{
string err = "Exception: " + ex.Message;
MessageBox.Show(err);
}
}
}
You are not await-ing result of getJson() call, so as expected most of the times you'll get no information because actual call to GetFileAsync is not completed yet.
Now since getJson method returns void you can't really await on it. Potential fix by using Result to turn asynchronous code into synchronous for get:
public List<KeyedList<string, Hspital>> GroupedHospitals
{
get
{
hspitalList= getJson().Result;
...
}
}
...
public async Task<IList<Hspital>> getJson()
{
....
return JsonConvert.DeserializeObject<IList<Hspital>>(jsonContents) as List<Hspital>;
}
Note: It is generally bad idea to make get methods to perform long operations, also calling async method in synchronous way via Wait/Result can cause deadlocks in your code -await vs Task.Wait - Deadlock?.
I recently encountered this very strange problem.
Initially I have this block of code
public async Task<string> Fetch(string module, string input)
{
if (module != this._moduleName)
{
return null;
}
try
{
var db = new SQLiteAsyncConnection(_dbPath);
ResponsePage storedResponse = new ResponsePage();
Action<SQLiteConnection> trans = connect =>
{
storedResponse = connect.Get<ResponsePage>(input);
};
await db.RunInTransactionAsync(trans);
string storedResponseString = storedResponse.Response;
return storedResponseString;
}
catch (Exception e)
{
return null;
}
}
However control will never be handed back to my code after the transaction finishes running. I traced the program and it seems that after the lock is release, the flow of program stops. Then I switched to using the GetAsync method from SQLiteAsyncConnection class. Basically it did the same thing so I was still blocked at await. Then I removed the async calls and used the synchronous api like below:
public async Task<string> Fetch(string module, string input)
{
if (module != this._moduleName)
{
return null;
}
try
{
var db = new SQLiteConnection(_dbPath);
ResponsePage storedResponse = new ResponsePage();
lock (_dbLock)
{
storedResponse = db.Get<ResponsePage>(input);
}
string storedResponseString = storedResponse.Response;
return storedResponseString;
}
catch (Exception e)
{
return null;
}
}
Only then can the logic flows back to my code. However I can't figure out why is this so.
Another problem is that for this kind of simple query is there any gain in terms of query time if I use aysnc api instead of sync api? If not I'll stick to the sync version then.
You are most likely calling Result (or Wait) further up the call stack from Fetch. This will cause a deadlock, as I explain on my blog and in a recent MSDN article.
For your second question, there is some overhead from async, so for extremely fast asynchronous operations, the synchronous version will be faster. There is no way to tell whether this is the case in your code unless you do profiling.