Asynchronously download and compile list of JsonDocument - c#

I'm a little new (returning after a couple of decades) to C# and to the async/await model of programming. Looking for a little guidance, since I received an understandable warning CS1998 that the asynchronous method lacks await and operators and will run synchronously.
The code below I think is straightforward - the server API returns data in pages of 25 items. I'm using a continuation to add each page of 25 to a List of JsonDocuments. Calling code will handle the parsing as needed. I'm not sure how I could reasonably leverage anything further in this, but looking for any suggestions/guidance.
internal static async Task<List<JsonDocument>> Get_All_Data(HttpClient client, string endpoint)
{
Console.WriteLine("Downloading all data from {0}{1}", client.BaseAddress, endpoint);
var all_pages = new List<JsonDocument>();
// Get first page to determine total number of pages
HttpResponseMessage response = client.GetAsync(endpoint).Result;
Console.WriteLine("Initial download complete - parsing headers to determine total pages");
//int items_per_page;
if (int.TryParse(Get_Header_Value("X-Per-Page", response.Headers), out int items_per_page) == false)
// throw new Exception("Response missing X-Per-Page in header");
items_per_page = 25;
if (int.TryParse(Get_Header_Value("X-Total-Count", response.Headers), out int total_items) == false)
//throw new Exception("Response missing X-Total-Count in header");
total_items = 1;
// Divsion returns number of complete pages, add 1 for partial IF total items_json is not an exact multiple of items_per_page
var total_pages = total_items / items_per_page;
if ((total_items % items_per_page) != 0) total_pages++;
Console.WriteLine("{0} pages to be downloaded", total_pages);
var http_tasks = new Task[total_pages];
for (int i = 1; i <= total_pages; i++)
{
Console.WriteLine("Downloading page {0}", i);
var paged_endpoint = endpoint + "?page=" + i;
response = client.GetAsync(paged_endpoint).Result;
http_tasks[i - 1] = response.Content.ReadAsStringAsync().ContinueWith((_content) => { all_pages.Add(JsonDocument.Parse(_content.Result)); }); ;
//http_tasks[i].ContinueWith((_content) => { all_pages.Add(JsonDocument.Parse_List(_content.Result)); });
}
System.Threading.Tasks.Task.WaitAll(http_tasks); // wait for all of the downloads and parsing to complete
return all_pages;
}
Thanks for your help

My suggestion is to await all asynchronous operations, and use the Parallel.ForEachAsync method to parallelize the downloading of the JSON documents, while maintaining control of the degree of parallelism:
static async Task<JsonDocument[]> GetAllData(HttpClient client, string endpoint)
{
HttpResponseMessage response = await client.GetAsync(endpoint);
response.EnsureSuccessStatusCode();
if (!Int32.TryParse(GetHeaderValue(response, "X-Total-Count"),
out int totalItems) || totalItems < 0)
totalItems = 1;
if (!Int32.TryParse(GetHeaderValue(response, "X-Per-Page"),
out int itemsPerPage) || itemsPerPage < 1)
itemsPerPage = 25;
int totalPages = ((totalItems - 1) / itemsPerPage) + 1;
JsonDocument[] results = new JsonDocument[totalPages];
ParallelOptions options = new() { MaxDegreeOfParallelism = 5 };
await Parallel.ForEachAsync(Enumerable.Range(1, totalPages), options,
async (page, ct) =>
{
string pageEndpoint = endpoint + "?page=" + page;
HttpResponseMessage pageResponse = await client
.GetAsync(pageEndpoint, ct);
pageResponse.EnsureSuccessStatusCode();
string pageContent = await response.Content.ReadAsStringAsync(ct);
JsonDocument result = JsonDocument.Parse(pageContent);
results[page - 1] = result;
});
return results;
}
static string GetHeaderValue(HttpResponseMessage response, string name)
=> response.Headers.TryGetValues(name, out var values) ?
values.FirstOrDefault() : null;
The MaxDegreeOfParallelism is configured to the value 5 for demonstration purposes. You can find the optimal degree of parallelism by experimenting with your API. Setting the value too low might result in mediocre performance. Setting the value too high might overburden the target server, and potentially trigger an anti-DoS-attack mechanism.
If you are not familiar with the Enumerable.Range, it is a LINQ method that returns an incremented numeric sequence of integers that starts from start, and contains count elements.
The GetAllData is an asynchronous method and it is supposed to be awaited. If you are calling it without await, and your application is a UI application like WinForms or WPF, you are at risk of experiencing a deadlock. Don't panic, it happens consistently, and you'll observe it during the testing. One way to prevent it is to append .ConfigureAwait(false) to all awaited operations inside the GetAllData method.

Related

Inconsistent outcome when trying to fetch all items from a paginated url

I am having the issue of fetching all forms currently hosted in hubspot.
I tried with a simple for loop where I made one request at the time, and fetched one form at a time, which worked, but was very slow.
I then thought it might work better if I created a seperate task for each request, and then made the task create each request, and store them in one common list.
Problem is that I expect the list to have 2000 items, but I never seem to get that, it seem pretty inconsistent the number of items I get?
But how come?
This is how I have setup my for fetching scheme.
private static async Task<IEnumerable<HubspotModel>> GetForms(
string hubspotPath, int pageSize)
{
int totalResults;
int offset = 0;
List<HubspotModel> output = new();
List<Task> tasks = new();
using var client = new HttpClient();
{
System.Net.Http.Headers.HttpResponseHeaders requestHeader = client
.GetAsync($"https://api.hubapi.com{hubspotPath}?" +
$"hapikey={HubspotConfiguration.ApiKey}&limit={1}&offset={0}")
.Result.Headers;
totalResults = int.Parse(requestHeader.GetValues("x-total-count").First());
do
{
tasks.Add(Task.Run(() =>
{
int scopedOffset = offset;
IEnumerable<HubspotModel> forms = GetFormsFromHubspot(hubspotPath,
pageSize, offset, client);
output.AddRange(forms);
}).ContinueWith(requestReponse =>
{
if (requestReponse.IsFaulted)
{
Console.WriteLine("it failed");
}
}));
offset += pageSize;
}
while (totalResults > offset);
await Task.WhenAll(tasks);
}
return output;
}
private static IEnumerable<HubspotModel> GetFormsFromHubspot(string hubspotPath,
int pageSize, int offset, HttpClient client)
{
HttpResponseMessage request = client
.GetAsync($"https://api.hubapi.com{hubspotPath}?" +
$"hapikey={HubspotConfiguration.ApiKey}&limit={pageSize}&offset={offset}")
.Result;
request.EnsureSuccessStatusCode();
string content = request.Content.ReadAsStringAsync().Result;
IEnumerable<Dictionary<string, object>> jsonResponse = JsonSerializer
.Deserialize<IEnumerable<Dictionary<string, object>>>(content,
new JsonSerializerOptions() { });
var guid = Guid.Parse(jsonResponse.First()["guid"].ToString());
var forms = jsonResponse.Select(x => new HubspotModel()
{
id = Guid.Parse(x["guid"].ToString()),
FormName = x["name"].ToString(),
Form = x
});
return forms;
}
First of all, I'd suggest to make GetFormsFromHotspot an async as well and use await client.GetAsync( ...) and await request.Content.ReadAsStringAsync() instead of client.GetAsync(...).Result and ReadAsStringAsync().Result respectively, because using .Result will block the current thread and thus, you will throw away the advantages of async Tasks.
But the main cause of the problem should be the following
GetFormsFromHubspot(hubspotPath, pageSize, offset, client);
Here you are calling the GetFormsFromHubspot with an offset parameter from an outer scope (and that value keeps changing), thus it will not use the value it had when you created that task but it uses the value it actually has, when that particular part of the code is really executed. So the value that is used as an offset is quite random. You already tried to create a
int scopedOffset = offset;
but you don't use it. And also you create it at the wrong position. Create that scopedOffset outside of the task, but inside the loop's body. So it will be created at the creationtime of the task. And because it's inside the loop's body, a new value will be created for each task.
The following should do the trick (after you refactor GetFormsFromHubspot to be async.
do {
int scopedOffset = offset
tasks.Add(Task.Run(async () => {
IEnumerable<HubspotModel> forms = await GetFormsFromHubspot(hubspotPath, pageSize, scopedOffset, client);
output.AddRange(forms);
})
.ContinueWith(...);
);
offset += pageSize;
} while (totalResults > offset);
The main problem with your code is that the List<T> is not thread-safe. When multiple threads are Adding to a list concurrently without synchronization, its behavior becomes undefined (throws exceptions, becomes corrupted etc). There are many ways to solve this problem:
Synchronize the access to the list with the lock statement: lock (output) output.AddRange(forms);.
Use a concurrent collection instead of the List<T>, for example a ConcurrentQueue<T>.
Avoid collecting manually the output altogether. Instead of storing your tasks in a List<Task>, you can store them in a List<Task<HubspotModel[]>>, meaning that each task will be a generic Task<TResult>, with the TResult being an array of HubspotModel instances. Finally you will get all the output at once, when you await the Task.WhenAll.
Below is an implementation of the third idea. Notice that I have avoided creating a HttpClient instance, because the recommendation is to instantiated this class only once, and reuse it throughout the life of the application.
private static async Task<HubspotModel[]> GetFormsAsync(HttpClient client,
string hubspotPath, int pageSize)
{
string url = $"https://api.hubapi.com{hubspotPath}?hapikey=" +
$"{HubspotConfiguration.ApiKey}&limit={1}&offset={0}";
HttpResponseMessage response = await client.GetAsync(url)
.ConfigureAwait(false);
response.EnsureSuccessStatusCode();
int totalCount = Int32.Parse(response.Headers
.GetValues("x-total-count").First());
List<int> offsets = new();
for (int offset = 0; offset < totalCount; offset += pageSize)
offsets.Add(offset);
Task<HubspotModel[]>[] tasks = offsets.Select(offset => Task.Run(async () =>
{
HubspotModel[] forms = await GetFormsAsync(client,
hubspotPath, pageSize, offset).ConfigureAwait(false);
return forms;
})).ToArray();
HubspotModel[][] results = await Task.WhenAll(tasks).ConfigureAwait(false);
return results.SelectMany(x => x).ToArray();
}
private async static Task<HubspotModel[]> GetFormsAsync(HttpClient client,
string hubspotPath, int pageSize, int offset)
{
string url = $"https://api.hubapi.com{hubspotPath}?hapikey=" +
$"{HubspotConfiguration.ApiKey}&limit={pageSize}&offset={offset}";
HttpResponseMessage response = await client.GetAsync(url)
.ConfigureAwait(false);
response.EnsureSuccessStatusCode();
string content = await response.Content.ReadAsStringAsync()
.ConfigureAwait(false);
IEnumerable<Dictionary<string, object>> jsonResponse = JsonSerializer
.Deserialize<IEnumerable<Dictionary<string, object>>>(content,
new JsonSerializerOptions() { });
Guid guid = Guid.Parse(jsonResponse.First()["guid"].ToString());
HubspotModel[] forms = jsonResponse.Select(x => new HubspotModel()
{
Id = Guid.Parse(x["guid"].ToString()),
FormName = x["name"].ToString(),
Form = x
}).ToArray();
return forms;
}
One more improvement that you could consider doing is to switch from the Task.WhenAll to the new (.NET 6) API Parallel.ForEachAsync. The advantage is that you'll get control over the degree of parallelism, and so you'll be able to reduce the parallelization in case the remote server can't keep up with the pressure. Unfortunatelly the Parallel.ForEachAsync method does not return the results like the Task.WhenAll, so you'll be back to your original problem. You can find a solution about this here: ForEachAsync with Result.

Calling HttpClient and getting identical results from paged requests - is it me or the service?

I am sending five HttpClient requests to the same URL, but with a varying page number parameter. They all fire async, and then I await for them all to finish using Tasks.WaitAll(). My requests are using System.Net.Http.HttpClient.
This mostly works fine, and I get five distinct results representing each page of the data about 99% of the time.
But every so often, and I have not dug into deep analysis yet, I get the exact same response for each task. Each task does indeed instantiate its own HttpClient. When I was reusing one client instance, I got this problem. But since I started instantiating new clients for every call, the problem went away.
I am calling a 3rd party web service over which I have no control. So before nagging their team too much about this, I do want to know if I may be doing something wrong here, or if there is some aspect of HttpClient ot Task that I'm missing.
Here is the calling code:
for (int i = 1; i <= 5; i++)
{
page = load_made + i;
var t_page = page;
var t_url = url;
var task = new Task<List<T>>(() => DoPagedLoad<T>(t_page, per_page, t_url));
task.Run();
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
Here is the code in the DoPagedLoad, which returns a Task:
var client = new HttpClient();
var response = client.GetAsync(url).Result;
var results = response.Content.ReadAsStringAsync().Result();
I would appreciate any help from folks familiar with the possible quirks of Task and HttpClient
NOTE: Run is an extension method to help with async exceptions.
public static Task Run(this Task task)
{
task.Start();
task.ContinueWith(t =>
{
if(t.Exception != null)
Log.Error(t.Exception.Flatten().ToString());
});
return task;
}
It's hard to give a definitive answer because we don't have all the detail but here's a sample implementation of how you should fire off HTTP requests. Notice that all async operations are awaited - Result and Wait / WaitAll are not used. You should almost never need / use any of those - they block synchronously and can create problems.
Also notice that there are no global cookie containers, default headers, etc. defined for the HTTP client. If you need any of that stuff, just create individial HttpRequestMessage objects and add whatever headers you need to add. Don't use the global properties - it's a lot cleaner to just set per-request properties.
// Globally defined HTTP client.
private static readonly HttpClient _httpClient = new HttpClient();
// Other stuff here...
private async Task SomeFunctionToGetContent()
{
var requestTasks = new List<Task<HttpResponseMessage>>();
var responseTasks = new List<Task>();
for (var i = 0; i < 5; i++)
{
// Fake URI but still based on the counter (or other
// variable, similar to page in the question)
var uri = new Uri($"https://.../{i}.html");
requestTasks.Add(_httpClient.GetAsync(uri));
}
await (Task.WhenAll(requestTasks));
for (var i = 0; i < 5; i++)
{
var response = await (requestTasks[i]);
responseTasks.Add(HandleResponse(response));
}
await (Tasks.WhenAll(responseTasks));
}
private async Task HandleResponse(HttpResponseMessage response)
{
try
{
if (response.Content != null)
{
var content = await (response.Content.ReadAsStringAsync());
// do something with content here; check IsSuccessStatusCode to
// see if the request failed or succeeded
}
else
{
// Do something when no content
}
}
finally
{
response.Dispose();
}
}

Wait for finish all async methods inside an Async lambda expression

I have a method like below
internal static ProgressDialogResult Execute(Window owner, string label, Action operation, ProgressDialogSettings settings)
{
return ExecuteInternal(owner, label, (object)operation, settings);
}
Which is used to display a progress bar based on task completions.
I am calling the above method like below
int count = soilData.Count;
var result = ProgressDialog.Execute(this, "Loading data", async () => {
for (int i = 1; i <= count; i = i + 1000)
{
await soilDataMigration.MigrateSoilData(soilData.GetRange(i, i + 1000 >= count ? count - i : 1000));
}
}, ProgressDialogSettings.WithSubLabel);
if (result.OperationFailed)
MessageBox.Show("Soil data upload failed failed.");
else
MessageBox.Show("Soil data successfully executed.");
soilData.Clear();
But when I try to execute the same it is suddenly comes out of the lambda and start executing if statement.
But my expected behavior is the control goes out of this lambda only when all my async operations are completed inside loop .
I have checked this question in stackoverflow
Is there a way to run async lambda synchronously?
But unfortunately that package is incompatible with .net version 4.5.2
Can someone help me to resolve this issue ?
Realistically you have two options:
1) refactor all of the calls to be async..await compatible, e.g.
internal static async System.Threading.Task<ProgressDialogResult> Execute(Window owner, string label, Func<System.Threading.Task> operation, ProgressDialogSettings settings)
{
return await ExecuteInternal(owner, label, (object)operation, settings);
}
internal static async System.Threading.Task<ProgressDialogResult> ExecuteInternal(Window owner, string label, Func<System.Threading.Task> operation, ProgressDialogSettings settings)
{
// do whatever
await operation();
return //whatever;
}
Then consume it like so:
int count = soilData.Count;
var result = await ProgressDialog.Execute(this, "Loading data", async () => {
for (int i = 1; i <= count; i = i + 1000)
{
await soilDataMigration.MigrateSoilData(soilData.GetRange(i, i + 1000 >= count ? count - i : 1000));
}
}, ProgressDialogSettings.WithSubLabel);
if (result.OperationFailed)
MessageBox.Show("Soil data upload failed failed.");
else
MessageBox.Show("Soil data successfully executed.");
soilData.Clear();
and
2) just change the await method you're passing to be
soilDataMigration.MigrateSoilData(soilData.GetRange(i, i + 1000 >= count ? count - i : 1000)).GetAwaiter().GetResult();
In general, option 1) is better as it clearly denotes intent and harmonizes the code you have. Option 2) is more a workaround.

C# Async Tasks wait vs await

There is something that is bugging me for a while, something that happen in my code I can't understand.
I've defined a workflow to extract information from the Facebook API, basically I have 3 different async tasks.
Execute Reports
private static async Task<List<DataTypes.ReportRun>> ExecuteMarketingReports(FacebookClient fb, List<DataTypes.Request.Batch> batch_requests)
{
//Do stuff - Make async POST requests to
//FB to execute reports on FB server side
}
Monitorize Reports
private static async Task<List<DataTypes.AdReportRun>> MonitorizeMarketingReports(FacebookClient fb, List<DataTypes.ReportRun> tListReportRun)
{
//Do stuff -- Check if the reports are ready or not, and return a list with status
}
GetReportData
private static async Task GetReportData(FacebookClient fb, List<DataTypes.AdReportRun> tListReportRun, DateTime since, DateTime until, string breakdown)
{
//Do stuff - Gets report thata once the reports are finish and create the files
}
This is the main Task where all the other methods are called
private static async Task PullInsightsData(FacebookClient fb, List<DataTypes.Request.Batch> batchRequests, DateTime since, DateTime until, string breakdown)
{
var tResult = new List<DataTypes.AdReportRun>();
int retry = 0;
List<DataTypes.AdReportRun> tReportCompleteList = new List<DataTypes.AdReportRun>();
List<DataTypes.AdReportRun> tReportIncompleteList = new List<DataTypes.AdReportRun>();
var report_ids = await ExecuteMarketingReports(fb, batchRequests);
Thread.Sleep(20000); // Waits 20 seconds before get the info.
do
{
/*Start monitorizing the reports*/
var tReport_info = await MonitorizeMarketingReports(fb, report_ids);
/*Get the reports that are complete*/
tReportCompleteList = tReport_info.Where(x => x.async_percent_completion == 100).ToList();
if (tReportCompleteList.Count > 0)
await GetReportData(fb, tReportCompleteList, since, until, breakdown);
tReportIncompleteList = tReport_info.Where(x => x.async_percent_completion < 100).ToList();
report_ids = (from x in tReportIncompleteList
select new DataTypes.ReportRun { report_run_id = x.id }).ToList();
var sleepTime = TimeSpan.FromSeconds(Math.Pow(2, retry + 1));
Thread.Sleep(sleepTime);
retry++;
} while (report_ids.Count > 0 && retry < 8);
}
I call my Main task in this foreach loop, and this is where the problem occurs.
for (int i = 0; i < ActiveAdAccounts.Count; i = i + 50)
{
var AdAccountsSubList = ActiveAdAccounts.Skip(i).Take(50).ToList();
var batchRequests = ....
await PullInsightsData(fb, batchRequests, (DateTime)since, (DateTime)until, breakdown.Replace(",", "_"));
//tTaskList.Add(PullInsightsData(fb, batchRequests, (DateTime)since, (DateTime)until, breakdown.Replace(",", "_")));
}
//Task.WaitAll(tTaskList);
I don't understand why the foreach loop does't continue using the await the console application closes suddenly, shouldn't await "wait" until the task is finish and so then proceed to the next line of code?
I've solved the problem putting all the tasks into a list and waiting for all, but I would like an explanation for this.
[EDITED] Question was edited to create a minimal reproducible example.

Multiple Response.writeAsync Calls

I have been researching Asp.Net Security and I found some surprising code:
Strange Code?
context.Response.ContentType = "text/html";
await context.Response.WriteAsync("<html><body>");
await context.Response.WriteAsync("An remote error has occured: " + context.Request.Query["ErrorMessage"] + "<br>");
await context.Response.WriteAsync("Home");
await context.Response.WriteAsync("</body></html>");
What surprised me is the multiple calls to WriteAsync with short strings.
What I would have done
I would have used a template with String.Format or a StringBuilder to concatenate the strings and then write that in a single call:
var template = #"
<html><body>
An remote error has occured:{0}<br>
Home
</body></html>
";
var html = string.format(template, context.Request.Query["ErrorMessage"]);
await context.Response.WriteAsync(html);
The differences I observe
My code is much easier to modify.
I've got some extra white-space.
My code uses a larger hard-coded string instead of a bunch of small hard-coded strings.
I use String.Format which may have a performance hit compared to concatenation.
If string concatenation should be avoided, this part should be broken up:
"An remote error has occured: " + context.Request.Query["ErrorMessage"] + "<br>"
Questions
For the purposes of discussion. Let's assume that this is in the context of a web server with an average of ~10,000 simultaneous active users: So performance is important.
Why is this done like this?
How does it affect performance?
When should await Response.WriteAsync be called instead of Response.Write?
How often should Response.WriteAsync be called?
As often as possible with tiny amounts of data
Only when a large amount of text is ready
I created an Azure website (running on Basic 1 - 1 Small Instance) to benchmark this. Then I used the free service at https://loader.io to run each test at 100 users/second over 1 minute.
I ran each test 3 times in different orders. The times for each test run were within 200ms of each other.
Results:
The results were clear: StringBuilder won significantly. The cost of each async call far out weighs the cost of any form of string concatenation (even String.Format performed better than the multiple async calls).
1992ms - StringBuilder.Append
3071ms - StringBuilder.AppendFormat
4257ms - WriteAsync with String.Format
9265ms - WriteAsync
Here is the code for each test:
// Do not write this code - It is ugly and performs terribly
private async Task TestWriteAsync(HttpContext context)
{
var r = context.Response;
var id = "id";
var size = "12";
var text = "text";
await r.WriteAsync("<div style='display:none'>");
for (int i = 0; i < 10000; i++)
{
await r.WriteAsync("<li id='");
await r.WriteAsync(id);
await r.WriteAsync("' style='font-size:");
await r.WriteAsync(size);
await r.WriteAsync("'>");
await r.WriteAsync(text);
await r.WriteAsync("</li>");
}
await r.WriteAsync("</div>");
}
// This is much better, but still not great
private async Task TestWriteAsyncFormat(HttpContext context)
{
var r = context.Response;
var id = "id";
var size = "12";
var text = "text";
var template = "<li id='{0}' style='font-size:{1}'>{2}</li>";
await r.WriteAsync("<div style='display:none'>");
for (int i = 0; i < 10000; i++)
{
await r.WriteAsync(string.Format(template, id, size, text));
}
await r.WriteAsync("</div>");
}
// The Performance Winner, but ugly code
private async Task TestStringBuilder(HttpContext context)
{
var sb = new StringBuilder();
var id = "id";
var size = "12";
var text = "text";
sb.Append("<div style='display:none'>");
for (int i = 0; i < 10000; i++)
{
sb.Append("<li id='");
sb.Append(id);
sb.Append("' style='font-size:");
sb.Append(size);
sb.Append("'>");
sb.Append(text);
sb.Append("</li>");
}
sb.Append("</div>");
await context.Response.WriteAsync(sb.ToString());
}
// Decent performance and Clean Code
private async Task TestStringBuilderFormat(HttpContext context)
{
var sb = new StringBuilder();
var id = "id";
var size = "12";
var text = "text";
var template = "<li id='{0}' style='font-size:{1}'>{2}</li>";
sb.Append("<div style='display:none'>");
for (int i = 0; i < 10000; i++)
{
sb.AppendFormat(template, id, size, text);
}
sb.Append("</div>");
await context.Response.WriteAsync(sb.ToString());
}
So although the old "Response.Write" is faster than StringBuilder with synchronous requests, "await Response.WriteAsync" is much slower (because of the async overhead).
Test Screenshots:
I found that Link that might answer some of your questions about Response.Write :
http://www.dotnetperls.com/response-write
It seems that a lot of shorts strings is faster .
I hope it works the same as Response.WriteAsync.

Categories

Resources