Maximizing usage of Parallel.For or Parallel.Foreach loops - c#

I have a structure of nested Parallel.For and PLINQ statements in my small console app that basically performs network-bound operation(performing http requests) like following:
a list of users is filled from DB where then I do following:
Parallel.For(0,users.count(), index=>{
// here I try to perform HTTP requests for multiple users
});
Then inside this for loop I perform a plinq statement for fetching this user's info via HTTP requests.
So that now in total I get two nested loops like following:
Parallel.For(0,users.count(), index=>{
// Some stuff is done before the PLINQ statement is called...
newFilteredList.AsParallel().WithDegreeOfParallelism(60).ForAll(qqmethod =>
{
var xdocic = new XmlDocument();
xdocic.LoadXml(SendXMLRequestToEbay(null, null, qqmethod.ItemID, true, TotalDaysSinceLastUpdate.ToString(), null));
int TotalPages = 0;
if (xdocic.GetElementsByTagName("TotalNumberOfPages").Item(0) != null)
{
TotalPages = Convert.ToInt32(xdocic.GetElementsByTagName("TotalNumberOfPages").Item(0).InnerText);
}
if (TotalPages > 1)
{
for (int i = 1; i < TotalPages + 1; i++)
{
Products.Add(SendXMLRequestToEbay(null, null, qqmethod.ItemID, false, TotalDaysSinceLastUpdate.ToString(), i.ToString()));
}
}
else
{
Products.Add(SendXMLRequestToEbay(null, null, qqmethod.ItemID, false, TotalDaysSinceLastUpdate.ToString(), "1"));
}
});
});
I tried using the outer for loop just as a regular one ,and I noticed that it was performing much faster and better than like this.
What worries me here mostly is that I was checking the utilization of CPU when running the console app like this, it's always nearby 0.5-3% of total CPU power...
So the way I'm trying to perform HTTP requests is like this:
15 users at a time * amount of HTTP requests for those 15 users.
What am I doing wrong here?

Related

Insert Multiple records in AWS keyspace using C#

Hello I just newly started with Cassandra not much familiar, can u please let me know the error here
I am trying to insert 16000 records using the bellow code
public async Task AddSprintsStories(List<SprintStories> sprintStories)
{
var tasks = new List<Task>();
try
{
if (sprintStories.Count > 0)
{
foreach (var item in sprintStories)
{
SprintStories sprintStoryData = new SprintStories();
sprintStoryData.Id = item.Id;
sprintStoryData.ProjectId = item.ProjectId;
sprintStoryData.SprintId = item.SprintId;
tasks.Add(mapper.InsertAsync<SprintStories>(sprintStoryData, new CqlQueryOptions().SetConsistencyLevel(ConsistencyLevel.LocalQuorum)));
}
await Task.WhenAll(tasks);
}
}
catch (Exception e)
{
}
}
but facing the error: c# Server timeout during write query at consistency LOCALQUORUM (0 peer(s) acknowledged the write over 2 required)
can anyone please help me out here
How does the Cassandra cluster look during this cluster? CPU or disk I/O maxed-out? Without knowing that, my guess is that those 16000 writes are happening faster than your cluster can process them, creating write back pressure. Finally, it just can't process anymore, so they start failing.
For a possible solution, try limiting the number of active threads. Something like this should do it.
int maxActiveThreads = 20;
int activeThreads = 0;
foreach (var item in sprintStories)
{
...
tasks.Add(mapper.InsertAsync<SprintStories>(sprintStoryData, new CqlQueryOptions().SetConsistencyLevel(ConsistencyLevel.LocalQuorum)));
activeThreads++;
if (activeThreads >= maxActiveThreads)
{
await Task.WhenAll(tasks);
activeThreads = 0;
}
}
await Task.WhenAll(tasks);
With this code, only 20 writes will be competing for Cassandra cluster resources at any given time. Do note, that I'm just using 20 as an example. Adjust that number to something that meets your requirements for performance and stability.
Ryan Svihla wrote a great blog post on this topic- Cassandra: Batch Loading Without the BATCH - The Nuanced Edition

C# - Parallel.Foreach() for Service Call

I have a console application which does 3 steps as mentioned below,
Get pending notification records from db
Call service to send email (service returns email address as response)
After getting service response, update the db
For step 2, I am using Parallel.Foreach() and it is working far better than foreach().
I have gone through lot of articles and threads on stackoverflow which is causing more confusion on this topic.
I have few questions
I am running this on a server, does it affect the performance and should I limit the number of threads? (The email count can be from 0-500 or 1000)
I ran into one issue, where in step 2, the service returned an email address as response but it was not available while updating the db. (email count here was 400)
I am suspecting that the issue could be because of using parallel.foreach and that it did not add in notifList.
If this is the case, can I add Thread.Sleep(1000) after Parallel.Foreach() loop ends, does it fix the issue?
In case of any exception, should I explicitly cancel the threads?
Appreciate your time and effort on helping me with this. Thank you!
public void notificationMethod()
{
List<notify> notifList = new List<notify>();
//step 1
List<orders> orderList = GetNotifs();
try
{
if (orderList.Count > 0)
{
Parallel.ForEach(orderList, (orderItem) =>
{
//step 2
SendNotifs(orderItem);
notifList.Add(new notify()
{
//building list by adding email address along with other information
});
});
if (notifList.Count > 0)
{
int index = 0;
int rows = 10;
int skipRows = index * rows;
int updatedRows = 0;
while (skipRows < notifList.Count)
{
//pagination
List<notify> subitem = notifList.Skip(index * rows).Take(rows).ToList<notify>();
updatedRows += subitem.Count;
//step 3
UpdateDatabase(subitem);
index++;
skipRows = index * rows;
}
}
}
}
catch (ApplicationException ex)
{
}
}
I also had a similar scenario regarding whether using Parallel.ForEach() would help improve the performance. But when I saw the below video from Microsoft, it gave me an idea to select Parallel.ForEach() only for CPU intensive workloads.
In this case, your scenario will fall into I/O intensive workloads and could be handled better by async/await.
https://channel9.msdn.com/Series/Three-Essential-Tips-for-Async/Tip-2-Distinguish-CPU-Bound-work-from-IO-bound-work

Performance issue in bulk importing data in Dynamics CRM

I am importing data to Dynamics CRM using C# console application. I am using following code:
public static void Main(string[] args)
{
int totalRecords = dbcon.GetDataCount();
int rowCount = totalRecords / 10;
for (int i = 1, j = 1; i <= totalRecords; i = i + rowCount, j = j + 1)
{
Task myTask = new Task(() => TestMethod(i, (rowCount * j)));
myTask.Start();
}
Task.WaitAll();
}
public static void TestMethod(int startSeqNo, int endSeqNo)
{
IOrganizationService service = getServiceProxcy();
DBConnection dbcon = new DBConnection();
DataTable dt = dbcon.GetData(startSeqNo, endSeqNo);
// Insert Commented
BulkCreate(service, dt);
}
public static void BulkCreate(IOrganizationService service, DataTable dt)
{
// Create an ExecuteMultipleRequest object.
ExecuteMultipleRequest multipleRequest = new ExecuteMultipleRequest()
{
// Assign settings that define execution behavior: continue on error, return responses.
Settings = new ExecuteMultipleSettings()
{
ContinueOnError = false,
ReturnResponses = true
},
// Create an empty organization request collection.
Requests = new OrganizationRequestCollection()
};
foreach (DataRow row in dt.Rows)
{
Entity entity = new Entity("new_dataimporttest");
entity["new_name"] = row["name"].ToString();
entity["new_telephone"] = row["telephone1"].ToString();
if (multipleRequest.Requests.Count == 1000)
{
// Execute all the requests in the request collection using a single web method call.
ExecuteMultipleResponse multipleResponse = (ExecuteMultipleResponse)service.Execute(multipleRequest);
multipleRequest.Requests.Clear();
}
CreateRequest createRequest = new CreateRequest { Target = entity };
multipleRequest.Requests.Add(createRequest);
}
// Execute all the requests in the request collection using a single web method call.
if (multipleRequest.Requests.Count > 0)
{
ExecuteMultipleResponse multipleResponse = (ExecuteMultipleResponse)service.Execute(multipleRequest);
}
}
I am using Task Parallel Library. It works fine but issue is that when following line is executed it takes time.
// Execute all the requests in the request collection using a single web method call.
// ExecuteMultipleResponse multipleResponse = (ExecuteMultipleResponse)service.Execute(multipleRequest);
I want to improve performance of code as I am importing large amount of data nearly 1 million records. Currently it takes 1h 50 mins. How do I improve code to reduce execution time.
With the ExecuteMultipleRequest data throughput can only be enhanced in a limited way. This is because the Dynamics CRM server processes the requests in it in sequential order, not parallel. Therefore your main gain is less roundtrips to the server.
Throughput can really be boosted when working with multiple threads. Every thread communicating with CRM must get its own IOrganizationService instance. By default a CRM server accepts up to 10 simultaneous connections from a client. (This is the WCF default.)
In batch processes I tend to use a BlockingCollection<T> with a Producer Consumer pattern: one thread produces the requests to be sent to the CRM server and multiple threads consume the requests by taking them off the collection and sending them to CRM.
You could have two methods execute multiplerequest in the same process(500 for each), using that you can cut the time by half.
Analize in which part it is wasting more time in the execution or in the foreach create. And write here so I can't help you in a better way
If just data import is what intended here, try using SqlBulkCopy to write data directly to server (Sample Code).

Define Next Start Point When Number of Items Unknown

I have a web service I need to query and it takes a value that supports pagination for its data. Due to the amount of data I need to fetch and how that service is implemented I intended to do a series of concurrent http web requests to accumulate this data.
Say I have number of threads and page size how could I assign each thread to pick its starting point that doesn't overlap with the other thread? Its been a long time since I took parallel programming and I'm floundering a bit. I know I could find my start point with something like start = N/numThreads * threadNum however I don't know N. Right now I just spin up X threads and each loop until they get no more data. Problem is they tend to overlap and I end up with duplicate data. I need unique data and not to waste requests.
Right now I have code that looks something like this. This is one of many attempts and I see why this is wrong but its better to show something. The goal is to in parallel collect pages of data from a webservice:
int limit = pageSize;
data = new List<RequestStuff>();
List<Task> tasks = new List<Task>();
for (int i = 0; i < numThreads; i++)
{
tasks.Add(Task.Factory.StartNew(() =>
{
try
{
List<RequestStuff> someData;
do
{
int start;
lock(myLock)
{
start = data.Count;
}
someKeys = GetDataFromService(start, limit);
lock (myLock)
{
if (someData != null && someData.Count > 0)
{
data.AddRange(someData);
}
}
} while (hasData);
}
catch (AggregateException ex)
{
//Exception things
}
}));
}
Task.WaitAll(tasks.ToArray());
Any inspiration to solve this without race conditions? I need to stick to .NET 4 if that matters.
I'm not sure there's a way to do this without wasting some requests unless you know the actual limit. The code below might help eliminate the duplicate data as you will only query on each index once:
private int _index = -1; // -1 so first request starts at 0
private bool _shouldContinue = true;
public IEnumerable<RequestStuff> GetAllData()
{
var tasks = new List<Task<RequestStuff>>();
while (_shouldContinue)
{
tasks.Add(new Task<RequestStuff>(() => GetDataFromService(GetNextIndex())));
}
Task.WaitAll(tasks.ToArray());
return tasks.Select(t => t.Result).ToList();
}
private RequestStuff GetDataFromService(int id)
{
// Get the data
// If there's no data returned set _shouldContinue to false
// return the RequestStuff;
}
private int GetNextIndex()
{
return Interlocked.Increment(ref _index);
}
It could also be improved by adding cancellation tokens to cancel any indexes you know to be wasteful, i.e, if index 4 returns nothing you can cancel all queries on indexes above 4 that are still active.
Or if you could make a reasonable guess at the max index you might be able to implement an algorithm to pinpoint the exact limit before retrieving any data. This would probably only be more efficient if your guess was fairly accurate though.
Are you attempting to force parallelism on the part of the remote service by issuing multiple concurrent requests? Paging is generally used to limit the amount of data returned to only that which is needed, but if you need all of the data, then attempting to first page and then reconstruct it later seems like a poor design. Your code becomes needlessly complex, difficult to maintain, you'll likely just move the bottleneck from code you control to somewhere else, and now you've introduced data integrity issues (what happens if all of these threads access different versions of the data you are trying to query?). By increasing the complexity and number of calls, you are also increasing the likelihood of problems occurring (eg. one of the connections gets dropped).
Can you state the problem you are attempting to solve so perhaps instead we can help architect a better solution?

throttle parallel request to remote api

I'm working on an ASP.NET MVC application that uses the Google Maps Geocoding API. In a single batch there may be upto 1000 queries to submit to the Geocoding API, so I'm trying to use a parallel processing approach to imporove performance. The method responsible for starting a process for each core is:
public void GeoCode(Queue<Job> qJobs, bool bolKeepTrying, bool bolSpellCheck, Action<Job, bool, bool> aWorker)
{
// Get the number of processors, initialize the number of remaining
// threads, and set the starting point for the iteration.
int intCoreCount = Environment.ProcessorCount;
int intRemainingWorkItems = intCoreCount;
using(ManualResetEvent mreController = new ManualResetEvent(false))
{
// Create each of the work items.
for(int i = 0; i < intCoreCount; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
Job jCurrent = null;
while(qJobs.Count > 0)
{
lock(qJobs)
{
if(qJobs.Count > 0)
{
jCurrent = qJobs.Dequeue();
}
else
{
if(jCurrent != null)
{
jCurrent = null;
}
}
}
aWorker(jCurrent, bolKeepTrying, bolSpellCheck);
}
if(Interlocked.Decrement(ref intRemainingWorkItems) == 0)
{
mreController.Set();
}
});
}
// Wait for all threads to complete.
mreController.WaitOne();
}
}
This is based on patterns document I found on Microsoft's parallel computing web site.
The problem is that the Google API has a limit of 10 QPS (enterprise customer) - which I'm hitting - then I get HTTP 403 error's. Is this a way I can benefit from parallel processing but limit the requests I'm making? I've tried using Thread.Sleep but it doesn't solve the problem. Any help or suggestions would be very much appreciated.
It sounds like your missing some sort of Max in Flight parameter. Rather than just looping while there are jobs in the queue, you need to throttle your submissions based on jobs finishing.
Seems like your algorithm should be something like the following:
submit N jobs (where N is your max in flight)
Wait for a job to complete, and if queue is not empty, submit next job.

Categories

Resources