Get data from document DB using multithreading/parallel

Get data from document DB using multithreading/parallel - c#

I have the JSON documents in the document DB (~30k documents) where each document has a unique ID something like AA123, AA124. There is a tool we use to pull those documents from the document DB where it has a restriction of 500 documents per GET request call. So this has to go through 60 times GET requests to fetch the result which takes sometime. I am looking to get this optimized to run this in quick time(run threads parallely), so that I can get the data quickly. Below is the sample code on how I am pulling the data from the DB as of now.
private int maxItemsPerCall = 500;
public override async Task<IEnumerable<docClass>> Getdocuments()
{
string accessToken = "token";
SearchResponse<docClass> docs = await db.SearchDocuments<docClass>(initialload, accessToken); //Gets top 500
List<docClass> routeRules = new List<docClass>();
routeRules.AddRange(docs.Documents);
var remainingCalls = (docs.TotalDocuments / maxItemsPerCall);
while (remainingCalls > 0 && docs.TotalDocuments > maxItemsPerSearch)
{
docs = await db.SearchDocuments<docClass>(GetFollowUp(docs.Documents.LastOrDefault().Id.Id), requestOptions);
routeRules.AddRange(docs.Documents);
remainingCalls--;
}
return routeRules;
}
private static SearchRequest initialload = new SearchRequest()
{
Filter = new SearchFilterGroup(
new[]
{
new SearchFilter(Field.Type, FilterOperation.Equal, "documentRule")
},
GroupOperator.And),
OrderBy = Field.Id,
Top = maxItemsPerCall,
Descending = false
};
private static SearchRequest GetFollowUp(string lastId)
{
SearchRequest followUpRequest = new SearchRequest()
{
Filter = new SearchFilterGroup(
new[] {
new SearchFilter(Field.Type, FilterOperation.Equal, "documentRule"),
new SearchFilter(Field.Id, FilterOperation.GreaterThan, lastId)
},
GroupOperator.And),
OrderBy = Field.Id,
Top = maxItemsPerCall,
};
return followUpRequest;
}
Help needed: Since I am using the each GET request(500 documents based on IDs depending on the ID of the previous run), how can I use to run this parallely (atleast 5 parallel threads at a time) fetching 500 records per thread (i.e. 2500 parallely in total for 5 threads at a time). I am not familiar with threading, so it would be helpful if someone can suggest how to do this.

Related

Idempotency for BigQuery load jobs using Google.Cloud.BigQuery.V2

You are able to create a csv load job to load data from a csv file in Google Cloud Storage by using the BigQueryClient in Google.Cloud.BigQuery.V2 which has a CreateLoadJob method.
How can you guarantee idempotency with this API to ensure that say the network dropped before getting a response and you kicked off a retry you would not end up with the same data being loaded into BigQuery multiple times?
Example API usage
private void LoadCsv(string sourceUri, string tableId, string timePartitionField)
{
var tableReference = new TableReference()
{
DatasetId = _dataSetId,
ProjectId = _projectId,
TableId = tableId
};
var options = new CreateLoadJobOptions
{
WriteDisposition = WriteDisposition.WriteAppend,
CreateDisposition = CreateDisposition.CreateNever,
SkipLeadingRows = 1,
SourceFormat = FileFormat.Csv,
TimePartitioning = new TimePartitioning
{
Type = _partitionByDayType,
Field = timePartitionField
}
};
BigQueryJob loadJob = _bigQueryClient.CreateLoadJob(sourceUri: sourceUri,
destination: tableReference,
schema: null,
options: options);
loadJob.PollUntilCompletedAsync().Wait();
if (loadJob.Status.Errors == null || !loadJob.Status.Errors.Any())
{
//Log success
return;
}
//Log error
}

You can achieve idempotency by generating your own jobid based on e.g. file location you loaded and target table.
job_id = 'my_load_job_{}'.format(hashlib.md5(sourceUri+_projectId+_datasetId+tableId).hexdigest())
var options = new CreateLoadJobOptions
{
WriteDisposition = WriteDisposition.WriteAppend,
CreateDisposition = CreateDisposition.CreateNever,
SkipLeadingRows = 1,
JobId = job_id, #add this
SourceFormat = FileFormat.Csv,
TimePartitioning = new TimePartitioning
{
Type = _partitionByDayType,
Field = timePartitionField
}
};
In this case if you try reinsert the same job_id you got error.
You can also easily generate this job_id for check in case if pooling failed.

There are two places you could end up losing the response:
When creating the job to start with
When polling for completion
The first one is relatively tricky to recover from without a job ID; you could list all the jobs in the project and try to find one that looks like the one you'd otherwise create.
However, the C# client library generates a job ID so that it can retry, or you can specify your own job ID via CreateLoadJobOptions.
The second failure time is much simpler: keep the returned BigQueryJob so you can retry the polling if that fails. (You could store the job name so that you can recover even if your process dies while waiting for it to complete, for example.)

ReplaceOneAsync() immediately after InsertOneAsync() not always working, even when journaled

On a single-instance MongoDB server, even with the write concern on the client set to journaled, one in every couple of thousand documents isn't replacable immediately after inserting.
I was under the impression that once journaled, documents are immediately available for querying.
The code below inserts a document, then updates the DateModified property of the document and tries to update the document based on the document's Id and the old value of that property.
public class MyDocument
{
public BsonObjectId Id { get; set; }
public DateTime DateModified { get; set; }
}
static void Main(string[] args)
{
var r = Task.Run(MainAsync);
Console.WriteLine("Inserting documents... Press any key to exit.");
Console.ReadKey(intercept: true);
}
private static async Task MainAsync()
{
var client = new MongoClient("mongodb://localhost:27017");
var database = client.GetDatabase("updateInsertedDocuments");
var concern = new WriteConcern(journal: true);
var collection = database.GetCollection<MyDocument>("docs").WithWriteConcern(concern);
int errorCount = 0;
int totalCount = 0;
do
{
totalCount++;
// Create and insert the document
var document = new MyDocument
{
DateModified = DateTime.Now,
};
await collection.InsertOneAsync(document);
// Save and update the modified date
var oldDateModified = document.DateModified;
document.DateModified = DateTime.Now;
// Try to update the document by Id and the earlier DateModified
var result = await collection.ReplaceOneAsync(d => d.Id == document.Id && d.DateModified == oldDateModified, document);
if (result.ModifiedCount == 0)
{
Console.WriteLine($"Error {++errorCount}/{totalCount}: doc {document.Id} did not have DateModified {oldDateModified.ToString("yyyy-MM-dd HH:mm:ss.ffffff")}");
await DoesItExist(collection, document, oldDateModified);
}
}
while (true);
}
The code inserts at a rate of around 250 documents per second. One in around 1,000-15,000 calls to ReplaceOneAsync(d => d.Id == document.Id && d.DateModified == oldDateModified, ...) fails, as it returns a ModifiedCount of 0. The failure rate depends on whether we run a Debug or Release build and with debugger attached or not: more speed means more errors.
The code shown represents something that I can't really easily change. Of course I'd rather perform a series of Update.Set() calls, but that's not really an option right now. The InsertOneAsync() followed by a ReplaceOneAsync() is abstracted by some kind of repository pattern that updates entities by reference. The non-async counterparts of the methods display the same behavior.
A simple Thread.Sleep(100) between inserting and replacing mitigates the problem.
When the query fails, and we wait a while and then attempt to query the document again in the code below, it'll be found every time.
private static async Task DoesItExist(IMongoCollection<MyDocument> collection, MyDocument document, DateTime oldDateModified)
{
Thread.Sleep(500);
var fromDatabaseCursor = await collection.FindAsync(d => d.Id == document.Id && d.DateModified == oldDateModified);
var fromDatabaseDoc = await fromDatabaseCursor.FirstOrDefaultAsync();
if (fromDatabaseDoc != null)
{
Console.WriteLine("But it was found!");
}
else
{
Console.WriteLine("And wasn't found!");
}
}
Versions on which this occurs:
MongoDB Community Server 3.4.0, 3.4.1, 3.4.3, 3.4.4 and 3.4.10, all on WiredTiger storage engine
Server runs on Windows, other OSes as well
C# Mongo Driver 2.3.0 and 2.4.4
Is this an issue in MongoDB, or are we doing (or assuming) something wrong?
Or, the actual end goal, how can I ensure an insert is immediately retrievable by an update?

ReplaceOneAsync returns 0 if the new document is identical to the old one (because nothing changed).
It looks to me like if your test executes fast enough the various calls to DateTime.Now could return the same value, so it is possible that you are passing the exact same document to InsertOneAsync and ReplaceOneAsync.

Insufficient system resources exist to complete the requested service when using GeneticSharp

Long story short, I'm using GeneticSharp for an iterative/conditional reinforcement learning algorithm. This means that I'm making a bunch of different GeneticAlgorithm instances, each using a shared SmartThreadPool. Only one GA is running at a time though.
After a few iterations of my algorithm, I run into this error, which happens when attempting to start the SmartThreadPool.
Is there any obvious reason this should be happening? I've tried using a different STPE and disposing of it each time, but that didn't seem to help either. Is there some manual cleanup I need to be doing in between each GA run? Should I be using one shared GA instance?
Edit: Quick code sample
static readonly SmartThreadPoolTaskExecutor Executor = new SmartThreadPoolTaskExecutor() { MinThreads = 2, MaxThreads = 8 };
public static void Main(string[] args)
{
var achromosome = new AChromosome();
var bchromosome = new BChromosome();
while(true)
{
achromosome = FindBestAChromosome(bchromosome);
bchromosome = FindBestBChromosome(achromosome);
// Log results;
}
}
public static AChromosome FindBestAChromosome(BChromosome chromosome)
{
AChromosome result;
var selection = new EliteSelection();
var crossover = new UniformCrossover();
var mutation = new UniformMutation(true);
using (var fitness = new AChromosomeFitness(chromosome))
{
var population = new Population(50, 70, chromosome);
var ga = new GeneticAlgorithm(population, fitness, selection, crossover, mutation);
ga.Termination = new GenerationNumberTermination(100);
ga.GenerationRan += LogGeneration;
ga.TaskExecutor = Executor;
ga.Start();
LogResults();
result = ga.BestChromosome as AChromosome;
ga.GenerationRan -= LogGeneration;
}
return result;
}
public static BChromosome FindBestBChromosome(AChromosome chromosome)
{
BChromosome result;
var selection = new EliteSelection();
var crossover = new UniformCrossover();
var mutation = new UniformMutation(true);
using (var fitness = new BChromosomeFitness(chromosome))
{
var population = new Population(50, 70, chromosome);
var ga = new GeneticAlgorithm(population, fitness, selection, crossover, mutation);
ga.Termination = new GenerationNumberTermination(100);
ga.GenerationRan += LogGeneration;
ga.TaskExecutor = Executor;
ga.Start();
LogResults();
result = ga.BestChromosome as BChromosome;
ga.GenerationRan -= LogGeneration;
}
return result;
}
AChromosome and BChromosome are each fairly simple, a couple doubles and ints and maybe a function pointer (to a static function).
Edit2: Full call stack with replaced bottom two entries
Unhandled Exception: System.IO.IOException: Insufficient system resources exist to complete the requested service.
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.Threading.EventWaitHandle..ctor(Boolean initialState, eventResetMode mode, string name)
at Amib.Threading.SmartThreadPool..ctor()
at GeneticSharp.Infrastructure.Threading.SmartThreadPoolTaskExecutor.Start()
at GeneticSharp.Domain.GeneticAlgorithm.EvaluateFitness()
at GeneticSharp.Domain.GeneticAlgorithm.EndCurrentGeneration()
at GeneticSharp.Domain.GeneticAlgorithm.EvolveOneGeneration()
at GeneticSharp.Domain.GeneticAlgorithm.Resume()
at GeneticSharp.Domain.GeneticAlgorithm.Start()
at MyProject.Program.FindBestAChromosome(BChromosome chromosome)
at MyProject.Program.Main(String[] args)
Edit3: One last thing to note is that my fitness functions are pretty processing-intensive and one run can take almost 2g of ram (running on a machine with 16g, so no worries there). I've seen no problems with garbage collection though.
So far, this only happens after about 5 iterations (which takes multiple hours).

It turns out it was my antivirus preventing the threads from finalizing. Now I'm running it on a machine with a different antivirus and it's running just fine. If I come up with a better answer for how to handle this in the future, I'll update here.

Dynamics CRM SDK: Execute Multiple Requests for Bulk Update of around 5000 Records

I have written a function to update Default Price List for all the Active Products on the CRM 2013 Online.
//The method takes IOrganization service and total number of records to be created as input
private void UpdateMultipleProducts(IOrganizationService service, int batchSize, EntityCollection UpdateProductsCollection, Guid PriceListGuid)
{
//To execute the request we have to add the Microsoft.Xrm.Sdk of the latest SDK as reference
ExecuteMultipleRequest req = new ExecuteMultipleRequest();
req.Requests = new OrganizationRequestCollection();
req.Settings = new ExecuteMultipleSettings();
req.Settings.ContinueOnError = true;
req.Settings.ReturnResponses = true;
try
{
foreach (var entity in UpdateProductsCollection.Entities)
{
UpdateRequest updateRequest = new UpdateRequest { Target = entity };
entity.Attributes["pricelevelid"] = new EntityReference("pricelevel", PriceListGuid);
req.Requests.Add(updateRequest);
}
var res = service.Execute(req) as ExecuteMultipleResponse; //Execute the collection of requests
}
//If the BatchSize exceeds 1000 fault will be thrown.In the catch block divide the records into batchable records and create
catch (FaultException<OrganizationServiceFault> fault)
{
if (fault.Detail.ErrorDetails.Contains("MaxBatchSize"))
{
var allowedBatchSize = Convert.ToInt32(fault.Detail.ErrorDetails["MaxBatchSize"]);
int remainingCreates = batchSize;
while (remainingCreates > 0)
{
var recordsToCreate = Math.Min(remainingCreates, allowedBatchSize);
UpdateMultipleProducts(service, recordsToCreate, UpdateProductsCollection, PriceListGuid);
remainingCreates -= recordsToCreate;
}
}
}
}
Code Description : There are around 5000 active product records in the System. So I am updating Default Price List for all of them using above code.
But, I am missing here something so that, it has updated only 438 records. It loops through the While statement correctly, but it is not updating all of them here.
What should be the Batchsize when we run this function for the First Time?
Any one can help me here?
Thank you,
Mittal.

You pass remainingCreates as the batchSize parameter but your code never references batchSize so you are just going to reenter that while loop every time.
Also, I'm not sure how you are doing all your error handling but you need to update your catch block so that it doesn't just let FaultExceptions pass-through if they don't contain a MaxBatchSize value. Right now, if you take a FaultException regarding something other than batch size it will be ignored.
{
if (fault.Detail.ErrorDetails.Contains("MaxBatchSize"))
{
var allowedBatchSize = Convert.ToInt32(fault.Detail.ErrorDetails["MaxBatchSize"]);
int remainingCreates = batchSize;
while (remainingCreates > 0)
{
var recordsToCreate = Math.Min(remainingCreates, allowedBatchSize);
UpdateMultipleProducts(service, recordsToCreate, UpdateProductsCollection, PriceListGuid);
remainingCreates -= recordsToCreate;
}
}
else throw;
}

Instead of reactive handling, i prefer proactive handling of the MaxBatchSize, this is true when you already know what is MaxMatchSize is.
Following is sample code, here while adding OrgRequest to collection i keep count of batch and when it exceeds I call Execute and reset the collection to take fresh batch.
foreach (DataRow dr in statusTable.Rows)
{
Entity updEntity = new Entity("ABZ_NBA");
updEntity["ABZ_NBAid"] = query.ToList().Where(a => a.NotificationNumber == dr["QNMUM"].ToString()).FirstOrDefault().TroubleTicketId;
//updEntity["ABZ_makerfccall"] = false;
updEntity["ABZ_rfccall"] = null;
updEntity[cNBAttribute.Key] = dr["test"];
req.Requests.Add(new UpdateRequest() { Target = updEntity });
if (req.Requests.Count == 1000)
{
responseWithResults = (ExecuteMultipleResponse)_orgSvc.Execute(req);
req.Requests = new OrganizationRequestCollection();
}
}
if (req.Requests.Count > 0)
{
responseWithResults = (ExecuteMultipleResponse)_orgSvc.Execute(req);
}

System.DirectoryServices.Protocol search question

I'm trying to re write a search from System.DirectoryServices to System.DirectoryServices.Protocol
In S.DS I get all the requested attributes back, but in S.DS.P, I don't get the GUID, or the HomePhone...
The rest of it works for one user.
Any Ideas?
public static List<AllAdStudentsCV> GetUsersDistinguishedName( string domain, string distinguishedName )
{
try
{
NetworkCredential credentials = new NetworkCredential( ConfigurationManager.AppSettings[ "AD_User" ], ConfigurationManager.AppSettings[ "AD_Pass" ] );
LdapDirectoryIdentifier directoryIdentifier = new LdapDirectoryIdentifier( domain+":389" );
using ( LdapConnection connection = new LdapConnection( directoryIdentifier, credentials ) )
{
SearchRequest searchRequest = new SearchRequest( );
searchRequest.DistinguishedName = distinguishedName;
searchRequest.Filter = "(&(objectCategory=person)(objectClass=user)(sn=Afcan))";//"(&(objectClass=user))";
searchRequest.Scope = SearchScope.Subtree;
searchRequest.Attributes.Add("name");
searchRequest.Attributes.Add("sAMAccountName");
searchRequest.Attributes.Add("uid");
searchRequest.Attributes.Add("telexNumber"); // studId
searchRequest.Attributes.Add("HomePhone"); //ctrId
searchRequest.SizeLimit = Int32.MaxValue;
searchRequest.TimeLimit = new TimeSpan(0, 0, 45, 0);// 45 min - EWB
SearchResponse searchResponse = connection.SendRequest(searchRequest) as SearchResponse;
if (searchResponse == null) return null;
List<AllAdStudentsCV> users = new List<AllAdStudentsCV>();
foreach (SearchResultEntry entry in searchResponse.Entries)
{
AllAdStudentsCV user = new AllAdStudentsCV();
user.Active = "Y";
user.CenterName = "";
user.StudId = GetstringAttributeValue(entry.Attributes, "telexNumber");
user.CtrId = GetstringAttributeValue(entry.Attributes, "HomePhone");
user.Guid = GetstringAttributeValue(entry.Attributes, "uid");
user.Username = GetstringAttributeValue(entry.Attributes, "sAMAccountName");
users.Add(user);
}
return users;
}
}
catch (Exception ex)
{
throw;
}
}
Also, if I want to fetch EVERY user in AD, so I can synch data with my SQL DB, how do I do that, I Kept getting max size exceeded, errors. I set the size to maxInt32... is there an "ignore size" option?
Thanks,
Eric-

I think that the standard way is to use System.DirectoryServices, not System.DirectoryServices.Protocol. Why do you want to user the later ?
Concerning your second question about the error message "max sized exceeded", it may be because you try to fetch too many entries at once.
Active Directory limits the number of objects returned by query, in order to not overload the directory (the limit is something like 1000 objects). The standard way to fetch all the users is using paging searchs.
The algorithm is like this:
You construct the query that will fetch all the users
You specify a specific control (Paged Result Control) in this query indicating that this is
a paged search, with 500 users per page
You launch the query, fetch the first page and parse the first 500 entries in
that page
You ask AD for the next page, parse the next 500 entries
Repeat until there are no pages left

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get data from document DB using multithreading/parallel - c#

Related

Idempotency for BigQuery load jobs using Google.Cloud.BigQuery.V2

ReplaceOneAsync() immediately after InsertOneAsync() not always working, even when journaled

Insufficient system resources exist to complete the requested service when using GeneticSharp

Dynamics CRM SDK: Execute Multiple Requests for Bulk Update of around 5000 Records

System.DirectoryServices.Protocol search question

Categories

Resources