Performance issue in bulk importing data in Dynamics CRM - c#

I am importing data to Dynamics CRM using C# console application. I am using following code:
public static void Main(string[] args)
{
int totalRecords = dbcon.GetDataCount();
int rowCount = totalRecords / 10;
for (int i = 1, j = 1; i <= totalRecords; i = i + rowCount, j = j + 1)
{
Task myTask = new Task(() => TestMethod(i, (rowCount * j)));
myTask.Start();
}
Task.WaitAll();
}
public static void TestMethod(int startSeqNo, int endSeqNo)
{
IOrganizationService service = getServiceProxcy();
DBConnection dbcon = new DBConnection();
DataTable dt = dbcon.GetData(startSeqNo, endSeqNo);
// Insert Commented
BulkCreate(service, dt);
}
public static void BulkCreate(IOrganizationService service, DataTable dt)
{
// Create an ExecuteMultipleRequest object.
ExecuteMultipleRequest multipleRequest = new ExecuteMultipleRequest()
{
// Assign settings that define execution behavior: continue on error, return responses.
Settings = new ExecuteMultipleSettings()
{
ContinueOnError = false,
ReturnResponses = true
},
// Create an empty organization request collection.
Requests = new OrganizationRequestCollection()
};
foreach (DataRow row in dt.Rows)
{
Entity entity = new Entity("new_dataimporttest");
entity["new_name"] = row["name"].ToString();
entity["new_telephone"] = row["telephone1"].ToString();
if (multipleRequest.Requests.Count == 1000)
{
// Execute all the requests in the request collection using a single web method call.
ExecuteMultipleResponse multipleResponse = (ExecuteMultipleResponse)service.Execute(multipleRequest);
multipleRequest.Requests.Clear();
}
CreateRequest createRequest = new CreateRequest { Target = entity };
multipleRequest.Requests.Add(createRequest);
}
// Execute all the requests in the request collection using a single web method call.
if (multipleRequest.Requests.Count > 0)
{
ExecuteMultipleResponse multipleResponse = (ExecuteMultipleResponse)service.Execute(multipleRequest);
}
}
I am using Task Parallel Library. It works fine but issue is that when following line is executed it takes time.
// Execute all the requests in the request collection using a single web method call.
// ExecuteMultipleResponse multipleResponse = (ExecuteMultipleResponse)service.Execute(multipleRequest);
I want to improve performance of code as I am importing large amount of data nearly 1 million records. Currently it takes 1h 50 mins. How do I improve code to reduce execution time.

With the ExecuteMultipleRequest data throughput can only be enhanced in a limited way. This is because the Dynamics CRM server processes the requests in it in sequential order, not parallel. Therefore your main gain is less roundtrips to the server.
Throughput can really be boosted when working with multiple threads. Every thread communicating with CRM must get its own IOrganizationService instance. By default a CRM server accepts up to 10 simultaneous connections from a client. (This is the WCF default.)
In batch processes I tend to use a BlockingCollection<T> with a Producer Consumer pattern: one thread produces the requests to be sent to the CRM server and multiple threads consume the requests by taking them off the collection and sending them to CRM.

You could have two methods execute multiplerequest in the same process(500 for each), using that you can cut the time by half.
Analize in which part it is wasting more time in the execution or in the foreach create. And write here so I can't help you in a better way

If just data import is what intended here, try using SqlBulkCopy to write data directly to server (Sample Code).

Related

BigQuery V2 (3.0.0) Memory Usage Always Goes Up Even With Paging

This is utilizing the .Net Google.Cloud.BigQuery.V2 (3.0.0) Package.
It seems that no matter what I do when querying against my data every "page" query increases the memory usage. I would assume the Client would Dispose of any unused results/rows per execution of a query but that doesn't seem to be the case.
Essentially, I setup the client per the documentation, perform my execution of the query and use the results returned:
var results = await bigQueryClient.ExecuteQueryAsync(query, parameters: null).ConfigureAwait(false);
Within the query I have a limit and offset to perform the pagination and iterate through that on my own.
I have tried their version of paging directly through the TableReference and can pull the pages just like above but the same memory issue occurs:
// Get the table.
var bigQueryTable = await _bigQueryClient.GetTableAsync("dataset_id", "table_id", new GetTableOptions() { SelectedFields = "date,visitStartTime,visitId,fullVisitorId,channelGrouping" }).ConfigureAwait(false);
// Get the first page.
var bigQueryPage = bigQueryTable.ListRowsAsync(new ListRowsOptions() { PageSize = batchSize });
// Get the first rows.
var bigQueryRows = await bigQueryPage.ReadPageAsync(batchSize).ConfigureAwait(false);
while (bigQueryRows.NextPageToken != null)
{
foreach (var bigQueryRow in bigQueryRows)
{
// Perform whatever logic for each row.
}
// Set the next page.
bigQueryPage = bigQueryTable.ListRowsAsync(new ListRowsOptions { PageSize = batchSize, PageToken = bigQueryRows.NextPageToken });
bigQueryRows = await bigQueryPage.ReadPageAsync(batchSize).ConfigureAwait(false);
}
Any advice would be appreciated for "Pulling Batches/Pages of Data via SQL against BigQuery".
UPDATE 2022-06-13 4:12PM -4:00
I have tried wrapping this stuff in a using Client statement per execution of a query to Dispose of the Client and clear up memory. Doesn't work, no clue why as it's even recommended by the Google Documentation: https://cloud.google.com/dotnet/docs/reference/help/cleanup#rest-based-apis

Insert Multiple records in AWS keyspace using C#

Hello I just newly started with Cassandra not much familiar, can u please let me know the error here
I am trying to insert 16000 records using the bellow code
public async Task AddSprintsStories(List<SprintStories> sprintStories)
{
var tasks = new List<Task>();
try
{
if (sprintStories.Count > 0)
{
foreach (var item in sprintStories)
{
SprintStories sprintStoryData = new SprintStories();
sprintStoryData.Id = item.Id;
sprintStoryData.ProjectId = item.ProjectId;
sprintStoryData.SprintId = item.SprintId;
tasks.Add(mapper.InsertAsync<SprintStories>(sprintStoryData, new CqlQueryOptions().SetConsistencyLevel(ConsistencyLevel.LocalQuorum)));
}
await Task.WhenAll(tasks);
}
}
catch (Exception e)
{
}
}
but facing the error: c# Server timeout during write query at consistency LOCALQUORUM (0 peer(s) acknowledged the write over 2 required)
can anyone please help me out here
How does the Cassandra cluster look during this cluster? CPU or disk I/O maxed-out? Without knowing that, my guess is that those 16000 writes are happening faster than your cluster can process them, creating write back pressure. Finally, it just can't process anymore, so they start failing.
For a possible solution, try limiting the number of active threads. Something like this should do it.
int maxActiveThreads = 20;
int activeThreads = 0;
foreach (var item in sprintStories)
{
...
tasks.Add(mapper.InsertAsync<SprintStories>(sprintStoryData, new CqlQueryOptions().SetConsistencyLevel(ConsistencyLevel.LocalQuorum)));
activeThreads++;
if (activeThreads >= maxActiveThreads)
{
await Task.WhenAll(tasks);
activeThreads = 0;
}
}
await Task.WhenAll(tasks);
With this code, only 20 writes will be competing for Cassandra cluster resources at any given time. Do note, that I'm just using 20 as an example. Adjust that number to something that meets your requirements for performance and stability.
Ryan Svihla wrote a great blog post on this topic- Cassandra: Batch Loading Without the BATCH - The Nuanced Edition

How to get List of all Hangfire Jobs using JobStorage in C#?

I am using Hangfire BackgroundJob to create a background job in C# using below code.
var options = new BackgroundJobServerOptions
{
ServerName = "Test Server",
SchedulePollingInterval = TimeSpan.FromSeconds(30),
Queues = new[] { "critical", "default", "low" },
Activator = new AutofacJobActivator(container),
};
var jobStorage = new MongoStorage("mongodb://localhost:*****", "TestDB", new MongoStorageOptions()
{
QueuePollInterval = TimeSpan.FromSeconds(30)
});
var _Server = new BackgroundJobServer(options, jobStorage);
It creates Jobserver object and after that, I am creating Schedule, Recurring Jobs as below.
var InitJob = BackgroundJob.Schedule<TestInitializationJob>(job => job.Execute(), TimeSpan.FromSeconds(5));
var secondJob = BackgroundJob.ContinueWith<Test_SecondJob>(InitJob, job => job.Execute());
BackgroundJob.ContinueWith<Third_Job>(secondJob, job => job.Execute());
RecurringJob.AddOrUpdate<RecurringJobInit>("test-recurring-job", job => job.Execute(), Cron.MinuteInterval(1));
After that, I want to delete or stop all Jobs when my application is stop or close. So in OnStop event of my application, I have written below code.
var monitoringApi = JobStorage.Current.GetMonitoringApi();
var queues = monitoringApi.Queues();// BUT this is not returning all queues and all jobs
foreach (QueueWithTopEnqueuedJobsDto queue in queues)
{
var jobList = monitoringApi.EnqueuedJobs(queue.Name, 0, 100);
foreach (var item in jobList)
{
BackgroundJob.Delete(item.Key);
}
}
But, the above code to get all the Jobs and all Queues is not working. It always returning "default" queue and not returning all jobs.
Can anyone have an idea to get all the Jobs using Hangfire JobStorage and Stops those job when Application is stopped?
Any Help would be highly appreciated!
Thanks
Single Server Setup
To get all recurring jobs you can use the job storage (e.g. either via static instance or DI):
using (var connection = JobStorage.Current.GetConnection())
{
var recurringJobs = connection.GetRecurringJobs();
foreach (var recurringJob in recurringJobs)
{
if (NonRemovableJobs.ContainsKey(recurringJob.Id)) continue;
logger.LogWarning($"Removing job with id [{recurringJob.Id}]");
jobManager.RemoveIfExists(recurringJob.Id);
}
}
If your application acts as single Hangfire server, all job processing will be stopped, as soon as the application is stopped. In this case they wouldn't even need to be removed.
Multi Server Setup
In a multi instance setup which uses the same Hangfire tables for multiple servers, you'll run into the problem that not all applications have all assemblies available. With the method above Hangfire tries to deserialize every job it finds, which results in "Assembly Not Found" exceptions.
To prevent this I used the following workaround, which loads the column 'Key' from the table 'Hash'. It comes in the format 'recurring-jobs:{YourJobIdentifier}'. Then the job id is used to remove the job if neccessary:
var queue = 'MyInstanceQueue'; // probably using queues in a multi server setup
var recurringJobsRaw = await dbContext.HangfireHashes.FromSqlInterpolated($"SELECT [Key] FROM [Hangfire].[Hash] where Field='Queue' AND Value='{queue}'").ToListAsync();
var recJobIds = recurringJobsRaw.Select(s => s.Key.Split(":").Last());
foreach (var id in recJobIds)
{
if (NonRemovableJobs.ContainsKey(id)) continue;
logger.LogWarning($"Removing job with id [{id}]");
jobManager.RemoveIfExists(id);
}
P.S.: To make it work with EF Core I used a Keyless entity for the Hangfire.Hash table.

Is parallelism usefull with a lock for datatable write transactions

Will parallelism help with performance for a locked object, should it be run single threaded, or is there another technique?
I noticed that when accessing a dataset and adding rows from multiple threads exceptions were thrown. Therefore I created a "thread-safe" version to add rows by locking the table prior to updating the row. This implementation works but is appears slow with many transactions.
public partial class HaMmeRffl
{
public partial class PlayerStatsDataTable
{
public void AddPlayerStatsRow(int PlayerID, int Year, int StatEnum, int Value, DateTime Timestamp)
{
lock (TeamMemberData.Dataset.PlayerStats)
{
HaMmeRffl.PlayerStatsRow testrow = TeamMemberData.Dataset.PlayerStats.FindByPlayerIDYearStatEnum(PlayerID, Year, StatEnum);
if (testrow == null)
{
HaMmeRffl.PlayerStatsRow newRow = TeamMemberData.Dataset.PlayerStats.NewPlayerStatsRow();
newRow.PlayerID = PlayerID;
newRow.Year = Year;
newRow.StatEnum = StatEnum;
newRow.Value = Value;
newRow.Timestamp = Timestamp;
TeamMemberData.Dataset.PlayerStats.AddPlayerStatsRow(newRow);
}
else
{
testrow.Value = Value;
testrow.Timestamp = Timestamp;
}
}
}
}
}
Now I can call this safely from multiple threads, but does it actually buy me anything? Can I do this differently for better performance. For instance is there any way to use System.Collections.Concurrent namespace to optimize performance or any other methods?
In addition, I update the underlying database after the entire dataset is updated and that takes a very long time. Would that be considered an I/O operation and be worth using parallel processing by updating it after each row is updated in the dataset (or some number of rows).
UPDATE
I wrote some code to test concurrent vs sequential processing which shows it takes about 30% longer to do concurrent processing and I should use sequential processing here. I assume this is because the lock on the database is causing the overhead on the ConcurrentQueue to be more costly than the gains from parallel processing. Is this conclusion correct and is there anything that I can do to speed up the processing, or am I stuck as for a Datatable "You must synchronize any write operations".
Here is my test code which might not be scientifically correct. Here is the timer and calls between them.
dbTimer.Restart();
Queue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRow = InsertToPlayerQ(addUpdatePlayers);
Queue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRow = InsertToPlayerStatQ(addUpdatePlayers);
UpdatePlayerStatsInDB(addPlayerRow, addPlayerStatRow);
dbTimer.Stop();
System.Diagnostics.Debug.Print("Writing to the dataset took {0} seconds single threaded", dbTimer.Elapsed.TotalSeconds);
dbTimer.Restart();
ConcurrentQueue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRows = InsertToPlayerQueue(addUpdatePlayers);
ConcurrentQueue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRows = InsertToPlayerStatQueue(addUpdatePlayers);
UpdatePlayerStatsInDB(addPlayerRows, addPlayerStatRows);
dbTimer.Stop();
System.Diagnostics.Debug.Print("Writing to the dataset took {0} seconds concurrently", dbTimer.Elapsed.TotalSeconds);
In both examples I add to the Queue and ConcurrentQueue in an identical manner single threaded. The only difference is the insertion into the datatable. The single-threaded approach inserts as follows:
private static void UpdatePlayerStatsInDB(Queue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRows, Queue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRows)
{
try
{
HaMmeRffl.PlayersRow.PlayerValue row;
while (addPlayerRows.Count > 0)
{
row = addPlayerRows.Dequeue();
TeamMemberData.Dataset.Players.AddPlayersRow(
row.PlayerID, row.Name, row.PosEnum, row.DepthEnum,
row.TeamID, row.RosterTimestamp, row.DepthTimestamp,
row.Active, row.NewsUpdate);
}
}
catch (Exception)
{
TeamMemberData.Dataset.Players.RejectChanges();
}
try
{
HaMmeRffl.PlayerStatsRow.PlayerStatValue row;
while (addPlayerStatRows.Count > 0)
{
row = addPlayerStatRows.Dequeue();
TeamMemberData.Dataset.PlayerStats.AddUpdatePlayerStatsRow(
row.PlayerID, row.Year, row.StatEnum, row.Value, row.Timestamp);
}
}
catch (Exception)
{
TeamMemberData.Dataset.PlayerStats.RejectChanges();
}
TeamMemberData.Dataset.Players.AcceptChanges();
TeamMemberData.Dataset.PlayerStats.AcceptChanges();
}
The concurrent adds as follows
private static void UpdatePlayerStatsInDB(ConcurrentQueue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRows, ConcurrentQueue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRows)
{
Action actionPlayer = () =>
{
HaMmeRffl.PlayersRow.PlayerValue row;
while (addPlayerRows.TryDequeue(out row))
{
TeamMemberData.Dataset.Players.AddPlayersRow(
row.PlayerID, row.Name, row.PosEnum, row.DepthEnum,
row.TeamID, row.RosterTimestamp, row.DepthTimestamp,
row.Active, row.NewsUpdate);
}
};
Action actionPlayerStat = () =>
{
HaMmeRffl.PlayerStatsRow.PlayerStatValue row;
while (addPlayerStatRows.TryDequeue(out row))
{
TeamMemberData.Dataset.PlayerStats.AddUpdatePlayerStatsRow(
row.PlayerID, row.Year, row.StatEnum, row.Value, row.Timestamp);
}
};
Action[] actions = new Action[Environment.ProcessorCount * 2];
for (int i = 0; i < Environment.ProcessorCount; i++)
{
actions[i * 2] = actionPlayer;
actions[i * 2 + 1] = actionPlayerStat;
}
try
{
// Start ProcessorCount concurrent consuming actions.
Parallel.Invoke(actions);
}
catch (Exception)
{
TeamMemberData.Dataset.Players.RejectChanges();
TeamMemberData.Dataset.PlayerStats.RejectChanges();
}
TeamMemberData.Dataset.Players.AcceptChanges();
TeamMemberData.Dataset.PlayerStats.AcceptChanges();
}
The difference in time is 4.6 seconds for the single-threaded and 6.1 for the parallel.Invoke.
Lock & transactions are not good for parallelism and performance.
1)Try avoid lock:Will different threads need to update the same Row in dataset?
2)minimize lock time.
For db operation use may try Batch Update future of ADO.NET: http://msdn.microsoft.com/en-us/library/ms810297.aspx
Multithreading can help upto an extent because once the data across your app boundary , you will start waiting for I/O , here you can do asynchronous processing because your app does not have control over various parameters ( Resource access , Network speed etc),this will give better user experience (If UI app).
Now for your scenario , you may want to use some sort of producer/consumer queue , as soon as a row is available in queue , a different thread start processing it but again this will work upto an extent.

throttle parallel request to remote api

I'm working on an ASP.NET MVC application that uses the Google Maps Geocoding API. In a single batch there may be upto 1000 queries to submit to the Geocoding API, so I'm trying to use a parallel processing approach to imporove performance. The method responsible for starting a process for each core is:
public void GeoCode(Queue<Job> qJobs, bool bolKeepTrying, bool bolSpellCheck, Action<Job, bool, bool> aWorker)
{
// Get the number of processors, initialize the number of remaining
// threads, and set the starting point for the iteration.
int intCoreCount = Environment.ProcessorCount;
int intRemainingWorkItems = intCoreCount;
using(ManualResetEvent mreController = new ManualResetEvent(false))
{
// Create each of the work items.
for(int i = 0; i < intCoreCount; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
Job jCurrent = null;
while(qJobs.Count > 0)
{
lock(qJobs)
{
if(qJobs.Count > 0)
{
jCurrent = qJobs.Dequeue();
}
else
{
if(jCurrent != null)
{
jCurrent = null;
}
}
}
aWorker(jCurrent, bolKeepTrying, bolSpellCheck);
}
if(Interlocked.Decrement(ref intRemainingWorkItems) == 0)
{
mreController.Set();
}
});
}
// Wait for all threads to complete.
mreController.WaitOne();
}
}
This is based on patterns document I found on Microsoft's parallel computing web site.
The problem is that the Google API has a limit of 10 QPS (enterprise customer) - which I'm hitting - then I get HTTP 403 error's. Is this a way I can benefit from parallel processing but limit the requests I'm making? I've tried using Thread.Sleep but it doesn't solve the problem. Any help or suggestions would be very much appreciated.
It sounds like your missing some sort of Max in Flight parameter. Rather than just looping while there are jobs in the queue, you need to throttle your submissions based on jobs finishing.
Seems like your algorithm should be something like the following:
submit N jobs (where N is your max in flight)
Wait for a job to complete, and if queue is not empty, submit next job.

Categories

Resources