Insert Multiple records in AWS keyspace using C# - c#

Hello I just newly started with Cassandra not much familiar, can u please let me know the error here
I am trying to insert 16000 records using the bellow code
public async Task AddSprintsStories(List<SprintStories> sprintStories)
{
var tasks = new List<Task>();
try
{
if (sprintStories.Count > 0)
{
foreach (var item in sprintStories)
{
SprintStories sprintStoryData = new SprintStories();
sprintStoryData.Id = item.Id;
sprintStoryData.ProjectId = item.ProjectId;
sprintStoryData.SprintId = item.SprintId;
tasks.Add(mapper.InsertAsync<SprintStories>(sprintStoryData, new CqlQueryOptions().SetConsistencyLevel(ConsistencyLevel.LocalQuorum)));
}
await Task.WhenAll(tasks);
}
}
catch (Exception e)
{
}
}
but facing the error: c# Server timeout during write query at consistency LOCALQUORUM (0 peer(s) acknowledged the write over 2 required)
can anyone please help me out here

How does the Cassandra cluster look during this cluster? CPU or disk I/O maxed-out? Without knowing that, my guess is that those 16000 writes are happening faster than your cluster can process them, creating write back pressure. Finally, it just can't process anymore, so they start failing.
For a possible solution, try limiting the number of active threads. Something like this should do it.
int maxActiveThreads = 20;
int activeThreads = 0;
foreach (var item in sprintStories)
{
...
tasks.Add(mapper.InsertAsync<SprintStories>(sprintStoryData, new CqlQueryOptions().SetConsistencyLevel(ConsistencyLevel.LocalQuorum)));
activeThreads++;
if (activeThreads >= maxActiveThreads)
{
await Task.WhenAll(tasks);
activeThreads = 0;
}
}
await Task.WhenAll(tasks);
With this code, only 20 writes will be competing for Cassandra cluster resources at any given time. Do note, that I'm just using 20 as an example. Adjust that number to something that meets your requirements for performance and stability.
Ryan Svihla wrote a great blog post on this topic- Cassandra: Batch Loading Without the BATCH - The Nuanced Edition

Related

C# - Parallel.Foreach() for Service Call

I have a console application which does 3 steps as mentioned below,
Get pending notification records from db
Call service to send email (service returns email address as response)
After getting service response, update the db
For step 2, I am using Parallel.Foreach() and it is working far better than foreach().
I have gone through lot of articles and threads on stackoverflow which is causing more confusion on this topic.
I have few questions
I am running this on a server, does it affect the performance and should I limit the number of threads? (The email count can be from 0-500 or 1000)
I ran into one issue, where in step 2, the service returned an email address as response but it was not available while updating the db. (email count here was 400)
I am suspecting that the issue could be because of using parallel.foreach and that it did not add in notifList.
If this is the case, can I add Thread.Sleep(1000) after Parallel.Foreach() loop ends, does it fix the issue?
In case of any exception, should I explicitly cancel the threads?
Appreciate your time and effort on helping me with this. Thank you!
public void notificationMethod()
{
List<notify> notifList = new List<notify>();
//step 1
List<orders> orderList = GetNotifs();
try
{
if (orderList.Count > 0)
{
Parallel.ForEach(orderList, (orderItem) =>
{
//step 2
SendNotifs(orderItem);
notifList.Add(new notify()
{
//building list by adding email address along with other information
});
});
if (notifList.Count > 0)
{
int index = 0;
int rows = 10;
int skipRows = index * rows;
int updatedRows = 0;
while (skipRows < notifList.Count)
{
//pagination
List<notify> subitem = notifList.Skip(index * rows).Take(rows).ToList<notify>();
updatedRows += subitem.Count;
//step 3
UpdateDatabase(subitem);
index++;
skipRows = index * rows;
}
}
}
}
catch (ApplicationException ex)
{
}
}
I also had a similar scenario regarding whether using Parallel.ForEach() would help improve the performance. But when I saw the below video from Microsoft, it gave me an idea to select Parallel.ForEach() only for CPU intensive workloads.
In this case, your scenario will fall into I/O intensive workloads and could be handled better by async/await.
https://channel9.msdn.com/Series/Three-Essential-Tips-for-Async/Tip-2-Distinguish-CPU-Bound-work-from-IO-bound-work

How to use partitions in order to parallel consume one topic in kafka with .NET Core C#?

We are using the .NET Kafka client to consume messages from one topic in a C# code.
However, it seems to be a wee bit too slow.
Wondering if we could parallelize the process a bit, so I checked this answer there: Kafka how to consume one topic parallel
But I don't really see how to implement this partition thing with the .NET Kafka client in my example below:
var consumerBuilder = new ConsumerBuilder<Ignore, string>(GetConfig())
.SetErrorHandler((_, e) => _logger.LogError("Kafka consumer error on Revenue response. {#KafkaConsumerError}", e));
using (var consumer = consumerBuilder.Build())
{
consumer.Subscribe(RevenueResponseTopicName);
try
{
while (!stoppingToken.IsCancellationRequested)
{
var consumeResult = consumer.Consume(stoppingToken);
RevenueTopicResponseModel revenueResponse;
try
{
revenueResponse = JsonConvert.DeserializeObject<RevenueTopicResponseModel>(consumeResult.Value);
}
catch
{
_logger.LogCritical("Impossible to deserialize the response. {#RevenueConsumeResult}", consumeResult);
continue;
}
_logger.LogInformation("Revenue response received from Kafka. {RevenueTopicResponse}",
consumeResult.Value);
await _revenueService.RevenueResultReceivedAsync(revenueResponse);
}
}
catch (OperationCanceledException)
{
_logger.LogInformation($"Operation canceled. Closing {nameof(RevenueResponseConsumer)}.");
consumer.Close();
}
catch (Exception e)
{
_logger.LogCritical(e, $"Unhandled exception during {nameof(RevenueResponseConsumer)}.");
}
}
You need to create topic with multiple partitions, let's say 10.
In your code create 10 consumers with the same Consumer Group - brokers will distribute topic messages among your consumers.
Basically, just put your code inside for loop:
for (int i = 0; i < 10; i++)
{
var consumerBuilder = new ConsumerBuilder<Ignore, string>(GetConfig())
.SetErrorHandler((_, e) => _logger.LogError("Kafka consumer error on Revenue response. {#KafkaConsumerError}", e));
using (var consumer = consumerBuilder.Build())
{
// your processing here
}
}
In order to answer to this question correctly we need to know what is the reason behind this requirement to partitioning.
If your topic doesn't have lots of messages to be processed then it's not the case to use partitioning. If the issue is that a single message processing tooks too much time and you want parallelize the work, then you could add consumed messages to a Channel and have as many consumers of that channel as needed in background.
Basically you should still use a single consumer per process since a consumer utilizes threads in background
Also you may find my consideration about Kafka Consumer in C# in the article
If you have any questions, please feel free to ask! I'll be glad to help you
You can commit after a set of offsets instead of committing on each offset, which could give you some performance benefit.
if( result.offset % 5 == 0)
{
consumer.Commit(result)
}
Assuming EnableAutoCommit = false

performance issues executing list of stored procedures

I'm having some performance issues when starting my windows service, the first round my lstSps is long (about 130 stored procedures). Is there anyway to speed this up (except for speeding the stored procedures up)?
When the foreach is over and goes over to the second round it goes faster, because there aren't that many returning true on TimeToRun(). But, my concern is about the first time, when there are a lot more stored procedures to run.
I have though about making a array and a for loop since I read that its faster, but I believe the problem is because the procedures takes to long time. Could I build this in a better way? Maybe use multiple threads (one for each execute) or something like that?
Would really appreciate some tips :)
EDIT: Just to clarify, it's method HasResult() is executing the SP:s and makes to look taking time..
lock (lstSpsToSend)
{
lock (lstSps)
{
foreach (var sp in lstSps.Where(sp => sp .TimeToRun()).Where(sp => sp.HasResult()))
{
lstSpsToSend.Add(sp);
}
}
}
while (lstSpsToSend.Count > 0)
{
//Take the first watchdog in list and then remove it
Sp sp;
lock (lstSpsToSend)
{
sp = lstSpsToSend[0];
lstSpsToSend.RemoveAt(0);
}
try
{
//Send the results
}
catch (Exception e)
{
Thread.Sleep(30000);
}
}
What I would do is something like this:
int openThread = 0;
ConcurrentQueue<Type> queue = new ConcurrentQueue<Type>();
foreach (var sp in lstSps)
{
Thread worker = new Thread(() =>
{
Interlocked.Increment(ref openThread);
if(sp.TimeToRun() && sp.HasResult)
{
queue.add(sp);
}
Interlocked.Decrement(ref openThread);
}) {Priority = ThreadPriority.AboveNormal, IsBackground = false};
worker.Start();
}
// Wait for all thread to be finnished
while(openThread > 0)
{
Thread.Sleep(500);
}
// And here move sp from queue to lstSpsToSend
while (lstSpsToSend.Count > 0)
{
//Take the first watchdog in list and then remove it
Sp sp;
lock (lstSpsToSend)
{
sp = lstSpsToSend[0];
lstSpsToSend.RemoveAt(0);
}
try
{
//Send the results
}
catch (Exception e)
{
Thread.Sleep(30000);
}
}
The best approach would rely heavily on what these stored procedures are actually doing, If they are returning the same kind of result set, or no result for that matter, it would definitely be beneficial to send them to SQL server all at once instead of one at a time.
The reason for this is network latency, if your SQL server sits in a data center somewhere that you are accessing over a WAN, your latency could be anywhere from 200ms up. So if you are calling 130 stored procedures sequentially, the "cost" would be 200ms X 130. That's 26 seconds just running back and forth over a network connection not actually executing the logic in your proc.
If you can combine all the procedures into a single call, you pay the 200ms cost only once.
Executing them on multiple concurrent threads is also a reasonable approach, but as before it would depend on what your procedures are doing and returning back to you.
Using an array over a list would not really give you any performance increases.
Hope this helps, good luck!

Insert into RavenDB; Fastest way

I want to import 100 million entries from a text file (each row is one csv-like entry) into a RavenDB database. What is the fastest way to do this?
Additional Notes:
I have not any indexes yet (I will create them after inserting the data). RavenDB is running in service mode on local machine with no security enhancements (yet; because I am still testing RavenDB). This test will run on 2 different machines, 1) 2 cores 4GB ram 2) 8 cores 12 GB ram.
I have done inserting a portion of this data (2 million entries) into RavenDB but it was not as fast as I would like. By using OpenAsyncSession and calling SaveChangesAsync for every 1024 records and again creating a new session by calling OpenAsyncSession and not waiting for return Task (returned by SaveChangesAsync) after 500`000 entries or so, I get an "Index out of range" exception that I can not root out. But if I wait for tasks to end (by creating them same as number of cores), process will succeed but not fast enough.
This code ran successfully:
using (var reader = new StreamReader(#"D:\*\DATA.TXT", Encoding.UTF8))
{
string line = null;
IAsyncDocumentSession session = null;
var tasks = new List<Task>();
var locCount = 0;
while ((line = reader.ReadLine()) != null)
{
if (string.IsNullOrWhiteSpace(line)) continue;
var loc = Parse(line);
if (session == null) session = documentStore.OpenAsyncSession();
session.Store(loc);
locCount++;
if (locCount % 1024 == 0 && session != null)
{
try
{
var t = session.SaveChangesAsync();
tasks.Add(t);
session = null;
}
catch (Exception x)
{
// ... something ...
}
}
if (tasks.Count >= NUMBER_OF_CORES)
{
Task.WaitAll(tasks.ToArray());
tasks.Clear();
}
}
if (session != null)
{
if (tasks.Count > 0)
{
Task.WaitAll(tasks.ToArray());
tasks.Clear();
}
session.SaveChangesAsync().Wait();
session = null;
}
}
Thanks
Kaveh,
There are a number of issues here.
1) RavenDB models very rarely map to CSV files. If you have a CSV file, you usually have tabular format, and that isn't a good format to port to RavenDB. You can probably get better results by getting good models.
2) You code, without the if (tasks.Count >= NUMBER_OF_CORES), will generate as many tasks as possible (subject to the limit of reading lines from the file, which is really fast.
This will tend to generate thousands of concurrent tasks, and will overload the number of requests RavenDB can insert at once.
3) Use the standard session, use a batch size of 1,024 - 2,048. And just let it run.
RavenDB is really good in optimizing, and I expect that you'll see thousands of inserts per second easily.
But, again, you are probably modeling things wrong.

throttle parallel request to remote api

I'm working on an ASP.NET MVC application that uses the Google Maps Geocoding API. In a single batch there may be upto 1000 queries to submit to the Geocoding API, so I'm trying to use a parallel processing approach to imporove performance. The method responsible for starting a process for each core is:
public void GeoCode(Queue<Job> qJobs, bool bolKeepTrying, bool bolSpellCheck, Action<Job, bool, bool> aWorker)
{
// Get the number of processors, initialize the number of remaining
// threads, and set the starting point for the iteration.
int intCoreCount = Environment.ProcessorCount;
int intRemainingWorkItems = intCoreCount;
using(ManualResetEvent mreController = new ManualResetEvent(false))
{
// Create each of the work items.
for(int i = 0; i < intCoreCount; i++)
{
ThreadPool.QueueUserWorkItem(delegate
{
Job jCurrent = null;
while(qJobs.Count > 0)
{
lock(qJobs)
{
if(qJobs.Count > 0)
{
jCurrent = qJobs.Dequeue();
}
else
{
if(jCurrent != null)
{
jCurrent = null;
}
}
}
aWorker(jCurrent, bolKeepTrying, bolSpellCheck);
}
if(Interlocked.Decrement(ref intRemainingWorkItems) == 0)
{
mreController.Set();
}
});
}
// Wait for all threads to complete.
mreController.WaitOne();
}
}
This is based on patterns document I found on Microsoft's parallel computing web site.
The problem is that the Google API has a limit of 10 QPS (enterprise customer) - which I'm hitting - then I get HTTP 403 error's. Is this a way I can benefit from parallel processing but limit the requests I'm making? I've tried using Thread.Sleep but it doesn't solve the problem. Any help or suggestions would be very much appreciated.
It sounds like your missing some sort of Max in Flight parameter. Rather than just looping while there are jobs in the queue, you need to throttle your submissions based on jobs finishing.
Seems like your algorithm should be something like the following:
submit N jobs (where N is your max in flight)
Wait for a job to complete, and if queue is not empty, submit next job.

Categories

Resources