The fastest approach to inserting big data collections to Cassandra in C#

The fastest approach to inserting big data collections to Cassandra in C# - c#

I'm a little bit confused about the fastest way to insert large collections to cassandra database. I read that I shouldn't use batch insert because it's created for atomicity. Even Cassandra thow an information for me to use asynchronic writes for performace.
I've used code for the fastest insert without 'batch' keyword:
var cluster = Cluster.Builder()
.AddContactPoint(“127.0.0.1")
.Build();
var session = cluster.Connect();
//Save off the prepared statement you’re going to use
var statement = session.Prepare (“INSERT INTO tester.users (userID, firstName, lastName) VALUES (?,?,?)”);
var tasks = new List<Task>();
for (int i = 0; i < 1000; i++)
{
//please bind with whatever actually useful data you’re importing
var bind = statement.Bind (i, “John”, “Tester”);
var resultSetFuture = session.ExecuteAsync (bind);
tasks.Add (resultSetFuture);
}
Task.WaitAll(tasks.ToArray());
cluster.Shutdown();
from: https://medium.com/#foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
But it's still much slower than batch option i'm using. My current code looks like this:
IList<Movie> moviesList = Movie.CreateMoviesCollectionForCassandra(collectionEntriesNumber);
var preparedStatements = new List<PreparedStatement>();
foreach (var statement in preparedStatements)
{
statement.SetConsistencyLevel(ConsistencyLevel.One);
}
var statementBinding = new BatchStatement();
statementBinding.SetBatchType(BatchType.Unlogged);
for (int i = 0; i < collectionEntriesNumber; i++)
{
preparedStatements.Add(Session.Prepare("INSERT INTO Movies (id, title, description, year, genres, rating, originallanguage, productioncountry, votingsnumber, director) VALUES (?,?,?,?,?,?,?,?,?,?)"));
}
for (int i = 0; i < collectionEntriesNumber; i++)
{
statementBinding.Add(preparedStatements[i].Bind(moviesList[i].Id, moviesList[i].Title,
moviesList[i].Description, moviesList[i].Year, moviesList[i].Genres, moviesList[i].Rating,
moviesList[i].OriginalLanguage, moviesList[i].ProductionCountry, moviesList[i].VotingsNumber,
new Director(moviesList[0].Director.Id, moviesList[i].Director.Firstname,
moviesList[i].Director.Lastname, moviesList[i].Director.Age)));
}
watch.Start();
Session.ExecuteAsync(statementBinding);
watch.Stop();
It really works much much faster but i can only insert ~2500 prepared statements, no more, and I want to measure time of about 100000 objects insertion.
Is my code correct? Maybe I just should increase insert treshold?
Please, explain my how to do it right way.

Remember that you should prepare your once and reuse that same PreparedStatement to bind to different parameters.
You can use small sized batches if you are targeting the same partition, if not you should use individual requests.
When using individual requests, you can schedule executions in parallel and limit the amount of outstanding requests using a semaphore.
Something like:
public async Task<long> Execute(
IStatement[] statements, int parallelism, int maxOutstandingRequests)
{
var semaphore = new SemaphoreSlim(maxOutstandingRequests);
var tasks = new Task<RowSet>[statements.Length];
var chunkSize = statements.Length / parallelism;
if (chunkSize == 0)
{
chunkSize = 1;
}
var statementLength = statements.Length;
var launchTasks = new Task[parallelism + 1];
var watch = new Stopwatch();
watch.Start();
for (var i = 0; i < parallelism + 1; i++)
{
var startIndex = i * chunkSize;
//start to launch in parallel
launchTasks[i] = Task.Run(async () =>
{
for (var j = 0; j < chunkSize; j++)
{
var index = startIndex + j;
if (index >= statementLength)
{
break;
}
await semaphore.WaitAsync();
var t = _session.ExecuteAsync(statements[index]);
tasks[index] = t;
var rs = await t;
semaphore.Release();
}
});
}
await Task.WhenAll(launchTasks);
await Task.WhenAll(tasks);
watch.Stop();
return watch.ElapsedMilliseconds;
}

Related

Async generator, previous iterations await a future iteration?

I want to generate an enumerable of tasks, the tasks will complete at different times.
How can I make a generator in C# that:
yields tasks
every few iterations, resolves previously yielded tasks with results that are only now known
The reason I want to do this is because I am processing a long iterable of inputs, and every so often I accumulate enough data from these inputs to send a batch API request and finalise my outputs.
Pseudocode:
IEnumerable<Task<Output>> Process(IEnumerable<Input> inputs)
{
var queuedInputs = Queue<Input>();
var cumulativeLength = 0;
foreach (var input in inputs)
{
yield return waiting task for this input
queuedInputs.Enqueue(input);
cumulativeLength += input.Length;
if (cumulativeLength > 10)
{
cumulativeLength = 0
GetFromAPI(queue).ContinueWith((apiTask) => {
Queue<BatchResult> batchResults = apiTask.Result;
while (queuedInputs.Count > 0)
{
batchResult = batchResults.Dequeue();
historicalInput = queuedInputs.Dequeue();
var output = MakeOutput(historicalInput, batchResult);
resolve earlier input's task with this output
}
});
}
}
}

The shape of your solution is going to be driven by the shape of your problem. There's a couple of questions I have because your problem domain seems odd:
Are all your inputs known at the outset? The (synchronous) IEnumerable<Input> implies they are.
Are you sure you want to wait for a batch of inputs before sending any query? What about the "remainder" if you're batching by 10 but have 55 inputs?
Assuming you do have synchronous inputs, and that you want to batch with remainders, you can just accumulate all your inputs immediately, batch them, and walk the batches, asynchronously providing outputs:
async IAsyncEnumerable<Output> Process(IEnumerable<Input> inputs)
{
foreach (var batchedInput in inputs.Batch(10))
{
var batchResults = await GetFromAPI(batchedInput);
for (int i = 0; i != batchedInput.Count; ++i)
yield return MakeOutput(batchedInput[i], batchResults[i]);
}
}
public static IEnumerable<IReadOnlyList<TSource>> Batch<TSource>(this IEnumerable<TSource> source, int size)
{
List<TSource>? batch = null;
foreach (var item in source)
{
batch ??= new List<TSource>(capacity: size);
batch.Add(item);
if (batch.Count == size)
{
yield return batch;
batch = null;
}
}
if (batch?.Count > 0)
yield return batch;
}
Update:
If you want to start the API calls immediately, you can move those out of the loop:
async IAsyncEnumerable<Output> Process(IEnumerable<Input> inputs)
{
var batchedInputs = inputs.Batch(10).ToList();
var apiCallTasks = batchedInputs.Select(GetFromAPI).ToList();
foreach (int i = 0; i != apiCallTasks.Count; ++i)
{
var batchResults = await apiCallTasks[i];
var batchedInput = batchedInputs[i];
for (int j = 0; j != batchedInput.Count; ++j)
yield return MakeOutput(batchedInput[j], batchResults[j]);
}
}

One approach is to use the TPL Dataflow library. This library offers a variety of components named "blocks" (TransformBlock, ActionBlock etc), where each block is processing its input data, and then propagates the results to the next block. The blocks are linked together so that the completion of the previous block in the pipeline triggers the completion of the next block etc, until the final block which is usually an ActionBlock<T> with no output. Here is an example:
var block1 = new TransformBlock<int, string>(item =>
{
Thread.Sleep(1000); // Simulate synchronous work
return item.ToString();
}, new()
{
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
EnsureOrdered = false
});
var block2 = new BatchBlock<string>(batchSize: 10);
var block3 = new ActionBlock<string[]>(async batch =>
{
await Task.Delay(1000); // Simulate asynchronous work
}); // The default MaxDegreeOfParallelism is 1
block1.LinkTo(block2, new() { PropagateCompletion = true });
block2.LinkTo(block3, new() { PropagateCompletion = true });
// Provide some input in the pipeline
block1.Post(1);
block1.Post(2);
block1.Post(3);
block1.Post(4);
block1.Post(5);
block1.Complete(); // Mark the first block as completed
await block3.Completion; // Await the completion of the last block
The TPL Dataflow library is powerful and flexible, but is has a weak point in the propagation of exceptions. There is no built-in way to instruct the block1 to stop working, if the block3 fails. You can read more about this issue here. It might not be a serious issue, if you don't expect your blocks to fail very often.

Assuming MyGenerator() returns List<Task<T>>, and the number of tasks is relatively small (even in the hundreds is probably fine) then you can use Task.WhenAny(), which returns the first Task that completes. Then remove that Task from the list, process the result, and move on to the next:
var tasks = MyGenerator();
while (tasks.Count > 0) {
var t = Task.WhenAny(tasks);
tasks.Remove(t);
var result = await t; // this won't actually wait since the task is already done
// Do something with result
}
There is a good discussion of this in an article by Stephen Toub, which explains in more detail, and gives alternatives if your task list is in the thousands: Processing tasks as they complete
There's also this article, but I think Stephen's is better written: Process asynchronous tasks as they complete (C#)

Using TaskCompletionSource:
IEnumerable<Task<Output>> Process(IEnumerable<Input> inputs)
{
var tcss = new List<TaskCompletionSource<Output>>();
var queue = new Queue<(Input, TaskCompletionSource<Output>)>();
var cumulativeLength = 0;
foreach (var input in inputs)
{
var tcs = new TaskCompletionSource<Output>();
queue.Enqueue((input, tcs));
tcss.Add(tcs);
cumulativeLength += input.Length;
if (cumulativeLength > 10)
{
cumulativeLength = 0
var queueClone = Queue<(Input, TaskCompletionSource<Input>)>(queue);
queue.Clear();
GetFromAPI(queueClone.Select(x => x.Item1)).ContinueWith((apiTask) => {
Queue<BatchResult> batchResults = apiTask.Result;
while (queueClone.Count > 0)
{
var batchResult = batchResults.Dequeue();
var (queuedInput, queuedTcs) = queueClone.Dequeue();
var output = MakeOutput(queuedInput, batchResult);
queuedTcs.SetResult(output)
}
});
}
}
GetFromAPI(queue.Select(x => x.Item1)).ContinueWith((apiTask) => {
Queue<BatchResult> batchResults = apiTask.Result;
while (queue.Count > 0)
{
var batchResult = batchResults.Dequeue();
var (queuedInput, queuedTcs) = queue.Dequeue();
var output = MakeOutput(queuedInput, batchResult);
queuedTcs.SetResult(output)
}
});
foreach (var tcs in tcss)
{
yield return tcs.Task;
}
}

Producer/consumer doesn't generate expected results

I've written such producer/consumer code, which should generate big file filled with random data
class Program
{
static void Main(string[] args)
{
Random random = new Random();
String filename = #"d:\test_out";
long numlines = 1000000;
var buffer = new BlockingCollection<string[]>(10); //limit to not get OOM.
int arrSize = 100; //size of each string chunk in buffer;
String[] block = new string[arrSize];
Task producer = Task.Factory.StartNew(() =>
{
long blockNum = 0;
long lineStopped = 0;
for (long i = 0; i < numlines; i++)
{
if (blockNum == arrSize)
{
buffer.Add(block);
blockNum = 0;
lineStopped = i;
}
block[blockNum] = random.Next().ToString();
//null is sign to stop if last block is not fully filled
if (blockNum < arrSize - 1)
{
block[blockNum + 1] = null;
}
blockNum++;
};
if (lineStopped < numlines)
{
buffer.Add(block);
}
buffer.CompleteAdding();
}, TaskCreationOptions.LongRunning);
Task consumer = Task.Factory.StartNew(() =>
{
using (var outputFile = new StreamWriter(filename))
{
foreach (string[] chunk in buffer.GetConsumingEnumerable())
{
foreach (string value in chunk)
{
if (value == null) break;
outputFile.WriteLine(value);
}
}
}
}, TaskCreationOptions.LongRunning);
Task.WaitAll(producer, consumer);
}
}
And it does what is intended to do. But for some unknown reason it produces only ~550000 strings, not 1000000 and I can not understand why this is happening.
Can someone point on my mistake? I really don't get what's wrong with this code.

The buffer
String[] block = new string[arrSize];
is declared outside the Lambda. That means it is captured and re-used.
That would normally go unnoticed (you would just write out the wrong random data) but because your if (blockNum < arrSize - 1) is placed inside the for loop you regularly write a null into the shared buffer.
Exercise, instead of:
block[blockNum] = random.Next().ToString();
use
block[blockNum] = i.ToString();
and predict and verify the results.

Compare list to itself with parallel execution

i have following code I used up until now to compare a list of file-entrys to itsef by hash-codes
for (int i = 0; i < fileLists.SourceFileListBefore.Count; i++) // Compare SourceFileList-Files to themselves
{
for (int n = i + 1; n < fileLists.SourceFileListBefore.Count; n++) // Don´t need to do the same comparison twice!
{
if (fileLists.SourceFileListBefore[i].targetNode.IsFile && fileLists.SourceFileListBefore[n].targetNode.IsFile)
if (fileLists.SourceFileListBefore[i].hash == fileLists.SourceFileListBefore[n].hash)
{
// do Something
}
}
}
where SourceFileListBefore is a List
I want to change this code to be able to execute parallel on multiple cores. I thought about doing this with PLINQ, but im completely new to LINQ.
I tried
var duplicate = from entry in fileLists.SourceFileListBefore.AsParallel()
where fileLists.SourceFileListBefore.Any(x => (x.hash == entry.hash) && (x.targetNode.IsFile) && (entry.targetNode.IsFile))
select entry;
but it wont work like this, because I have to execute code for each pair of two hash-code matching entrys. So I would at least have to get a collection of results with x+entry from LINQ, not just one entry. Is that possible with PLINQ?

Why don't you look at optimising your code first?
looking at this statement:
if (fileLists.SourceFileListBefore[i].targetNode.IsFile && fileLists.SourceFileListBefore[n].targetNode.IsFile)
Means you can straight away build1 single list of files where IsFile == true (making the loop smaller already)
secondly,
if (fileLists.SourceFileListBefore[i].hash == fileLists.SourceFileListBefore[n].hash)
Why don't you build a hash lookup of the hashes first.
Then iterate over your filtered list, looking up in the lookup you created, if it contains > 1, it means there is a match as (current node hash + some other node hash). So you only do some work on the matched hashes which is not your node.
I wrote a blog post about it which you can read at # CodePERF[dot]NET -.NET Nested Loops vs Hash Lookups
PLINQ will only be slightly improving a bad solution to your problem.
Added some comparisons:
Total File Count: 16900
TargetNode.IsFile == true: 11900
Files with Duplicate Hashes = 10000 (5000 unique hashes)
Files with triplicate Hashes = 900 (300 unique hashes)
Files with Unique hashes = 1000
And the actual setup method:
[SetUp]
public void TestStup()
{
_sw = new Stopwatch();
_files = new List<File>();
int duplicateHashes = 10000;
int triplicateHashesCount = 900;
int randomCount = 1000;
int nonFileCount = 5000;
for (int i = 0; i < duplicateHashes; i++)
{
var hash = i % (duplicateHashes / 2);
_files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = true}});
}
for (int i = 0; i < triplicateHashesCount; i++)
{
var hash = int.MaxValue - 100000 - i % (triplicateHashesCount / 3);
_files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = true}});
}
for (int i = 0; i < randomCount; i++)
{
var hash = int.MaxValue - i;
_files.Add(new File { Id = i, Hash = hash.ToString(), TargetNode = new Node { IsFile = true } });
}
for (int i = 0; i < nonFileCount; i++)
{
var hash = i % (nonFileCount / 2);
_files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = false}});
}
_matched = 0;
}
Than your current method:
[Test]
public void FindDuplicates()
{
_sw.Start();
for (int i = 0; i < _files.Count; i++) // Compare SourceFileList-Files to themselves
{
for (int n = i + 1; n < _files.Count; n++) // Don´t need to do the same comparison twice!
{
if (_files[i].TargetNode.IsFile && _files[n].TargetNode.IsFile)
if (_files[i].Hash == _files[n].Hash)
{
// Do Work
_matched++;
}
}
}
_sw.Stop();
}
Takes around 7.1 seconds on my machine.
Using lookup to find hashes which appear multiple times takes 21ms.
[Test]
public void FindDuplicatesHash()
{
_sw.Start();
var lookup = _files.Where(f => f.TargetNode.IsFile).ToLookup(f => f.Hash);
foreach (var duplicateFiles in lookup.Where(files => files.Count() > 1))
{
// Do Work for each unique hash, which appears multiple times in _files.
// If you need to do work on each pair, you will need to create pairs from duplicateFiles
// this can be an excercise for you ;-)
_matched++;
}
_sw.Stop();
}
In my test, using PLINQ for counting the lookups, is actually slower (As there is a large cost of dividing lists between threads and aggregating results back)
[Test]
public void FindDuplicatesHashParallel()
{
_sw.Start();
var lookup = _files.Where(f => f.TargetNode.IsFile).ToLookup(f => f.Hash);
_matched = lookup.AsParallel().Where(g => g.Count() > 1).Sum(g => 1);
_sw.Stop();
}
This took 120ms, so almost 6 times as long with my current source list.

Multi Threading and Task issue

This is the first time I'm attempting multiple threads in a project so bear with me. The idea is this. I have a bunch of documents I need converted to pdf. I am using itextsharp to do the conversion for me. When run iteratively, the program runs fine but slow.
I have a list of items that need to be converted. I take that list and split it into 2 lists.
for (int i = 0; i < essaylist.Count / 2; i++)
{
frontessay.Add(essaylist[i]);
try
{
backessay.Add(essaylist[essaylist.Count - i]);
}
catch(Exception e)
{
}
}
if (essaylist.Count > 1)
{
var essay1 = new Essay();
Thread t1 = new Thread(() => essay1.StartThread(frontessay));
Thread t2 = new Thread(() => essay1.StartThread(backessay));
t1.Start();
t2.Start();
t1.Join();
t2.Join();
}
else
{
var essay1 = new Essay();
essay1.GenerateEssays(essaylist[1]);
}
I then create 2 threads that run this code
public void StartThread(List<Essay> essaylist)
{
var essay = new Essay();
List<System.Threading.Tasks.Task> tasklist = new List<System.Threading.Tasks.Task>();
int threadcount = 7;
Boolean threadcomplete = false;
int counter = 0;
for (int i = 0; i < essaylist.Count; i++)
{
essay = essaylist[i];
var task1 = System.Threading.Tasks.Task.Factory.StartNew(() => essay.GenerateEssays(essay));
tasklist.Add(task1);
counter++;
if (tasklist.Count % threadcount == 0)
{
tasklist.ForEach(t => t.Wait());
//counter = 0;
tasklist = new List<System.Threading.Tasks.Task>();
threadcomplete = true;
}
Thread.Sleep(100);
}
tasklist.ForEach(t => t.Wait());
Thread.Sleep(100);
}
For the majority of the files, the code runs as it should. However, for example I have 155 items that need to be convereted. When the program finishes and I look at the results I have 149 items instead of 155. It seems like the results are something like the total = list - threadcount. In this case its 7. Any ideas on how to correct this? Am I even doing threads/tasks correctly?
Also the essay.GenerateEssays code is the actual itextsharp that converts the info from the db to the actual pdf.

How about using TPL. It seems that all your code can be replaced with this
Parallel.ForEach(essaylist, essay =>
{
YourAction(essay);
});

Pass non reference value to Task

I often get strange resulst when passing int variables to tasks such as in this example:
List<List<object>> ListToProcess = new List<List<object>>();
// place some lists in list to process
foreach (var temp in Foo)
ListToProcess.Add(temp);
foreach (var tempArray in ListToProcess)
{
// initialize each list in ListToProcess
}
int numberOfChunks = ListToProcess.Count;
Task[] tasks = new Task[numberOfChunks];
for (int counter = 0; counter < numberOfChunks; counter++)
{
tasks[counter] = Task.Factory.StartNew(() =>
{
// counter is always = 5 why? <---------------------------
var t = ListToProcess[counter];
});
}
How can I solve this problem?

This is known as a closure. You are not using the value of the variable, you are using the variable itself. When the code executes, it uses the value at the time of execution, not the value when the Task was defined.
To fix this issue, I believe you would do something like this:
for (int counter = 0; counter < numberOfChunks; counter++)
{
int cur = counter;
tasks[counter] = Task.Factory.StartNew(() =>
{
// counter is always = 5 why? <---------------------------
var t = ListToProcess[cur];
});
}

There is no guarantee as to when the 'counter' variable in the Action block of StartNew will be accessed. What is likely to happen is that all 5 values are looped through, and the tasks are created, then the tasks are scheduled to run.
When they do run, the following is executed:
var t = ListToProcess[counter];
But at this stage count is already equal to 5.
Perhaps you should look at parallel collections?
ListToProcess.AsParallel().ForAll(list => dosomething(list));
There are many other options around this area.

for (int counter = 0; counter < numberOfChunks; counter++)
{
var referenceVariable = new{val=counter};
tasks[counter] = Task.Factory.StartNew(() =>
{
var t = ListToProcess[referenceVariable.val];
});
}

Since variables are captured, you can solve this by redeclaring a new variable in each loop.
for (int counter = 0; counter < numberOfChunks; counter++)
{
int localCounter = counter;
tasks[localCounter] = Task.Factory.StartNew(() =>
{
// counter is always = 5 why? <---------------------------
var t = ListToProcess[localCounter];
});
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

The fastest approach to inserting big data collections to Cassandra in C# - c#

Related

Async generator, previous iterations await a future iteration?

Producer/consumer doesn't generate expected results

Compare list to itself with parallel execution

Multi Threading and Task issue

Pass non reference value to Task

Categories

Resources