Splitting array of objects and then process it in batches

Splitting array of objects and then process it in batches - c#

What would be a good way to call the Execute method in batches of rulesObjs? Lets say the list have more than 10,000 objects and I want to call Execute with no more than 500 at a time.
public static List<object> ExecutePolicy()
{
Policy policy = new Policy();
List<object> rules = GetRules();
object[] rulesObjs = rules.ToArray();
// Call this method with array of object, but in batches.
policy.Execute(rulesObjs);
return rulesObjs.ToList();
}
private static List<object> GetRules()
{
// get the rules via some process
return new List<object>();
}
}
public sealed class Policy
{
public void Execute(params object[] rules)
{
// Process rules...
}
}
I do not have control over Execute() method.

List<object> rules = GetRules();
int batchSize = 500;
int currentBatch = 0;
while (currentBatch * batchSize < rules.Count)
{
object[] nextBatch = rules.Skip(currentBatch * batchSize)
.Take(batchSize).ToArray();
//use batch
currentBatch++;
}

Well, if you have control over the Execute() method, the best way to do it would be to pass an index to that method so that it knows at which index of the array to start at.
public void Execute(int startIndex, /*optional*/ int endIndex, params object[] rules)
{
// Process rules...
}
Don't worry about passing too much data at once. Behind the scenes, your array is just a pointer, so you're only passing a reference anyways.
If you don't have control over the Execute() method, then you can make a new array for your section, using Array.Copy, and process that new array.

With a reference to System.Linq you can use skip and take:
int total = 10000;
int chunkSize = 500;
for (int i = 0; i < total; i += chunkSize )
{
var chunk = rulesObjs.Skip(i).Take(chunkSize).ToArray();
policy.Execute(chunk);
}

Related

Simoultanious data insertion into a list with multithreading

I'm trying to optimize a small program. So here is the basic idea:
I have an array of unfiltered data, and I wanna pass that to a function which will call another function, twice, for data filtering and insertion to a new list. The first call will take the data from original array in range from 0 => half of arrays length, and the second will do the same, but with range from half, to the last item. This way, I should make simultaneous insertion of filtered data into the same list. After the insertion is completed the filtered list can be passed to the rest of the program. Here's the code:
static void Main(string[]
{
// the unfiltered list
int[] oldArray = new int[6] {1,2,3,4,5,6};
// filtered list
List<int> newList= new List<int>();
// Functions is my static class
Functions.Insert(newList, oldArray )
Continue_Program_With_Filtered_List(newList);
// remaining functions...
}
And here is the Function class:
public static class Functions
{
public static void Insert(List<int> newList, int[] oldArray)
{
new Thread(() =>
{
Inserter(newList, oldArray, true);
}).Start();
new Thread(() =>
{
Inserter(newList, oldArray, false);
}).Start();
// I need to wait the result here of both threads
// and make sure that every item from oldArray has been filtered
// before I proceed to the next function in Main()
}
public static void Inserter(List<int> newList, int[] oldArray, bool countUp)
{
bool filterIsValid = false;
int length = oldArray.Length;
int halflen = (int)Math.Floor((decimal)length / 2);
if (countUp)
{
// from half length to 0
for (int i = 0; i < halflen; i++)
{
// filtering conditions here to set value of filterIsValid
if(filterIsValid)
newList.Add(oldArray[i]);
}
}
else
{
// from half length to full length
for (int i = halflen + 1; i < length; i++)
{
// filtering conditions here to set value of filterIsValid
if(filterIsValid)
newList.Add(oldArray[i]);
}
}
}
}
So the problem is that I must await Function.Insert() to complete every thread, and pass through every item before the newList is passed to the next function in Main().
I've no idea how to use Tasks or async method on something like this. This is just an outline of the program by the way. Any help?

In your case using PLINQ may also an option.
static void Main(string[] args)
{
// the unfiltered list
int[] oldArray = new int[6] { 1, 2, 3, 4, 5, 6 };
// filtered list
List<int> newList = oldArray.AsParallel().Where(filter).ToList();
// remaining functions...
}
You can also use AsOrdered() to preserve order
To come back to your initial question, here's what you can do
Note: Solution with minimal changes to your original code, whether there are other possible optimizations or not
Additional Note: Keep in mind that there can still be concurrency issues depending on what else you do with the arguments passing to that function.
public static async Task Insert(List<int> newList, int[] oldArray)
{
ConcurrentBag<int> concurrentBag = new ConcurrentBag<int>();
var task1 = Task.Factory.StartNew(() =>
{
Inserter(concurrentBag, oldArray, true);
});
var task2 = Task.Factory.StartNew(() =>
{
Inserter(concurrentBag, oldArray, false);
});
await Task.WhenAll(task1, task2);
newList.AddRange(concurrentBag);
}
public static void Inserter(ConcurrentBag<int> newList, int[] oldArray, bool countUp)
{
//Same code
}
Edit: Your second for-loop is wrong, change it to this or you will loose one item
for (int i = halflen; i < length; i++)

How to avoid reaching Thread pool's max in a recursive algorithm?

I have a recursive algorithm that creates two new threads each time through. My tests crash with the following error when the array's length is 100k. I beleive this is exceeding the thread pool and/or running out of memory.
The question now, is how can I redesign the algorithm to not crash?
Test Authoring and Execution Framework: TE.ProcessHost.Managed[v5.3-1509] has stopped working
public class MyParamObj
{
public int[] MyArray;
public int MyIntOne;
public int MyIntTwo;
public static Create( int[] myArray, int myIntOne, int myIntTwo )
{
return new MyParamObj
{
MyArray = myArray,
MyIntOne = myIntOne,
MyIntTwo = myIntTwo
};
}
}
public class MyClass
{
public void Entry( int[] myArray)
{
int intOne = 0;
int intTwo = myArray.Length;
MyDelegate( MyParamObj.Create( myArray, intOne, intTwo) );
}
private void MyDelegate( object paramaObject )
{
var parameters = paramaObject as MyParamObjCreate;
if (parameters == null) throw new ArgumentNullException(nameof(parameters));
// just sample values
int intOne = 0;
int intTwo = 0;
int intThree = 0;
int intFour = 0
var threadOneParams = MyParamObj.Create( parameters.MyArray, intOne, intTwo );
var threadTwoParams = MyParamObj.Create( parameters.MyArray, intThree, intFour );
var threads = new Thread[2];
threads[0] = new Thread( MyDelegate );
threads[1] = new Thread( MyDelegate );
threads[0].Start(threadOneParams);
threads[1].Start(threadTwoParams);
threads[0].Join();
threads[1].Join();
//reorder elements within the array
}
}
Update 3
I think I am exceeding the thread pool's max limit. When I have an array that's 100k long and I'm recursive down to a size of 1, I'll have 100k+ threads I think. Maximum number of threads in a .NET app? So, I guess my next question, is how do I recursively do this operation while not exceeding the thread limit?

It's hard to tell since you left out huge chunks of code, but it looks like with the line threads[0] = new Thread( MyDelegate ); you have a function that is creating new threads of itself. Your program will explode as it recursively creates threads until it runs out of memory.

Using threads to parse multiple Html pages faster

Here's what I'm trying to do:
Get one html page from url which contains multiple links inside
Visit each link
Extract some data from visited link and create object using it
So far All i did is just simple and slow way:
public List<Link> searchLinks(string name)
{
List<Link> foundLinks = new List<Link>();
// getHtmlDocument() just returns HtmlDocument using input url.
HtmlDocument doc = getHtmlDocument(AU_SEARCH_URL + fixSpaces(name));
var link_list = doc.DocumentNode.SelectNodes(#"/html/body/div[#id='parent-container']/div[#id='main-content']/ol[#id='searchresult']/li/h2/a");
foreach (var link in link_list)
{
// TODO Threads
// getObject() creates object using data gathered
foundLinks.Add(getObject(link.InnerText, link.Attributes["href"].Value, getLatestEpisode(link.Attributes["href"].Value)));
}
return foundLinks;
}
To make it faster/efficient I need to implement threads, but I'm not sure how i should approach it, because I can't just randomly start threads, I need to wait for them to finish, thread.Join() kind of solves 'wait for threads to finish' problem, but it becomes not fast anymore i think, because threads will be launched after earlier one is finished.

The simplest way to offload the work to multiple threads would be to use Parallel.ForEach() in place of your current loop. Something like this:
Parallel.ForEach(link_list, link =>
{
foundLinks.Add(getObject(link.InnerText, link.Attributes["href"].Value, getLatestEpisode(link.Attributes["href"].Value)));
});
I'm not sure if there are other threading concerns in your overall code. (Note, for example, that this would no longer guarantee that the data would be added to foundLinks in the same order.) But as long as there's nothing explicitly preventing concurrent work from taking place then this would take advantage of threading over multiple CPU cores to process the work.

Maybe you should use Thread pool :
Example from MSDN :
using System;
using System.Threading;
public class Fibonacci
{
private int _n;
private int _fibOfN;
private ManualResetEvent _doneEvent;
public int N { get { return _n; } }
public int FibOfN { get { return _fibOfN; } }
// Constructor.
public Fibonacci(int n, ManualResetEvent doneEvent)
{
_n = n;
_doneEvent = doneEvent;
}
// Wrapper method for use with thread pool.
public void ThreadPoolCallback(Object threadContext)
{
int threadIndex = (int)threadContext;
Console.WriteLine("thread {0} started...", threadIndex);
_fibOfN = Calculate(_n);
Console.WriteLine("thread {0} result calculated...", threadIndex);
_doneEvent.Set();
}
// Recursive method that calculates the Nth Fibonacci number.
public int Calculate(int n)
{
if (n <= 1)
{
return n;
}
return Calculate(n - 1) + Calculate(n - 2);
}
}
public class ThreadPoolExample
{
static void Main()
{
const int FibonacciCalculations = 10;
// One event is used for each Fibonacci object.
ManualResetEvent[] doneEvents = new ManualResetEvent[FibonacciCalculations];
Fibonacci[] fibArray = new Fibonacci[FibonacciCalculations];
Random r = new Random();
// Configure and start threads using ThreadPool.
Console.WriteLine("launching {0} tasks...", FibonacciCalculations);
for (int i = 0; i < FibonacciCalculations; i++)
{
doneEvents[i] = new ManualResetEvent(false);
Fibonacci f = new Fibonacci(r.Next(20, 40), doneEvents[i]);
fibArray[i] = f;
ThreadPool.QueueUserWorkItem(f.ThreadPoolCallback, i);
}
// Wait for all threads in pool to calculate.
WaitHandle.WaitAll(doneEvents);
Console.WriteLine("All calculations are complete.");
// Display the results.
for (int i= 0; i<FibonacciCalculations; i++)
{
Fibonacci f = fibArray[i];
Console.WriteLine("Fibonacci({0}) = {1}", f.N, f.FibOfN);
}
}
}

Sorting a Queue

I have to simulate a process scheduler using SRTN algorithm and im having trouble within a certain part.
I have a queue of a custom class called 'Process' I need to sort it based on a a field called 'last_prediction'. My code works most of the time, but if you look at time:19 of my output, the output in the ready queue is wrong (it should be: 1004(1.5) 1002(2) 1003(2)).
Here is my code:
int count = ReadyQueue.Count;
// Copy Queue into Vector
ArrayList temp = new ArrayList();
for (int i = 0; i < count; i++)
{
Process p = (Process)ReadyQueue.Dequeue();
temp.Add(p);
}
// Sort Vector
for (int i = 0; i < count; i++)
{
double min = ((Process)temp[i]).last_prediction;
for (int j=i+1; j<count; j++)
{
if ( ((Process)temp[j]).last_prediction < min )
{
min = ((Process)temp[j]).last_prediction;
Process dummy = (Process)temp[j];
temp[j] = temp[i];
temp[i] = dummy;
}
}
}
// Copy Vector back into Queue
for (int i = 0; i < count; i++)
{
Process p = (Process)temp[i];
ReadyQueue.Enqueue(p);
}
EDIT: ok, im trying to use ICompare, similar to what you gave hughdbrown.Now i get a different error:
public class Process
{
public int process_id;
public int arrival_time;
public int total_time;
public int avg_burst;
public int actual_burst;
public int last_burst; // SRTN
public double last_prediction; // SRTN
public int io_delay;
public int context_switch_delay;
public class ProcessSort : IComparer
{
public int Compare(object x, object y)
{
var a = x as Process;
var b = y as Process;
double aNum = a.last_prediction;
double bNum = b.last_prediction;
return Compare(aNum, bNum);
}
}
}
this is the error i get now:
Unhandled Exception: System.InvalidOperationException: Failed to compare two elements in the array. ---> System.NullReferenceException: Object reference not set to an instance of an object.

I would use a real sorting routine on this array, not a hand-crafted insertion/bubble sort. Add a comparison function to your object.
I'd also use a templatized data collection, not ArrayList. You might be interested in using this C# PriorityQueue code from my website. That has Queue semantics and maintains items in a sorted order.
Later: Your IComparable code would be something like this:
public class Process : IComparable
{
int last_prediction;
public int CompareTo(object obj)
{
Process right = obj as Process;
return this.last_prediction.CompareTo(right.last_prediction);
}
}
Later still: here is a complete test program that has a sortable Process. Tested in Mono on ubuntu.
using System;
using System.Collections.Generic;
using System.Text;
namespace Comparer
{
public class Process : IComparable
{
int last_prediction;
public Process(int p)
{
this.last_prediction = p;
}
public int CompareTo(object obj)
{
Process right = obj as Process;
return this.last_prediction.CompareTo(right.last_prediction);
}
public int Prediction { get { return this.last_prediction; } }
}
class MainClass
{
public static void Main (string[] args)
{
List<Process> list = new List<Process>();
for (int i = 0; i < 10; i++)
list.Add(new Process(10 - i));
System.Console.WriteLine("Current values:");
foreach (Process p in list)
System.Console.WriteLine("Process {0}", p.Prediction);
list.Sort();
System.Console.WriteLine("Sorted values:");
foreach (Process p in list)
System.Console.WriteLine("Process {0}", p.Prediction);
}
}
}

Have you considered using the ArrayList.Sort method instead of attempting to write your own sort?

Here's how I would sort the Process objects. Let's use a List<Process> instead of an ArrayList so that we don't have to keep casting it back and forth. I haven't done much with queues in C# so I'm afraid I can't help much with those. And please note that this code is untested. :)
int count = ReadyQueue.Count;
// Copy Queue into Vector
List<Process> listProcesses = new List<Process>();
for(int i = 0; i < count; i++)
{
Process p = (Process)ReadyQueue.Dequeue();
listProcesses.Add(p);
}
// Sort Vector
listProcesses.Sort(CompareProcessesByPrediction);
// Copy Vector back into Queue
foreach(Process p in listProcesses)
ReadyQueue.Enqueue(p);
private static int CompareProcessesByPrediction(Process proc1, Process proc2)
{
//if they're both not-null, figure out which one is greatest/smallest.
//otherwise just pick the one that isn't null
if(proc1 == null)
return proc2 == null ? 0 : -1;
else
return proc1 == null ? 1 : proc1.last_prediction.CompareTo(proc2.last_prediction);
}

yea.. use arraylist.sort
If ur array only got numbers, create a new number array coz.. arraylist.sort for string has some problem.
and use arraylist.sort
take the number of the position you want and convert back to string if u want..

How to generate random values that don't keep giving me the same value in 'runs' of numbers?

Hi I coded this OneAtRandom() extension method:
public static class GenericIListExtensions
{
public static T OneAtRandom<T>(this IList<T> list)
{
list.ThrowIfNull("list");
if (list.Count == 0)
throw new ArgumentException("OneAtRandom() cannot be called on 'list' with 0 elements");
int randomIdx = new Random().Next(list.Count);
return list[randomIdx];
}
}
Testing it using this unit test fails:
[Test]
public void ShouldNotAlwaysReturnTheSameValueIfOneAtRandomCalledOnListOfLengthTwo()
{
int SomeStatisticallyUnlikelyNumberOf = 100;
IList<string> list = new List<string>() { FirstString, SecondString };
bool firstStringFound = false;
bool secondStringFound = false;
SomeStatisticallyUnlikelyNumberOf.Times(() =>
{
string theString = list.OneAtRandom();
if (theString == FirstString) firstStringFound = true;
if (theString == SecondString) secondStringFound = true;
});
Assert.That(firstStringFound && secondStringFound);
}
It seems that int randomIdx = new Random().Next(list.Count);is generating the same number 100 times in a row, I think possibly because the seed is based on the time?
How can I get this to work properly?
Thanks :)

You shouldn't be calling new Random()for every iteration because it causes it to be reseeded and generate the same sequence of numbers again. Create one Random object at the start of your application and pass it into your function as a parameter.
public static class GenericIListExtensions
{
public static T OneAtRandom<T>(this IList<T> list, Random random)
{
list.ThrowIfNull("list");
if (list.Count == 0)
throw new ArgumentException("OneAtRandom() cannot be called on 'list' with 0 elements");
int randomIdx = random.Next(list.Count);
return list[randomIdx];
}
}
This also has the advantage of making your code more testable as you can pass in a Random that is seeded to a value of your choice so that your tests are repeatable.

No; it's generating the same number 100 times because you're not seeding the generator.
Move the "new Random()" to the constructor or a static var, and use the generated object.

You could use a seed based on the current time to create the instance of Random. A sample on MSDN uses the following code:
int randomInstancesToCreate = 4;
Random[] randomEngines = new Random[randomInstancesToCreate];
for (int ctr = 0; ctr < randomInstancesToCreate; ctr++)
{
randomEngines[ctr] = new Random(unchecked((int) (DateTime.Now.Ticks >> ctr)));
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Splitting array of objects and then process it in batches - c#

List<object> rules = GetRules(); int batchSize = 500; int currentBatch = 0; while (currentBatch * batchSize < rules.Count) { object[] nextBatch = rules.Skip(currentBatch * batchSize) .Take(batchSize).ToArray(); //use batch currentBatch++; }

With a reference to System.Linq you can use skip and take: int total = 10000; int chunkSize = 500; for (int i = 0; i < total; i += chunkSize ) { var chunk = rulesObjs.Skip(i).Take(chunkSize).ToArray(); policy.Execute(chunk); }

Related

Simoultanious data insertion into a list with multithreading

How to avoid reaching Thread pool's max in a recursive algorithm?

Using threads to parse multiple Html pages faster

Sorting a Queue

How to generate random values that don't keep giving me the same value in 'runs' of numbers?

Categories

Resources