My Console app is reading huge volume of data from text files and those will be saved to a DB. For this purpose, I am storing the data into a DataTable and I want to dump this DataTable to a DB every 5 minutes (If I want to dump the whole data at once, then I have to fill the DataTable with whole set of data , and in that case I am getting OutOfMemoryException).
public void ProcessData()
{
string[] files=File.ReadAllLines(path)
foreach(var item in files)
{
DataRow dtRow= dataTable.NewRow();
dtRow["ID"]= .... //some code here;
dtRow["Name"]= .... //some code here;
dtRow["Age"]= .... //some code here;
var timer = new Timer(v => SaveData(), null, 0, 5*60*1000);
}
}
public void SaveData(string tableName, DataTable dataTable )
{
//Some code Here
//After dumping data to DB, clear DataTable
dataTable.Rows.Clear();
}
What I wanted here is, the code will continue to fill the DataTable, and every 5 minute it will call SaveData() method. This will continue to run till all files has processed.
However, I have seen that, when the SaveData() method is called , it is executing for 4-5 times. Sometimes, it has bot called in every 5 minute.
I am not getting how to proceed here. How to fix this ? Can any other approach be used here ? Any help is appreciated.
Is it essential that you read each text file in completely with ReadAllLines, this will be consuming a large amount of memory. Why not Read x lines from a file, save to database, then continue until the end of the file is reached?
Your biggest problem is instantiating new Timer instances in your foreach. New Timer objects in every foreach call mean multiple threads calling SaveData concurrently, meaning dataTable being processed and saved to the database multiple times concurrently, possibly (and likely) before rows are cleared, thus duplicating much of your file into the database.
Before I provide a solution to the question as asked, I wanted to point out that saving data in a 5 minute interval has a distinct code smell to it. As has been pointed out, I would suggest some approach that loads and saves data based on some data size rather than an arbitrary time interval. That said, I will go ahead and address your question on the assumption that there is a reason you must go with 5 minute interval save.
First, we need to setup our Timer correctly, which you'll notice I create outside of the foreach loop. Timer continues running on an interval, not just waiting and executing once.
Second, we have to take steps to ensure thread-safe data integrity on our intermediate data store (in your case you used DataTable, but I am using a List of a custom class, because DataTable is too costly for what we want to do). You'll notice I accomplish this by locking before updates to our List.
Updates to your data processing class:
private bool isComplete = false;
private object DataStoreLock = new object();
private List<MyCustomClass> myDataStore;
private Timer myTimer;
public void ProcessData()
{
myTimer = new Timer(SaveData, null, TimeSpan.Zero, TimeSpan.FromMinutes(5.0));
foreach (var item in File.ReadLines(path))
{
var myData = new MyCustomClass()
{
ID = 0, // Some code here
Name = "Some code here",
Age = 0 // Some code here
};
lock (DataStoreLock)
{
myDataStore.Add(myData);
}
}
isComplete = true;
}
public void SaveData(object arg)
{
// Our first step is to check if timed work is done.
if (isComplete)
{
myTimer.Dispose();
myTimer = null;
}
// Our next step is to create a local instance of the data store to work on, which
// allows ProcessData to continue populating while our DB actions are being performed.
List<MyCustomClass> lDataStore;
lock (DataStoreLock)
{
lDataStore = myDataStore;
myDataStore = new List<MyCustomClass>();
}
//Some code DB code here.
}
EDIT: I've changed the enumeration to go through ReadLines rather than ReadAllLines. Read Remarks under the ReadLines method on MSDN. ReadAllLines will be a blocking call, while ReadLines will allow enumeration to be processed while reading the file. I can't imagine a scenario otherwise where your foreach would be running for more than 5 minutes if the file had been read all to memory already.
Here would be a suggestion on how to implement the code and the suggestion from the other answer:
public void ProcessData()
{
int i = 1;
foreach(var item in File.ReadLines(path)) //This line has been edited
{
DataRow dtRow= dataTable.NewRow();
dtRow["ID"]= .... //some code here;
dtRow["Name"]= .... //some code here;
dtRow["Age"]= .... //some code here;
if (i%25 == 0) //you can change the 25 here to something else
{
SaveData(/* table name */, /* dataTable */);
}
i++;
}
SaveData(/* table name */, /* dataTable */);
}
public void SaveData(string tableName, DataTable dataTable )
{
//Some code Here
//After dumping data to DB, clear DataTable
dataTable.Rows.Clear();
}
Related
I have two methods as below
private void MethodB_GetId()
{
//Calling Method A constinuosly in different thread
//Let's say its calling for Id = 1 to 100
}
private void MethodA_GetAll()
{
List<string> lst;
lock(_locker)
{
lst = SomeService.Get(); //This get return all 100 ids in one shot.
//Some other processing and then return result.
}
}
Now client is calling MethodB_GetById continuously for fetching data for id: 1 to 100 randomly. (It require some of data from these 100 Ids, not all data)
MethodA_GetAll get all data from network may be cache or database in one shot. and return whole collection to method B, then method B extract record in which it is interested.
Now if MethodA_GetAll() makes GetALL() times multiple times and fetching same records will be useless. so i can put a lock around it one thread is fetching record then other will be blocked.
Let's When MethodA_GetAll called by Id = 1 acquire lock and all others are waiting for lock to be released.
What i want is one data is available by any one thread just don't make call again.
Solution option:
1. Make List global to that class and thread safe. (I don't have that option)
I require some how thread 1 tell all other threads that i have record don't go fetching record again.
something like
lock(_locker && Lst!=null) //Not here lst is local to every thread
{
//If this satisfy then only fetch records
}
Please excuse me for poorly framing question. I have posted this in little hurry.
It sounds like you want to create a threadsafe cache. One way to do this is to use Lazy<t>.
Here's an example for a cache of type List<string>:
public sealed class DataProvider
{
public DataProvider()
{
_cache = new Lazy<List<string>>(createCache);
}
public void DoSomethingThatNeedsCachedList()
{
var list = _cache.Value;
// Do something with list.
Console.WriteLine(list[10]);
}
readonly Lazy<List<string>> _cache;
List<string> createCache()
{
// Dummy implementation.
return Enumerable.Range(1, 100).Select(x => x.ToString()).ToList();
}
}
When you need to access the cached value, you just access _cache.Value. If it hasn't yet been created, then the method you passed to the Lazy<T>'s constructor will be called to initialise it. In the example above, this is the createCache() method.
This is done in a threadsafe manner, so that if two threads try to access the cached value simultaneously when it hasn't been created yet, one of the threads will actually end up calling createCache() and the other thread will be blocked until the cached value has been initialised.
You can try double-check-locking lst:
private List<string> lst;
private void MethodA_GetAll()
{
if (lst == null)
{
lock (_locker)
{
if (lst == null)
{
// do your thing
}
}
}
}
So I'm running a Parallel.ForEach that basically generates a bunch of data which is ultimately going to be saved to a database. However, since collection of data can get quite large I need to be able to occasionally save/clear the collection so as to not run into an OutOfMemoryException.
I'm new to using Parallel.ForEach, concurrent collections, and locks, so I'm a little fuzzy on what exactly needs to be done to make sure everything works correctly (i.e. we don't get any records added to the collection between the Save and Clear operations).
Currently I'm saying, if the record count is above a certain threshold, save the data in the current collection, within a lock block.
ConcurrentStack<OutRecord> OutRecs = new ConcurrentStack<OutRecord>();
object StackLock = new object();
Parallel.ForEach(inputrecords, input =>
{
lock(StackLock)
{
if (OutRecs.Count >= 50000)
{
Save(OutRecs);
OutRecs.Clear();
}
}
OutRecs.Push(CreateOutputRecord(input);
});
if (OutRecs.Count > 0) Save(OutRecs);
I'm not 100% certain whether or not this works the way I think it does. Does the lock stop other instances of the loop from writing to output collection? If not is there a better way to do this?
Your lock will work correctly but it will not be very efficient because all your worker threads will be forced to pause for the entire duration of each save operation. Also, locks tends to be (relatively) expensive, so performing a lock in each iteration of each thread is a bit wasteful.
One of your comments mentioned giving each worker thread its own data storage: yes, you can do this. Here's an example that you could tailor to your needs:
Parallel.ForEach(
// collection of objects to iterate over
inputrecords,
// delegate to initialize thread-local data
() => new List<OutRecord>(),
// body of loop
(inputrecord, loopstate, localstorage) =>
{
localstorage.Add(CreateOutputRecord(inputrecord));
if (localstorage.Count > 1000)
{
// Save() must be thread-safe, or you'll need to wrap it in a lock
Save(localstorage);
localstorage.Clear();
}
return localstorage;
},
// finally block gets executed after each thread exits
localstorage =>
{
if (localstorage.Count > 0)
{
// Save() must be thread-safe, or you'll need to wrap it in a lock
Save(localstorage);
localstorage.Clear();
}
});
One approach is to define an abstraction that represents the destination for your data. It could be something like this:
public interface IRecordWriter<T> // perhaps come up with a better name.
{
void WriteRecord(T record);
void Flush();
}
Your class that processes the records in parallel doesn't need to worry about how those records are handled or what happens when there's too many of them. The implementation of IRecordWriter handles all those details, making your other class easier to test.
An implementation of IRecordWriter could look something like this:
public abstract class BufferedRecordWriter<T> : IRecordWriter<T>
{
private readonly ConcurrentQueue<T> _buffer = new ConcurrentQueue<T>();
private readonly int _maxCapacity;
private bool _flushing;
public ConcurrentQueueRecordOutput(int maxCapacity = 100)
{
_maxCapacity = maxCapacity;
}
public void WriteRecord(T record)
{
_buffer.Enqueue(record);
if (_buffer.Count >= _maxCapacity && !_flushing)
Flush();
}
public void Flush()
{
_flushing = true;
try
{
var recordsToWrite = new List<T>();
while (_buffer.TryDequeue(out T dequeued))
{
recordsToWrite.Add(dequeued);
}
if(recordsToWrite.Any())
WriteRecords(recordsToWrite);
}
finally
{
_flushing = false;
}
}
protected abstract void WriteRecords(IEnumerable<T> records);
}
When the buffer reaches the maximum size, all the records in it are sent to WriteRecords. Because _buffer is a ConcurrentQueue it can keep reading records even as they are added.
That Flush method could be anything specific to how you write your records. Instead of this being an abstract class the actual output to a database or file could be yet another dependency that gets injected into this one. You can make decisions like that, refactor, and change your mind because the very first class isn't affected by those changes. All it knows about is the IRecordWriter interface which doesn't change.
You might notice that I haven't made absolutely certain that Flush won't execute concurrently on different threads. I could put more locking around this, but it really doesn't matter. This will avoid most concurrent executions, but it's okay if concurrent executions both read from the ConcurrentQueue.
This is just a rough outline, but it shows how all of the steps become simpler and easier to test if we separate them. One class converts inputs to outputs. Another class buffers the outputs and writes them. That second class can even be split into two - one as a buffer, and another as the "final" writer that sends them to a database or file or some other destination.
I had a function which update the database by every second (as continuously data coming by some Network) I wanted to put delay on that updating function.. As it would update database table by every 5 minutes..
Here is my Code
if (ip==StrIp)
{
Task.Delay(300000).ContinueWith(_=>
{ //I'm Using Task.Delay to make delay
var res= from i in dc.Pins //LINQ Query
where i.ip== ip
select i;
for each (var p in res)
{
p.time= System.DateTime.Now,
p.temperature= temp,
.
. //some other values
.
};
datacontext.submitChanges();
});
}
It is working and updating data by every 5 minutes, Now I want that data should update immediately only first time when application start but after that It should update after every 5 minutes.. But Right now my code isn't doing that..
How can I make such delay which ignore the operation first time, but apply on upcoming data iterations..?
Thanks in Advance
You could use a flag to determine whether it is the first time your method is called, e.g.:
private uint _counter = 0;
public YourMethod()
{
if (ip == StrIp)
{
Action<Task> action = _ =>
{
var res = from i in dc.Pins //LINQ Query
where i.ip == ip
select i;
//...
datacontext.submitChanges();
};
if (_counter++ == 0)
action();
else
Task.Delay(300000).ContinueWith(action);
}
}
Extract the inner logic of the task into a function/method (refactoring of VS or R# can to this automatically) and call the new function/method at start and on the interval.
I personally would go into another direction:
Have a in-memory queue that gets filled with data as it comes into your app. Then I would have a thread/task etc. which checks the queue every 5 minutes and updates the database accordingly. Remember to lock the queue for updates (concurrency). The ConcurrentQueue of .Net is one way to do it.
Will parallelism help with performance for a locked object, should it be run single threaded, or is there another technique?
I noticed that when accessing a dataset and adding rows from multiple threads exceptions were thrown. Therefore I created a "thread-safe" version to add rows by locking the table prior to updating the row. This implementation works but is appears slow with many transactions.
public partial class HaMmeRffl
{
public partial class PlayerStatsDataTable
{
public void AddPlayerStatsRow(int PlayerID, int Year, int StatEnum, int Value, DateTime Timestamp)
{
lock (TeamMemberData.Dataset.PlayerStats)
{
HaMmeRffl.PlayerStatsRow testrow = TeamMemberData.Dataset.PlayerStats.FindByPlayerIDYearStatEnum(PlayerID, Year, StatEnum);
if (testrow == null)
{
HaMmeRffl.PlayerStatsRow newRow = TeamMemberData.Dataset.PlayerStats.NewPlayerStatsRow();
newRow.PlayerID = PlayerID;
newRow.Year = Year;
newRow.StatEnum = StatEnum;
newRow.Value = Value;
newRow.Timestamp = Timestamp;
TeamMemberData.Dataset.PlayerStats.AddPlayerStatsRow(newRow);
}
else
{
testrow.Value = Value;
testrow.Timestamp = Timestamp;
}
}
}
}
}
Now I can call this safely from multiple threads, but does it actually buy me anything? Can I do this differently for better performance. For instance is there any way to use System.Collections.Concurrent namespace to optimize performance or any other methods?
In addition, I update the underlying database after the entire dataset is updated and that takes a very long time. Would that be considered an I/O operation and be worth using parallel processing by updating it after each row is updated in the dataset (or some number of rows).
UPDATE
I wrote some code to test concurrent vs sequential processing which shows it takes about 30% longer to do concurrent processing and I should use sequential processing here. I assume this is because the lock on the database is causing the overhead on the ConcurrentQueue to be more costly than the gains from parallel processing. Is this conclusion correct and is there anything that I can do to speed up the processing, or am I stuck as for a Datatable "You must synchronize any write operations".
Here is my test code which might not be scientifically correct. Here is the timer and calls between them.
dbTimer.Restart();
Queue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRow = InsertToPlayerQ(addUpdatePlayers);
Queue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRow = InsertToPlayerStatQ(addUpdatePlayers);
UpdatePlayerStatsInDB(addPlayerRow, addPlayerStatRow);
dbTimer.Stop();
System.Diagnostics.Debug.Print("Writing to the dataset took {0} seconds single threaded", dbTimer.Elapsed.TotalSeconds);
dbTimer.Restart();
ConcurrentQueue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRows = InsertToPlayerQueue(addUpdatePlayers);
ConcurrentQueue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRows = InsertToPlayerStatQueue(addUpdatePlayers);
UpdatePlayerStatsInDB(addPlayerRows, addPlayerStatRows);
dbTimer.Stop();
System.Diagnostics.Debug.Print("Writing to the dataset took {0} seconds concurrently", dbTimer.Elapsed.TotalSeconds);
In both examples I add to the Queue and ConcurrentQueue in an identical manner single threaded. The only difference is the insertion into the datatable. The single-threaded approach inserts as follows:
private static void UpdatePlayerStatsInDB(Queue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRows, Queue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRows)
{
try
{
HaMmeRffl.PlayersRow.PlayerValue row;
while (addPlayerRows.Count > 0)
{
row = addPlayerRows.Dequeue();
TeamMemberData.Dataset.Players.AddPlayersRow(
row.PlayerID, row.Name, row.PosEnum, row.DepthEnum,
row.TeamID, row.RosterTimestamp, row.DepthTimestamp,
row.Active, row.NewsUpdate);
}
}
catch (Exception)
{
TeamMemberData.Dataset.Players.RejectChanges();
}
try
{
HaMmeRffl.PlayerStatsRow.PlayerStatValue row;
while (addPlayerStatRows.Count > 0)
{
row = addPlayerStatRows.Dequeue();
TeamMemberData.Dataset.PlayerStats.AddUpdatePlayerStatsRow(
row.PlayerID, row.Year, row.StatEnum, row.Value, row.Timestamp);
}
}
catch (Exception)
{
TeamMemberData.Dataset.PlayerStats.RejectChanges();
}
TeamMemberData.Dataset.Players.AcceptChanges();
TeamMemberData.Dataset.PlayerStats.AcceptChanges();
}
The concurrent adds as follows
private static void UpdatePlayerStatsInDB(ConcurrentQueue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRows, ConcurrentQueue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRows)
{
Action actionPlayer = () =>
{
HaMmeRffl.PlayersRow.PlayerValue row;
while (addPlayerRows.TryDequeue(out row))
{
TeamMemberData.Dataset.Players.AddPlayersRow(
row.PlayerID, row.Name, row.PosEnum, row.DepthEnum,
row.TeamID, row.RosterTimestamp, row.DepthTimestamp,
row.Active, row.NewsUpdate);
}
};
Action actionPlayerStat = () =>
{
HaMmeRffl.PlayerStatsRow.PlayerStatValue row;
while (addPlayerStatRows.TryDequeue(out row))
{
TeamMemberData.Dataset.PlayerStats.AddUpdatePlayerStatsRow(
row.PlayerID, row.Year, row.StatEnum, row.Value, row.Timestamp);
}
};
Action[] actions = new Action[Environment.ProcessorCount * 2];
for (int i = 0; i < Environment.ProcessorCount; i++)
{
actions[i * 2] = actionPlayer;
actions[i * 2 + 1] = actionPlayerStat;
}
try
{
// Start ProcessorCount concurrent consuming actions.
Parallel.Invoke(actions);
}
catch (Exception)
{
TeamMemberData.Dataset.Players.RejectChanges();
TeamMemberData.Dataset.PlayerStats.RejectChanges();
}
TeamMemberData.Dataset.Players.AcceptChanges();
TeamMemberData.Dataset.PlayerStats.AcceptChanges();
}
The difference in time is 4.6 seconds for the single-threaded and 6.1 for the parallel.Invoke.
Lock & transactions are not good for parallelism and performance.
1)Try avoid lock:Will different threads need to update the same Row in dataset?
2)minimize lock time.
For db operation use may try Batch Update future of ADO.NET: http://msdn.microsoft.com/en-us/library/ms810297.aspx
Multithreading can help upto an extent because once the data across your app boundary , you will start waiting for I/O , here you can do asynchronous processing because your app does not have control over various parameters ( Resource access , Network speed etc),this will give better user experience (If UI app).
Now for your scenario , you may want to use some sort of producer/consumer queue , as soon as a row is available in queue , a different thread start processing it but again this will work upto an extent.
I have a CVS file with over 1 Million rows of data. I am planning to read them in parallel to improve efficiency. Can I do something like the following or is there a more efficient method?
namespace ParallelData
{
public partial class ParallelData : Form
{
public ParallelData()
{
InitializeComponent();
}
private static readonly char[] Separators = { ',', ' ' };
private static void ProcessFile()
{
var lines = File.ReadLines("BigData.csv");
var numbers = ProcessRawNumbers(lines);
var rowTotal = new List<double>();
var totalElements = 0;
foreach (var values in numbers)
{
var sumOfRow = values.Sum();
rowTotal.Add(sumOfRow);
totalElements += values.Count;
}
MessageBox.Show(totalElements.ToString());
}
private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines)
{
var numbers = new List<List<double>>();
/*System.Threading.Tasks.*/
Parallel.ForEach(lines, line =>
{
lock (numbers)
{
numbers.Add(ProcessLine(line));
}
});
return numbers;
}
private static List<double> ProcessLine(string line)
{
var list = new List<double>();
foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
{
double i;
if (Double.TryParse(s, out i))
{
list.Add(i);
}
}
return list;
}
private void button2_Click(object sender, EventArgs e)
{
ProcessFile();
}
}
}
I'm not sure it's a good idea. Depending on your hardware, the CPU won't be a bottleneck, the disk read speed will.
Another point: if your storage hardware is a magnetic hard disk, then then disk read speed is strongly related to how the file is physically stored in the disk; if the file is not fragmented (i.e. all file chunks are sequentially stored on the disk), you'll have better performances if you read line by line sequentially.
One solution would be to read the whole file in one time (if you have enough memory space, for 1 million row it should be OK) using File.ReadAllLines, store all lines in a string array, then process (i.e. parse using string.Split...etc.) in your Parallel.Foreach, if the rows order is not important.
In general you should try to avoid having disk access on multiple threads. The disk is a bottleneck and will block, so might impact performance.
If the size of the lines in the file is not an issue, you should probably read the entire file in first, and then process in parallel.
If the file is too large to do that or it's not practical, then you could use BlockingCollection to load it. Use one thread to read the file and populate the BlockingCollection and then Parallel.ForEach to process the items in it. BlockingCollection allows you to specify the max size of the collection, so it will only read more lines from the file as what's already in the collection is processed and removed.
static void Main(string[] args)
{
string filename = #"c:\vs\temp\test.txt";
int maxEntries = 2;
var c = new BlockingCollection<String>(maxEntries);
var taskAdding = Task.Factory.StartNew(delegate
{
var lines = File.ReadLines(filename);
foreach (var line in lines)
{
c.Add(line); // when there are maxEntries items
// in the collection, this line
// and thread will block until
// the processing thread removes
// an item
}
c.CompleteAdding(); // this tells the collection there's
// nothing more to be added, so the
// enumerator in the other thread can
// end
});
while (c.Count < 1)
{
// this is here simply to give the adding thread time to
// spin up in this much simplified sample
}
Parallel.ForEach(c.GetConsumingEnumerable(), i =>
{
// NOTE: GetConsumingEnumerable() removes items from the
// collection as it enumerates over it, this frees up
// the space in the collection for the other thread
// to write more lines from the file
Console.WriteLine(i);
});
Console.ReadLine();
}
As with some of the others, though, I have to ask the question: Is this something you really need to try optimizing through parallelization, or would a single-threaded solution perform well enough? Multithreading adds a lot of complexity and it's sometimes not worth it.
What kind of performance are you seeing that you want to improve upon?
I checked those lines on my computer and it looks like using Parallel to read csv file without any cpu-expensive computation make no sense. It takes more time to run this in parallel than in one thread. Here are my result:
For code above:
2699ms 2712ms (Checked twice just to confirm results)
Then with:
private static IEnumerable<List<double>> ProcessRawNumbers2(IEnumerable<string> lines)
{
var numbers = new List<List<double>>();
foreach(var line in lines)
{
lock (numbers)
{
numbers.Add(ProcessLine(line));
}
}
return numbers;
}
Gives me: 2075ms 2106ms
So I think that if those numbers in csv does not require to be computed somehow (with some extensive calculation or so) in program and then stored in program, than it make no sense to use parallelism in such case as this add some overhead to it.