reading a csv file with a million rows in parallel in c#

reading a csv file with a million rows in parallel in c# - c#

I have a CVS file with over 1 Million rows of data. I am planning to read them in parallel to improve efficiency. Can I do something like the following or is there a more efficient method?
namespace ParallelData
{
public partial class ParallelData : Form
{
public ParallelData()
{
InitializeComponent();
}
private static readonly char[] Separators = { ',', ' ' };
private static void ProcessFile()
{
var lines = File.ReadLines("BigData.csv");
var numbers = ProcessRawNumbers(lines);
var rowTotal = new List<double>();
var totalElements = 0;
foreach (var values in numbers)
{
var sumOfRow = values.Sum();
rowTotal.Add(sumOfRow);
totalElements += values.Count;
}
MessageBox.Show(totalElements.ToString());
}
private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines)
{
var numbers = new List<List<double>>();
/*System.Threading.Tasks.*/
Parallel.ForEach(lines, line =>
{
lock (numbers)
{
numbers.Add(ProcessLine(line));
}
});
return numbers;
}
private static List<double> ProcessLine(string line)
{
var list = new List<double>();
foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
{
double i;
if (Double.TryParse(s, out i))
{
list.Add(i);
}
}
return list;
}
private void button2_Click(object sender, EventArgs e)
{
ProcessFile();
}
}
}

I'm not sure it's a good idea. Depending on your hardware, the CPU won't be a bottleneck, the disk read speed will.
Another point: if your storage hardware is a magnetic hard disk, then then disk read speed is strongly related to how the file is physically stored in the disk; if the file is not fragmented (i.e. all file chunks are sequentially stored on the disk), you'll have better performances if you read line by line sequentially.
One solution would be to read the whole file in one time (if you have enough memory space, for 1 million row it should be OK) using File.ReadAllLines, store all lines in a string array, then process (i.e. parse using string.Split...etc.) in your Parallel.Foreach, if the rows order is not important.

In general you should try to avoid having disk access on multiple threads. The disk is a bottleneck and will block, so might impact performance.
If the size of the lines in the file is not an issue, you should probably read the entire file in first, and then process in parallel.
If the file is too large to do that or it's not practical, then you could use BlockingCollection to load it. Use one thread to read the file and populate the BlockingCollection and then Parallel.ForEach to process the items in it. BlockingCollection allows you to specify the max size of the collection, so it will only read more lines from the file as what's already in the collection is processed and removed.
static void Main(string[] args)
{
string filename = #"c:\vs\temp\test.txt";
int maxEntries = 2;
var c = new BlockingCollection<String>(maxEntries);
var taskAdding = Task.Factory.StartNew(delegate
{
var lines = File.ReadLines(filename);
foreach (var line in lines)
{
c.Add(line); // when there are maxEntries items
// in the collection, this line
// and thread will block until
// the processing thread removes
// an item
}
c.CompleteAdding(); // this tells the collection there's
// nothing more to be added, so the
// enumerator in the other thread can
// end
});
while (c.Count < 1)
{
// this is here simply to give the adding thread time to
// spin up in this much simplified sample
}
Parallel.ForEach(c.GetConsumingEnumerable(), i =>
{
// NOTE: GetConsumingEnumerable() removes items from the
// collection as it enumerates over it, this frees up
// the space in the collection for the other thread
// to write more lines from the file
Console.WriteLine(i);
});
Console.ReadLine();
}
As with some of the others, though, I have to ask the question: Is this something you really need to try optimizing through parallelization, or would a single-threaded solution perform well enough? Multithreading adds a lot of complexity and it's sometimes not worth it.
What kind of performance are you seeing that you want to improve upon?

I checked those lines on my computer and it looks like using Parallel to read csv file without any cpu-expensive computation make no sense. It takes more time to run this in parallel than in one thread. Here are my result:
For code above:
2699ms 2712ms (Checked twice just to confirm results)
Then with:
private static IEnumerable<List<double>> ProcessRawNumbers2(IEnumerable<string> lines)
{
var numbers = new List<List<double>>();
foreach(var line in lines)
{
lock (numbers)
{
numbers.Add(ProcessLine(line));
}
}
return numbers;
}
Gives me: 2075ms 2106ms
So I think that if those numbers in csv does not require to be computed somehow (with some extensive calculation or so) in program and then stored in program, than it make no sense to use parallelism in such case as this add some overhead to it.

Related

Unsure how to release memory caused by function in winforms

My problem is being unable or not knowing how to clear the memory being flooded by images (bitmaps) not being used anymore. The function's purpose is to change the background of the form to a new image every x amount of seconds.
The memory usage will inevitably overflow and it will crash. But even when I am changing to a different window, I run this.Close() and the memory usage is still constantly increasing.
Here is the function:
public async void WaitSomeTime(String[] favs, int time)
{
while (true)
{
var rnd = new Random();
favs = favs.OrderBy(item => rnd.Next()).ToArray();
foreach (string fav in favs)
{
await Task.Delay(time);
Image img = new Bitmap(fav);
this.pictureBoxBG.Image = img;
}
}
}
So far I've tried the Dispose method but to no avail, I don't completely understand it. I've tried the 'using' statement but that causes an error in Program.cs (entry point). I'm sure it's a simple fix but I'm out of ideas and GPT3 isn't helping very well, thanks in advance.

Using the Diagnostic Tools in Visual Studio we can get an idea of how your current code behaves with respect to process memory over a ~5-minute period. Updates are set to occur ~10 times a second to provide a test load. Even though the image files I'm using for testing aren't very big the effect on memory consumption is apparent.
One approach you could try that seems to keep this it in check is to explicitly dispose of images and then collect garbage. It seems reasonable to apply this after each complete pass through the list of favs but you get general drift and can apply it wherever you choose.
To be clear, the critical step is to Dispose the previous image and GC will take care of it from there. But if your goal is to keep the footprint as low as possible this "extraordinary measure" allows you to exercise more control.
public partial class MainForm : Form
{
public MainForm()
{
InitializeComponent();
pictureBoxBG.SizeMode = PictureBoxSizeMode.StretchImage;
string dir = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Images");
string[] favs = Enumerable.Range(0, 10).Select(_=>Path.Combine(dir, $"image-{_}.png")).ToArray();
_task = WaitSomeTime(favs, 100);
}
Task _task;
Random rnd = new Random();
public async Task WaitSomeTime(String[] favs, int time)
{
while (true)
{
List<Image> trash= new List<Image>();
favs = favs.OrderBy(item => rnd.Next()).ToArray();
foreach (string fav in favs)
{
await Task.Delay(time);
if(pictureBoxBG.Image != null)
{
trash.Add(pictureBoxBG.Image);
}
Image img = new Bitmap(fav);
pictureBoxBG.Image = img;
}
for (int i = 0; i < trash.Count; i++)
{
trash[i].Dispose();
trash[i] = null;
}
// Requesting an expedited disposal of managed memory.
GC.Collect();
}
}
}

ConcurrentQueue in a ConcurrentDictionary Duplicate Error

I have a thread that handles the message receiving every 10 seconds and have another one write these messages to the database every minute.
Each message has a different sender which is named serialNumber in my case.
Therefore, I created a ConcurrentDictionary like below.
public ConcurrentDictionary<string, ConcurrentQueue<PacketModel>> _dicAllPackets;
The key of the dictionary is serialNumber and the value is the collection of 1-minute messages. The reason I want to collect a minute of data is instead of going database every 10 seconds is go once in every minute so I can reduce the process by 1/6 times.
public class ShotManager
{
private const int SLEEP_THREAD_FOR_FILE_LIST_DB_SHOOTER = 25000;
private bool ACTIVE_FILE_DB_SHOOT_THREAD = false;
private List<Devices> _devices = new List<Devices>();
public ConcurrentDictionary<string, ConcurrentQueue<PacketModel>> _dicAllPackets;
public ShotManager()
{
ACTIVE_FILE_DB_SHOOT_THREAD = Utility.GetAppSettings("AppConfig", "0", "ACTIVE_LIST_DB_SHOOT") == "1";
init();
}
private void init()
{
using (iotemplaridbContext dbContext = new iotemplaridbContext())
_devices = (from d in dbContext.Devices select d).ToList();
if (_dicAllPackets is null)
_dicAllPackets = new ConcurrentDictionary<string, ConcurrentQueue<PacketModel>>();
foreach (var device in _devices)
{
if(!_dicAllPackets.ContainsKey(device.SerialNumber))
_dicAllPackets.TryAdd(device.SerialNumber, new ConcurrentQueue<PacketModel> { });
}
}
public void Spinner()
{
while (ACTIVE_FILE_DB_SHOOT_THREAD)
{
try
{
Parallel.ForEach(_dicAllPackets, devicePacket =>
{
Thread.Sleep(100);
readAndShot(devicePacket);
});
Thread.Sleep(SLEEP_THREAD_FOR_FILE_LIST_DB_SHOOTER);
//init();
}
catch (Exception ex)
{
//init();
tLogger.EXC("Spinner exception for write...", ex);
}
}
}
public void EnqueueObjectToQueue(string serialNumber, PacketModel model)
{
if (_dicAllPackets != null)
{
if (!_dicAllPackets.ContainsKey(serialNumber))
_dicAllPackets.TryAdd(serialNumber, new ConcurrentQueue<PacketModel> { });
else
_dicAllPackets[serialNumber].Enqueue(model);
}
}
private void readAndShot(KeyValuePair<string, ConcurrentQueue<PacketModel>> keyValuePair)
{
StringBuilder sb = new StringBuilder();
if (keyValuePair.Value.Count() <= 0)
{
return;
}
sb.AppendLine($"INSERT INTO ......) VALUES(");
//the reason why I don't use while(TryDequeue(out ..)){..} is there's constantly enqueue to this dictionary, so the thread will be occupied with a single device for so long
for (int i = 0; i < 10; i++)
{
keyValuePair.Value.TryDequeue(out PacketModel packet);
if (packet != null)
{
/*
*** do something and fill the sb...
*/
}
else
{
Console.WriteLine("No packet found! For Device: " + keyValuePair.Key);
break;
}
}
insertIntoDB(sb.ToString()[..(sb.Length - 5)] + ";");
}
}
EnqueueObjectToQueue caller is from a different class like below.
private void packetToDictionary(string serialNumber, string jsonPacket, string messageTimeStamp)
{
PacketModel model = new PacketModel {
MachineData = jsonPacket,
DataInsertedAt = messageTimeStamp
};
_shotManager.EnqueueObjectToQueue(serialNumber, model);
}
How I call the above function is from the handler function itself.
private void messageReceiveHandler(object sender, MessageReceviedEventArgs e){
//do something...parse from e and call the func
string jsonPacket = ""; //something parsed from e
string serialNumber = ""; //something parsed from e
string message_timestamp = DateTime.Now().ToString("yyyy-MM-dd HH:mm:ss");
ThreadPool.QueueUserWorkItem(state => packetToDictionary(serialNumber, str, message_timestamp));
}
The problem is sometimes some packets are enqueued under the wrong serialNumber or repeat itself(duplicate entry).
Is it clever to use ConcurrentQueue in a ConcurrentDictionary like this?

No, it's not a good idea to use a ConcurrentDictionary with nested ConcurrentQueues as values. It's impossible to update atomically this structure. Take this for example:
if (!_dicAllPackets.ContainsKey(serialNumber))
_dicAllPackets.TryAdd(serialNumber, new ConcurrentQueue<PacketModel> { });
else
_dicAllPackets[serialNumber].Enqueue(model);
This little piece of code is riddled with race conditions. A thread that is running this code can be intercepted by another thread at any point between the ContainsKey, TryAdd, the [] indexer and the Enqueue invocations, altering the state of the structure, and invalidating the conditions on which the correctness of the current thread's work is based.
A ConcurrentDictionary is a good idea when you have a simple Dictionary that contains immutable values, you want to use it concurrently, and using a lock around each access could potentially create significant contention. You can read more about this here: When should I use ConcurrentDictionary and Dictionary?
My suggestion is to switch to a simple Dictionary<string, Queue<PacketModel>>, and synchronize it with a lock. If you are careful and you avoid doing anything irrelevant while holding the lock, the lock will be released so quickly that rarely other threads will be blocked by it. Use the lock just to protect the reading and updating of a specific entry of the structure, and nothing else.
Alternative designs
A ConcurrentDictionary<string, Queue<PacketModel>> structure might be a good option, under the condition that you never removed queues from the dictionary. Otherwise there is still space for race conditions to occur. You should use exclusively the GetOrAdd method to get or add atomically a queue in the dictionary, and also use always the queue itself as a locker before doing anything with it (either reading or writing):
Queue<PacketModel> queue = _dicAllPackets
.GetOrAdd(serialNumber, _ => new Queue<PacketModel>());
lock (queue)
{
queue.Enqueue(model);
}
Using a ConcurrentDictionary<string, ImmutableQueue<PacketModel>> is also possible because in this case the value of the ConcurrentDictionary is immutable, and you won't need to lock anything. You'll need to use always the AddOrUpdate method, in order to update the dictionary with a single call, as an atomic operation.
_dicAllPackets.AddOrUpdate
(
serialNumber,
key => ImmutableQueue.Create<PacketModel>(model),
(key, queue) => queue.Enqueue(model)
);
The queue.Enqueue(model) call inside the updateValueFactory delegate does not mutate the queue. Instead it creates a new ImmutableQueue<PacketModel> and discards the previous one. The immutable collections are not very efficient in general. But if your goal is to minimize the contention between threads, at the cost of increasing the work that each thread has to do, then you might find them useful.

C# Code takes to long to run. Is there a way to make it finish quicker?

I need some help. If you input an Directory into my code, it goes in every folder in that Directory and gets every single file. This way, i managed to bypass the "AccessDeniedException" by using a code, BUT if the Directory is one, which contains alot of Data and folders (example: C:/) it just takes way to much time.
I dont really know how to multithread and i could not find any help on the internet. Is there a way to make the code run faster by multithreading? Or is it possible to ask the code to use more memory or Cores ? I really dont know and could use advise
My code to go in every File in every Subdirectory:
public static List<string> Files = new List<string>();
public static List<string> Exceptions = new List<string>();
public MainWindow()
{
InitializeComponent();
}
private static void GetFilesRecursively(string Directory)
{
try
{
foreach (string A in Directory.GetDirectories(Directory))
GetFilesRecursively(A);
foreach (string B in Directory.GetFiles(Directory))
AddtoList(B);
} catch (System.Exception ex) { Exceptions.Add(ex.ToString()); }
}
private static void AddtoList(string Result)
{
Files.Add(Result);
}
private void Btn_Click(object sender, RoutedEventArgs e)
{
GetFilesRecursively(Textbox1.Text);
foreach(string C in Files)
Textbox2.Text += $"{C} \n";
}

You don't need recursion to avoid inaccessible files. You can use the EnumerateFiles overload that accepts an EnumerationOptions parameter and set EnumerationOptions.IgnoreInaccessible to true:
var options=new EnumerationOptions
{
IgnoreInaccessible=true,
RecurseSubdirectories=true
};
var files=Directory.EnumerateFiles(somePath,"*",options);
The loop that appends file paths is very expensive too. Not only does it create a new temporary string on each iteration, it also forces a UI redraw. You could improve speed and memory usage (which, due to garbage collection is also affecting performance) by creating a single string, eg with String.Join or a StringBuilder :
var text=String.Join("\n",files);
Textbox2.Text=text;
String.Join uses a StringBuilder internally whose internal buffer gets reallocated each time it's full. The previous buffer has to be garbage-collected. Once could avoid even this by using a StringBuilder with a specific capacity. Even a rough estimate can reduce reallocations significantly:
var builder=new StringBuilder(4096);
foreach(var file in files)
{
builder.AppendLine(file);
}

create a class so you can add a private field to count the deep of the directroy.
add a TaskSource<T> property to the class, and await the Task that generated only if the deep out of the limit, and trigger an event so your UI can hook into the action and ask user.
if user cancel , then the task fail, if user confirm, then continue.
some logic code
public class FileLocator
{
public FileLocator(int maxDeep = 6){
_maxDeep = maxDeep;
this.TaskSource = new TaskSource();
this.ConfirmTask = this.TaskSource.Task;
}
private int _maxDeep;
private int _deep;
public event Action<FileLocator> OnReachMaxDeep;
public Task ConfirmTask ;
public TaskSource TaskSource {get;}
public Task<List<string>> GetFilesRecursivelyAsync()
{
var result = new List<string>();
foreach(xxxxxxx)
{
xxxxxxxxxxxxxx;
this._deep +=1;
if(_deep == _maxDeep)
{ OnRichMaxDeep?.Invoke(this); }
if(_deep >= _maxDeep)
{
try{
await ConfirmTask;
continue;
}
catch{
return result;
}
}
}
}
}
and call
var locator = new FileLocator();
locator.OnReachMaxDeep += (x)=> { var result = UI.Confirm(); if(result){ x.TaskSource.SetResult(); else{ x.TaskSource.SetException(new Exception()) } } }
var result = await locator.GetFilesRecursivelyAsync("C:");

Is parallelism usefull with a lock for datatable write transactions

Will parallelism help with performance for a locked object, should it be run single threaded, or is there another technique?
I noticed that when accessing a dataset and adding rows from multiple threads exceptions were thrown. Therefore I created a "thread-safe" version to add rows by locking the table prior to updating the row. This implementation works but is appears slow with many transactions.
public partial class HaMmeRffl
{
public partial class PlayerStatsDataTable
{
public void AddPlayerStatsRow(int PlayerID, int Year, int StatEnum, int Value, DateTime Timestamp)
{
lock (TeamMemberData.Dataset.PlayerStats)
{
HaMmeRffl.PlayerStatsRow testrow = TeamMemberData.Dataset.PlayerStats.FindByPlayerIDYearStatEnum(PlayerID, Year, StatEnum);
if (testrow == null)
{
HaMmeRffl.PlayerStatsRow newRow = TeamMemberData.Dataset.PlayerStats.NewPlayerStatsRow();
newRow.PlayerID = PlayerID;
newRow.Year = Year;
newRow.StatEnum = StatEnum;
newRow.Value = Value;
newRow.Timestamp = Timestamp;
TeamMemberData.Dataset.PlayerStats.AddPlayerStatsRow(newRow);
}
else
{
testrow.Value = Value;
testrow.Timestamp = Timestamp;
}
}
}
}
}
Now I can call this safely from multiple threads, but does it actually buy me anything? Can I do this differently for better performance. For instance is there any way to use System.Collections.Concurrent namespace to optimize performance or any other methods?
In addition, I update the underlying database after the entire dataset is updated and that takes a very long time. Would that be considered an I/O operation and be worth using parallel processing by updating it after each row is updated in the dataset (or some number of rows).
UPDATE
I wrote some code to test concurrent vs sequential processing which shows it takes about 30% longer to do concurrent processing and I should use sequential processing here. I assume this is because the lock on the database is causing the overhead on the ConcurrentQueue to be more costly than the gains from parallel processing. Is this conclusion correct and is there anything that I can do to speed up the processing, or am I stuck as for a Datatable "You must synchronize any write operations".
Here is my test code which might not be scientifically correct. Here is the timer and calls between them.
dbTimer.Restart();
Queue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRow = InsertToPlayerQ(addUpdatePlayers);
Queue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRow = InsertToPlayerStatQ(addUpdatePlayers);
UpdatePlayerStatsInDB(addPlayerRow, addPlayerStatRow);
dbTimer.Stop();
System.Diagnostics.Debug.Print("Writing to the dataset took {0} seconds single threaded", dbTimer.Elapsed.TotalSeconds);
dbTimer.Restart();
ConcurrentQueue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRows = InsertToPlayerQueue(addUpdatePlayers);
ConcurrentQueue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRows = InsertToPlayerStatQueue(addUpdatePlayers);
UpdatePlayerStatsInDB(addPlayerRows, addPlayerStatRows);
dbTimer.Stop();
System.Diagnostics.Debug.Print("Writing to the dataset took {0} seconds concurrently", dbTimer.Elapsed.TotalSeconds);
In both examples I add to the Queue and ConcurrentQueue in an identical manner single threaded. The only difference is the insertion into the datatable. The single-threaded approach inserts as follows:
private static void UpdatePlayerStatsInDB(Queue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRows, Queue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRows)
{
try
{
HaMmeRffl.PlayersRow.PlayerValue row;
while (addPlayerRows.Count > 0)
{
row = addPlayerRows.Dequeue();
TeamMemberData.Dataset.Players.AddPlayersRow(
row.PlayerID, row.Name, row.PosEnum, row.DepthEnum,
row.TeamID, row.RosterTimestamp, row.DepthTimestamp,
row.Active, row.NewsUpdate);
}
}
catch (Exception)
{
TeamMemberData.Dataset.Players.RejectChanges();
}
try
{
HaMmeRffl.PlayerStatsRow.PlayerStatValue row;
while (addPlayerStatRows.Count > 0)
{
row = addPlayerStatRows.Dequeue();
TeamMemberData.Dataset.PlayerStats.AddUpdatePlayerStatsRow(
row.PlayerID, row.Year, row.StatEnum, row.Value, row.Timestamp);
}
}
catch (Exception)
{
TeamMemberData.Dataset.PlayerStats.RejectChanges();
}
TeamMemberData.Dataset.Players.AcceptChanges();
TeamMemberData.Dataset.PlayerStats.AcceptChanges();
}
The concurrent adds as follows
private static void UpdatePlayerStatsInDB(ConcurrentQueue<HaMmeRffl.PlayersRow.PlayerValue> addPlayerRows, ConcurrentQueue<HaMmeRffl.PlayerStatsRow.PlayerStatValue> addPlayerStatRows)
{
Action actionPlayer = () =>
{
HaMmeRffl.PlayersRow.PlayerValue row;
while (addPlayerRows.TryDequeue(out row))
{
TeamMemberData.Dataset.Players.AddPlayersRow(
row.PlayerID, row.Name, row.PosEnum, row.DepthEnum,
row.TeamID, row.RosterTimestamp, row.DepthTimestamp,
row.Active, row.NewsUpdate);
}
};
Action actionPlayerStat = () =>
{
HaMmeRffl.PlayerStatsRow.PlayerStatValue row;
while (addPlayerStatRows.TryDequeue(out row))
{
TeamMemberData.Dataset.PlayerStats.AddUpdatePlayerStatsRow(
row.PlayerID, row.Year, row.StatEnum, row.Value, row.Timestamp);
}
};
Action[] actions = new Action[Environment.ProcessorCount * 2];
for (int i = 0; i < Environment.ProcessorCount; i++)
{
actions[i * 2] = actionPlayer;
actions[i * 2 + 1] = actionPlayerStat;
}
try
{
// Start ProcessorCount concurrent consuming actions.
Parallel.Invoke(actions);
}
catch (Exception)
{
TeamMemberData.Dataset.Players.RejectChanges();
TeamMemberData.Dataset.PlayerStats.RejectChanges();
}
TeamMemberData.Dataset.Players.AcceptChanges();
TeamMemberData.Dataset.PlayerStats.AcceptChanges();
}
The difference in time is 4.6 seconds for the single-threaded and 6.1 for the parallel.Invoke.

Lock & transactions are not good for parallelism and performance.
1)Try avoid lock：Will different threads need to update the same Row in dataset?
2)minimize lock time.
For db operation use may try Batch Update future of ADO.NET: http://msdn.microsoft.com/en-us/library/ms810297.aspx

Multithreading can help upto an extent because once the data across your app boundary , you will start waiting for I/O , here you can do asynchronous processing because your app does not have control over various parameters ( Resource access , Network speed etc),this will give better user experience (If UI app).
Now for your scenario , you may want to use some sort of producer/consumer queue , as soon as a row is available in queue , a different thread start processing it but again this will work upto an extent.

C# : Call a method every 5 minutes from a foreach loop

My Console app is reading huge volume of data from text files and those will be saved to a DB. For this purpose, I am storing the data into a DataTable and I want to dump this DataTable to a DB every 5 minutes (If I want to dump the whole data at once, then I have to fill the DataTable with whole set of data , and in that case I am getting OutOfMemoryException).
public void ProcessData()
{
string[] files=File.ReadAllLines(path)
foreach(var item in files)
{
DataRow dtRow= dataTable.NewRow();
dtRow["ID"]= .... //some code here;
dtRow["Name"]= .... //some code here;
dtRow["Age"]= .... //some code here;
var timer = new Timer(v => SaveData(), null, 0, 5*60*1000);
}
}
public void SaveData(string tableName, DataTable dataTable )
{
//Some code Here
//After dumping data to DB, clear DataTable
dataTable.Rows.Clear();
}
What I wanted here is, the code will continue to fill the DataTable, and every 5 minute it will call SaveData() method. This will continue to run till all files has processed.
However, I have seen that, when the SaveData() method is called , it is executing for 4-5 times. Sometimes, it has bot called in every 5 minute.
I am not getting how to proceed here. How to fix this ? Can any other approach be used here ? Any help is appreciated.

Is it essential that you read each text file in completely with ReadAllLines, this will be consuming a large amount of memory. Why not Read x lines from a file, save to database, then continue until the end of the file is reached?

Your biggest problem is instantiating new Timer instances in your foreach. New Timer objects in every foreach call mean multiple threads calling SaveData concurrently, meaning dataTable being processed and saved to the database multiple times concurrently, possibly (and likely) before rows are cleared, thus duplicating much of your file into the database.
Before I provide a solution to the question as asked, I wanted to point out that saving data in a 5 minute interval has a distinct code smell to it. As has been pointed out, I would suggest some approach that loads and saves data based on some data size rather than an arbitrary time interval. That said, I will go ahead and address your question on the assumption that there is a reason you must go with 5 minute interval save.
First, we need to setup our Timer correctly, which you'll notice I create outside of the foreach loop. Timer continues running on an interval, not just waiting and executing once.
Second, we have to take steps to ensure thread-safe data integrity on our intermediate data store (in your case you used DataTable, but I am using a List of a custom class, because DataTable is too costly for what we want to do). You'll notice I accomplish this by locking before updates to our List.
Updates to your data processing class:
private bool isComplete = false;
private object DataStoreLock = new object();
private List<MyCustomClass> myDataStore;
private Timer myTimer;
public void ProcessData()
{
myTimer = new Timer(SaveData, null, TimeSpan.Zero, TimeSpan.FromMinutes(5.0));
foreach (var item in File.ReadLines(path))
{
var myData = new MyCustomClass()
{
ID = 0, // Some code here
Name = "Some code here",
Age = 0 // Some code here
};
lock (DataStoreLock)
{
myDataStore.Add(myData);
}
}
isComplete = true;
}
public void SaveData(object arg)
{
// Our first step is to check if timed work is done.
if (isComplete)
{
myTimer.Dispose();
myTimer = null;
}
// Our next step is to create a local instance of the data store to work on, which
// allows ProcessData to continue populating while our DB actions are being performed.
List<MyCustomClass> lDataStore;
lock (DataStoreLock)
{
lDataStore = myDataStore;
myDataStore = new List<MyCustomClass>();
}
//Some code DB code here.
}
EDIT: I've changed the enumeration to go through ReadLines rather than ReadAllLines. Read Remarks under the ReadLines method on MSDN. ReadAllLines will be a blocking call, while ReadLines will allow enumeration to be processed while reading the file. I can't imagine a scenario otherwise where your foreach would be running for more than 5 minutes if the file had been read all to memory already.

Here would be a suggestion on how to implement the code and the suggestion from the other answer:
public void ProcessData()
{
int i = 1;
foreach(var item in File.ReadLines(path)) //This line has been edited
{
DataRow dtRow= dataTable.NewRow();
dtRow["ID"]= .... //some code here;
dtRow["Name"]= .... //some code here;
dtRow["Age"]= .... //some code here;
if (i%25 == 0) //you can change the 25 here to something else
{
SaveData(/* table name */, /* dataTable */);
}
i++;
}
SaveData(/* table name */, /* dataTable */);
}
public void SaveData(string tableName, DataTable dataTable )
{
//Some code Here
//After dumping data to DB, clear DataTable
dataTable.Rows.Clear();
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

reading a csv file with a million rows in parallel in c# - c#

Related

Unsure how to release memory caused by function in winforms

ConcurrentQueue in a ConcurrentDictionary Duplicate Error

C# Code takes to long to run. Is there a way to make it finish quicker?

Is parallelism usefull with a lock for datatable write transactions

C# : Call a method every 5 minutes from a foreach loop

Categories

Resources