Getting Duplicate Objects in Producer/Consumer ConcurrentDictionary C# - c#

I'm stuck on a problem and am wondering if I just have coded something incorrectly. The application polls every few seconds and grabs every record from a table whose sole purpose is to signify what records to act upon.
Please note I've left out the error handling code for space and readability
//Producing Thread, this is triggered every 5 seconds... UGH, I hate timers
foreach (var Record in GetRecordsFromDataBase()) // returns a dictionary
{
if (!ConcurrentDictionary.Contains(Record.Key))
ConcurrentDictionary.TryAdd(Record.Key, Record.Value);
}
This code works great, with the irritating fact that it may/will select the same record multiple times until said record(s) is/are processed. By processed, each selected record is being written into its own newly created, uniquely named file. Then a stored procedure is called for that record's key to remove it from the database at which point that particular key is removed from the ConcurrentDictionary.
// Consuming Thread, located within another loop to allow
// the below code to continue to cycle until instructed
// to terminate
while (!ConcurrentDictionary.IsEmpty)
{
var Record = ConcurrentDictionary.Take(1).First();
WriteToNewFile(Record.Value);
RemoveFromDatabase(Record.Key);
ConcurrentDictionary.TryRemove(Record.Key);
}
For a throughput test I added 20k+ records into the table and then turned the application loose. I was quite surprised when I noticed 22k+ files that continued to increase well into 100k+ territory.
What am I doing wrong??? Have I completely misunderstood what the concurrent dictionary is used for? Did I forget a semi-colon somewhere?

First, eliminate the call to Contains. TryAdd already checks for duplicates, and returns false if the item is already present.
foreach (var Record in GetRecordsFromDataBase()) // returns a dictionary
{
ConcurrentDictionary.TryAdd(Record.Key, Record.Value);
}
The next problem I see is that I don't think that ConcurrentDictionary.Take(1).First() is a good way to get an item from the dictionary since it isn't atomic. I think you want to use a BlockingCollection() instead. It is specifically designed for implementing a producer-consumer pattern.
Lastly, I think your problems don't really have to do with the Dictionary, but with the database. The dictionary itself is thread-safe, but your dictioanry is not atomic with the database. So suppose record A is in the database. GetRecordsFromDataBase() pulls it and adds it to the dictionary. Then it begins processing record A (I assume this is in another thread). Then, that first loop again calls GetRecordsFromDataBase() and gets record A again. Simultaneously, record A is processed and removed from the database. But it's too late! GetRecordsFromDataBase() already grabbed it! So that initial loop adds it to the dictionary again, after it has been removed.
I think you may need to take records that are to be processed, and move them into another table entirely. That way, they won't get picked-up a second time. Doing this at the C# level, rather than the database level, is going to be a problem. Either that, or you don't want to be adding records to the queue while processing records.

What am I doing wrong???
The foreach (add) loop is trying to add any record not in the database to the dictionary.
The while (remove) loop is removing items from the database and then the dictionary, also writing them to file.
This logic looks correct. But there is a race:
GetRecordsFromDataBase(); // returns records 1 through 10.
switch context to remove loop.
WriteToNewFile(Record.Value); // write record 5
RemoveFromDatabase(Record.Key); // remove record 5 from db
ConcurrentDictionary.TryRemove(Record.Key); // remove record 5 from dictionary
switch back to add loop
ConcurrentDictionary.TryAdd(Record.Key, Record.Value); // add record 5 even though it is not in the DB becuase it was part of the records returned by ConcurrentDictionary.TryAdd(Record.Key, Record.Value);;
After the item is removed the foreach loop adds it again. This is why your file count is multiplying.
foreach (var Record in GetRecordsFromDataBase()) // returns a dictionary
{
if (!ConcurrentDictionary.Contains(Record.Key)) // this if is not required. try add will do.
ConcurrentDictionary.TryAdd(Record.Key, Record.Value);
}
Try something like this:
add loop:
foreach (var Record in GetRecordsFromDataBase()) // returns a dictionary
{
if (ConcurrentDictionary.TryAdd(Record.Key, false)) // only adds the record if it has not been processed.
{
ConcurrentQueue.Enque(record) // enqueue the record
}
}
Remove loop
var record;// you will need to specify the type
if (ConcurrentQueue.TryDequeue(record))
{
if (ConcurrentDictionary.TryUpdate(record.key,true,false)) // update the value from true to false
{
WriteToNewFile(Record.Value); // write record 5
RemoveFromDatabase(Record.Key); // remove record 5 from db
}
}
This will leave items in the dictionary for each record processed. You can remove them from the dictionary eventually but multithreading involving a db can be tricky.

Related

Prevent insertion of duplicate documents into Lotus notes database

I have a c# web api hosted in iis which has a post method that takes a list of document ids to insert into a lotus notes database.
The post method can be called multiple times and I want to prevent insertion of duplicate documents.
This is the code(in a static class) that is called from the post:
lock (thisLock)
{
var id = "some unique id";
doc = vw.GetDocumentByKey(id, false);
if (doc == null)
{
NotesDocument docNew = db.CreateDocument();
//some more processing
docNew.Save(true, false, false);
}
}
Even with the lock in place, I am running into scenarios where duplicate documents are inserted. Is it because a request can be execute on a new process? What is the best way to prevent it from happening?
Your problem is: getdocumentbykey depends on the view index being up to date. On a busy server there is no guarantee that this is true. You can TRY to call a vw.Update, but unfortunately this does not trigger an update of the view index, so it might be without any effect (it just updates the vw object to represent what has changed in the backend, if the backend did not update, then it does nothing).
You could use db.Search('IdField ="' & id & '"', Nothing, 0) instead, as the search does not rely on an index to be rebuilt. This will be slightly slower, but should be way more accurate.
you might want to store the inserted ids in some singleton object or even simply static list. And lock on this list - whoever obtains the lock verifies that the ids it wants to insert are not present and then adds them to the list itself.
You need to keep them only for a short length of time, just so that 2 concurrent posts with the same content does not update plus normal view index gets updated. So rather store timestamp along id, so you can clean out older records if the list grows long.

How do I find out if a DynamoDB table is empty?

How can I find out if a DynamoDB table contains any items using the .NET SDK?
One option is to do a Scan operation, and check the returned Count. But Scans can be costly for large tables and should be avoided.
The describe table count does not return real time value. The item count is updated every 6 hours.
The best way is to scan only once without any filter expression and check the count. This may not be costly as you are scanning the table only once and it would not scan the entire table as you don't need to scan recursively to find whether the table has any item.
A single scan returns only 1 MB of data.
If the use case requires real time value, this is the best and only option available.
Edit: While the below appears to work fine with small tables on localhost, the docs state
DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
so only use DescribeTable if you don't need an accurate, up to date figure.
Original:
It looks like the best way to do this is to use the DescribeTable method on AmazonDynamoDBClient:
AmazonDynamoDBClient client = ...
if (client.DescribeTable("FooTable").Table.ItemCount == 0)
// do stuff

C#-Replacing Sharepoint list data nightly

I have a Sharepoint list on a site that I want to update nightly from a SQL server DB, preferably using C#. Here is the catch, I do not know if any records were removed, added, or if any field in any record has been updated. I would believe then the simplest thing to do is remove the data from the list and then replace it with the new list data. But is there any simple way to do this? I would hate to remove 3000+ items line by line from the list and then add the 3000+ records one at a time.
Its up to your environment. If you not that much load on the systems in the night, i would prefer one of the following ways:
1) Build a timerjob, delete the list (not the items one by one, cause this is slow), recreate the list and import the items from the db. When we are talking about 3.000 - 5.000 Elements, this is not that much and i think done under 10 Minutes.
2) Loop through the sharepoint list with the items and check field by field if it was updated within the db and if yes, update it.
I would preferr to delete the list and import the complete table, cause we are talking about not that much data.
Another way, which is a good idea, is to use BCS or BDC. Then you would have the data always in place and synched with the db. Look at
https://msdn.microsoft.com/en-us/library/office/jj163782.aspx
https://msdn.microsoft.com/de-de/library/ee231515(v=vs.110).aspx
Unfortunately there is no "easy" and/or elegant way to delete all the items in a list, like the delete statement in SQL. You can either delete the entire list and recreate it if the list can be easily created from a list definition or, if your concern is performance, since SP 2007 the SPWeb Class has a method called ProcessBatchData. You can use it to batch process commands to avoid the performance penalty of issuing 6000 separate commands to the server. However, it still requires you to pass an ugly XML that contains a list of all the items to be deleted or added.
The ideal way is to enumerate all the rows from the database and see if each row already exists in the SharePoint list using a primary field value. If it already exists, simply update them[1]. Otherwise you can add a new item.
[1] - Optionally, while updating them we can compare the list item field values with database column values. Only if there is a change in any of the field, update it. Otherwise skip it.

How to query an SQLite db in batches

I am using C# with .NET 4.5. I am making a scraper which collects specific data. Each time a value is scraped, I need to make sure it hasn't already been added to the SQLite db.
To do this, I am making a call each time a value is scraped to query against the db to check if it contains the value, and if not, I make another call to insert the value into the db.
Since I am scraping multiple values per second, this gets to be very IO-intensive, with constant calls to the db.
My question is, is there any better way to do this? Perhaps I could queue the values scraped and then run a batch query at once? Is that possible?
I see three approaches:
Use INSERT OR IGNORE, which will reject an entry if it is already present (based on primary key and unique fields). Or plainly INSERT (or its equivalent (INSERT or ABORT) which will return SQLITE_CONSTRAINT, a value you will have to catch and manage if you want to count failed insertions.
Accumulate, outside the database, the updates you want to make. When you have accumulated enough/all, start a transaction (BEGIN;), do your insertions (you can use INSERT OR IGNORE here as well), commit the transaction (COMMIT;)
You could pre-fetch a list of items you already have, depending, and check against that list, if your data model allows it.

Remove duplicate values from ListView with the lower TIME value?

I have a listview control that is filled with returned records from a SQL Statement. The fields may be something like:
SSN------|NAME|DATE----|TIME--|SYS
111222333|Bell|20140130|121507|P
123456789|John|20140225|135000|P
123456789|John|20140225|135002|N
The "duplicates" are generated from a ChangeLog, such as a change of address. Due to bad database design I have no control over however, an address change will create 2 records if a member happens to be a member of both SYS.
What would be the best way to go through each record in my listview, find duplicate values of SSN & DATE (There can be a record generated for both SYS if person is a member of both), and remove the duplicate value with the lower TIME value?
I'm trying to do a code-based solution instead of SQL because the true SQL statement is already highly complex and this application needs to only be maintained until October.
For this, I've assumed you have some class with these record's properties exposed with easy access like SSN and Time, I've also assumed they were both strings. In the code below I refer to this object as Record.
HINT: You might instead want to be removing items with the SYS flag set to False instead of judging it on time (Probably doesn't make a difference) .
I did not used any lambda fun on purpose to try to keep this simple and easy to read.
Call this code every time you load items into the ListView.... it would actually be a better idea to sanitize that list before you load it into the ListView, but the below code is a solution to your question based on the available info.
//Turn the ListView's ItemCollection into an easy to use List<Record>
List<Record> records = myListView.Items.OfType<Record>().ToList();
//Grab records with duplicate SSNs but with lower Time values
List<Record> recordsToRemove = new List<Record>();
foreach (var record in records)
{
foreach (var r in records)
{
if (record.SSN == r.SSN && record != r)
{
if (int.Parse(r.Time) > int.Parse(record.Time))
recordsToRemove.Add(record);
else
recordsToRemove.Add(r);
}
}
}
//Now actually remove the items from the ListView
foreach (var record in recordsToRemove)
{
myListView.Items.Remove(record);
}

Categories

Resources