Kafka track offsets in all partitions - c#

Will try to explain what I am trying to achieve.
All I know is topic name and by that I must scale down to partitions. First I try
consumer.Subscribe(topics)
And
consumer.Assignement
But if there is no delay between calls It returns empty list.
I could use consumer.Assign(..)
But I dont know exact partitions, offsets yet.
Next when Iam able to go down to partition I need to get offsets low/high by time range.
For example topi "test" has 5 partitions, and I need to extract all messages info (partition, offsets) for messages being inserted from 10:00 to 10:05.
If any additional info needed, just let me know.
Thanks

I'm not 100% clear on what you are aiming for but some information about assignment may help here.
The assignment() method returns an empty list prior to the first time the poll method is called on a consumer when joining the group, or after a rebalance - this is because when partitions are automatically assigned the consumer only finds out the assignment as one of the steps of the poll method, prior to fetching actual records.
You can find out the actual assigned partitions either by calling poll at least once before calling assignment() - I think that is what you have discovered - else by passing a ConsumerRebalanceListener when calling subscribe(). The onPartitionsAssigned method is called during the poll - essentially a callback - with an argument that is the collection of newly assigned partitions. This enables your code to discover the current assignment before any records are fetched.
Hope this helps a bit - I have written up a blog post about this aspect of assignment but haven't yet published it - I'll add a link when I do, if it sounds like this is the issue you are facing.

I went for a bit different approach.
From IAdminClient load metada to get all available partitions for it.
Create TopicPartitionTimestamp with start timestamp I need to consume from.
Assign to TopicPartitionTimestamp and consume from it.
Also I chose to start every partition consumption on different Thread.

Related

How do I find out if a DynamoDB table is empty?

How can I find out if a DynamoDB table contains any items using the .NET SDK?
One option is to do a Scan operation, and check the returned Count. But Scans can be costly for large tables and should be avoided.
The describe table count does not return real time value. The item count is updated every 6 hours.
The best way is to scan only once without any filter expression and check the count. This may not be costly as you are scanning the table only once and it would not scan the entire table as you don't need to scan recursively to find whether the table has any item.
A single scan returns only 1 MB of data.
If the use case requires real time value, this is the best and only option available.
Edit: While the below appears to work fine with small tables on localhost, the docs state
DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
so only use DescribeTable if you don't need an accurate, up to date figure.
Original:
It looks like the best way to do this is to use the DescribeTable method on AmazonDynamoDBClient:
AmazonDynamoDBClient client = ...
if (client.DescribeTable("FooTable").Table.ItemCount == 0)
// do stuff

Assign unique numbers in database when saving the form

I have a form where you can create an order and when you save it, is checking in the database (using oracle) for the last order number and is assigning the next one to the currently saved order. What I found is that if two users are saving a new order both in the same time or at few seconds apart, because of the connection speed my app is unable to assign different numbers for the newly two created orders. The problem is that both are checking in the same time the last assigned number and both orders get the same number..
I have some ideas but all of them have advantages and disadvantages..
To have the system wait a few seconds and check the order number when the user saves the order. But if both saved in the same time, the check will be done in the same time later and I guess that I will end up with the same problem..
To have the system check the order number (a check is run every time the treeview is refreshed) and see if it’s been duplicated and then let the user know via the treeview with some highlight, that it’s been duplicated. But if any documents are assigned to the order before the check, then I will end up with documents having a different number in the name and inside from the order to which is assigned..
To have the system check all order numbers periodically and give one of the duplicates a new order number, but Here is the same problem with the documents as at #2.. And also might cause some performance issue..
Assigning the order number when a user requests a new order not when he saves the order. I could have the system do Solution #1 along with this solution and recheck to see if the number is being used within the database and then reassign it a new one. Once again, if documents get assigned, someone has to go fix those.
One way of possibly stopping the documents from being assigned to duplicates is that the user is only allowed put some of the information and then save it or apply it and it does the recheck of #1, and then if it doesn't find anything, allow the user to add documents. This part of the solution could be applied possibly to any of the above but I don't want to delay the users work while is checking the numbers..
Please if you see any improvements to the ideas above or if you have new ones, let me know.
I need to find the best solution and as much as possible not to affect the user's current workflow..
If your Order ID is only a number you can use Oracle Sequence.
CREATE SEQUENCE order_id;
And before you save the record get a new order number.
SELECT order_id.NEXTVAL FROM DUAL;
See also Oracle/PLSQL: Sequences (Autonumber)

Generating unique ID for clients on a system running over a LAN in C#

I've a simple client registration system that runs over a network. The system is supposed to generate a unique three digit ID (primary key) with the current year concatenated (e.g. 001-2013). However, I've encountered the problem that the same primary keys being generated when two users from different computers (over a LAN) try to register different clients.
What if the user cancels the registration after an ID is already generated? I've to reuse that ID for another client. I've read about static variable but it didn't solve my problem. I really appreciate your ideas.
Unique and sequential IDs are hard to implement. To completely achive it you would have to serialize commiting creation of client information so ID generated only when data is actually stored, otherwise you'll endup with holes when something wrong happened during submittion.
If you don't need strict sequential numbers - giving out ranges of ID (1-22, 23-44,...) to each system is common approach. Instead of ranges you can give out lists of IDs to use ({1,3,233,234}, {235,236,237}) if you need to use as many IDs as possible.
Issue:
New item -001 is created, but not saved yet
New item -002 is created, but not saved yet
Item -001 is cancelled
What to do with ID -001?
The easiest solution is to simply not assign an ID until an item is definitely stored.
An alternative is, when finally saving an item, you look up the first free ID. If the item from step 2 (#2) is saved before the one from step 1, #2 gets ID -001. When #1 then gets saved, the saving logic sees that its claimed ID (-001) is in use, so it'll assign -002. So ID's get reassigned.
Finally you can simply find the next free ID when creating a new item. In the three steps described above, this'll mean you initially have a gap where -001 is supposed to be. If you now create a new item, your code will see -001 is unused and will assign that to the new item.
But, and that totally depends on your requirements which you didn't specify, now -001 was created later in time than -002, I do not know if that is allowed. Furthermore at any given moment you can have a gap in your numbering where an item has been cancelled. If it happens at the end of a reporting period, this will cause errors (-033, -034, -036).
You also might want to include an auto-incrementing primary key instead of this invoice number or whatever it is.

IndexedPageItemView not returning items when using an offset from the beginning

We have a new application in the works where we use streaming subscriptions to read emails as they come to our mailbox. Since we would like to read all past emails (roughly 50,000) we need some way of manually finding them instead of relying on the on notification event provided by EWS. The plan is to serialize the list of emails we read and parse for future database testing scenarios. I originally posted this question on StackOverflow asking how to retrieve all emails from a mailbox regardless of which folder they're in. Essentially, the mailbox I'm after organizes emails by year, month and day. Issuing a FindFolders() operation yields around 1,300 folders. I would rather not have to issue a FindItems() operation for every folder as it would take hours upon hours to read every email. Henning Krause mentioned that the managed API does not support querying Exchange for emails that are in a given list of parent folder ids. Instead I would have to use the FindItem operation. After some testing around I was able to retrieve my limit of items (1,000). Since the FindItem operation doesn't outright return if there are more items for your query I had to simply check that if the FindItem operation return 1,000 items then it was safe to query again until the result count is less than 1,000. Each loop through I increment my offset by 1,000.
When my offset is set to 0 I get the first 1,000 emails with no issues. On the second iteration I set my offset to 1,000 and receive no items, yet the operation was a success. Here is what I receive when setting an offset. If it matters I'm offsetting from the beginning. Below you can see one of the many FindItemResponseMessage nodes that clearly show there are 29 items in the view and that they all shown.
<m:FindItemResponseMessage ResponseClass="Success">
<m:ResponseCode>NoError</m:ResponseCode>
<m:RootFolder IndexedPagingOffset="29" TotalItemsInView="29" IncludesLastItemInRange="true">
<t:Items />
</m:RootFolder>
</m:FindItemResponseMessage>
What am I missing?
Thank you.

Can someone explain map-reduce in C#?

Can anyone please explain the concept of map-reduce, particularly in Mongo?
I also use C# so any specifics in that area would also be useful.
One way to understand Map-Reduce coming from C# and LINQ is to think of it as a SelectMany() followed by a GroupBy() followed by an Aggregate() operation.
In a SelectMany() you are projecting a sequence but each element can become multiple elements. This is equivalent to using multiple emit statements in your map operation. The map operation can also chose not to call emit which is like having a Where() clause inside your SelectMany() operation.
In a GroupBy() you are collecting elements with the same key which is what Map-Reduce does with the key value that you emit from the map operation.
In the Aggregate() or reduce step you are taking the collections associated with each group key and combining them in some way to produce one result for each key. Often this combination is simply adding up a single '1' value output with each key from the map step but sometimes it's more complicated.
One important caveat with MongoDB's map-reduce is that the reduce operation must accept and output the same data type because it may be applied repeatedly to partial sets of the grouped data. If you are passed an array of values, don't simply take the length of it because it might be a partial result from an earlier reduce operation.
Here's a spot to get started with Map Reduce in Mongo. The cookbook has a few examples, I would focus on these two.
I like to think of map-reduces in the context of "data warehousing jobs" or "rollups". You're basically taking detailed data and "rolling up" a smaller version of that data.
In SQL you would normally do this with sum() and avg() and group by. In MongoDB you would do this with a Map Reduce. The basic premise of a Map Reduce is that you have two functions.
The first function (map) is a basically a giant for loop that runs over your data and "emits" certain keys and values. The second function (reduce), is a giant loop over all of the emitted data. The map says "hey this is the data you want to summarize" and the reduce says "hey this array of values reduces to this single value"
The output from a map-reduce can come in many forms (typically flat files). In MongoDB, the output is actually a new collection.
C# Specifics
In MongoDB all of the Map Reduces are performed inside of the javascript engine. So both the map & reduce function are all written in javascript. The various drivers will allow you to build the javascript and issue the command, however, this is not how I normally do it.
The preferred method for running Map Reduce jobs is to compile the JS into a file and then mongo map_reduce.js. Generally you'll do this on the server somewhere as a cron job or a scheduled task.
Why?
Well, map reduce is not a "real-time", especially with a big data set. It's really designed to be used in a batch fashion. Don't get me wrong, you can call it from your code, but generally, you don't want users to initiate map reduce jobs. Instead you want those jobs to be scheduled and you want users to be querying the results :)
Map Reduce is a way to process data where you have a map stage/function that identifies all data to be processed and process it, row by row.
Then you have a reduce step/function that can be run multiple times, for example once per server in a cluster and then once in the client to return a final result.
Here is a Wiki article describing it in more detail:
http://en.wikipedia.org/wiki/MapReduce
And here is the documentation for MongoDB for Mapreduce
http://www.mongodb.org/display/DOCS/MapReduce
Simple example, find the longest string in a list.
The map step will loop over the list calculating the length of each string, the reduce step will loop over the result from map and for each line keep the longest one.
This can of cause be much more complex but that's the essence of it.

Categories

Resources