Reading multiple Hashes from Redis in one call - c#

I want to search for a key with highest value from multiple hashes in Redis. My keys are of this format -
emp:1, emp:2,...emp:n
Each having values in this format -
1. name ABC
2. salary 1234
3. age 23
I want to find an oldest employee from these Hashes. From what I have read about Redis there is no way to read multiple hashes in one call. Which means I need to iterate through all the emp keys and call HGETALL on each to get the desired result (I do have a set where all the emp ids are stored).
Is there a way I can minimize the number of hits to get this working?

You can use a pipeline in Redis to run multiple commands and get their responses. That should allow you to execute multiple HGETALL commands. See the docs for more info. Not sure what library you are using for C#, but it should provide a way for you to use a pipeline.
You could also create a Lua script to iterate over the Redis keys and return the hash for the oldest employee.

tldr;
Yes you are right
... there is no way to read multiple hashes in one call ...
And so is #TheDude
... You could also create a Lua script to iterate over the Redis keys ...
Adding to it
It Appears that you are using Redis as a database. You have stored all your domain data and now you want to query it. This is misuse of Redis. It can be done but that is not what it was meant for. For this activity, if you use a real database it will be easier and more performant.
Redis is meant for caching frequently-used data[Note:1]. Note the two words (1) caching and (2) frequently-used. Caching is to store temporarily. If you want permanent storage - after server reboot - go for a database. Frequently-used says not to store All your data in there. Store only the subset that is actively being used. You can use Redis with all your data and even with permanent-store turned on, but then you have to tread very carefully.
For you purpose it seems using a generic database and SELECT MAX(age) FROM ... will be equally good if not better.
Or maybe,
You have quoted only part of the real problem and actually you are following the Redis best practices. In that cases I would suggest having a separate Sorted Set. For every employee inserted into the main keyset, also do ZADD employeeages 80 Alen where 80 is the age and Alen is presumable the ID of the person Alen.
To get the person ('s ID) with the maximum age, you can do
ZREVRANGEBYSCORE employeeages +inf -inf WITHSCORES LIMIT 0 1
If that looks bizarre then you are right - this is something very interesting! This will get your data not only in a single call, but in a single step in that call! Consider this: lets say you have a million employees (waao). Then this approach to get the oldest employee will be fastest, using a database and SELECT MAX(... will be runner up and your HGETALL or Lua script will be the slowest.
Use this approach if ages of your employees are frequently changing - like scores of players of an online game and you frequently want to query the topper or the looser - like updating the leaderboards. The downside of using this approach in place of a database is high redundancy. When (say) the address of an employee changes, you need to change a lot of records and to do that you need to make a lot of calls.
[1] As noted in comments, Redis is much more than just a cache for frequently-used data. I believe for this discussion, this definition is sufficient.

Related

Best approach to track Amount field on Invoice table when InvoiceItem items change?

I'm building an app where I need to store invoices from customers so we can track who has paid and who has not, and if not, see how much they owe in total. Right now my schema looks something like this:
Customer
- Id
- Name
Invoice
- Id
- CreatedOn
- PaidOn
- CustomerId
InvoiceItem
- Id
- Amount
- InvoiceId
Normally I'd fetch all the data using Entity Framework and calculate everything in my C# service, (or even do the calculation on SQL Server) something like so:
var amountOwed = Invoice.Where(i => i.CustomerId == customer.Id)
.SelectMany(i => i.InvoiceItems)
.Select(ii => ii.Amount)
.Sum()
But calculating everything every time I need to generate a report doesn't feel like the right approach this time, because down the line I'll have to generate reports that should calculate what all the customers owe (sometimes go even higher on the hierarchy).
For this scenario I was thinking of adding an Amount field on my Invoice table and possibly an AmountOwed on my Customer table which will be updated or populated via the InvoiceService whenever I insert/update/delete an InvoiceItem. This should be safe enough and make the report querying much faster.
But I've also been searching some on this subject and another recommended approach is using triggers on my database. I like this method best because even if I were to directly modify a value using SQL and not the app services, the other tables would automatically update.
My question is:
How do I add a trigger to update all the parent tables whenever an InvoiceItem is changed?
And from your experience, is this the best (safer, less error-prone) solution to this problem, or am I missing something?
There are many examples of triggers that you can find on the web. Many are poorly written unfortunately. And for future reference, post DDL for your tables, not some abbreviated list. No one should need to ask about the constraints and relationships you have (or should have) defined.
To start, how would you write a query to calculate the total amount at the invoice level? Presumably you know the tsql to do that. So write it, test it, verify it. Then add your amount column to the invoice table. Now how would you write an update statement to set that new amount column to the sum of the associated item rows? Again - write it, test it, verify it. At this point you have all the code you need to implement your trigger.
Since this process involves changes to the item table, you will need to write triggers to handle all three types of dml statements - insert, update, and delete. Write a trigger for each to simplify your learning and debugging. Triggers have access to special tables - go learn about them. And go learn about the false assumption that a trigger works with a single row - it doesn't. Triggers must be written to work correctly if 0 (yes, zero), 1, or many rows are affected.
In an insert statement, the inserted table will hold all the rows inserted by the statement that caused the trigger to execute. So you merely sum the values (using the appropriate grouping logic) and update the appropriate rows in the invoice table. Having written the update statement mentioned in the previous paragraphs, this should be a relatively simple change to that query. But since you can insert a new row for an old invoice, you must remember to add the summed amount to the value already stored in the invoice table. This should be enough direction for you to start.
And to answer your second question - the safest and easiest way is to calculate the value every time. I fear you are trying to solve a problem that you do not have and that you may never have. Generally speaking, no one cares about invoices that are of "significant" age. You might care about unpaid invoices for a period of time, but eventually you write these things off (especially if the amounts are not significant). Another relatively easy approach is to create an indexed view to calculate and materialize the total amount. But remember - nothing is free. An indexed view must be maintained and it will add extra processing for DML statements affecting the item table. Indexed views do have limitations - which are documented.
And one last comment. I would strongly hesitate to maintain a total amount at any level higher than invoice. Above that level one frequently wants to filter the results in any ways - date, location, type, customer, etc. At this level you are approaching data warehouse functionality which is not appropriate for a OLTP system.
First of all never use triggers for business logic. Triggers are tricky and easily forgettable. It will be hard to maintain such application.
For most cases you can easily populate your reporting data via entity framework or SQL query. But if it requires lots of joins then you need to consider using staging tables. Because reporting requires data denormalization. To populate staging tables you can use SQL jobs or other schedule mechanism (Azure Scheduler maybe). This way you won't need to work with lots of join and your reports will populate faster.

C# Zip Function to iterate through two lists of objects

I have a program that creates a list of objects from a file, and also creates a list of the same type of object, but with fewer/and some different properties, from the database, like:
List from FILE: Address ID, Address, City, State, Zip, other important properties
list from DB: Address ID, Address, City, State
I have implemented IEquatable on this CustObj so that it only compares against Address, City, and State, in the hopes of doing easy comparisons between the two lists.
The ultimate goal is to get the address ID from the database and update the address IDs for each address in the list of objects from the file. These two lists could have quite a lot of objects (over 1,000,000) so I want it to be fast.
The alternative is to offload this to the database and have the DB return the info we need. If that would be significantly faster/more resource efficient, I will go that route, but I want to see if it can be done quickly and efficiently in code first.
Anyways, I see there's a Zip method. I was wondering if I could use that to say "if there's a match between the two lists, keep the data in list 1 but update the address id property of each object in list 1 to the address Id from list 2".
Is that possible?
The answer is, it really depends. There are a lot of parameters you haven't mentioned.
The only way to be sure is to build one solution (preferably using the zip method, since it has less work involved) and if it works within the parameters of your requirements (time or any other parameter, memory footprint?), you can stop there.
Otherwise you have to off load it to the database. Mind you, you would have to hold the 1 million records from files and 1 million records from DB in memory at the same time if you want to use the zip method.
The problem with pushing everything to the database is, inserting that many records is resource (time, space etc) consuming. Moreover if you want to do that maybe everyday, it is going to be more difficult, resource wise.
Your question didn't say if this was going to be a one time thing or a daily event in a production environment. Even that is going to make a difference in which approach to choose.
To repeat, you would have to try different approaches to see which will work best for you based on your requirements: is this a one time thing? How much resources does the process have? How much time does it have? and possibly many more.
It kindof sounds also like a job for .Aggregate() aka
var aggreg = list1.Aggregate(otherListPrefilled, (acc,elemFrom1) =>
{
// some code to create the joined data, usig elemFrom1 to find
// and modify the correct element in otherListPrefilled
return acc;
});
normally I would use an empty otherListPrefilled, not sure how it performs on 100k data items though.
If its a onetime thing, its probably faster to put your file to a csv, import it in your database as temporary table and join the data in sql.

API - filter big list with word fragment

I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection
You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"
I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.
Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.

Bulk insert strategy from c# to SQL Server

In our current project, customers will send collection of a complex/nested messages to our system. Frequency of these messages are approx. 1000-2000 msg/per seconds.
These complex objects contains the transaction data (to be added) as well as master data (which will be added if not found). But instead of passing the ids of the master data, customer passes the 'name' column.
System checks if master data exist for these names. If found, it uses the ids from database otherwise create this master data first and then use these ids.
Once master data ids are resolved, system inserts the transactional data to a SQL Server database (using master data ids). Number of master entities per message are around 15-20.
Following are the some strategies we can adopt.
We can resolve master ids first from our C# code (and insert master data if not found) and store these ids in C# cache. Once all ids are resolved, we can bulk insert the transactional data using SqlBulkCopy class. We can hit the database 15 times to fetch the ids for different entities and then hit database one more time to insert the final data. We can use the same connection will close it after doing all this processing.
We can send all these messages containing master data and transactional data in single hit to the database (in the form of multiple TVP) and then inside stored procedure, create the master data first for the missing ones and then insert the transactional data.
Could anyone suggest the best approach in this use case?
Due to some privacy issue, I cannot share the actual object structure. But here is the hypothetical object structure which is very close to our business object.
One such message will contain information about one product (its master data) and its price details (transaction data) from different vendors:
Master data (which need to be added if not found)
Product name: ABC, ProductCateory: XYZ, Manufacturer: XXX and some other other details (number of properties are in the range of 15-20).
Transaction data (which will always be added)
Vendor Name: A, ListPrice: XXX, Discount: XXX
Vendor Name: B, ListPrice: XXX, Discount: XXX
Vendor Name: C, ListPrice: XXX, Discount: XXX
Vendor Name: D, ListPrice: XXX, Discount: XXX
Most of the information about the master data will remain the same for a message belong to one product (and will change less frequently) but transaction data will always fluctuate. So, system will check if the product 'XXX' exist in the system or not. If not it check if the 'Category' mentioned with this product exist of not. If not, it will insert a new record for category and then for product. This will be done to for Manufacturer and other master data.
Multiple vendors will be sending data about multiple products (2000-5000) at the same time.
So, assume that we have 1000 suppliers, Each vendor is sending data about 10-15 different products. After each 2-3 seconds, every vendor sends us the price updates of these 10 products. He may start sending data about new products, but which will not be very frequent.
You would likely be best off with your #2 idea (i.e. sending all of the 15 - 20 entities to the DB in one shot using multiple TVPs and processing as a whole set of up to 2000 messages).
Caching master data lookups at the app layer and translating prior to sending to the DB sounds great, but misses something:
You are going to have to hit the DB to get the initial list anyway
You are going to have to hit the DB to insert new entries anyway
Looking up values in a dictionary to replace with IDs is exactly what a database does (assume a Non-Clustered Index on each of these name-to-ID lookups)
Frequently queried values will have their datapages cached in the buffer pool (which is a memory cache)
Why duplicate at the app layer what is already provided and happening right now at the DB layer, especially given:
The 15 - 20 entities can have up to 20k records (which is a relatively small number, especially when considering that the Non-Clustered Index only needs to be two fields: Name and ID which can pack many rows into a single data page when using a 100% Fill Factor).
Not all 20k entries are "active" or "current", so you don't need to worry about caching all of them. So whatever values are current will be easily identified as the ones being queried, and those data pages (which may include some inactive entries, but no big deal there) will be the ones to get cached in the Buffer Pool.
Hence, you don't need to worry about aging out old entries OR forcing any key expirations or reloads due to possibly changing values (i.e. updated Name for a particular ID) as that is handled naturally.
Yes, in-memory caching is wonderful technology and greatly speeds up websites, but those scenarios / use-cases are for when non-database processes are requesting the same data over and over in pure read-only purposes. But this particular scenario is one in which data is being merged and the list of lookup values can be changing frequently (moreso due to new entries than due to updated entries).
That all being said, Option #2 is the way to go. I have done this technique several times with much success, though not with 15 TVPs. It might be that some optimizations / adjustments need to be made to the method to tune this particular situation, but what I have found to work well is:
Accept the data via TVP. I prefer this over SqlBulkCopy because:
it makes for an easily self-contained Stored Procedure
it fits very nicely into the app code to fully stream the collection(s) to the DB without needing to copy the collection(s) to a DataTable first, which is duplicating the collection, which is wasting CPU and memory. This requires that you create a method per each collection that returns IEnumerable<SqlDataRecord>, accepts the collection as input, and uses yield return; to send each record in the for or foreach loop.
TVPs are not great for statistics and hence not great for JOINing to (though this can be mitigated by using a TOP (#RecordCount) in the queries), but you don't need to worry about that anyway since they are only used to populate the real tables with any missing values
Step 1: Insert missing Names for each entity. Remember that there should be a NonClustered Index on the [Name] field for each entity, and assuming that the ID is the Clustered Index, that value will naturally be a part of the index, hence [Name] only will provide a covering index in addition to helping the following operation. And also remember that any prior executions for this client (i.e. roughly the same entity values) will cause the data pages for these indexes to remain cached in the Buffer Pool (i.e. memory).
;WITH cte AS
(
SELECT DISTINCT tmp.[Name]
FROM #EntityNumeroUno tmp
)
INSERT INTO EntityNumeroUno ([Name])
SELECT cte.[Name]
FROM cte
WHERE NOT EXISTS(
SELECT *
FROM EntityNumeroUno tab
WHERE tab.[Name] = cte.[Name]
)
Step 2: INSERT all of the "messages" in simple INSERT...SELECT where the data pages for the lookup tables (i.e. the "entities") are already cached in the Buffer Pool due to Step 1
Finally, keep in mind that conjecture / assumptions / educated guesses are no substitute for testing. You need to try a few methods to see what works best for your particular situation since there might be additional details that have not been shared that could influence what is considered "ideal" here.
I will say that if the Messages are insert-only, then Vlad's idea might be faster. The method I am describing here I have used in situations that were more complex and required full syncing (updates and deletes) and did additional validations and creation of related operational data (not lookup values). Using SqlBulkCopy might be faster on straight inserts (though for only 2000 records I doubt there is much difference if any at all), but this assumes you are loading directly to the destination tables (messages and lookups) and not into intermediary / staging tables (and I believe Vlad's idea is to SqlBulkCopy directly to the destination tables). However, as stated above, using an external cache (i.e. not the Buffer Pool) is also more error prone due to the issue of updating lookup values. It could take more code than it's worth to account for invalidating an external cache, especially if using an external cache is only marginally faster. That additional risk / maintenance needs to be factored into which method is overall better for your needs.
UPDATE
Based on info provided in comments, we now know:
There are multiple Vendors
There are multiple Products offered by each Vendor
Products are not unique to a Vendor; Products are sold by 1 or more Vendors
Product properties are singular
Pricing info has properties that can have multiple records
Pricing info is INSERT-only (i.e. point-in-time history)
Unique Product is determined by SKU (or similar field)
Once created, a Product coming through with an existing SKU but different properties otherwise (e.g. category, manufacturer, etc) will be considered the same Product; the differences will be ignored
With all of this in mind, I will still recommend TVPs, but to re-think the approach and make it Vendor-centric, not Product-centric. The assumption here is that Vendor's send files whenever. So when you get a file, import it. The only lookup you would be doing ahead of time is the Vendor. Here is the basic layout:
Seems reasonable to assume that you already have a VendorID at this point because why would the system be importing a file from an unknown source?
You can import in batches
Create a SendRows method that:
accepts a FileStream or something that allows for advancing through a file
accepts something like int BatchSize
returns IEnumerable<SqlDataRecord>
creates a SqlDataRecord to match the TVP structure
for loops though the FileStream until either BatchSize has been met or no more records in the File
perform any necessary validations on the data
map the data to the SqlDataRecord
call yield return;
Open the file
While there is data in the file
call the stored proc
pass in VendorID
pass in SendRows(FileStream, BatchSize) for the TVP
Close the file
Experiment with:
opening the SqlConnection before the loop around the FileStream and closing it after the loops are done
Opening the SqlConnection, executing the stored procedure, and closing the SqlConnection inside of the FileStream loop
Experiment with various BatchSize values. Start at 100, then 200, 500, etc.
The stored proc will handle inserting new Products
Using this type of structure you will be sending in Product properties that are not used (i.e. only the SKU is used for the look up of existing Products). BUT, it scales very well as there is no upper-bound regarding file size. If the Vendor sends 50 Products, fine. If they send 50k Products, fine. If they send 4 million Products (which is the system I worked on and it did handle updating Product info that was different for any of its properties!), then fine. No increase in memory at the app layer or DB layer to handle even 10 million Products. The time the import takes should increase in step with the amount of Products sent.
UPDATE 2
New details related to Source data:
comes from Azure EventHub
comes in the form of C# objects (no files)
Product details come in through O.P.'s system's APIs
is collected in single queue (just pull data out insert into database)
If the data source is C# objects then I would most definitely use TVPs as you can send them over as is via the method I described in my first update (i.e. a method that returns IEnumerable<SqlDataRecord>). Send one or more TVPs for the Price/Offer per Vendor details but regular input params for the singular Property attributes. For example:
CREATE PROCEDURE dbo.ImportProduct
(
#SKU VARCHAR(50),
#ProductName NVARCHAR(100),
#Manufacturer NVARCHAR(100),
#Category NVARCHAR(300),
#VendorPrices dbo.VendorPrices READONLY,
#DiscountCoupons dbo.DiscountCoupons READONLY
)
SET NOCOUNT ON;
-- Insert Product if it doesn't already exist
IF (NOT EXISTS(
SELECT *
FROM dbo.Products pr
WHERE pr.SKU = #SKU
)
)
BEGIN
INSERT INTO dbo.Products (SKU, ProductName, Manufacturer, Category, ...)
VALUES (#SKU, #ProductName, #Manufacturer, #Category, ...);
END;
...INSERT data from TVPs
-- might need OPTION (RECOMPILE) per each TVP query to ensure proper estimated rows
From a DB point of view, there's no such fast thing than BULK INSERT (from csv files for example). The best is to bulk all data asap, then process it with stored procedures.
A C# layer will just slow down the process, since all the queries between C# and SQL will be thousands times slower than what Sql-Server can directly handle.

NHibernate matching property as a substring

I'm trying to find the "best" way to match, for example, politicians' names in RSS articles. The names will be stored in a database accessed with NHibernate. As an example:
Id Name
--- ---------------
1 David Cameron
2 George Osborne
3 Alistair Darling
At the time of writing, the BBC politics news RSS feed has an item with the description
Backbench Conservative MPs put pressure on Chancellor George Osborne to stop rail firms in England increasing commuter fares by up to 11%.
For this article, I would like to detect that George Osborne is mentioned. I realise that there are several ways that this could be done, e.g. selecting all the politicians' names first, and comparing them in code, or doing the NHibernate equivalent of a LIKE.
The application itself would have a few dozen feeds, which will be queried at most every 15 minutes. Obviously there are speed, memory and scaling concerns, so I would like to ask for a recommended approach (and NHibernate query if relevant).
As we were discussing on the comments, I believe that there is a simpler approach to this problem:
Keep a list of the politicians' in memory. Because these entities won't be updated often, it's safe to work like this. Just implement an expiration logic to refresh it from the database sooner or later.
For each downloaded feed entry, simply run foreach Name in Politicians FeedEntry.Content.Contains(Name) (or something like it) before saving the entry to the database.
There you go, no complex query needed and less I/O for your solution.
Along the following lines I would either use use a regex expression or a contains to get the politicians that match the feed. The politician names and ids can be a simple collection in memory.
Then the the feed can be saved in a memcached or redis (even a db would do) with a guid. Then save the associated guid in a table that holds politician_id, feed_guid.
For some statistics you can also have a table which is an aggregate of politician_id, num_articles_mentioned where the num_articles_mentioned is incremented by 1.
You can wrap the above in a transaction if needed.

Categories

Resources