I'm trying to find the "best" way to match, for example, politicians' names in RSS articles. The names will be stored in a database accessed with NHibernate. As an example:
Id Name
--- ---------------
1 David Cameron
2 George Osborne
3 Alistair Darling
At the time of writing, the BBC politics news RSS feed has an item with the description
Backbench Conservative MPs put pressure on Chancellor George Osborne to stop rail firms in England increasing commuter fares by up to 11%.
For this article, I would like to detect that George Osborne is mentioned. I realise that there are several ways that this could be done, e.g. selecting all the politicians' names first, and comparing them in code, or doing the NHibernate equivalent of a LIKE.
The application itself would have a few dozen feeds, which will be queried at most every 15 minutes. Obviously there are speed, memory and scaling concerns, so I would like to ask for a recommended approach (and NHibernate query if relevant).
As we were discussing on the comments, I believe that there is a simpler approach to this problem:
Keep a list of the politicians' in memory. Because these entities won't be updated often, it's safe to work like this. Just implement an expiration logic to refresh it from the database sooner or later.
For each downloaded feed entry, simply run foreach Name in Politicians FeedEntry.Content.Contains(Name) (or something like it) before saving the entry to the database.
There you go, no complex query needed and less I/O for your solution.
Along the following lines I would either use use a regex expression or a contains to get the politicians that match the feed. The politician names and ids can be a simple collection in memory.
Then the the feed can be saved in a memcached or redis (even a db would do) with a guid. Then save the associated guid in a table that holds politician_id, feed_guid.
For some statistics you can also have a table which is an aggregate of politician_id, num_articles_mentioned where the num_articles_mentioned is incremented by 1.
You can wrap the above in a transaction if needed.
Related
I want to search for a key with highest value from multiple hashes in Redis. My keys are of this format -
emp:1, emp:2,...emp:n
Each having values in this format -
1. name ABC
2. salary 1234
3. age 23
I want to find an oldest employee from these Hashes. From what I have read about Redis there is no way to read multiple hashes in one call. Which means I need to iterate through all the emp keys and call HGETALL on each to get the desired result (I do have a set where all the emp ids are stored).
Is there a way I can minimize the number of hits to get this working?
You can use a pipeline in Redis to run multiple commands and get their responses. That should allow you to execute multiple HGETALL commands. See the docs for more info. Not sure what library you are using for C#, but it should provide a way for you to use a pipeline.
You could also create a Lua script to iterate over the Redis keys and return the hash for the oldest employee.
tldr;
Yes you are right
... there is no way to read multiple hashes in one call ...
And so is #TheDude
... You could also create a Lua script to iterate over the Redis keys ...
Adding to it
It Appears that you are using Redis as a database. You have stored all your domain data and now you want to query it. This is misuse of Redis. It can be done but that is not what it was meant for. For this activity, if you use a real database it will be easier and more performant.
Redis is meant for caching frequently-used data[Note:1]. Note the two words (1) caching and (2) frequently-used. Caching is to store temporarily. If you want permanent storage - after server reboot - go for a database. Frequently-used says not to store All your data in there. Store only the subset that is actively being used. You can use Redis with all your data and even with permanent-store turned on, but then you have to tread very carefully.
For you purpose it seems using a generic database and SELECT MAX(age) FROM ... will be equally good if not better.
Or maybe,
You have quoted only part of the real problem and actually you are following the Redis best practices. In that cases I would suggest having a separate Sorted Set. For every employee inserted into the main keyset, also do ZADD employeeages 80 Alen where 80 is the age and Alen is presumable the ID of the person Alen.
To get the person ('s ID) with the maximum age, you can do
ZREVRANGEBYSCORE employeeages +inf -inf WITHSCORES LIMIT 0 1
If that looks bizarre then you are right - this is something very interesting! This will get your data not only in a single call, but in a single step in that call! Consider this: lets say you have a million employees (waao). Then this approach to get the oldest employee will be fastest, using a database and SELECT MAX(... will be runner up and your HGETALL or Lua script will be the slowest.
Use this approach if ages of your employees are frequently changing - like scores of players of an online game and you frequently want to query the topper or the looser - like updating the leaderboards. The downside of using this approach in place of a database is high redundancy. When (say) the address of an employee changes, you need to change a lot of records and to do that you need to make a lot of calls.
[1] As noted in comments, Redis is much more than just a cache for frequently-used data. I believe for this discussion, this definition is sufficient.
I have a program that creates a list of objects from a file, and also creates a list of the same type of object, but with fewer/and some different properties, from the database, like:
List from FILE: Address ID, Address, City, State, Zip, other important properties
list from DB: Address ID, Address, City, State
I have implemented IEquatable on this CustObj so that it only compares against Address, City, and State, in the hopes of doing easy comparisons between the two lists.
The ultimate goal is to get the address ID from the database and update the address IDs for each address in the list of objects from the file. These two lists could have quite a lot of objects (over 1,000,000) so I want it to be fast.
The alternative is to offload this to the database and have the DB return the info we need. If that would be significantly faster/more resource efficient, I will go that route, but I want to see if it can be done quickly and efficiently in code first.
Anyways, I see there's a Zip method. I was wondering if I could use that to say "if there's a match between the two lists, keep the data in list 1 but update the address id property of each object in list 1 to the address Id from list 2".
Is that possible?
The answer is, it really depends. There are a lot of parameters you haven't mentioned.
The only way to be sure is to build one solution (preferably using the zip method, since it has less work involved) and if it works within the parameters of your requirements (time or any other parameter, memory footprint?), you can stop there.
Otherwise you have to off load it to the database. Mind you, you would have to hold the 1 million records from files and 1 million records from DB in memory at the same time if you want to use the zip method.
The problem with pushing everything to the database is, inserting that many records is resource (time, space etc) consuming. Moreover if you want to do that maybe everyday, it is going to be more difficult, resource wise.
Your question didn't say if this was going to be a one time thing or a daily event in a production environment. Even that is going to make a difference in which approach to choose.
To repeat, you would have to try different approaches to see which will work best for you based on your requirements: is this a one time thing? How much resources does the process have? How much time does it have? and possibly many more.
It kindof sounds also like a job for .Aggregate() aka
var aggreg = list1.Aggregate(otherListPrefilled, (acc,elemFrom1) =>
{
// some code to create the joined data, usig elemFrom1 to find
// and modify the correct element in otherListPrefilled
return acc;
});
normally I would use an empty otherListPrefilled, not sure how it performs on 100k data items though.
If its a onetime thing, its probably faster to put your file to a csv, import it in your database as temporary table and join the data in sql.
So for example I have Collection of Documents like this:
{
hotField1 : 0,
hotField2 : "",
coldField1 : 0,
...
coldFieldN : ""
}
In this scope cold properties are written once, and accessed sometimes, hot properties are written and then fairly often accessed\updated (but in different use cases, it is not same sub-document or parts of same object).
Amount of documents is fairly huge (1M and more), size of hot data is at least ten times less than cold.
Since partial update is still most wanted yet not implemented feature, only way to update hotField1 is:
Request full document
Change either hotField1 or hotField2
Write back whole document
This is costly in terms of RUs, and doesn't scale so well.
So the question is how to organize such data&calls in DocumentDB to minimize costs?
Discovered alternatives:
Obviously best: retrieve one property; change; update - not yet.
Separate on two Collections, use stored procedures to retrieve from Main Collection then from Dictionary?
Put hotFields1-2 as subdocument ({ sub: {hf1:0, hf2:""}}) and somehow only update it? (I'm not sure if it is possible)
PS. C# in tags for client library we use. If it lacks smth its ok to use REST interface instead.
While there's no exact "best" answer:
Your #2 choice will not work with stored procedures, since stored procedures are scoped to a collection.
Updating a subdocument (#3 choice) is no different than updating top-level properties - you are still retrieving, and re-writing, a document (a subdocument is just another property on the document).
While it may or may not reduce RU (you'd need to benchmark, as Larry pointed out in comments), you may choose to store your hot properties in a separate (smaller) document (or multiple smaller documents). With less properties, there would be less bandwidth consumed during updates, and less index updating. However, since you'd now be retrieving more than one document (possibly across multiple calls), you may find that this activity negates any RU savings from storing in a single document.
Note: There's nothing stopping you from storing these separate documents in the same collection (which then lets you approach the problem with a stored procedure, as you suggested in your #2 choice). You'll just need to create some type of property to help you identify different document types.
NoSQL based on Documents replace the document once you change one or all properties.
In terms of cost, it is based on per collection basis.
So, if you have a DB with two collections in it and each with a performance tier of S1 i.e., $25/month.
$25 x 2 = $50
Case you need a better performance, and change one to S2 you'll be charged:
$50 + $25 = $75
I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection
You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"
I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.
Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.
I'm writing an application that I will use to keep up with my monthly budget. This will be a C# .NET 4.0 Winforms application.
Essentially my budget will be a matrix of data if you look at it visually. The columns are the "dates" at which that budget item will be spent. For example, I have 1 column for every Friday in the month. The Y axis is the name of the budget item (Car payment, house payment, eating out, etc). There are also categories, which are used to group the budget item names that are similar. For example, a category called "Housing" would have budget items called Mortgage, Rent, Electricity, Home Insurance, etc.
I need a good way to store this data from a code design perspective. Basically I've thought of two approaches:
One, I can have a "BudgetItem" class that has a "Category", "Value", and "Date". I would have a linear list of these items and each time I wanted to find a value by either date or category, I iterate this list in some form or fashion to find the value. I could probably use LINQ for this.
Second, I could use a 2D array which is indexed first by column (date) and second by row. I'd have to maintain categories and budget item names in a separate list and join the data together when I do my lookups somehow.
What is the best way to store this data in code? I'm leaning more towards the first solution but I wanted to see what you guys think. Later on when I implement my data persistence, I want to be able to persist this data to SQL server OR to an XML file (one file per monthly budget).
While your first attempt looks nicer, obviusly the second could be faster (depends on how you implement it). However when we are talking about desktop applications which are not performance critical, your first idea is definitely better, expecially because will help you a lot talking about maintaining your code. Also remember that the entity framework could be really nice in this situation
Finally if you know how to works with XML, I think is really better for this type of project. A database is required only when you have a fair amount of tables, as you explained you will only have 2 tables (budgetitem and category), I don't think you need a database for such a simple thing