C# Zip Function to iterate through two lists of objects - c#

I have a program that creates a list of objects from a file, and also creates a list of the same type of object, but with fewer/and some different properties, from the database, like:
List from FILE: Address ID, Address, City, State, Zip, other important properties
list from DB: Address ID, Address, City, State
I have implemented IEquatable on this CustObj so that it only compares against Address, City, and State, in the hopes of doing easy comparisons between the two lists.
The ultimate goal is to get the address ID from the database and update the address IDs for each address in the list of objects from the file. These two lists could have quite a lot of objects (over 1,000,000) so I want it to be fast.
The alternative is to offload this to the database and have the DB return the info we need. If that would be significantly faster/more resource efficient, I will go that route, but I want to see if it can be done quickly and efficiently in code first.
Anyways, I see there's a Zip method. I was wondering if I could use that to say "if there's a match between the two lists, keep the data in list 1 but update the address id property of each object in list 1 to the address Id from list 2".
Is that possible?

The answer is, it really depends. There are a lot of parameters you haven't mentioned.
The only way to be sure is to build one solution (preferably using the zip method, since it has less work involved) and if it works within the parameters of your requirements (time or any other parameter, memory footprint?), you can stop there.
Otherwise you have to off load it to the database. Mind you, you would have to hold the 1 million records from files and 1 million records from DB in memory at the same time if you want to use the zip method.
The problem with pushing everything to the database is, inserting that many records is resource (time, space etc) consuming. Moreover if you want to do that maybe everyday, it is going to be more difficult, resource wise.
Your question didn't say if this was going to be a one time thing or a daily event in a production environment. Even that is going to make a difference in which approach to choose.
To repeat, you would have to try different approaches to see which will work best for you based on your requirements: is this a one time thing? How much resources does the process have? How much time does it have? and possibly many more.

It kindof sounds also like a job for .Aggregate() aka
var aggreg = list1.Aggregate(otherListPrefilled, (acc,elemFrom1) =>
{
// some code to create the joined data, usig elemFrom1 to find
// and modify the correct element in otherListPrefilled
return acc;
});
normally I would use an empty otherListPrefilled, not sure how it performs on 100k data items though.
If its a onetime thing, its probably faster to put your file to a csv, import it in your database as temporary table and join the data in sql.

Related

Is it better additional query or get all data and filter in the client?

I have a query with EF Core in which I would like to include a property and from this property, that it is a ICollection, I would like filter what data to get.
It is something like that:
myDbContext.MyEntity.Where(x => x.ID == 1).Include(x => x.MyCollection.Where(y => y.isEnabled == true));
However, I get an error because it is not possible to filter the included properties.
In fact, the items in the collection will be few, about 6 or 7, so I was thinking that I could include all and later filter the data in the client.
Another option it would be get the the main entity first and in a second query to get the childs that really I need.
I always read that the connections to the database are expensive, so it is better to do as less queries as possible, but also I read that the best practice it is to get only the data that I need and no filter in the client, but it is better filter in the query.
But in this case, with EF Core, it seems that I can't filter in the query, so I would like to know what is better, 2 queries and get only the data that I need or one query getting all the data and filter later in the client.
But in this case, with EF Core, it seems that I can't filter in the query, so I would like to know what is better, 2 queries and get only the data that I need or one query getting all the data and filter later in the client.
Which is longer? One long piece of string, or two shorter pieces of string?
You don't know, because I haven't told you the actual lengths. You don't know if it's a 1m string versus two 5cm strings or a 10cm string vs two 8cm string.
And your question here is the same. It's better to do fewer queries than many, and it's better to do short queries than long queries. When a choice is on only one of those metrics (e.g. the shorter query from doing a simple Where on the database vs a simple Where in memory on all results) then we can make sound a priori judgements about which is likely to be the more efficient, and choose accordingly.
When though we have competing factors in play we have to:
Decide whether we even care: If they're going to still be pretty fast either way it might not be worth worrying about; find bigger fish to fry.
Measure.
Make sure what we are measuring is realistic.
The third point is important as one can often create data sets that would make one come out the victor, and other data sets that would make the other win. We need to make sure we're correctly modelling what is encountered in real life.
When the difference is small, or if they are both fast either way (and/or the use is so rare that it's still not a big deal), then just go for whichever is easier to code and maintain.

Reading multiple Hashes from Redis in one call

I want to search for a key with highest value from multiple hashes in Redis. My keys are of this format -
emp:1, emp:2,...emp:n
Each having values in this format -
1. name ABC
2. salary 1234
3. age 23
I want to find an oldest employee from these Hashes. From what I have read about Redis there is no way to read multiple hashes in one call. Which means I need to iterate through all the emp keys and call HGETALL on each to get the desired result (I do have a set where all the emp ids are stored).
Is there a way I can minimize the number of hits to get this working?
You can use a pipeline in Redis to run multiple commands and get their responses. That should allow you to execute multiple HGETALL commands. See the docs for more info. Not sure what library you are using for C#, but it should provide a way for you to use a pipeline.
You could also create a Lua script to iterate over the Redis keys and return the hash for the oldest employee.
tldr;
Yes you are right
... there is no way to read multiple hashes in one call ...
And so is #TheDude
... You could also create a Lua script to iterate over the Redis keys ...
Adding to it
It Appears that you are using Redis as a database. You have stored all your domain data and now you want to query it. This is misuse of Redis. It can be done but that is not what it was meant for. For this activity, if you use a real database it will be easier and more performant.
Redis is meant for caching frequently-used data[Note:1]. Note the two words (1) caching and (2) frequently-used. Caching is to store temporarily. If you want permanent storage - after server reboot - go for a database. Frequently-used says not to store All your data in there. Store only the subset that is actively being used. You can use Redis with all your data and even with permanent-store turned on, but then you have to tread very carefully.
For you purpose it seems using a generic database and SELECT MAX(age) FROM ... will be equally good if not better.
Or maybe,
You have quoted only part of the real problem and actually you are following the Redis best practices. In that cases I would suggest having a separate Sorted Set. For every employee inserted into the main keyset, also do ZADD employeeages 80 Alen where 80 is the age and Alen is presumable the ID of the person Alen.
To get the person ('s ID) with the maximum age, you can do
ZREVRANGEBYSCORE employeeages +inf -inf WITHSCORES LIMIT 0 1
If that looks bizarre then you are right - this is something very interesting! This will get your data not only in a single call, but in a single step in that call! Consider this: lets say you have a million employees (waao). Then this approach to get the oldest employee will be fastest, using a database and SELECT MAX(... will be runner up and your HGETALL or Lua script will be the slowest.
Use this approach if ages of your employees are frequently changing - like scores of players of an online game and you frequently want to query the topper or the looser - like updating the leaderboards. The downside of using this approach in place of a database is high redundancy. When (say) the address of an employee changes, you need to change a lot of records and to do that you need to make a lot of calls.
[1] As noted in comments, Redis is much more than just a cache for frequently-used data. I believe for this discussion, this definition is sufficient.

Improving nested objects filtering speed

Here's a problem I experience (simplified example):
Let's say I have several tables:
One customer can have mamy products and a product can have multiple features.
On my asp.net front-end I have a grid with customer info:
something like this:
Name Address
John 222 1st st
Mark 111 2nd st
What I need is an ability to filter customers by feature. So, I have a dropdown list of available features that are connected to a customer.
What I currently do:
1. I return DataTable of Customers from stored procedure. I store it in viewstate
2. I return DataTable of features connected to customers from stored procedure. I store it in viewstate
3. On filter selected, I run stored procedure again with new feature_id filter where I do joins again to only show customers that have selected feature.
My problem: It is very slow.
I think that possible solutions would be:
1. On page load return ALL data in one viewstate variable. So basically three lists of nested objects. This will make my page load slow.
2. Perform async loazing in some smart way. How?
Any better solutions?
Edit:
this is a simplified example, so I also need to filter customer by property that is connected through 6 tables to table Customer.
The way I deal with these scenarios is by passing in Xml to SQL and then running a join against that. So Xml would look something like:
<Features><Feat Id="2" /><Feat Id="5" /><feat Id="8" /></Features>
Then you can pass that Xml into SQL (depending on what version of SQL there are different ways), but in the newer version's its a lot easier than it used to be:
http://www.codeproject.com/Articles/20847/Passing-Arrays-in-SQL-Parameters-using-XML-Data-Ty
Also, don't put any of that in ViewState; there's really no reason for that.
Storing an entire list of customers in ViewState is going to be hideously slow; storing all information for all customers in ViewState is going to be worse, unless your entire customer base is very very small, like about 30 records.
For a start, why are you loading all the customers into ViewState? If you have any significant number of customers, load the data a page at a time. That will at least reduce the amount of data flowing over the wire and might speed up your stored procedure as well.
In your position, I would focus on optimizing the data retrieval first (including minimizing the amount you return), and then worry about faster ways to store and display it. If you're up against unusual constraints that prevent this (very slow database; no profiling tools; not allowed to change stored procedures) than please let us know.
Solution 1: Include whatever criteria you need to filter on in your query, only return and render the requested records. No need to use viewstate.
Solution 2: Retrieve some reasonable page limit of customers, filter on the browser with javascript. Allow easy navigation to the next page.

Competitions: Storing an arbitrary number of fields

I'm going to be creating competitions on the current site I'm working on. Each competition is not going to be the same and may have a varying number of input fields that a user must enter to be part of the competition eg.
Competition 1 might just require a firstname
Competition 2 might require a firstname, lastname and email address.
I will also be building a tool to observe these entries so that I can look at each individual entry.
My question is what is the best way to store an arbitrary number of fields? I was thinking of two options, one being to write each entry to a CSV file containing all the entries of the competition, the other being to have a db table with a varchar field in the database that just stores an entire entry as text. Both of these methods seem messy, is there any common practice for this sort of task?
I could in theory create a db table with a column for every possible field, but it won't work when the competition has specific requirements such as "Tell us in 100 words why..." or "Enter your 5 favourite things that.."
ANSWERED:
I have decided to use the method described below where there are multiple generic columns that can be utilized for different purposes per competition.
Initially I was going to use EAV, and I still think it might be slightly more appropriate for this specific scenario. But it is generally recommended against because of it's poor scalability and complicated querying, and I wouldn't want to get into a habit of using it. Both answers worked absolutely fine in my tests.
I think you are right to be cautious about EAV as it will make your code a bit more complex, and it will be a bit more difficult to do ad-hoc queries against the table.
I've seen many enterprise apps simply adopt something like the following schema -
t_Comp_Data
-----------
CompId
Name
Surname
Email
Field1
Field2
Field3
...
Fieldn
In this instance, the generic fields (Field1 etc) mean different things for the different competitions. For ease of querying, you might create a different view for each competition, with the proper field names aliased in.
I'm usually hesitant to use it, but this looks like a good situation for the Entity-attribute-value model if you use a database.
Basically, you have a CompetitionEntry (entity) table with the standard fields which make up every entry (Competition_id, maybe dates, etc), and then a CompetitionEntryAttribute table with CompetitionEntry_id, Attribute and Value.You probably also want another table with template attributes for each competition for creating new entries.
Unfortunately you will only be able to store one datatype, which will likely have to be a large nvarchar.
Another disadvantage is the difficulty to query against EAV databases.
Another option is to create one table per competition (possibly in code as part of the competition creation), but depending on the number of competitions this may be impractcal.

XMLSerialized Object in Database Field. Is it good design?

Suppose i have one table that holds Blogs.
The schema looks like :
ID (int)| Title (varchar 50) | Value (longtext) | Images (longtext)| ....
In the field Images i store an XML Serialized List of images that are associated with the blog.
Should i use another table for this purpose?
Yes, you should put the images in another table. Having several values in the same field indicates denormalized data and makes it hard to work with the database.
As with all rules, there are exceptions where it makes sense to put XML with multiple values in one field in the database. The first rule is that:
The data should always read/written together. No need to read or update just one of the values.
If that is fulfilled, there can be a number of reasons to put the data together in one field:
Storage efficiency, if space has proved to be a problem.
Retrieval efficiency, if performance has proved to be a problem.
Schema flexilibity; where one XML field can eliminate tens or hundreds of different tables.
I would certainly use another table. If you use XML, what happens when you need to go through and update the references to all images? (Would you just rather do an Update blog_images Set ..., or parse through the XML for each row, make the update, then re-generate the updated XML for each?
Well, it is a bit "inner platform", but it will work. A separate table would allow better image querying, although on some RDBMS platforms this could also be achieved via an XML-type column and SQL/XML.
If this data only has to be opaque storage, then maybe. However, keep in mind you'll generally have to bring back the entire XML to the app-tier to do anything interesting with it (or: depending on platform, use SQL/XML, but I advise against this, as the DB isn't the place to do such processing in most cases).
My advice in all other cases: separate table.
That depends on whether you'd need to query on the actual image data itself. If you see a possible need to query on certain images, or images with certain attributes, then it would probably be best to store that image data in a different way.
Otherwise, leave it the way it is.
But remember, only include the fields in your SELECT when you need them.
Should i use another table for this purpose?
Not necessarily. You just have to ensure that you are not selecting the images field in your queries when you don't need it. But if you wanted to denormalize your schema you could use another table and when you need the images perform a join.

Categories

Resources