I recently became aware of the notion of LazyList, and I would like to implement this notion in my work.
I have serveral methods which may retrieve hundreds of thousands of entries from the database, I want to return a LazyList<T> rather than a typical List<T>.
I could only find Lazy<List<T>> which is, as to my understanding, not the same. The Lazy<List<T>> makes the initialization of the list lazy, thats not what I need.
I want to give an example from Scheme language, if someone ever used it.
Basically it is implemented by LinkedNodeswheras the value of a given node needs to be calculated and the node.next is actually a function which needed to be calculated to retrieve the value.
I wonder how to actually control lists in size of 400k or so, It sounds like its so expensive to hold a List in the size of couple of MB which, possibly, can get to GBs depends on the db operation the program needs to do.
Im currently using .Net 4.5, C# version is 4
Instead of returning a List<T> or LazyList, why not yield return the results? This is much better than retrieving all rows. It will stream it row by row. Better for memory management.
For example: (PSEUDO)
private IEnumerator<Row> GetRows(SqlConnection connection)
{
var resultSet = connection.ExecuteQuery(.....);
resultSet.Open();
try
{
while(resultSet.FetchNext())
{
// read one row..
yield return row;
}
}
finally
{
resultSet.Close();
}
}
foreach(var row in GetRows(connection))
{
// handle the row.
}
This way each the result set is handled each row.
Related
Im building an multithreading program that handels big data and wounder what i can do to tweak it.
Right now i have 50 000millions entrys in a normal List and as i use multithreading i use lockstatement.
public string getUsername()
{
string user = null;
lock (UsersToCheckExistList)
{
user = UsersToCheckExistList.First();
UsersToCheckExistList.Remove(user);
}
return user;
}
When i run smaller lists 500k lines it works much faster. But when i load a bigger list 5-50mill it starts to slow down. One way to solve this issue is creating many small lists dynamic and store them in an Dictonary and this is the way i think i will go with. But as i want to learn more about optimizing i wounder if there is a better solution for this task?
All i want is the get a value from the collection and remove it same time from the collection.
You're using the wrong tools for the job - explicit locking is quite expensive, not to mention that the cost of removing the head of a List is O(Count). If you want a collection that is accessed concurrently it's best to use types in System.Collections.Concurrent, as they are heavily optimised for concurrent accesses. From your use case it seems you want a queue of users, so using a ConcurrentQueue:
ConcurrentQueue<string> UsersQueue;
public string getUsername()
{
string user = null;
UsersQueue.TryDequeue(out user);
return user;
}
The problem is that removing the first item from a list is O(n), so as you list grows it takes longer to remove the first item. You would probably be better off using a Queue instead. Since you need threadsafety, you could use ConcurrentQueue, which handles efficient locking for you.
You can put them all in a ConcurrentBag (https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.concurrentbag-1?view=netframework-4.8) then each thread can just use the TryTake method to grab one entry and remove it at the same time, you then don't need to worry about doing your own locking
If you have enough RAM for your data, you should definitely use ConcurrentQueue for FIFO access to you data.
But if you have not enough RAM you can try to use some database. Modern databases can cache data very effectively, you will have almost instant access to you data and save OS memory from swapping.
I have a function which will be called thousands of times per day, and I want to optimize this function so it will be as fast and efficient as possible.
In this function, a list will be checked and based on the result of this check, different actions will happen. My question is what is the most efficient way of determining how many elements are in this list.
Obviously, you can just check like this:
List<Objects> data = GetData();
if (data.Count == 0)
{
//Do something
}
else if (data.Count < 5)
{
//Do something with a small list
}
else
{
//Do something with a larger list
}
Is this already the fastest/most efficient way of doing this?
I came up with an alternative, but I would like some suggestions
List<Objects> data = GetData();
int amountOfObjects = data.Count();
if (amountOfObjects == 0)
{
//Do something
}
else if (amountOfObjects < 5)
{
//Do something with a small list
}
else
{
//Do something with a larger list
}
You should use the property Count as it is a pre-calculated value and will not require recalculating the value when you use it, whereas the method Count() will try to be a smart ass and try to work out if it needs to recount or not, but that working out alone is still more effort than just using Count.
So just use what you have initially done.
For List<T>, the Count property really just returns a field because the implementation is an array list that needs to know very precisely how many elements are in this collection. Therefore, you won't gain any performance by trying to cache this value or anything alike. This is just no problem.
This situation may be different when you use other collection implementations. For example, a LinkedList conceptually has no clue of how many elements are in it, but has to count them, which is an expensive operation.
Edit: Your alternative using Count() is actually a very bad thing. Since List<T> is sealed, the compiler will create a static method call for accessing the Count property meanwhile Count() results in a cast and a virtual method call over an interface. That makes up much more cost and the JIT-compiler can do less magic such as inlining.
Let's say I have a relatively large list of an object MyObjectModel called MyBigList. One of the properties of MyObjectModel is an int called ObjectID. In theory, I think MyBigList could reach 15-20MB in size. I also have a table in my database that stores some scalars about this list so that it can be recomposed later.
What is going to be more efficient?
Option A:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int RowID = PutScalarsInDB(MyBigList);
Option B:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int TheCount = MyBigList.Count();
StringBuilder ListOfObjectID = null;
foreach (MyObjectModel ThisObject in MyBigList)
{
ListOfObjectID.Append(ThisObject.ObjectID.ToString());
}
int RowID = PutScalarsInDB ( TheCount, ListOfObjectID);
In option A I pass MyBigList to a function that extracts the scalars from the list, stores these in the DB and returns the row where these entries were made. In option B, I keep MyBigList in the page method where I do the extraction of the scalars and I just pass these to the PutScalarsInDB function.
What's the better option, and it could be that yet another is better? I'm concerned about passing around objects this size and memory usage.
I don't think you'll see a material difference between these two approaches. From your description, it sounds like you'll be burning the same CPU cycles either way. The things that matter are:
Get the list
Iterate through the list to get the IDs
Iterate through the list to update the database
The order in which these three activities occur, and whether they occur within a single method or a subroutine, doesn't matter. All other activities (declaring variables, assigning results, etc.,) are of zero to negligible performance impact.
Other things being equal, your first option may be slightly more performant because you'll only be iterating once, I assume, both extracting IDs and updating the database in a single pass. But the cost of iteration will likely be very small compared with the cost of updating the database, so it's not a performance difference you're likely to notice.
Having said all that, there are many, many more factors that may impact performance, such as the type of list you're iterating through, the speed of your connection to the database, etc., that could dwarf these other considerations. It doesn't look like too much code either way. I'd strongly suggesting building both and testing them.
Then let us know your results!
If you want to know which method has more performance you can use the stopwatch class to check the time needed for each method. see here for stopwatch usage: http://www.dotnetperls.com/stopwatch
I think there are other issues for a asp.net application you need to verify:
From where do read your list? if you read it from the data base, would it be more efficient to do your work in database within a stored procedure.
Where is it stored? Is it only read and destroyed or is it stored in session or application?
Having a checkedboxlist databound with encrypted listitem value, i had written a method to return a array holding respective checked items on postback.
Signature of which would be similar to below
private Array GetCheckedItems(CheckBoxList ctrlChkbox)
{
//decrypt and push to array
}
Is this a optimal object to return. I will be accessing the array items again to be individually pushed into DataBase.( I will also be binding the same data with a gridview again to show the records. It's like single page form with a gridview to show records)
Which objects might get me benefits and performance than arrays. key based would be nice i feel. Advice me on this please,
Regards,
Deeptechtons
Performance around collections is quite difficult to answer.
Array "simple" provides good performances if items count is known (like it seems for you, if the checked item list is accessible from UI), and if you access it quite straight-forwardly.
Little informations about List<T>, as you said you would put it back into a gridview.
You should try to be concerned (depending on element's number, always) about "boxing/unboxing" informations.
I think that would be your main issue.
Extracting value to push it into database or to gridview may be two different uses for your datas.
if "boxing/unboxing" match your concern more than collecting elements, a linkedList could be a way to insert/read it one after the other.
In case of many elements (don't know which number), addRange() in List<T> is also to consider
Always many ways to do it, hard to
The way i currently populate business objects is using something similar to the snippet below.
using (SqlConnection conn = new SqlConnection(Properties.Settings.Default.CDRDatabase))
{
using (SqlCommand comm = new SqlCommand(SELECT, conn))
{
conn.Open();
using (SqlDataReader r = comm.ExecuteReader(CommandBehavior.CloseConnection))
{
while (r.Read())
{
Ailias ailias = PopulateFromReader(r);
tmpList.Add(ailias);
}
}
}
}
private static Ailias PopulateFromReader(IDataReader reader)
{
Ailias ailias = new Ailias();
if (!reader.IsDBNull(reader.GetOrdinal("AiliasId")))
{
ailias.AiliasId = reader.GetInt32(reader.GetOrdinal("AiliasId"));
}
if (!reader.IsDBNull(reader.GetOrdinal("TenantId")))
{
ailias.TenantId = reader.GetInt32(reader.GetOrdinal("TenantId"));
}
if (!reader.IsDBNull(reader.GetOrdinal("Name")))
{
ailias.Name = reader.GetString(reader.GetOrdinal("Name"));
}
if (!reader.IsDBNull(reader.GetOrdinal("Extention")))
{
ailias.Extention = reader.GetString(reader.GetOrdinal("Extention"));
}
return ailias;
}
Does anyone have any suggestions of how to improve performance on something like this? Bear in mind that PopulateFromReader, for some types, contains more database look-ups in order to populate the object fully.
One obvious change would be to replace this kind of statement:
ailias.AiliasId = reader.GetInt32(reader.GetOrdinal("AiliasId"));
with
ailias.AiliasId = reader.GetInt32(constAiliasId);
where constAiliasId is a constant holding the ordinal of the field AiliasId.
This avoids the ordinal lookups in each iteration of the loop.
If the data volume is high, then it can happen that the overhead of building a huge list can be a bottleneck; in which case, it can be more efficient to use a streaming object model; i.e.
public IEnumerable<YourType> SomeMethod(...args...) {
using(connection+reader) {
while(reader.Read()) {
YourType item = BuildObj(reader);
yield return item;
}
}
}
The consuming code (via foreach etc) then only has a single object to deal with (at a time). If they want to get a list, they can (with new List<SomeType>(sequence), or in .NET 3.5: sequence.ToList()).
This involves a few more method calls (an additional MoveNext()/Current per sequence item, hidden behind the foreach), but you will never notice this when you have out-of-process data such as from a database.
Your code looks almost identical to a lot of our business object loading functions. When we suspect DAL performance issues, we take a look at a few things things.
How many times are we hopping out to the DB? Is there any way we can connect less often and bring back larger chunks of data via the use of multiple result sets (we use stored procedures.) So, instead of each child object loading its own data, the parent will fetch all data for itself and its children. You can run into fragile SQL (sort orders that need to match, etc) and tricky loops to walk over the DataReaders, but we have found it to be more optimal than multiple DB trips.
Fire up a packet sniffer/network monitor to see exactly how much data is being transmitted across the wire. You may be surprised to see how massive some of the result sets are. If they are, then you might think about alternate ways of approaching the issue. Like lazy/defer loading some child data.
Make sure that you are using all of the results you are asking for. For example, going from SELECT * FROM (with 30 fields being returned) to simply SELECT Id, Name FROM (if that is all you needed) could make a large difference.
AFAIK, that is as fast as it gets. Perhaps the slowness is in the SQL query/server. Or somewhere else.
It's likely the real problem is the multiple, per-object lookups that you mention. Have you looked closely to see if they can all be put into a single stored procedure?