Indexing subcategories vs finding them dynamically (performance)

Indexing subcategories vs finding them dynamically (performance) - c#

I'm building a web-based store application, and I have to deal with many nested subcategories within each other. The point is, I have no idea whether my script will handle thousands (the new system will replace the old one, so I know what traffic I have to expect) - at the present day, respond lag from the local server is 1-2 seconds more than other pages with added about 30 products in different categories.
My code is the following:
BazaArkadiaDataContext db = new BazaArkadiaDataContext();
List<A_Kategorie> Podkategorie = new List<A_Kategorie>();
public int IdKat { get; set; }
protected void Page_Load(object sender, EventArgs e)
{
if (!IsPostBack)
{
List<A_Produkty> Produkty = new List<A_Produkty>(); //list of all products within the category and remaining subcategories
if (Page.RouteData.Values["IdKategorii"] != null)
{
string tmpkat = Page.RouteData.Values["IdKategorii"].ToString();
int index = tmpkat.IndexOf("-");
if (index > 0)
tmpkat = tmpkat.Substring(0, index);
IdKat = db.A_Kategories.Where(k => k.ID == Convert.ToInt32(tmpkat)).Select(k => k.IDAllegro).FirstOrDefault();
}
else
return;
PobierzPodkategorie(IdKat);
foreach (var item in Podkategorie)
{
var x = db.A_Produkties.Where(k => k.IDKategorii == item.ID);
foreach (var itemm in x)
{
Produkty.Add(itemm);
}
}
//data binding here
}
}
List<A_Kategorie> PobierzPodkategorie(int IdKat, List<A_Kategorie> kat = null)
{
List<A_Kategorie> Kategorie = new List<A_Kategorie>();
if (kat != null)
Kategorie.Concat(kat);
Kategorie = db.A_Kategories.Where(k => k.KatNadrzedna == IdKat).ToList();
if (Kategorie.Count() > 0)
{
foreach (var item in Kategorie)
{
PobierzPodkategorie(item.IDAllegro, Kategorie);
Podkategorie.Add(item);
}
}
return Kategorie;
}
TMC;DR*
My function PobierzPodkategorie recursively seeks through subcategories (subcategory got KatNadrzedna column for its parent category, which is placed in IDAllegro), selects all the products with the subcategory ID and adds it to the Produkty list. The database structure is pretty wicked, as the category list is downloaded from another shop service server and it needed to get our own ID column in case the foreign server would change the structure.
There are more than 30 000 entries in the category list, some of them will have 5 or more parents, and the website will show only main categories and subcategories ("lower" subcategories are needed by external shop connected with SOAP).
My question is
Will adding index table to the database (Category 123 is parent for 1234, 12738...) will improve the performance, or is it just waste of time? (The index should be updated when version of API changes and I have no idea how often would it be) Or is there other way to do it?
I'm asking because changing the script will not be possible in production, and I don't know how the db engine handles lots of requests - I'd really appreciate any help with this.
The database is MSSQL
*Too much code; didn't read

The big efficiency gain you can get is to load all subproducts in a single query. The time saved by reducing network trips can be huge. If 1 is a root category and 12 a child category, you can query all root categories and their children like:
select *
from Categories
where len(Category) <= 2
An index on Category would not help with the above query. But it's good practice to have a primary key on any table. So I'd make Category the primary key. A primary key is unique, preventing duplicates, and it is indexed automatically.
Moving away from RBAR (row by agonizing row) has more effect than proper tuning of the database. So I'd tackle that first.

You definitely should move the recursion into database. It can be done using WITH statement and Common Table Expressions. Then create a view or stored procedure and map it to you application.
With that you should be able to reduce SQL queries to two (or even one).

Related

C# - Creating a list by filtering a pre-exisitng list

I am very new to C# lists and databases, please keep this in mind.
I have a list of workouts saved in a database that also has the UserID field to make each workout added to the table unique to each user. I want to make a list view for when the user logs in, they can see only their workouts.
I have tried to do this by creating a new list without all the workouts that don't have that User's primary key/userID
public void Read()
{
using (UserDataContext context = new UserDataContext())
{
DatabaseWorkouts = context.Workouts.ToList(); // Saves the users from the database into a list
// DatabaseWorkouts = context.Workouts.FindAll(item => item.UserID != Globals.primaryKey); I thought this would work
foreach (var item in DatabaseWorkouts.ToList())
{
if (DatabaseWorkouts.Exists(item => item.UserID != Globals.primaryKey))
{
DatabaseWorkouts.Remove(item);
}
}
ItemList.ItemsSource = DatabaseWorkouts; //Displays the list on the listview in the GUI
}
}
I have run many tests with this code above and I think that it only displays the workouts that are most recent and accept conditions, instead of just accepting conditions.
Please help

Instead of fetching all the workouts and then removing the ones that don't belong to the user, you could just directly fetch the user's ones.
Assuming that Globals.primaryKey is the targeted user's id, you can do the following
var userWorkouts = context.Workouts.Where(w => w.UserId == Globals.primaryKey).ToList();
ItemList.ItemsSource = userWorkouts;

Sorting a list by parent/child relations ship so I have definitely processed a parent before I try to process the child

I have a system where people can sent me a List<ProductCategory>.
I need to do some mapping to my system and then save these to the database.
The incoming data is in this format:
public string ExternalCategoryID { get; set; }
public string ExternalCategoryName { get; set; }
public string ExternalCategoryParentID { get; set; }
The incoming list is in no particular order. If ExternalCategoryParentID is null then this is a top level category. The parent child relationship can be any depth - i.e. Technology > TVs > Samsung > 3D > 40" > etc > etc
When I'm saving I need to ensure I've already saved the parent - I can't save TVs until I have saved Technology. The ExternalCategoryID is likely to be an int but this has no relevance on the parent child relationship (a parent can have a higher or lower id than a child).
How can I order this list so I can loop through it and be certain that for any child, I have already processed it's parent.
The only way I can think of is to get all where ExternalCategoryParentID == null then get all where the ExternalCategoryParentID is in this "Top Level" list, then get the next set of children... etc. but this can't be the best solution. I'd prefer to sort first, then have a single loop to process. I have found this post, but it relies on createdDate which isn't relevant to me.

Turns out it wasn't so difficult after all. I wrote this function to do it - you pass in the original list and it will return a sorted list.
It works by looping through the list checking if there are any items in the list that have id == current items parentid. If there is one, we ignore that item and continue. if there aren't any, we add the current item to the sortedList and remove it from the original list and continue. This ensures that items are inserted in the sorted list after their parent.
private List<HeisenbergProdMarketplaceCategory> SortByParentChildRelationship(List<HeisenbergProdMarketplaceCategory> heisenbergMarketplaceCategories)
{
List<HeisenbergProdMarketplaceCategory> sortedList = new List<HeisenbergProdMarketplaceCategory>();
//we can check that a category doesn't have a parent in the same list - if it does, leave it and continue
//eventually the list will be empty
while(heisenbergMarketplaceCategories.Count > 0)
{
for (int i = heisenbergMarketplaceCategories.Count-1; i >= 0; i--)
{
if (heisenbergMarketplaceCategories.SingleOrDefault(p => p.ExternalCategoryID == heisenbergMarketplaceCategories[i].ExternalCategoryParentID) == null)
{
sortedList.Add(heisenbergMarketplaceCategories[i]);
heisenbergMarketplaceCategories.RemoveAt(i);
}
}
}
return sortedList;
}

We can check the field value to be null and then set int.max to it and then order by desc to get it at the top.
We can check the ExternalCategoryID = field.Value ?? int.max and then Order by descending.
Sample :
var query = context.Categories.Select(o => new { Obj = o, OrderID = o.OrderID ?? int.MaxValue }).OrderByDescending(o => o.OrderID).Select(o => o.Obj);

How to Optimize Code Performance in .NET [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have an export job migrating data from an old database into a new database. The problem I'm having is that the user population is around 3 million and the job takes a very long time to complete (15+ hours). The machine I am using only has 1 processor so I'm not sure if threading is what I should be doing. Can someone help me optimize this code?
static void ExportFromLegacy()
{
var usersQuery = _oldDb.users.Where(x =>
x.status == 'active');
int BatchSize = 1000;
var errorCount = 0;
var successCount = 0;
var batchCount = 0;
// Using MoreLinq's Batch for sequences
// https://www.nuget.org/packages/MoreLinq.Source.MoreEnumerable.Batch
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
Console.WriteLine(String.Format("Batch count at {0}", batchCount));
batchCount++;
foreach(var user in batch)
{
try
{
var userData = _oldDb.userData.Where(x =>
x.user_id == user.user_id).ToList();
if (userData.Count > 0)
{
// Insert into table
var newData = new newData()
{
UserId = user.user_id; // shortened code for brevity.
};
_db.newUserData.Add(newData);
_db.SaveChanges();
// Insert item(s) into table
foreach (var item in userData.items)
{
if (!_db.userDataItems.Any(x => x.id == item.id)
{
var item = new Item()
{
UserId = user.user_id, // shortened code for brevity.
DataId = newData.id // id from object created above
};
_db.userDataItems.Add(item);
}
_db.SaveChanges();
successCount++;
}
}
}
catch(Exception ex)
{
errorCount++;
Console.WriteLine(String.Format("Error saving changes for user_id: {0} at {1}.", user.user_id.ToString(), DateTime.Now));
Console.WriteLine("Message: " + ex.Message);
Console.WriteLine("InnerException: " + ex.InnerException);
}
}
}
Console.WriteLine(String.Format("End at {0}...", DateTime.Now));
Console.WriteLine(String.Format("Successful imports: {0} | Errors: {1}", successCount, errorCount));
Console.WriteLine(String.Format("Total running time: {0}", (exportStart - DateTime.Now).ToString(#"hh\:mm\:ss")));
}

Unfortunately, the major issue is the number of database round-trip.
You make a round-trip:
For every user, you retrieve user data by user id in the old database
For every user, you save user data in the new database
For every user, you save user data item in the new database
So if you say you have 3 million users, and every user has an average of 5 user data item, it mean you do at least 3m + 3m + 15m = 21 million database round-trip which is insane.
The only way to dramatically improve the performance is by reducing the number of database round-trip.
Batch - Retrieve user by id
You can quickly reduce the number of database round-trip by retrieving all user data at once and since you don't have to track them, use "AsNoTracking()" for even more performance gains.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
foreach(var userData in userDatas)
{
....
}
You should already have saved a few hours only with this change.
Batch - Save Changes
Every time you save a user data or item, you perform a database round-trip.
Disclaimer: I'm the owner of the project Entity Framework Extensions
This library allows to perform:
BulkSaveChanges
BulkInsert
BulkUpdate
BulkDelete
BulkMerge
You can either call BulkSaveChanges at the end of the batch or create a list to insert and use directly BulkInsert instead for even more performance.
You will, however, have to use a relation to the newData instance instead of using the ID directly.
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
// Retrieve all users for the batch at once.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
// Create list used for BulkInsert
var newDatas = new List<newData>();
var newDataItems = new List<Item();
foreach(var userData in userDatas)
{
// newDatas.Add(newData);
// newDataItem.OwnerData = newData;
// newDataItems.Add(newDataItem);
}
_db.BulkInsert(newDatas);
_db.BulkInsert(newDataItems);
}
EDIT: Answer subquestion
One of the properties of a newDataItem, is the id of newData. (ex.
newDataItem.newDataId.) So newData would have to be saved first in
order to generate its id. How would I BulkInsert if there is a
dependency of an another object?
You must use instead navigation properties. By using navigation property, you will never have to specify parent id but set the parent object instance instead.
public class UserData
{
public int UserDataID { get; set; }
// ... properties ...
public List<UserDataItem> Items { get; set; }
}
public class UserDataItem
{
public int UserDataItemID { get; set; }
// ... properties ...
public UserData OwnerData { get; set; }
}
var userData = new UserData();
var userDataItem = new UserDataItem();
// Use navigation property to set the parent.
userDataItem.OwnerData = userData;
Tutorial: Configure One-to-Many Relationship
Also, I don't see a BulkSaveChanges in your example code. Would that
have to be called after all the BulkInserts?
Bulk Insert directly insert into the database. You don't have to specify "SaveChanges" or "BulkSaveChanges", once you invoke the method, it's done ;)
Here is an example using BulkSaveChanges:
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
// Retrieve all users for the batch at once.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
// Create list used for BulkInsert
var newDatas = new List<newData>();
var newDataItems = new List<Item();
foreach(var userData in userDatas)
{
// newDatas.Add(newData);
// newDataItem.OwnerData = newData;
// newDataItems.Add(newDataItem);
}
var context = new UserContext();
context.userDatas.AddRange(newDatas);
context.userDataItems.AddRange(newDataItems);
context.BulkSaveChanges();
}
BulkSaveChanges is slower than BulkInsert due to having to use some internal method from Entity Framework but still way faster than SaveChanges.
In the example, I create a new context for every batch to avoid memory issue and gain some performance. If you re-use the same context for all batchs, you will have millions of tracked entities in the ChangeTracker which is never a good idea.

Entity Framework is a very bad choice for importing large amounts of data. I know this from personal experience.
That being said, I found a few ways to optimize things when I tried to use it in the same way you are.
The Context will cache objects as you add them, and the more inserts you do, the slower future inserts will get. My solution was to limit each context to about 500 inserts before I disposed of that instance and created a new one. This boosted performance significantly.
I was able to make use of multiple threads to increase performance, but you will have to be very careful about resource contention. Each thread will definitely need its own Context, don't even think about trying to share it between threads. My machine had 8 cores, so threading will probably not help you as much; with a single core I doubt it will help you at all.
Turn off ChangeTracking with AutoDetectChangesEnabled = false;, change tracking is incredibly slow. Unfortunately this means you have to modify your code to make all changes directly through the context. No more Entity.Property = "Some Value";, it becomes Context.Entity(e=> e.Property).SetValue("Some Value"); (or something like that, I don't remember the exact syntax), which makes the code ugly.
Any queries you do should definitely use AsNoTracking.
With all that, I was able to cut a ~20 hour process down to about 6 hours, but I still don't recommend using EF for this. It was an extremely painful project due almost entirely to my poor choice of EF to add data. Please use something else... anything else...
I don't want to give the impression that EF is a bad data access library, it is great at what it was designed to do, unfortunately this is not what it was designed for.

I can think on a few options.
1) A little speed increase could be done by moving your _db.SaveChanges() under your foreach() close bracket
foreach (...){
}
successCount += _db.SaveChanges();
2) Add items to a list, and then to context
List<ObjClass> list = new List<ObjClass>();
foreach (...)
{
list.Add(new ObjClass() { ... });
}
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
3) If it's a big amount of dada, save on bunches
List<ObjClass> list = new List<ObjClass>();
int cnt=0;
foreach (...)
{
list.Add(new ObjClass() { ... });
if (++cnt % 100 == 0) // bunches of 100
{
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
list.Clear();
// Optional if a HUGE amount of data
if (cnt % 1000 == 0)
{
_db = new MyDbContext();
}
}
}
// Don't forget that!
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
list.Clear();
4) If TOOOO big, considere using bulkinserts. There are a few examples on internet and a few free libraries around.
Ref: https://blogs.msdn.microsoft.com/nikhilsi/2008/06/11/bulk-insert-into-sql-from-c-app/
On most of these options you loose some control on error handling as it is difficult to know which one failed.

LINQ To SQL Include existing object in insert

In this method, I am inserting a new item (Room) into the database. That process functions as expected.
But, in addition to that, each time I add a room, I want to add a piece of furniture as the initial piece. Each item of type Furniture has a "RoomID" to designate its location. Thus, Room contains a collection of Furniture. Below, I am the piece of "primary" furniture from the database, adding it to the room's furniture collection, and submitting changes. The room gets added to the database, but the Furniture.RoomID column remains as null.
public void AddResidentToUniverse(int residentID, int universeID)
{
Universe uni = _context.Universes.FirstOrDefault(u => u.UniverseID == universeID);
Resident res = _context.Residents.FirstOrDefault(r=>r.ResidentID == residentID);
if (uni != null && res!=null)
{
Room e = new Room();
Furniture primary = _context.Furnitures.FirstOrDefault(p => p.FurnitureID == new FurnitureController().GetPrimary(universeID).FurnitureID);
e.UniverseID = uni.UniverseID;
e.RoomName = res.RootName;
e.ResidentID = residentID;
e.Expired = null;
e.Furniture.Add(primary);
uni.Rooms.Add(e);
_context.SubmitChanges();
}
}

You need to add a line that tells your database what you want to insert. For example,
uni.Rooms.InsertOnSubmit(Room object);
uni.Furniture.InsertOnSubmit(furniture piece);
after this, you can write your
uni.SubmitChanges();
line.

I finally bit the bullet and erase my dbml, dropped and recreated the tables, and recreated my dbml. The Furniture.RoomID column updates correctly now. A totally unsatisfying, ham-handed and brute-force approach, I know.

Efficient Way To Query Nested Data

I have need to select a number of 'master' rows from a table, also returning for each result a number of detail rows from another table. What is a good way of achieving this without multiple queries (one for the master rows and one per result to get the detail rows).
For example, with a database structure like below:
MasterTable:
- MasterId BIGINT
- Name NVARCHAR(100)
DetailTable:
- DetailId BIGINT
- MasterId BIGINT
- Amount MONEY
How would I most efficiently populate the data object below?
IList<MasterDetail> data;
public class Master
{
private readonly List<Detail> _details = new List<Detail>();
public long MasterId
{
get; set;
}
public string Name
{
get; set;
}
public IList<Detail> Details
{
get
{
return _details;
}
}
}
public class Detail
{
public long DetailId
{
get; set;
}
public decimal Amount
{
get; set;
}
}

Normally, I'd go for the two grids approach - however, you might also want to look at FOR XML - it is fairly easy (in SQL Server 2005 and above) to shape the parent/child data as xml, and load it from there.
SELECT parent.*,
(SELECT * FROM child
WHERE child.parentid = parent.id FOR XML PATH('child'), TYPE)
FROM parent
FOR XML PATH('parent')
Also - LINQ-to-SQL supports this type of model, but you need to tell it which data you want ahead of time. Via DataLoadOptions.LoadWith:
// sample from MSDN
Northwnd db = new Northwnd(#"c:\northwnd.mdf");
DataLoadOptions dlo = new DataLoadOptions();
dlo.LoadWith<Customer>(c => c.Orders);
db.LoadOptions = dlo;
var londonCustomers =
from cust in db.Customers
where cust.City == "London"
select cust;
foreach (var custObj in londonCustomers)
{
Console.WriteLine(custObj.CustomerID);
}
If you don't use LoadWith, you will get n+1 queries - one master, and one child list per master row.

It can be done with a single query like this:
select MasterTable.MasterId,
MasterTable.Name,
DetailTable.DetailId,
DetailTable.Amount
from MasterTable
inner join
DetailTable
on MasterTable.MasterId = DetailTable.MasterId
order by MasterTable.MasterId
Then in psuedo code
foreach(row in result)
{
if (row.MasterId != currentMaster.MasterId)
{
list.Add(currentMaster);
currentMaster = new Master { MasterId = row.MasterId, Name = row.Name };
}
currentMaster.Details.Add(new Detail { DetailId = row.DetailId, Amount = row.Amount});
}
list.Add(currentMaster);
There's a few edges to knock off that but it should give you the general idea.

select < columns > from master
select < columns > from master M join Child C on M.Id = C.MasterID

You can do it with two queries and one pass on each result set:
Query for all masters ordered by MasterId then query for all Details also ordered by MasterId. Then, with two nested loops, iterate the master data and create a new Master object foreach row in the main loop, and iterate the details while they have the same MasterId as the current Master object and populate its _details collection in the nested loop.

Depending on the size of your dataset you can pull all of the data into your application in memory with two queries (one for all masters and one for all nested data) and then use that to programatically create your sublists for each of your objects giving something like:
List<Master> allMasters = GetAllMasters();
List<Detail> allDetail = getAllDetail();
foreach (Master m in allMasters)
m.Details.Add(allDetail.FindAll(delegate (Detail d) { return d.MasterId==m.MasterId });
You're essentially trading memory footprint for speed with this approach. You can easily adapt this so that GetAllMasters and GetAllDetail only return the master and detail items you're interested in. Also note for this to be effective you need to add the MasterId to the detail class

This is an alternative you might consider. It does cost $150 per developer, but time is money too...
We use an object persistence layer called Entity Spaces that generates the code for you to do exactly what you want, and you can regenerate whenever your schema changes. Populating the objects with data is transparent. Using the objects you described above would look like this (excuse my VB, but it works in C# too):
Dim master as New BusinessObjects.Master
master.LoadByPrimaryKey(43)
Console.PrintLine(master.Name)
For Each detail as BusinessObjects.Detail in master.DetailCollectionByMasterId
Console.PrintLine(detail.Amount)
detail.Amount *= 1.15
End For
With master.DetailCollectionByMasterId.AddNew
.Amount = 13
End With
master.Save()

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.