Ok so I'm basically trying to determine which way is more efficient performance wise when checking if items exist in a database.
I'm using LINQ to SQL on WP7 with SQL Server CE.
I'm going to be importing multiple objects into the database. Now there is a pretty good possibility that some of those objects already exist in the database so I need to check each item as they come in and if its already there, skip it, otherwise add it.
There were two approaches that came to mind. The first was using a foreach and checking if an object exists in the db with the same name:
foreach(var item in items)
{
//Make individual call to db for every item
var possibleItem = /*SQL SERVER STATEMENT WITH WHERE CONDITION*/;
}
Making individual calls to the db though sounds pretty resource intensive. So the other idea was to do a full select on all the objects in the db and store them in the list. And then pretty much do the same concept with the foreach except now I don't have to connect to the db, I have direct access to the list. What are your thoughts on these approaches? Is there a better way?
If you can easily sort your items you could make a select from the database and step through the list while reading through the database to determine which items are new. That should be much faster than multiple selects while preserving memory.
using(var reader = cmd.ExecuteReader())
{
while(reader.HasRows && reader.Read())
{
var id = reader.GetInt32(0);
// test how id compares to the memory list
}
}
But if memory if of no concern, I'd probably just read all keys from the database into memory for simplicity.
Related
I have a Sharepoint list on a site that I want to update nightly from a SQL server DB, preferably using C#. Here is the catch, I do not know if any records were removed, added, or if any field in any record has been updated. I would believe then the simplest thing to do is remove the data from the list and then replace it with the new list data. But is there any simple way to do this? I would hate to remove 3000+ items line by line from the list and then add the 3000+ records one at a time.
Its up to your environment. If you not that much load on the systems in the night, i would prefer one of the following ways:
1) Build a timerjob, delete the list (not the items one by one, cause this is slow), recreate the list and import the items from the db. When we are talking about 3.000 - 5.000 Elements, this is not that much and i think done under 10 Minutes.
2) Loop through the sharepoint list with the items and check field by field if it was updated within the db and if yes, update it.
I would preferr to delete the list and import the complete table, cause we are talking about not that much data.
Another way, which is a good idea, is to use BCS or BDC. Then you would have the data always in place and synched with the db. Look at
https://msdn.microsoft.com/en-us/library/office/jj163782.aspx
https://msdn.microsoft.com/de-de/library/ee231515(v=vs.110).aspx
Unfortunately there is no "easy" and/or elegant way to delete all the items in a list, like the delete statement in SQL. You can either delete the entire list and recreate it if the list can be easily created from a list definition or, if your concern is performance, since SP 2007 the SPWeb Class has a method called ProcessBatchData. You can use it to batch process commands to avoid the performance penalty of issuing 6000 separate commands to the server. However, it still requires you to pass an ugly XML that contains a list of all the items to be deleted or added.
The ideal way is to enumerate all the rows from the database and see if each row already exists in the SharePoint list using a primary field value. If it already exists, simply update them[1]. Otherwise you can add a new item.
[1] - Optionally, while updating them we can compare the list item field values with database column values. Only if there is a change in any of the field, update it. Otherwise skip it.
I have a problem concerning application performance: I have many tables, each having millions of records. I am performing select statements over them using joins, where clauses and orderby on different criterias (specified by the user at runtime). I want to get my records paged but no matter what I do with my SQL statements I cannot reach the performance of getting my pages directly from memory. Basically the problem comes when I have to filter my records by using some runtime dynamic specified criteria. I tried everything such as using ROW_NUMBER() function combined with a "where RowNo between" clause, I've tried CTE, temp tables, etc. Those SQL solutions performs well only if I don't include filtering. Keep in mind also that I want my solution to be as generic as possible (imagine that i have in my app several lists that virtually presents paged millions of records and those records are constructed with very complex sql statements).
All my tables has a primary key of type INT.
So, I come with an ideea: Why not create a "server" only for select statements. The server loads first all records from all tables and stores them into some HashSets where each T has an Id property and GetHashCode () returns that Id and also the Equals is implemented such that two records are "equal" only if Id is equal (don't scream, You will see later why I am not using all record data for hashing and comparisons).
So far so good, but there's a problem: How can I sync my in memory collections with database records?. The ideea is that I must find a solution such as I load only differential changes. So I invented a changelog table for each table that I want to cache. In this changelog I perform only inserts that marks dirty rows (updates or deletes) and also records newly inserted ids, all of this mechanism implemented using triggers. So whenever an in-memory select comes, I check first if I must sync something (by interogating the changelog). If something must be applied, I load the changelog, I apply those changes in memory and finally I am clearing that changelog (or maybe remember what was the highest changelog id that I've applied ...).
In order to be able to apply the changelog in O ( N ) where N is the changelog size, i am using this algo:
for each log.
identify my in-memory Dictionary <int, T> where the key is the primary key.
if it's a delete log then call dictionary.Remove (id) ( O ( 1 ))
if it's an update log, then call also dictionary.Remove (id) ( O (1)) and move this id into an "to be inserted" collection
if it's an insert log, move this id into a "to be inserted" collection.
finally, refresh cache by selecting all data from the corresponding table where Id in ("to be inserted").
For filtering, I am compiling some expression trees into Func < T, List < FilterCriterias >, bool > functors. Using this mechanism I am performing way more faster than SQL.
I Know that SQL 2012 has caching support and the new comming SQL version will suport even more but My client have SQL server 2005 so ... I can't benefit of this stuff.
My question: What do you think ? this is a bad ideea ? there's a better aproach ?
The developers of SQL Server did a very good job. I think it is fairly impossible to trick this out.
Unless your data has some kind of implicit structure which might help to speed things up and which the optimizer cannot be aware of, such "I do my own speedy trick" approaches won't help - normally...
Performance problems are ever first to be solved where they occur:
the tables structures and relations
indexes and statistics
quality of SQL statements
Even many million rows are no problem if the design and the queries are good...
If your queries do a lot of computations, or you need to retrieve data out of tricky structures (nested list with recursive reads, XML...) I'd go the Data-Warehouse-Path and write some denormalized tables for quick selects. Of course you will have to deal with the fact, that you are reading "old" data. If your data does not change much, you could trigger all changes to a denormalized structure immediately. But this depends on your actual situation.
If you want, you could post one of your imperformant queries together with the relevant structure details and ask for review. There are dedicated groups on Stack-Exchange, such as "Code Review". If it's not to big, you might try it here as well...
The data
I have a collection with around 300,000 vacations. Every vacation has several categories, countries, cities, activities and other subobjects. This data needs to be inserted into a MySQL / SQL Server database. I have the luxury of being able to truncate the entire database and start clean every time the parser program is run.
What I have tried
I have tried working with Entity Framework, this is also where my preference lies. To keep Entity Framework's performance up I have created a construction where 300 items are taken out of the vacations collection, parsed and inserted by Entity Framework and it's context disposed thereafter. The program finishes in a matter of minutes using this method. If I fill the context with all 300k vacations from the collection (and it's subobjects) it's a matter of hours.
int total = vacationsObjects.Count;
for (int i = 0; i < total; i += Math.Min(300, (total - i)))
{
var set = vacationsObjects.Skip(i).Take(300);
int enumerator = 0;
using (var database = InitializeContext())
{
foreach (VacationModel vacationData in set)
{
enumerator++;;
Vacations vacation = new Vacations
{
ProductId = vacationData.ExternalId,
Name = vacationData.Name,
Description = vacationData.Description,
Price = vacationData.Price,
Url = vacationData.Url,
};
foreach (string category in vacationData.Categories)
{
var existingCategory = database.Categories.Local.FirstOrDefault(c => c.CategoryName == categor);
if (existingCategory != null)
vacation.Categories.Add(existingCategory);
else
{
vacation.Categories.Add(new Category
{
CategoryName = category
});
}
}
database.Vacations.Add(vacation);
}
database.SaveChanges();
}
}
The downside (and possibly dealbreaker) with this method is figuring out the relationships. As you can see when adding a Category I check if it's already been created in the local context, and then use that. But what if it has been added in a previous set of 300? I don't want to query the database multiple times for every vacation to check whether an entity already resides within it.
Possible solution
I could keep a dictionary in memory containing the categories that have been added. I'd need to figure out how to attach these categories to the proper vacations (or vice-versa) and insert them, including their respective relations into the database.
Possible alternatives
Segregate the context and the transaction -
Purely theoretical, I do not know if I'm making any sense here. Maybe I could have EF's context keep track of all objects, and take manual control over the inserting part. I have messed around with this, trying to work with manual transaction scopes without avail.
Stored procedure -
I could write a stored procedure that handles and inserts my data. I'm not a big fan of this alternative, as I would like to keep the flexibility of switching between MySQL and SQL Server. Also, I would be in the dark as to where to begin.
Intermediary CSV file -
Instead of inserting parsed data directly into the RDMBS, I could export it into one or more CSV files and make use of importing tools such as MySQL's INFLINE.
Alternative database systems
Databases such as Azure Table Storage, MongoDB or RavenDB could be an option. However, I would prefer to stick to a traditional RDMBS due to compatibility with my skillset and tools.
I have been working on and researching this problem for a couple of weeks now. It seems like the best way of finding a solution that fits is by simply trying the different possibilities and observing the result. I was hoping that I could receive some pointers or tips from your personal experiences.
If you insert each record separately, the whole operation will take a lot of time. The bottleneck is SQL-queries between client and server. Each query takes time, so try to avoid using multiple of them. For huge amount of data it will be much better to process them locally. The best solution is to use special import tool. In MySQL you can use LOAD DATA, in MSSQL there is BULK INSERT. To import your data, you need a .css file.
To handle external keys correctly, you must populate tables manually before inserting. If destination tables are empty, you can simply create .css file with predefined primary and external keys. Otherwise you can import existing records from server, update them with your data, then export them back.
Time
Since you can afford to make only INSERTs, one suggestion is to try Entity Framework Bulk Insert extension. I have used it to save up to 200K records and it works fine. Just include in your project and write something like this:
context.BulkInsert(listOfEntities);
This should solve (or greatly improve the EF version) your problem's the time dimension
Data integrity
Keeping everything in one transaction does not sound reasonable (I expect that 300K parent records to generate at least 3M overall records), so I would try the following approach:
1) make your entities insertion using bulk insert.
2) call a stored procedure to check data integrity
If the insertion is quite long and the chance of failure is relatively big, you can load what is already loaded and have the process skip what is already loaded:
1) make smaller bulk inserts for a batch of vacation records and all its children records. Ensure that it runs in a transaction. One BULK INSERT is run atomically (no transaction needed), for several it seems tricky.
2) if the process fails, you have complete vacation data in your database (no partially imported vacation)
3) retake the process, but load existing vacation records (parents only). Using EF, a faster way is using AsNoTracking to spare the tracking overhead (which is great for large lists)
var existingVacations = context.Vacation.Select(v => v.VacationSourceIdentifier).AsNoTracking();
As suggested by Alexei, EntityFramework.BulkInsert is a very good solution if your model is supported by this library.
You can also use Entity Framework Extensions (PRO Version) which allow to use BulkSaveChanges and Bulk Operations (Insert, Update, Delete and Merge).
It's support your both provider: MySQL and SQL Server
// Upgrade SaveChanges performance with BulkSaveChanges
var context = new CustomerContext();
// ... context code ...
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(operation => operation.BatchSize = 1000);
// Use direct bulk operation
context.BulkInsert(customers);
Disclaimer: I'm the owner of the project Entity Framework Extensions
I have a listview control that is filled with returned records from a SQL Statement. The fields may be something like:
SSN------|NAME|DATE----|TIME--|SYS
111222333|Bell|20140130|121507|P
123456789|John|20140225|135000|P
123456789|John|20140225|135002|N
The "duplicates" are generated from a ChangeLog, such as a change of address. Due to bad database design I have no control over however, an address change will create 2 records if a member happens to be a member of both SYS.
What would be the best way to go through each record in my listview, find duplicate values of SSN & DATE (There can be a record generated for both SYS if person is a member of both), and remove the duplicate value with the lower TIME value?
I'm trying to do a code-based solution instead of SQL because the true SQL statement is already highly complex and this application needs to only be maintained until October.
For this, I've assumed you have some class with these record's properties exposed with easy access like SSN and Time, I've also assumed they were both strings. In the code below I refer to this object as Record.
HINT: You might instead want to be removing items with the SYS flag set to False instead of judging it on time (Probably doesn't make a difference) .
I did not used any lambda fun on purpose to try to keep this simple and easy to read.
Call this code every time you load items into the ListView.... it would actually be a better idea to sanitize that list before you load it into the ListView, but the below code is a solution to your question based on the available info.
//Turn the ListView's ItemCollection into an easy to use List<Record>
List<Record> records = myListView.Items.OfType<Record>().ToList();
//Grab records with duplicate SSNs but with lower Time values
List<Record> recordsToRemove = new List<Record>();
foreach (var record in records)
{
foreach (var r in records)
{
if (record.SSN == r.SSN && record != r)
{
if (int.Parse(r.Time) > int.Parse(record.Time))
recordsToRemove.Add(record);
else
recordsToRemove.Add(r);
}
}
}
//Now actually remove the items from the ListView
foreach (var record in recordsToRemove)
{
myListView.Items.Remove(record);
}
Current I have a large data that stored inside database.
This set of data (3000 records) will be retrieve by user frequently.
Currently the method that I'm using right now is:
Retrieve this set of records from database
Convert into datatable
Store into a cache object
Search result from this cache object base on query
CachePostData.Select(string.Format("Name LIKE '%{0}%'", txtItemName.Text));
Bind the result to a repeater (with paging that display 40 records per page)
But I notice that the performance is not good (about 4 seconds on every request). So, I am wondering is there any better way to do this? Or I should straight away retrieve the result from the database for every query?
DataTable.Select is probably not the most efficient way to search an in-memory cache, but it certainly shouldn't take 4 seconds or anything like to search 3000 rows.
First step is to find out where your performance bottlenecks are. I'm betting it's nothing to do with searching the cache, but you can easily find out, e.g. with code something like:
var stopwatch = Stopwatch.StartNew();
var result = CachePostData.Select(string.Format("Name LIKE '%{0}%'", txtItemName.Text));
WriteToLog("Read from cache took {0} ms", stopwatch.Elapsed.TotalMilliseconds);
where WriteToLog traces somewhere (e.g. System.Diagnostics.Trace, System.Diagnostics.Debug, or a logging framework such as log4net).
If you are looking for alternatives for caching, you could simply cache a generic list of entity objects, and use Linq to search the list:
var result = CachePostData.Select(x => x.Name.Contains(txtItemName.Text));
This is probably slightly more efficient (for example, it doesn't need to parse the "NAME LIKE ..." filter expression), but again, I don't expect this is your bottleneck.
I think using datatable would be more efficient as by doing that you will reduce the hits on your database server. You can store datatable in cache and then to reuse it. Something like this:-
public DataTable myDatatable()
{
DataTable dt = HttpContext.Current.Cache["key"] as DataTable;
if(dt == null)
{
dt = myDatatable();
HttpContext.Current.Cache["key"] = dt;
}
return dt;
}
Also check SqlCacheDependency
You may also clear the cache on some particular time interval like this:-
HttpContext.Current.Cache.Insert("key", dt, null, DateTime.Now.AddHours(2),
System.Web.Caching.Cache.NoSlidingExpiration);
Also check DataTable caching performance
It is a bit hard to declare a correct solution for your problem without knowing how many hits on the database you actually expect. For most cases, I would not cache the data in the ASP.NET cache for filtering, because searching by using DataTable.Select basically performs a table scan and cannot take advantage of database indexing. Unless you run into really heavy load, most database server should be capable of performing this task with less delay than filtering the DataTable in .NET.
If your database supports fulltext search (ie. MSSQL or MySQL), you could create a fulltext index on the name column and query that. Fulltext search should give you even faster response types for these types of LIKE queries.
Generally, caching data for faster access is good, but in this case, the DataTable is most likely inferior to the database server in terms of searching for data. You could still use the cache to display unfiltered data faster and without hitting the database.