Linq and adding a large amount of data

Linq and adding a large amount of data - c#

I need insert to large amount of data to sqlite db.
I uses Linq to Entities.
I have problem to adding large amount of data 1M+.
Not enough memory or a very long time.
This code - fast, but requires a lot of memory:
// query - IQueryable of DbfRecord
// db - ObjectContext
int i = 0;
foreach (var item in query) {
db.AddToKladrs(new Kladr() {
Id = item.GetField(0),
ParentId = item.GetField(1),
RegionId = item.GetField(3),
Name = item.GetField(2),
Index = item.GetField(4)
});
if(++i % 4000 == 0)
db.SaveChanges(SaveOptions.AcceptAllChangesAfterSave);
}
This code - not resource-intensive, but very slow:
// query - IQueryable of DbfRecord
// db - ObjectContext
foreach (var item in query) {
db.ExecuteStoreCommand("insert into [Kladr] values({0}, {1}, {2}, {3}, {4})",
item.GetField(0),
item.GetField(1),
item.GetField(3),
item.GetField(2),
item.GetField(4)
);
}
I missed the try-catch construction and ghost types.
Help me find the best solution!

You can use the SqlBulkCopy for copying large amounts of data. Havn't tried it with SQL lite but it should work.
Link 1
Link 2
Update :
Here is a good answer by Marc Gravell. how-to-do-a-bulk-insert-linq-to-entities

Related

How can I improve the performance of adding many entities to DbContext.Set<T>().Attach(...) in EF6 [duplicate]

I'm looking for the fastest way of inserting into Entity Framework.
I'm asking this because of the scenario where you have an active TransactionScope and the insertion is huge (4000+). It can potentially last more than 10 minutes (default timeout of transactions), and this will lead to an incomplete transaction.

To your remark in the comments to your question:
"...SavingChanges (for each
record)..."
That's the worst thing you can do! Calling SaveChanges() for each record slows bulk inserts extremely down. I would do a few simple tests which will very likely improve the performance:
Call SaveChanges() once after ALL records.
Call SaveChanges() after for example 100 records.
Call SaveChanges() after for example 100 records and dispose the context and create a new one.
Disable change detection
For bulk inserts I am working and experimenting with a pattern like this:
using (TransactionScope scope = new TransactionScope())
{
MyDbContext context = null;
try
{
context = new MyDbContext();
context.Configuration.AutoDetectChangesEnabled = false;
int count = 0;
foreach (var entityToInsert in someCollectionOfEntitiesToInsert)
{
++count;
context = AddToContext(context, entityToInsert, count, 100, true);
}
context.SaveChanges();
}
finally
{
if (context != null)
context.Dispose();
}
scope.Complete();
}
private MyDbContext AddToContext(MyDbContext context,
Entity entity, int count, int commitCount, bool recreateContext)
{
context.Set<Entity>().Add(entity);
if (count % commitCount == 0)
{
context.SaveChanges();
if (recreateContext)
{
context.Dispose();
context = new MyDbContext();
context.Configuration.AutoDetectChangesEnabled = false;
}
}
return context;
}
I have a test program which inserts 560.000 entities (9 scalar properties, no navigation properties) into the DB. With this code it works in less than 3 minutes.
For the performance it is important to call SaveChanges() after "many" records ("many" around 100 or 1000). It also improves the performance to dispose the context after SaveChanges and create a new one. This clears the context from all entites, SaveChanges doesn't do that, the entities are still attached to the context in state Unchanged. It is the growing size of attached entities in the context what slows down the insertion step by step. So, it is helpful to clear it after some time.
Here are a few measurements for my 560000 entities:
commitCount = 1, recreateContext = false: many hours (That's your current procedure)
commitCount = 100, recreateContext = false: more than 20 minutes
commitCount = 1000, recreateContext = false: 242 sec
commitCount = 10000, recreateContext = false: 202 sec
commitCount = 100000, recreateContext = false: 199 sec
commitCount = 1000000, recreateContext = false: out of memory exception
commitCount = 1, recreateContext = true: more than 10 minutes
commitCount = 10, recreateContext = true: 241 sec
commitCount = 100, recreateContext = true: 164 sec
commitCount = 1000, recreateContext = true: 191 sec
The behaviour in the first test above is that the performance is very non-linear and decreases extremely over time. ("Many hours" is an estimation, I never finished this test, I stopped at 50.000 entities after 20 minutes.) This non-linear behaviour is not so significant in all other tests.

This combination increase speed well enough.
context.Configuration.AutoDetectChangesEnabled = false;
context.Configuration.ValidateOnSaveEnabled = false;

You should look at using the System.Data.SqlClient.SqlBulkCopy for this. Here's the documentation, and of course there are plenty of tutorials online.
Sorry, I know you were looking for a simple answer to get EF to do what you want, but bulk operations are not really what ORMs are meant for.

as it was never mentioned here I want to recomment EFCore.BulkExtensions here
context.BulkInsert(entitiesList); context.BulkInsertAsync(entitiesList);
context.BulkUpdate(entitiesList); context.BulkUpdateAsync(entitiesList);
context.BulkDelete(entitiesList); context.BulkDeleteAsync(entitiesList);
context.BulkInsertOrUpdate(entitiesList); context.BulkInsertOrUpdateAsync(entitiesList); // Upsert
context.BulkInsertOrUpdateOrDelete(entitiesList); context.BulkInsertOrUpdateOrDeleteAsync(entitiesList); // Sync
context.BulkRead(entitiesList); context.BulkReadAsync(entitiesList);

The fastest way would be using bulk insert extension, which I developed
note: this is a commercial product, not free of charge
It uses SqlBulkCopy and custom datareader to get max performance. As a result it is over 20 times faster than using regular insert or AddRange
usage is extremely simple
context.BulkInsert(hugeAmountOfEntities);

I agree with Adam Rackis. SqlBulkCopy is the fastest way of transferring bulk records from one data source to another. I used this to copy 20K records and it took less than 3 seconds. Have a look at the example below.
public static void InsertIntoMembers(DataTable dataTable)
{
using (var connection = new SqlConnection(#"data source=;persist security info=True;user id=;password=;initial catalog=;MultipleActiveResultSets=True;App=EntityFramework"))
{
SqlTransaction transaction = null;
connection.Open();
try
{
transaction = connection.BeginTransaction();
using (var sqlBulkCopy = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, transaction))
{
sqlBulkCopy.DestinationTableName = "Members";
sqlBulkCopy.ColumnMappings.Add("Firstname", "Firstname");
sqlBulkCopy.ColumnMappings.Add("Lastname", "Lastname");
sqlBulkCopy.ColumnMappings.Add("DOB", "DOB");
sqlBulkCopy.ColumnMappings.Add("Gender", "Gender");
sqlBulkCopy.ColumnMappings.Add("Email", "Email");
sqlBulkCopy.ColumnMappings.Add("Address1", "Address1");
sqlBulkCopy.ColumnMappings.Add("Address2", "Address2");
sqlBulkCopy.ColumnMappings.Add("Address3", "Address3");
sqlBulkCopy.ColumnMappings.Add("Address4", "Address4");
sqlBulkCopy.ColumnMappings.Add("Postcode", "Postcode");
sqlBulkCopy.ColumnMappings.Add("MobileNumber", "MobileNumber");
sqlBulkCopy.ColumnMappings.Add("TelephoneNumber", "TelephoneNumber");
sqlBulkCopy.ColumnMappings.Add("Deleted", "Deleted");
sqlBulkCopy.WriteToServer(dataTable);
}
transaction.Commit();
}
catch (Exception)
{
transaction.Rollback();
}
}
}

I would recommend this article on how to do bulk inserts using EF.
Entity Framework and slow bulk INSERTs
He explores these areas and compares perfomance:
Default EF (57 minutes to complete adding 30,000 records)
Replacing with ADO.NET Code (25 seconds for those same 30,000)
Context Bloat- Keep the active Context Graph small by using a new context for each Unit of Work (same 30,000 inserts take 33 seconds)
Large Lists - Turn off AutoDetectChangesEnabled (brings the time down to about 20 seconds)
Batching (down to 16 seconds)
DbTable.AddRange() - (performance is in the 12 range)

[2019 Update] EF Core 3.1
Following what have been said above, disabling AutoDetectChangesEnabled in EF Core worked perfectly: the insertion time was divided by 100 (from many minutes to a few seconds, 10k records with cross tables relationships)
The updated code is :
context.ChangeTracker.AutoDetectChangesEnabled = false;
foreach (IRecord record in records) {
//Add records to your database
}
context.ChangeTracker.DetectChanges();
context.SaveChanges();
context.ChangeTracker.AutoDetectChangesEnabled = true; //do not forget to re-enable

I've investigated Slauma's answer (which is awesome, thanks for the idea man), and I've reduced batch size until I've hit optimal speed. Looking at the Slauma's results:
commitCount = 1, recreateContext = true: more than 10 minutes
commitCount = 10, recreateContext = true: 241 sec
commitCount = 100, recreateContext = true: 164 sec
commitCount = 1000, recreateContext = true: 191 sec
It is visible that there is speed increase when moving from 1 to 10, and from 10 to 100, but from 100 to 1000 inserting speed is falling down again.
So I've focused on what's happening when you reduce batch size to value somewhere in between 10 and 100, and here are my results (I'm using different row contents, so my times are of different value):
Quantity | Batch size | Interval
1000 1 3
10000 1 34
100000 1 368
1000 5 1
10000 5 12
100000 5 133
1000 10 1
10000 10 11
100000 10 101
1000 20 1
10000 20 9
100000 20 92
1000 27 0
10000 27 9
100000 27 92
1000 30 0
10000 30 9
100000 30 92
1000 35 1
10000 35 9
100000 35 94
1000 50 1
10000 50 10
100000 50 106
1000 100 1
10000 100 14
100000 100 141
Based on my results, actual optimum is around value of 30 for batch size. It's less than both 10 and 100. Problem is, I have no idea why is 30 optimal, nor could have I found any logical explanation for it.

As other people have said SqlBulkCopy is the way to do it if you want really good insert performance.
It's a bit cumbersome to implement but there are libraries that can help you with it. There are a few out there but I will shamelesslyplug my own library this time: https://github.com/MikaelEliasson/EntityFramework.Utilities#batch-insert-entities
The only code you would need is:
using (var db = new YourDbContext())
{
EFBatchOperation.For(db, db.BlogPosts).InsertAll(list);
}
So how much faster is it? Very hard to say because it depends on so many factors, computer performance, network, object size etc etc. The performance tests I've made suggests 25k entities can be inserted at around 10s the standard way on localhost IF you optimize your EF configuration like mentioned in the other answers. With EFUtilities that takes about 300ms. Even more interesting is that I have saved around 3 millions entities in under 15 seconds using this method, averaging around 200k entities per second.
The one problem is ofcourse if you need to insert releated data. This can be done efficently into sql server using the method above but it requires you to have an Id generation strategy that let you generate id's in the app-code for the parent so you can set the foreign keys. This can be done using GUIDs or something like HiLo id generation.

Dispose() context create problems if the entities you Add() rely on other preloaded entities (e.g. navigation properties) in the context
I use similar concept to keep my context small to achieve the same performance
But instead of Dispose() the context and recreate, I simply detach the entities that already SaveChanges()
public void AddAndSave<TEntity>(List<TEntity> entities) where TEntity : class {
const int CommitCount = 1000; //set your own best performance number here
int currentCount = 0;
while (currentCount < entities.Count())
{
//make sure it don't commit more than the entities you have
int commitCount = CommitCount;
if ((entities.Count - currentCount) < commitCount)
commitCount = entities.Count - currentCount;
//e.g. Add entities [ i = 0 to 999, 1000 to 1999, ... , n to n+999... ] to conext
for (int i = currentCount; i < (currentCount + commitCount); i++)
_context.Entry(entities[i]).State = System.Data.EntityState.Added;
//same as calling _context.Set<TEntity>().Add(entities[i]);
//commit entities[n to n+999] to database
_context.SaveChanges();
//detach all entities in the context that committed to database
//so it won't overload the context
for (int i = currentCount; i < (currentCount + commitCount); i++)
_context.Entry(entities[i]).State = System.Data.EntityState.Detached;
currentCount += commitCount;
} }
wrap it with try catch and TrasactionScope() if you need,
not showing them here for keeping the code clean

I know this is a very old question, but one guy here said that developed an extension method to use bulk insert with EF, and when I checked, I discovered that the library costs $599 today (for one developer). Maybe it makes sense for the entire library, however for just the bulk insert this is too much.
Here is a very simple extension method I made. I use that on pair with database first (do not tested with code first, but I think that works the same). Change YourEntities with the name of your context:
public partial class YourEntities : DbContext
{
public async Task BulkInsertAllAsync<T>(IEnumerable<T> entities)
{
using (var conn = new SqlConnection(Database.Connection.ConnectionString))
{
await conn.OpenAsync();
Type t = typeof(T);
var bulkCopy = new SqlBulkCopy(conn)
{
DestinationTableName = GetTableName(t)
};
var table = new DataTable();
var properties = t.GetProperties().Where(p => p.PropertyType.IsValueType || p.PropertyType == typeof(string));
foreach (var property in properties)
{
Type propertyType = property.PropertyType;
if (propertyType.IsGenericType &&
propertyType.GetGenericTypeDefinition() == typeof(Nullable<>))
{
propertyType = Nullable.GetUnderlyingType(propertyType);
}
table.Columns.Add(new DataColumn(property.Name, propertyType));
}
foreach (var entity in entities)
{
table.Rows.Add(
properties.Select(property => property.GetValue(entity, null) ?? DBNull.Value).ToArray());
}
bulkCopy.BulkCopyTimeout = 0;
await bulkCopy.WriteToServerAsync(table);
}
}
public void BulkInsertAll<T>(IEnumerable<T> entities)
{
using (var conn = new SqlConnection(Database.Connection.ConnectionString))
{
conn.Open();
Type t = typeof(T);
var bulkCopy = new SqlBulkCopy(conn)
{
DestinationTableName = GetTableName(t)
};
var table = new DataTable();
var properties = t.GetProperties().Where(p => p.PropertyType.IsValueType || p.PropertyType == typeof(string));
foreach (var property in properties)
{
Type propertyType = property.PropertyType;
if (propertyType.IsGenericType &&
propertyType.GetGenericTypeDefinition() == typeof(Nullable<>))
{
propertyType = Nullable.GetUnderlyingType(propertyType);
}
table.Columns.Add(new DataColumn(property.Name, propertyType));
}
foreach (var entity in entities)
{
table.Rows.Add(
properties.Select(property => property.GetValue(entity, null) ?? DBNull.Value).ToArray());
}
bulkCopy.BulkCopyTimeout = 0;
bulkCopy.WriteToServer(table);
}
}
public string GetTableName(Type type)
{
var metadata = ((IObjectContextAdapter)this).ObjectContext.MetadataWorkspace;
var objectItemCollection = ((ObjectItemCollection)metadata.GetItemCollection(DataSpace.OSpace));
var entityType = metadata
.GetItems<EntityType>(DataSpace.OSpace)
.Single(e => objectItemCollection.GetClrType(e) == type);
var entitySet = metadata
.GetItems<EntityContainer>(DataSpace.CSpace)
.Single()
.EntitySets
.Single(s => s.ElementType.Name == entityType.Name);
var mapping = metadata.GetItems<EntityContainerMapping>(DataSpace.CSSpace)
.Single()
.EntitySetMappings
.Single(s => s.EntitySet == entitySet);
var table = mapping
.EntityTypeMappings.Single()
.Fragments.Single()
.StoreEntitySet;
return (string)table.MetadataProperties["Table"].Value ?? table.Name;
}
}
You can use that against any collection that inherit from IEnumerable, like that:
await context.BulkInsertAllAsync(items);

Yes, SqlBulkUpdate is indeed the fastest tool for this type of task. I wanted to find "least effort" generic way for me in .NET Core so I ended up using great library from Marc Gravell called FastMember and writing one tiny extension method for entity framework DB context. Works lightning fast:
using System.Collections.Generic;
using System.Linq;
using FastMember;
using Microsoft.Data.SqlClient;
using Microsoft.EntityFrameworkCore;
namespace Services.Extensions
{
public static class DbContextExtensions
{
public static void BulkCopyToServer<T>(this DbContext db, IEnumerable<T> collection)
{
var messageEntityType = db.Model.FindEntityType(typeof(T));
var tableName = messageEntityType.GetSchema() + "." + messageEntityType.GetTableName();
var tableColumnMappings = messageEntityType.GetProperties()
.ToDictionary(p => p.PropertyInfo.Name, p => p.GetColumnName());
using (var connection = new SqlConnection(db.Database.GetDbConnection().ConnectionString))
using (var bulkCopy = new SqlBulkCopy(connection))
{
foreach (var (field, column) in tableColumnMappings)
{
bulkCopy.ColumnMappings.Add(field, column);
}
using (var reader = ObjectReader.Create(collection, tableColumnMappings.Keys.ToArray()))
{
bulkCopy.DestinationTableName = tableName;
connection.Open();
bulkCopy.WriteToServer(reader);
connection.Close();
}
}
}
}
}

I'm looking for the fastest way of inserting into Entity Framework
There are some third-party libraries supporting Bulk Insert available:
Z.EntityFramework.Extensions (Recommended)
EFUtilities
EntityFramework.BulkInsert
See: Entity Framework Bulk Insert library
Be careful, when choosing a bulk insert library. Only Entity Framework Extensions supports all kind of associations and inheritances and it's the only one still supported.
Disclaimer: I'm the owner of Entity Framework Extensions
This library allows you to perform all bulk operations you need for your scenarios:
Bulk SaveChanges
Bulk Insert
Bulk Delete
Bulk Update
Bulk Merge
Example
// Easy to use
context.BulkSaveChanges();
// Easy to customize
context.BulkSaveChanges(bulk => bulk.BatchSize = 100);
// Perform Bulk Operations
context.BulkDelete(customers);
context.BulkInsert(customers);
context.BulkUpdate(customers);
// Customize Primary Key
context.BulkMerge(customers, operation => {
operation.ColumnPrimaryKeyExpression =
customer => customer.Code;
});

One of the fastest ways to save a list
you must apply the following code
context.Configuration.AutoDetectChangesEnabled = false;
context.Configuration.ValidateOnSaveEnabled = false;
AutoDetectChangesEnabled = false
Add, AddRange & SaveChanges: Doesn't detect changes.
ValidateOnSaveEnabled = false;
Doesn't detect change tracker
You must add nuget
Install-Package Z.EntityFramework.Extensions
Now you can use the following code
var context = new MyContext();
context.Configuration.AutoDetectChangesEnabled = false;
context.Configuration.ValidateOnSaveEnabled = false;
context.BulkInsert(list);
context.BulkSaveChanges();

Try to use a Stored Procedure that will get an XML of the data that you want to insert.

I have made an generic extension of #Slauma s example above;
public static class DataExtensions
{
public static DbContext AddToContext<T>(this DbContext context, object entity, int count, int commitCount, bool recreateContext, Func<DbContext> contextCreator)
{
context.Set(typeof(T)).Add((T)entity);
if (count % commitCount == 0)
{
context.SaveChanges();
if (recreateContext)
{
context.Dispose();
context = contextCreator.Invoke();
context.Configuration.AutoDetectChangesEnabled = false;
}
}
return context;
}
}
Usage:
public void AddEntities(List<YourEntity> entities)
{
using (var transactionScope = new TransactionScope())
{
DbContext context = new YourContext();
int count = 0;
foreach (var entity in entities)
{
++count;
context = context.AddToContext<TenancyNote>(entity, count, 100, true,
() => new YourContext());
}
context.SaveChanges();
transactionScope.Complete();
}
}

SqlBulkCopy is super quick
This is my implementation:
// at some point in my calling code, I will call:
var myDataTable = CreateMyDataTable();
myDataTable.Rows.Add(Guid.NewGuid,tableHeaderId,theName,theValue); // e.g. - need this call for each row to insert
var efConnectionString = ConfigurationManager.ConnectionStrings["MyWebConfigEfConnection"].ConnectionString;
var efConnectionStringBuilder = new EntityConnectionStringBuilder(efConnectionString);
var connectionString = efConnectionStringBuilder.ProviderConnectionString;
BulkInsert(connectionString, myDataTable);
private DataTable CreateMyDataTable()
{
var myDataTable = new DataTable { TableName = "MyTable"};
// this table has an identity column - don't need to specify that
myDataTable.Columns.Add("MyTableRecordGuid", typeof(Guid));
myDataTable.Columns.Add("MyTableHeaderId", typeof(int));
myDataTable.Columns.Add("ColumnName", typeof(string));
myDataTable.Columns.Add("ColumnValue", typeof(string));
return myDataTable;
}
private void BulkInsert(string connectionString, DataTable dataTable)
{
using (var connection = new SqlConnection(connectionString))
{
connection.Open();
SqlTransaction transaction = null;
try
{
transaction = connection.BeginTransaction();
using (var sqlBulkCopy = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, transaction))
{
sqlBulkCopy.DestinationTableName = dataTable.TableName;
foreach (DataColumn column in dataTable.Columns) {
sqlBulkCopy.ColumnMappings.Add(column.ColumnName, column.ColumnName);
}
sqlBulkCopy.WriteToServer(dataTable);
}
transaction.Commit();
}
catch (Exception)
{
transaction?.Rollback();
throw;
}
}
}

Here is a performance comparison between using Entity Framework and using SqlBulkCopy class on a realistic example: How to Bulk Insert Complex Objects into SQL Server Database
As others already emphasized, ORMs are not meant to be used in bulk operations. They offer flexibility, separation of concerns and other benefits, but bulk operations (except bulk reading) are not one of them.

Use SqlBulkCopy:
void BulkInsert(GpsReceiverTrack[] gpsReceiverTracks)
{
if (gpsReceiverTracks == null)
{
throw new ArgumentNullException(nameof(gpsReceiverTracks));
}
DataTable dataTable = new DataTable("GpsReceiverTracks");
dataTable.Columns.Add("ID", typeof(int));
dataTable.Columns.Add("DownloadedTrackID", typeof(int));
dataTable.Columns.Add("Time", typeof(TimeSpan));
dataTable.Columns.Add("Latitude", typeof(double));
dataTable.Columns.Add("Longitude", typeof(double));
dataTable.Columns.Add("Altitude", typeof(double));
for (int i = 0; i < gpsReceiverTracks.Length; i++)
{
dataTable.Rows.Add
(
new object[]
{
gpsReceiverTracks[i].ID,
gpsReceiverTracks[i].DownloadedTrackID,
gpsReceiverTracks[i].Time,
gpsReceiverTracks[i].Latitude,
gpsReceiverTracks[i].Longitude,
gpsReceiverTracks[i].Altitude
}
);
}
string connectionString = (new TeamTrackerEntities()).Database.Connection.ConnectionString;
using (var connection = new SqlConnection(connectionString))
{
connection.Open();
using (var transaction = connection.BeginTransaction())
{
using (var sqlBulkCopy = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, transaction))
{
sqlBulkCopy.DestinationTableName = dataTable.TableName;
foreach (DataColumn column in dataTable.Columns)
{
sqlBulkCopy.ColumnMappings.Add(column.ColumnName, column.ColumnName);
}
sqlBulkCopy.WriteToServer(dataTable);
}
transaction.Commit();
}
}
return;
}

As per my knowledge there is no BulkInsert in EntityFramework to increase the performance of the huge inserts.
In this scenario you can go with SqlBulkCopy in ADO.net to solve your problem

All the solutions written here don't help because when you do SaveChanges(), insert statements are sent to database one by one, that's how Entity works.
And if your trip to database and back is 50 ms for instance then time needed for insert is number of records x 50 ms.
You have to use BulkInsert, here is the link: https://efbulkinsert.codeplex.com/
I got insert time reduced from 5-6 minutes to 10-12 seconds by using it.

Another option is to use SqlBulkTools available from Nuget. It's very easy to use and has some powerful features.
Example:
var bulk = new BulkOperations();
var books = GetBooks();
using (TransactionScope trans = new TransactionScope())
{
using (SqlConnection conn = new SqlConnection(ConfigurationManager
.ConnectionStrings["SqlBulkToolsTest"].ConnectionString))
{
bulk.Setup<Book>()
.ForCollection(books)
.WithTable("Books")
.AddAllColumns()
.BulkInsert()
.Commit(conn);
}
trans.Complete();
}
See the documentation for more examples and advanced usage. Disclaimer: I am the author of this library and any views are of my own opinion.

[NEW SOLUTION FOR POSTGRESQL]
Hey, I know it's quite an old post, but I have recently run into similar problem, but we were using Postgresql. I wanted to use effective bulkinsert, what turned out to be pretty difficult. I haven't found any proper free library to do so on this DB. I have only found this helper:
https://bytefish.de/blog/postgresql_bulk_insert/
which is also on Nuget. I have written a small mapper, which auto mapped properties the way Entity Framework:
public static PostgreSQLCopyHelper<T> CreateHelper<T>(string schemaName, string tableName)
{
var helper = new PostgreSQLCopyHelper<T>("dbo", "\"" + tableName + "\"");
var properties = typeof(T).GetProperties();
foreach(var prop in properties)
{
var type = prop.PropertyType;
if (Attribute.IsDefined(prop, typeof(KeyAttribute)) || Attribute.IsDefined(prop, typeof(ForeignKeyAttribute)))
continue;
switch (type)
{
case Type intType when intType == typeof(int) || intType == typeof(int?):
{
helper = helper.MapInteger("\"" + prop.Name + "\"", x => (int?)typeof(T).GetProperty(prop.Name).GetValue(x, null));
break;
}
case Type stringType when stringType == typeof(string):
{
helper = helper.MapText("\"" + prop.Name + "\"", x => (string)typeof(T).GetProperty(prop.Name).GetValue(x, null));
break;
}
case Type dateType when dateType == typeof(DateTime) || dateType == typeof(DateTime?):
{
helper = helper.MapTimeStamp("\"" + prop.Name + "\"", x => (DateTime?)typeof(T).GetProperty(prop.Name).GetValue(x, null));
break;
}
case Type decimalType when decimalType == typeof(decimal) || decimalType == typeof(decimal?):
{
helper = helper.MapMoney("\"" + prop.Name + "\"", x => (decimal?)typeof(T).GetProperty(prop.Name).GetValue(x, null));
break;
}
case Type doubleType when doubleType == typeof(double) || doubleType == typeof(double?):
{
helper = helper.MapDouble("\"" + prop.Name + "\"", x => (double?)typeof(T).GetProperty(prop.Name).GetValue(x, null));
break;
}
case Type floatType when floatType == typeof(float) || floatType == typeof(float?):
{
helper = helper.MapReal("\"" + prop.Name + "\"", x => (float?)typeof(T).GetProperty(prop.Name).GetValue(x, null));
break;
}
case Type guidType when guidType == typeof(Guid):
{
helper = helper.MapUUID("\"" + prop.Name + "\"", x => (Guid)typeof(T).GetProperty(prop.Name).GetValue(x, null));
break;
}
}
}
return helper;
}
I use it the following way (I had entity named Undertaking):
var undertakingHelper = BulkMapper.CreateHelper<Model.Undertaking>("dbo", nameof(Model.Undertaking));
undertakingHelper.SaveAll(transaction.UnderlyingTransaction.Connection as Npgsql.NpgsqlConnection, undertakingsToAdd));
I showed an example with transaction, but it can also be done with normal connection retrieved from context. undertakingsToAdd is enumerable of normal entity records, which I want to bulkInsert into DB.
This solution, to which I've got after few hours of research and trying, is as you could expect much faster and finally easy to use and free! I really advice you to use this solution, not only for the reasons mentioned above, but also because it's the only one with which I had no problems with Postgresql itself, many other solutions work flawlessly for example with SqlServer.

Taking several notes, this is my implementation with improvements mine and from other answers and comments.
Improvements:
Getting the SQL connection string from my Entity
Using SQLBulk just in some parts, the rest only Entity Framework
Using the same Datetable column names that uses the SQL Database without need of mapping each column
Using the same Datatable name that uses SQL Datatable
public void InsertBulkDatatable(DataTable dataTable)
{
EntityConnectionStringBuilder entityBuilder = new EntityConnectionStringBuilder(ConfigurationManager.ConnectionStrings["MyDbContextConnectionName"].ConnectionString);
string cs = entityBuilder.ProviderConnectionString;
using (var connection = new SqlConnection(cs))
{
SqlTransaction transaction = null;
connection.Open();
try
{
transaction = connection.BeginTransaction();
using (var sqlBulkCopy = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, transaction))
{
sqlBulkCopy.DestinationTableName = dataTable.TableName; //Uses the SQL datatable to name the datatable in c#
//Maping Columns
foreach (DataColumn column in dataTable.Columns) {
sqlBulkCopy.ColumnMappings.Add(column.ColumnName, column.ColumnName);
}
sqlBulkCopy.WriteToServer(dataTable);
}
transaction.Commit();
}
catch (Exception)
{
transaction.Rollback();
}
}
}

The secret is to insert into an identical blank staging table. Inserts are lightening quick. Then run a single insert from that into your main large table. Then truncate the staging table ready for the next batch.
ie.
insert into some_staging_table using Entity Framework.
-- Single insert into main table (this could be a tiny stored proc call)
insert into some_main_already_large_table (columns...)
select (columns...) from some_staging_table
truncate table some_staging_table

Have you ever tried to insert through a background worker or task?
In my case, im inserting 7760 registers, distributed in 182 different tables with foreign key relationships ( by NavigationProperties).
Without the task, it took 2 minutes and a half.
Within a Task ( Task.Factory.StartNew(...) ), it took 15 seconds.
Im only doing the SaveChanges() after adding all the entities to the context. (to ensure data integrity)

You may use Bulk package library.
Bulk Insert 1.0.0 version is used in projects having Entity framework >=6.0.0 .
More description can be found here-
Bulkoperation source code

TL;DR
I know it is an old post, but I have implemented a solution starting from one of those proposed by extending it and solving some problems of this; moreover I have also read the other solutions presented and compared to these it seems to me to propose a solution that is much more suited to the requests formulated in the original question.
In this solution I extend Slauma's approach which I would say is perfect for the case proposed in the original question, and that is to use Entity Framework and Transaction Scope for an expensive write operation on the db.
In Slauma's solution - which incidentally was a draft and was only used to get an idea of the speed of EF with a strategy to implement bulk-insert - there were problems due to:
the timeout of the transaction (by default 1 minute extendable via code to max 10 minutes);
the duplication of the first block of data with a width equal to the size of the commit used at the end of the transaction (this problem is quite weird and circumvented by means of a workaround).
I also extended the case study presented by Slauma by reporting an example that includes the contextual insertion of several dependent entities.
The performances that I have been able to verify have been of 10K rec/min inserting in the db a block of 200K wide records approximately 1KB each. The speed was constant, there was no degradation in performance and the test took about 20 minutes to run successfully.
The solution in detail
the method that presides over the bulk-insert operation inserted in an example repository class:
abstract class SomeRepository {
protected MyDbContext myDbContextRef;
public void ImportData<TChild, TFather>(List<TChild> entities, TFather entityFather)
where TChild : class, IEntityChild
where TFather : class, IEntityFather
{
using (var scope = MyDbContext.CreateTransactionScope())
{
MyDbContext context = null;
try
{
context = new MyDbContext(myDbContextRef.ConnectionString);
context.Configuration.AutoDetectChangesEnabled = false;
entityFather.BulkInsertResult = false;
var fileEntity = context.Set<TFather>().Add(entityFather);
context.SaveChanges();
int count = 0;
//avoids an issue with recreating context: EF duplicates the first commit block of data at the end of transaction!!
context = MyDbContext.AddToContext<TChild>(context, null, 0, 1, true);
foreach (var entityToInsert in entities)
{
++count;
entityToInsert.EntityFatherRefId = fileEntity.Id;
context = MyDbContext.AddToContext<TChild>(context, entityToInsert, count, 100, true);
}
entityFather.BulkInsertResult = true;
context.Set<TFather>().Add(fileEntity);
context.Entry<TFather>(fileEntity).State = EntityState.Modified;
context.SaveChanges();
}
finally
{
if (context != null)
context.Dispose();
}
scope.Complete();
}
}
}
interfaces used for example purposes only:
public interface IEntityChild {
//some properties ...
int EntityFatherRefId { get; set; }
}
public interface IEntityFather {
int Id { get; set; }
bool BulkInsertResult { get; set; }
}
db context where I implemented the various elements of the solution as static methods:
public class MyDbContext : DbContext
{
public string ConnectionString { get; set; }
public MyDbContext(string nameOrConnectionString)
: base(nameOrConnectionString)
{
Database.SetInitializer<MyDbContext>(null);
ConnectionString = Database.Connection.ConnectionString;
}
/// <summary>
/// Creates a TransactionScope raising timeout transaction to 30 minutes
/// </summary>
/// <param name="_isolationLevel"></param>
/// <param name="timeout"></param>
/// <remarks>
/// It is possible to set isolation-level and timeout to different values. Pay close attention managing these 2 transactions working parameters.
/// <para>Default TransactionScope values for isolation-level and timeout are the following:</para>
/// <para>Default isolation-level is "Serializable"</para>
/// <para>Default timeout ranges between 1 minute (default value if not specified a timeout) to max 10 minute (if not changed by code or updating max-timeout machine.config value)</para>
/// </remarks>
public static TransactionScope CreateTransactionScope(IsolationLevel _isolationLevel = IsolationLevel.Serializable, TimeSpan? timeout = null)
{
SetTransactionManagerField("_cachedMaxTimeout", true);
SetTransactionManagerField("_maximumTimeout", timeout ?? TimeSpan.FromMinutes(30));
var transactionOptions = new TransactionOptions();
transactionOptions.IsolationLevel = _isolationLevel;
transactionOptions.Timeout = TransactionManager.MaximumTimeout;
return new TransactionScope(TransactionScopeOption.Required, transactionOptions);
}
private static void SetTransactionManagerField(string fieldName, object value)
{
typeof(TransactionManager).GetField(fieldName, BindingFlags.NonPublic | BindingFlags.Static).SetValue(null, value);
}
/// <summary>
/// Adds a generic entity to a given context allowing commit on large block of data and improving performance to support db bulk-insert operations based on Entity Framework
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="context"></param>
/// <param name="entity"></param>
/// <param name="count"></param>
/// <param name="commitCount">defines the block of data size</param>
/// <param name="recreateContext"></param>
/// <returns></returns>
public static MyDbContext AddToContext<T>(MyDbContext context, T entity, int count, int commitCount, bool recreateContext) where T : class
{
if (entity != null)
context.Set<T>().Add(entity);
if (count % commitCount == 0)
{
context.SaveChanges();
if (recreateContext)
{
var contextConnectionString = context.ConnectionString;
context.Dispose();
context = new MyDbContext(contextConnectionString);
context.Configuration.AutoDetectChangesEnabled = false;
}
}
return context;
}
}

Configuration.LazyLoadingEnabled = false;
Configuration.ProxyCreationEnabled = false;
these are too effect to speed without AutoDetectChangesEnabled = false; and i advise to use different table header from dbo. generally i use like nop,sop,tbl etc..

C# EF Query optimisation

I have the following function using entity framework, that checks a db table for expired product data, the entire query seems extremely slow and i am wondering if there is a better / more efficient way of adapting this query as there is always a bulk amount of data to clean sometimes upto 3k items.
It is purely the selection item that can take 4-10 mins dependant on db size on the day.
Datetime format Example = 2021-03-31T23:59:59.000
using (ProductContext db = new ProductContext())
{
log.Info("Checking for expired data in the Product Data db");
//rule = now - configured hours, default = 2 hours in past.
var checkWindow = Convert.ToInt32(PRODUCTMAPPING_CONFIG.MinusExpiredWindowHours);
var dtCheck = Convert.ToDateTime(DateTime.Now.AddHours(-checkWindow).ToString("s"));
var rowData = db.ProductData.Where(le => Convert.ToDateTime(le.ProductEndDate.Value.ToString().Trim()) < dtCheck).ToList();
rowData.ForEach(i => {log.Debug($"DB Row ID {i.Id} with Product ID Value: {i.ProductUid} has expired with Product End Date: {i.ProductEndDate}, marked for removal."); });
log.Info($"Number of expired Products being removed: {rowData.Count()}");
db.ProductData.RemoveRange(rowData);
db.SaveChanges();
log.Info(rowData.Count == 0
? "No expired data present."
: $"Number of expired assets Successfully removed from database = {rowData.Count}");
}
Thanks in advance
EDIT::
Thanks to all the suggestions i will be looking at the ORM comments made by Panagiotis Kanavos below in regard to direct queries for this type of item and amended the column datatype based on the comment from Panagiotis Kanavos again. Finally the .ToList comment by jdweng removed the lag immediately so this at least gets my moving faster for now while i look at the suggestions by Panagiotis Kanavos as i think that is probably the best way forward?
Faster code is now:
using (ProductContext db = new ProductContext())
{
log.Info("Checking for expired data in the Product Data db");
//rule = now - configured hours, default = 2 hours in past.
var checkWindow = Convert.ToInt32(PRODUCTMAPPING_CONFIG.MinusExpiredWindowHours);
var dtCheck = Convert.ToDateTime(DateTime.Now.AddHours(-checkWindow).ToString("s"));
// Ammended DB Table from nvarchar to DateTime to allow direct comparison based on comment by: Panagiotis Kanavos
// Removed ToList() which returned the IEnumerable immediately based on comment by: jdweng
var rowData = db.ProductData.Where(le => le.ProductEndDate < dtCheck);
log.Info($"Number of expired Products being removed: {rowData.Count()}");
// added print out on debug only.
if(log.IsDebugEnabled)
rowData.ToList().ForEach(i => {log.Debug($"DB Row ID {i.Id} with Product ID Value: {i.ProductUid} has expired with Product End Date: {i.ProductEndDate}, marked for removal."); });
var rowCount = rowData.Count();
db.ProductData.RemoveRange(rowData);
db.SaveChanges();
log.Info(rowCount == 0
? "No expired data present."
: $"Number of expired assets Successfully removed from database = {rowCount}");
}
So grateful to all the useful comments below and grateful for the time you all took to respond and help me learn from this.

How to Optimize Code Performance in .NET [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have an export job migrating data from an old database into a new database. The problem I'm having is that the user population is around 3 million and the job takes a very long time to complete (15+ hours). The machine I am using only has 1 processor so I'm not sure if threading is what I should be doing. Can someone help me optimize this code?
static void ExportFromLegacy()
{
var usersQuery = _oldDb.users.Where(x =>
x.status == 'active');
int BatchSize = 1000;
var errorCount = 0;
var successCount = 0;
var batchCount = 0;
// Using MoreLinq's Batch for sequences
// https://www.nuget.org/packages/MoreLinq.Source.MoreEnumerable.Batch
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
Console.WriteLine(String.Format("Batch count at {0}", batchCount));
batchCount++;
foreach(var user in batch)
{
try
{
var userData = _oldDb.userData.Where(x =>
x.user_id == user.user_id).ToList();
if (userData.Count > 0)
{
// Insert into table
var newData = new newData()
{
UserId = user.user_id; // shortened code for brevity.
};
_db.newUserData.Add(newData);
_db.SaveChanges();
// Insert item(s) into table
foreach (var item in userData.items)
{
if (!_db.userDataItems.Any(x => x.id == item.id)
{
var item = new Item()
{
UserId = user.user_id, // shortened code for brevity.
DataId = newData.id // id from object created above
};
_db.userDataItems.Add(item);
}
_db.SaveChanges();
successCount++;
}
}
}
catch(Exception ex)
{
errorCount++;
Console.WriteLine(String.Format("Error saving changes for user_id: {0} at {1}.", user.user_id.ToString(), DateTime.Now));
Console.WriteLine("Message: " + ex.Message);
Console.WriteLine("InnerException: " + ex.InnerException);
}
}
}
Console.WriteLine(String.Format("End at {0}...", DateTime.Now));
Console.WriteLine(String.Format("Successful imports: {0} | Errors: {1}", successCount, errorCount));
Console.WriteLine(String.Format("Total running time: {0}", (exportStart - DateTime.Now).ToString(#"hh\:mm\:ss")));
}

Unfortunately, the major issue is the number of database round-trip.
You make a round-trip:
For every user, you retrieve user data by user id in the old database
For every user, you save user data in the new database
For every user, you save user data item in the new database
So if you say you have 3 million users, and every user has an average of 5 user data item, it mean you do at least 3m + 3m + 15m = 21 million database round-trip which is insane.
The only way to dramatically improve the performance is by reducing the number of database round-trip.
Batch - Retrieve user by id
You can quickly reduce the number of database round-trip by retrieving all user data at once and since you don't have to track them, use "AsNoTracking()" for even more performance gains.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
foreach(var userData in userDatas)
{
....
}
You should already have saved a few hours only with this change.
Batch - Save Changes
Every time you save a user data or item, you perform a database round-trip.
Disclaimer: I'm the owner of the project Entity Framework Extensions
This library allows to perform:
BulkSaveChanges
BulkInsert
BulkUpdate
BulkDelete
BulkMerge
You can either call BulkSaveChanges at the end of the batch or create a list to insert and use directly BulkInsert instead for even more performance.
You will, however, have to use a relation to the newData instance instead of using the ID directly.
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
// Retrieve all users for the batch at once.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
// Create list used for BulkInsert
var newDatas = new List<newData>();
var newDataItems = new List<Item();
foreach(var userData in userDatas)
{
// newDatas.Add(newData);
// newDataItem.OwnerData = newData;
// newDataItems.Add(newDataItem);
}
_db.BulkInsert(newDatas);
_db.BulkInsert(newDataItems);
}
EDIT: Answer subquestion
One of the properties of a newDataItem, is the id of newData. (ex.
newDataItem.newDataId.) So newData would have to be saved first in
order to generate its id. How would I BulkInsert if there is a
dependency of an another object?
You must use instead navigation properties. By using navigation property, you will never have to specify parent id but set the parent object instance instead.
public class UserData
{
public int UserDataID { get; set; }
// ... properties ...
public List<UserDataItem> Items { get; set; }
}
public class UserDataItem
{
public int UserDataItemID { get; set; }
// ... properties ...
public UserData OwnerData { get; set; }
}
var userData = new UserData();
var userDataItem = new UserDataItem();
// Use navigation property to set the parent.
userDataItem.OwnerData = userData;
Tutorial: Configure One-to-Many Relationship
Also, I don't see a BulkSaveChanges in your example code. Would that
have to be called after all the BulkInserts?
Bulk Insert directly insert into the database. You don't have to specify "SaveChanges" or "BulkSaveChanges", once you invoke the method, it's done ;)
Here is an example using BulkSaveChanges:
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
// Retrieve all users for the batch at once.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
// Create list used for BulkInsert
var newDatas = new List<newData>();
var newDataItems = new List<Item();
foreach(var userData in userDatas)
{
// newDatas.Add(newData);
// newDataItem.OwnerData = newData;
// newDataItems.Add(newDataItem);
}
var context = new UserContext();
context.userDatas.AddRange(newDatas);
context.userDataItems.AddRange(newDataItems);
context.BulkSaveChanges();
}
BulkSaveChanges is slower than BulkInsert due to having to use some internal method from Entity Framework but still way faster than SaveChanges.
In the example, I create a new context for every batch to avoid memory issue and gain some performance. If you re-use the same context for all batchs, you will have millions of tracked entities in the ChangeTracker which is never a good idea.

Entity Framework is a very bad choice for importing large amounts of data. I know this from personal experience.
That being said, I found a few ways to optimize things when I tried to use it in the same way you are.
The Context will cache objects as you add them, and the more inserts you do, the slower future inserts will get. My solution was to limit each context to about 500 inserts before I disposed of that instance and created a new one. This boosted performance significantly.
I was able to make use of multiple threads to increase performance, but you will have to be very careful about resource contention. Each thread will definitely need its own Context, don't even think about trying to share it between threads. My machine had 8 cores, so threading will probably not help you as much; with a single core I doubt it will help you at all.
Turn off ChangeTracking with AutoDetectChangesEnabled = false;, change tracking is incredibly slow. Unfortunately this means you have to modify your code to make all changes directly through the context. No more Entity.Property = "Some Value";, it becomes Context.Entity(e=> e.Property).SetValue("Some Value"); (or something like that, I don't remember the exact syntax), which makes the code ugly.
Any queries you do should definitely use AsNoTracking.
With all that, I was able to cut a ~20 hour process down to about 6 hours, but I still don't recommend using EF for this. It was an extremely painful project due almost entirely to my poor choice of EF to add data. Please use something else... anything else...
I don't want to give the impression that EF is a bad data access library, it is great at what it was designed to do, unfortunately this is not what it was designed for.

I can think on a few options.
1) A little speed increase could be done by moving your _db.SaveChanges() under your foreach() close bracket
foreach (...){
}
successCount += _db.SaveChanges();
2) Add items to a list, and then to context
List<ObjClass> list = new List<ObjClass>();
foreach (...)
{
list.Add(new ObjClass() { ... });
}
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
3) If it's a big amount of dada, save on bunches
List<ObjClass> list = new List<ObjClass>();
int cnt=0;
foreach (...)
{
list.Add(new ObjClass() { ... });
if (++cnt % 100 == 0) // bunches of 100
{
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
list.Clear();
// Optional if a HUGE amount of data
if (cnt % 1000 == 0)
{
_db = new MyDbContext();
}
}
}
// Don't forget that!
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
list.Clear();
4) If TOOOO big, considere using bulkinserts. There are a few examples on internet and a few free libraries around.
Ref: https://blogs.msdn.microsoft.com/nikhilsi/2008/06/11/bulk-insert-into-sql-from-c-app/
On most of these options you loose some control on error handling as it is difficult to know which one failed.

Indexing subcategories vs finding them dynamically (performance)

I'm building a web-based store application, and I have to deal with many nested subcategories within each other. The point is, I have no idea whether my script will handle thousands (the new system will replace the old one, so I know what traffic I have to expect) - at the present day, respond lag from the local server is 1-2 seconds more than other pages with added about 30 products in different categories.
My code is the following:
BazaArkadiaDataContext db = new BazaArkadiaDataContext();
List<A_Kategorie> Podkategorie = new List<A_Kategorie>();
public int IdKat { get; set; }
protected void Page_Load(object sender, EventArgs e)
{
if (!IsPostBack)
{
List<A_Produkty> Produkty = new List<A_Produkty>(); //list of all products within the category and remaining subcategories
if (Page.RouteData.Values["IdKategorii"] != null)
{
string tmpkat = Page.RouteData.Values["IdKategorii"].ToString();
int index = tmpkat.IndexOf("-");
if (index > 0)
tmpkat = tmpkat.Substring(0, index);
IdKat = db.A_Kategories.Where(k => k.ID == Convert.ToInt32(tmpkat)).Select(k => k.IDAllegro).FirstOrDefault();
}
else
return;
PobierzPodkategorie(IdKat);
foreach (var item in Podkategorie)
{
var x = db.A_Produkties.Where(k => k.IDKategorii == item.ID);
foreach (var itemm in x)
{
Produkty.Add(itemm);
}
}
//data binding here
}
}
List<A_Kategorie> PobierzPodkategorie(int IdKat, List<A_Kategorie> kat = null)
{
List<A_Kategorie> Kategorie = new List<A_Kategorie>();
if (kat != null)
Kategorie.Concat(kat);
Kategorie = db.A_Kategories.Where(k => k.KatNadrzedna == IdKat).ToList();
if (Kategorie.Count() > 0)
{
foreach (var item in Kategorie)
{
PobierzPodkategorie(item.IDAllegro, Kategorie);
Podkategorie.Add(item);
}
}
return Kategorie;
}
TMC;DR*
My function PobierzPodkategorie recursively seeks through subcategories (subcategory got KatNadrzedna column for its parent category, which is placed in IDAllegro), selects all the products with the subcategory ID and adds it to the Produkty list. The database structure is pretty wicked, as the category list is downloaded from another shop service server and it needed to get our own ID column in case the foreign server would change the structure.
There are more than 30 000 entries in the category list, some of them will have 5 or more parents, and the website will show only main categories and subcategories ("lower" subcategories are needed by external shop connected with SOAP).
My question is
Will adding index table to the database (Category 123 is parent for 1234, 12738...) will improve the performance, or is it just waste of time? (The index should be updated when version of API changes and I have no idea how often would it be) Or is there other way to do it?
I'm asking because changing the script will not be possible in production, and I don't know how the db engine handles lots of requests - I'd really appreciate any help with this.
The database is MSSQL
*Too much code; didn't read

The big efficiency gain you can get is to load all subproducts in a single query. The time saved by reducing network trips can be huge. If 1 is a root category and 12 a child category, you can query all root categories and their children like:
select *
from Categories
where len(Category) <= 2
An index on Category would not help with the above query. But it's good practice to have a primary key on any table. So I'd make Category the primary key. A primary key is unique, preventing duplicates, and it is indexed automatically.
Moving away from RBAR (row by agonizing row) has more effect than proper tuning of the database. So I'd tackle that first.

You definitely should move the recursion into database. It can be done using WITH statement and Common Table Expressions. Then create a view or stored procedure and map it to you application.
With that you should be able to reduce SQL queries to two (or even one).

Speed issues fetching Thousands of rows in a SQLCE database

Quick disclaimer first: I am a complete noob when it comes to databases. I know how to form an SQL query and... well, that's pretty much it - I figured that'd be enough to start with. Performance optimizations would come later.
'Later' has arrived. I need your help.
I'm doing NLP-processing on news articles. The articles are taken from the Internet and stored in a database. Users give me an input period to analyze, I bring up all the articles in this period, analyze them and show them some graphs in return. I currently have a rather naive approach to this - I don't limit the number of articles returned. About 250 articles a day * 6 months is 45,000 records, a rather large number.
I'm experiencing mediocre fetch performance. I'm using C# + SQLCE (an easy DB to start with, with no set up cost). I tried indexing the database to no avail. I'm suspecting the problems comes from either
asking for so much data in one single query.
using SQLCE
Am I utterly crazy to try and fetch thousands of records all in 1 call ? Was SQLCE a stupid choice to make ? I basically need practical advice on this. Also, if you could point me to good alternatives to solve my problem that's even more awesome.
Your help is of great value to me - thanks in advance!
EDIT - Below is to command I use to get my articles:
using (SqlCeCommand com1 = new SqlCeCommand(mySqlRequestString, myConnectionString))
{
SqlCeResultSet res = com1.ExecuteResultSet(ResultSetOptions.Scrollable);
if (res.HasRows)
{
//Use the get ordinal method so we don’t have to worry about remembering what order our SQL put the field names in.
int ordGuid = res.GetOrdinal("Id"); int ordUrl = res.GetOrdinal("Url"); int ordPublicationDate = res.GetOrdinal("PublicationDate");
int ordTitle = res.GetOrdinal("Title"); int ordContent = res.GetOrdinal("Content"); int ordSource = res.GetOrdinal("Source");
int ordAuthor = res.GetOrdinal("Author"); int ordComputedKeywords = res.GetOrdinal("ComputedKeywords"); int ordComputedKeywordsDate = res.GetOrdinal("ComputedKeywordsDate");
//Get all the Articles
List<Article> articles = new List<Article>();
if (res.ReadFirst())
{
// Read the first record and get its data
res.ReadFirst();
Constants.Sources src = (Constants.Sources)Enum.Parse(typeof(Constants.Sources), res.GetString(ordSource));
string[] computedKeywords = res.IsDBNull(ordComputedKeywords)?new string[]{}: res.GetString(ordComputedKeywords).Split(',').ToArray();
DateTime computedKeywordsDate = res.IsDBNull(ordComputedKeywordsDate) ? new DateTime() : res.GetDateTime(ordComputedKeywordsDate);
articles.Add(new Article(res.GetGuid(ordGuid), new Uri(res.GetString(ordUrl)), res.GetDateTime(ordPublicationDate), res.GetString(ordTitle), res.GetString(ordContent), src, res.GetString(ordAuthor), computedKeywords, computedKeywordsDate));
}
// Read the remaining records
while (res.Read())
{
Constants.Sources src = (Constants.Sources)Enum.Parse(typeof(Constants.Sources), res.GetString(ordSource));
string[] computedKeywords = res.IsDBNull(ordComputedKeywords) ? new string[] { } : res.GetString(ordComputedKeywords).Split(',').ToArray();
DateTime computedKeywordsDate = res.IsDBNull(ordComputedKeywordsDate) ? new DateTime() : res.GetDateTime(ordComputedKeywordsDate);
articles.Add(new Article(res.GetGuid(ordGuid), new Uri(res.GetString(ordUrl)), res.GetDateTime(ordPublicationDate), res.GetString(ordTitle), res.GetString(ordContent), src, res.GetString(ordAuthor), computedKeywords, computedKeywordsDate));
}
return articles.ToArray();
}
}

You should only fetch one page of results at a time.
SqlCE is great for testing or very low usage applications, but you should really consider SQL Express or full blown SQL Server.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.