Can I do threads instead of tasks to make this run faster?
I'm trying to get 114000 products into the database. As my code is right now I get about 100 products into the database a minute.
My Tasks (producers) each scrape an XML File which contains a product data, packages it in the Product class, then queue's it for the consumer.
my Consumer takes each product from the queue and puts it into the database 1 at a time. I use Entity Framework so it's not safe for threading.
public static void GetAllProductsFromIndexes_AndPutInDB(List<IndexModel> indexes, ProductContext context)
{
BlockingCollection<IndexModel> inputQueue = CreateInputQueue(indexes);
BlockingCollection<Product> productsQueue = new BlockingCollection<Product>(5000);
var consumer = Task.Run(() =>
{
foreach (Product readyProduct in productsQueue.GetConsumingEnumerable())
{
InsertProductInDB(readyProduct, context);
}
});
var producers = Enumerable.Range(0, 100)
.Select(_ => Task.Run(() =>
{
foreach (IndexModel index in inputQueue.GetConsumingEnumerable())
{
Product product = new Product();
byte[] unconvertedByteArray;
string xml;
string url = #"https://data.Icecat.biz/export/freexml.int/en/";
unconvertedByteArray = DownloadIcecatFile(index.IndexNumber.ToString() + ".xml", url);
xml = Encoding.UTF8.GetString(unconvertedByteArray);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xml);
GetProductDetails(product, xmlDoc, index);
XmlNodeList nodeList = (xmlDoc.SelectNodes("ICECAT-interface/Product/ProductFeature"));
product.FeaturesLink = GetProductFeatures(product, nodeList);
nodeList = (xmlDoc.SelectNodes("ICECAT-interface/Product/ProductGallery/ProductPicture"));
product.Images = GetProductImages(nodeList);
productsQueue.Add(product);
}
})).ToArray();
Task.WaitAll(producers);
productsQueue.CompleteAdding();
consumer.Wait();
}
A couple of things you must do.
Detach each Product entity after you isnert it, or they will all accumulate in the Change Tracker.
Don't call SaveChanges after every product. Batch up a hundred or so. Like this:
var consumer = Task.Run(() =>
{
var batch = new List<Product>();
foreach (Product readyProduct in productsQueue.GetConsumingEnumerable())
{
batch.Add(readyProduct);
if (batch.Count >= 100)
{
context.Products.AddRange(batch);
context.SaveChanges();
foreach (var p in batch)
{
context.Entry(p).State = EntityState.Detached;
}
batch.Clear();
}
}
context.Products.AddRange(batch);
context.SaveChanges();
foreach (var p in batch)
{
context.Entry(p).State = EntityState.Detached;
}
});
If you're on EF Core and your provider supports it (Like SQL Server), you'll even get statement batching. You should expect several hunderd rows per second using basic best-practices here. If you need more than that, you can switch to a bulk load API ( like SqlBulkCopy for SQL Server).
First, read the speed rant to make sure this is even worth investigating.
Can I do threads instead of tasks to make this run faster?
Extremely unlikely. Multithreading has been used as a cheap way to implement Multitasking for a while, but it is technically only usefull if the task is CPU bound. You are doing a DB operation. That one will be Network bound. More likely DB bound (they apply additional bottlenecks as part of their reliabiltiy and concurrency issue prevention).
Im trying to get 114000 products into the db.
Then your best bet is not trying to do that in code. Every DBMS worth it's memory footprint has bulk insert options. Doing that in C# code? That will just make it slower and less reliable.
At best you add the Network load to send the data to the DB to the whole operation. At worst, you make it even slower then that. It is one of the most common mistakes with DB's, thinking you can beat the DBMS performance with code. It will not work.
Related
I have about 100 items (allRights) in the database and about 10 id-s to be searched (inputRightsIds). Which one is better - first to get all rights and then search the items (Variant 1) or to make 10 checking requests requests to the database
Here is some example code:
DbContext db = new DbContext();
int[] inputRightsIds = new int[10]{...};
Variant 1
var allRights = db.Rights.ToLIst();
foreach( var right in allRights)
{
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(inputRightsIds[i] == right.Id)
{
// Do something
}
}
}
Variant 2
for(int i>0; i<inputRightsIds.Lenght; i++)
{
if(db.Rights.Any(r => r.Id == inputRightsIds[i]);)
{
// Do something
}
}
Thanks in advance!
As other's have already stated you should do the following.
var matchingIds = from r in db.Rights
where inputRightIds.Contains(r.Id)
select r.Id;
foreach(var id in matchingIds)
{
// Do something
}
But this is different from both of your approaches. In your first approach you are making one SQL call to the DB that is returning more results than you are interested in. The second is making multiple SQL calls returning part of the information you want with each call. The query above will make one SQL call to the DB and return only the data you are interested in. This is the best approach as it reduces the two bottle necks of making multiple calls to the DB and having too much data returned.
You can use following :
db.Rights.Where(right => inputRightsIds.Contains(right.Id));
They should be very similar speeds since both must enumerate the arrays the same number of times. There might be subtle differences in speed between the two depending on the input data but in general I would go with Variant 2. I think you should almost always prefer LINQ over manual enumeration when possible. Also consider using the following LINQ statement to simplify the whole search to a single line.
var matches = db.Rights.Where(r=> inputRightIds.Contains(r.Id));
...//Do stuff with matches
Not forget get all your items into memory to process list further
var itemsFromDatabase = db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList();
Or you could even enumerate through collection and do some stuff on each item
db.Rights.Where(r => inputRightsIds.Contains(r.Id)).ToList().Foreach(item => {
//your code here
});
I have a slight issue with the following scenario:
I'm given a list of ID values, I need to run a SELECT query (where the ID is a parameter), then combine all the result sets as one big one and return it to the caller.
Since the query might run for minutes per ID (that's another issue, but at the moment I consider it as a given fact), and there can be 1000s of IDs in the input) I tried to use tasks. With that approach I experience a slow, but solid increase in memory use.
As a test, I made a simple sequential solution too, this has normal memory usage graph, but as expected, very slow. There's an increase while it's running, but then everything drops back to the normal level when it's finished.
Here's the skeleton of code:
public class RowItem
{
public int ID { get; set; }
public string Name { get; set; }
//the rest of the properties
}
public List<RowItem> GetRowItems(List<int> customerIDs)
{
// this solution has the memory leak
var tasks = new List<Task<List<RowItem>>>();
foreach (var customerID in customerIDs)
{
var task = Task.Factory.StartNew(() => return ProcessCustomerID(customerID));
tasks.Add(task);
}
while (tasks.Any())
{
var index = Task.WaitAny(tasks.ToArray());
var task = tasks[index];
rowItems.AddRange(task.Result);
tasks.RemoveAt(index);
}
// this works fine, but slow
foreach (var customerID in customerIDs)
{
rowItems.AddRange(ProcessCustomerID(customerID)));
}
return rowItems;
}
private List<RowItem> ProcessCustomerID(int customerID)
{
var rowItems = new List<RowItem>();
using (var conn = new OracleConnection("XXX"))
{
conn.Open();
var sql = "SELECT * FROM ...";
using (var command = new OracleCommand(sql, conn))
{
using (var dataReader = command.ExecuteReader())
{
using (var dataTable = new DataTable())
{
dataTable.Load(dataReader);
rowItems = dataTable
.Rows
.OfType<DataRow>()
.Select(
row => new RowItem
{
ID = Convert.ToInt32(row["ID"]),
Name = row["Name"].ToString(),
//the rest of the properties
})
.ToList();
}
}
}
conn.Close();
}
return rowItems;
}
What am I doing wrong when using tasks? According to this MSDN article, I don't need to bother disposing them manually, but there's barely anything else. I guess ProcessCustomerID is OK, as it's called in both variations.
update
To log the current memory usage I used Process.GetCurrentProcess().PrivateMemorySize64, but I noticed the problem in Task Manager >> Processes
Using entity framework your ProcessCustomerID method could look like:
List<RowItem> rowItems;
using(var ctx = new OracleEntities()){
rowItems = ctx.Customer
.Where(o => o.id == customerID)
.Select(
new RowItem
{
ID = Convert.ToInt32(row["ID"]),
Name = row["Name"].ToString(),
//the rest of the properties
}
).ToList();
}
return rowItems;
Unless you are transferring large amounts of data like images, video, data or blobs this should be near instantaneous with 1k data as result.
If it is unclear what is taking time, and you use pre 10g oracle, it would be really hard to monitor this. However if you use entity framework you can attach monitoring to it! http://www.hibernatingrhinos.com/products/efprof
At least a year ago Oracle supported entity framework 5.
In sequential they are executed one by one, in parallel they literally get started all at same time consuming your resources and creating deadlocks.
I don't think you have any evidences for a memory leak in the parallel execution.
May be Garbage Collection occurs at different times and that’s why experienced two different readings. You cannot expect it release memory real time. .Net garbage collection occurs only when required. Have a look at “Fundamentals of Garbage Collection”
Task Manager or Process.GetCurrentProcess().PrivateMemorySize64 may not very accurate way to find memory leaks. If you do so, at least make sure you call full garbage collection and wait for pending finalizers prior reading memory counters.
GC.Collect();
GC.WaitForPendingFinalizers();
Update 2011-05-20 12:49AM: The foreach is still 25% faster than the parallel solution for my application. And don't use the collection count for max parallelism, use somthing closer to the number of cores on your machine.
=
I have an IO bound task that I would like to run in parallel. I want to apply the same operation to every file in a folder. Internally, the operation results in a Dispatcher.Invoke that adds the computed file info to a collection on the UI thread. So, in a sense, the work result is a side effect of the method call, not a value returned directly from the method call.
This is the core loop that I want to run in parallel
foreach (ShellObject sf in sfcoll)
ProcessShellObject(sf, curExeName);
The context for this loop is here:
var curExeName = Path.GetFileName(Assembly.GetEntryAssembly().Location);
using (ShellFileSystemFolder sfcoll = ShellFileSystemFolder.FromFolderPath(_rootPath))
{
//This works, but is not parallel.
foreach (ShellObject sf in sfcoll)
ProcessShellObject(sf, curExeName);
//This doesn't work.
//My attempt at PLINQ. This code never calls method ProcessShellObject.
var query = from sf in sfcoll.AsParallel().WithDegreeOfParallelism(sfcoll.Count())
let p = ProcessShellObject(sf, curExeName)
select p;
}
private String ProcessShellObject(ShellObject sf, string curExeName)
{
String unusedReturnValueName = sf.ParsingName
try
{
DesktopItem di = new DesktopItem(sf);
//Up date DesktopItem stuff
di.PropertyChanged += new PropertyChangedEventHandler(DesktopItem_PropertyChanged);
ControlWindowHelper.MainWindow.Dispatcher.Invoke(
(Action)(() => _desktopItemCollection.Add(di)));
}
catch (Exception ex)
{
}
return unusedReturnValueName ;
}
Thanks for any help!
+tom
EDIT: Regarding the update to your question. I hadn't spotted that the task was IO-bound - and presumably all the files are from a single (traditional?) disk. Yes, that would go slower - because you're introducing contention in a non-parallelizable resource, forcing the disk to seek all over the place.
IO-bound tasks can still be parallelized effectively sometimes - but it depends on whether the resource itself is parallelizable. For example, an SSD (which has much smaller seek times) may completely change the characteristics you're seeing - or if you're fetching over the network from several individually-slow servers, you could be IO-bound but not on a single channel.
You've created a query, but never used it. The simplest way of forcing everything to be used with the query would be to use Count() or ToList(), or something similar. However, a better approach would be to use Parallel.ForEach:
var options = new ParallelOptions { MaxDegreeOfParallelism = sfcoll.Count() };
Parallel.ForEach(sfcoll, options, sf => ProcessShellObject(sf, curExeName));
I'm not sure that setting the max degree of parallelism like that is the right approach though. It may work, but I'm not sure. A different way of approaching this would be to start all the operations as tasks, specifying TaskCreationOptions.LongRunning.
Your query object created via LINQ is an IEnumerable. It gets evaluated only if you enumerate it (eg. via foreach loop):
var query = from sf in sfcoll.AsParallel().WithDegreeOfParallelism(sfcoll.Count())
let p = ProcessShellObject(sf, curExeName)
select p;
foreach(var q in query)
{
// ....
}
// or:
var results = query.ToArray(); // also enumerates query
Should you add a line in the end
var results = query.ToList();
i have an sql stored procedure that will call to TOP 1000 records from a table that function like a queue-in this table there will be more or less 30,000-40,000 records.the call to the SP takes ~4 seconds (there's an xml column) so to finish the calls it will take ~2 minutes.
i thought to use multi threaded calls and to insert the records to a sync dictionary\list.
did someone did that before? any efficient way to end the calls as soon as possible?
Thanks...
Consider optimizing the query before resorting to threads.
In my experience, when beginners at multi-threading implement threads, it usually does not improve performance. Worse, it usually introduces subtle errors which can be difficult to debug.
Optimize the query first, and you may find that you don't need threads.
Even if you implemented them, eventually you'll have SQL Server doing too much work, and the threaded requests will simply have to wait.
Basic mistake is wanting to insert into the database from multiple threads and overload server with connections, locks, and eventually bring it to its knees.
If you are READING the data, you will do much better if you find a query that will perform faster and fetch as much data as you can at once.
To me, it seems like your problem is not solvable on its level - maybe if you elaborate what you want to do you'll get better advice.
EDIT:
I did use SQL as a queue once - and I just remembered - to dequeue, you'll have to use result from the first query to get input to the second, so threads are out of the question. Or, you'll have to MARK your queued data 'done' in the database, and your READ will become UPDATE -> resulting to locking.
If you are reading, and you want to react as soon as possible, you can use DataReader, then read ALL of the data, and chunk your processing into threads - read 100 records, fork a thread and pass it to it... then next records and so on. That way you'll be able to balance your resource usage.
Try reading the data asynchronously using DataReader; fetch the columns that can uniquely identify the row in the database .Populate the Queue to hold the returned data value (Custom Object) and run work threads to perform the task against the queue.
You have to decide how many worker threads should be implemented to perform the task as threads have their own overheads and if not implemented correctly could be a nightmare.
If you really have to you can start BGWorkers that individually make connections to the server and report back with their progress.
I did the same thing for an elaborate export/import application to move roughly 50GB of data (4GB deflatestream'ed) except I only used the BGWorker to do the work consecutively, not concurrently, without locking up the UI-thread..
It isn't clear if you're selecting the 1000 most recently added rows, or the 1000 rows with the highest value in a particular column, nor is it clear whether your rows are mutable -- i.e. a row might qualify for the top 1000 yesterday but then get updated so that it no longer qualifies today. But if the individual rows are not mutable, you could have a separate table for the TOP1000, and when the 1001st row is inserted into it, an after-insert trigger would move the 1001st row (however you determine that row) to a HISTORY table. That would make the selection virtually instantaneous: select * from TOP1000. You'd simply combine the two tables with a UNION when you need to query the TOP1000 and HISTORY as though they were one table. Or instead of trigger you could wrap the insert and 1001st-row delete in a transaction.
Different can of worms, though, if the rows mutate, and can move in and out of the top 1000.
public struct BillingData
{
public int CustomerTrackID, CustomerID;
public DateTime BillingDate;
}
public Queue<BillingData> customerQueue = new Queue<BillingData>();
volatile static int ThreadProcessCount = 0;
readonly static object threadprocesslock = new object();
readonly static object queuelock = new object();
readonly static object countlock = new object();
AsyncCallback asyncCallback
// Pulling the Data Aync from the Database
private void StartProcess()
{
SqlCommand command = SQLHelper.GetCommand("GetRecordsByBillingTrackID");
command.Connection = SQLHelper.GetConnection("Con");SQLHelper.DeriveParameters(command);
command.Parameters["#TrackID"].Value = trackid;
asyncCallback = new AsyncCallback(FetchData);
command.BeginExecuteXmlReader(asyncCallback, command);
}
public void FetchData(IAsyncResult c1)
{
SqlCommand comm1 = (SqlCommand)c1.AsyncState;
System.Xml.XmlReader xr = comm1.EndExecuteXmlReader(c1);
xr.Read();
string data = "";
while (!xr.EOF)
{
data = xr.ReadOuterXml();
XmlDocument dom = new XmlDocument();
dom.LoadXml("<data>" + data + "</data>");
BillingData billingData;
billingData.CustomerTrackID = Convert.ToInt32(dom.FirstChild.ChildNodes[0].Attributes["CustomerTrackID"].Value);
billingData.CustomerID = Convert.ToInt32(dom.FirstChild.ChildNodes[0].Attributes["CustomerID"].Value);
billingData.BillingDate = Convert.ToDateTime(dom.FirstChild.ChildNodes[0].Attributes["BillingDate"].Value);
lock (queuelock)
{
if (!customerQueue.Contains(billingData))
{
customerQueue.Enqueue(billingData);
}
}
AssignThreadProcessToTheCustomer();
}
xr.Close();
}
// Assign the Threads based on the data pulled
private void AssignThreadProcessToTheCustomer()
{
int TotalProcessThreads = 5;
int TotalCustomersPerThread = 5;
if (ThreadProcessCount < TotalProcessThreads)
{
int ThreadsNeeded = (customerQueue.Count % TotalCustomersPerThread == 0) ? (customerQueue.Count / TotalCustomersPerThread) : (customerQueue.Count / TotalCustomersPerThread + 1);
int count = 0;
if (ThreadsNeeded > ThreadProcessCount)
{
count = ThreadsNeeded - ThreadProcessCount;
if ((count + ThreadProcessCount) > TotalProcessThreads)
count = TotalProcessThreads - ThreadProcessCount;
}
for (int i = 0; i < count; i++)
{
ThreadProcess objThreadProcess = new ThreadProcess(this);
ThreadPool.QueueUserWorkItem(objThreadProcess.BillingEngineThreadPoolCallBack, count);
lock (threadprocesslock)
{
ThreadProcessCount++;
}
}
public void BillingEngineThreadPoolCallBack(object threadContext)
{
BillingData? billingData = null;
while (true)
{
lock (queuelock)
{
billingData = ProcessCustomerQueue();
}
if (billingData != null)
{
StartBilling(billingData.Value);
}
else
break;
More....
}
this is my code
void fixInstellingenTabel(object source, ElapsedEventArgs e)
{
NASDataContext _db = new NASDataContext();
List<Instellingen> newOnes = new List<Instellingen>();
List<InstellingGegeven> li = _db.InstellingGegevens.ToList();
foreach (InstellingGegeven i in li) {
if (_db.Instellingens.Count(q => q.INST_LOC_REF == i.INST_LOC_REF && q.INST_LOCNR == i.INST_LOCNR && q.INST_REF == i.INST_REF && q.INST_TYPE == i.INST_TYPE) <= 0) {
// There is no item yet. Create one.
Instellingen newInst = new Instellingen();
newInst.INST_LOC_REF = i.INST_LOC_REF;
newInst.INST_LOCNR = i.INST_LOCNR;
newInst.INST_REF = i.INST_REF;
newInst.INST_TYPE = i.INST_TYPE;
newInst.Opt_KalStandaard = false;
newOnes.Add(newInst);
}
}
_db.Instellingens.InsertAllOnSubmit(newOnes);
_db.SubmitChanges();
}
basically, the InstellingGegevens table gest filled in by some procedure from another server.
the thing i then need to do is check if there are new records in this table, and fill in the new ones in Instellingens.
this code runs for like 4 minutes on 15k records. how do I optimize it? or is the only way a Stored Procedure?
this code runs in a timer, running every 6h. IF a stored procedure is best, how to I use that in a timer?
Timer Tim = new Timer(21600000); //6u
Tim.Elapsed += new ElapsedEventHandler(fixInstellingenTabel);
Tim.Start();
Doing this in a stored procedure would be a lot faster. We do something quite similar, only there is about 100k items in the table, it's updated every five minutes, and has a lot more fields in it. Our job takes about two minutes to run, and then it does updates in several tables across three databases, so your job would reasonably take only a couple of seconds.
The query you need would just be something like:
create procedure UpdateInstellingens as
insert into Instellingens (
INST_LOC_REF, INST_LOCNR, INST_REF, INST_TYPE, Opt_KalStandaard
)
select q.INST_LOC_REF, q.INST_LOCNR, q.INST_REF, q.INST_TYPE, cast(0 as bit)
from InstellingGeven q
left join Instellingens i
on q.INST_LOC_REF = i.INST_LOC_REF and q.INST_LOCNR = i.INST_LOCNR
and q.INST_REF = i.INST_REF and q.INST_TYPE = i.INST_TYPE
where i.INST_LOC_REF is null
You can run the procedure from a job in the SQL server, without involving any application at all, or you can use ADO.NET to execute the procedure from your timer.
One way you could optimise this is by changing the Count(...) <= 0 into Any(). However, an even better optimisation would be to retrieve this information in a single query outside the loop:
var instellingens = _db.Instellingens
.Select(q => new { q.INST_LOC_REF, q.INST_LOCNR, q.INST_REF, q.INST_TYPE })
.Distinct()
.ToDictionary(q => q, q => true);
(On second thought, a HashSet would be most appropriate here, but there is unfortunately no ToHashSet() extension method. You can write one of your own if you like!)
And then inside your loop:
if (instellingens.ContainsKey(new { q.INST_LOC_REF, q.INST_LOCNR,
q.INST_REF, q.INST_TYPE })) {
// There is no item yet. Create one.
// ...
}
Then you can optimise the loop itself by making it lazy-retrieve:
// No need for the List<InstellingGegeven>
foreach (InstellingGegeven i in _db.InstellingGegevens) {
// ...
}
What Guffa said, but using Linq here is not the best course if performance is what you are after. Linq, like every other ORM, sacrifices performance for usability. Which is usually a great tradeoff for typical application execution paths. On the other hand, SQL is very, very good at set based operations so that really is the way to fly here.