Can BulkWrite operation be used for multiple collections? [duplicate] - c#

I see that mongo has bulk insert, but I see nowhere the capability to do bulk inserts across multiple collections.
Since I do not see it anywhere I'm assuming its not available from Mongo.
Any specific reason for that?

You are correct in that the bulk API operates on single collections only.
There is no specific reason but the APIs in general are collection-scoped so a "cross-collection bulk insert" would be a design deviation.
You can of course set up multiple bulk API objects in a program, each on a different collection. Keep in mind that while this wouldn't be transactional (in the startTrans-commit-rollback sense), neither is bulk insert.

Related

Fast Loading/Storing of large tree/graph-ish data structures using conventional database (long)

Requirements:
Let's have conventional mysql database server.
Let's have C# .net app using mysql connector.
Let's have set of tables (designed to fit the complex tree/graph like data structures and their relations).
The data structures can be fairy large and can contain many (hundreds to thousands) blob items (10 to 100kiB per item).
Allow loading/storing of data structures from/to database as fast as possible.
Let's have three freely convertible representations of any data structure (sub)item: xml, C# object in memory, sql.
Current solution:
It emerged via some sort of evolution, this means no big knowledge of applicable/typical methods was present (also - as usual - the original requirements were not so demanding).
Each data structure (sub)item implements (custom) ISqlSerializable interface having (among others) ReadSql(...) and WriteSql(...) methods. (Inspired by IXmlSerializable, which also has to be implemented due to XML serialization requirements).
Custom (De)Serializer calls such methods for each (root) data structure.
These calls emit SQL commands for reading/writing the data structure itself followed by the serialization requests for children (if any) - the same approach used by the IXmlSerializable.
This sequence emits the SQL code for the whole tree/graph disregarding the fact what is the root (you can fully de/serialize any data structure via calling ISqlSerializable - I mean, the root might be almost any structure implementing ISqlSerializable).
Problems:
This approach is terribly slow, beacuse e.g. reading/writing N adjacent objects from/to single table means N select/insert commands instead of possibly single effective one.
Sort of caching has been introduced (for loading, up to now) to speedup the process:
The preseleted root structure prefetches all the tables/rows/columns via DataAdapters/DataSets and complex sql commands.
All the (sub)structures then read itself from such built cache and do not emit SQL code to server.
It dramatically improved the speed, but it's still lot of hardwired SQL emiting code for few preselected root structures.
The responsibility to load/store children's data is now up to the parent who needs to know the wider context ("what exactly is the full rest of me") than in previous case: "handle myself, then let the children to handle themselves".
Questions:
What is the typical method to solve such task? I mean, what would use people who do this kind of tasks "every day"? I do not suspect this is somehow special scenario...
Nice candidate seems to be stored procedures approach:
SQL/app code isolation.
Possibility to update/tune the SQL code with no/little impact to app code.
Probaly better efficiency when everything SQL related runs on server side.
Is that the best way (before we rebuild the whole app)?
Is there a standard way to create sql-stored-procedure-driven conversion accepting/generating multi-table DataSets writing/reading multiple tables to/from mysql database (something more practical than simple 'hello-world' example which will perform superfast even on 386)?
Please note: I'm not expecting fully working copy/paste source code in first answer, just general ideas/thoughts/kicks from developers more experienced in this particular area, considering current state and possible future improvement. I hope I'll manage the rest. Thanks!

Inserting/updating huge amount of rows into SQL Server with C#

I have to parse a big XML file and import (insert/update) its data into various tables with foreign key constraints.
So my first thought was: I create a list of SQL insert/update statements and execute them all at once by using SqlCommand.ExecuteNonQuery().
Another method I found was shown by AMissico: Method
where I would execute the sql commands one by one. No one complained, so I think its also a viable practice.
Then I found out about SqlBulkCopy, but it seems that I would have to create a DataTable with the data I want to upload. So, SqlBulkCopy for every table. For this I could create a DataSet.
I think every option supports SqlTransaction. It's approximately 100 - 20000 records per table.
Which option would you prefer and why?
You say that the XML is already in the database. First, decide whether you want to process it in C# or in T-SQL.
C#: You'll have to send all data back and forth once, but C# is a far better language for complex logic. Depending on what you do it can be orders of magnitude faster.
T-SQL: No need to copy data to the client but you have to live with the capabilities and perf profile of T-SQL.
Depending on your case one might be far faster than the other (not clear which one).
If you want to compute in C#, use a single streaming SELECT to read the data and a single SqlBulkCopy to write it. If your writes are not insert-only, write to a temp table and execute as few DML statements as possible to update the target table(s) (maybe a single MERGE).
If you want to stay in T-SQL minimize the number of statements executed. Use set-based logic.
All of this is simplified/shortened. I left out many considerations because they would be too long for a Stack Overflow answer. Be aware that the best strategy depends on many factors. You can ask follow-up questions in the comments.
Don't do it from C# unless you have to, it's a huge overhead and SQL can do it so much faster and better by itself
Insert to table from XML file using INSERT INTO SELECT

How entity framework works for large number of records? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I see already a un-answered question here on.
My question is -
Is EF really production ready for large application?
The question originated from these underlying questions -
EF pulls all the records into memory then performs the query
operation. How EF would behave when table has around ~1000 records?
For simple edit I have to pull the record edit it and
then push to db using SaveChanges()
I faced a similar situation where we had a large database with many tables 7- 10 million records each. we used Entity framework to display the data. To get nice performance here's what I learned; My 10 Golden rules for Entity Framework:
Understand that call to database made only when the actual records are required. all the operations are just used to make the query (SQL) so try to fetch only a piece of data rather then requesting a large number of records. Trim the fetch size as much as possible
Yes, (In some cases stored procedures are a better choice, they are not that evil as some make you believe), you should use stored procedures where necessary. Import them into your model and have function imports for them. You can also call them directly ExecuteStoreCommand(), ExecuteStoreQuery<>(). Same goes for functions and views but EF has a really odd way of calling functions "SELECT dbo.blah(#id)".
EF performs slower when it has to populate an Entity with deep hierarchy. be extremely careful with entities with deep hierarchy
Sometimes when you are requesting records and you are not required to modify them you should tell EF not to watch the property changes (AutoDetectChanges). that way record retrieval is much faster
Indexing of database is good but in case of EF it becomes very important. The columns you use for retrieval and sorting should be properly indexed.
When you model is large, VS2010/VS2012 Model designer gets real crazy. so break your model into medium sized models. There is a limitation that the Entities from different models cannot be shared even though they may be pointing to the same table in the database.
When you have to make changes in the same entity at different places, use the same entity, make changes and save it only once. The point is to AVOID retrieving the same record, make changes & save it multiple times. (Real performance gain tip).
When you need the info in only one or two columns try not to fetch the full entity. you can either execute your sql directly or have a mini entity something. You may need to cache some frequently used data in your application also.
Transactions are slow. be careful with them.
SQL Profiler or any query profiler is your friend. Run it when developing your application to see what does EF sends to database. When you perform a join using LINQ or Lambda expression in ur application, EF usually generates a Select-Where-In-Select style query which may not always perform well. If u find any such case, roll up ur sleeves, perform the join on DB and have EF retrieve results. (I forgot this one, the most important one!)
if you keep these things in mind EF should give almost similar performance as plain ADO.NET if not the same.
1. EF pulls all the records into memory then performs the query operation. How EF would behave when table has around ~1000 records?
That's not true! EF fetches only necessary records and queries are transformed into proper SQL statements. EF can cache objects locally within DataContext (and track all changes made to entities), but as long as you follow the rule to keep context open only when needed, it won't be a problem.
2. For simple edit I have to pull the record edit it and then push to db using SaveChanges()
It's true, but I would not bother in doing that unless you really see the performance problems. Because 1. is not true, you'll only get one record from DB fetched before it's saved. You can bypass that, by creating the SQL query as a string and sending it as a plain string.
EF translates your LINQ query into an SQL query, so it doesn't pull all records into memory. The generated SQL might not always be the most efficient, but a thousand records won't be a problem at all.
Yes, that's one way of doing it (assuming you only want to edit one record). If you are changing several records, you can get them all using one query and SaveChanges() will persist all of those changes.
EF is not a bad ORM framework. It is a different one with its own characteristics. Compare Microsoft Entity Framework 6, against say NetTiers which is powered by Microsoft Enterprise Library 6.
These are two entirely different beasts. The accepted answer is really good because it goes through the nuances of EF6. Whats key to understand is that each ORM has its own strengths and weaknesses. Compare the project requirements and its data access patterns against the ORM's behavior patterns.
For Example: NetTiers will always give you higher raw performance than EF6. However that is primarily because it is not a point and click ORM and as part and parcel of generating the ORM you will be optimizing your data model, adding custom stored procedures where relevant, etc... if you engineered your data model with the same effort for EF6 you could probably get close to the same performance.
Also consider can you modify the ORM? for example with NetTiers you can add extensions to the codesmith templates to include your own design patterns over and above what is generated by the base ORM library.
Also consider EF6 makes significant use of reflection whereas NetTiers or any library powered by Microsoft Enterprise Library will make heavy use of Generics instead. These are two entirely different approaches. Why so? Because EF6 is based on dynamic reflection whereas NetTiers is based on static reflection. Which is faster and which is better entirely depends on the usage patterns that will be required of the ORM.
Sometimes a hybrid approach works better: Consider for example EF6 for Web API OData endpoints, A few large tables wrapped with NetTiers & Microsoft Enterprise Library with custom stored procedures, and a few large masterdata tables wrapped with a custom built write through object cache where on initial load the record set is streamed into the memory cache using an ADO data reader.
These are all different and they all have their best fit scenarios: EF6, NetTiers, NHibernate, Wilson OR Mapper, XPO from Dev Express, etc...
There is no simple answer for your question. The main thing is about what you want to do with your data? And do you need so much data at one time?
EF translated your Queries to SQL so at this time there is no Object in Memory. When you get the data, then the selected records are in memory. If you are selecting a large amount of large objects then it can be a performance killer if you need to manipulate them all.
If you don't need to manipulate them all you can disable change tracking and enable it later for single objects you need to manipulate.
So you see it depends on your type of application.
If you need to manipulate a mass of data efficient, then don't use a OR-Mapper!
Otherwise EF is fine, but consider how many objects you really need at one time and what you want to do with them.

How can I bulk upload and transform relational data to aggregates in RavenDB?

I'm trying to get my head around how to do efficient bulk inserts of relational data into RavenDB, particularly where converting from relational data to aggregates.
Let's say we have two dump files of two tables: Orders and OrderItems. They're too big to load into memory, so I read them as streams. I can read through each table and create a document in RavenDB corresponding to each row. I can do this as bulk operations using batched requests. Easy and efficient so far.
Then I want to transform this on the server, getting rid of the OrderItems and integrating them in to their parent Order documents. How can I do this without thousands of roundtrips?
The answer seems to lie somewhere between set-based updates, live projections and denormalized updates, but I don't know where.
You're going to need to do this with denormalised updates and set-based updates. Take a look at the PATCH API to see what it offers. Although you only need the set-based updates if you plan on updating several docs at once, you can just patch against a know doc directly using the PATCH api.
Live projections will only help you when you are getting the results of a query/index, they don't change the docs themselves, only what is returned from the server to the client.
However I'd recommend that if possible you combine a Order and the corresponding OrderItems in-memory before you send them to RavenDB. You could still stream the data from the dump files, just use some caching if needed. This will be the simplest option.
Updated
I've made some sample code that shows how to do this. This patches the Comments array/list within a particular Post doc, in this case "Posts/1"

Design Strategy: Query and Update data across 2 different databases

We have a requirement in which we need to query data across 2 different databases ( 1 in SQL Server and other in Oracle).
Here are the scenarios which need to be implemented:
Query: Get the data from one database and match for values in other
Update: Get the data from one database and update the objects in other
Technology that we are using: ASP.net, C#
The options that we have thought about:
Staging area in one database
Link Server ( can't go with the approach as it is not allowed due to organization wide policy)
Create web services
Create 2 different DAL and perform list operations with the data from 2 sources in DAL
I would like to know what is the best design strategy to deal with this kind of a scenario? If yes, then what are the pros and cons of that approach
Is it not possible to use SSIS package to do the data transformation between 2 servers and invoke it either via ASP.Net & c# project or via schedule job invoked on demand?
Will the results from one of the databases be small enough to efficiently pass around?
If so, I would suggest treating the databases as two independent datasources.
If the datasets are large, then you may have to consider some form of ETL into a staging area on one of the database. You may have issues if you need the queries to return up-to-date data from both databases. Because you will need to do a real-time ETL.
There is an article here about performing distributed transactions between Microsoft SQL server and Oracle:
https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1054237.html
I don't know how well this works, however if it does work, this will probably be the best solution for you:
It will almost certainly be the fastest method of querying across multiple database servers.
It should also allow for true transactional support even when writing to both databases.
The best strategy for this will be to use Linked Server, as it is designed for querying and writing to heterogeneous databases as you described above. But obviously due to the policy constraint you mentioned, this is not the option.
Therefore, to achieve the result you want in the most optimal performance, here is what I suggest:
Decide which database contains the lookup data only (minimal dataset) and you will need to execute a query on it to pull the info out
Insert the lookup data using bulk copy into a temp/dummy table in the main database (contains most of the data that you will want to retrieve and return to the caller)
Use stored procedure or query to join the temp table with other tables in your main database to retrieve the dataset desired
The decision to whether to write this as web service or not isn't going to change the data retrieval process. But consideration should be given in essentially reducing the overhead on data transfer time by keeping the process as close to your db server as possible either on same machine or within LAN/high speed connection link.
Data update will be quite straightforward. It will just be the standard two phase operations of pull data out from one and update the other. -
It's hard to tell what the best solution is. But we have a scenario that's nearly the same.
RealTime:
For realtime data updating, we are using WebServices, since in our case, the two different databases belongs to distinct projects. So every project offers a WebService which can be used for data retrieval and data update. That has the advantage, that the project must not take care for database structure changes as long the webservice interface does not change.
Static Data:
Static data (e.g. employees) will be mirrored because for faster access. For that huge amount of data we are using flat files for the nightly update.
In case of static data I think it's important to explicit define data owners. For every piece of data it should be clear which database has the original data, and which database only has shadow copies for faster access.
So Static data is readonly in the shadow database, or only updateable through designated WebServices.
The problem with using multiple data sources in your .NET code is that you run the risk of having your CRUD ops fail ACID tests and having data inconsistencies.
I would be most inclined to pursue #Will A's comment to your question...
Set up a replication to a remove server, then link the two remote servers.
Have multiple DALs and handle it in the application - thousands is not a big number, you need to worry only if you are into 100,000s or millions in which case your application will hang.
Use linq to perform data operations on the datasets that are generated rather than looping through them.

Categories

Resources