I'd like to transfer a large amount of data from SQL Server to MongoDB (Around 80 million records) using a solution I wrote in C#.
I want to transfer say 200 000 records at a time, but my problem is keeping track of what has already been transferred. Normally I'd do it as follows:
Gather IDs from destination to exclude from source scope
Read from source (Excluding IDs already in destination)
Write to destination
Repeat
The problem is that I build a string in C# containing all the IDs that exist in the destination, for the purpose of excluding those from source selection, eg.
select * from source_table where id not in (<My large list of IDs>)
Now you can imagine what happens here when I have already inserted 600 000+ records and then build a string with all the IDs, it gets large and slows things down even more, so I'm looking for a way to iterate through say 200 000 records at a time, like a cursor, but I have never done something like this and so I am here, looking for advice.
Just as a reference, I do my reads as follows
SqlConnection conn = new SqlConnection(myConnStr);
conn.Open();
SqlCommand cmd = new SqlCommand("select * from mytable where id not in ("+bigListOfIDs+")", conn);
SqlDataReader reader = cmd.ExecuteReader();
if (reader.HasRows)
{
while (reader.Read())
{
//Populate objects for insertion into MongoDB
}
}
So basically, I want to know how to iterate through large amounts of data without selecting all that data in one go, or having to filter the data using large strings. Any help would be appreciated.
Need more rep to comment, but if you sort by your id column you could change your where clause to become
select * from source_table where *lastusedid* < id and id <= *lastusedid+200000*
which will give you the range of 200000 you asked for and you only need to store the single integer
There are many different ways of doing this, but I would suggest first that you don't try to reinvent the wheel but look at existing programs.
There are many programs designed to export and import data between different databases, some are very flexible and expensive, but others come with free options and most DBMS programs include something.
Option 1:
Use SQL Server Management Studio (SSMS) Export wizards.
This allows you to export to different sources. You can even write complex queries if required. More information here:
https://www.mssqltips.com/sqlservertutorial/202/simple-way-to-export-data-from-sql-server/
Option 2:
Export your data in ascending ID order.
Store the last exported ID in a table.
Export the next set of data where ID > lastExportedID
Option 3:
Create a copy of your data in a back-up table.
Export from this table, and delete the records as you export them.
Related
Going a bit old school today trying to fix a bit of software that is running in a manufacturing plant importing data from a CSV file into an access database which is used to generate Crystal reports!
The original CSV file is in RANK (1,2,3, etc) order but when its imported using the following code it often arrives in the Access database out of order.
using (OleDbCommand accComm2 = new OleDbCommand(String.Format("SELECT TOP 100 PERCENT vCSV.* INTO {0} FROM [Text;FMT=Delimited;HDR=YES;IMEX=1;DATABASE={1}].[{2}] as vCSV ORDER BY vCSV.RANK", dbtable, Path.GetDirectoryName(fname), Path.GetFileName(fname)), accConn, accTran))
{
accComm2.ExecuteNonQuery();
}
This causes issues further down the line as the quirky crystal report manufacturing summarys that are created require the data to be in rank order.
Because the Crystal report is grouped into batches, you cant do the sorting on the crystal report end.
Does anyone have any ideas why the above wouldnt be working? Weve even tried importing it to a temporary table and then to the live table, effectively doing the sort command twice but it still sometimes fails!
You should sort in a subquery before inserting. Sorting while inserting has no effect in Access, besides the preview you can get while inserting.
SELECT * INTO {0} FROM (SELECT TOP 100 * [Text;FMT=Delimited;HDR=YES;IMEX=1;DATABASE={1}].[{2}] ORDER BY RANK) as vCSV
Note that, strictly speaking, the resulting table has no index thus is a heap and has no defined sort order (there is no first/second/nth record, any ordering is equally valid). Practically, this will likely work, though.
I am sorry that they ask this question has been asked many times but I still have not yet found the best answer.
I am worried applications take a long time to download the record or filter the records. Assuming I have a table called tbl_customer. And records in tbl_customer more than 10,000 rows.
The first question, I am using Data Grid View to display the records. Would be ideal if I download all the records up to 10,000 rows into the Data Grid View? Or perhaps I had better put the record row limit?
Second question, what is the best way to filter records in tbl_customer. Do we just need to query using SQL? or using LINQ? or maybe there is a better way?
For now, I only use this way:
DataTable dtCustomer = new DataTable();
using (SqlConnection conn = new SqlConnection(cs.connString))
{
string query = "SELECT customerName,customerAddress FROM tbl_customer WHERE customerAddress = '"+addressValue+"' ORDER BY customerName ASC;";
using (SqlDataAdapter adap = new SqlDataAdapter(query, conn))
{
adap.Fill(dtCustomer);
}
}
dgvListCustomer.DataSource = dtCustomer
Then I learn about LINQ so i do like this
DataTable dtCustomer = new DataTable();
using (SqlConnection conn = new SqlConnection(cs.connString))
{
string query = "SELECT * FROM tbl_customer ORDER BY customerName ASC;";
using (SqlDataAdapter adap = new SqlDataAdapter(query, conn))
{
adap.Fill(dtCustomer);
}
}
var resultCustomer = from row in dtCustomer.AsEnumerable()
where row.Field<string>("customerAddress") == addressValue
select new
{
customerName = row["customerName"].ToString(),
customerAddress = row2["customerAddress"].ToString(),
};
dgvListCustomer.DataSource = resultCustomer;
Workflow SQL> DATATABLE> LINQ > DataGridView is suitable to filter records? Or if there are better suggestions are most welcome.
Thanks you..:)
I am worried applications take a long time to download the record or filter the records.
Welcome - you seem to live in a world like me where performance ms measured in milliseconds, and yes, on a low power server it will take likely more than a millisecond (0.001 seconds) to hot load and filter 10.000 rows.
As such, my advice is not to put that database on a tablet or mobile phone but to use at least a decent desktop level compute r or VM for the database server.
As a hint: I am regularly making queries on a billion row table and it is fast. Anything below a million rows is a joke these days - in fact it was nothing worth mentioning when I started with databases more than 15 years ago. You are the guy asking whether it is better to have a ferrari or a porsche becauese you are concerned whether any of those case goes more than 20km/h.
Would be ideal if I download all the records up to 10,000 rows into the Data Grid View?
In order to get fired? Yes. Old rule with databases: never load more data than you have to, especially when you have no clue. Forget the SQL side - you will get UI problems with 10.000 rows and more, especially usability issues.
Do we just need to query using SQL? or using LINQ?
Hint: Linq is also using SQL under the hood. The question is more - how much time do you want to spend writing boring repetitive code for handwritten SQL like in your examples? Espeically given that you also do "smart" things like referencing fields by name, not ordinal, and asking for "select *" instead of a field list, bot obvious beginner mistakes.
What you should definitely not do - but you do - is using a DataTable. Get a decent book about programming databases. RTFSM may help - both LINQ (which I am not sure what you mean - LINQ is a language for the compiler, you need an implementor, so that could be NHibernate, Entity Framework, Linq2Sql, BlToolkit, to name just a FEW tha t go from a LINQ query to a sql statement).
Workflow SQL> DATATABLE> LINQ > DataGridView is suitable to filter records?
A Ferrari is also suitable to transport 20 tons of coal from A to B - just the worst possible car for it. GSour stack is likely the worst I have seen, but it is suuitable in that you CAN do it - slow, lots f mmemoory use, but you will get a result and hopefully fired. You pull the data from a high performance database into a data table, then use a non integrating technology (LINQ) to filter (not using the indices in the data table) to go into yet another layer.
Just to give you an idea - this would get you removed from quite some "beginning programming" courses.
What about:
LINQ
Point.
Pulls a collection of business objects that go to the UI. Period.
Read at least some of the sample code for the technologies you use.
I'm creating a windows application in which I need to get data using ado.net/(Or any other way using C# if any ). From one table. The database table apparently has around 100000 records and it takes forever to download.
Is there any faster way where I could get data into faster?
I tried the DataReader but still isn't fast enough.
The data-reader API is about the most direct you can do. The important thing is where is the time?
is it bandwidth in transferring the data?
or is it in the fundamental query?
You can find out by running the query locally on the machine, and see how long it takes. If bandwidth is your limit, then all you can really try is removing columns you don't actually need (don't do select *). Or pay for a fatter pipe between you and the server. In some cases, querying the data locally, and returning it in some compressed form might help - but then you're really talking about something like a web-service, which has other bandwidth considerations.
More likely, though, the problem is the query itself. Often, things like:
writing sensible tsql
adding an appropriate index
avoid cursors, complex processing, etc
You might want to implement a need to know basis method. Only pull down the first chunk of data that is needed and then when the next set is needed, pull those rows.
It's probably your query that is so slow not the streaming process. You should show us your sql query, then we could help you to improve it.
Assuming you want to get all 100000 records from you table, you could use a SqlDataAdapter to fill a DataTable or a SqlDataReader to fill a List<YourCustomClass>:
the DataTable approach (since i don't know your fields it's difficult to show a class):
var table = new DataTable();
const string sql = "SELECT * FROM dbo.YourTable ORDER BY SomeColumn";
using(var con = new SqlConnection(Properties.Settings.Default.ConnectionString))
using(var da = new SqlDataAdapter(sql, con))
{
da.Fill(table);
}
We are pulling a huge data from sql server DB. It has around 25000 rows with 2500 columns. The requirement is to read the data and export it to spread sheet, so pagination is not a choice. When the records are less it is able to pull the data but when it grows to the size i mentioned above it is throwing exception.
public DataSet Exportexcel(string Username)
{
Database db = DatabaseFactory.CreateDatabase(Config);
DbCommand dbCommand =
db.GetStoredProcCommand("Sp_ExportADExcel");
db.AddInParameter(dbCommand, "#Username", DbType.String,
Username);
return db.ExecuteDataSet(dbCommand);
}
Please help me in resolving this issue.
The requirement is to read the data and export it to spread sheet, so
pagination is not a choice.
Why not read data in steps. Instead of getting all records at once why not get limited number of records every time and write them to excel. Continue until you have processed all the records
Your problem is purely down to the fact that you are trying to extract so much data in one go.
You may get around the problem by installing more memory in the machine doing the query, but this is just a bodge.
Your best to retrieve such amounts of data in steps.
You could quite easily read the data back row by row and export/append that in CSV format to a file and this could all be done in a stored procedure.
You don't say what database you are using, but handling such large amounts of data is what database engines are designed to cope with.
Other than that when handling large quantities of data objects in C# code its best to look into using generics as this doesn't impose object instantiation in the same way that classes do and so reduces the memory footprint.
You can use batch processing logic to fetch records in batches say 5000 records per execution and store the result in a temp dataset and once all processing is done. Dump the data from temp dataset to excel.
You can use C# BulkCopy class for this purpose.
If it is enough to have the data available in Excel as csv you can use bulk copy
bcp "select col1, col2, col3 from database.schema.SomeTable" queryout "c:\MyData.txt" -c -t"," -r"\n" -S ServerName -T
This is mangnitudes faster and has little footprint.
How can I do this. I have about 10000 records in an an Excel file and I want to insert all records as fast as possible into an access database?
Any suggestions?
What you can do is something like this:
Dim AccessConn As New System.Data.OleDb.OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0; Data Source=C:\Test Files\db1 XP.mdb")
AccessConn.Open()
Dim AccessCommand As New System.Data.OleDb.OleDbCommand("SELECT * INTO [ReportFile] FROM [Text;DATABASE=C:\Documents and Settings\...\My Documents\My Database\Text].[ReportFile.txt]", AccessConn)
AccessCommand.ExecuteNonQuery()
AccessConn.Close()
Switch off the indexing on the affected tables before starting the load and then rebuilding the indexes from scratch after the bulk load has finished. Rebuilding the indexes from scratch is faster than trying to keep them up to date while loading large amount of data into a table.
If you choose to insert row by row, then maybe want to you consider using transactions. Like, open transaction, insert 1000 records, commit transaction. This should work fine.
Use the default data import features in Access. If that does not suit your needs and you want to use C#, use standard ADO.NET and simply write record-for-record. 10K records should not take too long.