I have a very long winded question. I have recently been given a task by my employer where I am to create a custom DBF to SQL migration utility in C# from 117 *.dbf files. Each table has approximately 40-50 columns and over 100 rows. (Eg. Property Management Database)
How I am tackling it is the following:
Convert a dbf file into a DataTable object.
Add the DataTable object to a List<DataTable> which is contained in the model object.
Bind list to a DataGridView for column viewing purposes.
This is all done in a background worker which works fine.
The next thing I need to do is allow the user to convert and save this list into a very large *.sql file (or optionally, migrate it directly to SQL Express). Again, this I attempt to do in a background thread.
This is where I run into problems. I have a method that accepts a DataTable object and returns a string. In it, I have a stringbuilder object which concatenates all the columns into a 'create table' statement and attaches the associated insert statements in order to include the data.
This method is executed in a loop while passing an instance of each DataTable from List<DataTable> stored in the model object.
Now this works fine up until about the fourth or fifth DataTable before an 'Out of Memory' exception is thrown. I am sure to initiate and dispose of any objects I am not using. I have even went as far as to change all my string concatenation to stringbuilder append logic to take advantage of the stringbuilders better memory management.
I am pretty sure that all my objects are deallocated and garbage collected, so I am assuming that the problem lies in the fact that I am storing all the 117 tables in a list of the model object. Whenever I need to access this list, I simply pass a reference to the model object. As soon as I start building an sql statement for all the tables, the combination between the DataTables list and the Stringbuilder object, it's running out of memory.
I neglect to inform that I am new to the industry as I am fresh out of college. I have been programming for many years only until recently have I been following 'best practice'. So my question is to all of you, am I tackling this project the wrong way? Is there a better way to do it and if so, could you help shed some light as to what you would do in my place?
Alright, I did what made sense. I simply wrote the SQL directly to file rather than write it to a string and then write that string to file. This appears to have done the trick. Not sure why I didn't think of it earlier.
Related
I've hit a wall when it comes to adding a new entity object (a regular SQL table) to the Data Context using LINQ-to-SQL. This isn't regarding the drag-and-drop method that is cited regularly across many other threads. This method has worked repeatedly without issue.
The end goal is relatively simple. I need to find a way to add a table that gets created during runtime via stored procedure to the current Data Context of the LINQ-to-SQL dbml file. I'll then need to be able to use the regular LINQ query methods/extension methods (InsertOnSubmit(), DeleteOnSubmit(), Where(), Contains(), FirstOrDefault(), etc...) on this new table object through the existing Data Context. Essentially, I need to find a way to procedurally create the code that would otherwise be automatically generated when you do use the drag-and-drop method during development (when the application isn't running), but have it generate this same code while the application is running via command and/or event trigger.
More Detail
There's one table that gets used a lot and, over the course of an entire year, collects many thousands of rows. Each row contains a timestamp and this table needs to be divided into multiple tables based on the year that the row was added.
Current Solution (using one table)
Single table with tens of thousands of rows which are constantly queried against.
Table is added to Data Context during development using drag-and-drop, so there are no additional coding issues
Significant performance decrease over time
Goals (using multiple tables)
(Complete) While the application is running, use C# code to check if a table for the current year already exists. If it does, no action is taken. If not, a new table gets created using a stored procedure with the current year as a prefix on the table name (2017_TableName, 2018_TableName, 2019_TableName, and so on...).
(Incomplete) While the application is still running, add the newly created table to the active LINQ-to-SQL Data Context (the same code that would otherwise be added using drag-and-drop during development).
(Incomplete) Run regular LINQ queries against the newly added table.
Final Thoughts
Other than the above, my only other concern is how to write the C# code that references a table that may or may not already exist. Is it possible to use a variable in place of the standard 'DB_DataContext.2019_TableName' methodology in order to actually get the table's data into a UI control? Is there a way to simply create an Enumerable of all the tables where the name is prefixed with a year and then select the most current table?
From what I've read so far, the most likely solution seems to involve the use of a SQL add-on like SQLMetal or Huagati which (based solely from what I've read) will generate the code I need during runtime and update the corresponding dbml file. I have no experience using these types of add-ons, so any additional insight into these would be appreciated.
Lastly, I've seen some references to LINQ-to-Entities and/or LINQ-to-Objects. Would these be the components I'm looking for?
Thanks for reading through a rather lengthy first post. Any comments/criticisms are welcome.
The simplest way to achieve what you want is to redirect in SQL Server, and leave your client code alone. At design-time create your L2S Data Context, or EF DbContex referencing a database with only a single table. Then at run-time substitue a view or synonym for that table that points to the "current year" table.
HOWEVER this should not be necessary in the first place. SQL Server supports partitioning, so you can store all the data in a physically separate data structures, but have a single logical table. And SQL Server supports columnstore tables, which can compress and store many millions of rows with excellent performance.
I have a very large number of rows (10 million) which I need to select out of a SQL Server table. I will go through each record and parse out each one (they are xml), and then write each one back to a database via a stored procedure.
The question I have is, what's the most efficient way to do this?
The way I am doing it currently is I open 2 SqlConnection's (one for read one for write). The read one uses the SqlDataReader of which it basically does a select * from table and I loop through the dataset. After I parse each record I do an ExecuteNonQuery (using parameters) on the second connection.
Is there any suggestions to make this more efficient, or is this just the way to do it?
Thanks
It seems that you are writing rows one-by-one. That is the slowest possible model. Write bigger batches.
There is no need for two connections when you use MARS. Unfortunately, MARS forces a 14 byte row versioning tag in each written row. Might be totally acceptable, or not.
I had very slimier situation and here what I did:
I made two copies of same database.
One is optimized for reading and another is optimized for writing.
In config, i kept two connection string ConnectionRead and ConnectionWrite.
Now in DataLayer when I have read statement(select..) I switch my connection to ConnectionRead connection string and when writing using other one.
Now since I have to keep both the databases in sync, I am using SQL replication for this job.
I can understand implementation depends on many aspect but approach may help you.
I agree with Tim Schmelter's post - I did something very similar... I actually used a SQLCLR procedure which read the data from a XML column in a SQL table into an in-memory (table) using .net (System.Data) then used the .net System.Xml namespace to deserialize the xml, populated another in-memory table (in the shape of the destination table) and used the sqlbulkcopy to populate that destination SQL table with those parsed attributes I needed.
SQL Server is engineered for set-based operations... If ever I'm shredding/iterating (row-by-row) I tend to use SQLCLR as .Net is generally better at iterative/data-manipulative processing. An exception to my rule is when working with a little metadata for data-driven processes, cleanup routines where I may use a cursor.
I need some help. I've been back and forth on which direction I should go and there are some options of which none I like or can use.
I wrote a generic data dump tool that pulls data from a specified server and dumps it to a comma delimited file. It's configuration and what query to run comes from a SQL table created specifically for this tool. However, I have a new requirement that there are some data dumps that need data pulled from different servers and merged together, but I don't want to alter the tool for this "custom" type of pull/dump. I'm trying to keep it generic so I'm not constantly coding on it. My thought is to create a lib in which my reporting tool can use for each of these custom type of pulls and the data returned by this lib is a SqlDataReader object. However, since this lib will have to pull from different servers and merge the data, is it possibly for the lib to create a SqlDataReader of it's own with this pulled data and returned to the data dump tool or am I think too much into this?
I don't want to return an array because it's not how the tool loops through data now, mainly because some of my existing data dumps are millions of rows, so my existing loop is a datareader loop to keep memory down. However, the libs may create a two dimensional array, as long as it can be converted to a SqlDataReader object before returning. This way I don't have to change much on the looping within my application.
Hope that all makes sense. I have it in my head bouncing around so I ended up writing this like 10 times.
Edit: Keep in mind, each record will be scattered across 3 servers and have to be merged. These are three different processes that work together, but have their own servers. ID from server 1 will relate to Server1ID on Server2 for example.
All the ADO.NET data access classes implement common interfaces so you can return an IDataReader instead of SqlDataReader.
I've come up with a solution. I'll write the object, which will by dynamic based on the "custom" report to be generated. This object will pull data from the first server and insert it into a local table/SQL Server. It will then go to the next server, pull data based on the first data pulled and update it within the same server. Then finally the last server to pull the final data that will also need to be merged into my local table. Once all that is merged correctly, I'll select * back as the DataReader needed to the original caller Data/Dump exe. Seems to be the only real way to make this work without modifying the the original exe for each custom data pull.
Thanks for everyone's input.
EDIT: Solution (kind of)
So, what I did had very little in common with what I originally wanted to do, but my application now works much faster (DataSets that took upward of 15 minutes to process now go through in 30-40 seconds tops). Here's roughly what I did:
- Read spreadsheet & populate DataTable/DataSet normally
- [HACK WARNING] Instead of using UpdateDataSet, I generate my own SQL queries, mostly by having a skeleton string for each type of update (e.g. String skeleton = "UPDATE ... SET ... WHERE ..."). I then consult the template database and replace the placeholder ... with the appropriate entries.
- [MORE HACK WARNING] The way I dealt with errors was by manually checking whether those errors will occur. So if I know I am about to do an insert, I'll run an error-checking command before the actual insert; what the error checker will do is construct a JOIN statement, checking whether any of the entries in the user's DataSet already exist in the database. Just by executing the JOIN command, I get back a DataSet with the results, so I know that if there is anything there, it's the errors. Then I can proceed to print them.
If anyone needs more details, I'll be happy to provide them. It's a fairly specific question, so I should probably keep this outline fairly high level.
Original Question
For (good) reasons outside of my control, I need to use the Database.UpdateDataSet() method from Microsoft's Enterprise Library. The way my project will work, I am letting the user make changes to the database (multiple database, multiple schemas, multiple tables, but always only one at a time) by uploading Excel spreadsheets to a web application. The spreadsheets follow a design/template specified by me (usually). I am a state where I read the spreadsheet, turn it into a DataTable/DataSet, and use (dynamically generated) prepared statements to make the appropriate changes to the database. Here's the problem:
Each spreadsheet only allows for one type of change (insert/update/delete). I want to make it so if the user uploads an insert spreadsheet, but several (let's say 10) of the entries are already in the database, I not only return with an error, but also tell them which entries (DataRows) violated the primary key constraint.
The idea solution would be get a DataSet with the list of errors back, but I don't see how I can do that. Perhaps there is a way to construct the prepared statements in such a way that if a DataRow is to be inserted (following the example from above), it proceeds normally; however if it attempts to update or delete, it skips it and adds it to an error collection of some sort?
Note that I am trying to avoid using stored procedures. Since the number of different templates will grow extremely quickly after deployment, it is important that I stay away from manually written code and close to database-driven model as much as possible.
I have a dataset fetched from a ODBC data source (MySQL), I need to temporarily put it into a SQL Server DB to be fetched by another process (and then transformed further).
Instead of creating a nicely formatted database table I'd rather have one field of type text and whack the stuff there.
This is only a data migration exercise, a one-off, so I don't care about elegance, performance, or any other aspects of it.
All I want is to be able to de-serialize the "text-blob" (or binary) back into anything resembling a dataset. A Dictionary object would do the trick too.
Any quick fixes for me? :)
Use DataSet.WriteXml to write out your "text-blob" then use DataSet.ReadXml later when you want to translate the "text-blob" back into a DataSet to perform whatever subsequent manipulations you want to do.
Why don't you do everything on a SSIS package, extract from MySQL, transform as you wish and load wherever you need it?