Working on a pet project concerning efficiency and data and this is the mission statement:
I have an Access DB with x amount of tables , each containing upwards of 2000 to a max of around 15000 entries in the format of [shipment][box][serial][name][region].
Now, I have another table(called Bluuur) with n amount of entries and I'd like to compare this table (contains serials) to all the tables in the Access DB and return the serial matches along with the name of the record which matched.
So the output should be something like :
Timmy 360 (for a compare in which Timmy had 360 matches to Bluuur)
note: I'm developing an application to do this
I would use the OleDbConnection with connection string like:
OleDbConnection connToFile = new OleDbConnection(String.Format(#"Provider=Microsoft.Jet.OLEDB.4.0; Data Source={0}; Extended Properties=""Excel 8.0;HDR=Yes"";", fileName));
Similar for MS Access. Then load both tables to the memory and compare.
Update
Ok, sorry that I didn't got your question initially.
The answer would more depend from requirements:
Is the table in XLS static or dynamic? If it is static then import the XLS into MS Access and, if you just need to get values for yourself use, use Query Editor to select and join tables like: select a.Name, count(b.ID) from tableA a inner join tableB b on a.Name = b.Name group by a.Name
If the table in XLS is dynamic, but once again you need to work with it for your own purposes, create the linked data source in MS Access to that XLS file and once again use query editor to perform the selection.
If the purpose of all this is to have the web page/application, which will connect to both the Microsoft Access and XLS and will join data, and will do that regularly you have 2 potential solutions: do it in memory, or do it using saved query and then use OleDbConnection/OleDbDataAdapter to load data into the application and display them to user. The memory approach may not be the best by performance so write the MS Access query, which will join and group data as you need and use OleDbConnection to connect MS Access MDB file and execute the query. OR, if you need to do this for multiple tables in MS Access, write the query text yourself directly in the code, execute for each join and then summ results.
If I underastand correctly then one table (to which you need to compare) is not in MS Access DB. Quick solution seems as follows: import "Bluur" table to Access database (most probable it's possible with Access import data wizard). Now you can use simple JOIN queries to test agains all other table in DB.
Related
Problem Statement: The requirement is straight-forward, which is we have a flat file(csv basically) which we need to load into one of the tables in Sql Server database. The problem arises when we have to derive a new column(not present in flat file) and populate this too alongwith rest of the columns from the file.
The derivation logic of the new columns is - find the max date of "TransactionDate".
The entire exercise is to be performed in SSIS and we were hoping to get it done by using DataFlowTask but stuck on how to derive the new column and then add it to the destination flow.
Ideas:
Use DataFlowTask to read the file and then store it in recordset so that in ControlFlow we would use ScriptTask to read it as DataTable and use LINQ sort-of to determine the max column and push it to another DataFlow to be consumed by Sql table (but this I guess would require creating of tabletype in database which I would avoid)
Perform the entire operation in DataFlowTask itself and we would be needing Asynchronous transformation (to get all the data and find out the max value)
We are kind of out-of-ideas here and any lead would be much appreciated and do let us know if any further information would be required on this regard.
Run a dataflow task to insert the data to your destination table. Follow that up with an Execute SQL task that calculates the MAX(TransactionDate) based on the values in the table with a NULL (or other new record indicator) MaxTransactionDate.
I'd like to transfer a large amount of data from SQL Server to MongoDB (Around 80 million records) using a solution I wrote in C#.
I want to transfer say 200 000 records at a time, but my problem is keeping track of what has already been transferred. Normally I'd do it as follows:
Gather IDs from destination to exclude from source scope
Read from source (Excluding IDs already in destination)
Write to destination
Repeat
The problem is that I build a string in C# containing all the IDs that exist in the destination, for the purpose of excluding those from source selection, eg.
select * from source_table where id not in (<My large list of IDs>)
Now you can imagine what happens here when I have already inserted 600 000+ records and then build a string with all the IDs, it gets large and slows things down even more, so I'm looking for a way to iterate through say 200 000 records at a time, like a cursor, but I have never done something like this and so I am here, looking for advice.
Just as a reference, I do my reads as follows
SqlConnection conn = new SqlConnection(myConnStr);
conn.Open();
SqlCommand cmd = new SqlCommand("select * from mytable where id not in ("+bigListOfIDs+")", conn);
SqlDataReader reader = cmd.ExecuteReader();
if (reader.HasRows)
{
while (reader.Read())
{
//Populate objects for insertion into MongoDB
}
}
So basically, I want to know how to iterate through large amounts of data without selecting all that data in one go, or having to filter the data using large strings. Any help would be appreciated.
Need more rep to comment, but if you sort by your id column you could change your where clause to become
select * from source_table where *lastusedid* < id and id <= *lastusedid+200000*
which will give you the range of 200000 you asked for and you only need to store the single integer
There are many different ways of doing this, but I would suggest first that you don't try to reinvent the wheel but look at existing programs.
There are many programs designed to export and import data between different databases, some are very flexible and expensive, but others come with free options and most DBMS programs include something.
Option 1:
Use SQL Server Management Studio (SSMS) Export wizards.
This allows you to export to different sources. You can even write complex queries if required. More information here:
https://www.mssqltips.com/sqlservertutorial/202/simple-way-to-export-data-from-sql-server/
Option 2:
Export your data in ascending ID order.
Store the last exported ID in a table.
Export the next set of data where ID > lastExportedID
Option 3:
Create a copy of your data in a back-up table.
Export from this table, and delete the records as you export them.
So I have a program that processes a lot of data, it pulls the data from the source then saves it into the database. The only conditions is that there cannot be duplicates based on just a single name field, there is no way to only pull new data, I have to pull it all.
The way I have it working right now is that it pulls all the data from the source, then all of the data from the DB (~1.5m records) then compares and only sends the new entries, however this is not very good in terms of RAM since it can use up around 700mb.
I looked into a way to let the DB handle more of the processing and found the below extension method on another question, but my concerns with using it is that it might be even more inefficient due to having to check 200 000 singularly against all records in the db
public static T AddIfNotExists<T>(this DbSet<T> dbSet, T entity, Expression<Func<T, bool>> predicate = null) where T : class, new()
{
var exists = predicate != null ? dbSet.Any(predicate) : dbSet.Any();
return !exists ? dbSet.Add(entity) : null;
}
So, any ideas as to how I might be able to handle this problem in an efficient manner?
1) Just a suggestion: because this task seems to be "data intensive" you could use a dedicated tool: SSIS. Create one package having one Data Flow Task thus:
OLE DB source (to read data from source)
Lookup transformation (to check if every source value exists or not within target table; This needs indexing)
No match data flow (source value doesn't exist) then
OLE DB destination (it inserts the source value into target table).
I would use this approach if I have to compare/search on small set of columns (this columns should be indexed).
2) Another suggestion: is to simply insert source data into #TempTable and then
INSERT dbo.TargetTable (...)
SELECT ...
FROM #TempTable
EXCEPT
SELECT ...
FROM dbo.TargetTable
I would use this approach if I have to compare/search on all columns.
You have to do your tests to see which is the right/fastest approach.
If you stick to EF, there's a few ways of improving your algorithm:
Download only Name field from your source table (create a new DTO object that maps to the same table but contains only the name field against which you compare). That will minimize the memory usage.
Create and store the hash of the name field in the database. Then download on the client only hashes. Check the input data - if there's no hash, just add it; if there's such hash - check with dbSet.Any(..). If you use the MD5 hash it will take only 16 (bytes) *1,5M records ~ 23Mb. If you use CRC32, 4*1,5 records ~ 6Mb.
You have the high memory usage because all the data you added is stored in the DB context. If you do in chunks (let's say 4000 records) and then close the context, the memory will be released.
using (var context = new MyDBContext())
{
context.DBSet.Add(...); // and no more than 4000 records here.
context.SaveChanges();
}
The better solution is to use SQL statements that can be executed via
model.Database.ExecuteSqlCommand("INSERT...",
new SqlParameter("Name",..),
new SqlParameter("Value",..));
(Mind to set the parameter type for SqlParameter)
INSERT target(name, value)
SELECT #Name, #Value
WHERE NOT EXISTS (SELECT name FROM target WHERE name = #Name);
If the expected number of inserted values is high, you might get benefits form using Table type (System.Data.SqlDbType.Structured) (assuming you use MS SQL server) and executing the similar query in some chunks but it is a little bit more complex.
Depending on the SQL server you may download the data into a temporary table (e.g. you have a CSV file or BulkInsert for MS SQL) and then execute the similar query.
At the moment, I source my data from a SQL serve r(2008) database. The cyurrent method is to use a DataTable, which is then passed around and used.
if (parameters != null)
{
SqlDataAdapter _dataAdapter = new SqlDataAdapter(SqlQuery, CreateFORSConnection());
foreach (var param in parameters)
{
_dataAdapter.SelectCommand.Parameters.AddWithValue(param.Name, param.Value);
}
DataTable ExtractedData = new DataTable(TableName);
_dataAdapter.Fill(ExtractedData);
return ExtractedData;
}
return null;
But now, the user has said that we can also get data from txt files, which have the same structure as the tables in SQL Server. So, if I have a table called 'Customer', then I have a csv file with Customer. with the same column structure. The first line in the CSV is the column name, and matches my tables.
Would it be possible to read the txt file into a data table, and then run a SELECT on that data table somehow? Most of my queries are single table queries:
SELECT * FROM Table WHERE Code = 111
There is, however, ONE case where I do a join. That may be a bit more tricky, but I can make a plan. If I can get the txt files into data tables first, I can work with that.
Using the above code, can I not change the connection string to rather read from a CSV instead of SQL Server?
First, you'll need to read the CSV data into a DataTable. There are many CSV parsers out there, but since you prefer using ADO.NET, you can use the OleDB client. See the following article.
http://www.switchonthecode.com/tutorials/csharp-tutorial-using-the-built-in-oledb-csv-parser
Joining is a bit harder, since both sets of data live in different places. But what you can do is get two DataTables (one from each source), then use Linq to join them.
Inner join of DataTables in C#
You could read the text file into a List<string> (if there is just 1 column per file), and then use LINQ to query the list. For example:
var result = from entry in myList
where entry == "111"
select entry;
Of course, this example is kind of useless since all you get back is the same string you are searching for. But if there are multiple columns in the file, and they match the columns in your DataTable, why not read the file into the data table, and then use LINQ to query the table?
Here is a simple tutorial about how to use LINQ to query a DataTable:
http://blogs.msdn.com/b/adonet/archive/2007/01/26/querying-datasets-introduction-to-linq-to-dataset.aspx
I have been tasked to write a module for importing data into a client's system.
I thought to break the process into 4 parts:
1. Connect to the data source (SQL, Excel, Access, CSV, ActiveDirectory, Sharepoint and Oracle) - DONE
2. Get the available tables/data groups from the source - DONE
i. Get the available fields form the selected table/data group - DONE
ii. Get all data from the selected fields - DONE
3. Transform data to the user's requirements
4. Write the transformed data the the MSSQL target
I am trying to plan how to handle complex data transformations like:
Get column A from Table tblA, inner joined to column FA from table tblB, and concatenate these two with a semicolon in between.
OR
Get column C from table tblC on source where column tblC.D is not in table tblG column G on target database.
My worry is not the visual, but the representation in code of this operation.
I am NOT asking for sample code, but rather for some creative ideas.
The data transformation will not be with free text, but drag and drop objects that represent actions.
I am a bit lost, and need some fresh input.
maybe you can grab some ideas from this open source project: Rhino ETL.
See my answer: Manipulate values in a datatable?