I need to get around 50k-100k records from a table. Two of the fields hold very long strings. Field1 is up to 2048 characters and field2 is up to 255.
Getting just these two fields, 50k rows takes around 120 seconds. Is there a way to use compression or some how optimize the retrieval of this data? I'm using a data adapter to fill a data table.
Note: It's just a select statement, no where clause.
Simple answer: DONT PULL 50.000 to 100.000 rows. Point. Mass transfers always take time, and compression would put a lot of stress on the cpu. I still have to come to a case where pulling that much data outside of pure data transfers is a worthwhile proposition - most of the time it is a sign of a bad architecture.
Related
I'm attemping to use bulk insert and replace it with current common insertion in my project. Some of insertion requests (BillType.Booklet) are a list with one row and others are a list with multiple row.
public async Task CreateBill(List<BillReceiverDto> Receivers, BillType BillType)
{
var bulkList =new List<BillReceiverDto>();
if (BillType == BillType.Booklet)
{
bulkList.Add(Receivers.FirstOrDefault());
}
else
{
bulkList.AddRange(Receivers);
}
await _dbContextProvider.GetDbContext().BulkInsertAsync(bulkList);
}
Bulk insert have a great performance for inserting huge data, specially more than 100. It insert 5,000 entities in 75 millisecond. But Is it efficient to use bulk insert a list with one row? Is there any drawbacks such as overhead or etc...?
Disclaimer: I'm the owner of Entity Framework Extensions
It depends on the library you are using Aref Hemati,
In our library, if there are 10 entities or less to insert, we directly use a SQL statement. So the SqlBulkCopy overhead is not used.
So using our library even with one row is fine but obviously optimized for hundreds and thousands of rows.
I believe the BulkInsertAsync extension uses SqlBulkCopy under the hood. In which case, I blogged some benchmarks not that long ago, that might be useful.
While I didn't focus on single row inserts, there did seem to be a cost when using SqlBulkCopy for lower numbers of rows (100) versus Table Valued Parameter approach. As the volumes ramp up, SqlBulkCopy pulls away for performance but there was a noticeable overhead for low volume. How much of an overhead? In the grand scheme of things you're probably not going to notice 10s of milliseconds.
If you're dealing with up to hundreds of rows, I'd actually recommend a Table Valued Parameter approach for performance. Larger volumes - SqlBulkCopy.
Depending on your needs/views on overheads here, I'd be tempted to check how many rows you have to insert and use the mechanism that best fits the volume. Personally, I wouldn't use SqlBulkCopy for low numbers of rows if that is a very typical scenario, because of the overhead.
Blog:
https://www.sentryone.com/blog/sqlbulkcopy-vs-table-valued-parameters-bulk-loading-data-into-sql-server
First I build a list (by reading existing files) of approximately 12,000 objects that look like this:
public class Operator
{
string identifier; //i.e "7/1/2017 MN01 Day"
string name1;
string name2;
string id1;
string id2;
}
The identifier will be unique within the list.
Next I run a large query (currently about 4 million rows but it could be as large as 10 million, and about 20 columns). Then I write all of this to a CSV line by line using a write stream. For each line I loop over the Operator list to find a match and add those columns.
The problem I am having is with performance. I expect this report to take a long time to run but I've determined that the file writing step is taking especially long (about 4 hours). I suspect that it has to do with looping over the Operator list 4 million times.
Is there any way I can improve the speed of this? Perhaps by doing something when I build the list initially (indexing or sorting, maybe) that will allow searching to be done much faster.
You should be able to greatly speed up your code by building a Dictionary(HashTable):
var items = list.ToDictionary(i => i.identifier, i => i);
You can then index in on this dictionary:
var item = items["7/1/2017 MN01 Day"];
Building the dictionary is an O(n) operation, and doing a lookup into the dictionary is an O(1) operation. This means that your time complexity becomes linear rather than exponential.
... but also, "couldn't you somehow put those operators into a database table, so that you could use some kind of JOIN operation in your SQL?"
Another possibility that comes to mind is ... "twenty different queries, one for each symbol." Or, a UNION query with twenty branches. If there is any way for the SQL engine to use indexes, on its side, to speed up that process, you would still come out ahead.
Right now, vast amounts of time might be being wasted, packaging up every one of those millions of lines, squirting them through the network wires to your machine, only to have to discarding most of them, say, because they don't match any symbol.
If you control the database and can afford the space, and if, say, most of the rows don't match any symbol, consider a symbols table and a symbols_matched table, the second being a many-to-many join table that pre-identifies which rows match which symbol(s). It might well be worth the space, to save the time. (The process of populating this table could be put to a stored procedure which is TRIGGERed by appropriate insert, update, and delete events ...)
It is difficult to tell you how to speed up your file write without seeing any code.
But in general it could be worth considering writing using multiple threads. This SO post has some helpful info, and you could of course Google for more.
So here is the deal,
I will be dealing with 1 second data that may accumulate for up to a month with up to 40 sensors (columns) this data will be in exactly one second increments and I need to be able to quickly run logic on these sensor data values. So doing the math:
30 days * 24 hours/day * 60 min/hour * 60 sec/min is roughly:
2.5 million rows x 40 columns is:
100 million data points
Most the data is of double datatype, but I will have booleans, ints, and strings as well. I am entirely new to this so all my ideas on how to handle this are entirely original and therefore they may be absurd and I may be missing the obvious solution. So here are some options I have considered:
1)1 massive table - I am concerned about the performance of this option.
2)DataTables by date, thus reducing the size of an individual table to 86,000 rows.
3)DataTables by hour only dealing with 3600 rows per table, however at that point I start to have an excessive number of tables.
Let me give a little more detail on the nature of the logic - it is entirely sequential, meaning I will start at the beginning and go from the first row all the way to the last. I will essentially be making circles through the data with a number of different algorithms in order to produce my desired results. I am running Sql on Azure. My natural inclination is to move fast and make a lot of mistakes, but in this case I think some experience and advice on how I set up this database will pay off. So, any suggestions?
I have a giant (100Gb) csv file with several columns and a smaller (4Gb) csv also with several columns. The first column in both datasets have the same category. I want to create a third csv with the records of the big file which happen to have a matching first column in the small csv. In database terms it would be a simple join on the first column.
I am trying to find the best approach to go about this in terms of efficiency. As the smaller dataset fits in memory, I was thinking of loading it in a sort of set structure and then read the big file line to line and querying the in memory set, and write to file on positive.
Just to frame the question in SO terms, is there an optimal way to achieve this?
EDIT: This is a one time operation.
Note: the language is not relevant, open to suggestions on column, row oriented databases, python, etc...
Something like
import csv
def main():
with open('smallfile.csv', 'rb') as inf:
in_csv = csv.reader(inf)
categories = set(row[0] for row in in_csv)
with open('bigfile.csv', 'rb') as inf, open('newfile.csv', 'wb') as outf:
in_csv = csv.reader(inf)
out_csv = csv.writer(outf)
out_csv.writerows(row for row in in_csv if row[0] in categories)
if __name__=="__main__":
main()
I presume you meant 100 gigabytes, not 100 gigabits; most modern hard drives top out around 100 MB/s, so expect it to take around 16 minutes just to read the data off the disk.
If you are only doing this once, your approach should be sufficient. The only improvement I would make is to read the big file in chunks instead of line by line. That way you don't have to hit the file system as much. You'd want to make the chunks as big as possible while still fitting in memory.
If you will need to do this more than once, consider pushing the data into some database. You could insert all the data from the big file and then "update" that data using the second, smaller file to get a complete database with one large table with all the data. If you use a NoSQL database like Cassandra this should be fairly efficient since Cassandra is pretty good and handling writes efficiently.
I am working with a webservice that accepts a ADO.NET DataSet. The webservice will reject a submission if over 1000 rows are changed across all of the ten or so tables in the dataset.
I need to take my dataset and break it apart into chunks of less than 1000 changed rows. I can use DataSet.GetChanges() to produce a reduced dataset, but that still may exceed the changed row limit. Often a single table will have more than 1000 changes.
Right now, I think I need to: Create an empty copy of the dataset
Iterate over the DataTableCollection and .Add rows individually to the new tables until I get to the limit. Start a new dataset, and repeat until I've gone through everything.
Am I missing a simpler approach to this?
This is asking for trouble. Often changes to one table are dependent on changes to another. You don't want to split those up or bad things which may be difficult to debug problems will occur unless you are very very careful about this. Most likely, the "right" thing to do here is to submit changes to the webservice on a more frequent basis instead of batching them up so much.