I'm in the process of programming a web application that gets data from an inverter to which a PV cell is attached. I read the data from a CSV file. Every 20 seconds, the CSV file gains a line that contains the data at the respective point in time (line contains the following data: timestamp, current performance, energy).
The CSV file is saved to a database when the application is started (when the index action is called in the controller). It's all working.
Since the database now contains data at 20s intervals, it is rapidly increasing in size. Since I use graphs to show the energy that the PV system supplies me over the year on my web application, I have to summarize the 20s data, which also requires computing power. I also do this in the index action.
So whenever the user opens the page, the data is updated. If I e.g. switch from one view to the other and back again, the index action is called again in the associated controller. So it takes time to load the page again. So my application becomes slow.
What do I need to do to solve such a problem?
Ok, then.
In our IT industry, we often come across the term "data warehousing".
What this means (in most cases) is that we have a LOT of transaction data. Think maybe the very high transaction rate generated by people shopping on amazon. HUGE number of transactions.
But, if we want to report on such data? Say we want sales by hour, or maybe even only need by per day.
Well, we don't store each single transaction for that "house" of data, but a sum total, and a sum total over a given "chosen" time period by the developer of that data warehouse system.
So, you probably don't need to capture each 20 second data point. (maybe you do?????).
So, as I stated, every 20 seconds, you get a data point. Given a year has 31 million seconds? then that means you will have 1.5 million data points per year.
However, perhaps you don't need such fine resolution. If you take the data, and sum by say 1 minute intervals, then you now down to only 525,000 data points per year. (and if you report by month, then that is only 43,000 points per month).
However, maybe a resolution of 5 minutes is more then fine for your needs. at that resolution, then a whole year of data becomes only 105,120 data points.
And thus for a graph or display of one month of data, we have only 8,760 data points.
So, if we have to (for example) display a graph for one month, then we are only pulling 8,700 points of data. Not at all a large query for any database system these days.
So, you might want to think of this data as some "miniature" data warehousing project, in which you do loose some data "granularity", but as such, it still is sufficient for your reporting needs.
What time slot or "gap" you choose will be based on YOUR requirements or need.
What the above thus then suggests?
You would need a routine that reads the csv, and then groups the data by that "time slot" you chosen, and then sum into existing data points, and appends for the new ones.
This as a result would not only vast reduce the number of data rows of data, but of course would also significantly speed up reports and graphing on such data.
So, you could easy drop from about 1.5 million rows of data per year down to say 100,000 rows per year. With a index on the date, then reporting on such data be it daily, weekly, or monthly becomes far more manageable, and you reduced the data by a factor of 10x. Thus, you have a lot more headroom in the database, a lot less data, and after 10 years of data, you would only be around 1 million rows of data - not a lot for even the free "express" edition of SQL server.
Also, since you can't control when the "device" triggers adding of data to that csv, I would consider a re-name of the file before you read it, and thus during some read (and you deleting after done), you would reduce the possibility of losing data during your csv read + delete operation.
Related
I am busy writing a program for our company to monitor the performance of some systems of ours.
So basically every minute I will receive values with its current state. I will be storing them in a database. Thats all go and fine.
But now I need to be able to generate a report (well graph) of its performance over a time scale. That being hour, day, week, month year.
Now thats all well i can jsut query it with the time stamp. but when generating a year report I really donot need the accuracy to one minute and then I need to store 60*24*568 records per machine per year.
I want to some how store all the values for and then summarise them to reduce the amount of data needed to be stored. so when I generate a yearly report it will look at say every 2 days. rather than every minute and same for a month it can look at everyday or every half hour.
Now my question is actually. How do I store this in the database?
Should I make tables for each summarized version so minute => hour => day => week etc.
Or do i keep one big table with all the entries and just use sql to sumarise it etc.
I have never had to do anything like this before and I dont really know where to even start thinking. It sounds abit like data warehousing, but please note this is not oh a huge scale. It will be monitoring 5 -10 services with a web interface. The Idea I have in mind I actually got from Mikrotik Rotuers. I doubt many of oyu would have worked with them. But they have a great resource graphing system in there that is simple but displays what it need.
like this:
I think you can keep the original table in the grain of minutes. Because in future your needs may change. Like 2 days to 1 day or even 6 hours once. To efficiently manage this table consider partitioning the table. Create a another table which has a summarized data. Write a sql to insert the aggregation table from the original table. So basically you are doing extract, transform and load (ETL) here. Hope it helps.
I am trying to develop a system in which I will sync my database with 3-Party database by provided API.
The API has format in which we can provide From-Date and To-Date
Problems
There is no API which gives me only modified records.
The data is too large (1000 records/day average)
Need a scheduler so all the records are updated automatically
I also need to keep track of modified records(which is the biggest problem as I can't get them by modified date)
Note : As per the previous requirement i Have already developed the system in which i can Specify the From-Date and To-Date the record get updated (its completed with the GUI no ajax was uses). and even if I request 1 day records the system get time out error.
NOTE 2 : I really should no say but the client is too strict (DUMB);( he just need the solution nothing else will do
Assuming that the data doesn't need to be "fresh" can you not write a process to run hourly / nightly fetching that days worth of data and processing it into your DB?
Obviously this would only work if you're sure previous records are not updated?
Does the API provide batches?
Why did you you choose a web client with Ajax to process this data? Would a windows / console application be better suited?
If the data is too big to retrieve by any given query, you're just going to have to do it by ID. Figure out a good size (100 records? 250?), and just spin through every record in the system by groups of that size.
You didn't say if you're pulling down data, pushing up data, or both. If you're only pulling it down, then that's the best you can do, and it will get slower and slower as more records are added. If you're only pushing it, then you can track a "pushed date". If it's both, how do you resolve conflicts?
I have a list of int that's stored in the DB in the form of a string with commas in between (4345,324,24,2424,64567,33...). This string could become quite large and contain 2-3 thousand numbers. It's stored in the DB and used quite frequently.
I'm thinking that instead of reading it from the DB every time it's needed, it'd be better to store it in the session after it's loaded the first time.
How much memory would a list of 1,000 int require? Does the memory size also depend on the int itself such that storing a larger int (234,332) takes more space than a smaller int (544)?
Is it going to be better to read once and store in the session at the cost of memory space or better to read often and discard from memory after render.
Thanks for your suggestions.
I think you are heading in wrong direction. Storing in DB will likely be a better option, not in comma separated format, but as a table of int values.
Storing data in session will reduce scalability significantly. You might start having OutOfMemory exception and wondering why this is happening.
So my suggestion is read from DB when needed, apply appropriate indexes and it will be very fast.
The way you are heading is:
Day #1, 1 user - Hmm, should I store data in Session, why not. Should work fast. No need to query DB. Also easy to do.
Day #10, 5 users - Need to store another data structure, will put this to the session too, why not? Session is very fast.
Day #50, 10 users - There is a control that is haeavy to render, I will make it smart, render once and than put to the Session, will reuse it on every postback.
Day #100, 20 users - Sometimes the web site slow, don't know why. But it is just sometimes, so not a big deal.
Day #150, 50 users - It's got slow. Need better CPU and memory? We need to buy a better server, the hardware is old.
Day #160, 60 users - Got a new server, works much faster. Problem solved.
Day #200, 100 users - slow again, why? This is the newest the most expensive server!
Day #250, 150 users - application pool is getting recylced all the time. Why? OutOfMemoryException? what is this? I will google.
Day #300, 200 users - Users complain, we lose customers. I read about WinDbg, need to try using it.
Day #350, 200 users - Should we start using network load balancing, we can buy two servers! Bought server, tried to use, didn't work, a lot of dependencies on Session.
Day #400, 200 users - Can't get new customers, old customers go away. Started using WinDbg found out that almost all the memory is used by Session.
Day #450, 200 users - Starting a big project called 'Get rid of Session'.
Day #500, 250 users - The server is so fast now.
I've been there seen that. Basically my advice - don't go this way.
An int in C# is always 4 bytes (no matter what the value). A list of 1,000 ints is therefore ~4,000 bytes. I say approximately because the list structure will add some overhead. A few thousand ints in a list shouldn't be a problem for a modern computer.
I would not recommend storing it in the session, since that's going to cause memory pressure. If you have a series of integers tied to a single record, it sounds like you have a missing many to one relationship - why not store the ints in a separate table with a foreign key to the original table?
Integers are of a fixed size in .NET. Assuming you store it in an array instead of a List (since you are probably not adding to or removing from it), it would take up roughly 32 bits * the number of elements. 1000 ints in an array = roughly 32000 bits, or a little under 4 KB.
An int usually takes 32 bits (4 bytes), so 1000 of them would take about 4KB.
It doesn't matter how large the number is. They're always stored in the same space.
Is this list of int's unique to a session? If not, cache it at the server level and set an expiration on it. 1 copy of the list.
context.Cache.Add(...
I do this and refresh it every 5 minutes with a large amount of data. This way it's pretty "fresh" but only 1 connection takes the hit to populate it.
I am working with a rather large mysql database (several million rows) with a column storing blob images. The application attempts to grab a subset of the images and runs some processing algorithms on them. The problem I'm running into is that, due to the rather large dataset that I have, the dataset that my query is returning is too large to store in memory.
For the time being, I have changed the query to not return the images. While iterating over the resultset, I run another select which grabs the individual image that relates to the current record. This works, but the tens of thousands of extra queries have resulted in a performance decrease that is unacceptable.
My next idea is to limit the original query to 10,000 results or so, and then keep querying over spans of 10,000 rows. This seems like the middle of the road compromise between the two approaches. I feel that there is probably a better solution that I am not aware of. Is there another way to only have portions of a gigantic resultset in memory at a time?
Cheers,
Dave McClelland
One option is to use a DataReader. It streams the data, but it's at the expense of keeping an open connection to the database. If you're iterating over several million rows and performing processing for each one, that may not be desirable.
I think you're heading down the right path of grabbing the data in chunks, probably using MySql's Limit method, correct?
When dealing with such large datasets it is important not to need to have it all in memory at once. If you are writing the result out to disk or to a webpage, do that as you read in each row. Don't wait until you've read all rows before you start writing.
You also could have set the images to DelayLoad = true so that they are only fetched when you need them rather than implementing this functionality yourself. See here for more info.
I see 2 options.
1) if this is a windows app (as opposed to a web app) you can read each image using a data reader and dump the file to a temp folder on the disk, then you can do whatever processing you need to against the physical file.
2) Read and process the data in small chunks. 10k rows can still be a lot depending on how large the images are and how much process you want to do. Returning 5k worth of rows at a time and reading more in a separate thread when you are down to 1k remaining to process can make for a seamless process.
Also while not always recommended, forcing garbage collection before processing the next set of rows can help to free up memory.
I've used a solution like one outlined in this tutorial before:
http://www.asp.net/(S(pdfrohu0ajmwt445fanvj2r3))/learn/data-access/tutorial-25-cs.aspx
You could use multi-threading to pre-pull a portion of the next few datasets (at first pull 1-10,000 and in the background pull 10,001 - 20,000 and 20,001-30,000 rows; and delete the previous pages of the data (say if you are at 50,000 to 60,000 delete the first 1-10,000 rows to conserve memory if that is an issue). And use the user's location of the current "page" as a pointer to pull next range of data or delete some out-of-range data.
Scenario
I have the following methods:
public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)
Initially I'm thinking storage on the form:
itemId -> userId, userId, userId
and
userId -> itemId, itemId, itemId
AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.
There are potentially 2000 users and 10 million items.
Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).
AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.
I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.
If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.
The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.
Each time a new user is added all the lists have to updated as well, but this can be done nightly.
Question
Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)
[Update 2010-03-31]
I have now tested with SQL server 2008 under the following conditions.
Table with two columns (userid,itemid) both are Int
Clustered index on the two columns
Added ~800.000 items for 180 users - Total of 144 million rows
Allocated 4gb ram for SQL server
Dual Core 2.66ghz laptop
SSD disk
Use a SqlDataReader to read all itemid's into a List
Loop over all users
If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.
The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.
Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.
[Update 2010-03-31 #2]
I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.
After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.
Wikipedia has an explanation of what a sparse file is.
The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.
In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.
So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.
And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.
I really think you should try a nice database before you make your decision. Something like this will be a challenge to maintain in the long run. Your user-base is actually quite small. SQL Server should be able to handle what you need without any problems.
2000 users isn't too bad but with 10 mil related items you really should consider putting this into a database. DBs do all the storage, persistence, indexing, caching etc. that you need and they perform very well.
They also allow for better scalability into the future. If you suddenly need to deal with two million users and billions of settings having a good db in place will make scaling a non-issue.