I have a list of 2048 doubles that represent the amplitude of different samples in a signal. I'm constantly re-sampling the signal (say 20 times every second) and would like to store the data in an SQL database.
One complete measurement of the signal looks something like this:
List<double> sineCurve = new List<double>();
for (int i = 0; i < 2048; ++i)
{
sineCurve.Add(50 + 50 * Math.Sin(i / Math.PI));
}
What's the best way to store this data? The simplest way seems to be to create a table with 2049 columns and store each measurement as a new row:
string sql = "create table sample( measurement_id int not null, sample_value1 double, sample_value2 double, ... sample_value1 double2048 )";
Is this the preferred way of storing this type of data?
Edit: I would like to run a test where data is collected and saved to the database continuously over a few days. Then I would like to process the data by looking for certain patterns in each list of samples, min/max values and so forth.
I would create a table like this, where you store all your samples in a measurement in one BLOB-entry:
CREATE TABLE measurement (id (primary key) , generated_at (timestamp), sample (blob), SequenceNr (integer));
Then if you want to look for measurements in a given time range it would look kind of like this:
SELECT m.sample FROM measurement m
WHERE m.generated_at BETWEEN startDate AND endDate
ORDER BY m.SequenceNr;
You should know in advance what kind of queries you wish to run against your data. Find those relevant values before saving a record and put in separate fields. Otherwise save the array in a blob field as binary.
If you find out other queries you have to reprocess the blob values by picking the values you need.
Related
I am processing large files in C# (hopefully) and I need a way to determine the number of distinct values in each column of file. I have read all the questions I can find relating to determining distinct values with C#. The challenge is that due to the large size of some files and potential for tens of millions of distinct values in a column (and potentially hundreds of columns--all sorts of datatypes), to create lists, dictionaries, or arrays, etc. for each column--and then using techniques described in previously-answered questions--would put me in danger of hitting the 2 GB memory limitation.
Currently, I am reading/processing the files one line at a time and for each row "cleaning and sanitizing" the data, updating aggregate results, then writing each processed row in an output file which is then bulk inserted to SQL. Performance thus far is actually pretty decent.
Since the data is ultimately landed in MS SQL, as a fallback I can use SQL to determine distinct values but I would ideally like to be able to do this before landing in SQL. Any thoughts or suggestions are appreciated.
Update: For each field I have created a Hash Table and added new distinct values to each. At the end of processing, I use
myDistinctValues.Count
to obtain the count. This works fine for small files but as I feared, with a large file I get
System.OutOfMemoryException
thrown. Per a suggestion, I did try adding
<runtime>
<gcAllowVeryLargeObjects enabled="true"/>
</runtime>
to my application config but that did not help.
Though my solution is not elegant and there is surely a better one out there (BTree?), I found something that worked and thought I'd share it. I can't be the only one out there looking to determine distinct counts for fields in very large files. That said, I don't know how well this will scale to hundreds of millions or billions of records. At some point, with enough data, one will hit the 2GB size limit for a single array.
What didn't work:
For very large files: hash table for each field populated in real time as I iterate through the file, then use hashtable.count. The collective size of the hash tables causes SystemOutOfMemoryException before reaching the end of the file.
Importing data to SQL then using SQL on each column to determine distinct count. It takes WAY too long.
What did work:
For the large files with tens of millions of rows I first conduct analysis on the first 1000 rows in which I create a hash table for each field and populate with the distinct values.
For any field with more than 50 distinct values out of the 1000, I mark the field with a boolean flag HasHighDensityOfDistinctValues = true.
For any such fields with HasHighDensityOfDistinctValues == true I create a separate text file and as I iterate through the main file, I write values for just that field out to the field-specific text file.
For fields with a lower density of distinct values I maintain the hash table for each field and write distinct values to that.
I noticed that in many of the high density fields, there are repeat values occurring (such as a PersonID) for multiple consecutive rows so, to reduce the number of entries to the field-specific text files, I store the previous value of the field and only write to the text file if the current value does not equal the previous value. That cut down significantly on the total size of the field-specific text files.
Once done iterating through the main file being processed, I iterate through my FieldProcessingResults class and for each field, if HasHighDensityOfDistinctValues==true, I read each line in the field-specific text file and populate the field-specific hash table with distinct values, then use HashTable.Count to determine the count of distinct values.
Before moving on to the next field, I store the count associated with that field, then clear the hash table with myHashTable.Clear(). I close and delete the field-specific text file before moving on the the next field.
In this manner, I am able to get the count of distinct values for each field without necessarily having to concurrently populate and maintain an in-memory hash table for each field, which had caused the out of memory error.
Do you consider getting a hash code of a value(assuming it cannot be larger than 128 bytes), creating a hash set and doing something like this:
static void Main(string[] args)
{
List<object> vals = new List<object> {1, 'c', "as", 2, 1};
foreach(var v in vals)
Console.WriteLine($"Is uniques: {IsUniq(v)}");
Console.ReadKey();
}
private static HashSet<object> _hashes = new HashSet<object>();
private static bool IsUniq(object v)
{
return _hashes.Add(v);
}
It should be like 100-150 megabytes of raw data for 1 million elements.
How many distinct values are you expecting? I used the following simple app:
using System;
using System.Collections.Generic;
class Program
{
static void Main(string[] args)
{
Dictionary<string, int> ds = new Dictionary<string, int>;
Random r = new Random();
for (int i = 0; i < 100000000; i++) {
string s = Guid.NewGuid().ToString();
d[s] = r.Next(0, 1000000);
if (i % 100000 == 0)
{
Console.Out.WriteLine("Dict size: " + d.Count);
}
}
}
}
together with .net 4.6.1, x64 build target I got to 40 million unique objects and 5.5 gigabytes of memory consumed before I ran out of memory on my machine (it's busy with other things at the moment, sorry)..
If you're going to be using arrays, you might nee an app.config that looks like:
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<startup>
<supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.6.1"/>
</startup>
<runtime>
<gcAllowVeryLargeObjects enabled="true"/>
</runtime>
</configuration>
You should be able to work out what sort of memory you'll need to track the distinct values and their counts. I recommend you work on one column at a time if you think it'll be in the hundreds of millions..
Just a minor clarification too: when i read "the number of distinct values" it makes me think you want to track the count of how many times each value appears. This is why I used Dictionary<string, int> - the string is the distinct value being counted and the int is the count
If you're looking to de-dupe a list of X million/billion values to just the distinct ones, with no need to count occurrences then HashSet might be lighter weight
Have you tried loading the file into a datatable and then doing your distinct selection via a dataview (not creating a copy)?
Check out
https://social.msdn.microsoft.com/Forums/vstudio/en-US/fccda8dc-4515-4133-9022-2cb6bafa8ad9/how-does-a-dataview-act-in-memory?forum=netfxbcl
Here is some pseudo code
Read from File into Datatable
Create DataView with sort on the column you want
UniqueCount = 0
var CurrentValue="<some impossible value>"
For each ViewRow in DataView
If CurrentValue <> ViewRow["MyColumn"]
UniqueCount ++
UniqueCount should give me my result
This will be efficient because you are only using 2 variables UniqueCount and CurrentValue to loop through the data.
You are also sorting in dataview which does not make a copy of the data when processing.
Hope this helps
I am trying to devise an algorithm for finding duplicate rows in an Excel file across multiple worksheets.
Assume the following:
1) the file can be extremely large. Consider a file that has 10 worksheets (each with 1,048,576 rows) and say 30 columns of data
2) the duplicate rows may not be on the same worksheet. For example Sheet1/Row20 may be a duplicate of Sheet5/Row123456
3) to determine if a row is duplicate of another, 1 or more of its columns could be used as a user-specified condition (its not always that all columns must be the same, the user could for example specify that a duplicate is when columns 2,3 and 5 are the same)
4) the order of the underlying data cannot be changed (no sorting the data first and then checking adjacent rows).
5) the algorithm must be memory efficient. Storing all the values of the columns of a row in a Dictionary will take up too much memory. Not all of the data can be stored in memory (read into .NET multidim arrays) at the same time since that would effectively double the memory usage since it is already stored in Excel.
6) the algorithm must minimize IO with the Excel object model. Continually retrieving a single row of data at a time (or performing some other built in Excel interop operation) from Excel can be slow.
So far I have had two different ideas for the Algorithm:
Algorithm 1)
a) Create a Dictionary< int, List< Tuple < int, int> > where the dictionary key is the hashvalue of the desired values in the columns in a particular row and List< Tuple < int, int > > is a list of worksheet index/row index that compute to that hash code
b)Read in a large chunk of the data from Excel at a time (say 50,000 rows) and fill up the Dictionary.
c)find all entries in the Dictionary where the List has a Count > 1 and then go through all the rows and check if there are in fact duplicates by reading the data from Excel again and comparing the actual values this time
Algorithm 2)
Similar to Algorithm 1 but use two (or maybe three) distinct and independent hash functions to create a Tuple < int, int > or Tuple < int, int, int > as the key for the Dictionary. If the hash functions are independent then there would be a near 0% probability that there is a collision at a specific key unless the rows are in fact equal. step 1c) could be omitted because of this.
To get the hashkey used in algo1, I would do something like this:
private int GetHashKey(List<object> columns)
{
int hash = 23;
foreach (var o in columns)
hash = hash * 31 + o.GetHashCode();
return hash;
}
If I wanted to do algorithm 2) I would need to define an extension method GetHashCode2() for object (or at least the possible return data types for Range.Value2 which are string, double, bool and int)
Can anyone think of a better solution?
What are people's thoughts on Algo1 vs Algo2?
If people think Algo2 is better, any idea how I code create a GetHashCode2() function that is efficient and robust and produces different hash codes than GetHashCode()?
Using: SQL Server 2008, Entity Framework
I am summing columns in a table across a date/time range. The table is straight-forward:
DataId bigint IDENTITY(1,1) NOT NULL PRIMARY KEY,
DateCollected datetime NOT NULL,
Length int NULL,
Width int NULL,
-- etc... (several more measurement variables)
Once I have the date/time range, I use linq-to-EF to get the query back
var query = _context.Data.Where(d =>
(d.DateCollected > df &&
d.DateCollected < dt))
I then construct my data structure using the sum of the data elements I’m interested in
DataRowSum row = new DataRowSum
{
Length_Sum = query.Sum(d => d.Length.HasValue ? (long)d.Length.Value : 0),
Width_Sum = query.Sum(d => d.Width.HasValue ? (long)d.Width.Value : 0),
// etc... (sum up the rest of the measurement variables)
}
While this works, it results in a lot of DB round trips and is quite slow. Is there a better way to do this? If it means doing it in a stored procedure, that’s fine with me. I just need to get the performance better since we’ll only be adding more measurement variables and this performance issue will just get worse.
SQL SERVER is very good at rolling up summary values. Create a proper stored procedure which calculates the sums for you already. This will give you maximum performance, especially if you don't actually need the tabular data in your client program. Just have SQL Server roll up the summary, and send back a whole lot less data. One of the reasons I generally don't like LINQ is because it tempts the programmers to do things like what you are trying to do (pull a set and do 'something' against every row), instead of taking advantage of the database engine and all its capabilities.
Do this with aggregate functions and grouping in the SQL. LINQ will never figure out how to do this fast.
is there a way to show a certain amount of random records from a database table, but heavily influenced by the date and time of creation.
for example:
showing 10 records at random, but
showing latest with more frequency than the earliest
say there are 100 entries in the news table
latest (by date time) record would have an almost 100% chance of being selected
first (by date time) record would have an almost 0% chance of being selected
50th (by date time) record would have a 50% chance of being selected
is there such a thing in mssql directly? or is there some function (best practice) in c# i can use for this?
thnx
** edit: the title is really horrible, i know. please edit if you have a more descriptive one. thnx
A quite simplistic way might be something like the following. Or at least it might give you a basis to start with.
WITH N AS
(
SELECT id,
headline,
created_date,
POWER(ROW_NUMBER() OVER (ORDER BY created_date ASC),2) * /*row number squared*/
ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) AS [Weight] /*Random Number*/
FROM news
)
SELECT TOP 10
id,
headline,
created_date FROM N
ORDER BY [Weight] DESC
For a random sample, see Limiting Result Sets by Using TABLESAMPLE. Eg. Select a sample of 100 rows from a table:
SELECT FirstName, LastName
FROM Person.Person
TABLESAMPLE (100 ROWS);
For weighted sample, with preference to most recent records (I missed this when first reading the question), then Martin's solution is better.
You can pick an N using exponential distribution (for instance), and than SELECT TOP(N) ordered by date, and choose the last row.
You can choose the exponent according to the number of existing rows.
Unfortunately I don't know MSSQL, but I can give a high-level suggestion.
Get the date in UNIX time (or some other increasing integral representation)
Divide this value by the max for each column to get a percentage.
Get a random number and multiply by the percentage above.
Sort your columns by this value and take the top N.
This will give more weight to the most recent results. If you want to adjust the relative frequency of older vs later results, you can apply an exponential or logarithmic function to the values before taking the ratio. If you're interested, let me know and I can provide more info.
If you can filter results after DB access, or you can submit a query with order by and process results with a reader, then you can add a probabilistic bias to the selection. You see that the higher the bias, the harder the test inside the if, the more random the process.
var table = ... // This is ordered with latest records first
int nItems = 10; // Number of items you want
double bias = 0.5; // The probabilistic bias: 0=deterministic (top nItems), 1=totally random
Random rand = new Random();
var results = new List<DataRow>(); // For example...
for(int i=0; i<table.Rows.Count && results.Count < nItems; i++) {
if(rand.NextDouble() > bias)
// Pick the current item probabilistically
results.Add(table.Rows[i]); // Or reader.Next()[...]
}
I need to represent a lookup table in C#, here is the basic structure:
Name Range Multiplier
Active 10-20 0.5
What do you guys suggest?
I will need to lookup on range and retrieve the multiplier.
I will also need to lookup using the name.
UPdate
It will have maybe 10-15 rows in total.
Range is integer date type.
What you actually have is two lookup tables: one by Name and one by Range. There are several ways you can represent these in memory depending on how big the table will get.
The mostly-likely fit for the "by-name" lookup is a dictionary:
var MultiplierByName = new Dictionary<string, double>() { {"Active",.5}, {"Other", 1.0} };
The range is trickier. For that you will probably want to store either just the minimum or the maximum item, depending on how your range works. You may also need to write a function to reduce any given integer to it's corresponding stored key value (hint: use integer division or the mod operator).
From there you can choose another dictionary (Dictionary<int, double>), or if it works out right you could make your reduce function return a sequential int and use a List<double> so that your 'key' just becomes an index.
But like I said: to know for sure what's best we really need to know the scope and nature of the data in the lookup, and the scenario you'll use to access it.
Create a class to represent each row. It would have Name, RangeLow, RangeHigh and Multiplier properties. Create a list of such rows (read from a file or entered in the code), and then use LINQ to query it:
from r in LookupTable
where r.RangeLow <= x && r.RangeHigh >= x
select r.Multiplier;
Sometimes simplicity is best. How many entries are we looking at, and are the ranges integer ranges as you seem to imply in your example? While there are several approaches I can think of, the first one that comes to mind is to maintain two different lookup dictionaries, one for the name and one for the value (range) and then just store redundant info in the range dictionary. Of course, if your range is keyed by doubles, or your range goes into the tens of thousands I'd look for something different, but simplicity rules in my book.
I would implement this using a DataTable, assuming there was no pressing reason to use another datatype. DataTable.Select would work fine for running a lookup on Name or Range. You do lose some performance using a DataTable for this but with 10-15 records would it matter that much.