First time posting to a questions site, but I sort of have a complex problem i've been looking at for a few days.
Background
At work we're implementing a new billing system. However, we want to take the unprecedented move of actually auditing the new billing system against the old one which is significantly more robust on an ongoing basis. The reason is the new billing system is alot more flexible for our new rate plans, so marketing is really on us to get this new billing system in place.
We had our IT group develop a report for a ridiculous amount of money that runs at 8AM each morning for yesterday's data, compares records for getting byte count discrepancies, and generates the report. This isn't very useful for us since for one it runs the next day, and secondly if it shows bad results, we don't have any indication why we may have had a problem the day before.
So we want to build our own system, that hooks into any possible data source (at first only the new and old systems User Data Records (UDR)) and compares the results in near real-time.
Just some notes on the scale, each billing system produces roughly 6 million records / day at a total file size of about 1 gig.
My Proposed set-up
Essentially, buy some servers, we have budget for several 8 core / 32GB of RAM machines, so I'd like to do all the processing and storage in in-memory data structures. We can buy bigger server's if necessary, but after a couple days, I don't see any reason to keep the data in memory any longer (written out to persistent storage) and Aggregate statistics stored in a database.
Each record essentially contains a record-id from the platform, correlation-id, username, login-time, duration, bytes-in, bytes-out, and a few other fields.
I was thinking of using a fairly complex data structure for processing. Each record would be broken into a user object, and a record object belong to either platform a or platform b. At the top level, would be a binary search tree (self balancing) on the username. The next step would be sort of like a skip list based on date, so we would have next matched_record, next day, next hour, next month, next year, etc. Finally we would have our matched record object, essentially just a holder which references the udr_record object from system a, and the udr record object from system b.
I'd run a number of internal analytic as data is added to see if the new billing system has choked, started having large discrepancies compared to the old system, and send an alarm to our operations center to be investigated. I don't have any problem with this part myself.
Problem
The problem I have is aggregate statistics are great, but I want to see if I can come up with a sort of query language where the user can enter a query, for say the top contributors to this alarm, and see what records contributed to the discrepancy, and dig in and investigate. Originally, I wanted to use a syntax similar to a filter in wireshark, with some added in SQL.
Example:
udr.bytesin > 1000 && (udr.analysis.discrepancy > 100000 || udr.analysis.discrepency_percent > 100) && udr.started_date > '2008-11-10 22:00:44' order by udr.analysis.discrepancy DESC LIMIT 10
The other option would be to use DLINQ, but I've been out of the C# game for a year and a half now, so am not 100% up to speed on the .net 3.5 stuff. Also i'm not sure if it could handle the data structure I was planning on using. The real question, is can I get any feedback on how to approach the getting a query string from the user, parsing it, and applying it to the data structure (which has quite a few more attributes then outlined above), and getting the resulting list back. I can handle the rest on my own.
I am fully prepared to hard code much of the possible queries, and just have them more as reports that are run with some parameters, but if there is a nice clean way of doing this type of query syntax, I think it would be immensely cool feature to add.
Actually, for the above type of query, the dynamic LINQ stuff is quite a good fit. Otherwise you'll have to write pretty-much the same anyway - a parser, and a mechanism for mapping that to attributes. Unfortunately it isn't an exact hit, since you need to split things like OrderBy, and dates need to be parameterized - but here's a working example:
class Udr { // formatted for space
public int BytesIn { get; set; }
public UdrAnalysis Analysis { get; set; }
public DateTime StartedDate { get; set; }
}
class UdrAnalysis {
public int Discrepency { get; set; }
public int DiscrepencyPercent { get; set; }
}
static class Program {
static void Main() {
Udr[] data = new [] {
new Udr { BytesIn = 50000, StartedDate = DateTime.Today,
Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}},
new Udr { BytesIn = 500, StartedDate = DateTime.Today,
Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}}
};
DateTime when = DateTime.Parse("2008-11-10 22:00:44");
var query = data.AsQueryable().Where(
#"bytesin > 1000 && (analysis.discrepency > 100000
|| analysis.discrepencypercent > 100)
&& starteddate > #0",when)
.OrderBy("analysis.discrepency DESC")
.Take(10);
foreach(var item in query) {
Console.WriteLine(item.BytesIn);
}
}
}
Of course, you could take the dynamic LINQ sample and customize the parser to do more of what you need...
Whether you use DLINQ or not, I suspect that you'll want to use LINQ somewhere in the solution, because it provides so many bits of what you want.
How much protection do you need from your users, and how technical are they? If this is only for a few very technical internal staff (e.g. who are already developers) then you could just let them write a C# expression and then use CSharpCodeProvider to compile the code - then apply it on your data.
Obviously this requires your users to be able to write C# - or at least just enough of it for a query expression - and it requires that you trust them not to trash the server. (You can load the code into a separate AppDomain, give it low privileges and tear down the AppDomain after a timeout, but that sort of thing is complicated to achieve - and you don't really want huge amounts of data crossing an AppDomain boundary.)
On the subject of LINQ in general - again, a good fit due to your size issues:
Just some notes on the scale, each
billing system produces roughly 6
million records / day at a total file
size of about 1 gig.
LINQ can be used fully with streaming solutions. For example, your "source" could be a file reader. The Where would then iterate over the data checking individual rows without having to buffer the entire thing in memory:
static IEnumerable<Foo> ReadFoos(string path) {
return from line in ReadLines(path)
let parts = line.Split('|')
select new Foo { Name = parts[0],
Size = int.Parse(parts[1]) };
}
static IEnumerable<string> ReadLines(string path) {
using (var reader = File.OpenText(path)) {
string line;
while ((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
This is now lazy loading... we only read one line at a time. You'll need to use AsQueryable() to use it with dynamic LINQ, but it stays lazy.
If you need to perform multiple aggregates over the same data, then Push LINQ is a good fit; this works particularly well if you need to group data, since it doesn't buffer everything.
Finally - if you want binary storage, serializers like protobuf-net can be used to create streaming solutions. At the moment it works best with the "push" approach of Push LINQ, but I expect I could invert it for regular IEnumerable<T> if needed.
Related
I am reading 40,000 small objects / rows from SQLite with EF core, and it's taking 18 seconds, which is too long for my UWP app.
When this happens CPU usage on a single core reaches 100%, but the disk reading speed is circa 1%.
var dataPoints = _db.DataPoints.AsNoTracking().ToArray();
Without AsNoTracking() the time taken is even longer.
DataPoint is a small POCO with a few primitive properties. Total amount of data I am loading is 4.5 MB.
public class DataPointDto
{
[Key]
public ulong Id { get; set; }
[Required]
public DateTimeOffset TimeStamp { get; set; }
[Required]
public bool trueTime { get; set; }
[Required]
public double Value { get; set; }
}
Question: Is there a better way of loading this many objects, or am I stuck with this level of performance?
Fun fact: x86 takes 11 seconds, x64 takes 18. 'Optimise code' shaves off a second. Using Async pushes execution time to 30 seconds.
Most answers follow the common wisdom of loading less data, but in some circumstances such as here you Absolutely Positively Must load a lot of entities. So how do we do that?
Cause of poor performance
Is it unavoidable for this operation to take this long?
Well, its not. We are loading just a megabyte of data from disk, the cause of poor performance is that the data is split across 40,000 tiny entities. The database can handle that, but the entity framework seem to struggle setting up all those entities, change tracking, etc. If we do not intend to modify the data, there is a lot we can do.
I tried three things
Primitives
Load just one property, then you get a list of primitives.
List<double> dataPoints = _db.DataPoints.Select(dp => dp.Value).ToList();
This bypasses all of entity creation normally performed by entity framework. This query took 0.4 seconds, compared to 18 seconds for the original query. We are talking 45 (!) times improvement.
Anonymous types
Of-course most of the time we need more than just an array of primitives
We can create new objects right inside the LINQ query. Entity framework won't create the entities it normally would, and the operation runs much faster. We can use anonymous objects for convenience.
var query = db.DataPoints.Select(dp => new {Guid ID = dp.sensorID, DateTimeOffset Timestamp = dp.TimeStamp, double Value = dp.Value});
This operations takes 1.2 seconds compared to 18 seconds for normally retrieving the same amount of data.
Tuples
I found that in my case using Tuples instead of anonymous types improves performance a little, the following query executed roughly 30% faster:
var query = db.DataPoints.Select(dp => Tuple.Create(dp.sensorID, dp.TimeStamp, dp.Value));
Other ways
You cannot use structures inside LinQ queries, so that's not an
option
In many cases you can combine many records together to reduce
overhead associated with retrieving many individual records. By
retrieving fewer larger records you could improve performance. For
instance in my usecase I've got some measurements that are being
taken every 5 minutes, 24/7. At the moment I am storing them
individually, and that's silly. Nobody will ever query less than a
day worth of them. I plan to update this post when I make the change
and find out how performance changed.
Some recommend using an object oriented DB or micro ORM. I have
never used either, so I can't comment.
you can use a different technique to load all your items.
you can create your own logic to load parts of the data while the user is scrolling the ListView( I guess you are using it) .
fortunately UWP a easy way to do this technique.
Incremental loading
please see the documentation and example
https://msdn.microsoft.com/library/windows/apps/Hh701916
Performance test on 26 million records (1 datetime, 1 double, 1 int), EF Core 3.1.5:
Anonymous types or tuples as suggested in the accepted answer = About
20 sec, 1.3GB RAM
Struct = About 15 sec, 0.8GB RAM
I'm currently exploring graph database potential for some processes in my industry.
I've started with Neo4Jclient one week ago so I'm below the standard beginner :-)
I'm very excited about Neo4J but I'm facing huge performances issues and I need help.
The first step in my project is be to populate Neo4j from existing text files.
Those files are composed of lines formatted using a simple pattern:
StringID=StringLabel(String1,String2,...,StringN);
For exemple, if I consider following line:
#126=TYPE1(#80,#125);
I would like to create one node with label "TYPE1", and 2 properties:
1) a unique ID using ObjectID: "#126" in above example
2) a string containing all parameters for future use: "#80,#125" in above example
I must consider that I will deal with multiple forward references, as in the exemple below:
#153=TYPE22('0BTBFw6f90Nfh9rP1dl_3P',#144,#6289,$);
The line defining the node with StringID "#6289" will be parsed later in the file.
So, to solve my file import problem, I've defined the following class:
public class myEntity
{
public string propID { get; set; }
public string propATTR { get; set; }
public myEntity()
{
}
}
And thanks to forward references in my text file (and with no doubt my poor Neo4J knowledge...)
I've decided to work in 3 steps:
First loop, I extract, from each line parsed from my file, strLABEL, strID and strATTRIBUTES,
then I add one Neo4j node for each line using following code:
strLabel = "(entity:" + strLABEL + " { propID: {newEntity}.propID })";
graphClient.Cypher
.Merge(strLabel)
.OnCreate()
.Set("entity = {newEntity}")
.WithParams(new {
newEntity = new {
propID = strID,
propATTR = strATTRIBUTES
}
})
.ExecuteWithoutResults();
Then I match all nodes created in Neo4J using following code:
var queryNode = graphClient.Cypher
.Match("(nodes)")
.Return(nodes => new {
NodeEntity = nodes.As<myEntity>(),
Labels = nodes.Labels()
}
);
And finally I loop on all nodes, split the propATTR properties for each node and add one relation for each ObjectID found in propATTR using following code:
graphClient.Cypher
.Match("(myEnt1)", "(myEnt2)")
.Where((myEntity myEnt1) => myEnt1.propID == strID)
.AndWhere((myEntity myEnt2) => myEnt2.propID == matchAttr)
.CreateUnique("myEnt1-[:INTOUCHWITH]->myEnt2")
.ExecuteWithoutResults();
When I explore the database populated using that code using Cypher, the resulting nodes and relations are the right ones and Neo4J execution speed is very fast any queries I've tested.
It's very impressive and I'm convinced there is a huge potentiel for Neo4j in my industry.
But my big issue today is time required to populate the database (my config: win8 x64, 32Go RAM, SSD, intel core i7-3840QM 2.8GHz):
For a small test case (6400 lines)
it's took me 13s to create 6373 nodes, and 94s more to create 7800 relations
On a real test case (40000 lines)
it's took me 496s to create 38898 nodes, and 3701s more to create 89532 relations (yes: more than one hour !)
I've no doubt such poor performances are directly resulting from my poor neo4jclient knowledge.
It would be a tremendous help for me if the community can advise me on how to solve that bottleneck.
Thanks by advance for your help.
Best regards
Max
While I don't have the exact syntax in my head to write down for you, I would suggest you look at splitting your propATTR values when you read them initially, and storing them directly as an array/collection in Neo4j. This would hopefully then enable you to do your relationship creation in bulk within Neo4j, rather than the iterating the nodes externally and executing so many sequential transactions.
The latter part might look something like:
MATCH (myEnt1),(myEnt2) WHERE myEnt1.propID IN myEnt2.propATTR
CREATE UNIQUE (myEnt1)-[:INTOUCHWITH]->(myEnt2)
Sorry my Cypher is a bit rusty, but the point is to try to transfer the load fully into the Neo4j engine, rather than the continual round-trips between your application logic and the Neo4j server. I suggest that it's probably these round trips that are killing your performance, and not so much the individual work involved in each transaction, so minimising the number of transactions would be the way to go.
I have a table that lists all errors and warnings for an device (a hardware device whose details we store in the db). That is covered by a single DeviceLog.
The DeviceLog table stores both current and archive errors, this means though that for devices with a large archive it is very slow to get out the current error / warning state.
device.Errors = databaseDevice.Errors.Where(
e => e.Current &&
e.LogEntryType == Models.DeviceLogEntryType.Error)
.Select(e => new DeviceErrorLog()).ToList();
So the in this current case there are about 5000 entries in DeviceLog for this specific device, but no current ones so device.Errors.Count() == 0 but if, in intellisense, I hover over databaseDevice.Errors it shows the actual full count.
Is this behaviour expected, if so then how can I make this faster as I thought this should be a fast operation especially as I am specifying a very direct and easily searchable subset of the data
To clarify the data structure:
public class DeviceLogEntry
{
public DeviceLogEntryType /* An Enum */ LogEntryType { get; set; }
public bool Current { get; set; }
}
but if, in intellisense, I hover over databaseDevice.Errors it shows the actual full count
Why shouldn't it? If you are hovering over databaseDevice.Errors, then intellisense is going to try to get you useful information about databaseDevice.Errors, of which the Count (produced by calling select count(*) from errors) is an example of such.
It is device.Errors that includes the Where constraint, not databaseDevice.Errors.
if so then how can I make this faster as I thought this should be a fast operation
Passing select count(*) should be a fast operation. But that's also irrelevant to your code, since your code isn't that which you have run by intellisense in this case; your code is:
databaseDevice.Errors.Where(
e => e.Current &&
e.LogEntryType == Models.DeviceLogEntryType.Error)
.Select(e => new DeviceErrorLog()).ToList()
Which will execute something like:
select * from errors where e.current = 1 and e.LogType = 52
(Or whatever integer that enum value corresponds to).
And then building a list from it. It doesn't matter what intellisense did in a case that doesn't correspond to the actual executed code.
Two things you can do to improve performance though:
Check the indices on the database table, as being able to quickly lookup on current and logEntryType will affect how fast the underlying SQL executes.
Drop the ToList() unless you'll be dealing with those results more than once. If you will be dealing with it more than once, then that's great; get it into memory and hit it repeatedly as needed. If you'll not be dealing with it more than once, or if those ways you deal with it will involve further Where, then don't waste time building a list, just to go and query that list again, when you could just query the results themselves.
From my understanding of Entity Framework the behaviour described above us as expected. The context databaseDevice.Errors knows how many records there are in the table, however, due to lazy loading the actual results are not fetched until the .ToList() is executed. The full table is not fetched from the database.
I would not expect any performance issues with this but, if there are, adding some indexes to the table should fix that.
I want to show similar products so called variants for the a product. Currently I am doing it as below:
public IList<Product> GetVariants(string productName)
{
EFContext db = new EFContext(); //using Entity Framework
return db.Products
.Where(product = > product.ProductName == productName)
.ToList();
}
But , this results into exact match, that is the current product itself. I am thinking to use Levenshtein Distance as a basis to get the similar products. But , before that I want to check what majority developers do for getting variants?
Is it good to use Levenshtein Distance ? Is it used in industry for this purpose?
Do I have to add another table in database showing the variants for the product while adding the product to database?
I used the Jaro-Winkler distance effectively to account for typos in one system I wrote a while back. IMO, It's much better than a simple edit distance calculation as it can account for string lengths fairly effectively. See this question on SO for open source implementations.
I ended up writing it in C# and importing it into SQL server as a SQL CLR function, but it was still relatively slow. It worked in my case mostly because such queries were executed infrequently (100-200 in a day).
If you expect a lot of traffic, you'd have to build an index to make these lookups faster. One strategy for this would be to periodically compute the distance between each pair of products each pair of products and store this in an index table if the distance exceeds a certain threshold. To reduce the amount of work that needs to be done, you can run this only once or twice a day and you can limit this to only new or modified records since the last run. You can then look up similar products and order by distance quickly.
Is it somehow possible to chain together several LINQ queries on the same IEnumerable ?
Some background,
I've some files, 20-50Gb in size, they will not fit in memory. Some code parses messages from such a file, and basically does :
public IEnumerable<Record> ReadRecordsFromStream(Stream inStream) {
Record msg;
while ((msg = ReadRecord(inStream)) != null) {
yield return msg;
}
}
This allow me to perform interesting queries on the records.
e.g. find the average duration of a Record
var records = ReadRecordsFromStream(stream);
var avg = records.Average(x => x.Duration);
Or perhaps the number of records per hour/minute
var x = from t in records
group t by t.Time.Hour + ":" + t.Time.Minute into g
select new { Period = g.Key, Frequency = g.Count() };
And there's a a dozen or so more queries I'd like to run to pull relevant info out of these records. Some of the simple queries can certainly be combined in a single query, but this seem to get unmanegable quite fast.
Now, each time I run these queries, I have to read the file from the beginning again, all records reparsed - parsing a 20Gb file 20 times takes time, and is a waste.
What can I do to be able to do just one pass over the file, but run several linq queries against it ?
You might want to consider using Reactive Extensions for this. It's been a while since I've used it, but you'd probably create a Subject<Record>, attach all your queries to it (as appropriate IObservable<T> variables) and then hook up the data source. That will push all the data through the various aggregations for you, only reading from disk once.
While the exact details elude me without downloading the latest build myself, I blogged on this a couple of times: part 1; part 2. (Various features that I complained about being missing in part 1 were added :)
I have done this before for logs with 3-10MB/file. Haven't reached that file size but I tried to execute this in a 1GB+ total log files without consuming that much of RAM. You may try what I did.
There's a technology that allows you to do this kind of thing. It's called a database :)