performances issues using neo4jclient - c#

I'm currently exploring graph database potential for some processes in my industry.
I've started with Neo4Jclient one week ago so I'm below the standard beginner :-)
I'm very excited about Neo4J but I'm facing huge performances issues and I need help.
The first step in my project is be to populate Neo4j from existing text files.
Those files are composed of lines formatted using a simple pattern:
StringID=StringLabel(String1,String2,...,StringN);
For exemple, if I consider following line:
#126=TYPE1(#80,#125);
I would like to create one node with label "TYPE1", and 2 properties:
1) a unique ID using ObjectID: "#126" in above example
2) a string containing all parameters for future use: "#80,#125" in above example
I must consider that I will deal with multiple forward references, as in the exemple below:
#153=TYPE22('0BTBFw6f90Nfh9rP1dl_3P',#144,#6289,$);
The line defining the node with StringID "#6289" will be parsed later in the file.
So, to solve my file import problem, I've defined the following class:
public class myEntity
{
public string propID { get; set; }
public string propATTR { get; set; }
public myEntity()
{
}
}
And thanks to forward references in my text file (and with no doubt my poor Neo4J knowledge...)
I've decided to work in 3 steps:
First loop, I extract, from each line parsed from my file, strLABEL, strID and strATTRIBUTES,
then I add one Neo4j node for each line using following code:
strLabel = "(entity:" + strLABEL + " { propID: {newEntity}.propID })";
graphClient.Cypher
.Merge(strLabel)
.OnCreate()
.Set("entity = {newEntity}")
.WithParams(new {
newEntity = new {
propID = strID,
propATTR = strATTRIBUTES
}
})
.ExecuteWithoutResults();
Then I match all nodes created in Neo4J using following code:
var queryNode = graphClient.Cypher
.Match("(nodes)")
.Return(nodes => new {
NodeEntity = nodes.As<myEntity>(),
Labels = nodes.Labels()
}
);
And finally I loop on all nodes, split the propATTR properties for each node and add one relation for each ObjectID found in propATTR using following code:
graphClient.Cypher
.Match("(myEnt1)", "(myEnt2)")
.Where((myEntity myEnt1) => myEnt1.propID == strID)
.AndWhere((myEntity myEnt2) => myEnt2.propID == matchAttr)
.CreateUnique("myEnt1-[:INTOUCHWITH]->myEnt2")
.ExecuteWithoutResults();
When I explore the database populated using that code using Cypher, the resulting nodes and relations are the right ones and Neo4J execution speed is very fast any queries I've tested.
It's very impressive and I'm convinced there is a huge potentiel for Neo4j in my industry.
But my big issue today is time required to populate the database (my config: win8 x64, 32Go RAM, SSD, intel core i7-3840QM 2.8GHz):
For a small test case (6400 lines)
it's took me 13s to create 6373 nodes, and 94s more to create 7800 relations
On a real test case (40000 lines)
it's took me 496s to create 38898 nodes, and 3701s more to create 89532 relations (yes: more than one hour !)
I've no doubt such poor performances are directly resulting from my poor neo4jclient knowledge.
It would be a tremendous help for me if the community can advise me on how to solve that bottleneck.
Thanks by advance for your help.
Best regards
Max

While I don't have the exact syntax in my head to write down for you, I would suggest you look at splitting your propATTR values when you read them initially, and storing them directly as an array/collection in Neo4j. This would hopefully then enable you to do your relationship creation in bulk within Neo4j, rather than the iterating the nodes externally and executing so many sequential transactions.
The latter part might look something like:
MATCH (myEnt1),(myEnt2) WHERE myEnt1.propID IN myEnt2.propATTR
CREATE UNIQUE (myEnt1)-[:INTOUCHWITH]->(myEnt2)
Sorry my Cypher is a bit rusty, but the point is to try to transfer the load fully into the Neo4j engine, rather than the continual round-trips between your application logic and the Neo4j server. I suggest that it's probably these round trips that are killing your performance, and not so much the individual work involved in each transaction, so minimising the number of transactions would be the way to go.

Related

LINQ Optimization for searching a if an object exist in a list within a list

Currently I have 7,000 video entries and I have a hard time optimizing it to search for Tags and Actress.
This is my code I am trying to modify, I tried using HashSet. It is my first time using it but I don't think I am doing it right.
Dictionary dictTag = JsonPairtoDictionary(tagsId,tagsName);
Dictionary dictActresss = JsonPairtoDictionary(actressId, actressName);
var listVid = new List<VideoItem>(db.VideoItems.ToList());
HashSet<VideoItem> lll = new HashSet<VideoItem>(listVid);
foreach (var tags in dictTag)
{
lll = new HashSet<VideoItem>(lll.Where(q => q.Tags.Exists(p => p.Id == tags.Key)));
}
foreach (var actress in dictActresss)
{
listVid = listVid.Where(q => q.Actress.Exists(p => p.Id == actress.Key)).ToList();
}
First part I get all the Videos in Db by using db.VideoItems.ToList()
Then it will go through a loop to check if a Tag exist
For each VideoItem it has a List<Tags> and I use 'exist' to check if a tag is match.
Then same thing with Actress.
I am not sure if its because I am in Debug mode and ApplicationInsight is active but it is slow. And I will get like 10-15 events per second with baseType:RemoteDependencyData which I am not sure if it means it still connected to database (should not be since I only should only be messing with the a new list of all videos) or what.
After 7 mins it is still processing and that's the longest time I have waited.
I am afraid to put this on my live site since this will eat up my resource like candy
Instead of optimizing the linq you should optimize your database query.
Databases are great at optimized searches and creating subsets and will most likely be faster than anything you write. If you have need to create a subset based on more than on database parameter I would recommend looking into creating some indexes and using those.
Edit:
Example of db query that would eliminate first for loop (which is actually multiple nested loops and where the time delay comes from):
select * from videos where tag in [list of tags]
Edit2
To make sure this is most efficient, require the database to index on the TAGS column. To create the index:
CREATE INDEX video_tags_idx ON videos (tag)
Use 'explains' to see if the index is being used automatically (it should be)
explain select * from videos where tag in [list of tags]
If it doesn't show your index as being used you can look up the syntax to force the use of it.
The problem was not optimization but it was utilization of the Microsoft SQL or my ApplicationDbContext.
I found this when I realize that http://www.albahari.com/nutshell/predicatebuilder.aspx
Because the problem with Keyword search, there can be multiple keywords, and the code I made above doesn't utilize the SQL which made the long execution time.
Using the predicate builder, it will be possible to create dynamic conditions in LINQ

C#, Best way to loop through NHibernate objects without loading entire set into memory

I'm trying to loop through a large table and write the entries to a csv file. If I load all Objects into memory I get an OutOfMemoryException. My Employer class is mapped with fluent nhibernate.
Here's what I've tried:
This Loads all object on first iteration and crashes.
var myQuerable = DataProvider.GetEmployer(); // returns IQuerably
foreach (var emp in myQuerable)
{
// stuff...
}
No luck here:
var myEnumerator = myQuerable.GetEnumerator();
I thought this would work:
for (int i = 0; i <= myQuerable.Count(); i++)
{
Employer e = myQuerable.ElementAt(i);
}
but am getting this exception:
Could not parse expression
'value(NHibernate.Linq.NhQueryable`1[MyProject.Model.Employer]).ElementAt(0)':
This overload of the method 'System.Linq.Queryable.ElementAt' is currently not supported
Am I missing something here? Is this even possible with nHibernate?
Thanks!
I don't think loading your entries one by one could resolve your problem fully, as this is gonna go to another bad direction - huge loading on database side and longer response time for your C# method. I can't imagine how long it will take, as you've already god OutOfMemoryException exception that indicate you have huge number of records. I think the mechanism you really should take is pagination. There're various materials on the Internet about this topic, such as NHibernate 3 paging and determining the total number of rows.
Cheers!
Looks like I'm going to have to follow this artical:
http://ayende.com/blog/4548/nhibernate-streaming-large-result-sets
or use straight ado for performance.
Thanks for the help!

How to use MongoDB as unique/enumeration store

This seems to be like a common use case... but somehow I cannot get it working.
I'm attempting to use MongoDB as an enumeration store with unique items. I've created a collection with a byte[] Id (the unique ID) and a timestamp (a long, used for enumeration). The store is quite big (terabytes) and distributed among different servers. I am able to re-build the store from scratch currently, since I'm still in the testing phase.
What I want to do is two things:
Create a unique id for each item that I insert. This basically means that if I insert the same ID twice, MongoDB will detect this and give an error. This approach seems to work fine.
Continuously enumerate the store for new items by other processes. The approach I took was to add a second index to InsertID and used a high precision timestamp on this along with the server id and a counter (just to make it unique and ascending).
In the best scenario this would mean that the enumerator would keep track of an index cursor for every server. From what I've learned from mongodb query processing I expected this behavior. However, when I try to execute the code (below) it seems to take forever to get anything.
long lastid = 0;
while (true)
{
DateTime first = DateTime.UtcNow;
foreach (var item in collection.FindAllAs<ContentItem>().OrderBy((a)=>(a.InsertId)).Take(100))
{
lastid = item.InsertId;
}
Console.WriteLine("Took {0:0.00} for 100", (DateTime.UtcNow - first).TotalSeconds);
}
I've read about cursors, but am unsure if they fulfill the requirements when new items are inserted into the store.
As I said, I'm not bound to any table structure or something like that... the only things that are important is that I can get new items over time and without getting duplicate items.
-Stefan.
Somehow I figured it out... more or less...
I created the query manually and ended up with something like this:
db.documents.find({ "InsertId" : { "$gt" : NumberLong("2020374866209304106") } }).limit(10).sort({ "InsertId" : 1 });
The LINQ query I put in the question doesn't generate this query. After some digging in the code I found that it should be this LINQ query:
foreach (var item in collection.AsQueryable().Where((a)=>(a.InsertId > lastid)).OrderBy((a) => (a.InsertId)).Take(100))
The AsQueryable() seems to be the key to execute the rewriting of LINQ to MongoDB queries.
This gives results, but still they appeared to be slow (4 secs for 10 results, 30 for 100). However, when I added 'explain()' I noticed '0 millis' in the query execution.
I stopped the process doing bulk inserts and tada, it works, and fast. In other words: the issues I was having were due to the locking behavior of MongoDB, and due to the way I interpreted the linq implementation. Since the former is the result of initial bulk-filling the data store, this means that the problem is solved.
On the 'negative' part of the solution: I would have preferred a solution that involved serializable cursors or something like that... this 'take' solution has to iterate the b-tree over and over again. If someone has an answer for this, please let me know.
-Stefan.

LINQ (to objects) , running several queries over the same IEnumerable?

Is it somehow possible to chain together several LINQ queries on the same IEnumerable ?
Some background,
I've some files, 20-50Gb in size, they will not fit in memory. Some code parses messages from such a file, and basically does :
public IEnumerable<Record> ReadRecordsFromStream(Stream inStream) {
Record msg;
while ((msg = ReadRecord(inStream)) != null) {
yield return msg;
}
}
This allow me to perform interesting queries on the records.
e.g. find the average duration of a Record
var records = ReadRecordsFromStream(stream);
var avg = records.Average(x => x.Duration);
Or perhaps the number of records per hour/minute
var x = from t in records
group t by t.Time.Hour + ":" + t.Time.Minute into g
select new { Period = g.Key, Frequency = g.Count() };
And there's a a dozen or so more queries I'd like to run to pull relevant info out of these records. Some of the simple queries can certainly be combined in a single query, but this seem to get unmanegable quite fast.
Now, each time I run these queries, I have to read the file from the beginning again, all records reparsed - parsing a 20Gb file 20 times takes time, and is a waste.
What can I do to be able to do just one pass over the file, but run several linq queries against it ?
You might want to consider using Reactive Extensions for this. It's been a while since I've used it, but you'd probably create a Subject<Record>, attach all your queries to it (as appropriate IObservable<T> variables) and then hook up the data source. That will push all the data through the various aggregations for you, only reading from disk once.
While the exact details elude me without downloading the latest build myself, I blogged on this a couple of times: part 1; part 2. (Various features that I complained about being missing in part 1 were added :)
I have done this before for logs with 3-10MB/file. Haven't reached that file size but I tried to execute this in a 1GB+ total log files without consuming that much of RAM. You may try what I did.
There's a technology that allows you to do this kind of thing. It's called a database :)

Querying Complex Data Structure in Memory

First time posting to a questions site, but I sort of have a complex problem i've been looking at for a few days.
Background
At work we're implementing a new billing system. However, we want to take the unprecedented move of actually auditing the new billing system against the old one which is significantly more robust on an ongoing basis. The reason is the new billing system is alot more flexible for our new rate plans, so marketing is really on us to get this new billing system in place.
We had our IT group develop a report for a ridiculous amount of money that runs at 8AM each morning for yesterday's data, compares records for getting byte count discrepancies, and generates the report. This isn't very useful for us since for one it runs the next day, and secondly if it shows bad results, we don't have any indication why we may have had a problem the day before.
So we want to build our own system, that hooks into any possible data source (at first only the new and old systems User Data Records (UDR)) and compares the results in near real-time.
Just some notes on the scale, each billing system produces roughly 6 million records / day at a total file size of about 1 gig.
My Proposed set-up
Essentially, buy some servers, we have budget for several 8 core / 32GB of RAM machines, so I'd like to do all the processing and storage in in-memory data structures. We can buy bigger server's if necessary, but after a couple days, I don't see any reason to keep the data in memory any longer (written out to persistent storage) and Aggregate statistics stored in a database.
Each record essentially contains a record-id from the platform, correlation-id, username, login-time, duration, bytes-in, bytes-out, and a few other fields.
I was thinking of using a fairly complex data structure for processing. Each record would be broken into a user object, and a record object belong to either platform a or platform b. At the top level, would be a binary search tree (self balancing) on the username. The next step would be sort of like a skip list based on date, so we would have next matched_record, next day, next hour, next month, next year, etc. Finally we would have our matched record object, essentially just a holder which references the udr_record object from system a, and the udr record object from system b.
I'd run a number of internal analytic as data is added to see if the new billing system has choked, started having large discrepancies compared to the old system, and send an alarm to our operations center to be investigated. I don't have any problem with this part myself.
Problem
The problem I have is aggregate statistics are great, but I want to see if I can come up with a sort of query language where the user can enter a query, for say the top contributors to this alarm, and see what records contributed to the discrepancy, and dig in and investigate. Originally, I wanted to use a syntax similar to a filter in wireshark, with some added in SQL.
Example:
udr.bytesin > 1000 && (udr.analysis.discrepancy > 100000 || udr.analysis.discrepency_percent > 100) && udr.started_date > '2008-11-10 22:00:44' order by udr.analysis.discrepancy DESC LIMIT 10
The other option would be to use DLINQ, but I've been out of the C# game for a year and a half now, so am not 100% up to speed on the .net 3.5 stuff. Also i'm not sure if it could handle the data structure I was planning on using. The real question, is can I get any feedback on how to approach the getting a query string from the user, parsing it, and applying it to the data structure (which has quite a few more attributes then outlined above), and getting the resulting list back. I can handle the rest on my own.
I am fully prepared to hard code much of the possible queries, and just have them more as reports that are run with some parameters, but if there is a nice clean way of doing this type of query syntax, I think it would be immensely cool feature to add.
Actually, for the above type of query, the dynamic LINQ stuff is quite a good fit. Otherwise you'll have to write pretty-much the same anyway - a parser, and a mechanism for mapping that to attributes. Unfortunately it isn't an exact hit, since you need to split things like OrderBy, and dates need to be parameterized - but here's a working example:
class Udr { // formatted for space
public int BytesIn { get; set; }
public UdrAnalysis Analysis { get; set; }
public DateTime StartedDate { get; set; }
}
class UdrAnalysis {
public int Discrepency { get; set; }
public int DiscrepencyPercent { get; set; }
}
static class Program {
static void Main() {
Udr[] data = new [] {
new Udr { BytesIn = 50000, StartedDate = DateTime.Today,
Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}},
new Udr { BytesIn = 500, StartedDate = DateTime.Today,
Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}}
};
DateTime when = DateTime.Parse("2008-11-10 22:00:44");
var query = data.AsQueryable().Where(
#"bytesin > 1000 && (analysis.discrepency > 100000
|| analysis.discrepencypercent > 100)
&& starteddate > #0",when)
.OrderBy("analysis.discrepency DESC")
.Take(10);
foreach(var item in query) {
Console.WriteLine(item.BytesIn);
}
}
}
Of course, you could take the dynamic LINQ sample and customize the parser to do more of what you need...
Whether you use DLINQ or not, I suspect that you'll want to use LINQ somewhere in the solution, because it provides so many bits of what you want.
How much protection do you need from your users, and how technical are they? If this is only for a few very technical internal staff (e.g. who are already developers) then you could just let them write a C# expression and then use CSharpCodeProvider to compile the code - then apply it on your data.
Obviously this requires your users to be able to write C# - or at least just enough of it for a query expression - and it requires that you trust them not to trash the server. (You can load the code into a separate AppDomain, give it low privileges and tear down the AppDomain after a timeout, but that sort of thing is complicated to achieve - and you don't really want huge amounts of data crossing an AppDomain boundary.)
On the subject of LINQ in general - again, a good fit due to your size issues:
Just some notes on the scale, each
billing system produces roughly 6
million records / day at a total file
size of about 1 gig.
LINQ can be used fully with streaming solutions. For example, your "source" could be a file reader. The Where would then iterate over the data checking individual rows without having to buffer the entire thing in memory:
static IEnumerable<Foo> ReadFoos(string path) {
return from line in ReadLines(path)
let parts = line.Split('|')
select new Foo { Name = parts[0],
Size = int.Parse(parts[1]) };
}
static IEnumerable<string> ReadLines(string path) {
using (var reader = File.OpenText(path)) {
string line;
while ((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
This is now lazy loading... we only read one line at a time. You'll need to use AsQueryable() to use it with dynamic LINQ, but it stays lazy.
If you need to perform multiple aggregates over the same data, then Push LINQ is a good fit; this works particularly well if you need to group data, since it doesn't buffer everything.
Finally - if you want binary storage, serializers like protobuf-net can be used to create streaming solutions. At the moment it works best with the "push" approach of Push LINQ, but I expect I could invert it for regular IEnumerable<T> if needed.

Categories

Resources