Comparing Data Graphs Using C# GetHashCode()

Comparing Data Graphs Using C# GetHashCode() - c#

I have a graph of data that I'm pulling from an OAuth source using several REST calls and storing relationally in a database. The data structure ends up having about 5-10 tables with several one-to-many relationships. I'd like to periodically go a re-retrieve that information to see if updates are necessary in my database.
Since I'm going to be doing this for many users and their data will likely not change very often, my goal is to minimize the load on my database unnecessarily. My strategy is to query the data from my OAuth provider but then hash the results and compare it to the last hash that I generated for the same dataset. If the hashes don't match, then I would simply start a transaction in the database, blow away all the data for that user, re-write the data, and close the transaction. This saves me the time of reading in the data from the database and doing all the compare work to see what's changed, what rows were added, deleted changed etc.
So my question: if I glue all my data together in memory as a big string and use C# GetHasCode(), is that fairly reliable mechanism to check if my data has changed? Or, are there any better techniques to skinning this cat?
Thanks

Yes, that's a fairly reliable mechanism to detect changes. I do not know about the probabilty of collisions in the GetHashCode() Method, but I'd assume it to be safe.
Better methods: Can't the data have a version-stamp or timestamp that is set everytime something changes?

Related

how and where to efficiently store and retrieve data every second for this situation?

hi experts :) I'm using wpf with sql server
problem 1: lots of data gets created and must be saved to db every second, but at the same time multiple parts of the program write to the same tables. Saving to the db every second is not efficient as db methods are expensive, do you experts disagree or what should I do? not sure what is the best thing to do, when would xml or text files be more useful?
problem 2: I have to retrieve the data from the db from the tables that problem 1 is saving to so I can show on live graphs. Would this cause read/write problems?

Loading a lot of data with one-by-one inserts is not a good idea. Try to look at SqlBulkCopy
Database handles concurrency very well, you can insulate the writing in proper transaction in order to see just the data when a complete write is done.

Considering that you have time limit, you have to process data in some way in 1 second.
I would suggest:
problem 1: Save the data you generate in chunks injectd into the Stack<..>. After from another thread process that Stack<..> to save the chunk to the DB, untill the Stack<..> is not empty. As there is no any gurantee that you wil be able save data in 1 second, after you have it in memory.
problem 2: Having it already in memory you can achieve maximum possible perforance, remaining in the acceptable allocated memory limimts.
It's hard to suggest somethign really practical here, as performance is always strictly domain specific, which can not be described completely in short question. But soluton, can be taken like a basic guideline.

1 You can use Caching in order to persist your datas, you can save with Cache class
Link : http://msdn.microsoft.com/en-us/library/system.web.caching.cache.add.aspx
2 You don't have problem with second scenario, you can use Transaction in order to ensure that you get commited datas.
Link : http://msdn.microsoft.com/en-us/library/system.transactions.transaction.aspx

An other option, is to break up the table(s).
Are your other functions only writing to subset of the field of the records.
So you have a 1 to 1 mapping between two tables one for the initial data, the other for functionA. Depends on how well you can partition the backend needs, but it can significantly reduce collisions.
Basic idea is to look at your tables more like objects. So you have a base table, then if it's a type 1 thingy you add a 1 to 1 link to the relevant table. The business functions around type 1 only need to write to that table, never the entity table.
A possibility anyway.

Data integrity for a large database table

I have to provide data integrity for a large database table. So, if a crafty admin manually changes the table (not via UI) I want to be able to detect it.
My idea is to have HMAC for each record and calculate incremental HMAC for the table when a user change it via UI:
Calculate HMAC for first record - HMAC_Current.
Calculate HMAC for a new record - HMAC_i
Calculate new HMAC for the table as HMAC_Current = HMAC(HMAC_Current + HMAC_i).
Pros:
there is no need to calculate HMAC for entire table each time when a user adds a record via UI.
Cons:
When a user deletes or changes a record I have to recalculate HMAC for the table from this record to the end of the table.
When I want to check data integrity I have to check HMAC for each record. Then calculate HMAC for entire table from top to bottom and compare it with HMAC_Current.
Is there a better way to do it?

I see a number of problems with this approach:
If your sysdba has access to all the data, what's stopping them from messing with the HMACs as well? eg: They revert all changes to the table made in the last month. Then they put back the HMAC from last month. Is data integrity "preserved" in this case?
What stops them from subverting the application to mess with the HMACs? eg: If they don't have access to the application, they change the password for a user, and accesses the application as that user to mess with records.
Even if you can get this to work, what's it good for? Say you find a HMAC mismatch. Now who do you hold responsible? An admin? A user? Data corruption?
The better solution is to use auditing. You can set up all kinds of auditing on Oracle, and have the audits saved somewhere even the dba can't touch. Additionally, there's a huge advantage in using auditing: you can know who changed what. With your scheme, you can't possibly know that.
You can even set up FGA (fine-grained auditing) so that it'll only audit specific columns and also know what the values were before and after a change, which isn't possible with standard auditing.
Reference: Configuring and Administering Auditing

Well the first issue is that you don't trust your admins. If so why are they still there? Admins need full rights to prod databases, so they must be trustworthy.
If the issue is that there are occasional disputes about who made changes, then set up audit tables with triggers. Trustworthy admins will not bypass the triggers (even though they can). Only admins should have delete rights to audit tables.
Audit tables are a requirement for most enterprise systems. If you did not set rights through strored procs, it is likely that many internal users have the rights they need to affect the database directly which makes it easier for people to comit fraud. It may not be the admins at all who are affecting the data. Make sure you record information about the user who made the change and at what time as well as recording the change.
SQL Server also has a way to audit structural changes to the db. I don't know if Oracle does as well, but this is also a handly thing to audit.

Are the triggers available for your solution? If so, you can Write Managed Triggers using C#, and add any logic you want for this code.

This approach to 'integrity' is not really an approach to integrity - this is more like security patchwork.
So, first of all try to accomplish the same with better security model.
In case of your scenario, you have to calculate, store and check the HMAC. If check fails, you have to escalate.
If you setup your security properly (almost always it is possible that no admin needs direct write access on your tables) - then you don't have to check.
Moving as much of your business logic to the database will allow you to make stored procedures that could be the only interface to changing the data, so in this case you would have integrity guaranteed.

MongoDB, C# and NoRM + Denormalization

I am trying to use MongoDB, C# and NoRM to work on some sample projects, but at this point I'm having a much harder time wrapping my head around the data model. With RDBMS's related data is no problem. In MongoDB, however, I'm having a difficult time deciding what to do with them.
Let's use StackOverflow as an example... I have no problem understanding that the majority of data on a question page should be included in one document. Title, question text, revisions, comments... all good in one document object.
Where I start to get hazy is on the question of user data like username, avatar, reputation (which changes especially often)... Do you denormalize and update thousands of document records every time there is a user change or do you somehow link the data together?
What is the most efficient way to accomplish a user relationship without causing tons of queries to happen on each page load? I noticed the DbReference<T> type in NoRM, but haven't found a great way to use it yet. What if I have nullable optional relationships?
Thanks for your insight!

The balance that I have found is using SQL as the normalized database and Mongo as the denormalized copy. I use a ESB to keep them in sync with each other. I use a concept that I call "prepared documents" and "stored documents". Stored documents are data that is only kept in mongo. Useful for data that isn't relational. The prepared documents contain data that can be rebuilt using the data within the normalized database. They act as living caches in a way - they can be rebuilt from scratch if the data ever falls out of sync (in complicated documents this is an expensive process because these documents require many queries to be rebuilt). They can also be updated one field at a time. This is where the service bus comes in. It responds to events sent after the normalized database has been updated and then updates the relevant mongo prepared documents.
Use each database to their strengths. Allow SQL to be the write database that ensures data integrity. Let Mongo be the read-only database that is blazing fast and can contain sub-documents so that you need less queries.
** EDIT **
I just re-read your question and realized what you were actually asking for. I'm leaving my original answer in case its helpful at all.
The way I would handle the Stackoverflow example you gave is to store the user id in each comment. You would load up the post which would have all of the comments in it. Thats one query.
You would then traverse the comment data and pull out an array of user ids that you need to load. Then load those as a batch query (using the Q.In() query operator). Thats two queries total. You would then need to merge the data together into a final form. There is a balance that you need to strike between when to do it like this and when to use something like an ESB to manually update each document. Use what works best for each individual scenario of your data structure.

I think you need to strike a balance.
If I were you, I'd just reference the userid instead of their name/reputation in each post.
Unlike a RDBMS though, you would opt to have comments embedded in the document.

Why you want to avoid denormalization and updating 'thousands of document records'? Mongodb db designed for denormalization. Stackoverlow handle millions of different data in background. And some data can be stale for some short period and it's okay.
So main idea of above said is that you should have denormalized documents in order to fast display them at ui.
You can't query by referenced document, in any way you need denormalization.
Also i suggest have a look into cqrs architecture.

Try to investigate cqrs and event sourcing architecture. This will allow you to update all this data by queue.

database, requests, performance, cache

I need some input on how to design a database layer.
In my application I have a List of T. The information in T have information from multiple database tables.
There are of course multiple ways to do this.
Two ways that I think of is :
chatty database layer and cacheable:
List<SomeX> list = new List<SomeX>();
foreach(...) {
list.Add(new SomeX() {
prop1 = dataRow["someId1"],
prop2 = GetSomeValueFromCacheOrDb(dataRow["someId2"])
});
}
The problem that I see with the above is that if we want a list of 500 items, it could potentially make 500 database requests. With all the network latency and that.
Another problem is that the users could have been deleted after we got the list from the database but before we are trying to get it from cache/db, which means that we will have null-problems. Which we have to handle manually.
The good thing is that it's highly cacheable.
non chatty but not cacheable:
List<SomeX> list = new List<SomeX>();
foreach(...) {
list.Add(new SomeX() {
prop1 = dataRow["someId1"],
prop2 = dataRow["someValue"]
});
}
The problem that I see with the above is that its hard to cache, since potentially all users have unique lists. The other problem is that it will be a lot of joins which could result in a lot of reads against the database.
The good thing is that we know for sure that all information exists after the query is run (inner join etc)
non so chatty, but still cacheable
A third option could be to first loop through the data rows, and collect all necessary someId2 and then make one more database request to get all the SomeId2 values.

"The problem that I see with the above is that if we want a list of 500 items, it could potentially make 500 database requests. With all the network latency and that."
True. Could also create unnecessary contention and consume server resources maintaining locks as you iterate over a query.
"Another problem is that the users could have been deleted after we got the list from the database but before we are trying to get it from cache/db, which means that we will have null-problems."
If I take that quote, then this quote:
"The good thing is that it's highly cacheable."
Is not true, because you've cached stale data. So strike off the only advantage so far.
But to directly answer your question, the most efficient design, which seems to be what you are asking, is to use the database for what it is good for, enforcing ACID compliance and various constraints, most notably pk's and fk's, but also for returning aggregated answers to cut down on round trips and wasted cycles on the app side.
This means you either put SQL into your app code, which has been ruled to be Infinite Bad Taste by the Code Thought Police, or go to sprocs. Either one works. Putting the code into the App makes it more maintainable, but you'll never be invited to any more elegant OOP parties.

Some suggestions:
SQL is a set based language, so don't design things for iterating over loops. Even with stored procedures, still see cursors now and then when a set based query will solve the issue. So, always try and get the information with 1 query. Now sometimes this isn't possible but in the majority this will be. You can also design Views to make your querying easier if you have a schema with many tables to pull the information that is needed with one statement.
Use proxies. Let's say I have an object with 50 properties. At first you display a list of objects to the user. In this case, I would create a proxy of the most important properties and display that to the user, maybe 2 or three important ones like name, ID, etc. This cuts down on amount of information sent initially. When the user actually wants to edit or change the object, then make a second query to get the "full" object. Only get what you need. This is especially important over the web when serialization XML between the layers.
Come up with a paging strategy. Most systems work fine until they get a lot of data and then the query comes to a halt because it is reurning 1000s of data rows/records. Page early and often. If you are doing a web application, probably paging directly in the database will be the most performant because only the paged data is being sent between the layers.
Data caching depends on the data. For highly volatile data (changing all the time) caching isn't worth it. By for semi-volatile or non-volatile data, caching can be worth it, but you have to manage the cache either directly or indirectly if you are using a built in framework.
A good place to use a cache is say you have a zip codes table. Certianly, those don't change that often and you could cache those to boost performance if you had a zip code drop down in your application. This is just an example, but caching IMO depends on the type of data.

Virtual Database in Memory

Imagine the following:
I have a table of 57,000 items that i regularly use in my application to figure out things like targeting groups etc.
instead of querying the database 300,000 times a day, for a table that hardly ever changes it's data, is there a way to store its information in my application and poll data in memory directly? Or, do I have to create some sort of custom datatype for each row and iterate through testing each row, to check for the results i want?
After some googling, the closest thing i could find is in-memory database
thank you,
- theo

SQLite supports in-memory tables.

For 57,000 items that you will be querying against (and want immediately available) I would not recommend implementing just simple caching. For that many items I'd either recommend a distributed memory cache (even if it's only one machine) such as Memcache, Velocity etc or to go with your initial idea of using an in memory database.
Also if you use any full fledged ORM such as NHibernate you can implement it to use clients for the distributed caching tools with almost no work. Many of the major clients have NHibernate implementations for them including Memcache, Velocity and some others. This might be a better solution as you can have it where it's only caching data it truly is using and not all the data it might need.

Read up on Caching
It sounds like this is application level data rather than user level, so you should look into "Caching Application Data"
Here are some samples of caching datatables

If you only need to find rows using the same key all the time, a simple Dictionary<Key, Value> could very well be all that you need. To me, 57,000 items doesn't really sound that much unless each row contains a huge amount of data. However, if you need to search by different columns, an in-memory database is most likely the way to go.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.