Testing your databases via C# tests - c#

I've been assigned the task of testing our database. It's a test database and we can do anything we want to it and easily roll it back. I've been given this task because we're still in design phase'ish (meaning any changes at any point in time of the project can happen... renaming the Person.FirstName column to [First_Name] and later renaming it to [First Name]. My goals is to establish a rough estimate for what kind of pain we're walking in to when we make changes so we can plan for this ahead of time. We can also expect these kinds of changes during production too.
The items I have on my list are and have written tests for:
Send in the word null (not a literal null but "null") because using dynamic SQL it can flip out thinking you really mean null. We found this out because someone with the last name "null" caused an exception to be thrown.
Using single quotes because dynamic SQL no likely single quotes. Again, someone with one in their name caused a crash.
Having never done this before, that's about all I know that can crash. Any other ideas out there? We're trying to emulate data a user may enter.
edit 1: Our problem is we have a search screen with roughly 25 fields they can search by. Some of these search fields are simple (e.g. first name) some are less simple (category 1 with a date less than 2 but also has category 2 with a date greater than 2 OR has category 4 at any period of time). The search screen allows for a user to select different oeprators and predicates with each of these 25 fields. Is there a better way to handle this than dynamic SQL? I'm in a position and a point in time where we can change to something different if it's better.
edit 2: I don't know if it's worth mentioning, but we use LINQ to access the stored procedures. My research has shown that dynamic LINQ won't do what we need it to do like a dynamic SQL query will. May be wrong though.

Does "'; Drop Table Person; --" cause a crash too?
You should really consider moving your strategy away from dynamic SQL to parameterized queries to avoid SQL injection techniques.
As for the C# testing of your database, you can use transactionized queries and nUnit to do unit testing - of a fashion. Strictly speaking, unit testing is supposed to separate your application from the data store so that the component parts can be tested without the performance penalties of accessing and modifying the data store. However, you can use very similar techniques to test your data store if that's what you decide. Create the transaction in the TestFixtureSetup and roll it back in the TestFixtureTearDown that way your database will be back to the original state when your testing is complete.
You should certainly be aware that there is a performance penalty when testing data stores in this manner though. Your unit tests won't perform like the rest of your application - assuming the rest of your application is performance tuned.

Related

InsertedDate, set Default Value in DB or Set in application?

I am writing a new application and I am in the design phase. A few of the db tables require to have an InsertedDate field which will display when the record was inserted.
I am thinking of setting the Default Value to GetDate()
Is there an advantage to doing this in the application over setting the default value in the database?
I think its better to set the Default Value to GetDate() in SQL Server, rather than in your application. You can use that to get an ordered table based on insertion. It looks like an overhead if you try to set it from the application. Unless you want to specify some particular date with the insert, which I believe kills the purpose of it.
If you ever need to manually INSERT a record into the database, you'll need to remember to set this field if you're setting the default value in your application to avoid NULL references.
Personally, I prefer to set default values in the database where possible, but others might have differing opinions on this.
If you do it in your application, you can unit test it. In the projects I've been on, especially when using an ORM, we do all default operations in the code.
When designing, I always put a lot of importance on separation of concern, which for me, in the context of "database functionality vs application functionality", boils down to the question: "Who owns the data?". I have always been of the opinion that my code owns my data - never the database. The database is simply the container for data. This is similar to saying that I own my clothes, rather than my dresser owning my clothes. My dresser performs an important function, making my clothes available in an organized fashion, but I am always the agent putting clothes in the dresser, and I am responsible for their organization.
I'm sure many will have a problem with this analogy, saying that modern databases are much more powerful than my dresser, but in my experience the more functionality I put in the database layer, the more confusing projects get and the more blurred the line between data and functionality (e.g. database stored procedures, etc). Admittedly, yours is a simple example of this concept, but once a precedent is set, anything goes.
Another thing I'd like to address is the ease-of-use factor. I reject the idea that because a particular implementation is convenient (such as avoiding nulls, different server times, etc.) then I should choose it. To me, choosing such implementations is equivalent to saying: "It's alright if my code doesn't work. I'll avoid using my code rather than fixing it and making it robust."
I'm sure there are many cases, perhaps at extreme scale or due to other business requirements, when database-layer functionality is not only warranted but necessary, but my experience tells me the more you can keep your functionality in your code, the cleaner, simpler, and more robust your application will be.

How is using Entity + LINQ not just essentially hard coding my queries?

So I've been developing with Entity + LINQ for a bit now and I'm really starting to wonder about best practices. I'm used to the model of "if I need to get data, reference a stored procedure". Stored procedures can be changed on the fly if needed and don't require code recompiling. I'm finding that my queries in my code are looking like this:
List<int> intList = (from query in context.DBTable
where query.ForeignKeyId == fkIdToSearchFor
select query.ID).ToList();
and I'm starting to wonder what the difference is between that and this:
List<int> intList = SomeMgrThatDoesSQLExecute.GetResults(
string.Format("SELECT [ID]
FROM DBTable
WHERE ForeignKeyId = {0}",
fkIdToSearchFor));
My concern is that that I'm essentially hard coding the query into the code. Am I missing something? Is that the point of Entity? If I need to do any real query work should I put it in a sproc?
The power of Linq doesn't really make itself apparent until you need more complex queries.
In your example, think about what you would need to do if you wanted to apply a filter of some form to your query. Using the string built SQL you would have to append values into a string builder, protected against SQL injection (or go through the additional effort of preparing the statement with parameters).
Let's say you wanted to add a statement to your linq query;
IQueryable<Table> results = from query in context.Table
where query.ForeignKeyId = fldToSearchFor
select query;
You can take that and make it;
results.Where( r => r.Value > 5);
The resulting sql would look like;
SELECT * FROM Table WHERE ForeignKeyId = fldToSearchFor AND Value > 5
Until you enumerate the result set, any extra conditions you want to decorate will get added in for you, resulting in a much more flexible and powerful way to make dynamic queries. I use a method like this to provide filters on a list.
I personally avoid to hard-code SQL requests (as your second example). Writing LINQ instead of actual SQL allows:
ease of use (Intellisense, type check...)
power of LINQ language (which is most of the time more simple than SQL when there is some complexity, multiple joins...etc.)
power of anonymous types
seeing errors right now at compile-time, not during runtime two months later...
better refactoring if your want to rename a table/column/... (you won't forget to rename anything with LINQ becaues of compile-time checks)
loose coupling between your requests and your database (what if you move from Oracle to SQL Server? With LINQ you won't change your code, with hardcoded requests you'll have to review all of your requests)
LINQ vs stored procedures: you put the logic in your code, not in your database. See discussion here.
if I need to get data, reference a stored procedure. Stored procedures
can be changed on the fly if needed and don't require code recompiling
-> if you need to update your model, you'll probably also have to update your code to take the update of the DB into account. So I don't think it'll help you avoid a recompilation most of the time.
Is LINQ is hard-coding all your queries into your application? Yes, absolutely.
Let's consider what this means to your application.
If you want to make a change to how you obtain some data, you must make a change to your compiled code; you can't make a "hotfix" to your database.
But, if you're changing a query because of a change in your data model, you're probably going to have to change your domain model to accommodate the change.
Let's assume your model hasn't changed and your query is changing because you need to supply more information to the query to get the right result. This kind of change most certainly requires that you change your application to allow the use of the new parameter to add additional filtering to the query.
Again, let's assume you're happy to use a default value for the new parameter and the application doesn't need to specify it. The query might include an field as part of the result. You don't have to consume this additional field though, and you can ignore the additional information being sent over the wire. It has introduced a bit of a maintenance problem here, in that your SQL is out-of-step with your application's interpretation of it.
In this very specific scenario where you either aren't making an outward change to the query, or your application ignores the changes, you gain the ability to deploy your SQL-only change without having to touch the application or bring it down for any amount of time (or if you're into desktops, deploy a new version).
Realistically, when it comes to making changes to a system, the majority of your time is going to be spent designing and testing your queries, not deploying them (and if it isn't, then you're in a scary place). The benefit of having your query in LINQ is how much easier it is to write and test them in isolation of other factors, as unit tests or part of other processes.
The only real reason to use Stored Procedures over LINQ is if you want to share your database between several systems using a consistent API at the SQL-layer. It's a pretty horrid situation, and I would prefer to develop a service-layer over the top of the SQL database to get away from this design.
Yes, if you're good at SQL, you can get all that with stored procs, and benefit from better performance and some maintainance benefits.
On the other hand, LINQ is type-safe, slightly easier to use (as developers are accustomed to it from non-db scenarios), and can be used with different providers (it can translate to provider-specific code). Anything that implements IQueriable can be used the same way with LINQ.
Additionally, you can pass partially constructed queries around, and they will be lazy evaluated only when needed.
So, yes, you are hard coding them, but, essentially, it's your program's logic, and it's hard coded like any other part of your source code.
I also wondered about that, but the mentality that database interactivity is only in the database tier is an outmoded concept. Having a dedicated dba churn out stored procedures while multiple front end developers wait, is truly an inefficient use of development dollars.
Java shops hold to this paradigm because they do not have a tool such as linq and hold to that paradigm due to necessity. Databases with .Net find that the need for a database person is lessoned if not fully removed because once the structure is in place the data can be manipulated more efficiently (as shown by all the other responders to this post) in the data layer using linq than by raw SQL.
If one has a chance to use code first and create a database model from the ground up by originating the entity classes first in the code and then running EF to do the magic of actually creating the database...one understands how the database is truly a tool which stores data and the need for stored procedures and a dedicated dba is not needed.

Fresh data vs. locally cached data in LINQ to SQL

Okay, so I have a LINQ to SQL system set up on a WCF service. My application contains a reference to this service which it uses to collect data from an SQL database. I use a DataContext object which was generated by SQLMetal.exe.
I have two entity collections in my DataContext object, Clients and Groups. Each Client contains a field which says how many Groups it belongs to (a comma separated list of Group IDs).
In the application, I have a table of Clients. If I select one and click a button, a second table shows details of the Groups that Client is a part of.
Here's the question: When I click this button, do I go to the database for the Groups each time, or should I load the Groups when the application starts and sift through those? The latter would be quicker, but I want a concurrent solution.
The second question (I know there shouldn't be two really, but I just realised I might be confused on this issue): when I run a LINQ query on a collection in my DataContext object, am I getting the latest database data?
Thanks.
For your second question - Yes, each query against LINQ to SQL results in a SQL statement issued to the backing database. And to clarify further this is each time to attempt to enumerate a LINQ statement. I don't mean to imply that every LINQ statement is sent to the database immediately, which of course it isn't.
The first part on caching vs. querying each time is dependent upon other factors. Is it necessary? Meaning, is there a performance hit you're trying to correct? Also how "stale" can the data afford to be before it becomes a concern to your users? Those are question you'd need to take back to the application owners to decide.
As with most real world performance questions... it depends. The best way to tell is to write your application with what 'feels right' and then if performance is a concern, measure and change accordingly.
Yes, you'll be getting the latest database information.
Unless loading the groups takes a significant amount of time, don't cache.
Premature optimisation is never a good idea.
Just bear in mind that you might want to and make sure your collection of groups is nicely decoupled, optimising if you have to will be relatively simple.

database, requests, performance, cache

I need some input on how to design a database layer.
In my application I have a List of T. The information in T have information from multiple database tables.
There are of course multiple ways to do this.
Two ways that I think of is :
chatty database layer and cacheable:
List<SomeX> list = new List<SomeX>();
foreach(...) {
list.Add(new SomeX() {
prop1 = dataRow["someId1"],
prop2 = GetSomeValueFromCacheOrDb(dataRow["someId2"])
});
}
The problem that I see with the above is that if we want a list of 500 items, it could potentially make 500 database requests. With all the network latency and that.
Another problem is that the users could have been deleted after we got the list from the database but before we are trying to get it from cache/db, which means that we will have null-problems. Which we have to handle manually.
The good thing is that it's highly cacheable.
non chatty but not cacheable:
List<SomeX> list = new List<SomeX>();
foreach(...) {
list.Add(new SomeX() {
prop1 = dataRow["someId1"],
prop2 = dataRow["someValue"]
});
}
The problem that I see with the above is that its hard to cache, since potentially all users have unique lists. The other problem is that it will be a lot of joins which could result in a lot of reads against the database.
The good thing is that we know for sure that all information exists after the query is run (inner join etc)
non so chatty, but still cacheable
A third option could be to first loop through the data rows, and collect all necessary someId2 and then make one more database request to get all the SomeId2 values.
"The problem that I see with the above is that if we want a list of 500 items, it could potentially make 500 database requests. With all the network latency and that."
True. Could also create unnecessary contention and consume server resources maintaining locks as you iterate over a query.
"Another problem is that the users could have been deleted after we got the list from the database but before we are trying to get it from cache/db, which means that we will have null-problems."
If I take that quote, then this quote:
"The good thing is that it's highly cacheable."
Is not true, because you've cached stale data. So strike off the only advantage so far.
But to directly answer your question, the most efficient design, which seems to be what you are asking, is to use the database for what it is good for, enforcing ACID compliance and various constraints, most notably pk's and fk's, but also for returning aggregated answers to cut down on round trips and wasted cycles on the app side.
This means you either put SQL into your app code, which has been ruled to be Infinite Bad Taste by the Code Thought Police, or go to sprocs. Either one works. Putting the code into the App makes it more maintainable, but you'll never be invited to any more elegant OOP parties.
Some suggestions:
SQL is a set based language, so don't design things for iterating over loops. Even with stored procedures, still see cursors now and then when a set based query will solve the issue. So, always try and get the information with 1 query. Now sometimes this isn't possible but in the majority this will be. You can also design Views to make your querying easier if you have a schema with many tables to pull the information that is needed with one statement.
Use proxies. Let's say I have an object with 50 properties. At first you display a list of objects to the user. In this case, I would create a proxy of the most important properties and display that to the user, maybe 2 or three important ones like name, ID, etc. This cuts down on amount of information sent initially. When the user actually wants to edit or change the object, then make a second query to get the "full" object. Only get what you need. This is especially important over the web when serialization XML between the layers.
Come up with a paging strategy. Most systems work fine until they get a lot of data and then the query comes to a halt because it is reurning 1000s of data rows/records. Page early and often. If you are doing a web application, probably paging directly in the database will be the most performant because only the paged data is being sent between the layers.
Data caching depends on the data. For highly volatile data (changing all the time) caching isn't worth it. By for semi-volatile or non-volatile data, caching can be worth it, but you have to manage the cache either directly or indirectly if you are using a built in framework.
A good place to use a cache is say you have a zip codes table. Certianly, those don't change that often and you could cache those to boost performance if you had a zip code drop down in your application. This is just an example, but caching IMO depends on the type of data.

How to prevent Sql-Injection on User-Generated Sql Queries

I have a project (private, ASP.net website, password protected with https) where one of the requirements is that the user be able to enter Sql queries that will directly query the database. I need to be able to allow these queries, while preventing them from doing damage to the database itself, and from accessing or updating data that they shouldn't be able to access/update.
I have come up with the following rules for implementation:
Use a db user that only has permission for Select Table/View and Update Table (thus any other commands like drop/alter/truncate/insert/delete will just not run).
Verify that the statement begins with the words "Select" or "Update"
Verify (using Regex) that there are no instances of semi-colons in the statement that are not surrounded by single-quotes, white space and letters. (The thought here is that the only way that they could include a second query would be to end the first with a semi-colon that is not part of an input string).
Verify (using Regex) that the user has permission to access the tables being queried/updated, included in joins, etc. This includes any subqueries. (Part of the way that this will be accomplished is that the user will be using a set of table names that do not actually exist in the database, part of the query parsing will be to substitute in the correct corresponding table names into the query).
Am I missing anything?
The goal is that the users be able to query/update tables to which they have access in any way that they see fit, and to prevent any accidental or malicious attempts to damage the db. (And since a requirement is that the user generate the sql, I have no way to parametrize the query or sanitize it using any built-in tools that I know of).
This is a bad idea, and not just from an injection-prevention perspective. It's really easy for a user that doesn't know any better to accidentally run a query that will hog all your database resources (memory/cpu), effectively resulting in a denial of service attack.
If you must allow this, it's best to keep a completely separate server for these queries, and use replication to keep it pretty close to an exact mirror of your production system. Of course, that won't work with your UPDATE requirement.
But I want to say again: this just won't work. You can't protect your database if users can run ad hoc queries.
what about this stuff, just imagine the select is an EXEC
select convert(varchar(50),0x64726F70207461626C652061)
My gut reaction is that you should focus on setting the account privileges and grants as tightly as possible. Look at your RDBMS security documentation thoroughly, there may well be features you are not familiar with that would prove helpful (e.g. Oracle's Virtual Private Database, I believe, may be useful in this kind of scenario).
In particular, your idea to "Verify (using Regex) that the user has permission to access the tables being queried/updated, included in joins, etc." sounds like you would be trying to re-implement security functionality already built into the database.
Well, you already have enough people telling you "dont' do this", so if they aren't able to dissuade you, here are some ideas:
INCLUDE the Good, Don't try to EXCLUDE the bad
(I think the proper terminology is Whitelisting vs Blacklisting )
By that, I mean don't look for evil or invalid stuff to toss out (there are too many ways it could be written or disguised), instead look for valid stuff to include and toss out everything else.
You already mentioned in another comment that you are looking for a list of user-friendly table names, and substituting the actual schema table names. This is what I'm talking about--if you are going to do this, then do it with field names, too.
I'm still leaning toward a graphical UI of some sort, though: select tables to view here, select fields you want to see here, use some drop-downs to build a where clause, etc. A pain, but still probably easier.
What you're missing is the ingenuity of an attacker finding holes in your application.
I can virtually guarantee you that you won't be able to close all the holes if you allow this. There might even be bugs in the database engine you don't know about but they do that allows an SQL statement you deem safe to wreck havoc in your system.
In short: This is a monumentally bad idea!
As the others indicate, letting end-users do this is not a good idea. I suspect the requirement isn't really that the user really needs ad-hoc SQL, but rather a way to get and update data in ways not initially forseen. To allow queries, do as Joel suggests and keep a "read only" database, but use a reporting application such as Microsoft Reporting Services or Data Dynamics Active reports to allow users to design and run ad-hoc reports. Both I believe have ways to present users with a filtered view on "their" data.
For the updates, it is more tricky- I don't know of existing tools to do this. One option may be to design your application so that developers can quickly write plugins to expose new forms for updating data. The plugin would need to expose a UI form, code for checking that the current user can execute it, and code for executing it. Your application would load all plugins and expose the forms that a user has access to.
Event seemingly secure technology like Dynamic LINQ, is not safe from code injection issues and you are talking about providing low-level access.
No matter how hard you sanitize queries and tune permissions, it probably will still be possible to freeze your DB by sending over some CPU-intensive query.
So one of the "protection options" is to show up a message box telling that all queries accessing restricted objects or causing bad side-effects will be logged against user's account and reported to the admins immediately.
Another option - just try to look for a better alternative (i.e. if you really need to process & update data, why not expose API to do this safely?)
One (maybe overkill) option could be use a compiler for a reduced SQL language. Something like using JavaCC with a modified SQL grammar that only allows SELECT statements, then you might receive the query, compile it and if it compiles you can run it.
For C# i know Irony but never used it.
You can do a huge amount of damage with an update statement.
I had a project similar to this, and our solution was to walk the user through a very annoying wizard allowing them to make the choices, but the query itself is constructed behind the scenes by the application code. Very laborious to create, but at least we were in control of the code that finally executed.
The question is, do you trust your users? If your users have had to log into the system, you are using HTTPS & taken precautions against XSS attacks then SQL Injection is a smaller issue. Running the queries under a restricted account ought to be enough if you trust the legitimate users. For years I've been running MyLittleAdmin on the web and have yet to have a problem.
If you run under a properly restricted SQL Account select convert(varchar(50),0x64726F70207461626C652061) won't get very far and you can defend against resource hogging queries by setting a short timeout on your database requests. People could still do incorrect updates, but then that just comes back to do you trust your users?
You are always taking a managed risk attaching any database to the web, but then that's what backups are for.
If they don't have to perform really advanced queries you could provide a ui that only allows certain choices, like a drop down list with "update,delete,select" then the next ddl would automatically populate with a list of available tables etc.. similar to query builder in sql management studio.
Then in your server side code you would convert these groups of ui elements into sql statements and use a parametrized query to stop malicious content
This is a terribly bad practice. I would create a handful of stored procedures to handle everything you'd want to do, even the more advanced queries. Present them to the user, let them pick the one they want, and pass your parameters.
The answer above mine is also extremely good.
Although I agree with Joel Coehoorn and SQLMenace, some of us do have "requirements". Instead of having them send ad Hoc queries, why not create a visual query builder, like the ones found in the MS sample applications found at asp.net, or try this link.
I am not against the points made by Joel. He is correct. Having users (remember we are talking users here, they could care less about what you want to enforce) throw queries is like an app without a "Business Logic Layer", not to mention the additional questions to be answered when certain results does not match other supporting application results.
here is another example
the hacker doesn't need to know the real table name, he/she can run undocumented procs like this
sp_msforeachtable 'print ''?'''
just instead of print it will be drop
Plenty of answers saying that it's a bad idea but somethimes that's what the requirements insist on. There is one gotcha that I haven't spotted mentioned in the "If you have to do it anyway" suggestions though:
Make sure that any update statements include a WHERE clause. It's all too easy to run
UPDATE ImportantTable
SET VitalColumn = NULL
and miss out the important
WHERE UserID = #USER_NAME
If an update is required across the whole table then it's easy enough to add
WHERE 1 = 1
Requiring the where clause doesn't stop a malicious user from doing bad things but it should reduce accidental whole table changes.

Categories

Resources