I'm working as a UI tester at a small software company. In order to make my life easier, I'm trying to write a scraper in Python that will automatically generate some of the standard tests run on every page. Testing is done in use Quicktest Pro and needs to be written in VBScript. Every page that creates data needs to have a full case, where every field on the page gets filled out, and number of reduced cases, where only required fields get filled out.
The full case should be easy -- I plan to set up a requests.Session object with an already-authenticated cookie, send a GET request to the appropriate page, and parse the the response with BeautifulSoup.
The reduced cases I'm less sure of how to approach. I can think of three ways to go about it, but none of them sound great:
A) Try to submit a blank page. Check the response for error messages of the form "* <field> is a required field." Look for the fields whose names are closest to the one specified. Fill them out. Try to submit again, and repeat, adding fields until it goes through successfully, and return a list of fields.
This isn't great because it's difficult to identify what field the error message corresponds to. A message stating that "* Birth date is required" might actually be referring to a form element with an HTML ID of "dob_entry1." I'm also testing on a development copy of the source, so it's not unusual for partially filled out forms to cause a server error, and I'd probably need to manually clean up any data that this approach creates.
B) Send in a fully filled-out form. Find the database record(s) that just got created, and find out which columns are NOT NULL. Match column names to field names, and return the resulting list.
This seems more promising, but I'm not sure how to go about finding the records that were created. Logs (except for errors) are not turned on for the MySQL server, and the server has ~15 databases on it, all of which are being worked on by developers, so I can't mess with the server's global variables to turn it on. I could query the database for all of the values that I just passed in, but there's a pretty huge amount of data already on the db, so it's unlikely that I would be able, for example, to figure out which date of birth is the one that I just submitted.
Googling, tools like this http://hackmysql.com/mysqlsniffer might be an option, but I'm wary of doing anything to the server as a whole since the developers will be using other dbs on the server at the same time. I don't have much experience with SQL so I'm not very sure how to go about doing this.
C) Somehow parse the C# source code to find the query that corresponds to a given page. Find out which columns it affects, query the database to find out which are NOT NULL, match the column names to field names and return a list.
I have no experience with C# so I don't know how feasible this is, but if it were PHP I think it would be pretty simple. I could find the source for the site if I poked around but I haven't looked at any of it yet. The website is ~10 years old and is pretty massive, so matching page names to source files is probably non-trivial.
I imagined that finding out which fields of a form are required to submit a page would be a pretty common task for scrapers, but Google hasn't turned up much. Are any of these approaches reasonable? Is there an easy solution that I'm missing out on?
I think your first choice - figuring out from the HTML response which fields are required - is your safest bet. Trying to match field names to database column names can be a real problem - you have no idea how many layers the data goes through until being saved in the database - field names make look nothing like the column names.
Seeing if a field is required shouldn't be too hard - start with a full form and submit it to see that it's legal. Then send the form again, without the first field. If you're getting an error - the field is required. Fill the first field again, clear the second one and try again. Do this for every field in the form.
The web application will need to be stable enough for this to work. You should be able to tell the difference between a missing field error and a server error.
Oh, and do check #Ming Slogar's comment - if the HTML guys marked the fields as required in the HTML, you'll have a lot of free time on your hands.
Related
I am trying to understand some basics of Lucene, the full text search engine. More specifically I am looking at Lucene.Net.
Today I have an old legacy .NET 4.8 web app. Some is MVC, but the newer parts follow a pretty nice API first pattern. The app holds a lot of records (app half a million) with tons of different fields. The search functionality there is outdated to say the least. It is a ton of old Linq2SQL queries that fan out in like queries.
I would like to introduce a new and better way to search records, so I started looking at Lucene.Net. But I am trying to understand one key concept, and I can't seem to find the answer anywhere, and I think it might be because it cannot be done, but I would like to make sure.
Is it possible to set up Lucene to monitor a SQL table or view so I don't have to maintain the Lucene index from within my code. The code of this app does not lend itself to easily keeping a Lucene index updated when things are added, changed or deleted. But the database is good source of truth. I can live with a small delay on having the index up to date. But basically I would like define for each business model what fields are part of the index and what the id is, and then be able to query with that index from the C# server side code of my Web App.
Is such a scenario even possible or am I asking too much?
It's totally possible, but not out of the box. You have to implement it if you want it. Fundamentally you need to implement three things.
A way to know every time a piece of relevant data in the sql database changes
A place to capture information about that change, call it a change log.
A routine that reads the change log, applies those changes to the
LuceneNet index and than marks the record in the change log has processed.
There are of course lots of different ways to handle each of these.
This SO answer Lucene.Net index updates, when a manual change is done in SQL Database provides more details on one way this can be accomplished.
EDIT: Solution (kind of)
So, what I did had very little in common with what I originally wanted to do, but my application now works much faster (DataSets that took upward of 15 minutes to process now go through in 30-40 seconds tops). Here's roughly what I did:
- Read spreadsheet & populate DataTable/DataSet normally
- [HACK WARNING] Instead of using UpdateDataSet, I generate my own SQL queries, mostly by having a skeleton string for each type of update (e.g. String skeleton = "UPDATE ... SET ... WHERE ..."). I then consult the template database and replace the placeholder ... with the appropriate entries.
- [MORE HACK WARNING] The way I dealt with errors was by manually checking whether those errors will occur. So if I know I am about to do an insert, I'll run an error-checking command before the actual insert; what the error checker will do is construct a JOIN statement, checking whether any of the entries in the user's DataSet already exist in the database. Just by executing the JOIN command, I get back a DataSet with the results, so I know that if there is anything there, it's the errors. Then I can proceed to print them.
If anyone needs more details, I'll be happy to provide them. It's a fairly specific question, so I should probably keep this outline fairly high level.
Original Question
For (good) reasons outside of my control, I need to use the Database.UpdateDataSet() method from Microsoft's Enterprise Library. The way my project will work, I am letting the user make changes to the database (multiple database, multiple schemas, multiple tables, but always only one at a time) by uploading Excel spreadsheets to a web application. The spreadsheets follow a design/template specified by me (usually). I am a state where I read the spreadsheet, turn it into a DataTable/DataSet, and use (dynamically generated) prepared statements to make the appropriate changes to the database. Here's the problem:
Each spreadsheet only allows for one type of change (insert/update/delete). I want to make it so if the user uploads an insert spreadsheet, but several (let's say 10) of the entries are already in the database, I not only return with an error, but also tell them which entries (DataRows) violated the primary key constraint.
The idea solution would be get a DataSet with the list of errors back, but I don't see how I can do that. Perhaps there is a way to construct the prepared statements in such a way that if a DataRow is to be inserted (following the example from above), it proceeds normally; however if it attempts to update or delete, it skips it and adds it to an error collection of some sort?
Note that I am trying to avoid using stored procedures. Since the number of different templates will grow extremely quickly after deployment, it is important that I stay away from manually written code and close to database-driven model as much as possible.
I've never looked much into all what .NET offers for user input validation because to start with I dislike the way they will typically not let you unfocus a control unless you enter the right data (I believe the DataGridView does this).
On the other hand, I found that I often need to validate what I'll describe below and I wonder if sticking to .NET standards here will make it any easier.
I'll typically have a dialog box that among other controls will have two combo boxes: one to select a data table among existing tables, and one to select a column among the columns in the currently selected table. This is easy enough so far, but since this is a dialog, I need to show the values that were selected the last time the dialog was shown if they still exist in the database, or otherwise select some other column if the table still exists, or select another table and column if there is any table and warn the user that his selection has changed, or if there are no tables simply show a message and close the dialog.
Of course this is not the only case. Sometimes it will be a bit more complex and every time I will try to figure out again what's the best way to handle it. I wonder if there is already a pattern, particularly one that .NET offers I can apply to the case I describe above? If so, I'm sure I'll figure out how to apply it to other cases.
The answer will depend quite a bit on your implementation specifics.
However, what we finally settled on for this was to pass the existing display and value values to the method that retrieves the data.
Once the data is retrieved, we check to see if the missing data is present in the retrieved data and, if it is, we add a record to store the display and value values to the collection of data that is returned.
Implementing this functionality at the point of data retrieval allows us to support the same functionality in any client (asp.net, silverlight, etc).
We do go back and forth occasionally on whether it is appropriate to add the logic to the business object, but there are enough exceptions (i.e. web services, simple collections, etc) that we always end up back at the above design.
Is there an easier way to prevent a duplicate insert after refresh? The way I do it now is to select everything with the all fields except ID as parameters; if a record exists i don't insert. Is there a way to possible detect refresh?
Assuming it's a database, you could put a unique constraint on the combination of "all fields except ID" and catch the exception on an insert or update.
I agree with #Austin Salonen that you should start by protecting the DB with primary keys, unique constraints and foreign keys.
That done, many websites will include some JS behind submit buttons to disable the button immediately before sending on the request. This way, users who double click don't send two requests.
I think you may want to the EXISTS function.
Here's a simple explanation of EXISTS I found through google.
Like Dereleased said, use a 303-based redirect. Make the form submission use POST and then after saving have it send a 303 header and send them to the post-submit URL via a Location header which will be fetched via GET and a refresh will not be re-posting data.
It has been a long time since I have done any real web work. But back in the 1.1 days I remember using ids associated with a postback to determine if a refresh had occured.
After a quick search I think this is the article I based my solution from:
http://msdn.microsoft.com/en-us/library/ms379557(VS.80).aspx
It basically shows you how to build a new page class that you can inherit from. The base class will expose a method that you call when you are doing something that shouldn't be repeated on a refresh, and an IsPageRefresh method to track if a refresh has occured.
That article was the basis for alot of variations with similar goals, so it should be a good place to start. Unfortunately I can't remember enough about how it went to really give any more help.
I second the option to redirect a user to another (confirmation) page after the request has been submitted (a record inserted into the database). That way they will not be able to do a refresh.
You could also have a flag that indicates whether the insert request has been submitted and store it either on the page (with javascript) or in the session. You could also go further and store it somewhere else but that's an architectural decision on the part of your web application.
If you're using an AJAX request to insert a record then it's a bit harder to prevent this on the client side.
I'd rather do an indicator/flag then compare the fields. This, of course, depends on your records. For example, if it is a simple shop and the user wants to make an identical order then you will treat it as a duplicate order and effectively prevent the functionality.
What DB are you using? If it's MySQL, and certain other factors of your implementation align, you could always use INSERT IGNORE INTO .... EDIT: Struck for SQL Server
Alternatively, you could create "handler" pages, e.g. your process looks like this:
User attempts to perform "action"
User is sent to "doAction.xxx"
"doAction.xxx" completes, and redirects to "actionDone.xxx"
???
Profit!
EDIT: After re-reading your question, I'm going to lean more towards the second solution; by creating an intermediate page with a redirect (usually an HTTP/1.1 303 See Other) you can usually prevent this kind of confusion. Checking uniques on the database is always a good idea, but for the simple case of not wanting a refresh to repeat the last action, this is an elegant and well-established solution.
I would like to know what are the best practice programming tasks in relation to users submitting data through a web form to a website.
I am particularly interested in any C# or VB.NET commands that should be used through out the process from the moment the user hits the submit button until the data hits the database.
I have been reading about reasons why you may want to take precautions such as SQL injections etc.
Avoiding SQL injections is quite simple - just use parameterized queries, or an ORM such as LINQ to SQL or nHibernate (which all use parameters under the hood). The library takes care of everything for you, and has been thoroughly vetted.
After that, you're safe until it's time to write the data back out to other users. You always want to store the data as close to the original user input as possible. Another way to say this is - don't store a scrubbed version (unless you also store the original alongside it). Scrubbing is a one-way process - it destroys information. It's always easy to scrub again if you need to, but you can't un-scrub something.
However, storing the original format means you do need to make sure you encode the output before you write it to the browser. This prevents users from putting malicious cross-site scripts and other things into your data that might be rendered on other users' pages.
At the highest level, just keep in mind that all the work should be done as late as possible. Be liberal in what you accept (do only what is necessary to protect yourself) and strict in what you send (encode everything, scrub the hell out of it, transform it, etc). You want to have a "pure" copy which is altered to conform to the target output.
If you are serious about it read this book: 19 Deadly Sins of Software Security
Using linq2sql you get protection from SQL injection. Alternatively use .Parameters with parametrized queries.
When you send the data back on the page, you have to prevent the data from running js by encoding it. Use http://msdn.microsoft.com/en-us/library/w3te6wfz.aspx
And overall consider any of use of that data a chance for attack and look for ways to prevent it. For example, using user data as a filename to access/save something can mean access to unintended resources (by adding ..\).
You can't go wrong with the following general rules
Validate everywhere! Where you validate determines the quality of the user experience. The closer to the user, the less safe it is but more responsive. The farther away, the safer but tends to give worse error messages.
Validate at the front-end to give the user a responsive error.
Validate in the middle to give the user nicer error messages.
Validate in the database (constraints and such) to keep your database sane.
Use parameters early, and use them often! Find those square pegs early.
Coerce data into the correct types as quickly as possible. (This is a form of validation.) If something is an int, don't handle it like a string.
Don't throw away errors when checking parameters. If your regex doesn't match, or your try { parse } catch { } gets triggered it's important you know why and don't continue!
Whether you use LINQ or roll-your-own SQL: do not build SQL statements with user-supplied data. EVER. Use parameterized queries and/or stored procedure calls. If you must piece-together SQL as strings, don't do it with user data. Get the "untrustworthy" data stored and manipulate it as needed later, in a separate query.
Encode all data passed to the user. The bad data may not be their fault, don't trash their world.
Assume that anything they pass you is full of JavaScript and HTML. Assume that "binary" data will find its way in. Someone will run your web page on something other than a "browser" eventually. Your "phone number" field will be used to store an .EXE.
Return all data encoded and harmless. Don't assume that "because it's in the database" (or that it's an int, or that it's just a 1 character string) that it's harmless.
Assume that eventually your database will fail you somehow. A developer will drop in "test" data, you'll miss an edge case above, or something may run amok and insert all-purpose crap. This crap has to be passed to the user safely.
Nobody's perfect: especially you. Plan for that.
While pretty much all of the guidelines on the Open Web Application Security Project (OWASP) site are useful, here are their guidelines on data validation.