Unit testing screen scraper

Unit testing screen scraper - c#

I'm in the process of writing an HTML screen scraper. What would be the best way to create unit tests for this?
Is it "ok" to have a static html file and read it from disk on every test?
Do you have any suggestions?

To guarantee that the test can be run over and over again, you should have a static page to test against. (Ie. from disk is OK)
If you write a test that touches the live page on the web, thats probably not a unit test, but an integration test. You could have those too.

For my ruby+mechanize scrapers I've been experimenting with integration tests that transparently test against as many possible versions of the target page as possible.
Inside the tests I'm overloading the scraper HTTP fetch method to automatically re-cache a newer version of the page, in addition to an "original" copy saved manually. Then each integration test runs against:
the original manually-saved page (somewhat like a unit test)
the freshest version of the page we have
a live copy from the site right now (which is skipped if offline)
... and raises an exception if the number of fields returned by them is different, e.g. they've changed the name of a thumbnail class, but still provides some resilience against tests breaking because the target site is down.

Files are ok but: your screen scraper processes text. You should have various unit tests that "scrapes" different pieces of text hard coded within each unit test. Each piece should "provoke" the various parts of your scraper method.
This way you completely remove dependencies to anything external, both files and web pages. And your tests will be easier to maintain individually since they no longer depends on external files. Your unit tests will also execute (slightly) faster ;)

To create your unit tests, you need to know how your scraper works and what sorts of information you think it should be extracting. Using simple web pages as unit tests could be OK depending on the complexity of your scraper.
For regression testing, you should absolutely keep files on disk.
But if your ultimate goal is to scrape the web, you should also keep a record of common queries and the HTML that comes back. This way, when your application fails, you can quickly capture all past queries of interest (using say wget or curl) and find out if and how the HTML has changed.
In other words, regression test both against known HTML and against unknown HTML from known queries. If you issue a known query and the HTML that comes back is identical to what's in your database, you don't need to test it twice.
Incidentally, I've had much better luck screen scraping ever since I stopped trying to scrape raw HTML and started instead to scrape the output of w3m -dump, which is ASCII and is so much easier to deal with!

You need to think about what it is you are scraping.
Static Html (html that is not bound to change drastically and break your scraper)
Dynamic Html (Loose term, html that may drastically change)
Unknown (html that you pull specific data from, regardless of format)
If the html is static, then I would just use a couple different local copies on disk. Since you know the html is not bound to change drastically and break your scraper, you can confidently write your test using a local file.
If the html is dynamic (again, loose term), then you may want to go ahead and use live requests in the test. If you use a local copy in this scenario and the test passes you may expect the live html to do the same, whereas it may fail. In this case, by testing against the live html every time, you immediately know if your screen scraper is up to par or not, before deployment.
Now if you simply don't care what format the html is, the order of the elements, or the structure because you are simply pulling out individual elements based on some matching mechanism (Regex/Other), then a local copy may be fine, but you may still want to lean towards testing against live html. If the live html changes, specifically parts of what you are looking for, then your test may pass if you're using a local copy, but come deployment may fail.
My opinion would be to test against live html if you can. This will prevent your local tests from passing when the live html may fail, and visa-versa. I don't think there is a best practice with screenscrapers, because screenscrapers in themselves are unusual little buggers. If a website or web service does not expose a API, a screenscraper is sort of a cheesy workaround to getting the data you want.

What you're suggesting sounds sensible. I'd perhaps have a directory of suitable test HTML files, plus data on what to expect for each one. You can further populate that with known problematic pages as/when you come across them, to form a complete regression test suite.
You should also perform integration tests for actually talking HTTP (including not just successful page fetches, but also 404 errors, unresponsive servers etc.)

I would say that depends on how many different tests you need to run.
If you need to check for a large number of different things in your unit test, you might be better off generating HTML output as part of your test initialization. It would still be file-based, but you would have an extensible pattern:
Initialize HTML file with fragments for Test A
Execute Test A
Delete HTML file
That way when you add test ZZZZZ down the road, you would have a consistent way of providing test data.
If you are just running a limited number of tests, and it will stay that way, a few pre-written static HTML files should be fine.
Certainly do some integration tests as Rich suggests.

You're creating an external dependency, which is going to be fragile.
Why not create a TestContent project, populated with a bunch of resources files? Copy 'n paste your source HTML into the resource file(s) and then you can reference them in your unit tests.

Sounds like you have several components here:
Something that fetches your HTML content
Something that strips away the chaff and produces just the text that must be scraped
Something that actually looks at the content and transforms it into your database/whatever
You should test (and probably) implement these parts of scraper independently.
There's no reason you shouldn't be able to get content from any where (i.e. no HTTP).
There's no reason you wouldn't want to strip away the chaff for purposes other than scraping.
There's no reason to only store data into your database via scraping.
So.. there's no reason to build and test all these pieces of your code as a single large program.
Then again... maybe we're over complicating things?

You should probably query a static page on disk for all but one or two tests. But don't forget those tests that touch the web!

I don't see why it matters where the html originates from as far as your unit tests are concerned.
To clarify: Your unit test is processing the html content, where that content comes from is immaterial, so reading it from a file is fine for your unit tests. as you say in your comment you certainly don't want to hit the network for every test as that is just overhead.
You also might want to add an integration test or two to check you're processing urls correctly though (i.e. you are able to connect and process external urls).

Related

generate c# test method

I was just wondering, given an input file(excel,xml etc), can we generate a unit test code in c#? Consider for example, I need to validate a database. In the input excel file, i will mention which all attributes to be set, which all to retrieve, expected value etc. Also for these , i can provide the queries to run. So given these many inputs from my side, can I create a unit test case method in c# through some tool or script or another program? Sorry if this sounds dumb. Thank you for the help.

A unit-test should test if your software works correct/as expected not that your data is correct. To be concise you should test the software that imported the data to your database. When the data is already in the database you can however write a validation-script or something similar which has nothing to do with a Unit-Test (however the script may be tested of course for working correctly).
You should however test if the queries provided by your software to run against the database are correct and wheather they work as expected, with both arbitrary and real-world-data.
Even when code-generation is involved you do not want to check if the process of generating the source-code works correctly (at least until you did not write your own code-generator). Simply assume the generator works as expected and continue with the stuff you can handle yourself.

I had a similar question some time back, though not in the context of unit tests. This code that can be generated from another file/database table is called Boilerplate Code.
So if you ask whether this can be done, the answer is yes. But if you wonder whether this should be done, the answer is no. Unit tests are not ideal boilerplate code. They are mutable... On catching an edge case that you did not consider earlier you may have to add a few more tests.
Also, unit tests are often used to not just test the code but to drive code development. This method is known as Test Driven Development (abbr. TDD). It'd be a mess to "drive" your development from boilerplate tests.

Can you test a razor view on its own without the need for integration testing?

I've got an MVC website with many different steps a user has to take to get through it. There are validation check and timed sections (for legal requirements). Having to do an integration test each time I need to test a small change to a page is a real headache. Ideally I want to know if there is a way (maybe a plugin?) that will allow me to right click a view, somehow specify a fake model object and open it directly?
What I am ultimately looking to test is how any new client side scripting (which combines razor/javascript/jQuery) looks and works on a variety of browsers. This isn't about testing functionality of my controllers.

Design time data
Design time data is commonly used in WPF, there is an article here that describes a techinque for showing design time data in MVC:
http://blog.dezfowler.com/2010/11/adding-design-mode-to-your-mvc-app.html
This should provide you a method to "somehow specify a fake model object and open it directly".
That might be all you're after, or:
cURL
Can be used with real time or design time data as above.
I use cURL executed from batch files and output the content to a number of files.
For example, this batch might simulate logging on:
Logon.bat:
echo Index without logon
curl http://localhost/index.html
echo Logon
curl http://localhost/login.html --data "username=a&password=p" ---dump-header auth.txt
echo Index after logon
curl http://localhost/index.html --cookie auth.txt
RunAll.bat:
call Logon.bat > logon_result.txt
The first time I run it, I also manually review the pages in a browser, then I know I can commit these batch result files (such as logon_result.txt) as the expected output.
Subsequent times I run the batch files, any changes are highlighted in revision control. At this point I review the differences and either OK them, and commit as the new expected output. Or I fix a bug.
I usually use this for WebAPI integration testing, but it should work for any http served page. One specific scenario to bare in mind is that for sweeping changes to a shared layout for example, you may not want to check them all manually. So make sure everything is checked and commited before the layout change, then little bugs won't be hidden within a vast number of changes.
I've caught some bad bugs with this techinque. Ever put an System.Web.Mvc.AuthorizeAttribute on a ApiController instead of an System.Web.Http.AuthorizeAttribute? Doesn't block unauthorized users, but code looks fine.
You may want to also set up a new clean database or restore a snapshot of one as the first task of the RunAll.bat file, so that any data displayed on pages is the same each run and doesn't show as a change.

Testing a web application is a pretty big topic, but lets keep it simple:
To correctly test your application, you have to design the application in a way
that all business logic can be tested via normal unit tests
all data access can be abstracted and mocked
data access can be integration tested separately
If you have a MVC website, you usually should have all business logic separated from any UI. This should actually enable you to use standard unit test projects to test lets say 80% of your code. Of cause you have to write a lot of code to test it right...
If you have tons of business logic in your view, this will result in very hard to test code.
The only way to do it (I know about) is automated UI testing.
To do this, there are some useful frameworks available, also visual studio offers some tools to automate tests.
In general it works like this, you define actions you usually would do as a user in a web browser. All actions the user would do can potentially be tested by scripting it.
To do this, it highly depends on how complex and/or dynamic your UI actually is. The more fancy stuff you have, the harder it will get to write a test script...
Following are some great articles about automation testing:
http://visualstudiomagazine.com/Articles/2012/12/01/Automated-UI-Testing.aspx
http://blog.typemock.com/2012/05/22/automated-testing-of-asp-net-mvc-applications
here is also a quick video about how to get automated UI tests running in VS2012:
http://channel9.msdn.com/Events/Visual-Studio/Visual-Studio-2012-Virtual-Launch/Automated-UI-testing-with-Visual-Studio-2012

Programmatically create MSTest unit tests

I am looking for a way to programmatically create unit tests using MSTest. I would like to loop through a series of configuration data and create tests dynamically based on the information. The configuration data will not be available at compile time and may come from an external data source such as a database or an XML file. Scenario: Load configuration data into a test harness and loop through the data while creating a new test for each element. Would like each dynamically created test to be reported (success/fail) on separately.

You can use Data Driven Testing depending on how complex your data is. If you are just substituting values and testing to make sure that your code can handle the same inputs that might be the way to go, but this doesn't really sound like what you are after. (You could make this more complex, after all all you are doing is pulling in values from a data source and then making a programmatic decision based on it)
All MS Test really does is run a series of tests and then produce the results (in an xml file) which is then interpreted by the calling application. It's just a wrapper for executing methods that you designate through attributes.
What it sounds like you're asking is to write C# code dynamically, and have it execute in the harness.
If you really want to run this through MS test you could:
Build a method (or series of methods) which looks at the XML file
Write out the C# code (I would maybe look at T4 Templates for this) (Personally, I would use F# to do this, but I'm more partial to functional languages, and this would be easier for me).
Calls the csc.exe (C# compiler)
Invokes MS Test
You could also write MSIL code into the running application directly, and try to get MS Test to execute it, which for some might be fun, but that could be time consuming and not necessarily guaranteed to work (I haven't tried it, so I don't know what the pit falls would be).
Based on this, it might be easier to quickly build your own harness which will interpret your XML file and dynamically build out your test scenarios and produce the same results file. (After all the results are what's important, not how you got there.) Since you said it won't be available during compile time, I would guess that you aren't interested in viewing the results in the VS studio window.
Actually, personally, I wouldn't use XML as your Domain Specific Language (DSL). The parsing of it is easy, because .NET already does that for you, but it's limiting in how it would define how your method can function. It's meant for conveying data, and although technically code is a form of data, it doesn't have the sufficient expressive strength to convey many abilities in more formal language. This is just my personal opinion though, and there are many ways to skin a cat.

Maintainability of database integration testing

I am developing a ETL process that extract business data from one database to a data warehouse. The application is NOT using NHibinate, Linq to Sql or Entity Framework. The application has its own generated data access classes that generate the necessary SQL statements to perform CUID.
As one can image, developers who write code that generate custom SQL can easily make mistakes.
I would like to write a program that generate testing data (Arrange), than perform the ETL process (Act) and validate the data warehouse (Assert).
I don't think it is hard to write such program. However, what I worry is that in the past my company had attempt to do something similar, and ending up with a brunch of un-maintainable unit tests that constantly fail because of many new changes to the database schema as new features are added.
My plan is to write an integration test that runs on the build machine, and not any unit tests to ensures the ETL process works. The testing data cannot be totally random generate because of business logic on determine how data are loaded to the data warehouse. We have custom development tool that generates new data access classes when there is a change in the database definition.
I would love any feedback from the community on giving me advice on write such integration test that is easy to easy to maintain. Some ideas I have:
Save a backup testing database in the version control (TFS), developers will need to modify the backup database when there are data changes to the source or data warehouse.
Developers needs to maintain testing data though the testing program (C# in this case) manually. This program would have a basic framework for developer to generate their testing data.
When the test database is initialize, it generate random data. Developers will need to write code to override certain randomly generated data to ensure the test passes.
I welcome any suggestions
Thanks

Hey dsum,
allthough I don't really know your whole architecture of the ETL, I would say, that integration-testing should only be another step in your testing process.
Even if the unit-testing in the first encounter ended up in a mess, you should keep in mind, that for many cases a single unit-test is the best place to check. Or do you want to split the whole integration test for triple-way case or sth. other further deep down, in order to guarantee the right flow in every of the three conditions?
Messy unit-test are only the result of messy production code. Don't feel offended. That's just my opinion. Unit-tests force coders to keep a clean coding style and keep the whole thing much more maintainable.
So... my goal is, that you just think about not only to perform integration testing on the whole thing, because unit-tests (if they are used in the right way) can focus on problems in more detail.
Regards,
MacX

First, let's say I think that's a good plan, and I have done something similar using Oracle & PL/SQL some years ago. IMHO your problem is mainly an organizational one, not a technical:
You must have someone who is responsible to extend and maintain the test code.
Responsibility for maintaining the test data must be clear (and provide mechanisms for easy test data maintenance; same applies to any verification data you might need)
The whole team should know that no code will go into the production environment as long as the test fails. If the test fails, first priority of the team should be to fix it (the code or the test, whatever is right). Train them not to work on any new feature as long as the test breaks!
After a bug fix, it should be easy for the one who fixed it to verify that the part of the integration which failed before does not fail afterwards. That means, it should be possible to run the whole test quick and easily from any developer machine (or at least, parts of it). Quick can get a problem for an ETL process if your test is too big, so focus on testing a lot of things with as few data as possible. And perhaps you can break the whole test into smaller pieces which can be executed step-by-step.

If one wants to maintain data while performing Data integration testing in ETL, We could also go with these steps because Integration testing of the ETL process and the related applications involves in them. for eg:
1.Setup test data in the source system.
2.Execute ETL process to load the test data into the target.
3.View or process the data in the target system.
4.Validate the data and application functionality that uses the data

Simple screen scraping and analyze in .NET

I'm building a small specialized search engine for prise info. The engine will only collect specific segments of data on each site. My plan is to split the process into two steps.
Simple screen scraping based on a URL that points to the page where the segment I need exists. Is the easiest way to do this just to use a WebClient object and get the full HTML?
Once the HTML is pulled and saved analyse it via some script and pull out just the segment and values I need (for example the price value of a product). My problem is that this script somehow has to be unique for each site I pull, it has to be able to handle really ugly HTML (so I don't think XSLT will do ...) and I need to be able to change it on the fly as the target sites updates and changes. I will finally take the specific values and write these to a database to make them searchable
Could you please give me some hints on how to architect the best way? Would you do different then described above?

Well, i would go with the way you describe.
1.
How much data is it going to handle? Fetching the full HTML via WebClient / HttpWebRequest should not be a problem.
2.
I would go for HtmlAgilityPack for HTML parsing. It's very forgiving, and can handle prety ugly markup. As HtmlAgilityPack supports XPath, it's pretty easy to have specific xpath selections for individual sites.
I'm on the run and going to expand on this answer asap.

Yes, a WebClient can work well for this. The WebBrowser control will work as well depending on your requirements. If you are going to load the document into a HtmlDocument (the IE HTML DOM) then it might be easier to use the web browser control.
The HtmlDocument object that is now built into .NET can be used to parse the HTML. It is designed to be used with the WebBrowser control but you can use the implementation from the mshtml dll as well. I hav enot used the HtmlAgilityPack, but I hear that it can do a similar job.
The HTML DOM objects will typically handle, and fix up, most ugly HTML That you throw at them. As well as allowing a nicer way to parse the html, document.GetElementsByTag to get a collection of tag objects for example.
As for handling the changing requirements of the site, it sounds like a good candidate for the strategy pattern. You could load the strategies for each site using reflection or something of that sort.
I have worked on a system that uses XML to define a generic set of parameters for extracting text from HTML pages. Basically it would define start and end elements to begin and end extraction. I have found this technique to work well enough for a small sample, but it gets rather cumbersome and difficult to customize as the collection of sites gets larger and larger. Keeping the XML up to date and trying to keep a generic set of XML and code the handle any type of site is difficult. But if the type and number of sites is small then this might work.
One last thing to mention is that you might want to add a cleaning step to your approach. A flexible way to clean up HTML as it comes into the process was invaluable on the code I have worked on in the past. Perhaps implementing a type of pipeline would be a good approach if you think the domain is complex enough to warrant it. But even just a method that runs some regexes over the HTML before you parse it would be valuable. Getting rid of images, replacing particular mis-used tags with nicer HTML , etc. The amount of really dodgy HTML that is out there continues to amaze me...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.