In ASP.NET we had Request Validation but in ASP.NET Core there is no such thing.
How can we protect an ASP.NET Core app against XSS in the best way?
Request validation gone:
https://nvisium.com/resources/blog/2017/08/08/dude-wheres-my-request-validation.html
this guy recommmends RegEx on Models like:
[RegularExpression(#"^[a-zA-Z0-9 -']*$", ErrorMessage = "Invalid characters detected")]
public string Name { get; set; }
...but that does not work for globalization/internationalization, i.e. non-latin characters like æ, ø å 汉字.
X-XSS to do >limited< XSS-protection: https://dotnetcoretutorials.com/2017/01/10/set-x-xss-protection-asp-net-core/ Like this but there is only limited support afaik:
public void Configure(IApplicationBuilder app, IHostingEnvironment env, ILoggerFactory loggerFactory)
{
app.Use(async (context, next) =>
{
context.Response.Headers.Add("X-Xss-Protection", "1");
await next();
});
app.UseMvc();
}
The documentation from Microsoft is two years old: https://learn.microsoft.com/en-us/aspnet/core/security/cross-site-scripting?view=aspnetcore-2.1 and does not really cover it.
I am thinking to do something simple like:
myField = myField.Replace('<','').Replace('>','').Replace('&','').Repl...;
on all data submission - but it seems kind of wonky.
I have asked the same question for Microsoft but I am interested to hear how people are solving this problem in real-life applications.
Update: what we are trying to accomplish:
In our application, we have webforms where people can input names, email, content and similar. The data is stored in a database and will be viewed on a frontend system and possibly other systems in the future (like RSS feeds, JSON, whatever). Some forms contain rich-text editors (TinyMCE) and allow users to markup their texts. Malicious users could enter <script>alert('evil stuff');</script> in the fields. What is the best way to strip the evil characters in ASP.NET Core before it reaches the database - I prefer evil scripts not to be stored in the database at all.
I figured something like this could work:
const string RegExInvalidCharacters = #"[^&<>\""'/]*$";
[RegularExpression(RegExInvalidCharacters, ErrorMessage = "InvalidCharacters")]
public string Name { get; set; }
[RegularExpression(RegExInvalidCharacters, ErrorMessage = "InvalidCharacters")]
public string Content { get; set; }
...
You can use the HtmlSanitizer NuGet package in ASP.NET Core.
One of the best ways in preventing stored/reflected XSS is to HTML-Encode the output. You may also encode before you store it in the DB.
Since you don't need the output from these fields to be in HTML anyways.
The solution with the Regex won't always work. What you're doing here is that you are relying on a blacklist. It's always better and more secure to either rely on Whitelist (Which you don't need in this case). Or HTML-Encode the output if possible.
I know this is a year old now, but for your (and others') reference, you may want to look at creating a ResourceFilter or Middleware which will sanitize incoming requests. You can use whatever tools or custom code you want there, but the key being that all incoming requests go through this filter to be sanitized of bad data.
Be sure to use the correct filter for your application/need. A ResourceFilter will run before model binding, and an ActionFilter will run after model binding.
Specifically, what are you trying to do here? Prevent posts which could contain content which could render, when un-sanitised an XSS attack?
If so, as I recently discussed with a colleague, you kind of can't, depending on your site.
You can provide client-side restrictions on the data posted, but this can obviously be bypassed, so what's your action trying to do? Prevent content being posted that when rendered un-sanitised is a potential XSS risk?
What is your post endpoint responsible for? Is it responsible for how other systems may render some output it has received?
I would argue your main XSS risk is in how an app renders your data. If you're not sanitising/encoding output based on the app that is using the data then you're probably doing it wrong.
Remember that a potential XSS issue is only a real issue if you're outputting something to a webpage or similar. This is not really the endpoint that receives the data's problem.
I know this has been answered already, and quite adequately, but I want to add my answer as a way of getting some feedback as well. This is what I have done in this case, which is very similar to the question at hand.
Scenario
We have a front-end application built in Vue.js wrapped in "Quasar.dev", which is being served by a .NET 5.0 API on the back side.
Solution
We have "public models" which are objects we use to send to the client application. They are not named and formatted as their database fields. We use "AutoMapper" to map from database fields (or models) to public fields (or models) Once the public model has been mapped we then send the public model to the user. Then inside the AutoMapper profile, I added an extension to the String object called .Encode() which encodes whatever comes from the database into the public model field using the following code...
public static string Encoded(this string value)
{
return HttpUtility.HtmlEncode(value);
}
Now, every time I add new profiles for public models, I just make sure to add that call for encoding on strings I do not trust.
How about that?
I have just taken over an ASP.NET MVC project and some refactoring is required, but I wanted to get some thoughts / advice for best practices.
The site has an SQL Server backend and here is a review of the projects inside the solution:
DomainObjects (one class per database table)
DomainORM (mapping code from objects to DB)
Models (business logic)
MVC (regular ASP.NET MVC web setup)
---- Controllers
---- ViewModels
---- Views
---- Scripts
The first "issue" I see is that while the Domain objects classes are pretty much POCO with some extra "get" properties around calculated fields, there is some presentation code in the Domain Objects. For example, inside the DomainObjects project, there is a Person object and I see this property on that class:
public class Person
{
public virtual string NameIdHTML
{
get
{
return "<a href='/People/Detail/" + Id + "'>" + Name + "</a> (" + Id + ")";
}
}
}
so obviously having HTML-generated content inside the domain object seems wrong.
Refactor Approaches:
My first instinct was to move this to the ViewModel class inside the MVC project, but I see that there are a lot of views that hit this code so I don't want to duplicate code in each view model.
The second idea was to create PersonHTML class that was either:
2a. A wrapper that took in a Person in the constructor or
2b. A class that inherited from Person and then has all of these HTML rendering methods.
The view Model would convert any Person object to a PersonHTML object and use that for all rendering code.
I just wanted to see:
If there is a best practice here as it seems like this is a common problem / pattern that comes up
How bad is this current state considered because besides feeling wrong, it is not really causing any major problems understanding the code or creating any bad dependencies. Any arguments to help describe why leaving code in this state is bad from a real practical sense (vs. a theoretical separation of concerns argument) would be helpful as well as there is debate in the team whether it's worth it to change.
I like TBD's comment. It's wrong because you are mixing domain concerns with UI concerns. This causes a coupling that you could avoid.
As for your suggested solutions, I don't really like any of them.
Introducing a view model. Yes, we should use view models, but we
don't want to pollute them with HTML code. So an example of using a
view would be if you've got a parent object, person type, and you
want to show the person type on the screen. You would fill the view
model with the person type name and not a full person type object
because you only need the person type name on the screen. Or if
your domain model had first and last name separate, but your view
calls for FullName, you would populate the view model's FullName and
return that to the view.
PersonHtml class. I'm not even sure what that would do. Views are what represent the HTML in an ASP.NET MVC application. You've got two options here:
a. You could create a display template for you model. Here's a link to a Stack Overflow question to display templates, How to make display template in MVC 4 project
b. You could also write a HtmlHelper method that would generate the correct HTML for you. Something like #Html.DisplayNameLink(...) Those would be your best options. Here's a link for understanding HtmlHelpers https://download.microsoft.com/download/1/1/f/11f721aa-d749-4ed7-bb89-a681b68894e6/ASPNET_MVC_Tutorial_9_CS.pdf
I've wrestled with this myself. When I had code in views that were more logic based than HTML, I created an enhanced version of the HtmlBuilder. I extended certain domain objects to automatically print out this helper, with it's contents based off of domain functions, that could then just be printed onto a view. However, the code becomes very cluttered and unreadable (especially when your trying to figure out where it came from); for these reasons, I suggest removing as much presentation/view logic from the domain as possible.
However, after that I decided to take another look at Display and Editor Templates. And I've come to appreciate them more, especially when combined with T4MVC, FluentValidation, and custom Metadata Providers, among other things. I've found using HtmlHelpers and extending the metadata or routing table to much more cleaner way of doing things, but you also start playing with systems that are less documented. However, this case is relatively simple.
So, first off, I would ensure you have a route defined for that entity, which is looks like you would with the default MVC route, so you can simply do this in a view:
//somewhere in the view, set the values to the desired value for the person you have
#{
var id = 10; //random id
var name = "random name";
}
//later:
#name ( #id )
Or, with T4MVC:
#name ( #id )
This means, with regards to the views/viewmodels, the only dependency these have is the id and name of the Person, which I would presume your existing view models ought to have (removing that ugly var id = x from above):
<a href="#Url.Action("People", "Detail", new { id = Model.PersonId } )">
#Model.Name ( #Model.PersonId )
</a>
Or, with T4MVC:
<a href="#Url.Action( MVC.People.Detail( Model.PersonId ) )">
#Model.Name ( #Model.PersonId )
</a>
Now, as you said, several views consume this code, so you would need to change the views to conform to the above. There are other ways do it, but every suggestion I have would require changing the views, and I believe this is the cleanest way. This also has a feature of using the route table, meaning that if the routing system is updated, then the updated url would print out here without worries, as opposed to hard coding it in the domain object as a url (that is dependent on the route system to have been set up in a specific manner for that url to work).
One of the my other suggestions would be to build a Html Helper, called Html.LinkFor( c => model ) or something like that, but, unless if you want it to dynamically determine the controller/action based off the type, that is kind of unnecessary.
How bad is this current state considered because besides feeling wrong, it not really causing any major problems understanding the code or creating any bad dependencies.
The current state is very bad, not only because UI code is included in domain code. That would be already pretty bad, but this is worse. The NameIdHTML property returns a hardcoded link to the person's UI page. Even in UI code you should not hardcode these kind of links. That is what LinkExtensions.ActionLink and UrlHelper.Action are for.
If you change your controller or your route, the link will change. LinkExtensions and UrlHelper are aware of this and you don't need any further changes. When you use a hardcoded link, you need to find all places in your code where such a link is hardcoded (and you need to be aware that those places exist). To make matters even worse, the code you need to change is in the business logic which is in the opposite direction of the chain of dependencies. This is a maintenance nightmare and a major source of errors. You need to change this.
If there is a best practice here as it seems like this is a common problem / pattern that comes up.
Yes, there is a best practice and that is using the mentioned LinkExtensions.ActionLink and UrlHelper.Action methods whenever you need a link to a page returned by a controller action. The bad news is that this means changes at multiple spots in your solution. The good news is that it's easy to find those spots: just remove the NameIdHTML property and the errors will pop up. Unless you are accessing the property by reflection. You will need to do a more careful code search in this case.
You will need to replace NameIdHTML by code that uses LinkExtensions.ActionLink or UrlHelper.Action to create the link. I assume that NameIdHTML returns HTML code that should be used whenever this person is to be shown on an HTML page. I also assume that this is a common pattern in your code. If my assumption is true, you can create a helper class that converts business objects to their HTML representations. You can add extension methods to that class that will provide the HTML representation of your objects. To make my point clear I assume (hypothetically), that you have a Department class that also has Name and Id and that has a similar HTML representation. You can then overload your conversion method:
public static class BusinessToHtmlHelper {
public static MvcHtmlString FromBusinessObject( this HtmlHelper html, Person person) {
string personLink = html.ActionLink(person.Name, "Detail", "People",
new { id = person.Id }, null).ToHtmlString();
return new MvcHtmlString(personLink + " (" + person.Id + ")");
}
public static MvcHtmlString FromBusinessObject( this HtmlHelper html,
Department department) {
string departmentLink = html.ActionLink(department.Name, "Detail", "Departments",
new { id = department.Id }, null).ToHtmlString();
return new MvcHtmlString(departmentLink + " (" + department.Id + ")");
}
}
In your views you need to replace NameIdHTML by a call to this helper method. For example this code...
#person.NameIdHTML
...would need to be replaced by this:
#Html.FromBusinessObject(person)
That would also keep your views clean and if you decide to change the visual representation of Person you can easily change BusinessToHtmlHelper.FromBusinessObject without changing any views. Also, changes to your route or controllers will be automatically reflected by the generated links. And the UI logic remains with the UI code, while business code stays clean.
If you want to keep your code completely free from HTML, you can create a display template for your person. The advantage is that all your HTML is with the views, with the disadvantage of needing a display template for each type of HTML link you want to create. For Person the display template would look something like this:
#model Person
#Html.ActionLink(Model.Name, "Detail", "People", new { id = Model.Id }, null) ( #Html.DisplayFor(p => p.Id) )
You would have to replace your references to person.NameIdHTML by this code (assuming your model contains a Person property of type Person):
#Html.DisplayFor(m => m.Person)
You can also add display templates later. You can create BusinessToHtmlHelper first and as a second refactoring step in the future, you change the helper class after introducing display templates (like the one above):
public static class BusinessToHtmlHelper {
public static MvcHtmlString FromBusinessObject<T>( this HtmlHelper<T> html, Person person) {
return html.DisplayFor( m => person );
}
//...
}
If you were careful only to use links created by BusinessToHtmlHelper, there will be no further changes required to your views.
It's not easy to provide a perfect answer to this issue. Although a total separation of layers is desirable, it often causes a lot useless engineering problems.
Although everyone is ok with the fact that the business layer must not know to much about the presentation/UI layer, I think it's acceptable for it to know these layers do exist, of course without too many details.
Once you have declared that, then you can use a very underused interface: IFormattable. This is the interface that string.Format uses.
So, for example, you could first define your Person class like this:
public class Person : IFormattable
{
public string Id { get; set; }
public string Name { get; set; }
public override string ToString()
{
// reroute standard method to IFormattable one
return ToString(null, null);
}
public virtual string ToString(string format, IFormatProvider formatProvider)
{
if (format == null)
return Name;
if (format == "I")
return Id;
// note WebUtility is now defined in System.Net so you don't need a reference on "web" oriented assemblies
if (format == "A")
return string.Format(formatProvider, "<a href='/People/Detail/{0}'>{1}</a>", WebUtility.UrlEncode(Id), WebUtility.HtmlDecode(Name));
// implement other smart formats
return Name;
}
}
This is not perfect, but at least, you'll be able to avoid defining hundreds of specified properties and keep the presentation details in a ToString method that was meant speficially for presentation details.
From calling code, you would use it like this:
string.Format("{0:A}", myPerson);
or use MVC's HtmlHelper.FormatValue. There are a lot of classes in .NET that support IFormattable (like StringBuilder for example).
You can refine the system, and do this instead:
public virtual string ToString(string format, IFormatProvider formatProvider)
{
...
if (format.StartsWith("A"))
{
string url = format.Substring(1);
return string.Format(formatProvider, "<a href='{0}{1}'>{2}</a>", url, WebUtility.UrlEncode(Id), WebUtility.HtmlDecode(Name));
}
...
return Name;
}
You would use it like this:
string.Format("{0:A/People/Detail/}", person)
So you don't hardcode the url in the business layer. With the web as a presentation layer, you'll usually have to pass a CSS class name in the format to avoid hardcoding style in the business layer. In fact, you can come up with quite sophisticated formats. After all, this is what's done with objects such as DateTime if you think about it.
You can even go further and use some ambiant/static property that tells you if you're running in a web context so it works automatically, like this:
public class Address : IFormattable
{
public string Recipient { get; set; }
public string Line1 { get; set; }
public string Line2 { get; set; }
public string ZipCode { get; set; }
public string City { get; set; }
public string Country { get; set; }
....
public virtual string ToString(string format, IFormatProvider formatProvider)
{
// http://stackoverflow.com/questions/3179716/how-determine-if-application-is-web-application
if ((format == null && InWebContext) || format == "H")
return string.Join("<br/>", Recipient, Line1, Line2, ZipCode + " " + City, Country);
return string.Join(Environment.NewLine, Recipient, Line1, Line2, ZipCode + " " + City, Country);
}
}
Ideally, you will want to refactor your code to use view models. The view models can have utility methods for simple string formatting e.g.
public string FullName => $"{FirstName} {LastName}"
But strictly NO HTML! (Being a good citizen :D)
You can then create various Editor/Display templates in the following directories:
Views/Shared/EditorTemplates
Views/Shared/DisplayTemplates
Name the templates after the model object type, e.g.
AddressViewModel.cshtml
You can then use the following to render display/editor templates:
#Html.DisplayFor(m => m.Address)
#Html.EditorFor(m => m.Address)
If the property type is AddressViewModel, then the AddressViewModel.cshtml from the EditorTemplates, or DisplayTemplates directory will be used.
You can further control the rendering by passing in options to the template like so:
#Html.DisplayFor(m => m.Address, new { show_property_name = false, ... })
You can access these values in the template cshtml file like so:
# {
var showPropertyName = ViewData.ContainsKey("show-property-name") ? (bool)ViewData["show-property-name] : true;
...
}
#if(showPropertyName)
{
#Html.TextBoxFor(m => m.PropertyName)
}
This allows for a lot of flexibility, but also the ability to override the template that is used by applying the UIHint attribute to the property like so:
[UIHint("PostalAddress")]
public AddressViewModel Address { get; set; }
The DisplayFor/EditorFor methods will now look for the 'PostalAddress.cshtml' template file, which is just another template file like AddressViewModel.cshtml.
I always break down UI into templates like this for projects that i work on, as you can package them via nuget and use them in other projects.
Also, you could also add them to a new class library project, and have them compiled into a dll, which you can just reference in you MVC projects. I have used RazorFileGenerator to do this previously (http://blog.davidebbo.com/2011/06/precompile-your-mvc-views-using.html), but now prefer using nuget packages, as it allows for versioning of the views.
I guess you need to have a plan before you change it. Yes, projects that you mentioned don't sound correct, but that does not mean the new plan is better.
First, existing projects (this will help you see what to avoid):
DomainObjects containing database tables? that sounds like DAL. I'm assuming that those objects are actually stored in the DB (e.g. if they are entity framework classes) and not mapped from them (e.g. using entity framework and then mapping results back to thse objects), otherwise you have too many mappings (1 from EF to data objects, and 2 from data objects to Models). I've seen that done, very typical mistake in layering. So if you have that, don't repeat that. Also, don't name projects containing data row objects as DomainObjects. Domain means Model.
DomainORM - Ok, but I'd combine it with the data row objects. Makes no sense to keep the mapping project separate, if it's tightly coupled with data objects anyway. It's like pretending you can replace one without the other.
Models - good name, it could mention Domain too, that way nobody would name other projects with this very important word.
NameIdHTML property - bad idea on business objects. But that's a minor refactoring - move that into a method that leaves somewhere else, not inside your business logic.
Business objects looking like DTOs - also bad idea. Then what's the point of the business logic? my own article on this: How to Design Business Objects
Now, what you need to target (if you are ready to refactor):
Business logic hosting project should be platform independent - no mention of HTML, or HTTP, or anything related to a concrete platform.
DAL - should reference business logic (not other way around), and should be responsible for mapping as well as holding the data objects.
MVC - keep thin controllers by moving logic out to business logic (where the logic is really a business logic), or into so called Service layer (a.k.a. Application logic layer - optional and exists if necessary to put application specific code out of controllers).
My own article on layering: Layering Software Architecture
Real reasons to do so:
Reusable business logic on potentially several platforms (today you are web only, tomorrow you can be web and services, or desktop too). All different platforms should be using same business logic ideally, if they belong to the same bounded context.
Manageable complexity long run, which is the well known factor for choosing something like DDD (domain driven) vs data-driven design. It comes with learning curve, so you invest into it initially. Long-run, you keep your maintainability low, like receiving premiums perpetually. Beware of your opponents, they will argue it's completely different from what they've been doing, and it will seem complex to them (due to learning curve and thinking iteratively to maintain the good design long run).
First consider your goal and Kent Beck's points about Economics of Software Development. Probably, the goal of your software is to deliver value and you should spend your time on doing something valuable.
Second, wear your Software Architect's hat, and make some kind of calculation. This is how you back up the choice to spend resources on this or to spend on something else.
Leaving code in that state would be bad, if within the next 2 years it were going to:
increase the number of unhappy customers
reduce your company's revenue
increase the number of software failure/bugs/crashes
increase cost of maintenance or change of your code
surprise developers, causing them to waste hours of time in a misunderstanding
increase the cost of onboarding new developers
If these things are unlikely to happen as a result of the code then don't waste your team's life on straightening pencils. If you can't identify a real negative cost-consequence of the code, then the code is probably okay and it's your theory that should change.
My guess would be “The cost of changing this code is probably higher than the cost of problems it causes.” But you are better placed to guess actual cost of problems. In your example the cost of change might be quite low. Add this to your option 2 refactor list:
————————————————————————————————————
2c. Use extensions methods in the MVC app to add presentation know-how to domain objects with minimal code.
public static class PersonViewExtensions
{
const string NameAndOnePhoneFormat="{0} ({1})";
public static string NameAndOnePhone(this Person person)
{
var phone = person.MobilePhone ?? person.HomePhone ?? person.WorkPhone;
return string.Format(NameAndOnePhoneFormat, person.Name, phone);
}
}
Where you have embedded HTML, the code in #Sefe's answer — using extension methods on the HtmlHelper class – is pretty much what I would do. Doing that is great feature of Asp.NetMVC
———————————————————————————————————————
But this approach should be the learned habit of the whole team. Don't ask your boss for a budget for refactoring. Ask your boss for a budget for learning: books, time to do coding katas, budget for taking the team to developer meetups.
Do not, whatever you do, do the amateur-software-architecture-thing of thinking, “this code doesn't conform to X, therefore we must spend time and money changing it even though we can show no concrete value for that expense.”
Ultimately, your goal is to add value. Spending money on learning will add value; spending money on delivering new features or removing bugs may add value ; spending money on rewriting working code only adds value if you are genuinely removing defects.
...before everything, I'm doing this out of curiosity only. Nothing real-world application here, but just for knowledge and tinkering about...
ASP.NET Views have properties like Model and ViewData and even has methods as well.
You can even use #Using just like a regular class.cs file.
I know that it is of type WebPageView<TModel>
My main question is: is it a class?
It should be because it's a type, but..
I should be able to also do this then (Razor engine):
#{
public class Person
{
//etc...
}
var p = new Person();
}
<span>#p.Name</span>
However I can't.. why?
note: currently a C#, ASP.net beginner.
Sure, you need to use the functions keyword in order to drop down to exposing class-level things like fields, properties, methods, and inner types:
#functions {
public class Person
{
public string Name { get; set; }
}
}
#{
var p = new Person();
}
<span>#p.Name</span>
This will work just fine.
That being said, keep in mind that the only purpose of these inner classes is if you need to define a type only for use within a view. Myself, I've never found a need to do this for classes. However, I have taken advantage of this technique to add new methods that are not syntactically possible with helper methods.
You can't do it because Razor markup is compiled into a sequence of statements inside a method within the generated class derived from WebViewPage or WebViewPage<TModel>
The more important question though, is why would you want to do this? Instead prefer to keep Razor free of this kind of logic - it's job should be to produce layout, not do any kind of business logic, or business data transformation. Do all the heavy lifting in your action method and deliver a Model that describes the data required to render the layout in a format that requires only simple Razor markup to process.
There are quite a few tutorials a round that describe how to approach MVC and Razor. I dug up this one that is brief but does a reasonable job of covering an end-to-end story that might help you get the idea. It does include using EF to get data as well which might be more that you were bargaining for - but it's worth a read to get the full picture of how a whole architecture hangs together: http://weblogs.asp.net/shijuvarghese/archive/2011/01/06/developing-web-apps-using-asp-net-mvc-3-razor-and-ef-code-first-part-1.aspx
Yes, Views are classes. They are compiled into a temporary assembly (so they don't have access to internal members of the main assembly, which is good to know when dealing with dynamic/anonymous types).
I think that Razor has a rule that disallows declaring inner classes, haven't checked.
I am creating a generic Windows Form that accepts T and uses reflection with custom attributes to create labels and input controls at run-time.
Example:
class GenericForm<T>: Form where T : ICloneable<T>
{
}
Here's a link to a previous question for the form code: SO Question.
This form could accept the following entity class as an example:
class Vehicle: ICloneable<Vehicle>
{
public int Id { get; set; }
public int Name { get; set; }
public int Description { get; set; }
}
As you could imagine, the magic behind the form would use reflection to determine data types, validation criteria, preferred control types to use, etc.
Rather than re-inventing the wheel, I thought it would be worth asking on SO if anyone knows of such frameworks. Needless to say, I'm looking for something simple rather than a bulky framework.
eXpressApp Framework (XAF) can generate UI on the fly. In a simple case, a programmer will create business entities only, and will not care of UI at all.
As far as I know, there are no frameworks that generate the UI code at runtime. There are plenty of tools (code-generators) that do this before. But you wouldn't have the advantage of "only" changing the code - you'd had an extra step where you would need to start the code generator.
If you really want to create the UI information at runtime - I'd generate Attributes for your properties, that would tell your UI generator how to deal with this property (if no Attribute is given - have a default for your datatypes). It's a lot of coding but could save you time for small to medium projects in the future.
Another thing you could do is to externalize your UI information into an XML file and have a generator for that one. There's actually a framework that does that - have a look at the re-motion framework. I don't know if the part of the UI is free but it has some functionality (i.e. mixins) that could help you fulfilling your task.
My requirement is to download and scrape various HTML pages, extracting lists of Objects from the code on the page depending on what object type we are looking for on that page. Eg one page might contain an embedded list of doctors surgeries, another might contain a list of primary trusts etc. I have to view the pages one by one and end up with lists of the appropriate object types.
The way I have chosen to do this is to have a Generic class called HTMLParser<T> where T : IEntity, new()
IEntity is the interface that all the object types that can be scraped will implement, though I haven't figured out yet what the interface members will be.
So you will effectively be able to say
HTMLParser<Surgery> parser = new HTMLParser<Surgery>(URL, XSD SCHEMA DOC);
IList<Surgery> results = parser.Parse();
Parse() will validate that the HTML string downloaded from the URL contains a block that conforms to the XSD document provided, then will somehow use this template to extract a List<Surgery> of Surgery objects, each one corresponding to an XML block in the HTML string.
The problems I have are
Im not sure how to specify the template for each object type in a nice way, other than HTMLParser<Surgery> parser = new HTMLParser<Surgery>(new URI("...."), Surgery.Template); which is a bit clunky. Can anyone suggest a better way using .NET 3.0/4.0?
Im not sure how in a Generic way I can take the HTML string, take an XSD or XML template document, and return a generic list of constructed objects of the Generic Type. Can anyone suggest on how to do this?
Finally, I'm not convinced generics are the right solution to this problem as it's starting to seem very convoluted. Would you agree with or condemn my choice of solution here and if not, what would you do instead?
I'm not convinced that generics are the right solution, either. I implemented something very similar to this using good old inheritance, and I still think that's the right tool for the job.
Generics are useful when you want to perform the same operations on different types. Collections, for example, are a good example of where generics are very handy.
Inheritance, on the other hand, is useful when you want an object to inherit common functionality, but then extend and/or modify that functionality. Doing that with generics is messy.
My scraper base class looks something like this:
public class ScraperBase
{
// Common methods for making web requests, etc.
// When you want to download and scrape a page, you call this:
public List<string> DownloadAndScrape(string url)
{
// make request and download page.
// Then call Scrape ...
return Scrape(pageText);
}
// And an abstract Scrape method that returns a List<string>
// Inheritors implement this method.
public abstract List<string> Scrape(string pageText);
}
There's some other stuff in there for logging, error reporting, etc., but that's the gist of it.
Now, let's say I have a Wordpress blog scraper:
public class WordpressBlogScraper : ScraperBase
{
// just implement the Scrape method
public override List<string> Scrape(string pageText)
{
// do Wordpress-specific parsing and return data.
}
}
And I can do the same thing to write a Blogspot scraper, or a custom scraper for any page, site, or class of data.
I actually tried to do something similar, but rather than using inheritance I used a scraper callback function. Something like:
public delegate List<string> PageScraperDelegate(string pageText);
public class PageScraper
{
public List<string> DownloadAndScrape(string url, PageScraperDelegate callback)
{
// download data to pageText;
return callback(pageText);
}
}
You can then write:
var myScraper = new PageScraper();
myScraper.DownloadAndScrape("http://example.com/index.html", ScrapeExample);
private List<string> ScrapeExample(string pageText)
{
// do the scraping here and return a List<string>
}
That works reasonably well, and eliminates having to create a new class for every scraper type. However, I found that in my situation it was too limiting. I ended up needing a different class for almost every type of scraper, so I just went ahead and used inheritance.
I would rather focus on your parser/verifier classes, as designing them properly will be cruicial to the ease of future usage. I think it's more important how the mechanism will determine which parser/verifier to use basing on input.
Also, what happens when you're told you need to parse yet another type of website, say for Invoiceentities - will you be able to extend your mechanism in 2 easy steps in order to handle such requirement?