Best practice to implement same HTML parser for several different website?

Best practice to implement same HTML parser for several different website? - c#

I'm trying to code an HTML parser in C#. I need to get data from, let's say, 10 gambling website. I'm trying to figure out what is the best approach.
At first, I thought to write one big function that parse all of the websites with a switch statement, but I believe it's an overkill. It will be too long. I use HTML agility pack, so each implementation will have similar and yet different structure.
What is the best way to implement such a structure?

Make a base class with the common parts and create a sub-class for each different parser. The functions that change from parser to parser can be declared as abstract so they have to be overridden in the different sub-classes.

You could implement a strategy pattern, it would be something along the lines of having an Abstract class (perhaps with some shared methods) that each Concrete class implements and overrides the Abstract method. Using a Factory method you could then select the appropriate Concrete class to call for parsing the HTML (perhaps depending on the Site URL or some configuration).

There are lots of ways to go with that. Being a start simple guy, I'be be implementing one website / parser combination.
Then looking at what was common.
They all have a url.
They will all have some Parse thingy
And presumably you want to extract the same sort of information from each one.
And then you want to do something with that info.
That suggests a website class
A class to navigate through the website and get the page(s)
A parsing class
A parsing information class.
You could use inheritance, though my first thought was an interface.
Either way you should end up a with a collection of websites to parse, each one described by it's own instance.
From there you could simply do a foreach, you could schedule, you could do them in parallel. More to the point you could add and remove targets, keep going on the others when one of them twiddles with their site, or goes down...
Prove your idea with one site, your infrastructure with two, and batter away at the others, while having deployed something that works and see if anything happens in the real world that you hadn't thought of.
Big bangs are for making universes, not applications.

Related

How do I handle similar objects with different properties?

I'm building an RSS client and using the Argotic framework. It provides different classes for different kinds of feeds like Atom, RSS, and OPML. These classes don't inherit from any other class and they don't implement a common interface for accessing their properties.
There is a GenericSyndicationFeed type that implements an overloaded method where you can pass in an AtomFeed or RssFeed. If I want to use the "more" strongly typed classes I would essentially need two code paths (one for Atom and one for RSS) everywhere in my program. Obviously, I'm not going to do this.
There is no documentation from the author other than the API documentation, so I'm kind of at a loss as to why it was implemented this way instead of taking full advantage of the complete classes. One thing that bothers me is that I can't get the authors of an item when using the GenericSyndicationItem type.
What can I do here? Make a wrapper class? Or inherit from the RssFeed and AtomFeed classes and implement an interface to expose the properties I feel should be similar from both?

When you are using a third-party library and the library doesn't meet your architectural needs: adapt! But how?
You've already identified some of your options and there are more:
Wrap existing classes in new classes using the Adapter Pattern
Extend and unify disparate classes by implementing a common interface
Refactor the original code to use Polymorphism natively
If the existing classes really have no common base class at all, then the first two options are both about the same amount of work. Wrapping has the advantage of a little looser coupling in case you ever decide to switch to a different framework. Extending avoids a lot of code like adaptee.AdapteeMethod since you can call base methods without specifying an instance. In this case I would lean towards the adapter pattern unless there is at least some common base class you can exploit through inheritance.
The last serious option is refactoring the code to be more object-oriented and I only recommend this approach if you are intending to contribute back to the project and have the blessing of the project's author. The reason is that you have working code that you probably don't full understand and messing around with it just risks breaking it. Leave the working code alone and adapt it from the outside.

It has been a really long time since I wrote Argotic (it was written before System.ServiceModel.Syndication existed in .NET), but since the concept of author exists in both RSS 2.0 and Atom, I don't really recall why the generic feed item did not include an Authors collection. It may have been because the outline elements in an OPML document do not have the concept of author. Poor design on my part obviously.
The bottom line is I was still young and learning, and Argotic while useful when it was written 3 years ago; is woefully in need of a major refactoring. If System.ServiceModel.Syndication can fulfill your needs, I recommend you use that to parse your syndication feeds.
Since you have the full source code to Argotic, and it is not fulfilling your needs; you could add an Authors collection to the generic syndication item class and populate it when consuming an RSS or Atom feed.
You most definitely have my blessing to refactor as you see fit regardless of whether you contribute back, I passed off project responsibilities years ago and am not sure what state it is in these days.
That said and done, if you know the format of the feed prior to consuming it, you can do the following:
RssFeed feed = RssFeed.Create(new Uri("http://www.pwop.com/feed.aspx?show=dotnetrocks&filetype=master"));
AtomFeed feed = AtomFeed.Create(new Uri("http://news.google.com/?output=atom"));

Proper usage of XML inline documentation for derived classes?

While I think I understand why inline XML documentation (i.e. using three slashes - ///) isn't working for me, I'd like to get some guidance on how to work around my "problem".
I have an interface, and two derived classes. One derived class is for simulation, and the other is for talking to real hardware.
It's very likely that the hardware implementation would do something special that the simulator doesn't need to do. I have XML documentation for the hardware methods, and not for the simulator. However, when I hover over the method name, I don't get documentation in the tooltip at all, presumably because the XML docs aren't associated with the interface.
This certainly makes sense, and I plan to just put my documentation in the interface instead and live with it. I am still curious, though... how does everyone else do this? Is there some magical way to make the tooltip aggregate all of the valid XML docs? In other words, since the compiler doesn't know which derived class is being used, is there a way for it to show XML docs for all classes that implement this interface?

This won't solve all your problems but GhostDoc can quickly insert documentation into a derived class using the base class documentation. It's worth taking a look anyway if you're doing XML documentation.

Since you are programming to an interface, there is not a way to pass through the XML documentation from the implementation. The separation means that the two "sides" don't know about each other. Like you said, you could have two different implementations of that interface. In that case, you would have a conflict. That isn't a big deal for two, but what about 200? Besides, the point of using an interface is that you don't care how it is implemented. You know that when you call use an interface, the implementation will follow the contract. Use the XML comments on the interface to describe the contract, not the implementation of the contract.
I can feel your pain on this one and I'm not sure that there is a better solution.

class definition and implementation in C# vs C++

With C++, I can have one class definition in a header file, and have a multiple implementation files by including the header file.
With C#, it seems that there is no such header file, as one class should contain both definition/implementation.
I wonder if the number of lines can be very big, because one can't separate the class into multiple files. Am I correct? I mean, in some cases, one can't change the class design to have smaller classes. In this case, is there a way to solve this problem?

You can separate a class into multiple files using the partial keyword
public partial class ClassNameHere
{
}

It is possible to split the definition of a class or a struct, or an interface over two or more source files using the Partial keyword modifier Link to msdn with the partial class

Partial classes only give you so much. There is still no way, that i know of, to split your class definition from implementation, such that each exists in a separate file. So if you like to develop based on a need-to-know paradigm then you are sort of stuck. Basically there are three levels a developer can work at...
1) Owns all the code and has access to, and maintains all of it.
2) Wishes to use some useful base class(s) which may form part of a framework, or may just be a useful class with some virtual methods, etc, and wishes to extend, or re-implement some virtual base class methods of interest. Now the developer should not need to go and look at the code in the base class(s) in order to understand things at a functional level. If you understand the job of a function, it's input and output parameters, there is no need to go and scratch inside source code. If you think there's a bug, or an optimization is needed, then refer to the developer from 1) who owns and maintains the base code. Of course there's nothing saying that 1) and 2) cannot be associated with the same developer, in which case we have no problem. In fact, this is more than often the case i suspect. Nevertheless, it is still good practice to keep things well separated according to the level at which you are working.
3) A developer needs to use an already packaged / sealed object / component dll, which exposes the relevant interfaces.
Within the context of c#, 1) and 3) have no problems. With 2) i believe there is no way to get round this (unless you change from exposing virtual base methods to exposing interface methods which can be reimplemented in a component owning the would-be base class). If i want to have a look at a class definition to browse over the methods, scaffolding functions, etc, i have to look at a whole lot of source code as well, which just gets in the way of what i am trying to focus on.
Of course if there is class definition documentation external to how we normally do it ( in headers and source files), then i must admit, that within the context of 2), there is not reason to ever look into a class definition file to gain functional knowledge.
So maybe clever Tom's came up with c#, decided to mix class definition with implementation in an attempt to encourage developers to have external documents for their class definitions, and interfaces, which in most IT companies is severely lacking.

Use a partial class as #sparks suggests, or, split into several classes. It's a good rule of thumb that, if you can't fit a class onto a couple of pages, it's complicated enough to need breaking apart.

Design Perspective: Static Methods vs. Classes

Although this is a fairly common problem, I am struggling with what the best way to approach it (if it needs approached at all in this case).
I have inherited a website (ASP.NET, C#) part of which contains a class full of static methods (it's a very large class, honestly). One method in particular is for sending e-mails. It has every possible parameter I can think of and it works well enough. However, the internals of that particular method are rather cumbersome to manage and understand due to the fact that everything is shoved inside - particularly when most of the parameters aren't used. In addition, it is somewhat difficult to handle errors, again, due to all the parameters for this one method.
Would it make more sense to actually have an EMail class which is instantiated when you want to send an e-mail? This just "feels" more right to me, though I can't full explain why. What are your thoughts on the way to go in this particular case? How about in general?
Thanks.

What you're describing sounds like an example of the aphorism, "You can write FORTRAN in any language."
A massive class full of static methods is often (not always) a sign that somebody just didn't "get" OOP, was stuck in a procedural-programming mindset and was trying to twist the language to do what he wanted.
As a rule of thumb: If any method, static or instance, takes more than about 5 parameters, it's often a sign that the method is trying to do too many things at once, and is a good candidate for refactoring into one or more classes.
Also, if the static methods are not really related, then they should at least be split up into classes that implement related functionality.
I'm actually wondering why you'd have a "send e-mail" method at all, given that the System.Net.Mail namespace handles just about every case, and is configurable via the app.config/web.config file, so you don't need to pass it a server name or port. Is this perchance a "notification" method - something that individual pages are supposed to call out to in order to send one of several "standard" messages based on templates with various values filled in, and certain headers/footers automatically added? If so, there are a number of designs for this type of interaction that are much easier to work with than what you seem to have inherited. (i.e. MailDefinition)
Update: Now having seen your comment that this is being used for exception handling, I think that the most appropriate solution is an actual exception handler. There are a ton of resources on this. For ASP.NET WebForms, I actually took the one Jeff Atwood wrote years ago, ported it to C# and made a few changes (like ignoring 404 errors). There are a number of good links in this previous question.
My preference these days is just to treat exception handling (and subsequent e-mailing of exception reports) as a subset of logging. log4net has an SmtpAppender that's quite capable, and you can configure it to only be used for "fatal" errors (i.e. unhandled exceptions - in your handler, you just make a LogFatal call).
The important thing, which you'll no doubt pick up from the SO link above and any referenced links, is that there are actually two anti-patterns here - the "miscellaneous" static class, and catching exceptions that you don't know how to handle. This is a poor practice in .NET - in most cases you should only catch application-specific exceptions that you can recover from, and let all other exceptions bubble up, installing a global exception handler if necessary.

Here are the Microsoft guidelines for when to use static types, generally.
Some things I would add, personally:
You must use static types to write extension methods.
Static types can make unit testing hard as they are difficult/impossible to mock.
Static types enforce immutability and referentially transparent functions, which can be a good design. So use them for things which are designed to be immutable and have no external dependencies. E.g., System.Math.
Some argue (e.g.) that the Singleton pattern is a bad idea. In any event, it would be wrong to think of static types as Singletons; they're much more broad than that.
This particular case has side-effects (sending e-mails) and doesn't appear to require extension methods. So it doesn't fit into what I would see as the useful case for static types. On the other hand, using an object would allow mocking the e-mail, which would be helpful for a unit test. So I think you're correct to say that a static type is inappropriate here.

Oh my gosh yes.
It sounds like its an old Classic ASP app that was ported.
It violates the single responsibility principle. If you can refactor that class. Use overloading for that function.

That is an example of the Utils anti-pattern.
It is always a good idea to separate those methods according on their responsibility. Creating an Email class is definitely a Good Idea™. It will give you a much nicer interface to use, and it allows you to mock out the Email in tests.

See The Little Manual of API Design, which describes the benefits of classes having minimal constructors and lots of getters/setters over the alternative of using constructor/methods having many parameters.
Since most of the parameters of the methods you mention are not used, a better approach is to use simple constructors that assume reasonable default settings for the internal variables. Having setter methods allows you to then set the few parameters (and only those parameters) that require non-default values.

Creating a Catch-All AppToolbox Class - Is this a Bad Practice?

Never sure where to place functions like:
String PrettyPhone( String phoneNumber ) // return formatted (999) 999-9999
String EscapeInput( String inputString ) // gets rid of SQL-escapes like '
I create a Toolbox class for each application that serves as a repository for functions that don't neatly fit into another class. I've read that such classes are bad programming practice, specifically bad Object Oriented Design. However, said references seem more the opinion of individual designers and developers more than an over-arching consensus. So my question is, Is a catch-all Toolbox a poor design pattern? If so, why, and what alternative is there?

Great question. I always find that any sufficiently complex project require "utility" classes. I think this is simply because the nature of object-oriented programming forces us to place things in a neatly structured hierarchical taxonomy, when this isn't always feasible or appropriate (e.g. try creating an object model for mammals, and then squeeze the platypus in). This is the problem which motivates work into aspect oriented programming (c.f. cross cutting concern). Often what goes into a utility class are things that are cross-cutting concerns.
One alternative to using toolbox or utility classes, are to use extension methods to provide additional needed functionality to primitive types. However, the jury is still out on whether or not that constitutes good software design.
My final word on the subject is: go with it if you need, just make sure that you aren't short-cutting better designs. Of course, you can always refactor later on if you need to.

I think a static helper class is the first thing that comes to mind. It is so common that some even refer to it as part of the object-oriented design. However, the biggest problem with helper classes is that they tend to become a large dump. I think i saw this happen on a few of the larger projects i was involved in. You're working on a class and don't know where to stick this and that function so you put it in your helper class. At which point your helpers don't communicate well what they do. The name 'helper' or 'util' itself in the class name doesn't mean anything. I think nearly all OO gurus advocate against helpers since you can very easily replace them with more descriptive classes if you give it enough thought. I tend to agree with this approach as I believe that helpers violate the single responsibility principle. Honestly, take this with a grain of salt. I'm a little opinionated on OOP :)

In these examples I would be more inclined to extend String:
class PhoneNumber extends String
{
public override string ToString()
{
// return (999) 999-9999
}
}
If you write down all the places you need these functions you can figure out what actually uses it and then add it to the appropriate class. That can sometimes be difficult but still something you should aim for.
EDIT:
As pointed out below, you cannot override String in C#. The point I was trying to make is that this operation is made on a phone number so that is where the function belongs:
interface PhoneNumber
{
string Formatted();
}
If you have different formats you can interchange implementations of PhoneNumber without littering your code with if statements, e.g.,
Instead of:
if(country == Countries.UK) output = Toolbox.PhoneNumberUK(phoneNumber);
else ph = Toolbox.PhoneNumberUS(phoneNumber);
You can just use:
output = phoneNumber.Formatted();

There is nothing wrong with this. One thing is try to break it up into logical parts. By doing this you can keep your intellisense clean.
MyCore.Extensions.Formatting.People
MyCore.Extensions.Formatting.Xml
MyCore.Extensions.Formatting.Html

My experience has been that utility functions seldom occur in isolation. If you need a method for formatting telephone numbers, then you will also need one for validating phone numbers, and parsing phone numbers. Following the YAGNI principle, you certainly wouldn't want to write such things until they're actually needed, but I think it's helpful to just go ahead and separate such functionality into individual classes. The growth of those classes from single methods into minor subsystems will then happen naturally over time. I have found this to be the easiest way to keep the code organized, understandable, and maintainable over the long term.

When I create an application, I typically create a static class that contains static methods and properties that I can't figure out where to put anywhere else.
It's not an especially good design, but that's sort of the point: it gives me a place to localize a whole class of design decisions that I haven't thought out yet. Generally as the application grows and is refined through refactoring, it becomes clearer where these methods and properties actually ought to reside. Mercifully, the state of refactoring tools is such that those changes are usually not exceptionally painful to make.
I've tried doing it the other way, but the other way is basically implementing an object model before I know enough about my application to design the object model properly. If I do that, I spend a fair amount of time and energy coming up with a mediocre solution that I have to revisit and rebuild from the ground up at some point in the future. Well, okay, if I know I'm going to be refactoring this code, how about I skip the step of designing and building the unnecessarily complicated classes that don't really work?
For instance, I've built an application that is being used by multiple customers. I figured out pretty early on that I needed to have a way of separating out methods that need to work differently for different customers. I built a static utility method that I could call at any point in the program where I needed to call a customized method, and stuck it in my static class.
This worked fine for months. But there came a point at which it was just beginning to look ugly. And so I decided to refactor it out into its own class. And as I went through my code looking at all the places where this method was being called, it became extremely clear that all of the customized methods really needed to be members of an abstract class, the customers' assemblies needed to contain a single derived class that implements all of the abstract methods, and then the program just needed to get the name of the assembly and the namespace out of its configuration and create an instance of the custom features class at startup. It was really simple for me to find all of the methods that had to be customized, since all I needed to do was find every place that my load-a-custom-feature method was being called. It took me the better part of an afternoon to go through the entire codebase and rationalize this design, and the end result is really flexible and robust and solves the right problem.
The thing is, when I first implemented that method (actually it was three or four interrelated methods), I recognized that it wasn't the right answer. But I didn't know enough to decide what the right answer was. So I went with the simplest wrong answer until the right answer became clear.

I think the reason it's frowned upon is because the "toolbox" can grow and you will be loading a ton of resources every time you want to call a single function.
It's also more elegant to have the methods that apply to the objects in the actual class - just makes more sense.
That being said, I personally don't think it's a problem, but would avoid it simply for the reasons above.

I posted a comment, but thought I'd elaborate a bit more.
What I do is create a Common library with namespaces: [Organisation].[Product].Common as the root and a sub namespace Helpers.
A few people on here mention things like creating a class and shoving some stuff they don't know where else to put in there. Wrong. I'd say, even if you need one helper method, it is related to something, so create a properly named (IoHelper, StringHelper, etc.) static helper class and put it in the Helpers namespace. That way, you get some structure and you get some sort of separation of concerns.
In the root namespace, you can use instance utility classes that do require state (they exist!). And needless to say also use an appropriate class name, but don't suffix with Helper.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Best practice to implement same HTML parser for several different website? - c#

Make a base class with the common parts and create a sub-class for each different parser. The functions that change from parser to parser can be declared as abstract so they have to be overridden in the different sub-classes.

Related

How do I handle similar objects with different properties?

Proper usage of XML inline documentation for derived classes?

class definition and implementation in C# vs C++

Design Perspective: Static Methods vs. Classes

Creating a Catch-All AppToolbox Class - Is this a Bad Practice?

Categories

Resources