Using Generics to accomplish an HTML scraper. Right or Wrong?

Using Generics to accomplish an HTML scraper. Right or Wrong? - c#

My requirement is to download and scrape various HTML pages, extracting lists of Objects from the code on the page depending on what object type we are looking for on that page. Eg one page might contain an embedded list of doctors surgeries, another might contain a list of primary trusts etc. I have to view the pages one by one and end up with lists of the appropriate object types.
The way I have chosen to do this is to have a Generic class called HTMLParser<T> where T : IEntity, new()
IEntity is the interface that all the object types that can be scraped will implement, though I haven't figured out yet what the interface members will be.
So you will effectively be able to say
HTMLParser<Surgery> parser = new HTMLParser<Surgery>(URL, XSD SCHEMA DOC);
IList<Surgery> results = parser.Parse();
Parse() will validate that the HTML string downloaded from the URL contains a block that conforms to the XSD document provided, then will somehow use this template to extract a List<Surgery> of Surgery objects, each one corresponding to an XML block in the HTML string.
The problems I have are
Im not sure how to specify the template for each object type in a nice way, other than HTMLParser<Surgery> parser = new HTMLParser<Surgery>(new URI("...."), Surgery.Template); which is a bit clunky. Can anyone suggest a better way using .NET 3.0/4.0?
Im not sure how in a Generic way I can take the HTML string, take an XSD or XML template document, and return a generic list of constructed objects of the Generic Type. Can anyone suggest on how to do this?
Finally, I'm not convinced generics are the right solution to this problem as it's starting to seem very convoluted. Would you agree with or condemn my choice of solution here and if not, what would you do instead?

I'm not convinced that generics are the right solution, either. I implemented something very similar to this using good old inheritance, and I still think that's the right tool for the job.
Generics are useful when you want to perform the same operations on different types. Collections, for example, are a good example of where generics are very handy.
Inheritance, on the other hand, is useful when you want an object to inherit common functionality, but then extend and/or modify that functionality. Doing that with generics is messy.
My scraper base class looks something like this:
public class ScraperBase
{
// Common methods for making web requests, etc.
// When you want to download and scrape a page, you call this:
public List<string> DownloadAndScrape(string url)
{
// make request and download page.
// Then call Scrape ...
return Scrape(pageText);
}
// And an abstract Scrape method that returns a List<string>
// Inheritors implement this method.
public abstract List<string> Scrape(string pageText);
}
There's some other stuff in there for logging, error reporting, etc., but that's the gist of it.
Now, let's say I have a Wordpress blog scraper:
public class WordpressBlogScraper : ScraperBase
{
// just implement the Scrape method
public override List<string> Scrape(string pageText)
{
// do Wordpress-specific parsing and return data.
}
}
And I can do the same thing to write a Blogspot scraper, or a custom scraper for any page, site, or class of data.
I actually tried to do something similar, but rather than using inheritance I used a scraper callback function. Something like:
public delegate List<string> PageScraperDelegate(string pageText);
public class PageScraper
{
public List<string> DownloadAndScrape(string url, PageScraperDelegate callback)
{
// download data to pageText;
return callback(pageText);
}
}
You can then write:
var myScraper = new PageScraper();
myScraper.DownloadAndScrape("http://example.com/index.html", ScrapeExample);
private List<string> ScrapeExample(string pageText)
{
// do the scraping here and return a List<string>
}
That works reasonably well, and eliminates having to create a new class for every scraper type. However, I found that in my situation it was too limiting. I ended up needing a different class for almost every type of scraper, so I just went ahead and used inheritance.

I would rather focus on your parser/verifier classes, as designing them properly will be cruicial to the ease of future usage. I think it's more important how the mechanism will determine which parser/verifier to use basing on input.
Also, what happens when you're told you need to parse yet another type of website, say for Invoiceentities - will you be able to extend your mechanism in 2 easy steps in order to handle such requirement?

Related

C# how to "register" class "plug-ins" into a service class?

Update#2 as of year 2022
All these years have passed and still no good answer.
Decided to revive this question.
I'm trying to implement something like the idea I'm trying to show with the following diagram (end of the question).
Everything is coded from the abstract class Base till the DoSomething classes.
My "Service" needs to provide to the consumer "actions" of the type "DoSomethings" that the service has "registered", at this point I am seeing my self as repeating (copy/paste) the following logic on the service class:
public async Task<Obj1<XXXX>> DoSomething1(....params....)
{
var action = new DoSomething1(contructParams);
return await action.Go(....params....);
}
I would like to know if there is anyway in C# to "register" all the "DoSomething" I want in a different way? Something more dynamic and less "copy/paste" and at the same time provide me the "intellisense" in my consumer class? Somekind of "injecting" a list of accepted "DoSomething" for that service.
Update#1
After reading the sugestion that PanagiotisKanavos said about MEF and checking other options of IoC, I was not able to find exactly what I am looking for.
My objective is to have my Service1 class (and all similar ones) to behave like a DynamicObject but where the accepted methods are defined on its own constructor (where I specify exactly which DoSomethingX I am offering as a method call.
Example:
I have several actions (DoSomethingX) as "BuyCar", "SellCar", "ChangeOil", "StartEngine", etc....
Now, I want to create a service "CarService" that only should offer the actions "StartEngine" and "SellCar", while I might have other "Services" with other combination of "actions". I want to define this logic inside the constructor of each service. Then, in the consumer class, I just want to do something like:
var myCarService = new CarService(...paramsX...);
var res1 = myCarService.StartEngine(...paramsY...);
var res2 = myCarService.SellCar(...paramsZ...);
And I want to offer intellisense when I use the "CarService"....
In conclusion: The objective is how to "register" in each Service which methods are provided by him, by giving a list of "DoSomethingX", and automatically offer them as a "method"... I hope I was able to explain my objective/wish.
In other words: I just want to be able to say that my class Service1 is "offering" the actions DoSomething1, DoSomething2 and DoSomething3, but with the minimum lines as possible. Somehow the concept of the use of class attributes, where I could do something similar to this:
// THEORETICAL CODE
[RegisterAction(typeOf(DoSomething1))]
[RegisterAction(typeOf(DoSomething2))]
[RegisterAction(typeOf(DoSomething3))]
public class Service1{
// NO NEED OF EXTRA LINES....
}

For me, MEF/MAF are really something you might do last in a problem like this. First step is to work out your design. I would do the following:
Implement the decorator design pattern (or a similar structural pattern of your choice). I pick decorator as that looks like what you are going for by suplimenting certain classes with shared functionality that isn't defined in those clases (ie composition seems prefered in your example as opposed to inheritance). See here http://www.dofactory.com/net/decorator-design-pattern
Validate step 1 POC to work out if it would do what you want if it was added as a separate dll (ie by making a different CSProj baked in at build time).
Evaluate whether MEF or MAF is for right for you (depending on how heavy weight you want to go). Compare those against other techniques like microservices (which would philosophically change your current approach).
Implement your choice of hot swapping (MEF is probably the most logical based on the info you have provided).

You could use Reflection.
In class Service1 define a list of BaseAction types that you want to provide:
List<Type> providedActions = new List<Type>();
providedActions.Add(typeof(DoSomething1));
providedActions.Add(typeof(DoSomething2));
Then you can write a single DoSomething method which selects the correct BaseAction at run-time:
public async Task<Obj1<XXXX>> DoSomething(string actionName, ....params....)
{
Type t = providedActions.Find(x => x.Name == actionName);
if (t != null)
{
var action = (BaseAction)Activator.CreateInstance(t);
return await action.Go(....params....);
}
else
return null;
}
The drawback is that the Client doesn't know the actions provided by the service unless you don't implement an ad-hoc method like:
public List<string> ProvidedActions()
{
List<string> lst = new List<string>();
foreach(Type t in providedActions)
lst.Add(t.Name);
return lst;
}

Maybe RealProxy can help you? If you create ICarService interface which inherits IAction1 and IAction2, you can then create a proxy object which will:
Find all the interfaces ICarService inherits.
Finds realizations of these interfaces (using actions factory or reflection).
Creates action list for the service.
In Invoke method will delegate the call to one of the actions.
This way you will have intellisence as you want, and actions will be building blocks for the services. Some kind of multi-inheritance hack :)

At this point I am really tempted to do the following:
Make my own Class Attribute RegisterAction (just like I wrote on my "Theoretical" example)
Extend the Visual Studio Build Process
Then on my public class LazyProgrammerSolutionTask: Microsoft.Build.Utilities.Task try to find the service classes and identify the RegisterAction attributes.
Then per each one, I will inject using reflection my own method (the one that I am always copying paste)... and of course get the "signature" from the corresponding target "action" class.
In the end, compile everything again.
Then my "next project" that will consume this project (library) will have the intellisence that I am looking for....
One thing, that I am really not sure, it how the "debug" would work on this....
Since this is also still a theoretically (BUT POSSIBLE) solution, I do not have yet a source code to share.
Meanwhile, I will leave this question open for other possible approaches.

I must disclose, I've never attempted anything of sorts so this is a thought experiment. A couple of wild ideas I'd explore here.
extension methods
You could declare and implement all your actions as extension methods against base class. This I believe will cover your intellisense requirements. Then you have each implementation check if it's registered against calling type before proceeding (use attributes, interface hierarchy or other means you prefer). This will get a bit noisy in intellisense as every method will be displayed on base class. And this is where you can potentially opt to filter it down by custom intellisense plugin to filter the list.
custom intellisense plugin
You could write a plugin that would scan current code base (see Roslyn), analyze your current service method registrations (by means of attributes, interfaces or whatever you prefer) and build a list of autocomplete methods that apply in this particular case.
This way you don't have to install any special plugins into your Dev environment and still have everything functional. Custom VS plugin will be there purely for convenience.

If you have a set of actions in your project that you want to invoke, maybe you could look at it from CQS (Command Query Separation) perspective, where you can define a command and a handler from that command that actually performs the action. Then you can use a dispatcher to dispatch a command to a handler in a dynamic way. The code may look similar to:
public class StartEngine
{
public StartEngine(...params...)
{
}
}
public class StartEngineHandler : ICommandHandler<StartEngine>
{
public StartEngineHandler(...params...)
{
}
public async Task Handle(StartEngine command)
{
// Start engine logic
}
}
public class CommandDispatcher : ICommandDispatcher
{
private readonly Container container;
public CommandDispatcher(Container container) => this.container = container;
public async Task Dispatch<T>(T command) =>
await container.GetInstance<ICommandHandler<T>>().Handle(command);
}
// Client code
await dispatcher.Dispatch(new StartEngine(params, to, start, engine));
This two articles will give you more context on the approach: Meanwhile... on the command side of my architecture, Meanwhile... on the query side of my architecture.
There is also a MediatR library that solves similar task that you may want to check.
If the approaches from above does not fit the need and you want to "dynamically" inject actions into your services, Fody can be a good way to implement it. It instruments the assembly during the build after the IL is generated. So you could implement your own weaver to generate methods in the class decorated with your RegisterAction attribute.

Substituting routes in text content in mvc

I am setting up a feature on my site that requires displaying content that is stored in the database. This content will frequently have links to other resources on the site.
Stored in the db:
"Lorem ipsum dolor sit <a href='http://mysite.com/somecontroller/someaction'>amet</a>."
For debugging and other reasons, I don't want to hard code that url. I want to replace http://mysite.com/somecontroller/someaction with UrlHelper.Action("somecontroller", "someaction").
I could write a replacement based on my own made up convention. That would be easy and accomplish what I want to do. However, I thought I might be missing some cleaner or more standard solution. So, is there a cleaner more maintainable way to do this, or should I go ahead with a convention based replacement?

Replacement is probably best, and easiest to maintain. Another way to handle it is to allow ASP.Net to parse the code. You'd have to create your own VirtualPathProvider which would allow Views to be pulled from the database in addition to the file system. Then you could use
Html.RenderPartial(nameVirtualPathProviderUnderstands);
Which would render the string in the database as if it were a file. Although this isn't very safe, and you would need to modify the content before the VirtualPathProvider rendered it. Here is a guide which explains how to implement your own VirtualPathProvider.

I made a 'hack' like so in one of my projects:
public class UrlLinks
{
public static string HomeLink { get; protected set; }
public static UrlLinks()
{
HomeLink = Link<HomeController>(x => x.Index());
}
protected static string Link<TControllerType>(Expression<Func<TControllerType, object>> expression)
{
return "methodName";
}
}
Basically, the link would be constructed from the typename and the method name via the expression and generic parameter, so if I refactored, I would pick up the new method and controller names.
Then in my view I could do this - eliminating the need for magic strings:
Home

Importing data from third party datasource (open architecture design )

How would you design an application (classes, interfaces in class library) in .NET when we have a fixed database design on our side and we need to support imports of data from third party data sources, which will most likely be in XML?
For instance, let us say we have a Products table in our DB which has columns
Id
Title
Description
TaxLevel
Price
and on the other side we have for instance Products:
ProductId
ProdTitle
Text
BasicPrice
Quantity.
Currently I do it like this:
Have the third party XML convert to classes and XSD's and then deserialize its contents into strong typed objects (what we get as a result of this process is classes like ThirdPartyProduct, ThirdPartyClassification, etc.).
Then I have methods like this:
InsertProduct(ThirdPartyProduct newproduct)
I do not use interfaces at the moment but I would like to. What I would like is implement something like
public class Contoso_ProductSynchronization : ProductSynchronization
{
public void InsertProduct(ContosoProduct p)
{
Product product = new Product(); // this is our Entity class
// do the assignments from p to product here
using(SyncEntities db = new SyncEntities())
{
// ....
db.AddToProducts(product);
}
}
// the problem is Product and ContosoProduct have no arhitectural connection right now
// so I cannot do this
public void InsertProduct(ContosoProduct p)
{
Product product = (Product)p;
using(SyncEntities db = new SyncEntities())
{
// ....
db.AddToProducts(product);
}
}
}
where ProductSynchronization will be an interface or abstract class. There will most likely be many implementations of ProductSynchronization. I cannot hardcode the types - classes like ContosoProduct, NorthwindProduct might be created from the third party XML's (so preferably I would continue to use deserialization).
Hopefully someone will understand what I'm trying to explain here. Just imagine you are the seller and you have numerous providers and each one uses their own proprietary XML format. I don't mind the development, which will of course be needed everytime new format appears, because it will only require 10-20 methods to be implemented, I just want the architecture to be open and support that.
In your replies, please focus on design and not so much on data access technologies because most are pretty straightforward to use (if you need to know, EF will be used for interacting with our database).

[EDIT: Design note]
Ok, from a design perspective I would do xslt on the incoming xml to transform it to a unified format. Also very easy to verify the result xml towards a schema.
Using xslt I would stay away from any interface or abstract class, and just have one class implementation in my code, the internal class. It would keep the code base clean, and the xslt's themselves should be pretty short if the data is as simple as you state.
Documenting the transformations can easily be done wherever you have your project documentation.
If you decide you absolutely want to have one class per xml (or if you perhaps got a .net dll instead of xml from one customer), then I would make the proxy class inherit an interface or abstract class (based off your internal class, and implement the mappings per property as needed in the proxy classes. This way you can cast any class to your base/internal class.
But seems to me doing the conversion/mapping in code will make the code design a bit more messy.
[Original Answer]
If I understand you correctly you want to map a ThirdPartyProduct class over to your own internal class.
Initially I am thinking class mapping. Use something like Automapper and configure up the mappings as you create your xml deserializing proxy's. If you make your deserialization end up with the same property names as your internal class, then there's less config to do for the mapper. Convention over Configuration.
I'd like to hear anyones thoughts on going this route.
Another approach would be to add a .ToInternalProduct( ThirdPartyClass ) in a Converter class. And keep adding more as you add more external classes.
The third approach is for XSLT guys. If you love XSLT you could transform the xml into something which can be deserialized into your internal product class.
Which one of these three I'd choose would depend on the skills of the programmer, and who will maintain adding new external classes. The XSLT approach would require no recompiling or compiling of code as new formats arrived. That might be an advantage.

Associate "Code/Properties/Stuff" with Fields in C# without reflection. I am too indoctrinated by Javascript

I am building a library to automatically create forms for Objects in the project that I am working on.
The codebase is in C#, and essentially we have a HUGE number of different objects to store information about different things. If I send these objects to the client side as JSON, it is easy enough to programatically inspect them to generate a form for all of the properties.
The problem is that I want to be able to create a simple way of enforcing permissions and doing validation on the client side. It needs to be done on a field by field level.
In javascript I would do this by creating a parallel object structure, which had some sort of { permissions : "someLevel", validator : someFunction } object at the nodes. With empty nodes implying free permissions and universal validation. This would let me simply iterate over the new object and the permissions object, run the check, and deal with the result.
Because I am overfamilar with the hammer that is javascript, this is really the only way that I can see to deal with this problem. My first implementation thus uses reflection to let me treat objects as dictionaries, that can be programatically iterated over, and then I just have dictionaries of dictionaries of PermissionRule objects which can be compared with.
Very javascripty. Very awkward.
Is there some better way that I can do this? Essentially a way to associate a data set with each property, and then iterate over those properties.
Or else am I Doing It Wrong?

It sounds like you are describing custom attributes - i.e.
[Permissions("someLevel"), Validator("someFunction")]
public string Foo {get;set;}
This requires some reflection to read the attributes, but is quite a nice way of decorating types / members / etc. You might also look at the pre-rolled [PrincipalPermission] for security checks. Is this what you mean?
Note the above would require:
public class PermissionsAttribute : Attribute {
private readonly string permissions;
public string Permissions { get {return permissions;}}
public PermissionsAttribute(string permissions) {
this.permissions = permissions;
}
}
(and similar for the other)
You can read them out with Attribute.GetCustomAttributes

C# Proxies and the var keyword

This question is related to a previous post of mine Here. Basically, I want to inject a DAO into an entity i.e.
public class User
{
IUserDAO userDAO;
public User()
{
userDAO = IoCContainer.Resolve<IUserDAO>;
}
public User(IUserDAO userDAO)
{
this.userDAO = userDAO;
}
//Wrapped DAO methods i.e
public User Save()
{
return userDAO.Save(this);
}
}
Here if I had a custom methods in my DAO then I basically have to wrap them in the entity object. So if I had a IUserDAO.Register() I would then have to create a User.Register() method to wrap it.
What would be better is to create a proxy object where the methods from the DAO are dynamically assign to the User object. So I may have something that looks like this:
var User = DAOProxyService.Create(new User());
User.Save();
This would mean that I can keep the User entity as a pretty dumb class suitable for data transfer over the wire, but also magically give it a bunch of DAO methods.
This is very much out of my confort zone though, and I wondered what I would need to accomplish this? Could I use Castles Dynamic proxy? Also would the C# compiler be able to cope with this and know about the dynamically added methods?
Feel free to let me know if this is nonsense.
EDIT:
What we need to do it somehow declare DAOProxyService.Create() as returning a User object -- at compile time. This can be done with generics.
This isnt quite true, what I want to return isn't a User object but a User object with dynamically added UserDAO methods. As this class isn't defnied anywhere the compiler will not know what to make of it.
What I am essentially returning is a new object that looks like: User : IUserDAO, so I guess I could cast as required. But this seems messy.
Looks like what I am looking for is similar to this: Mixins

I was initially going to say what you ask cannot work. But with some tweaking, we might be able to get it to work.
var is just a compiler feature. When you say.
var x = GetSomeValue();
the compiler says "'GetSomeValue' is defined as returning a string, so the programmer must of meant to write 'string x = GetSomeValue();'". Note that the compiler says this; this change is done at compile time.
You want to define a class (DAOProxyService) which essentially returns an Object. This will work, but "var User" would be the same as "Object user".
What we need to do it somehow declare DAOProxyService.Create() as returning a User object -- at compile time. This can be done with generics:
class DAOProxyService
{
static DAOProxyService<T> Create<T>(T obj) { ......}
}

It's not entirely automatic, but you might consider using a variation of Oleg Sych's method for generating decorator classes. Whenever IUserDAO changes (new method, etc) just regenerate the file. Better than maintaining it manually :-)
http://www.olegsych.com/2007/12/how-to-use-t4-to-generate-decorator-classes/

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using Generics to accomplish an HTML scraper. Right or Wrong? - c#

Related

C# how to "register" class "plug-ins" into a service class?

Substituting routes in text content in mvc

Importing data from third party datasource (open architecture design )

Associate "Code/Properties/Stuff" with Fields in C# without reflection. I am too indoctrinated by Javascript

C# Proxies and the var keyword

Categories

Resources