interesting news article/blog post scraping problem

interesting news article/blog post scraping problem - c#

i need to scrape the text of blog posts to build a summary description of the blog posts similar to what techmeme.com does. not a problem when it's one or a handful of blog posts. however, the possible blogs from which to scrape the text is variable and unlimited. how would you go about doing this?
i've used the html agility pack and yql in the past, but there's nothing built-in either of those solutions to handle this requirement.
one thought i had was to search for div ids and div attributes named things like content, post, article etc and see how that worked - not really leaning this direction. the other idea was to search for the biggest text node in the html document and assume that's the node i want - could lead to some false positives. the final idea was to endeavor to create a crowdsourced data repository on google apps that would allow for the community to manage (read: create, update, delete) the xpath mappings for most of the popular news/blog platforms then you could query this list by domain or blog platform type and get the requisite xpath - but this seems like a hella undertaking.
of course, i know some of you have ideas that will work better than any of my hare-brained ideas.
what are your thoughts?

The only sure-fire way of doing this is to have a class for each blog. That way you can do what you need in the implementation of each specific class for each specific blog.
So you'll have an abstract base class that processes a blog and returns the data/info you need from a blog.
for example
public abstract class BlogProcessor
{
public abstract BlogResult ProcessBlog(string url);
}
Where BlogResult is a type you define that has all the information you'll need from a blog such as title, date, tags, post etc.
Each descendant knows how to extract this information for the blog is is specialized for.
If you call code you'll treat these descendant classes pollymorphic-ally like so:
foreach(var url in BlogsToParse)
{
var blogProcessor = BlogProcessorFactory.CreateInstance(url);
var blogResult = blogProcessor.ProcessBlog(url);
/* Do Something with blogResult */
}
Does that make sense?
In the implementation of each "ProcessBlog" method you could use HtmlAgilityPack to do the specific parsing.

Related

How to format/style ///<summary> in Web API 2

Maybe this isn't even possible, but it seems silly that I can't figure it out (nor can find anything conclusive after searching).
With a MVC/C# Web API 2 project, your controllers can be documented using something like:
///<summary>
///This is something really cool that you should use. I want <b>this bold</b>.
///</summary>
[HttpPost]
public MyResponse MyMethod(SomeInput input)
{
....
}
When the API runs, the project automatically builds the help site, and I can see the above endpoint/method, and its description ( text), but I've head to figure out how to do any sort of styling to the summary. It appears that the HTML tags get striped from the help page's output. Notice in my example above, I have "this bold". I'm not so much concerned about bold, but more interested in being able to use unordered lists () and other basic HTML tags to just do some real basic formatting.
Is this even possible?
Is there a trick to it?
Is there some other markup/formatting I should be using?
Note - The actual endpoint that I'm trying to document at moment, happens to be a mime multipart form, and the framework won't document those out of the box. To get around this, I've created some helper methods in HelpPageConfigurationExtensions (to determine if the current endpoint view is one that requires custom documentation), in HelpPageApiModel.cshtml to determine if it should show the stock documentation or the custom docs, a helper library that contains the custom doc information, and a series of help functions that use some reflection to rapidly build HTML tables for the rest of the help page's documentation (e.g. the request and response objs). I'm mentioning this because maybe I just need to further extend my custom doc library to include (hard code) the value, and then in the view I can just #Html.Raw it -- opposed to trying to get the actual method's to output with formatting.
Thoughts?
Thanks!

nunit - set Order attribute from custom attribute of Test method

Let's say we have a custom attribute:
[Precondition(1, "Some precondition")]
This would implement [Test, Order(1), Description("Some precondition")]
Can I access and modify the Order attribute (or create one) for this method?
I can modify the Description and Author, but Order is not a possibility.
I have tried
1: context.Test.Properties["Order"][0] = order;
2:method.CustomAttributes.GetEnumerator()
by walking the stack frames with
Object[] attributes = method.GetCustomAttributes(typeof(PreconditionAttribute), false);
if (attributes.Length >= 1){...}
3:
OrderAttribute orderAttribute = (OrderAttribute)Attribute.GetCustomAttribute(i, typeof(OrderAttribute));
orderAttribute.Order = _order;
Which is readonly.
If I try orderAttribute.Order = new OrderAttribute(myOrd), it doesn't do anything.

I have two answers to choose from. One is in the vein of "Don't do this" and the other is about how to do it. Just for fun, I'm putting both answers up, separately, so they can compete with one another. This one is about why I don't think this is a good idea.
It's easy enough to write either
[Test, Order(1), Description("xxx")] or the equivalent...
[Test(Description="xxx"), Order(1)]
The proposed attribute gives users a second way to specify order, making it possible to assign two different orders to a test. Which of two attributes will win the day depends on (1) how each one is implemented, (2) the order in which the attributes are listed and (3) the platform on which you are running. For all practical purposes, it's non-deterministic.
Keeping the two things separate allows devs to decide which they need independently... which is why NUnit keeps them separate.
Using the standard attributes means that the devs can rely on the nunit documentation to tell them what the attributes do. If you implement your own attribute, you should document what it does in itself as well as what it does in the presence of the standard attributes... As stated above, that's difficult to predict.
I know this isn't a real answer in SO terms, but it's not pure opinion either. There are real technical issues in providing the kind of solution you want. I'd love to see what people think of it in comparison with "how to" I'm going to post next.

See my prior answer first! If you really want to do this, here's the how-to...
In order to combine the action of two existing attributes, you need equivalent code to those two attributes.
In this case both are extremely simple and both have about the same amount of code. DescriptionAttribute is based on PropertyAttribute so some of its code is hidden. OrderAttribute has a bit more logic because it checks to make sure the order has not already been set. Ultimately, both of them have code that implements the IApplyToTest interface.
Because they are both simple, I would copy the code, in order to avoid relying on implementation details that could change. Start with the slightly more complete OrderAttribute. Change its name. Modify the ApplyToTest method to set the description. You're done!
It will look something like this, depending on the names you use for properties...
public void ApplyToTest(Test test)
{
if (!test.Properties.ContainsKey(PropertyNames.Order))
test.Properties.Set(PropertyNames.Order, Order);
test.Properties.Set(PropertyNames.Description, Description);
}
A comment on what you tried...
There is no reason to think that creating an attribute in your code will do anything. NUnit has no way to know about those attributes. Your attribute cannot modify the code so that the test magically has other attributes. The only way Attributes communicate with NUnit is by having their interfaces (like IApplyToTest) called. And only attributes actually present in the code will receive such a call.

How do I update all tags on a blog post?

I have a blog website built using c# ASP.NET and entity framework. I am setting up a page to create a blog which allows the user to add tags. This works fine. However, when it comes to edit the blog post I am sure I must be missing a trick. I can't work out how I would update all the tags attached to the blog post in a single simple process.
The Blog and Tag entities are setup as many-to-many.
So currently I have:
_blog = blogRepo.GetBlogByID(blog.Id);
_blog.Tags = blog.Tags;
blogRepo.UpdateBlog(_blog);
blogRepo.Save();
Which works fine if I'm adding new tags. However, if I'm removing tags it only works Entity Framework side of things. As soon as the DB Context re-initialises, it picks up from the database that the tag is still attached to the blog.
E.g. I have tag "test" added to the blog. I edit the blog and remove the tag "test" and save the blog. The blog is returned by the same request with the tag list empty. If I then make another request for the blog then the tag "test" is back again.
I thought I could just remove all tags from the blog each time and then add any which are there. But I'm sure there must be a better way. Or something is set wrong and this should work in the current setup?
Any help appreciated. Particularly if it points out something stupid which I'm not seeing.

You can't simply assign a new child list to an entities object and expect Entities to figure out all your changes. You have to do that yourself. Here's a simple way to do it. It's not the most efficient, and there are tricks to speeding this up, but it works.
First you need to get the existing list of tags. I'm assuming GetBlogByID() does this. Then, rather than assign a new list of tags, you need to call Remove() on each tag you want removed. Here's an example:
//Generate a list of tags to remove
var tagsToRemove = _blog.Tags.Except(myNewListOfTags).ToArray();
foreach(var toRemove in tagsToRemove)
_blog.Tags.Remove(toRemove);
...Save changes
Now, as a optimization if there are a lot of tags, I sometimes will do a direct SQL call to delete all the many-to-many relationships, and then add them all again using Entities, rather than have to figure out each add and remove operation.
_myDbContext.Database.ExecuteSqlCommand(
"DELETE FROM BlogTagsManyToManyTable WHERE BlogId = #BlogId",
new SqlParameter("#BlogId", blogId));
I can then add a new list of Blog Tags without having to do any special work.

Url Encoding an array

This might seem dirty but it's for documentation purposes I swear!
I am accessing my services using GETs in my documentation so people can try things out without needing to get too complicated.
Appending x-http-method-override=POST to the URL forces the server to take a GET as a POST
This is all good except when I need to POST an array of objects. This would be simple in a standard POST but today I have a new bread of nightmare.
The expected POST looks like:
{"name":"String","slug":"String","start":"String","end":"String","id":"String","artists":[{"id":"String","name":"String","slug":"String"}],"locationId":"String"}
As you can see there is an array of artists up in here.
I have tried to do the following:
model/listing?start=10:10&end=12:30&artists[0].name=wayne&artists[0].id=artists-289&locationid=locations-641&x-http-method-override=POST
But to no avail.
How can I get an array of objects into a URL so that service stack will be happy with it?!
I appreciate this is not the done thing but it's making explaining my end points infinitely easier with clickable example URLs

You can use JSV to encode complex objects in the URL. This should work for your DTO:
model/listing?name=wayne&artists=[{id:artists-289,name:sample1},{id:artists-290,name:sample2}]&locationId=locations-641
You can programmatically create JSV from an arbitrary object using the ToJsv extension method in ServiceStack.Text.

MongoDB + NoRM- Concurrency and collections

Lets say we have the following document structure:
class BlogPost
{
[MongoIdentifier]
public Guid Id{get;set;}
public string Body{get;set;}
....
}
class Comment
{
[MongoIdentifier]
public Guid Id{get;set;}
public string Body {get;set;}
}
If we assume that multiple users may post comments for the same post, what would be the best way to model the relation between these?
if Post have a collection of comments, I might get concurrency problems, won't I ?
And placing a FK like attribute on Comment seems too relational , or?

You basically have two options: 1. Aggregate comments in the post document, or 2. Model post and comment as documents.
If you aggregate the comments, you should either a) implement a revision number on the post, allowing you to detect race conditions and implement handling of optimistic concurreny, or b) add new comments with a MongoDB modifier - e.g. something like
var posts = mongo.GetCollection<Post>();
var crit = new { Id = postId };
var mod = new { Comments = M.Push(new Comment(...)) };
posts.Update(crit, mod, false, false);
If you model post and comment as separate documents, handling concurrency is probably easier, but you lose the ability to load a post and its comments with a single findOne command.
In my opinion, (1) is by far the most interesting option because it models the post as an aggregate object, which is exactly what it is when you put your OO glasses on :). It's definitely the document-oriented approach, whereas (2) resembles the flat structure of a relational database.

This is one of the canonical NoSQL examples. The standard method for doing this is to store the Comments as an array of objects inside of the BlogPost.
To avoid concurrency problems MongoDB provides several atomic operations. In particular there are several update modifiers that work well with "sub-documents" or "sub-arrays".
For something like "add this comment to the post", you would typically use the $push command which will append the comment to the Post.
I see that you're using the "NoRM" drivers. It looks like they have support for atomic commands, as evidenced by their tests. In fact, their tests perform a "push this comment to the blog post".

They sort of give an example of how you'd model it over on the MongoDB page on inserting - I think you'd want a collection of comments exposed as a property on your post. You'd add comments to a given Post entity and this would do away with tying a Comment entity back to its parent Post entity which, as you are right to question, is something that makes sense in a RDBMS but not so much in a NoSQL solution.
As far as concurrency goes, if you don't trust Mongo to handle that for you, it's probably a big hint that you shouldn't be building an application on top of it.

I've created a test app that spawns 1000 concurrent threads adding "Comments" to the same "Post", the result is that alot of comments are lost.
So MongoDB treats child collections as a single value, it does not merge changes by default.
If I have a Comments collection on post, then I get concurrency problems when two or more users are adding comments at the exact same time (unlikely but possible)
So is it possible to add a comment to the post.comments collection without updating the entire post object?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.