Tactics for analyzing data scraped from the web

Tactics for analyzing data scraped from the web - c#

I'm working on an app and want to have a feature where the user can input a URL and the app will try to pull certain information from the page. I've already decided on HTML Agility Pack for fetching and parsing the HTML itself and creating a DOM that's easy to traverse. My hangup is trying to formulate a generic way to find the information I want from within that DOM.
For example, say the app is expecting the user to provide a URL to a product page of some sort and I want to parse out some information like price, model, etc. I could always go the route of writing specialized code for expected major websites (This answer touches on that) but the goal is not to specialize. Some items can be pretty easily identified (e.g price) but some other information might be identified with more varied language (e.g SKU vs part number vs stock number vs item number etc.).
One thought I've had so far is to identify likely locations for each piece of information I'm trying to extract and if the confidence is low present a "preview" to the user and let them approve or reject it but I'm obviously looking to maximize confidence and minimize demand on the user.
A second thought would be to specialize on major sites and fall back to the generic algorithm as a catchall. Could possibly collect anonymous user data to know which sites are most requested.

Related

How to display JSON data on view page

I am working on a project where I have a JSON file and am trying to pass that data through MVC to be displayed on a web page. As part of the JSON file, I have some data that I am trying to pass to my view, it looks like:
"htmlContent": "<p>It was over two decades ago that Bill Gates declared ‘Content is King’ and the consensus still stands today with it arguably being the most important part of designing a website. </p><p>Content is essentially your UX. It encompasses the images, words, videos and data featured across your website. If the main purpose of your site is to share valuable and relevant content to engage your audience, then you should be considering your content long before embarking on a web project. All too often, businesses miss the opportunity to create impressive UX designs, instead waiting until the later stages of the project to sign off content which inevitably creates new challenges to overcome. </p>\r\n<p>Having a research strategy in place that supports a content-first design approach should be at the top of your agenda. When businesses choose to design content-first, they are putting their valuable resources centre stage, conveying their brand through effective and engaging UX design. Throughout this blog, we will share our tips on how you can develop a content-first design approach. </p>\r\n<h2><strong>How to develop a content-first design approach </strong> </h2>\r\n<p>Content can no longer be an after-thought, but there’s no denying that generating content can be a tricky. To get you thinking along the right lines and help put pen to paper, follow our top tips: </p>\r\n<h3><strong>Ask lots of questions</strong> </h3>\r\n<p>Generating content that successfully satisfies what your customers want to know requires a lot of research. Get into the habit of asking open-ended questions that answer the Who, What, Where, When, Why and How. Using this approach will allow you to delve deep and gain an understanding of what your website should include to build a considered site map. </p>\r\n<h3><strong>Consider your Information Architecture (IA)</strong> </h3>\r\n<p>How your content is organised and divided across the website is a crucial aspect of UX design. Without effective sorting, most users would be completely lost when navigating a site and there’s no point having memorable features if they can’t be found! Use card sorting exercises, tree tests, user journey mapping and user flow diagrams to form an understanding of how best to display your content in a logical and accessible way. </p>\r\n<h3><strong>Conduct qualitative and quantitative research</strong> </h3>\r\n<p>Although Google Analytics is extremely useful, it doesn’t hold all the answers. Google Analytics is great at telling you <em>what</em> your users are doing, but it doesn’t give you the insight into <em>why</em> they’re doing it. Qualitative one-to-one user interviews is an effective method of really getting to grips with your user needs to understand why they do what they do. User testing also falls into this category. Seeing a user navigate through your website on a mobile phone in day to day life can give you great insight for UX design in terms of context and situation. </p>\r\n<h3><strong>Align your content strategy with long-term business goals</strong> </h3>\r\n<p>Before beginning your web project, it’s important to understand the goals of the project and the pain points you are trying to solve. Include all the necessary stakeholders within this research to gain a comprehensive understanding of these insights before embarking on your web design project. </p>\r\n<h3><strong>Content first, design second</strong> </h3>\r\n<p>Avoid designing content boxes across your website and trying to squeeze the content into these boxes. When designing a new website, it may seem counter intuitive to begin with a page of words rather than a design mock-up. But, it’s important to remember that Lorem Ipsum isn’t going to help anyone either. Begin with the content your users need and then design out from there. Capturing the content and its structure can be done in many ways; we like to build content models based on IA site maps and qualitative user testing such as card sorting and user journey mapping. </p>\r\n<p>By using a content-first design approach, you can understand what content needs to fit into your website design. Analysing your website’s content needs in the early stages or, even better, prior to the project beginning, can effectively inform and shape all touch points ultimately generating an optimised result with reduced time delays and constraints along the way. If you have a web project in mind and need help on how to get started, get in touch with the team today. </p>",
In the view I am then accessing this JSON data through a foreach loop and can access it like so
#jsondata.htmlContent
This then gets said 'htmlcontent' from the JSON file, when I open the html web page the 'htmlcontext' is not working as I would expect it to, the '<p.>' tag does not display as a paragraph on the web page, instead the content on the web page is exactly the same as the JSON string.
How would I go about doing this and displaying the data in the tags?

PDF Creating Server

I've been tasked to create (or seek something that is already working) a centralized server with an API that has the ability to return a PDF file passing some data, and the name of the template, it has to be a robust solution, enterprise ready. The goal is as follows:
A series of templates for different company things. (Invoices, Orders, Order Plannings, etc)
A way of returning a PDF from external software (Websites, ERP, etc)
Can be an already ready enterprise solution, but they are pressing for a custom one.
Can be any language, but we don't have any dedicated Java programmers in-house. We are PHP / .NET, some of us dabble, but the learning curve could be a little steep.
So, I've been reading. One way we've thought it may be possible is installing a jasper reports server, and creating the templates in Jaspersoft Studio, then using the API to return the PDF files. A colleague stands for this option, because it's mostly done, but 1º is java and 2º I think it's like using a hammer to crack a nut.
Other option we've been toying with is to use C# with iTextSharp to create a server, and create our own API that returns exactly the PDF with the data we need. Doing this we could have some benefits, like using the database connector we have already made and extracting most of the data from the database, instead of having to pass around a big chunk of data, but as it is bare, it doesn't really have a templating system. We'd have create something from with the XMLWorker or with c# classes but it's not really "easy" as drag and drop. For this case I've been reading about XFA too, but documentation on the iText site is misleading and not clear.
I've been also reading about some other alternatives, like PrinceXML, PDFBox, FOP, etc, but the concept will be the same as iText, we'd have to do it ourselves.
My vote, even if it's more work is to go the route of iText and use HTML / CSS for the templates, but my colleagues claim that the templates should be able to be changed every other week (I doubt it), and be easy. HTML / CSS would be too much work.
So the real question is, how do other business approach this? Did I leave anything out on my search? Is there an easier way to achieve this?
PS: I didn't know if SO would be the correct place for this question, but I'm mostly lost and risking a "too broad question" or "off topic" tag doesn't seem that bad.
EDIT:
Input should be sent with the same request. If we decide the C# route, we can get ~70% of the data from the ERP directly, but anyway, it should accept a post request with some data (template, and data needed for that template, like an invoice data, or the invoice ID if we have access to the ERP).
Output should be a PDF (not interested in other formats, just PDF).
Templates will be updated only by IT. (Mostly us, the development team).
Performance wise, I don't know how much muscle we'll need, but right now, without any increase, we are looking at ~500/1000 PDFs daily, mostly printed from 10 to 10.30 and from 12 to 13h. Then maybe 100 more the rest of the day.
TOP performance should not be more than ~10000 daily when the planets align, and is sales season (twice a year). That should be our ceiling for the years to come.
The templates have some requirements:
Have repeating blocks (invoice lines, for example).
Have images as background, as watermark and as blocks.
Have to be multi language (translatable, with the same data).
Have some blocks that are only show on a condition.
Blocks dependent on the page (PDF header / page header / page footer / PDF footer)
Template will maybe have to do calculations over some of the data, I don't think we'll ever need this, but it's something in the future may be asked by the company.
The PDFs don't need to be stored, as we have a document management system, maybe in the future we could link them.
Extra data: Right now we are using "Fast-Reports v2 VCL"

Your question shows you've been considering the problem in detail before asking for help so I'm sure SO will be friendly.
Certainly one thing you haven't detailed much in your description is the broader functional requirements. You mentioned cracking a nut with a hammer, but I think you are focused mostly on the technology/interfacing. If you consider your broader requirements for the documents you need to create, the variables involved, it's might be a bigger nut that you think.
The approach I would suggest is to prototype solutions, assuming you have some room to do so. From your research, pick maybe the best 3 to try which may well include the custom build you have in mind. Put them through some real use-cases end to end - rough as possible but realistic. One or two key documents you need to output should be used across all solutions. Make sure you are covering the most important or most common requirements in terms of:
Input Format(s) - who can/should be updating templates. What is the ideal requirement and what is the minimum requirement?
Output Requirement(s) - who are you delivering to and what formats are essential/desirable
Data Requirement(s) - what are your sources of data and how hard/easy is it to get data from your sources to the reporting system in the format needed?
Template feature(s) - if you are using templates, what features do the templates need? This includes input format(s) but I was mostly thinking of features of the engine like repeating/conditional content, image insertion, table manipulation etc. ie are your invoices, orders and planning documents plain or complex
API requirements - do you have any broader API requirements. You mentioned you use PHP so a PHP library or Web/Web Service is likely to be a good starting point.
Performance - you haven't mentioned any performance characteristics but certainly if you are working at scale (enterprise) it would be worth even rough-measuring the throughput.
iText and Jasper are certainly enterprise grade engines you can rely on. You may wish to look at Docmosis (please note I work for the company) and probably do some searches for PDF libraries that use templates.
A web service interface is possibly a key feature you might want to look at. A REST API is easy to call from PHP and virtually any technology stack. It means you will likely have options about how you can architect a solution, and it's typically easy to prototype against. If you decide to go down the prototyping path and try Docmosis, start with the cloud service since you can prototype/integrate very quickly.
I hope that helps.

From my years of experience in working with PDF I think you should pay attention to the following points:
The performance: You may do the fastest performance with API based pdf files generation in comparision to HTML or XML to PDF generation (because of an additional layer of conversion involved). Considering peaks in the load you may want to calculate the cost of scaling up the generation by adding more servers (and estimate the cost of additional servers or resources required per additional pdf file per day).
Ease of iterations and changes: how often will you need to adjust templates? If you are going to create templates just once (with some iterations) but then no changes required then you should be OK by just coding them using the API. Otherwise you should strongly consider using HTML or XML for templates to simplify changes and to decrease the complexity of making changes in templates;
Search and indexing: If you may need to run search among created documents then you should consider storing indexes of documents generated or maybe store more the source data in XML along with PDF file generated;
Long time preservation: you should better conform to PDF/A sub-format in case you are looking for a long time digital preservation for your documents. See the VeraPDF open source initiative that you may use to validate generated and incoming PDF documents against the conformance to PDF/A requirements;
Preserving source files The PDF format itself was not designed to be edited (though there are some PDF editors already) so you may consider the need of preserving the source data to be able to regenerate PDF documents later and probably introduce additional output formats later.

Any tools, libraries or suggestions to simplify dynamic question functionality?

I am working on an ASP.NET project that is relatively simple except for one requirement which requires custom questionnaires be attached to specific types of tasks. These questionnaires need to be customized regularly and no development, within the app itself, should be needed add questionnaires. The questionnaires currently do not require an editing tool and can be done by uploading a template, changing something in a DB, whatever. They can be stored in any format and the resulting output needs to be captured to be edited or viewed later.
The types of questions in the questionnaire could be:
Selections (select one from a list)
Input (text, integers, dates, etc)
Yes/No
The ability to display questions based on answers from other questions. For example if they answer yes to question X, display question Y else display question Z. Need to be able to apply data validation such as required fields, ranges, etc on questions (could all be probably capture by basic regex).
The simplest break down would be:
Create a new event.
Based on the type of event display a specific questionnaire.
Questionnaires can change over time but they can be considered as new version each time and data will always be related to a specific version and not need to be migrated to updated versions.
The questionnaire output (data elements and a final calculated value) must be captured.
XML output (or any other format) of data elements entered.
The optimal (unicorn) scenario would be to have a basic template in XML or something that a user can learn to create easily and it would be stored and versioned in a DB. When a user makes a new event, the app would fetch the appropriate template which would display the questionnaire to the user. The user would fill it out and the output would be posted as some type of output (again XML would be nice but not required). That output would be attached to the event. Done.
Are there any .NET compatible tools/libraries that I could leverage to accomplish this? InfoPath seems like a tool that might be of use but I have almost zero experience with it so I am not sure about its constraints / implementation and if it is just overkill. The solution needs to be contained within the ASP.NET application. An external editor tool for creating templates would be ok but the templates must be viewable and editable on the web with no constraints to the user.
Can anyone provide examples of this being done or hints on how you might have tackled this?
Since the application is relatively easy to create other than this one feature, I would rather not spend 80% of my time trying to implement the custom questionnaire functionality and spend more time on the problem the application is trying to solve.
Tech available: ASP.NET, Silverlight, SQL Server

I would suggest having a look at a dot net nuke implementation, I am sure there should be a lot of viable options (if not all free).
DotNetNuke
Have a look at the Forge to see free plugins

Consider evaluating SurveyMaster at CodePlex. It's licensed under Microsoft Public License (Ms-PL), and you can modify its source for your needs.

How do I data mine various news sources?

I'm working on a free web application that will analyze top news stories throughout the day and provide stats. Most news websites offer RSS feeds, which works fine for knowing which stories to retrieve. However, the problems arise when attempting to get the full news story from the news website itself. At the moment, I have separate NewsSource classes for each source (CNN, NY Times, etc) that read the appropriate RSS feed(s), follows each link, and strips out the body. This seems tedious and very unmanageable when a news website decides to change the HTML structure of their articles.
Is there a service (preferably free) that already aggregates multiple news sources with the full article content (not just a summary)? If not, do you have any suggestions for handling multiple sources with different HTML structures that may change without notice?

Use readability. Search for readability port for the language you use.

Can OmniTure SiteCatalyst or any web Analytics Software track Critical Page Views?

I need to track only human visits to my article pages. I hear that SiteCatalyst is the best of the best for page tracking. Here is what I am trying to do. I need to track every human visit if possible because this will affect the amount of money i have to pay. I will need to download site statistics for all of my pages with an accurate hit count. Again, I don't want to track spiders/bots. Once I download the site statistics I will use it to update hit counts to each of my articles. Then I will pay my writers according to how many hits they receive. Is SiteCatalyst able to do this. If not, who do you think can do something like this?

Luke - Quick answer there currently is no %100 accurate way to get this.
Omniture's SiteCatalyst does provide a very good tool for acquiring visitor information. You can acquire visitor information from any of the other vendors as well including the free option Google Analytics.
You may have been lead to believe as I had that Omniture strips out all bots and spiders by default. Omniture states that most bots and spiders do not load images or execute JavaScript, which is what they rely upon for tracking. I am not sure what the exact percentage is, but not all bots and spiders act in this way.
In order for you to gain a more accurate report on the number of "humans" you will need to know what the IP address of the visitor is and possibly the user agent. You can populate the agent and IP in PHP with these two variables $_SERVER['HTTP_USER_AGENT'] and $_SERVER['REMOTE_ADDR']. You will then need to strip out the IP address of known bots/spiders from your reporting. You can do this with lists like this: http://www.user-agents.org/index.shtml or manually by looking at the user agent. Beware of relying upon the user agent as the bot can easily spoof this. This will never be %100 accurate because new bots/spiders pop up every day. I suggest looking further into "click fraud".
Contact me if you want further info.

omniture also weeds out traffic from known bots/spiders. But yeah...there is an accepted margin of error in the analytics industry because it can never be 100%, due to the nature of the currently available technology.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.