How to get hidden data in a HTML file - c#

I try to get comments of a Instagram post with C#. But the thing is there is this 'Load more comments' button which as it refers does its job and when i take a look a Firefox HTML codes all of the sudden new <li> element appears out of no where. Is this data getting there from a Database or it's embedded in HTML file? Is there a way to reach that data? I tried SgmlReader but i couldn't manage get all of the data that i'm looking for.

Related

Recursive HTML Parsing using C#

I'm trying to export HTML content (tables) to CSV files using C#, and based from my research here, one of the best ways to implement this is through the use of the HTML Agility Pack.
I haven't started coding and testing this yet because I need to be sure if it's doable first. The HTML table from the website is actually getting push messages from the server so its contents are updated real-time, so a change can happen all the time. What I would like to do is to be able to export the table to CSV every after there's a change in the table (e.g. row added, row deleted, cell contents modified, etc).
I am not sure if this can be done using HTML agility pack, or can be done using C# at all.
Please advise and thank you in advance.
Since this is dynamically updating data it sounds like a headless browser would be a better fit for what you're looking to do. Something like espion.io or phantom.js. A headless browser would allow you to respond to these data pushes and capture the html for further processing.

C# Webbrowser incorrectly loads html of page

I am developing an Amazon Account checker for a customer, but I have encountered problems in the stage when I need to retrieve order information from the amazon. As soon as the document is loaded I get the HTML but in the orders section I get "There was a problem loading your orders..." whereas in the web-browser it displays the orders properly. Is there any way to synchronize the loading of the HTML (as in HtmlDocument) and the displayed contents? A hint is enough, Regards.

Get a snapshot of posted HTML page?

I'm using a expertPDF to convert a couple webpages to PDF, and there's one that i'm having difficulties with. This page only renders content when info is POST'd to it, and the content is text and a PNG graph (the graph is the most important piece).
I tried creating a page form with a 'auto submit' on the body onload='' event. If i go to this page, it auto posts to the 3rd party page and i get the page as i expect. But it appears ExpertPDF won't take a 'snapshot' if the page is redirected.
I tried using HTTPRequest/Response and WebClient, but have only been able to retrieve the HTML, which doesn't include the PNG graph.
Any idea how i can create a memorystream that includes the HTML AND the PNG graph or post to it, but then somehow send ExpertPDF to that URL to take a snapshot of the posted results?
Help is greatly appreciated - i've spent too much time trying on this one sniff.
Thanks!
In HTML/HTTP the web page (the HTML) is a separate resource from any images it includes. So you would need to parse the HTML and find the URL that points to your graph, and then make a second request to that URL to get the image. (This is unless the page spits the image out inline, which is pretty rare, and if that were the case you probably wouldn't be asking.)
A quick look at ExpertPDF's FAQ page, there's a FAQ question that deals specifically with your problem. I would recommend you take a look at that.
** UPDATE **
Take a look at the second FAQ question:
Q: When I convert a HTML string to PDF, the external CSS files and images are not applied in the rendered PDF document.
You can take the original (single) response from your WebClient and convert that into a string and pass that string to ExpertPDF based on the answer to that question.

C# extract content from HTML document

I was wondering how can I do something similar to Facebook when a link is posted or like shortening link services that can get the title of the page and its content.
Example:
My idea is to get only the plain text from a web page, for example if the url is an article of a newspaper how can I get only the news's text, like showed in the image. For now I have been trying to use the HtmlAgilityPack but I can never get the text clean.
Note this app is for Windows Phone 7.
You're on the right track with HtmlAgilityPack.
If you want all the text of the website, go for the innerText attribute. But I suggest you go with the meta description tag (if available).
EDIT - Go for the meta description. I believe that's what Facebook is doing:
Facebook link sample
Site source

Scraping content from webpage

I need to scrape a remote html page looking for images and links. I need to find an image that is "most likely" the product image on the page and links that are "near" that image. I currently do this with a javascript bookmarklet so that I am able to get the rendered x/y coordinates of images and links to help me determine if those are the ones that I want.
What I want is the ability to get this information by just using a url and not the bookmarklet. The issues it that by using the url and trying something like httpwebrequest and getting the html on the server, I will not have location values since it wasn't rendered in a browser. I need the location of images and links to help me determine the images and links that I want.
So how can I get html from a remote site on the server AND use the rendered location values of the dom elements to help me locate images and links?
As you indicate, doing this purely through inspection of the html is a royal pain (especially when CSS gets involved). You could try using the WebBrowser control (which hosts IE), but I wonder if looking for an appropriate, supported API might be better (and less likely to get you blocked). If there isn't an API or similar, you probably shouldn't be doing this. So don't.
You can dowload the page with HttpWebRequet and then use the HtmlAgilityPack to parse out the data that you need.
You can download it from http://htmlagilitypack.codeplex.com/

Categories

Resources