Downloading all PDF files from a website

Downloading all PDF files from a website - c#

I need to make a windows desktop application in c# that downloads all the PDFs from a website. I have the link to the website but the problem i am facing is that the PDFs are not in a specific folder on the website but are scattered all over.
The thing i need is help at finding all those links so i can download them or any other advices that could help me with my problem.
Thanks to all help in advanced.

Scrape through all the pages
Find all the "*.pdf" URLs
Reconstruct them and simply download :)
Please be more specific are you trying to get all the PDFs from the html page or from the whole domain ?

What you are trying to do is known as Web scraping, there are some libraries which can make your task easy one of them is IronWebScraper but its paid one.
An extensive list of NuGet packages is available here which can be used for web scraping purpose.

Related

In ASP .NET is there a way to create a winform as a pdf viewer rather than using the browser?

In ASP .NET is there a way to create a winform (or something of the sort) as a pdf viewer rather than using the browser?
I need to restrict print, url to pdf and right click to save as image or anything else. If I can do this through the browser, where can I find some sample code?
I understand there are hacks to get that PDF downloaded and PRINT SCREEN is another option as well as any image capture program. This is in a INTRANET site so these things I am aware of and not worried about. We just need to make it difficult for non tech employees.
The PDF's hold important reporting information that we do not want print and leave in there office anymore.
Anyone have sample code to achieve this or any tutorial or any leads to the direction I need to take to achieve this?
FYI, site in 3.5 and Webforms...upgrade to 4 or 4.5 has not been approved yet.
Thanks in advance

Best way to display pdf document in asp.net application

I'm looking for the best way to display pdf document on a website. Surely I need to convert it to jpeg or gif for the browser to handle it. I read few posts but most refer to GhostScript and its pdf2image. But that solution calls for starting a process that would save a copy of pdf doc to the file system and then would have to be loaded back into memory for displaying. Frankly I find it a bit clumsy. For those of you who have done it, what library you used and if you could attach a link to some examples, I'd greatly appreciate it.
I'm developick a web application that helps manage manufacturing process and is accessed fron android tablets. Company has a stockpile of documentation in pdf files that is to be delivered to production managers. I'd love the solution to be akin Crystal Report Viewer contron but I I understand that I have to stick to pdf to image conversion. Please give me some advise here.

My advice is don't over think this.
You can simply add a link to the PDF file, which will open on a new tab.

You can take a look at http://mozilla.github.io/pdf.js/ which will allow you to render a PDF on the client side.
Or if you decide to go with a Ghostscript, you can take a look at http://ghostscriptnet.codeplex.com

By all accounts the PDF Focus .NET library seems to be the best solution. A wrd of advice is to add a cleanup method to the page unload to delete all temporary files that were used to feed source into image controls when displaying pictures on a website.

Web control to view PDF files, videos and power point presentation

We are looking for a web control capable of displaying PDF files, videos and power point presentation. This is part of research school web site, where students can view these files. No download should be permitted. Appreciate your suggestions!

We have used PDF.js (https://mozillalabs.com/en-US/pdfjs/) which is very good and download could be prevented

How do I download google play install statistics to my database?

I want to find a way to automate the pull of total installs for each app my company has in Google Play. I want to bring this into our own interal database so we can marry it up with Google Analytics and in-app information we already have.
Can anyone provide the SDK or API I need to download and maybe some helpful hints to automtae this feed on a on-going basis.
Thanks,

I was looking into this and saw among others this question on my search.
The answer to this question states that you would need to scrape the answer.
Google Play Developer Statistics API
This states that you can manually download a csv file.
https://support.google.com/googleplay/android-developer/bin/answer.py?hl=en&answer=139628&topic=16285&ctx=topic
You could try to use selenium to automatically download the csv file and then read that in.
The answer on this page can also perhaps help you.
Is there an API to get sales report on Google Play?

Download all the links from any page

I want to develop an asp.net page through which I can specify the URL of any page which contains links of many files & directories. I want to download them all. Similar to DownThemAll plugin of FireFox.
i.e.
"MyPage.htm" file contains many links to files/directories located on the same server.
now I want to write a function which can download all these file if I provide
"www.mycustomdomain.com\Mypage.htm" as input.
I hope question is clear.

Fetch the web page as HTML. Google (c# fetch file from web). The first link will give you the idea.
Then find the links with regular expressions.
Some example regex pattern for links in www.x.com should be as
(http://www.x.com/.*?)
(But better if you also include the A tag in your regex pattern)
And download the files as shown in:
http://www.csharp-examples.net/download-files/

Hope I understand your question. You have a HTM file with a list of links and these links are links to specific files on a remote server and you want to download all the files.
There is no fail proof way to do this.
Check this question. How do you parse an HTML in vb.net Even though this is for VB.net it is related to what you asked for. You can get an array of links and then start downloading the files.
You can use the Computer.Network.DownloadFile method to download the remot file as save it on a location of yours.
Thi is not a fail prrof method because if a download requires authentication then it will download the HTML page [mostly loin page]

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Downloading all PDF files from a website - c#

Scrape through all the pages Find all the "*.pdf" URLs Reconstruct them and simply download :) Please be more specific are you trying to get all the PDFs from the html page or from the whole domain ?

What you are trying to do is known as Web scraping, there are some libraries which can make your task easy one of them is IronWebScraper but its paid one. An extensive list of NuGet packages is available here which can be used for web scraping purpose.

Related

In ASP .NET is there a way to create a winform as a pdf viewer rather than using the browser?

Best way to display pdf document in asp.net application

Web control to view PDF files, videos and power point presentation

How do I download google play install statistics to my database?

Download all the links from any page

Categories

Resources