Download all the links from any page - c#

I want to develop an asp.net page through which I can specify the URL of any page which contains links of many files & directories. I want to download them all. Similar to DownThemAll plugin of FireFox.
i.e.
"MyPage.htm" file contains many links to files/directories located on the same server.
now I want to write a function which can download all these file if I provide
"www.mycustomdomain.com\Mypage.htm" as input.
I hope question is clear.

Fetch the web page as HTML. Google (c# fetch file from web). The first link will give you the idea.
Then find the links with regular expressions.
Some example regex pattern for links in www.x.com should be as
(http://www.x.com/.*?)
(But better if you also include the A tag in your regex pattern)
And download the files as shown in:
http://www.csharp-examples.net/download-files/

Hope I understand your question. You have a HTM file with a list of links and these links are links to specific files on a remote server and you want to download all the files.
There is no fail proof way to do this.
Check this question. How do you parse an HTML in vb.net Even though this is for VB.net it is related to what you asked for. You can get an array of links and then start downloading the files.
You can use the Computer.Network.DownloadFile method to download the remot file as save it on a location of yours.
Thi is not a fail prrof method because if a download requires authentication then it will download the HTML page [mostly loin page]

Related

How to get chrome to recognize bookmarks in file download URL

I have searched everywhere and cannot find how to get chrome to recognize bookmarks in a doc download. (It's hard to search for this when bookmarks and anchors have two meanings).
For instance a URL http://my.website.com/myfile.doc#mybookmark downloads the file just fine, but I cannot get the browser to scroll to the bookmarks. I've tried the build in google docs viewer, as well as microsofts and neither work.
https://products.office.com/en-US/office-online/view-office-documents-online
https://docs.google.com/viewer?url=my.website.com/myfile#mybookmark
Does anyone know how this can be done?
If I can do it inside of the code, or in some other way please feel free to say so. I just need to have the client download a file from link and it go to the bookmark I specify.
(The site is written in VB.net, the clients use Chrome) C# is fine as well.
Thanks!

Does Google Accept Only Sitemap With .txt Extention?

I have finalized working on my Asp.Net 4.0 website. Now that i am to publish it by next few days, i am finding resources that can help me better rank my site on popular search engines. My site displays both static and dynamic contents. For dynamic contents i will be generating dynamic sitemap each week. My problem is that i read on google webmaster website that google accepts sitemaps only with .txt extension. (https://support.google.com/webmasters/answer/183668?hl=en). Orignal instructions quoted as:
* For best results, use the following guidelines for creating text file sitemaps:
You must fully specify all URLs in your sitemap as Google attempts to crawl them exactly as you list them.
Your text file must use UTF-8 encoding.
Your text file should contain nothing but the list of URLs.
You can name the text file anything you wish, provided it has a .txt extension (for instance, sitemap.txt).
As i have mentioned, i will be using c# code to dynamically generate xml sitemap for my site but i am not sure i will be able to write xml (by using C#) into .txt files. I have very little knowledge of writing xml by using C# (Such as by utilizing XmlWriter Class). I have found this website which uses it's sitemap file which is in .xml extension (http://www.mikesdotnetting.com/sitemap). Can anybody tell me what do i need to do to complete this final step of my project. Another thing that i am interested to know is should i submit my sitemap every time to google when a link is modified? Google says to submit your sitemap to google that contains no more than 50000 urls or less than 50mb.
Submitting every dynamic URL isn't necessarily going to improve your ranking. A lot of your ranking will depend on your pagerank, not the number of URLs you have. You need good content, and people who link to your site that think your content is good.
You read the document wrong. The very first line says
In addition to the standard XML format, Google also accepts the following file types as sitemaps:
The .txt file extension is only required when you have a text file that lists only the urls. Notice how even their submission example has a .xml extension since it's using the XML sitemap format.
It would be much simpler instead of generating a new sitemap file weekly to simply write a handler to generate it upon request, and cache the data for a period of time so you're not constantly generating it at every request for the sitemap file.
If google knows where the sitemap is, it will check it out periodically anyways so re-submitting it might not get you anything. You also don't want to submit too often as Google and other search engines may think you are trying to spam them. That's why the xml sitemap definition has elements for change frequency so they know how often to re-spider the page.
FYI, don't expect to see good ranking right away. It takes a while, months even, depending on the popularity of your site and the quality of the content. The volume of links won't help and Google will find them anyways when it spiders. It is possible to do too much.

Issue in downloading video from Youtube?

I am trying to make a video download application for desktop in C#.
Now the problem is that following code works fine:
WebClient webOne = new WebClient();
string temp1 = " http://www.c-sharpcorner.com/UploadFile/shivprasadk/visual-studio-and-net-tips-and-tricks-15/Media/Tip15.wmv";
webOne.DownloadFile(new Uri(temp1), "video.wmv");
But following code doesn't:
temp1="http://www.youtube.com/watch?v=Y_..."
(in this case a 200-400 kilobyte junk file gets downloaded )
Difference between the two URLs is obvious, first one contains exact name for file while other seems to be encrypted in some way...
I was unable to find any proper and satisfactory solution to the problem so I would highly appreciate a little help here, Thanks.
Note:
from one of the questions here I got a link http://youtubefisher.codeplex.com/ so I visited there, got the source code and read it. It's great work but what I don't seem to get is that how in the world that person came to know what structures and classes he had to make for downloading a YouTube video and why did he have to go through all that trouble why isn't my method working?
Someone please guide. Thanks again.
In order to download a video from youtube, you have to find the actual video location. Not the page that you use to watch the video. The http://www.youtube.com/watch?v=... url, is an html page (much like this one) that will load the video from it's source location and display it. Normally, you have to parse the html and extract the video location from the html.
In your case, you found code that does this already - and lucky you, because downloading videos from youtube is not simple at all. Looking at the link you provided in your question, the magic behind the madness is available in YoutubeService.cs / GetDownloadUrl():
http://youtubefisher.codeplex.com/SourceControl/changeset/view/68461#1113202
That method is parsing the html page returned by a youtube watch url, and finding the actual video content. The added complexity, is that youtube videos can also be a variety of different formats.
If you need to convert the video type after downloading, i recommend FFMPEG
EDIT: In response to your comment - You didnt look at the source code of YoutubeFisher at all, did you.. I'd recommend analysing the file I mentioned (YoutubeService.cs). Although after taking a quick look myself, you'll have to parse the yt.playerConfig variable within the html page.
Use that source to help you.
EDIT: In response to your second comment: "Actually I am trying to develop an application that can download video from any video site." You say that like its easy - fyi, its not. Since every video website is different, you cant just write something that will work for everything out of the box. If I had to do it though, heres how i would: I would write custom parsers for the major video sharing websites (Metacafe, Youtube, Whatever else) so that those ones are guarenteed to work. After that, I would write a "fallover" if you will. Basically, if you're requesting a video from an unknown website, it would scour the html looking for known video extentions (flv, wmv, mp4, etc) and then extract the url from that.
You could use a regex for extracting the url in the latter case, or a combination of something like indexof, substring, and lastindexof.
I found this page # CodeProject, it shows you how to make a very efficient Youtube downloader using no third party libraries. Remember it is sometimes necessary to slightly modify the code as Youtube sometimes makes changes to it's web structure, which may interfere with the way your app interacts with Youtube.
Here is the link: here you can also download the C# project files and see the files directly.
CodeProject - Youtube downloader using C# .NET

Downloading all PDF files from a website

I need to make a windows desktop application in c# that downloads all the PDFs from a website. I have the link to the website but the problem i am facing is that the PDFs are not in a specific folder on the website but are scattered all over.
The thing i need is help at finding all those links so i can download them or any other advices that could help me with my problem.
Thanks to all help in advanced.
Scrape through all the pages
Find all the "*.pdf" URLs
Reconstruct them and simply download :)
Please be more specific are you trying to get all the PDFs from the html page or from the whole domain ?
What you are trying to do is known as Web scraping, there are some libraries which can make your task easy one of them is IronWebScraper but its paid one.
An extensive list of NuGet packages is available here which can be used for web scraping purpose.

How Can I Get Files And Folders Of A Special Website Like IDM Grabber In C#

If you have worked with IDM(Internet Download Manager) this has a item named Grabber that searches in a special web site and get the files and folders of that website and you can download them using IDM.
I would like to do something similar in C#. I would like to download html web pages and extract links from those pages. I would also like to detect directories and attempt to search their contents - possibly parsing "Index Of" directory listing pages.
How would I go about doing this?
Use regex or use the HtmlAgilityPack (http://htmlagilitypack.codeplex.com/) to parse the website and find links to files. You may need to check the extension of the file. Ie. Only parse links that end in .zip|.exe|.msi|.rar|.png|.pdf|.gif|.jpg|.jpeg.
I once wrote a "Web Spider" to do this and published the source code over at Code Project.
If you want to do it as an end-user, I found out the the free Httrack Website Copier works pretty well.

Categories

Resources