make script to download all Mp3 files from a page - c#

I have a page that contains some links to .mp3/.wav files in that format
File Name
what I need to make a script that will download all these files instead of downloading them my self
I know that I can use regular expression to do some thing like that but i don't know how ? and what is the best choose to do that (Java , C# , JavaScript) ?
Any help will be appreciated
Thanks in Advance

You could use SgmlReader to parse the DOM and extract all the anchor links and then download the corresponding resources:
class Program
{
static void Main()
{
using (var reader = new SgmlReader())
{
reader.DocType = "HTML";
reader.Href = "http://www.example.com";
var doc = new XmlDocument();
doc.Load(reader);
var anchors = doc.SelectNodes("//a/#href[contains(., 'mp3') or contains(., 'wav')]");
foreach (XmlAttribute href in anchors)
{
using (var client = new WebClient())
{
var data = client.DownloadData(href.Value);
// TODO: do something with the downloaded data
}
}
}
}
}

Well, if you want to go hard-core, I think parsing the page with DOMDocument ( http://php.net/manual/en/class.domdocument.php ) and retrieving the files with cURL would do it if you're ok with PHP.
How many files are we talking about here?

Python's Beautiful Soup library is well-suited to this task:
http://www.crummy.com/software/BeautifulSoup/
Could be used in this way:
import urllib2, re
from BeautifulSoup import BeautifulSoup
#open the URL
page = urllib2.urlopen("http://www.foo.com")
#parse the page
soup = BeautifulSoup(page)
#get all anchor elements
anchors = soup.findAll("a")
#filter anchors based on their href attribute
filteredAnchors = filter(lambda a : re.search("\.wav",a["href"]) or re.search("\.mp3",a["href"]), anchors)
urlsToDownload = map(lambda a : a["href"],filteredAnchors)
#download each anchor url...
See here for instructions on downloading the mp3's from their URLs: How do I download a file over HTTP using Python?

Related

How to get only files from entire html read in a c# console app?

I need to get every single file from a URL so then I can iterate over them.
The idea is to resize each image using ImageMagick, but first I need to be able to get the files and iterate over them.
Here is the code I have done so far
using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
namespace Example
{
public class MyExample
{
public static void Main(String[] args)
{
string url = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Console.WriteLine(html);
}
}
Console.ReadLine();
}
}
}
Which returns the entire html of the URL. However, I just need the files (all images) so I can work with them As I expect.
Any idea how to achieve this?
I looked at that page, and it's a directory/file list. You can use Regex to extract all links to images from the body of that page.
Here's a pattern I could think of: HREF="([^"]+\.(jpg|png))
Build your regex object, iterate over the matches, and download each image:
var regex = new System.Text.RegularExpressions.Regex("HREF=\"([^\"]+\\.(jpg|png))");
var matches = regex.Matches(html); // this is your html string
foreach(var match in matches) {
var imagePath = match.ToString().Substring("HREF=\"".Length);
Console.WriteLine(imagePath);
}
Now, concatenate the base url https://www.paz.cl with the image relative path obtained above, issue another request to that url to download the image and process it as you wish.
You can use The HTML Agility Pack
for example
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//a");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["href"].Value);
}
You can use AngleSharp to load and parse the html page. Then you can extract all the information you need.
// TODO add a reference to NuGet package AngleSharp
private static async Task Main(string[] args)
{
var config = Configuration.Default.WithDefaultLoader();
var address = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
var images = document.Images.Select(img=>img.Source);
}
AngleSharp implements the w3c standard, so it works better than HTMLAgilityPack on real world webpage.

How to use Html Agility Pack to pick out the specific text?

Below is an example portion of a block of the Html that I am trying to extract information from:
<a href="https://secure.tibia.com/community/?subtopic=characters&name=Alemao+Golpista" >Alemao Golpista</a></td><td style="width:10%;" >51</td><td style="width:20%;" >Knight</td></tr><tr class="Even" style="text-align:right;" ><td style="width:70%;text-align:left;" >
I am basically grabbing the entire Html which is a list of players online and trying to append them to a list with the: Name (Alemao Golpista), Level (51), and Vocation (Knight).
Using regex for it is a pain in the ass and pretty slow how would I go about it using the Agility Pack?
Don't ever use regex to parse html files. As has already been stated you should use whatever HtmlagilityPack examples you can find even though they are scarce on their site. And the documentation isn't easy to find.
To get you started here is how you can load an HtmlDocument and get the anchor tags' href attributes.
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try{
var temp = new Uri(url.Url);
var request = (HttpWebRequest)WebRequest.Create(temp);
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
htmlDoc.Load(stream, Encoding.GetEncoding("iso-8859-9"));
}
}
}catch(WebException ex){
Console.WriteLine(ex.Message);
}
HtmlNodeCollection c = htmlDoc.DocumentNode.SelectNodes("//a");
List<string> urls = new List<string>();
foreach(HtmlNode n in c){
urls.Add(n.GetAttributeValue("href", ""));
}
The above code gets you all the links of a webpage in a string array.
You should look into xpath. And you should also get the documentation of HAP and read it. I couldn't find the documentation anywhere, so I uploaded the one I already had on my computer.

Parse Html Document Get All input fields with ID and Value

I have several thousand (ASP.net - messy html) html generated invoices that I'm trying to parse and save into a database.
Basically like:
foreach(var htmlDoc in HtmlFolder)
{
foreach(var inputBox in htmlDoc)
{
//Make Collection of ID and Values Insert to DB
}
}
From all the other questions I've read the best tool for this type of problem is the HtmlAgilityPack, however for the life of me I can't get the documentation .chm file to work. Any ideas on how I could accomplish this with or without the Agility Pack ?
Thanks in advance
An newer alternative to HtmlAgilityPack is CsQuery. See this later question on its relative performance merits, but its use of CSS selectors can't be beat:
var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
//get values
To get the CHM to work, you probably need to view the properties in Windows Explorer and uncheck the "Unblock Content" checkbox.
The HTML Agility Pack is quite easy when you know your way around Linq-to-XML or XPath.
Basics you'll need to know:
//import the HtmlAgilityPack
using HtmlAgilityPack;
HtmlDocument doc = new HtmlDocument();
// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);
// OR
// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------
// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
.Where(node => !string.IsNullOrWhitespace(node.Id)
&& node.Attributes["value"] != null
&& !string.IsNullOrWhitespace(node.Attributes["value"].Value));
// OR
// Finding things using XPath
var nodes = doc.DocumentNode
.SelectNodes("//input[not(#id='') and not(#value='')]");
// -----------------------------
// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null)
{
foreach (var node in nodes)
{
var id = node.Id;
var value = node.Attributes["value"].Value;
}
}
The easiest way to add the HtmlAgility Pack is using NuGet:
PM> Install-Package HtmlAgilityPack
Hah, looks like the ideal time to make a shameless plug of a library I wrote!
This should be rather easy to accomplish with this library (that's built on top of HtmlAgility pack by the way!) : https://github.com/amoerie/htmlbuilders
(You can find the Nuget package here: https://www.nuget.org/packages/HtmlBuilders/ )
Code samples:
const string html = "<div class='invoice'><input type='text' name='abc' value='123'/><input id='ohgood' type='text' name='def' value='456'/></div>";
var htmlDocument = new HtmlDocument {OptionCheckSyntax = false}; // avoid exceptions when html is invalid
htmlDocument.Load(new StringReader(html));
var tag = HtmlTag.Parse(htmlDocument); // if there is a root tag
var tags = HtmlTag.ParseAll(htmlDocument); // if there is no root tag
// find looks recursively through the entire DOM tree
var inputFields = tag.Find(t => string.Equals(t.TagName, "input"));
foreach (var inputField in inputFields)
{
Console.WriteLine(inputField["type"]);
Console.WriteLine(inputField["value"]);
if(inputField.HasAttribute("id"))
Console.WriteLine(inputField["id"]);
}
Note that inputField[attribute] will throw a 'KeyNotFoundException' if that field does not have the specified attribute name. That's because HtmlTag implements and reuses IDictionary logic for its attributes.
Edit: If you're not running this code in a web environment, you'll need to add a reference to System.Web. That's because this library makes use of the HtmlString class which can be found in System.Web. Just choose 'Add reference' and then you can find it under 'Assemblies > Framework'
You can download HtmlAgilityPack Documents CHM file from here.
If chm file contents are not visible then un-check Always ask before opening this file check-box as shown in screen shot
Note: The above dialog appears for unsigned files
Source: HtmlAgilityPack Documentation

how to find a image tags in source code using C#?

I am working with c# now stored webpage content in single variable and I have one text box if I paste any URL that will show the full source code the link .now I want to find all the image tags where it is begin and where it is finished.also I like to merge except image tags .
can you anyone tell me how to do..
Assuming you want to parse the content server-side you can use the Html Agility pack
See this question
Try This:
var images = doc.DocumentNode.SelectNodes("//img");
if (images != null)
{
foreach (HtmlNode image in images)
{
var alt = image.GetAttributeValue("alt", "");
var nodeForReplace = HtmlTextNode.CreateNode(alt);
image.ParentNode.ReplaceChild(nodeForReplace, image);
}
}
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
{
doc.Save(writer);
}

How could I download a file like .doc or .pdf from internet to my harddrive using c#

How could I download a file like .doc , .pdf from internet to my
hard drive using c#
using (var client = new System.Net.WebClient())
{
client.DownloadFile( "url", "localFilename");
}
The most simple way is use WebClient.DownloadFile
use the WebClient class:
using(WebClient wc = new WebClient())
wc.DownloadFile("http://a.com/foo.pdf", #"D:\foo.pdf");
Edit based on comments:
Based on your comments I think what you are trying to do is download i.e. PDF files that are linked to from an HTML page. In that case you can
Download the page (with WebClient,
see above)
Use the HtmlAgilityPack to find
all the links within the page that
point to pdf files
Download the pdf files
i am developing a crawler were if i
specify a keyword for eg:SHA algorithm
and i select the option .pdf or .doc
from the crawler it should download
the file with selected format in to a
targeted folder ..
Based on your clarification this is a solution using google to get the results of the search:
DownloadSearchHits("SHA", "pdf");
...
public static void DownloadSearchHits(string searchTerm, string fileType)
{
using (WebClient wc = new WebClient())
{
string html = wc.DownloadString(string.Format("http://www.google.com/search?q={0}+filetype%3A{1}", searchTerm, fileType));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var pdfLinks = doc.DocumentNode
.SelectNodes("//a")
.Where(link => link.Attributes["href"] != null
&& link.Attributes["href"].Value.EndsWith(".pdf"))
.Select(link => link.Attributes["href"].Value)
.ToList();
int index = 0;
foreach (string pdfUrl in pdfLinks)
{
wc.DownloadFile(pdfUrl,
string.Format(#"C:\download\{0}.{1}",
index++,
fileType));
}
}
}
In general though you should ask a question related to a particular problem you have with a given implementation that you already have - based on your question you are very far off being able to implement a standalone crawler.
Use WebClient.DownloadFile() from System.Net
Using WebClient.DownloadFile
http://msdn.microsoft.com/en-us/library/system.net.webclient.downloadfile.aspx
using (var client = new WebClient())
{
var data = client.DownloadFile(url, filename);
}

Categories

Resources