Retrieve HTML from links on page - c#

I am using the following method to retrieve the source code from my website-
class WorkerClass1
{
public static string getSourceCode(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
sr.Close();
return sourceCode;
}
}
And then implement the WorkerClass1 as so-
private void button1_Click(object sender, EventArgs e)
{
string url = textBox1.Text;
string sourceCode = WorkerClass1.getSourceCode(url);
StreamWriter sw = new StreamWriter(#"path");
sw.Write(sourceCode);
sw.Close();
}
This works great and retrieves the HTML from my home page, however there are links at the bottom the page which I want to follow once the first page has been retrieved.
Is there a way I could modify my current code to do this?

Yes of course.
What I would do is to read the HTML using a regular expression looking for links. For each match, I would put those links in a queue or similar data structure, and then use the same method for looking at that source.
Consider looking at HTMLAgilityPack for the parsing, it might be easier, even though looking for links should be quite simpele using Google.

Related

read a string from one point to another c#

I have a problem with string reading, I will explain the problem:
I have this code to read a web page and put it in a string:
System.Net.WebRequest request = System.Net.WebRequest.Create(textBox1.Text);
using (System.Net.WebResponse response = request.GetResponse())
{
using (System.IO.Stream stream = response.GetResponseStream())
{
using (StreamReader sr = new StreamReader(stream))
{
html = sr.ReadToEnd();
}
}
}
Now I would like to take only some parts of this string, how can I do, if I use substring it doesn't take the selected pieces.
Example of a substring code:
Name = html.Substring((html.IndexOf("og:title")+19), (html.Substring(html.IndexOf("og:title") +19).FirstOrDefault(x=> x== '>')));
I would like it to start after the "og: title" and get to the '>', but it doesn't work.
The result is example:
"Valchiria “Intera” Pendragon\">\n<meta property=\"og:image\" conte"
It is easier if you use a library to do it, for example you can take a look at this
Your code, if I understood what you desire, should be like the following:
static void Main(string[] args)
{
const string startingToken = "og:title\"";
const string endingToken = "\">";
var html = "<html><meta property=\"og:title\" Valchiria “Intera” Pendragon\">\n<meta property=\"og:image\" content></html>";
var indexWhereOgTitleBegins = html.IndexOf(startingToken);
var htmlTrimmedHead = html.Substring(indexWhereOgTitleBegins + startingToken.Length);
var indexOfTheEndingToken = htmlTrimmedHead.IndexOf(endingToken);
var parsedText = htmlTrimmedHead.Substring(0, indexOfTheEndingToken).TrimStart(' ').TrimEnd(' ');
Console.WriteLine(parsedText);
}
Note that you can also use regular expressions to achieve the same in less line of code, but managing regex are not always easy.
Take a look at this answer:
Parsing HTML String
Your question title is probably not correct, because it looks more specific to HTML parsing.

How to get only files from entire html read in a c# console app?

I need to get every single file from a URL so then I can iterate over them.
The idea is to resize each image using ImageMagick, but first I need to be able to get the files and iterate over them.
Here is the code I have done so far
using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
namespace Example
{
public class MyExample
{
public static void Main(String[] args)
{
string url = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Console.WriteLine(html);
}
}
Console.ReadLine();
}
}
}
Which returns the entire html of the URL. However, I just need the files (all images) so I can work with them As I expect.
Any idea how to achieve this?
I looked at that page, and it's a directory/file list. You can use Regex to extract all links to images from the body of that page.
Here's a pattern I could think of: HREF="([^"]+\.(jpg|png))
Build your regex object, iterate over the matches, and download each image:
var regex = new System.Text.RegularExpressions.Regex("HREF=\"([^\"]+\\.(jpg|png))");
var matches = regex.Matches(html); // this is your html string
foreach(var match in matches) {
var imagePath = match.ToString().Substring("HREF=\"".Length);
Console.WriteLine(imagePath);
}
Now, concatenate the base url https://www.paz.cl with the image relative path obtained above, issue another request to that url to download the image and process it as you wish.
You can use The HTML Agility Pack
for example
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//a");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["href"].Value);
}
You can use AngleSharp to load and parse the html page. Then you can extract all the information you need.
// TODO add a reference to NuGet package AngleSharp
private static async Task Main(string[] args)
{
var config = Configuration.Default.WithDefaultLoader();
var address = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
var images = document.Images.Select(img=>img.Source);
}
AngleSharp implements the w3c standard, so it works better than HTMLAgilityPack on real world webpage.

Trying to download string and can't search in it context

I am using WebClient to download string html from WebSite and then i am trying to manipulate the string by using SubString and IndexOf..
Also some times i use the functions: substring, indexOf or contains and a strange thing happens:
Some times it shows a text (HTML code) and some times it isn't show anything at all.
using (WebClient client = new WebClient())
{
htmlCode = client.DownloadString("https://www.google.com");
}
This is my code for getting an html code from a web site.
Now for example in this site i want to get the source of an image - a specific img (or another attribute)
using (StringReader reader = new StringReader(htmlCode))
{
string inputLine;
while ((inputLine = reader.ReadLine()) != null)
{
if (inputLine.Contains("img"))
{
RichTextBox.Text += inputLine;
}
}
}
There May be some syntax problems but don't look at it, They are not important.
Do you have an alterenetive or better way to get an HTML source code from a page and handle with it. It has to be HTTPS site and i would like a good explanation of it.
Sorry for noob question.

How to open txt file on localhost and change is content

i want to open a css file using C# 4.5 and change only one file at a time.
Doing it like this gives me the exception - URI formats are not supported.
What is the most effective way to do it ?
Can I find the line and replace it without reading the whole file ?
Can the line that I am looking and than start to insert text until
cursor is pointing on some char ?
public void ChangeColor()
{
string text = File.ReadAllText("http://localhost:8080/game/Css/style.css");
text = text.Replace("class='replace'", "new value");
File.WriteAllText("D://p.htm", text);
}
I believe File.ReadAllText is expecting a file path, not a URL.
No, you cannot search/replace sections of a text file without reading and re-writing the whole file. It's just a text file, not a database.
most effective way to do it is to declare any control you want to alter the css of as "runat=server" and then modify the CssClass property of it. There is no known alternative way to modify the css file directly. Any other hacks is just that.. a hack and very innefficient way to do it.
As mentioned before File.ReadAllText does not support url. Following is a working example with WebRequest:
{
Uri uri = new Uri("http://localhost:8080/game/Css/style.css");
WebRequest req = WebRequest.Create(uri);
WebResponse web = req.GetResponse();
Stream stream = web.GetResponseStream();
string content = string.Empty;
using (StreamReader sr = new StreamReader(stream))
{
content = sr.ReadToEnd();
}
content.Replace("class='replace'", "new value");
using (StreamWriter sw = new StreamWriter("D://p.htm"))
{
sw.Write(content);
sw.Flush();
}
}

A problem parsing a HTML tag with HTML Agility Pack C#

This seems like it should be a easy thing to do but I am having some major issues with this. I am trying to parse for a specific tag with the HAP. I use Firebug to find the XPath I want and come up with //*[#id="atfResults"]. I believe my issue is with the " since the signals the start and end of a new string. I have tried making it a literal string but I have errors. I have attached the functions
public List<string> GetHtmlPage(string strURL)
{
// the html retrieved from the page
WebResponse objResponse;
WebRequest objRequest = System.Net.HttpWebRequest.Create(strURL);
objResponse = objRequest.GetResponse();
// the using keyword will automatically dispose the object
// once complete
using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream()))
{//*[#id="atfResults"]
string strContent = sr.ReadToEnd();
// Close and clean up the StreamReader
sr.Close();
/*Regex regex = new Regex("<body>((.|\n)*?)</body>", RegexOptions.IgnoreCase);
//Here we apply our regular expression to our string using the
//Match object.
Match oM = regex.Match(strContent);
Result = oM.Value;*/
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(strContent));
HtmlNode root = doc.DocumentNode;
List<string> itemTags = new List<string>();
string listingtag = "//*[#id="atfResults"]";
foreach (HtmlNode link in root.SelectNodes(listingtag))
{
string att = link.OuterHtml;
itemTags.Add(att);
}
return itemTags;
}
}
You can escape it:
string listingtag = "//*[#id=\"atfResults\"]";
If you wanted to use a raw string, it would be:
string listingtag = #"//*[#id=""atfResults""]";
As you can see, raw strings don't really provide a benefit here.
However, you can instead use:
HtmlNode link = doc.GetElementById("atfResults");
This will also be slightly faster.
Have you tried this:
string listingtag = "//*[#id='atfResults']";

Categories

Resources