How to execute all http-links from a website? - c#

I have a task to write a program on C#, which finds all http-links from a website. Now I've write a such function for it:
async static void DownloadWebPage(string url)
{
using (HttpClient client = new HttpClient())
using (HttpResponseMessage response = await client.GetAsync(url))
using (HttpContent content = response.Content)
{
string[] resArr;
string result = await content.ReadAsStringAsync();
resArr = result.Split(new string[] {"href"}, StringSplitOptions.RemoveEmptyEntries);//splitting
//here must be some code-string which finds all neccessary http-links from resArr
Console.WriteLine("Main page of " + url + " size = " + result.Length.ToString());
}
}
Using this function I load a web-page content to the string, then I parse this string and write results to array, using "href"-splitter, then I check every array-unit on string, which contents "href" substring.So I can get strings, which content http-links. Problem starts when the string is spliting, because impossible to find http-links, to my mind this is due to content-format of this string.How to fix it?

I once did something similar. My solution was to change the html in a way that it fits the xml-regulations.
(Here could be the problem with this solution, i believe my html was in some way predefined, so i only had to change a few thing which I knew are not xml conform in the html)
After this you could simple search the "a"-nodes and read the href param.
Unfortunately, I can't find my code anymore, it's too long ago.

Related

read a string from one point to another c#

I have a problem with string reading, I will explain the problem:
I have this code to read a web page and put it in a string:
System.Net.WebRequest request = System.Net.WebRequest.Create(textBox1.Text);
using (System.Net.WebResponse response = request.GetResponse())
{
using (System.IO.Stream stream = response.GetResponseStream())
{
using (StreamReader sr = new StreamReader(stream))
{
html = sr.ReadToEnd();
}
}
}
Now I would like to take only some parts of this string, how can I do, if I use substring it doesn't take the selected pieces.
Example of a substring code:
Name = html.Substring((html.IndexOf("og:title")+19), (html.Substring(html.IndexOf("og:title") +19).FirstOrDefault(x=> x== '>')));
I would like it to start after the "og: title" and get to the '>', but it doesn't work.
The result is example:
"Valchiria “Intera” Pendragon\">\n<meta property=\"og:image\" conte"
It is easier if you use a library to do it, for example you can take a look at this
Your code, if I understood what you desire, should be like the following:
static void Main(string[] args)
{
const string startingToken = "og:title\"";
const string endingToken = "\">";
var html = "<html><meta property=\"og:title\" Valchiria “Intera” Pendragon\">\n<meta property=\"og:image\" content></html>";
var indexWhereOgTitleBegins = html.IndexOf(startingToken);
var htmlTrimmedHead = html.Substring(indexWhereOgTitleBegins + startingToken.Length);
var indexOfTheEndingToken = htmlTrimmedHead.IndexOf(endingToken);
var parsedText = htmlTrimmedHead.Substring(0, indexOfTheEndingToken).TrimStart(' ').TrimEnd(' ');
Console.WriteLine(parsedText);
}
Note that you can also use regular expressions to achieve the same in less line of code, but managing regex are not always easy.
Take a look at this answer:
Parsing HTML String
Your question title is probably not correct, because it looks more specific to HTML parsing.

How to get only files from entire html read in a c# console app?

I need to get every single file from a URL so then I can iterate over them.
The idea is to resize each image using ImageMagick, but first I need to be able to get the files and iterate over them.
Here is the code I have done so far
using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
namespace Example
{
public class MyExample
{
public static void Main(String[] args)
{
string url = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Console.WriteLine(html);
}
}
Console.ReadLine();
}
}
}
Which returns the entire html of the URL. However, I just need the files (all images) so I can work with them As I expect.
Any idea how to achieve this?
I looked at that page, and it's a directory/file list. You can use Regex to extract all links to images from the body of that page.
Here's a pattern I could think of: HREF="([^"]+\.(jpg|png))
Build your regex object, iterate over the matches, and download each image:
var regex = new System.Text.RegularExpressions.Regex("HREF=\"([^\"]+\\.(jpg|png))");
var matches = regex.Matches(html); // this is your html string
foreach(var match in matches) {
var imagePath = match.ToString().Substring("HREF=\"".Length);
Console.WriteLine(imagePath);
}
Now, concatenate the base url https://www.paz.cl with the image relative path obtained above, issue another request to that url to download the image and process it as you wish.
You can use The HTML Agility Pack
for example
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//a");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["href"].Value);
}
You can use AngleSharp to load and parse the html page. Then you can extract all the information you need.
// TODO add a reference to NuGet package AngleSharp
private static async Task Main(string[] args)
{
var config = Configuration.Default.WithDefaultLoader();
var address = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
var images = document.Images.Select(img=>img.Source);
}
AngleSharp implements the w3c standard, so it works better than HTMLAgilityPack on real world webpage.

Method HttpContent.ReadAsStringAsync() returns invalid html code. How it can be?

Here is some code,to request html-code from url.
var currentUrl = "https://www.google.com";
HttpClient client = new HttpClient();
System.Diagnostics.Process.Start(currentUrl); //here i open a browser with same URL,
var response = await client.GetAsync(currentUrl);
string sourse = null;
if (response != null && response.StatusCode == HttpStatusCode.OK)
{
sourse =await response.Content.ReadAsStringAsync(); // here i get the html-code
}
SO, the questions is: why html code that i get from program,and html code from real page in browser i have opened few secods before are different? It doesn't make sence.
Also, here is some proofs. First image - number of character in browser-html code
... and in program-html code
I gave the simplest evidence, to make the question more easier.
But if I dig deeper, the html code comes from nowhere. When I parse a specific page, and it should be 39 products, the program returns the html code, which only 6 products (which, incidentally, are not included in the 39 those that are actually on the page in the browser). So I asked the question so simply.
Really,i make a new project a minute ago with this code, and it work incorrect as I wrote above. to get the code that the program returns, I can look it up in the sourse variable or leave it in a file, and then compare. Like:
FileStream fs = new FileStream("report.txt", FileMode.OpenOrCreate);
StreamWriter SW = new StreamWriter(fs);
SW.WriteLine(sourse);

Trying to download string and can't search in it context

I am using WebClient to download string html from WebSite and then i am trying to manipulate the string by using SubString and IndexOf..
Also some times i use the functions: substring, indexOf or contains and a strange thing happens:
Some times it shows a text (HTML code) and some times it isn't show anything at all.
using (WebClient client = new WebClient())
{
htmlCode = client.DownloadString("https://www.google.com");
}
This is my code for getting an html code from a web site.
Now for example in this site i want to get the source of an image - a specific img (or another attribute)
using (StringReader reader = new StringReader(htmlCode))
{
string inputLine;
while ((inputLine = reader.ReadLine()) != null)
{
if (inputLine.Contains("img"))
{
RichTextBox.Text += inputLine;
}
}
}
There May be some syntax problems but don't look at it, They are not important.
Do you have an alterenetive or better way to get an HTML source code from a page and handle with it. It has to be HTTPS site and i would like a good explanation of it.
Sorry for noob question.

How to open txt file on localhost and change is content

i want to open a css file using C# 4.5 and change only one file at a time.
Doing it like this gives me the exception - URI formats are not supported.
What is the most effective way to do it ?
Can I find the line and replace it without reading the whole file ?
Can the line that I am looking and than start to insert text until
cursor is pointing on some char ?
public void ChangeColor()
{
string text = File.ReadAllText("http://localhost:8080/game/Css/style.css");
text = text.Replace("class='replace'", "new value");
File.WriteAllText("D://p.htm", text);
}
I believe File.ReadAllText is expecting a file path, not a URL.
No, you cannot search/replace sections of a text file without reading and re-writing the whole file. It's just a text file, not a database.
most effective way to do it is to declare any control you want to alter the css of as "runat=server" and then modify the CssClass property of it. There is no known alternative way to modify the css file directly. Any other hacks is just that.. a hack and very innefficient way to do it.
As mentioned before File.ReadAllText does not support url. Following is a working example with WebRequest:
{
Uri uri = new Uri("http://localhost:8080/game/Css/style.css");
WebRequest req = WebRequest.Create(uri);
WebResponse web = req.GetResponse();
Stream stream = web.GetResponseStream();
string content = string.Empty;
using (StreamReader sr = new StreamReader(stream))
{
content = sr.ReadToEnd();
}
content.Replace("class='replace'", "new value");
using (StreamWriter sw = new StreamWriter("D://p.htm"))
{
sw.Write(content);
sw.Flush();
}
}

Categories

Resources