read a string from one point to another c# - c#

I have a problem with string reading, I will explain the problem:
I have this code to read a web page and put it in a string:
System.Net.WebRequest request = System.Net.WebRequest.Create(textBox1.Text);
using (System.Net.WebResponse response = request.GetResponse())
{
using (System.IO.Stream stream = response.GetResponseStream())
{
using (StreamReader sr = new StreamReader(stream))
{
html = sr.ReadToEnd();
}
}
}
Now I would like to take only some parts of this string, how can I do, if I use substring it doesn't take the selected pieces.
Example of a substring code:
Name = html.Substring((html.IndexOf("og:title")+19), (html.Substring(html.IndexOf("og:title") +19).FirstOrDefault(x=> x== '>')));
I would like it to start after the "og: title" and get to the '>', but it doesn't work.
The result is example:
"Valchiria “Intera” Pendragon\">\n<meta property=\"og:image\" conte"

It is easier if you use a library to do it, for example you can take a look at this
Your code, if I understood what you desire, should be like the following:
static void Main(string[] args)
{
const string startingToken = "og:title\"";
const string endingToken = "\">";
var html = "<html><meta property=\"og:title\" Valchiria “Intera” Pendragon\">\n<meta property=\"og:image\" content></html>";
var indexWhereOgTitleBegins = html.IndexOf(startingToken);
var htmlTrimmedHead = html.Substring(indexWhereOgTitleBegins + startingToken.Length);
var indexOfTheEndingToken = htmlTrimmedHead.IndexOf(endingToken);
var parsedText = htmlTrimmedHead.Substring(0, indexOfTheEndingToken).TrimStart(' ').TrimEnd(' ');
Console.WriteLine(parsedText);
}
Note that you can also use regular expressions to achieve the same in less line of code, but managing regex are not always easy.
Take a look at this answer:
Parsing HTML String
Your question title is probably not correct, because it looks more specific to HTML parsing.

Related

How to get only files from entire html read in a c# console app?

I need to get every single file from a URL so then I can iterate over them.
The idea is to resize each image using ImageMagick, but first I need to be able to get the files and iterate over them.
Here is the code I have done so far
using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
namespace Example
{
public class MyExample
{
public static void Main(String[] args)
{
string url = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Console.WriteLine(html);
}
}
Console.ReadLine();
}
}
}
Which returns the entire html of the URL. However, I just need the files (all images) so I can work with them As I expect.
Any idea how to achieve this?
I looked at that page, and it's a directory/file list. You can use Regex to extract all links to images from the body of that page.
Here's a pattern I could think of: HREF="([^"]+\.(jpg|png))
Build your regex object, iterate over the matches, and download each image:
var regex = new System.Text.RegularExpressions.Regex("HREF=\"([^\"]+\\.(jpg|png))");
var matches = regex.Matches(html); // this is your html string
foreach(var match in matches) {
var imagePath = match.ToString().Substring("HREF=\"".Length);
Console.WriteLine(imagePath);
}
Now, concatenate the base url https://www.paz.cl with the image relative path obtained above, issue another request to that url to download the image and process it as you wish.
You can use The HTML Agility Pack
for example
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//a");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.Attributes["href"].Value);
}
You can use AngleSharp to load and parse the html page. Then you can extract all the information you need.
// TODO add a reference to NuGet package AngleSharp
private static async Task Main(string[] args)
{
var config = Configuration.Default.WithDefaultLoader();
var address = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
var images = document.Images.Select(img=>img.Source);
}
AngleSharp implements the w3c standard, so it works better than HTMLAgilityPack on real world webpage.

How to Get all the websites related to the keyword in Windows Form C#

Here is my process:
I have a textbox where user will enter the keyword, for example games, then after enter all the websites related to games will be output in the windows form.
Basically I tried using the Google Search API, using this code:
const string apiKey = "";
const string searchEngineId = "";
const string query = "games";
CustomsearchService customSearchService = new CustomsearchService(new Google.Apis.Services.BaseClientService.Initializer() { ApiKey = apiKey });
Google.Apis.Customsearch.v1.CseResource.ListRequest listRequest = customSearchService.Cse.List(query);
listRequest.Cx = searchEngineId;
Search search = listRequest.Execute();
foreach (var item in search.Items)
{
Console.WriteLine("Title : " + item.Title + Environment.NewLine + "Link : " + item.Link + Environment.NewLine + Environment.NewLine);
}
But my problem is that the limitation of 100 query/day and 10 results/query is not applicable.
So I decided to use HttpWebRequest and HttpWebResponse approach,
Here is the code which I saw from the internet:
StringBuilder sb = new StringBuilder();
// used on each read operation
byte[] buf = new byte[8192];
string GS = "http://google.com/search?q=sample";
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(GS);
// execute the request
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0);
My problem with this is that it returns the whole HTML, Is it possible to get only the URL like using the Google Search API
That's the way it works, you either have to pay for the API, or parse the HTML - the legality of the latter is questionable.
Using a html parser with css selectors, it is not that much work (solution is based on this java tutorial: http://mph-web.de/web-scraping-with-java-top-10-google-search-results/). I used Dcsoup (https://github.com/matarillo/dcsoup incomplete Jsoup port) for the example, since I'm used to Jsoup (https://jsoup.org/apidocs/), but there might be other html parsers for c# that are better maintained, etc.
// query results on page 14, to demonstrate that limit of results is avoided
int resultPage = 130;
string keyword = "test";
string searchUrl = "http://www.google.com/search?q="+keyword+"&start="+resultPage;
System.Net.WebClient webClient = new System.Net.WebClient();
string htmlResult = webClient.DownloadString(searchUrl);
Supremes.Nodes.Document doc = Supremes.Dcsoup.Parse(htmlResult, "http://www.google.com/");
// parse with css selector
foreach (Supremes.Nodes.Element result in doc.Select("h3.r a"))
{
string title = result.Text;
string url = result.Attr("href");
// do something useful with the search result
System.Diagnostics.Debug.WriteLine(title + " -> " + url);
}
The needed selector h3.r a might change. A more stable alternative might be to parse all elements an retrieve those with href attribute or at least have a built-in check (check for a search term with a lot of results and parse and if there are no results for your selector, send you a notify, to repair the selector).
See also this answer regarding getting the results for the exact search term: https://stackoverflow.com/a/37268746/1661938

Removing formatting (\t\r) from a Stream Reader

I'm importing an .sql file that I created with SQLExplorer into my .NET program using StreamReader (which will eventually be passed through OdbcConnection, fyi.) The problem is that when I use the ReadToEnd() method it not only imports the SQL itself, but it imports all of the formatting. So the string is littered with \r and \t and the like.
I've been looking at both using split or possibly regex to break the string down and remove the unwanted bits and pieces. But before throwing a bunch of effort into that I wondered if there was perhaps something I'm missing in the StreamReader class? Is there a way to tell it to just ignore the formatting characters?
Here's the code I have right now:
public static Object SQLQueryFileCall(String SQLQueryFileName){
string SQLQuery = "";
string directory = System.IO.Path.GetDirectoryName( System.Reflection.Assembly.GetExecutingAssembly().Location);
SQLQueryFileName = directory + "\\" + SQLQueryFileName;
//read in the file and pass to ODBC, return a Object[] of whatever comes back...
try{
using (StreamReader myStreamReader = new StreamReader(SQLQueryFileName)) {
SQLQuery = myStreamReader.ReadToEnd();
Console.WriteLine(SQLQuery);
}
}
catch (Exception e){
//Find the error
Console.WriteLine("File could not be read:");
string error = e.Message;
MessageBox.Show(error);
return null;
}
}
Feel free to offer any advice you might have on the code, seeing as I'm pretty new.
But yeah, mostly I'm just hoping there's a method in the StreamReader class that I'm just not understanding. I've gone to Microsoft's online documentation, and I feel I've given it a good look, but then again, I'm new and perhaps the concept skipped over my head?
Any help?
NOTE: There are multiple \t that are in the middle of some of the lines, and they do need to be removed. Hence using trim would...be tricky at least.
Well, myStreamReader.ReadToEnd(); will get you everything. The easiest way to get rid of most unneeded whitespace is to read it line-by-line and then simply .Trim() every line.
using (StreamReader myStreamReader = new StreamReader(SQLQueryFileName))
{
string line;
List<string> lines = new List<string();
while ((line = myStreamReader.ReadLine()) != null)
lines.Add(line.Trim());
SQLQuery = string.Join("\n", lines);
Console.WriteLine(SQLQuery);
}
SQL, by definition, shouldn't have a problem with whitespaces like tabs and newlines throughout code. Verify that your actual SQL is correct first.
Also, blindly stripping whitespace could potentially have an impact on textual data contained within your script; what happens if you have a string literal that contains a tab character?
using (StreamReader myStreamReader = new StreamReader(SQLQueryFileName))
{
while ((line = myStreamReader.ReadLine()) != null)
{
Console.WriteLine(line.Trim().Replace(#"\t"," ")
}
}
If it is a display and not content issue I think you could use WPF and a RichTextBox.
SQLQuery = Regex.Replace(SQLQuery, #"\t|\n|\r", "");
Better to use as you could remove content of SQL insertions:
using (StreamReader reader = new StreamReader(fileName))
{
String line;
List<string> lines = new List<string();
while ((line = myStreamReader.ReadLine()) != null)
SQLQuery += line.Trim() + "\n";
}
This is what helped me in 2020, incase if someone is looking for this
using (StreamReader sr = new StreamReader(stream))
{
return Regex.Replace(sr.ReadToEnd().Trim(), #"\s", "");
}

Retrieve HTML from links on page

I am using the following method to retrieve the source code from my website-
class WorkerClass1
{
public static string getSourceCode(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
sr.Close();
return sourceCode;
}
}
And then implement the WorkerClass1 as so-
private void button1_Click(object sender, EventArgs e)
{
string url = textBox1.Text;
string sourceCode = WorkerClass1.getSourceCode(url);
StreamWriter sw = new StreamWriter(#"path");
sw.Write(sourceCode);
sw.Close();
}
This works great and retrieves the HTML from my home page, however there are links at the bottom the page which I want to follow once the first page has been retrieved.
Is there a way I could modify my current code to do this?
Yes of course.
What I would do is to read the HTML using a regular expression looking for links. For each match, I would put those links in a queue or similar data structure, and then use the same method for looking at that source.
Consider looking at HTMLAgilityPack for the parsing, it might be easier, even though looking for links should be quite simpele using Google.

A problem parsing a HTML tag with HTML Agility Pack C#

This seems like it should be a easy thing to do but I am having some major issues with this. I am trying to parse for a specific tag with the HAP. I use Firebug to find the XPath I want and come up with //*[#id="atfResults"]. I believe my issue is with the " since the signals the start and end of a new string. I have tried making it a literal string but I have errors. I have attached the functions
public List<string> GetHtmlPage(string strURL)
{
// the html retrieved from the page
WebResponse objResponse;
WebRequest objRequest = System.Net.HttpWebRequest.Create(strURL);
objResponse = objRequest.GetResponse();
// the using keyword will automatically dispose the object
// once complete
using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream()))
{//*[#id="atfResults"]
string strContent = sr.ReadToEnd();
// Close and clean up the StreamReader
sr.Close();
/*Regex regex = new Regex("<body>((.|\n)*?)</body>", RegexOptions.IgnoreCase);
//Here we apply our regular expression to our string using the
//Match object.
Match oM = regex.Match(strContent);
Result = oM.Value;*/
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(strContent));
HtmlNode root = doc.DocumentNode;
List<string> itemTags = new List<string>();
string listingtag = "//*[#id="atfResults"]";
foreach (HtmlNode link in root.SelectNodes(listingtag))
{
string att = link.OuterHtml;
itemTags.Add(att);
}
return itemTags;
}
}
You can escape it:
string listingtag = "//*[#id=\"atfResults\"]";
If you wanted to use a raw string, it would be:
string listingtag = #"//*[#id=""atfResults""]";
As you can see, raw strings don't really provide a benefit here.
However, you can instead use:
HtmlNode link = doc.GetElementById("atfResults");
This will also be slightly faster.
Have you tried this:
string listingtag = "//*[#id='atfResults']";

Categories

Resources