How to retrieve specific HTML information from a given website

How to retrieve specific HTML information from a given website - c#

I'm trying to program an API for discord and I need to retrieve two pieces of information out of the HTML code of the web page https://myanimelist.net/character/214 (and other similar pages with URLs of the form https://myanimelist.net/character/N for integers N), specifically the URL of the Character Picture (in this case https://cdn.myanimelist.net/images/characters/14/54554.jpg) and the name of the character (in this case Youji Kudou). Afterwards I need to save those two pieces of information to JSON.
I am using HTMLAgilityPack for this, yet I can't quite see through it. The following is my first attempt:
public static void Main()
{
var html = "https://myanimelist.net/character/214";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
foreach (var node in htmlNodes.Descendants("tr/td/div/a/img"))
{
Console.WriteLine(node.InnerHtml);
}
}
Unfortunately, this produces no output. If I followed the path correctly (which is probably the first mistake) it should be "tr/td/div/a/img". I get no errors, it runs, yet I get no output.
My second attempt is:
public static void Main()
{
var html = "https://myanimelist.net/character/214";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
var script = htmlDoc.DocumentNode.Descendants()
.Where(n => n.Name == "tr/td/a/img")
.First().InnerText;
// Return the data of spect and stringify it into a proper JSON object
var engine = new Jurassic.ScriptEngine();
var result = engine.Evaluate("(function() { " + script + " return src; })()");
var json = JSONObject.Stringify(engine, result);
Console.WriteLine(json);
Console.ReadKey();
}
But this also doesn't work.
How can I extract the required information?
EDIT:
So, I've come quite further now, and I've found a solution to finding the link. It was rather simple. But now I'm stuck with finding the name of the character. The website is structured the same on every other link there is (changing the last number) so, I want to find many different ones via for loop. Here's how I tried to do it:
for (int i = 1; i <= 1000; i++)
{
HtmlWeb web = new HtmlWeb();
var html = "https://myanimelist.net/character/" + i;
var htmlDoc = web.Load(html);
foreach (var item in htmlDoc.DocumentNode.SelectNodes("//*[#]"))
{
string n;
n = item.GetAttributeValue("src", "");
foreach (var item2 in htmlDoc.DocumentNode.SelectNodes("//*[#src and #alt='" + n + "']"))
{
Console.WriteLine(item2.GetAttributeValue("src", ""));
}
}
}
in the first foreach I would try to search for the name, which is concluded always at the same position (e.g http://prntscr.com/o1uo3c and http://prntscr.com/o1uo91 and to be specific: http://prntscr.com/o1xzbk) but I haven't found out how yet. Since the structure in the HTML doesn't have any body type I can follow up with. The second foreach loop is to search for the URL which works by now and the n should give me the name, so I can figure it out for each different character.

I was able to extract the character name and image from https://myanimelist.net/character/214 using the following method:
public static CharacterData ExtractCharacterNameAndImage(string url)
{
//Use the following if you are OK with hardcoding the structure of <div> elements.
//var tableXpath = "/html/body/div[1]/div[3]/div[3]/div[2]/table";
//Use the following if you are OK with hardcoding the fact that the relevant table comes first.
var tableXpath = "/html/body//table";
var nameXpath = "tr/td[2]/div[4]";
var imageXpath = "tr/td[1]/div[1]/a/img";
var htmlDoc = new HtmlWeb().Load(url);
var table = htmlDoc.DocumentNode.SelectNodes(tableXpath).First();
var name = table.SelectNodes(nameXpath).Select(n => n.GetDirectInnerText().Trim()).SingleOrDefault();
var imageUrl = table.SelectNodes(imageXpath).Select(n => n.GetAttributeValue("src", "")).SingleOrDefault();
return new CharacterData { Name = name, ImageUrl = imageUrl, Url = url };
}
Where CharacterData is defined as follows:
public class CharacterData
{
public string Name { get; set; }
public string ImageUrl { get; set; }
public string Url { get; set; }
}
Afterwards, the character data can be serialized to JSON using any of the tools from How to write a JSON file in C#?, e.g. json.net:
var url = "https://myanimelist.net/character/214";
var data = ExtractCharacterNameAndImage(url);
var json = JsonConvert.SerializeObject(data, Formatting.Indented);
Console.WriteLine(json);
Which outputs
{
"Name": "Youji Kudou",
"ImageUrl": "https://cdn.myanimelist.net/images/characters/14/54554.jpg",
"Url": "https://myanimelist.net/character/214"
}
If you would prefer the Name to include the Japanese in parenthesis, replace GetDirectInnerText() with just InnerText, which results in:
{
"Name": "Youji Kudou (工藤耀爾)",
"ImageUrl": "https://cdn.myanimelist.net/images/characters/14/54554.jpg",
"Url": "https://myanimelist.net/character/214"
}
Alternatively, if you prefer you could pull the character name from the document title:
var title = string.Concat(htmlDoc.DocumentNode.SelectNodes("/html/head/title").Select(n => n.InnerText.Trim()));
var index = title.IndexOf("- MyAnimeList.net");
if (index >= 0)
title = title.Substring(0, index).Trim();
How did I determine the correct XPath strings?
Firstly, using Firefox 66, I opened the debugger and loaded https://myanimelist.net/character/214 in the window with the debugging tools visible.
Next, following the instructions from How to find xpath of an element in firefox inspector, I selected the Youji Kudou (工藤耀爾) node and copied its XPath, which turned out to be:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]
I then tried to select this node using SelectNodes()... and got a null result. But why? To determine this I created a debugging routine that would break the path into successively longer portions and determine where the failure occurs:
static void TestSelect(HtmlDocument htmlDoc, string xpath)
{
Console.WriteLine("\nInput path: " + xpath);
var splitPath = xpath.Split('/');
for (int i = 2; i <= splitPath.Length; i++)
{
if (splitPath[i-1] == "")
continue;
var thisPath = string.Join("/", splitPath, 0, i);
Console.Write("Testing \"{0}\": ", thisPath);
var result = htmlDoc.DocumentNode.SelectNodes(thisPath);
Console.WriteLine("result count = {0}", result == null ? "null" : result.Count.ToString());
}
}
This output the following:
Input path: /html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]
Testing "/html": result count = 1
Testing "/html/body": result count = 1
Testing "/html/body/div[1]": result count = 1
Testing "/html/body/div[1]/div[3]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]": result count = null
As you can see, something goes wrong selecting the <tbody> path element. Manual inspection of the InnerHtml returned by selecting /html/body/div[1]/div[3]/div[3]/div[2]/table revealed that, for some reason, the server is not including the <tbody> tag when returning HTML to the HtmlWeb object -- possibly due to some difference in request header(s) provided by Firefox vs HtmlWeb. Once I omitted the tbody path element I was able to query for the character name successfully using:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[2]/div[4]
A similar process provided the following working path for the image:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[1]/div[1]/a/img
Since the two queries are finding contents in the same <table>, in my final code I selected the table only once in a separate step, and removed some of the hardcoding as to the specific nesting of <div> elements.
Demo fiddle here.

Alright, to finnish it up, I've rounded the Code, gratefully assisted by dbc, and implemented nearly completly into the project. Just if someone in later days maybe has a identical question, here they go. This outputs out of a defined number all the character names, links and images and writes it into a JSON file and could be adapted for other websites.
using System;
using System.Linq;
using Newtonsoft.Json;
using HtmlAgilityPack;
using System.IO;
namespace SearchingHTML
{
public class CharacterData
{
public string Name { get; set; }
public string ImageUrl { get; set; }
public string Url { get; set; }
}
public class Program
{
public static CharacterData ExtractCharacterNameAndImage(string url)
{
var tableXpath = "/html/body//table";
var nameXpath = "tr/td[2]/div[4]";
var imageXpath = "tr/td[1]/div[1]/a/img";
var htmlDoc = new HtmlWeb().Load(url);
var table = htmlDoc.DocumentNode.SelectNodes(tableXpath).First();
var name = table.SelectNodes(nameXpath).Select(n => n.GetDirectInnerText().Trim()).SingleOrDefault();
var imageUrl = table.SelectNodes(imageXpath).Select(n => n.GetAttributeValue("src", "")).SingleOrDefault();
return new CharacterData { Name = name, ImageUrl = imageUrl, Url = url };
}
public static void Main()
{
int max = 10000;
string fileName = #"C:\Users\path of your file.json";
Console.WriteLine("Environment version: " + Environment.Version);
Console.WriteLine("Json.NET version: " + typeof(JsonSerializer).Assembly.FullName);
Console.WriteLine("HtmlAgilityPack version: " + typeof(HtmlDocument).Assembly.FullName);
Console.WriteLine();
for (int i = 6; i <= max; i++)
{
try
{
var url = "https://myanimelist.net/character/" + i;
var htmlDoc = new HtmlWeb().Load(url);
var data = ExtractCharacterNameAndImage(url);
var json = JsonConvert.SerializeObject(data, Formatting.Indented);
Console.WriteLine(json);
TextWriter tsw = new StreamWriter(fileName, true);
tsw.WriteLine(json);
tsw.Close();
} catch (Exception ex) { }
}
}
}
}
/*******************************************************************************************************************************
****************************************************IF TESTING IS REQUIERED****************************************************
*******************************************************************************************************************************
*
* static void TestSelect(HtmlDocument htmlDoc, string xpath)
Console.WriteLine("\nInput path: " + xpath);
var splitPath = xpath.Split('/');
for (int i = 2; i <= splitPath.Length; i++)
{
if (splitPath[i - 1] == "")
continue;
var thisPath = string.Join("/", splitPath, 0, i);
Console.Write("Testing \"{0}\": ", thisPath);
var result = htmlDoc.DocumentNode.SelectNodes(thisPath);
Console.WriteLine("result count = {0}", result == null ? "null" : result.Count.ToString());
}
}
*******************************************************************************************************************************
*********************************************FOR TESTING ENTER THIS INTO MAIN CLASS********************************************
*******************************************************************************************************************************
*
* var url2 = "https://myanimelist.net/character/256";
var data2 = ExtractCharacterNameAndImage(url2);
var json2 = JsonConvert.SerializeObject(data2, Formatting.Indented);
Console.WriteLine(json2);
var nameXpathFromFirefox = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]";
var imageXpathFromFirefox = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[1]/div[1]/a/img";
TestSelect(htmlDoc, nameXpathFromFirefox);
TestSelect(htmlDoc, imageXpathFromFirefox);
var nameXpathFromFirefoxFixed = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[2]/div[4]";
var imageXpathFromFirefoxFixed = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[1]/div[1]/a/img";
TestSelect(htmlDoc, nameXpathFromFirefoxFixed);
TestSelect(htmlDoc, imageXpathFromFirefoxFixed);
*******************************************************************************************************************************
*******************************************************************************************************************************
*******************************************************************************************************************************
*/

Related

Lucene 4.8 facets usage

I have difficulties understanding this example on how to use facets :
https://lucenenet.apache.org/docs/4.8.0-beta00008/api/Lucene.Net.Demo/Lucene.Net.Demo.Facet.SimpleFacetsExample.html
My goal is to create an index in which each document field have a facet, so that at search time i can choose which facets use to navigate data.
What i am confused about is setup of facets in index creation, to
summarize my question : is index with facets compatibile with
ReferenceManager?
Need DirectoryTaxonomyWriter to be actually written and persisted
on disk or it will embedded into the index itself and is just
temporary? I mean given the code
indexWriter.AddDocument(config.Build(taxoWriter, doc)); of the
example i expect it's temporary and will be embedded into the index (but then the example also show you need the Taxonomy to drill down facet). So can the Taxonomy be tangled in some way with the index so that the are handled althogeter with ReferenceManager?
If is not may i just use the same folder i use for storing index?
Here is a more detailed list of point that confuse me :
In my scenario i am indexing the document asyncrhonously (background process) and then fetching the indext ASAP throught ReferenceManager in ASP.NET application. I hope this way to fetch the index is compatibile with DirectoryTaxonomyWriter needed by facets.
Then i modified the code i write introducing the taxonomy writer as indicated in the example, but i am a bit confused, seems like i can't store DirectoryTaxonomyWriter into the same folder of index because the folder is locked, need i to persist it or it will be embedded into the index (so a RAMDirectory is enougth)? if i need to persist it in a different direcotry, can i safely persist it into subdirectory?
Here the code i am actually using :
private static void BuildIndex (IndexEntry entry)
{
string targetFolder = ConfigurationManager.AppSettings["IndexFolder"] ?? string.Empty;
//** LOG
if (System.IO.Directory.Exists(targetFolder) == false)
{
string message = #"Index folder not found";
_fileLogger.Error(message);
_consoleLogger.Error(message);
return;
}
var metadata = JsonConvert.DeserializeObject<IndexMetadata>(File.ReadAllText(entry.MetdataPath) ?? "{}");
string[] header = new string[0];
List<dynamic> csvRecords = new List<dynamic>();
using (var reader = new StreamReader(entry.DataPath))
{
CsvConfiguration csvConfiguration = new CsvConfiguration(CultureInfo.InvariantCulture);
csvConfiguration.AllowComments = false;
csvConfiguration.CountBytes = false;
csvConfiguration.Delimiter = ",";
csvConfiguration.DetectColumnCountChanges = false;
csvConfiguration.Encoding = Encoding.UTF8;
csvConfiguration.HasHeaderRecord = true;
csvConfiguration.IgnoreBlankLines = true;
csvConfiguration.HeaderValidated = null;
csvConfiguration.MissingFieldFound = null;
csvConfiguration.TrimOptions = CsvHelper.Configuration.TrimOptions.None;
csvConfiguration.BadDataFound = null;
using (var csvReader = new CsvReader(reader, csvConfiguration))
{
csvReader.Read();
csvReader.ReadHeader();
csvReader.Read();
header = csvReader.HeaderRecord;
csvRecords = csvReader.GetRecords<dynamic>().ToList();
}
}
string targetDirectory = Path.Combine(targetFolder, "Index__" + metadata.Boundle + "__" + DateTime.Now.ToString("yyyyMMdd_HHmmss") + "__" + Path.GetRandomFileName().Substring(0, 6));
System.IO.Directory.CreateDirectory(targetDirectory);
//** LOG
{
string message = #"..creating index : {0}";
_fileLogger.Information(message, targetDirectory);
_consoleLogger.Information(message, targetDirectory);
}
using (var dir = FSDirectory.Open(targetDirectory))
{
using (DirectoryTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(dir))
{
Analyzer analyzer = metadata.GetAnalyzer();
var indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer);
using (IndexWriter writer = new IndexWriter(dir, indexConfig))
{
long entryNumber = csvRecords.Count();
long index = 0;
long lastPercentage = 0;
foreach (dynamic csvEntry in csvRecords)
{
Document doc = new Document();
IDictionary<string, object> dynamicCsvEntry = (IDictionary<string, object>)csvEntry;
var indexedMetadataFiled = metadata.IdexedFields;
foreach (string headField in header)
{
if (indexedMetadataFiled.ContainsKey(headField) == false || (indexedMetadataFiled[headField].NeedToBeIndexed == false && indexedMetadataFiled[headField].NeedToBeStored == false))
continue;
var field = new Field(headField,
((string)dynamicCsvEntry[headField] ?? string.Empty).ToLower(),
indexedMetadataFiled[headField].NeedToBeStored ? Field.Store.YES : Field.Store.NO,
indexedMetadataFiled[headField].NeedToBeIndexed ? Field.Index.ANALYZED : Field.Index.NO
);
doc.Add(field);
var facetField = new FacetField(headField, (string)dynamicCsvEntry[headField]);
doc.Add(facetField);
}
long percentage = (long)(((decimal)index / (decimal)entryNumber) * 100m);
if (percentage > lastPercentage && percentage % 10 == 0)
{
_consoleLogger.Information($"..indexing {percentage}%..");
lastPercentage = percentage;
}
writer.AddDocument(doc);
index++;
}
writer.Commit();
}
}
}
//** LOG
{
string message = #"Index Created : {0}";
_fileLogger.Information(message, targetDirectory);
_consoleLogger.Information(message, targetDirectory);
}
}

Web-scrape project writing too much information

I'm trying to modify the code below to scrape jobs from www.itoworld.com/careers. The jobs are in a table format and return all the <'td> values.
I believe it comes from the line:
var parentnode = node.ParentNode.ParentNode.ParentNode.FirstChild.NextSibling
However, I want it to write:
<a class="std-btn" href="http://www.itoworld.com/office-manager/">Office Manager</a>
Currently it is writing
<a href='http://www.itoworld.com/office-manager/' target='_blank'>Office ManagerOffice & AdminCambridgeFind out more</a>
I plan on 'brute force' modifying the output to remove unnecessary extras but was hoping there is a smarter way to do this. Is there a way for example to remove the second and third ParentNode after they have been called? (So they do not get written?)
public string ExtractIto()
{
string sUrl = "http://www.itoworld.com/careers/";
GlobusHttpHelper ghh = new GlobusHttpHelper();
List<Links> link = new List<Links>();
bool Next = true;
int count = 1;
string html = ghh.getHtmlfromUrl(new Uri(string.Format(sUrl)));
HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();
hd.LoadHtml(html);
var hn = hd.DocumentNode.SelectSingleNode("//*[#class='btn-wrapper']");
var hnc = hn.SelectNodes(".//a");
foreach (var node in hnc)
{
try
{
var parentnode = node.ParentNode.ParentNode.ParentNode.FirstChild.NextSibling;
Links l = new Links();
l.Name = ParseHtmlContainingText(parentnode.InnerText);
l.Link = node.GetAttributeValue("href", "");
link.Add(l);
}
}
string Xml = getXml(link);
return WriteXml(Xml);
For completeness below is the definition of ParseHtmlContainingText
public string ParseHtmlContainingText(string htmlString)
{
return Regex.Replace(Regex.Replace(WebUtility.HtmlDecode(htmlString), #"<[^>]+>| ", ""), #"\s{2,}", " ").Trim();
}

You just need to create a "name node" and use that for your parse method.
I tested with this code and it worked for me.
var parentnode = node.ParentNode.ParentNode.ParentNode.FirstChild.NextSibling;
var nameNode = parentnode.FirstChild;
Links l = new Links();
l.Name = ParseHtmlContainingText(nameNode.InnerText);
l.Link = node.GetAttributeValue("href", "");

C# Sort Script/Picking the last Int from a foreach

i need some help with my sort script. I wanna sort some files.
This is how the Name is constructed: Name#Page#Version
I can pick the Name/category and the page but i dont know how to pick the last version :/
Here you can see an example.
foreach(string files in Directory.GetFiles(path).OrderBy(fi => fi.Length))
{
try
{
filename = Path.GetFileNameWithoutExtension(files);
index = filename.LastIndexOf("#");
index2 = filename.LastIndexOf("#",index-1);
strversion = filename.Substring(index+1);
strpage = filename.Substring(index2+1);
strpage = strpage.Substring(0, strpage.LastIndexOf("#"));
page = Int32.Parse(strpage);
version = Int32.Parse(strversion);
Console.WriteLine("Page: "+page);
Console.WriteLine("Version: "+version);
if (filename.Contains("SMA"))
{
if (page == 1)
{
Console.WriteLine(filename);
}
}
}
catch (ArgumentOutOfRangeException e)
{
Console.WriteLine(e.Message);
}
}

You're over complicating things, you can split the string by # and get what you want from the array given:
var fileName = "SMA#1#2";
var parts = fileName.Split('#');
var name = parts[0];
var page = parts[1];
var version = parts[2];
EDIT
As for getting the last version for each page, you're probably better off creating some sort of class for your file and then grouping by page, and then sorting by version, and then selecting the first one:
public class Program
{
public static void Main()
{
var fileNames = new[] { "SMA#1#1", "SMA#1#2", "SMA#1#3", "SMA#2#1", "SMA#2#3" };
var files = (from fileName in fileNames select fileName.Split('#') into parts let name = parts[0] let page = Int32.Parse(parts[1]) let version = Int32.Parse(parts[2]) select new MyFile(name, page, version)).ToList();
var grouped = files.GroupBy(x => x.Page).ToList();
foreach (var group in grouped)
{
var ordered = group.OrderByDescending(x => x.Version);
Console.WriteLine($"Page {group.Key} highest version: {ordered.First().Version}");
}
}
}
public class MyFile
{
public string Name { get; set; }
public int Page { get; set; }
public int Version { get; set; }
public MyFile(string name, int page, int version)
{
Name = name;
Page = page;
Version = version;
}
}

If I correctly understand your requirement, you want to
filter out every file not containing "SMA"
then order by page
then by version
You can achieve this quite declaratively using LINQ:
var orderedFileNames =
fileNames
.Where(fn=>fn.Contains("SMA")
// parse name
.Select(fn => fn.Split('#'))
// pull parts into anonymous type
.Select(fn => new {
Name = fn[0], Page = int.Parse(fn[1]), Version = int.Parse(fn[2])
})
.OrderBy(fn=>fn.Name)
.ThenBy(fn=>fn.Page)
.ThenBy(fn=>fn.Version);

int lastIndex = filename.LastIndexOf("#");
string version = fileName.SubString(lastIndex, fileName.Length - lastIndex);
Is that what you are looking for?

Reading Specific text from a website

I am trying to make a database, but i need to get info from a website. Mainly the Title, Date, Length and Genre from the IMDB website. I have tried like 50 different things and it is just not working.
Here is my code.
public string GetName(string URL)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(URL);
var Attr = doc.DocumentNode.SelectNodes("//*[#id=\"overview - top\"]/h1/span[1]#itemprop")[0];
return Name;
}
When I run this it just gives me a XPathException. I just want it to return the Title of a movie. I am now just using this movie for a example and testing but, I want it to work with all movies http://www.imdb.com/title/tt0405422
I am using the HtmlAgilityPack.

The last bit of your XPath is not valid. Also to get only single element from HtmlDocument() you can use SelectSingleNode() instead of SelectNodes() :
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.imdb.com/title/tt0405422/");
var xpath = "//*[#id='overview-top']/h1/span[#class='itemprop']";
var span = doc.DocumentNode.SelectSingleNode(xpath);
var title = span.InnerText;
Console.WriteLine(title);
output :
The 40-Year-Old Virgin
demo link : *
https://dotnetfiddle.net/P7U5A7
*) the demo shows that the correct title is printed, along with an error specific to .NET Fiddle (you can safely ignore the error).

I making something familiar and this is my code which gets info from imdb.com website.:
string html = getUrlData(imdbUrl + "combined");
Id = match(#"<link rel=""canonical"" href=""http://www.imdb.com/title/(tt\d{7})/combined"" />", html);
if (!string.IsNullOrEmpty(Id))
{
status = true;
Title = match(#"<title>(IMDb \- )*(.*?) \(.*?</title>", html, 2);
OriginalTitle = match(#"title-extra"">(.*?)<", html);
Year = match(#"<title>.*?\(.*?(\d{4}).*?\).*?</title>", html);
Rating = match(#"<b>(\d.\d)/10</b>", html);
Genres = matchAll(#"<a.*?>(.*?)</a>", match(#"Genre.?:(.*?)(</div>|See more)", html));
Directors = matchAll(#"<td valign=""top""><a.*?href=""/name/.*?/"">(.*?)</a>", match(#"Directed by</a></h5>(.*?)</table>", html));
Cast = matchAll(#"<td class=""nm""><a.*?href=""/name/.*?/"".*?>(.*?)</a>", match(#"<h3>Cast</h3>(.*?)</table>", html));
Plot = match(#"Plot:</h5>.*?<div class=""info-content"">(.*?)(<a|</div)", html);
Runtime = match(#"Runtime:</h5><div class=""info-content"">(\d{1,4}) min[\s]*.*?</div>", html);
Languages = matchAll(#"<a.*?>(.*?)</a>", match(#"Language.?:(.*?)(</div>|>.?and )", html));
Countries = matchAll(#"<a.*?>(.*?)</a>", match(#"Country:(.*?)(</div>|>.?and )", html));
Poster = match(#"<div class=""photo"">.*?<a name=""poster"".*?><img.*?src=""(.*?)"".*?</div>", html);
if (!string.IsNullOrEmpty(Poster) && Poster.IndexOf("media-imdb.com") > 0)
{
Poster = Regex.Replace(Poster, #"_V1.*?.jpg", "_V1._SY200.jpg");
PosterLarge = Regex.Replace(Poster, #"_V1.*?.jpg", "_V1._SY500.jpg");
PosterFull = Regex.Replace(Poster, #"_V1.*?.jpg", "_V1._SY0.jpg");
}
else
{
Poster = string.Empty;
PosterLarge = string.Empty;
PosterFull = string.Empty;
}
ImdbURL = "http://www.imdb.com/title/" + Id + "/";
if (GetExtraInfo)
{
string plotHtml = getUrlData(imdbUrl + "plotsummary");
}
//Match single instance
private string match(string regex, string html, int i = 1)
{
return new Regex(regex, RegexOptions.Multiline).Match(html).Groups[i].Value.Trim();
}
//Match all instances and return as ArrayList
private ArrayList matchAll(string regex, string html, int i = 1)
{
ArrayList list = new ArrayList();
foreach (Match m in new Regex(regex, RegexOptions.Multiline).Matches(html))
list.Add(m.Groups[i].Value.Trim());
return list;
}
Maybe you will find something useful

Windows Form app find Link on Web

I need to create a method that find the newest version of application on a website (Hudson server) and allow to download it.
till now I use regex to scan all the HTML and find the href tags and search for the string I wish to.
I want to know if there is a simplest way to do so.
I attached the code I use today:
namespace SDKGui
{
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href;
}
}
static class LinkFinder
{
public static string Find(string file)
{
string t=null;
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, #"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, #"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
t = Regex.Replace(value, #"\s*<.*?>\s*", "",
RegexOptions.Singleline);
if (t.Contains("hms_sdk_tool_"))
{
i.Text = t;
list.Add(i);
break;
}
}
return t;
}
}
}

It is easy to collect all href values and filter against any of your conditions using HtmlAgilityPack. The following method shows how to access a page, get all <a> tags, and return a list of all href values containing hms_sdk_tool_:
private List<string> HtmlAgilityCollectHrefs(string url)
{
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(url);
var a_nodes = doc.DocumentNode.SelectNodes("//a");
return a_nodes.Select(p => p.GetAttributeValue("href", "")).Where(n => n.Contains("hms_sdk_tool_")).ToList();
}
Or, if you are interested in 1 return string, use
private string GetLink(string url)
{
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(url);
var a_nodes = doc.DocumentNode.SelectNodes("//a");
return a_nodes.Select(p => p.GetAttributeValue("href", "")).Where(n => n.Contains("hms_sdk_tool_")).FirstOrDefault();
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to retrieve specific HTML information from a given website - c#

Related

Lucene 4.8 facets usage

Web-scrape project writing too much information

C# Sort Script/Picking the last Int from a foreach

Reading Specific text from a website

Windows Form app find Link on Web

Categories

Resources