I have an input string which is html. It contains images and I want to change the src property on the img
My code so far is as below:
if (htmlStr.Contains("img"))
{
var html = new HtmlDocument();
html.LoadHtml(htmlStr);
var images = html.DocumentNode.SelectNodes("//img");
if (images != null && images.Count > 0)
{
for (int i = 0; i < images.Count; i++)
{
string imageSrc = images[i].Attributes["src"].Value;
string newSrc = "MyNewValue";
images[i].SetAttributeValue("src", newSrc);
}
}
//htmlStr= ???
}
return htmlStr;
What I am missing is how to update the htmlStr I am returning with the newSrc value each image.
As far as I can tell, you have two options:
// Will give you a raw string.
// Not ideal if you are planning to
// send this over the network, or save as a file.
var updatedStr = html.DocumentNode.OuterHtml;
// Will let you write to any stream.
// Here, I'm just writing to a string builder as an example.
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
{
html.Save(writer);
}
// These two methods generate the same result, though.
Debug.Assert(string.Equals(updatedStr, sb.ToString()));
Related
I was using this piece of code till today and it was working fine:
for (int page = 1; page <= reader.NumberOfPages; page++)
{
var cpage = reader.GetPageN(page);
var content = cpage.Get(PdfName.CONTENTS);
var ir = (PRIndirectReference)content;
var value = reader.GetPdfObject(ir.Number);
if (value.IsStream())
{
PRStream stream = (PRStream)value;
var streamBytes = PdfReader.GetStreamBytes(stream);
var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));
try
{
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PRTokeniser.TK_STRING)
{
string strs = tokenizer.StringValue;
if (!(br = excludeList.Any(st => strs.Contains(st))))
{
//strfor += tokenizer.StringValue;
if (!string.IsNullOrWhiteSpace(strs) &&
!stringsList.Any(i => i == strs && excludeHeaders.Contains(strs)))
stringsList.Add(strs);
}
}
}
}
finally
{
tokenizer.Close();
}
}
}
But today I got an exception for some pdf file: Unable to cast object of type 'iTextSharp.text.pdf.PdfArray' to type 'iTextSharp.text.pdf.PRIndirectReference
On debugging I got to know that the error is at this line: var ir = (PRIndirectReference)content;. That's because the pdf content that I'm extracting, I get it in the form of ArrayList, as you can see from the below image:
It would be really grateful if anyone can help me with this. Thanks in advance.
EDIT :
The pdf contents are paragraphs, tables, headers & footers, images in few cases. But I'm not bothered of images as I'm bypassing them.
As you can see from the code I'm trying to add the words into a string list, so I expect the output as plain text; words to be specific.
That was real easy! Don't know why I couldn't make out.
PdfReader reader = new PdfReader(name);
List<string> stringsList = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
//directly get the contents into a byte stream
var streamByte = reader.GetPageContent(page);
var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamByte));
var sb = new StringBuilder(); //use a string builder instead
try
{
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PRTokeniser.TK_STRING)
{
var currentText = tokenizer.StringValue;
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
sb.Append(tokenizer.StringValue);
}
}
}
finally
{
//add appended strings into a string list
if(sb != null)
stringsList.Add(sb.ToString());
tokenizer.Close();
}
}
I'm trying to program an API for discord and I need to retrieve two pieces of information out of the HTML code of the web page https://myanimelist.net/character/214 (and other similar pages with URLs of the form https://myanimelist.net/character/N for integers N), specifically the URL of the Character Picture (in this case https://cdn.myanimelist.net/images/characters/14/54554.jpg) and the name of the character (in this case Youji Kudou). Afterwards I need to save those two pieces of information to JSON.
I am using HTMLAgilityPack for this, yet I can't quite see through it. The following is my first attempt:
public static void Main()
{
var html = "https://myanimelist.net/character/214";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
foreach (var node in htmlNodes.Descendants("tr/td/div/a/img"))
{
Console.WriteLine(node.InnerHtml);
}
}
Unfortunately, this produces no output. If I followed the path correctly (which is probably the first mistake) it should be "tr/td/div/a/img". I get no errors, it runs, yet I get no output.
My second attempt is:
public static void Main()
{
var html = "https://myanimelist.net/character/214";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
var script = htmlDoc.DocumentNode.Descendants()
.Where(n => n.Name == "tr/td/a/img")
.First().InnerText;
// Return the data of spect and stringify it into a proper JSON object
var engine = new Jurassic.ScriptEngine();
var result = engine.Evaluate("(function() { " + script + " return src; })()");
var json = JSONObject.Stringify(engine, result);
Console.WriteLine(json);
Console.ReadKey();
}
But this also doesn't work.
How can I extract the required information?
EDIT:
So, I've come quite further now, and I've found a solution to finding the link. It was rather simple. But now I'm stuck with finding the name of the character. The website is structured the same on every other link there is (changing the last number) so, I want to find many different ones via for loop. Here's how I tried to do it:
for (int i = 1; i <= 1000; i++)
{
HtmlWeb web = new HtmlWeb();
var html = "https://myanimelist.net/character/" + i;
var htmlDoc = web.Load(html);
foreach (var item in htmlDoc.DocumentNode.SelectNodes("//*[#]"))
{
string n;
n = item.GetAttributeValue("src", "");
foreach (var item2 in htmlDoc.DocumentNode.SelectNodes("//*[#src and #alt='" + n + "']"))
{
Console.WriteLine(item2.GetAttributeValue("src", ""));
}
}
}
in the first foreach I would try to search for the name, which is concluded always at the same position (e.g http://prntscr.com/o1uo3c and http://prntscr.com/o1uo91 and to be specific: http://prntscr.com/o1xzbk) but I haven't found out how yet. Since the structure in the HTML doesn't have any body type I can follow up with. The second foreach loop is to search for the URL which works by now and the n should give me the name, so I can figure it out for each different character.
I was able to extract the character name and image from https://myanimelist.net/character/214 using the following method:
public static CharacterData ExtractCharacterNameAndImage(string url)
{
//Use the following if you are OK with hardcoding the structure of <div> elements.
//var tableXpath = "/html/body/div[1]/div[3]/div[3]/div[2]/table";
//Use the following if you are OK with hardcoding the fact that the relevant table comes first.
var tableXpath = "/html/body//table";
var nameXpath = "tr/td[2]/div[4]";
var imageXpath = "tr/td[1]/div[1]/a/img";
var htmlDoc = new HtmlWeb().Load(url);
var table = htmlDoc.DocumentNode.SelectNodes(tableXpath).First();
var name = table.SelectNodes(nameXpath).Select(n => n.GetDirectInnerText().Trim()).SingleOrDefault();
var imageUrl = table.SelectNodes(imageXpath).Select(n => n.GetAttributeValue("src", "")).SingleOrDefault();
return new CharacterData { Name = name, ImageUrl = imageUrl, Url = url };
}
Where CharacterData is defined as follows:
public class CharacterData
{
public string Name { get; set; }
public string ImageUrl { get; set; }
public string Url { get; set; }
}
Afterwards, the character data can be serialized to JSON using any of the tools from How to write a JSON file in C#?, e.g. json.net:
var url = "https://myanimelist.net/character/214";
var data = ExtractCharacterNameAndImage(url);
var json = JsonConvert.SerializeObject(data, Formatting.Indented);
Console.WriteLine(json);
Which outputs
{
"Name": "Youji Kudou",
"ImageUrl": "https://cdn.myanimelist.net/images/characters/14/54554.jpg",
"Url": "https://myanimelist.net/character/214"
}
If you would prefer the Name to include the Japanese in parenthesis, replace GetDirectInnerText() with just InnerText, which results in:
{
"Name": "Youji Kudou (工藤耀爾)",
"ImageUrl": "https://cdn.myanimelist.net/images/characters/14/54554.jpg",
"Url": "https://myanimelist.net/character/214"
}
Alternatively, if you prefer you could pull the character name from the document title:
var title = string.Concat(htmlDoc.DocumentNode.SelectNodes("/html/head/title").Select(n => n.InnerText.Trim()));
var index = title.IndexOf("- MyAnimeList.net");
if (index >= 0)
title = title.Substring(0, index).Trim();
How did I determine the correct XPath strings?
Firstly, using Firefox 66, I opened the debugger and loaded https://myanimelist.net/character/214 in the window with the debugging tools visible.
Next, following the instructions from How to find xpath of an element in firefox inspector, I selected the Youji Kudou (工藤耀爾) node and copied its XPath, which turned out to be:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]
I then tried to select this node using SelectNodes()... and got a null result. But why? To determine this I created a debugging routine that would break the path into successively longer portions and determine where the failure occurs:
static void TestSelect(HtmlDocument htmlDoc, string xpath)
{
Console.WriteLine("\nInput path: " + xpath);
var splitPath = xpath.Split('/');
for (int i = 2; i <= splitPath.Length; i++)
{
if (splitPath[i-1] == "")
continue;
var thisPath = string.Join("/", splitPath, 0, i);
Console.Write("Testing \"{0}\": ", thisPath);
var result = htmlDoc.DocumentNode.SelectNodes(thisPath);
Console.WriteLine("result count = {0}", result == null ? "null" : result.Count.ToString());
}
}
This output the following:
Input path: /html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]
Testing "/html": result count = 1
Testing "/html/body": result count = 1
Testing "/html/body/div[1]": result count = 1
Testing "/html/body/div[1]/div[3]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]": result count = null
As you can see, something goes wrong selecting the <tbody> path element. Manual inspection of the InnerHtml returned by selecting /html/body/div[1]/div[3]/div[3]/div[2]/table revealed that, for some reason, the server is not including the <tbody> tag when returning HTML to the HtmlWeb object -- possibly due to some difference in request header(s) provided by Firefox vs HtmlWeb. Once I omitted the tbody path element I was able to query for the character name successfully using:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[2]/div[4]
A similar process provided the following working path for the image:
/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[1]/div[1]/a/img
Since the two queries are finding contents in the same <table>, in my final code I selected the table only once in a separate step, and removed some of the hardcoding as to the specific nesting of <div> elements.
Demo fiddle here.
Alright, to finnish it up, I've rounded the Code, gratefully assisted by dbc, and implemented nearly completly into the project. Just if someone in later days maybe has a identical question, here they go. This outputs out of a defined number all the character names, links and images and writes it into a JSON file and could be adapted for other websites.
using System;
using System.Linq;
using Newtonsoft.Json;
using HtmlAgilityPack;
using System.IO;
namespace SearchingHTML
{
public class CharacterData
{
public string Name { get; set; }
public string ImageUrl { get; set; }
public string Url { get; set; }
}
public class Program
{
public static CharacterData ExtractCharacterNameAndImage(string url)
{
var tableXpath = "/html/body//table";
var nameXpath = "tr/td[2]/div[4]";
var imageXpath = "tr/td[1]/div[1]/a/img";
var htmlDoc = new HtmlWeb().Load(url);
var table = htmlDoc.DocumentNode.SelectNodes(tableXpath).First();
var name = table.SelectNodes(nameXpath).Select(n => n.GetDirectInnerText().Trim()).SingleOrDefault();
var imageUrl = table.SelectNodes(imageXpath).Select(n => n.GetAttributeValue("src", "")).SingleOrDefault();
return new CharacterData { Name = name, ImageUrl = imageUrl, Url = url };
}
public static void Main()
{
int max = 10000;
string fileName = #"C:\Users\path of your file.json";
Console.WriteLine("Environment version: " + Environment.Version);
Console.WriteLine("Json.NET version: " + typeof(JsonSerializer).Assembly.FullName);
Console.WriteLine("HtmlAgilityPack version: " + typeof(HtmlDocument).Assembly.FullName);
Console.WriteLine();
for (int i = 6; i <= max; i++)
{
try
{
var url = "https://myanimelist.net/character/" + i;
var htmlDoc = new HtmlWeb().Load(url);
var data = ExtractCharacterNameAndImage(url);
var json = JsonConvert.SerializeObject(data, Formatting.Indented);
Console.WriteLine(json);
TextWriter tsw = new StreamWriter(fileName, true);
tsw.WriteLine(json);
tsw.Close();
} catch (Exception ex) { }
}
}
}
}
/*******************************************************************************************************************************
****************************************************IF TESTING IS REQUIERED****************************************************
*******************************************************************************************************************************
*
* static void TestSelect(HtmlDocument htmlDoc, string xpath)
Console.WriteLine("\nInput path: " + xpath);
var splitPath = xpath.Split('/');
for (int i = 2; i <= splitPath.Length; i++)
{
if (splitPath[i - 1] == "")
continue;
var thisPath = string.Join("/", splitPath, 0, i);
Console.Write("Testing \"{0}\": ", thisPath);
var result = htmlDoc.DocumentNode.SelectNodes(thisPath);
Console.WriteLine("result count = {0}", result == null ? "null" : result.Count.ToString());
}
}
*******************************************************************************************************************************
*********************************************FOR TESTING ENTER THIS INTO MAIN CLASS********************************************
*******************************************************************************************************************************
*
* var url2 = "https://myanimelist.net/character/256";
var data2 = ExtractCharacterNameAndImage(url2);
var json2 = JsonConvert.SerializeObject(data2, Formatting.Indented);
Console.WriteLine(json2);
var nameXpathFromFirefox = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]";
var imageXpathFromFirefox = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[1]/div[1]/a/img";
TestSelect(htmlDoc, nameXpathFromFirefox);
TestSelect(htmlDoc, imageXpathFromFirefox);
var nameXpathFromFirefoxFixed = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[2]/div[4]";
var imageXpathFromFirefoxFixed = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[1]/div[1]/a/img";
TestSelect(htmlDoc, nameXpathFromFirefoxFixed);
TestSelect(htmlDoc, imageXpathFromFirefoxFixed);
*******************************************************************************************************************************
*******************************************************************************************************************************
*******************************************************************************************************************************
*/
I am attempting to create a crawler that returns only links from a website and i have it to a point that it returns the HTML script.
I am now wanting to use an if statement to check that the string is returned and if it is returned, it searches for all "< a >" tags and shows me the href link.
but I don't know what object to check or what value I should be checking for.
Here is what I have so far:
namespace crawler
{
class Program
{
static void Main(string[] args)
{
System.Net.WebClient wc = new System.Net.WebClient();
string WebData wc.DownloadString("https://www.abc.net.au/news/science/");
Console.WriteLine(WebData);
// if
}
}
}
You can have a look at HTML Agility Pack:
Then you can find all links from a web page like:
var hrefs = new List<string>();
var hw = new HtmlWeb();
HtmlDocument document = hw.Load(/* your url here */);
foreach(HtmlNode link in document.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute attribute = link.Attributes["href"];
if (!string.IsNullOrWhiteSpace(attribute.Value))
hrefs.Add(attribute.Value);
}
Firstly you can make a function to return the whole website HTML code as you have done. Here is the one I have!
public string GetPageContents()
{
string link = "https://www.abc.net.au/news/science/"
string pageContent = "";
WebClient web = new WebClient();
Stream stream;
stream = web.OpenRead(link);
using (StreamReader reader = new StreamReader(stream))
{
pageContent = reader.ReadToEnd();
}
stream.Close();
return pageContents;
}
Then you could make a function that would return a substring or a List of substring (meaning that if you wanted all < a > tags you would probably get more than one).
List<string> divTags = GetBetweenTags(pageContents, "<div>", "</div>")
This would give you a list where you could, for example, make another search for < a > tags inside each of those < div > tags.
public List<string> GetBetweenTags(string pageContents, string startTag, string endTag)
{
Regex rx = new Regex(startTag + "(.*?)" + endTag);
MatchCollection col = rx.Matches(value);
List<string> tags = new List<string>();
foreach(Match s in col)
tags.Add(s.ToString());
return tags;
}
Edit: Wow didn't know of HTML Agility Pack, thanks #Gauravsa i'll update my project to use it!
imgs = doc.DocumentNode.SelectNodes("//img");
HtmlNode img in imgs
string imageIdString = image.Id.ToString();
img.SetAttributeValue("src", "/ImageBrowser/ImageById/" + imageIdString);
I get a proper value for the ID, but the img source stays unchanged and I can't find why
tried to manage it like here:
Need to replace an img src attrib with new value
Edit1: The requested code
string input = sectionEditModel.Content;
string htmlstring = sectionEditModel.Content;
string htmlstringdecoded = HttpUtility.HtmlDecode(htmlstring);
HtmlDocument doc = new HtmlDocument();
List<string> urls = new List<string>();
DbImgBrowser.Models.Image image = null;
doc.LoadHtml(htmlstringdecoded);
var files = new FilesRepository();
HtmlNodeCollection imgs = new HtmlNodeCollection(doc.DocumentNode);
imgs = doc.DocumentNode.SelectNodes("//img");
if (imgs != null && imgs.Count > 0)
{
foreach (HtmlNode img in imgs)
{
HtmlAttribute srcs = img.Attributes[#"src"];
urls.Add(srcs.Value);
{
foreach (string Value in urls){
string AttrVal = img.GetAttributeValue("src", null);
if(AttrVal.Contains("base64"))
{
byte[] data = Convert.FromBase64String(Value.Substring(Value.IndexOf(",") + 1));
var pFolder = files.GetFolderByPath(string.Empty);
if (pFolder != null)
{
image = new DbImgBrowser.Models.Image()
{
Name = Guid.NewGuid().ToString(),
Folder = pFolder,
Image1 = data
};
files.Db.Images.Add(image);
files.Db.SaveChanges();
string imageIdString = image.Id.ToString();
img.SetAttributeValue("src", "/ImageBrowser/ImageById/" + imageIdString);
files.Db.SaveChanges();
}
}
Edit2: Example paths: before base64 example image
Path by Url example /ImageBrowser/Image?path=Test2.PNG
Wanted Result src="ImageBrowser/ImageById/"ID" (1-1000)
Edit3: Still all src is not changed
The answer is very simple.
I was on a local doc but I had to return it to the content and save the section
SectionsRepository.SaveSection(Section sec)
I've stored all URLs in my application with "http://" - I now need to go through and replace all of them with "https:". Right now I have:
foreach (var link in links)
{
if (link.Contains("http:"))
{
/// do something, slice or replace or what?
}
}
I'm just not sure what the best way to update the string would be. How can this be done?
If you're dealing with uris, you probably want to use UriBuilder since doing a string replace on structured data like URIs is not a good idea.
var builder = new UriBuilder(link);
builder.Scheme = "https";
Uri modified = builder.Uri;
It's not clear what the type of links is, but you can create a new collection with the modified uris using linq:
IEnumerable<string> updated = links.Select(link => {
var builder = new UriBuilder(link);
builder.Scheme = "https";
return builder.ToString();
});
The problem is your strings are in a collection, and since strings are immutable you can't change them directly. Since you didn't specify the type of links (List? Array?) the right answer will change slightly. The easiest way is to create a new list:
links = links.Select(link => link.Replace("http://","https://")).ToList();
However if you want to minimize the number of changes and can access the string by index you can just loop through the collection:
for(int i = 0; i < links.Length; i++ )
{
links[i] = links[i].Replace("http://","https://");
}
based on your current code, link will not be replace to anything you want because it is read only (see here: Why can't I modify the loop variable in a foreach?). instead use for
for(int a = 0; a < links.Length; a++ )
{
links[a] = links[a].Replace("http:/","https:/")
}
http://myserver.xom/login.aspx?returnurl=http%3a%2f%2fmyserver.xom%2fmyaccount.aspx&q1=a%20b%20c&q2=c%2b%2b
What about the urls having also url in the querystring part? I think we should also replace them. And because of the url encoding-escaping this is the hard part of the job.
private void BlaBla()
{
// call the replacing function
Uri myNewUrl = ConvertHttpToHttps(myOriginalUrl);
}
private Uri ConvertHttpToHttps(Uri originalUri)
{
Uri result = null;
int httpsPort = 443;// if needed assign your own value or implement it as parametric
string resultQuery = string.Empty;
NameValueCollection urlParameters = HttpUtility.ParseQueryString(originalUri.Query);
if (urlParameters != null && urlParameters.Count > 0)
{
StringBuilder sb = new StringBuilder();
foreach (string key in urlParameters)
{
if (sb.Length > 0)
sb.Append("&");
string value = urlParameters[key].Replace("http://", "https://");
string valuEscaped = Uri.EscapeDataString(value);// this is important
sb.Append(string.Concat(key, "=", valuEscaped));
}
resultQuery = sb.ToString();
}
UriBuilder resultBuilder = new UriBuilder("https", originalUri.Host, httpsPort, originalUri.AbsolutePath);
resultBuilder.Query = resultQuery;
result = resultBuilder.Uri;
return result;
}
Use string.Replace and some LINQ:
var httpsLinks = links.Select(l=>l.Replace("http://", "https://");