How to get links from google html search results in c#? - c#

I got this code that brings me the search results from Google as an HTML string:
WebClient webClient = new WebClient();
string htmlString = webClient.DownloadString("http://www.google.com/search?q=" + searchQuery);
Any idea how to extract only the links from it ?
I guess I do a string search, but it doesn't look so elegant...
I found this code
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(htmlString);
var selectNodes = htmlDoc.DocumentNode.SelectNodes("//li[#class='g']");
foreach (var node in selectNodes)
{
//node.InnerText will give you the text content of the li tags ...
}
But I'm getting an exception that var selectNodes = htmlDoc.DocumentNode.SelectNodes("//li[#class='g']"); is null...

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
HtmlNodeCollection links = doc.DocumentNode.SelectNodes("//*[#background or #lowsrc or #src or #href]");
foreach (HtmlNode link in links)
{
if (link.Attributes["background"] != null)
link.Attributes["background"].Value = _newPath + link.Attributes["background"].Value;
if (link.Attributes["href"] != null)
link.Attributes["href"].Value = _newPath + link.Attributes["href"].Value;(link.Attributes["href"] != null)
link.Attributes["lowsrc"].Value = _newPath + link.Attributes["href"].Value;
if (link.Attributes["src"] != null)
link.Attributes["src"].Value = _newPath + link.Attributes["src"].Value;
}

Related

Add div in HTML after <body ...> in c#

Requirement:
add custom html after body tag in string
I solved with htmlagilitypack like this:
StringBuilder sb = new StringBuilder();
sb.Append(customStringWithHtmlContent)
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(sb.ToString());
// Create new node from newcontent
HtmlNode newNode = HtmlNode.CreateNode("<div>" + someStringWithAdditionalContent + "</div>");
// Get body node
HtmlNode body = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (body != null)
{// Add new node as first child of body
body.PrependChild(newNode);
}
var docContent = htmlDoc.DocumentNode.InnerHtml;
Looks good but in some html pages, html structure is changed, closed div tags are moved and html is renderend ugly
second solution:
if (sb.ToString().Contains("<body>"))
{
sb.Replace("<body>", "<body><div>" + someStringWithAdditionalContent + "</div>");
}
Looks good, but is not a solution for body with attributes like
<body style="someAttr:value ..." ...>
some ideas ? other solutions?
RegEx? There's probably a more elegant way but the basic idea:
string input = "<body style=\"someAttr\"><tag>sdsdsa</tag></body>";
Regex Pattern = new Regex(#"(<body.*?>)(.*?)(<\/body>)", RegexOptions.Compiled);
var updatedText = Pattern.Replace(input, match =>
{
string newMatch = match.Groups[2].Value;
string newContent = "<div>" + "someStringWithAdditionalContent" + "</div>";
return match.Groups[1].Value + newContent + newMatch + match.Groups[3].Value;
});
Console.WriteLine(updatedText);
Output:
<body style="someAttr"><div>someStringWithAdditionalContent</div><tag>sdsdsa</tag></body>

parsing an element in a div with html agility pack [C#]

I'm using Html Agility Pack on a website to extract some data. Parsing some of the HTML I need is easy but I am having trouble with this (slightly complex?) piece of HTML.
<tr>
<td>
<div onmouseover="toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class="correct_response">Obama</em><br /><br /><table width="100%"><tr><td class="right">Kailyn</td></tr></table>')" onmouseout="toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')" onclick="togglestick('clue_J_1_1_stuck')">
...
I need to get the value from the em class "correct_response" in the onmouseover div based on the clue_J_X_Y value. I really don't know how to go beyond this..
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//tr//td/div[#onmouseover]");
Some help would be appreciated.
I don't know what you're supposed to get out from the em. But I will give you all the data you say you need to figure it out.
First we load the HTML.
string html = "<tr>" +
"<td>" +
"<div onmouseover = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class="correct_response">Obama</em><br/><br/><table width="100%"><tr><td class="right">Kailyn</td></tr></table>')\" onmouseout = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')\" onclick = \"togglestick('clue_J_1_1_stuck')\"></div></td></tr>";
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
//Console.WriteLine(doc.DocumentNode.OuterHtml);
Then we get the value of the attribute, onmouseover.
string toggle = doc.DocumentNode.SelectSingleNode("//tr//td/div[#onmouseover]").GetAttributeValue("onmouseover", "FAILED");
It will return FAILED if it failed to find an attribute named "onmouseover". Now we get the parameters of the toggle method where each are enclosed by two '(apostrophe).
//Get Variables from toggle()
List<string> toggleVariables = new List<string>();
bool flag = false; string temp = "";
for(int i=0; i<toggle.Length; i++)
{
if (toggle[i] == '\'' && flag== true)
{
toggleVariables.Add(temp);
temp = "";
flag = false;
}
else if (flag)
{
temp += toggle[i];
}
else if (toggle[i] == '\'')
{
flag = true;
}
}
After that we have a list with 3 entities. In this case it will contain the following.
clue_J_1_1
clue_J_1_1_stuck
<em class="correct_response">Obama</em><br/><br/><table width="100%"><tr><td class="right">Kailyn</td></tr></table>;
Now we can create a new HtmlDocument with the HTML code from the third parameter. But first we have to convert it into workable HTML since the third parameter contains escape characters from HTML.
//Make it into workable HTML
toggleVariables[2] = HttpUtility.HtmlDecode(toggleVariables[2]);
//New HtmlDocument
HtmlDocument htmlInsideToggle = new HtmlDocument();
htmlInsideToggle.LoadHtml(toggleVariables[2]);
Console.WriteLine(htmlInsideToggle.DocumentNode.OuterHtml);
And done. The code in it's entirety is below from here.
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using HtmlAgilityPack;
using System.Web;
namespace test
{
class Program
{
public static void Main(string[] args)
{
string html = "<tr>" +
"<td>" +
"<div onmouseover = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class="correct_response">Obama</em><br/><br/><table width="100%"><tr><td class="right">Kailyn</td></tr></table>')\" onmouseout = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')\" onclick = \"togglestick('clue_J_1_1_stuck')\"></div></td></tr>";
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
//Console.WriteLine(doc.DocumentNode.OuterHtml);
string toggle = doc.DocumentNode.SelectSingleNode("//tr//td/div[#onmouseover]").GetAttributeValue("onmouseover", "FAILED");
//Clean up string
//Console.WriteLine(toggle);
//Get Variables from toggle()
List<string> toggleVariables = new List<string>();
bool flag = false; string temp = "";
for(int i=0; i<toggle.Length; i++)
{
if (toggle[i] == '\'' && flag== true)
{
toggleVariables.Add(temp);
temp = "";
flag = false;
}
else if (flag)
{
temp += toggle[i];
}
else if (toggle[i] == '\'')
{
flag = true;
}
}
//Make it into workable HTML
toggleVariables[2] = HttpUtility.HtmlDecode(toggleVariables[2]);
//New HtmlDocument
HtmlDocument htmlInsideToggle = new HtmlDocument();
htmlInsideToggle.LoadHtml(toggleVariables[2]);
Console.WriteLine(htmlInsideToggle.DocumentNode.OuterHtml);
//You're on your own from here
Console.ReadKey();
}
}

C# Nested Loops

I'm having trouble to make some loops.
I'm using agilitypack. I have a TXT file with several links (1 per line), and for each link that txt want to navigate to the page and then later extract to be in xpath and write in a memo.
The problem I'm having and that the code is only carrying out the procedure for the last line of txt. Where am I wrong?
var Webget = new HtmlWeb();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*[#id='title-article']"))
{
memoEdit1.Text = node.ChildNodes[0].InnerHtml + "\r\n";
break;
}
}
try to change
memoEdit1.Text = node.ChildNodes[0].InnerHtml + "\r\n";
to
memoEdit1.Text += node.ChildNodes[0].InnerHtml + "\r\n";
You're overwriting memoEdit1.Text every time. Try
memoEdit1.Text += node.ChildNodes[0].InnerHtml + "\r\n";
instead - note the += instead of =, which adds the new text every time.
Incidentally, constantly appending strings together isn't really the best way. Something like this might be better:
var Webget = new HtmlWeb();
var builder = new StringBuilder();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*[#id='title-article']"))
{
builder.AppendFormat("{0}\r\n", node.ChildNodes[0].InnerHtml);
break;
}
}
memoEdit1.Text = builder.ToString();
Or, using LINQ:
var Webget = new HtmlWeb();
memoEdit1.Text = string.Join(
"\r\n",
File.ReadAllLines("c:\\test.txt")
.Select (line => Webget.Load(line).DocumentNode.SelectNodes("//*[#id='title-article']").First().ChildNodes[0].InnerHtml));
If you are only selecting 1 node in the inner loop then use SelectSingleNode Instead. Also the better practice when concatenating strings in a loop is to use StringBuilder:
StringBuilder builder = new StringBuilder();
var Webget = new HtmlWeb();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
builder.AppendLine(doc.DocumentNode.SelectSingleNode("//*[#id='title-article']").InnerHtml);
}
memoEdit1.Text = builder.ToString();
Using linq it will look like this:
var Webget = new HtmlWeb();
var result = File.ReadLines("c:\\test.txt")
.Select(line => Webget.Load(line).DocumentNode.SelectSingleNode("//*[#id='title-article']").InnerHtml));
memoEdit1.Text = string.Join(Environment.NewLine, result);

HtmlAgilityPack : Issues getting content of anchor tag within a string

Guys what i'm trying to do is I've a section of a html code listed below. I need the content within the anchor tag.
HtmlDocument newHtml = new HtmlDocument();
newHtml.OptionOutputAsXml = true;
var content = "<div class="business-name-container">
<span class="tier_info"></span>
<h3 class="title fn org">
Foo
</h3>
</div>";
newHtml.Load(content);
HtmlNode doc = newHtml.DocumentNode;
var findContent = doc.SelectNodes("//a[#class='url link']");
foreach (var aContent in findContent)
{
if (acontent.InnerHtml != null)
{
Console.WriteLine("Content: " + acontent.InnerHtml);
}
}
But i'm not getting the results.
I want to the output to be as "Content : Foo"
Replace
Console.WriteLine("Content: " + acontent.InnerHtml);
With
Console.WriteLine("Content: " + acontent.InnerText);
Or even better something like this
var result = acontent.DocumentNode
.Descendants("a")
.Where(x=>x.Attributes["class"].Value =="url link").InnerText;

Modifying InnerXml of a text XmlNode

I traverse an html document with SGML and XmlDocument. When I find an XmlNode which its type is Text, I need to change its value that has an xml element. I can't change InnerXml because it's readonly. I tried to change InnerText, but this time tag descriptor chars < and > encoded to < and >. for example:
<p>
This is a text that will be highlighted.
<anothertag />
<......>
</p>
I'm trying to change to:
<p>
This is a text that will be <span class="highlighted">highlighted</span>.
<anothertag />
<......>
</p>
What is the easiest way to modify the value of a text XmlNode?
I have a workaround, I don't know it is a real solution or what, but it can result what I want. Please comment for this code if it is worthy solution or not
private void traverse(ref XmlNode node)
{
XmlNode prevOldElement = null;
XmlNode prevNewElement = null;
var element = node.FirstChild;
do
{
if (prevNewElement != null && prevOldElement != null)
{
prevOldElement.ParentNode.ReplaceChild(prevNewElement, prevOldElement);
prevNewElement = null;
prevOldElement = null;
}
if (element.NodeType == XmlNodeType.Text)
{
var el = doc.CreateElement("text");
//Here is manuplation of the InnerXml.
el.InnerXml = element.Value.Replace(a_search_term, "<b>" + a_search_term + "</b>");
//I don't replace element right now, because element.NextSibling will be null.
//So I replace the new element after getting the next sibling.
prevNewElement = el;
prevOldElement = element;
}
else if (element.HasChildNodes)
traverse(ref element);
}
while ((element = element.NextSibling) != null);
if (prevNewElement != null && prevOldElement != null)
{
prevOldElement.ParentNode.ReplaceChild(prevNewElement, prevOldElement);
}
}
Also, I remove <text> and </text> strings after the traverse function:
doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
var html = doc.FirstChild;
traverse(ref html);
textBox1.Text = doc.OuterXml.Replace("<text>", String.Empty).Replace("</text>", String.Empty);
using System;
using System.Xml;
public class Sample {
public static void Main() {
XmlDocument doc = new XmlDocument();
doc.LoadXml(
"<p>" +
"This is a text that will be highlighted." +
"<br />" +
"<img />" +
"</p>");
string ImpossibleMark = "_*_";
XmlNode elem = doc.DocumentElement.FirstChild;
string thewWord ="highlighted";
if(elem.NodeType == XmlNodeType.Text){
string OriginalXml = elem.ParentNode.InnerXml;
while(OriginalXml.Contains(ImpossibleMark)) ImpossibleMark += ImpossibleMark;
elem.InnerText = elem.InnerText.Replace(thewWord, ImpossibleMark);
string replaceString = "<span class=\"highlighted\">" + thewWord + "</span>";
elem.ParentNode.InnerXml = elem.ParentNode.InnerXml.Replace(ImpossibleMark, replaceString);
}
Console.WriteLine(doc.DocumentElement.InnerXml);
}
}
The InnerText property will give you the text content of all the child nodes of the XmlNode. What you really want to set is the InnerXml property, which will be construed as XML, not as text.

Categories

Resources