link extraction using HtmlAgilityPack and c#

link extraction using HtmlAgilityPack and c# - c#

i want to extract google result links
My code works it does extract links, but these links are not what i expected to be extracted.
My program would extract links inside the "a href" tag but all links in search result are not Appropriate links , ads link , googles link are also included
what should i do?
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Net;
using System.ServiceModel.Syndication;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Xml;
namespace Search
{
public partial class Form1 : Form
{
// load snippet
HtmlAgilityPack.HtmlDocument htmlSnippet = new HtmlAgilityPack.HtmlDocument();
public Form1()
{
InitializeComponent();
}
private void btn1_Click(object sender, EventArgs e)
{
listBox1.Items.Clear();
StringBuilder sb = new StringBuilder();
byte[] ResultsBuffer = new byte[8192];
string SearchResults = "http://google.com/search?q=" + txtKeyWords.Text.Trim();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(SearchResults);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
count = resStream.Read(ResultsBuffer, 0, ResultsBuffer.Length);
if (count != 0)
{
tempString = Encoding.ASCII.GetString(ResultsBuffer, 0, count);
sb.Append(tempString);
}
}
while (count > 0);
string sbb = sb.ToString();
HtmlAgilityPack.HtmlDocument html = new HtmlAgilityPack.HtmlDocument();
html.OptionOutputAsXml = true;
html.LoadHtml(sbb);
HtmlNode doc = html.DocumentNode;
foreach (HtmlNode link in doc.SelectNodes("//a[#href]"))
{
//HtmlAttribute att = link.Attributes["href"];
string hrefValue = link.GetAttributeValue("href", string.Empty);
// if ()
{
int index = hrefValue.IndexOf("&");
if (index > 0)
{
hrefValue = hrefValue.Substring(0, index);
listBox1.Items.Add(hrefValue.Replace("/url?q=", ""));
}
}
}
}
}
}
if i want to work with "a href" tag i have to add some condition in If
but i dont know what condition i should use here:
if ()
someplace i read about extracting cite tag not ahref tag anybody can help?

To get the links that are contained in the cite elements, simply access their inner text, like:
HtmlWeb w = new HtmlWeb();
var hd = w.Load("http://www.google.com/search?q=veverke");
var cites = hd.DocumentNode.SelectNodes("//cite");
foreach (var cite in cites)
Console.WriteLine(cite.InnerText);

Related

how to copy highlighted text from pdf file

i am using itextsharp library for developing c# application to merge all annotations comments from two different PDF file in another PDF file please help me thanks in advance i have tried code
i have used this code i am able to find highlighted text but not in proper formatting.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.Globalization;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace PdfFileApp
{
public class pdftotext
{
public static void ReadAnnotation()
{
int pageTo = 0;
try
{
using (iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader("D:\\DEMO_Supp_First Proof.pdf"))
{
pageTo = reader.NumberOfPages;
for (int i = 1; i <= reader.NumberOfPages; i++)
{
PdfDictionary page = reader.GetPageN(i);
PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if (annots != null)
foreach (PdfObject annot in annots.ArrayList)
{
PdfDictionary annotationDic = (PdfDictionary)iTextSharp.text.pdf.PdfReader.GetPdfObject(annot);
PdfDictionary pdfDictionary = annots.GetAsDict(i);
PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
var author = pdfDictionary.GetAsString(PdfName.T);
if (subType.Equals(PdfName.HIGHLIGHT))
{
PdfArray coordinates = annotationDic.GetAsArray(PdfName.RECT);
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(float.Parse(coordinates.ArrayList[0].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[1].ToString(), CultureInfo.InvariantCulture.NumberFormat),
float.Parse(coordinates.ArrayList[2].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[3].ToString(), CultureInfo.InvariantCulture.NumberFormat));
RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy;
StringBuilder sb = new StringBuilder();
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));
Console.WriteLine(sb.ToString());
Console.ReadLine();
var annotatedWord = sb.Replace(System.Environment.NewLine, string.Empty);
}
}
}
}
}
catch (Exception ex)
{
string error = ex.Message;
}
}
}
}

Install Aspose.pdf Library from Nuget packacge and use below code.
string highlightedText = "";
var document = new Aspose.Pdf.Document(#"Path");
Aspose.Pdf.Facades.PdfAnnotationEditor annotationEditor = new Aspose.Pdf.Facades.PdfAnnotationEditor();
annotationEditor.BindPdf(document);
// Extract annotations
var annotationTypes = new[] { Aspose.Pdf.Annotations.AnnotationType.FreeText, Aspose.Pdf.Annotations.AnnotationType.Highlight };
var annotations = annotationEditor.ExtractAnnotations(1, 2, annotationTypes);
foreach (var annotation in annotations)
{
var extractAnnotation = (Aspose.Pdf.Annotations.HighlightAnnotation)annotation;
highlightedText += extractAnnotation.GetMarkedText();
}

I use ASP.net to generate a sitemap.xml, get Google Webmaster says: "Your Sitemap appears to be an HTML page."

I am trying to return an generated .xml file, but google picks it up as a html page. So I get: "Your Sitemap appears to be an HTML page. Please use a supported sitemap format instead."
Here is the ASP.net Controller that generate the sitemap.xml
[Route("sitemap.xml")]
public async Task<IActionResult> SitemapXmlAsync()
{
using (var client = new HttpClient())
{
try
{
client.BaseAddress = new Uri("https://api.badgag.com/api/generateSitemap");
var response = await client.GetAsync("");
response.EnsureSuccessStatusCode();
var stringResult = await response.Content.ReadAsStringAsync();
var pages = JsonConvert.DeserializeObject<String[]>(stringResult);
String xml = "<?xml version=\"1.0\" encoding=\"utf-8\"?>";
xml += "<sitemapindex xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">";
foreach (string s in pages)
{
xml += "<sitemap>";
xml += "<loc>" + s + "</loc>";
//xml += "<lastmod>" + DateTime.Now.ToString("yyyy-MM-dd") + "</lastmod>";
xml += "</sitemap>";
}
xml += "</sitemapindex>";
return Content(xml, "text/xml");
}
catch (HttpRequestException httpRequestException)
{
return BadRequest($"Error getting sitemap: {httpRequestException.Message}");
}
}
}
I assume I am missing something. Setting a different header?
You can see the result here:
https://badgag.com/sitemap.xml
Thanks in advance :)

I found this article about creating a XML sitemap with ASP.Net, this helped me to create a sitemap.ashx file with the correct sitemap layout Google and Bing require.
It basically is using XmlTextWriter to generate the required tags for a sitemap. This example is using HTTPContext to write an XML file. Here is the code from this site:
using System;
using System.Configuration;
using System.Data;
using System.Data.SqlClient;
using System.Text;
using System.Web;
using System.Xml;
namespace Mikesdotnetting
{
public class Sitemap : IHttpHandler
{
public void ProcessRequest(HttpContext context) {
context.Response.ContentType = "text/xml";
using (XmlTextWriter writer = new XmlTextWriter(context.Response.OutputStream, Encoding.UTF8)) {
writer.WriteStartDocument();
writer.WriteStartElement("urlset");
writer.WriteAttributeString("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9");
writer.WriteStartElement("url");
string connect = ConfigurationManager.ConnectionStrings["MyConnString"].ConnectionString;
string url = "http://www.mikesdotnetting.com/";
using (SqlConnection conn = new SqlConnection(connect)) {
using (SqlCommand cmd = new SqlCommand("GetSiteMapContent", conn)) {
cmd.CommandType = CommandType.StoredProcedure;
conn.Open();
using (SqlDataReader rdr = cmd.ExecuteReader()) {
// Get the date of the most recent article
rdr.Read();
writer.WriteElementString("loc", string.Format("{0}Default.aspx", url));
writer.WriteElementString("lastmod", string.Format("{0:yyyy-MM-dd}", rdr[0]));
writer.WriteElementString("changefreq", "weekly");
writer.WriteElementString("priority", "1.0");
writer.WriteEndElement();
// Move to the Article IDs
rdr.NextResult();
while (rdr.Read()) {
writer.WriteStartElement("url");
writer.WriteElementString("loc", string.Format("{0}Article.aspx?ArticleID={1}", url, rdr[0]));
if (rdr[1] != DBNull.Value)
writer.WriteElementString("lastmod", string.Format("{0:yyyy-MM-dd}", rdr[1]));
writer.WriteElementString("changefreq", "monthly");
writer.WriteElementString("priority", "0.5");
writer.WriteEndElement();
}
// Move to the Article Type IDs
rdr.NextResult();
while (rdr.Read()) {
writer.WriteStartElement("url");
writer.WriteElementString("loc", string.Format("{0}ArticleTypes.aspx?Type={1}", url, rdr[0]));
writer.WriteElementString("priority", "0.5");
writer.WriteEndElement();
}
// Finally move to the Category IDs
rdr.NextResult();
while (rdr.Read()) {
writer.WriteStartElement("url");
writer.WriteElementString("loc", string.Format("{0}Category.aspx?Category={1}", url, rdr[0]));
writer.WriteElementString("priority", "0.5");
writer.WriteEndElement();
}
writer.WriteEndElement();
writer.WriteEndDocument();
writer.Flush();
}
context.Response.End();
}
}
}
}
public bool IsReusable {
get {
return false;
}
}
}
}

Try xml linq :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
const string URL = #"https://badgag.com/sitemap.xml";
static void Main(string[] args)
{
XDocument doc = XDocument.Load(URL);
XNamespace ns = doc.Root.GetDefaultNamespace();
string[] locations = doc.Descendants(ns + "loc").Select(x => (string)x).ToArray();
}
}
}

Loading a ListView from file, only displays every other record

I have created a method to load a ListView from a text file. However, the problem is that my ListView is only displaying every other record. For some reason it only displays even numbered items...2,4,6,etc. I need it to display all the items. Any help would be greatly appreciated. Here is the method...
public partial class Admin : Form
{
const string ITEMSFILE = "items.dat";
const char DELIM = ',';
List<Item> itemList = new List<Item>();
public Admin()
{
InitializeComponent();
load2();
itemsListView.View = View.Details;
itemsListView.Columns.Add("ID");
itemsListView.Columns.Add("Item Name");
itemsListView.Columns.Add("Cost");
itemsListView.Columns.Add("Category");
itemsListView.AutoResizeColumns(ColumnHeaderAutoResizeStyle.ColumnContent);
itemsListView.AutoResizeColumns(ColumnHeaderAutoResizeStyle.HeaderSize);
}
public void load2()
{
using (FileStream file = new FileStream(ITEMSFILE, FileMode.Open, FileAccess.Read))
using (StreamReader reader = new StreamReader(file))
{
string recordIn;
string[] fields;
string itemline;
itemsListView.Items.Clear();
try
{
while ((itemline = reader.ReadLine()) != null)
{
recordIn = reader.ReadLine();
fields = recordIn.Split(DELIM);
itemsListView.Items.Add(new ListViewItem(new string[] { fields[0], fields[1], fields[2], fields[3] }));
}
}
catch (NullReferenceException)
{
//TODO
}
}
}
}
Here is the text files content....
1,Bud light,3.50,Drinks
2,Michelob ultra,3.50,Drinks
3,Heineken,4.00,Drinks
4,Miller lite,3.50,Drinks
5,Busch,2.50,Drinks
6,Pabst,2.50,Drinks
Example of faulty output...
UPDATE - REVISED WORKING CODE
Disappointed I was downvoted. Here is my revised working code...
using System;
using static System.Console;
using System.IO;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Collections;
namespace LoadList
{
public partial class Admin : Form
{
const string ITEMSFILE = "items.dat";
const char DELIM = ',';
List<Item> itemList = new List<Item>();
public Admin()
{
InitializeComponent();
load2();
}
public void load2()
{
using (FileStream file = new FileStream(ITEMSFILE, FileMode.Open, FileAccess.Read))
using (StreamReader reader = new StreamReader(file))
{
string recordIn;
string[] fields;
string itemline;
itemsListView.Items.Clear();
while (!reader.EndOfStream)
{
recordIn = reader.ReadLine();
fields = recordIn.Split(DELIM);
itemsListView.Items.Add(new ListViewItem(new string[] { fields[0], fields[1], fields[2], fields[3] }));
}
}
}
}
}

You are reading two lines:
...
while ((itemline = reader.ReadLine()) != null) <- 1st
{
recordIn = reader.ReadLine(); <- 2nd
...
Get rid of the 2nd one and use itemline instead of recordIn

get title tag by html agility pack

i'm trying to use htmlagility pack to gain links and tites of results
i have this code
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Net;
using System.ServiceModel.Syndication;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Xml;
namespace Search
{
public partial class Form1 : Form
{
// load snippet
HtmlAgilityPack.HtmlDocument htmlSnippet = new HtmlAgilityPack.HtmlDocument();
public Form1()
{
InitializeComponent();
}
private void btn1_Click(object sender, EventArgs e)
{
listBox1.Items.Clear();
StringBuilder sb = new StringBuilder();
byte[] ResultsBuffer = new byte[8192];
string SearchResults = "http://google.com/search?q=" + txtKeyWords.Text.Trim();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(SearchResults);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
count = resStream.Read(ResultsBuffer, 0, ResultsBuffer.Length);
if (count != 0)
{
tempString = Encoding.ASCII.GetString(ResultsBuffer, 0, count);
sb.Append(tempString);
}
}
while (count > 0);
string sbb = sb.ToString();
HtmlAgilityPack.HtmlDocument html = new HtmlAgilityPack.HtmlDocument();
html.OptionOutputAsXml = true;
html.LoadHtml(sbb);
HtmlNode doc = html.DocumentNode;
foreach (HtmlNode link in doc.SelectNodes("//a[#href]"))
{
//HtmlAttribute att = link.Attributes["href"];
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!hrefValue.ToString().ToUpper().Contains("GOOGLE") && hrefValue.ToString().Contains("/url?q=") && hrefValue.ToString().ToUpper().Contains("HTTP://"))
{
int index = hrefValue.IndexOf("&");
if (index > 0)
{
hrefValue = hrefValue.Substring(0, index);
listBox1.Items.Add(hrefValue.Replace("/url?q=", ""));
}
}
}
}
}
}
this code returns result links for a query i want to get title tag for each link too how can i get title for each links?
anybody can help?

If, by 'title', you mean the displayed text of the link, then you can get it from InnerText property of each HtmlNode link :
foreach (HtmlNode link in doc.SelectNodes("//a[#href]"))
{
.....
var title = link.InnerText.Trim();
}

This will give output: links and titles of the links. This will not return the anchor text of links. You will get only the title of each link.
foreach (HtmlNode link in doc.SelectNodes("//a[#href]"))
{
HtmlWeb htmlWeb = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmlDocument = htmlWeb.Load(link);
var title = htmlDocument.DocumentNode.SelectSingleNode("html/head/title").InnerText;
}

c# regular expression for finding links in <a> with specific ending

I need a regex pattern for finding links in a string (with HTML code) to get the links with file endings like .gif or .png
Example String:
picture.png
For now I get everything between the " " and the text between the <a> and </a>.
I want to get this:
Href = //site.com/folder/picture.png
String = picture.png
My code so far:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Diagnostics;
using System.Drawing;
using System.Linq;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace downloader
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
string url = textBox1.Text;
string s = gethtmlcode(url);
foreach (LinkItem i in LinkFinder.Find(s))
{
richTextBox1.Text += Convert.ToString(i);
}
}
static string gethtmlcode(string url)
{
using (WebClient client = new WebClient())
{
string htmlCode = client.DownloadString(url);
return htmlCode;
}
}
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href + "\n\t" + Text + "\n\t";
}
}
static class LinkFinder
{
public static List<LinkItem> Find(string file)
{
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, #"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, #"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
string t = Regex.Replace(value, #"\s*<.*?>\s*", "",
RegexOptions.Singleline);
i.Text = t;
list.Add(i);
}
return list;
}
}
}
}

I can suggest using HtmlAgilityPack for this task. Install using Manage NuGet Packages for Solution menu, and add the following method:
/// <summary>
/// Collects a href attribute values and a node values if image extension is jpg or png
/// </summary>
/// <param name="html">HTML string or an URL</param>
/// <returns>A key-value pair list of href values and a node values</returns>
private List<KeyValuePair<string, string>> GetLinksWithHtmlAgilityPack(string html)
{
var result = new List<KeyValuePair<string, string>>();
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
{ // html is a URL
var doc = new HtmlAgilityPack.HtmlWeb();
hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
hap = new HtmlAgilityPack.HtmlDocument();
hap.LoadHtml(html);
}
var nodes = hap.DocumentNode.SelectNodes("//a");
if (nodes != null)
foreach (var node in nodes)
if (Path.GetExtension(node.InnerText.Trim()).ToLower() == ".png" ||
Path.GetExtension(node.InnerText.Trim()).ToLower() == ".jpg")
result.Add(new KeyValuePair<string,string>(node.GetAttributeValue("href", null), node.InnerText));
return result;
}
Then, use it as (I am using a dummy string, just for demo)
var result = GetLinksWithHtmlAgilityPack("picture.pngpicture.bmp");
Output:
Or, with a URL, something like:
var result = GetLinksWithHtmlAgilityPack("http://www.google.com");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

link extraction using HtmlAgilityPack and c# - c#

To get the links that are contained in the cite elements, simply access their inner text, like: HtmlWeb w = new HtmlWeb(); var hd = w.Load("http://www.google.com/search?q=veverke"); var cites = hd.DocumentNode.SelectNodes("//cite"); foreach (var cite in cites) Console.WriteLine(cite.InnerText);

Related

how to copy highlighted text from pdf file

I use ASP.net to generate a sitemap.xml, get Google Webmaster says: "Your Sitemap appears to be an HTML page."

Loading a ListView from file, only displays every other record

get title tag by html agility pack

c# regular expression for finding links in <a> with specific ending

Categories

Resources