Substring without breaking html c# - c#

Hi guys I'm trying to take a description which has been entered in a wysiwyg editor and take a substring of it..
i.e
This is some <span style="font-weight:bold;">text</span>
I'd like to limit some descriptions without breaking the html if i just substring and add ...
it breaks the html tags..
I've tried:
string HtmlSubstring(string html, int maxlength)
{
string htmltag = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
string emptytags = "<(\\w+)((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?></\\1>";
var expression = new Regex(string.Format("({0})|(.?)", htmltag));
MatchCollection matches = expression.Matches(html);
int i = 0;
StringBuilder content = new StringBuilder();
foreach (Match match in matches)
{
if (match.Value.Length == 1 && i < maxlength)
{
content.Append(match.Value);
i++;
}
else if (match.Value.Length > 1)
{
content.Append(match.Value);
}
}
return Regex.Replace(content.ToString(), emptytags, string.Empty);
}
but it doesn't quite get me there!

Use the HTML Agility Pack to load the HTML and then get InnerText.
var document = new HtmlDocument();
document.LoadHtml("...");
document.DocumentNode.InnerText;
Also see C#: HtmlAgilityPack extract inner text

Related

Extract HREF values from HTML string

I am attempting to create a crawler that returns only links from a website and i have it to a point that it returns the HTML script.
I am now wanting to use an if statement to check that the string is returned and if it is returned, it searches for all "< a >" tags and shows me the href link.
but I don't know what object to check or what value I should be checking for.
Here is what I have so far:
namespace crawler
{
class Program
{
static void Main(string[] args)
{
System.Net.WebClient wc = new System.Net.WebClient();
string WebData wc.DownloadString("https://www.abc.net.au/news/science/");
Console.WriteLine(WebData);
// if
}
}
}
You can have a look at HTML Agility Pack:
Then you can find all links from a web page like:
var hrefs = new List<string>();
var hw = new HtmlWeb();
HtmlDocument document = hw.Load(/* your url here */);
foreach(HtmlNode link in document.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute attribute = link.Attributes["href"];
if (!string.IsNullOrWhiteSpace(attribute.Value))
hrefs.Add(attribute.Value);
}
Firstly you can make a function to return the whole website HTML code as you have done. Here is the one I have!
public string GetPageContents()
{
string link = "https://www.abc.net.au/news/science/"
string pageContent = "";
WebClient web = new WebClient();
Stream stream;
stream = web.OpenRead(link);
using (StreamReader reader = new StreamReader(stream))
{
pageContent = reader.ReadToEnd();
}
stream.Close();
return pageContents;
}
Then you could make a function that would return a substring or a List of substring (meaning that if you wanted all < a > tags you would probably get more than one).
List<string> divTags = GetBetweenTags(pageContents, "<div>", "</div>")
This would give you a list where you could, for example, make another search for < a > tags inside each of those < div > tags.
public List<string> GetBetweenTags(string pageContents, string startTag, string endTag)
{
Regex rx = new Regex(startTag + "(.*?)" + endTag);
MatchCollection col = rx.Matches(value);
List<string> tags = new List<string>();
foreach(Match s in col)
tags.Add(s.ToString());
return tags;
}
Edit: Wow didn't know of HTML Agility Pack, thanks #Gauravsa i'll update my project to use it!

Function to exclude specific tags from xml string

I have given XML string with a list of html tag like "<p>, <a>, <img>, <link>" etc.
Now I want to make generic function where I will be passing the list of html tags or can be one tag as well which I want to exclude from the passed XML string. Function will return the whole string back without excluded tags.
public const String[] htmlTags = new String[] { "<p>", "a", "img" };
string result = strString.ExcludeHTMLTags(htmlTags); //I will write the String extension not an issue, please suggest how to exclude tags from exisiting string.
EDIT:
I am trying below code:
/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source, String[] htmlTags)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
foreach (String htmlTag in htmlTags)
{
char let = source[i];
String tag = "<" + "htmlTag"; //How to handle this as this is character
if (let == tag)
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
}
return new string(array, 0, arrayIndex);
}
EDIT 2: Using Regex
String[] htmlTags = new String[] { "a", "img", "p" };
private const string STR_RemoveHtmlTagRegex = "</?{0}[^<]*?>";
public static string RemoveHtmlTag(String input, String[] htmlTags)
{
String strResult = String.Empty;
foreach (String htmlTag in htmlTags)
{
Regex reg = new Regex(String.Format(STR_RemoveHtmlTagRegex, htmlTag.Trim()), RegexOptions.IgnoreCase);
strResult = reg.Replace(input, String.Empty);
input = strResult;
}
return strResult;
}
Now the problem is that it is not removing value of tag, so if there is "Testing" then it returns "Testing", I want to remove whole tag with values as well.
Convert html to DOM-tree and remove element-nodes with name containing in given excluding tag list
Have you tried Html Agility Pack. It is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT, built as a .NET code library that allows you to parse "out of the web" HTML files, and you can fix a string the way you want, modify the DOM, add nodes, copy nodes, well.

how to use html agility pack to extract all url from html text

Often I extract file names from html text data using regex but I heard the html agility pack is good for parsing html data. how can I use html agility pack to extract all url from html data. Can any one guide me with sample code. Thanks.
This is my code sample which works fine.
using System.Text.RegularExpressions;
private ArrayList GetFilesName(string Source)
{
ArrayList arrayList = new ArrayList();
Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
MatchCollection matchCollection = regex.Matches(Source);
foreach (Match match in matchCollection)
{
if (!match.get_Value().StartsWith("http://"))
{
arrayList.Add(Path.GetFileName(match.get_Value()));
}
match.NextMatch();
}
ArrayList arrayList1 = arrayList;
return arrayList1;
}
private string ReplaceSrc(string Source)
{
Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
MatchCollection matchCollection = regex.Matches(Source);
foreach (Match match in matchCollection)
{
string value = match.get_Value();
string str = string.Concat("images/", Path.GetFileName(value));
Source = Source.Replace(value, str);
match.NextMatch();
}
string source = Source;
return source;
}
Something like:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var images = doc.DocumentNode.Descendants("img")
.Where(i => i.GetAttributeValue("src", null) != null)
.Select(i => i.Attributes["src"].Value);
This selects all the <img> elements from the document which have src property set, and return these URLs.
Select all img tags with non-empty src attribute (otherwise you will get NullReferenceException during getting attribute value):
HtmlDocument html = new HtmlDocument();
html.Load(path_to_file);
var urls = html.DocumentNode.SelectNodes("//img[#src!='']")
.Select(i => i.Attributes["src"].Value);

How to get table from Wikipedia

I want to put one table from Wikipedia into xml file and then parse it to C#. Is it possible? If yes, can I save in xml only Title and Genre column?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/2012_in_film");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//table[#class='wikitable']");
You can use a web browser:
//First navigate to your address
webBrowser1.Navigate("http://en.wikipedia.org/wiki/2012_in_film");
List<string> Genre = new List<string>();
List<string> Title = new List<string>();
//When page loaded
foreach (HtmlElement table in webBrowser1.Document.GetElementsByTagName("table"))
{
if (table.GetAttribute("className").Equals("wikitable"))
{
foreach (HtmlElement tr in table.GetElementsByTagName("tr"))
{
int columncount = 1;
foreach (HtmlElement td in tr.GetElementsByTagName("td"))
{
//Title
if (columncount == 4)
{
Title.Add(td.InnerText);
}
//Genre
if (columncount == 7)
{
Genre.Add(td.InnerText);
}
columncount++;
}
}
}
}
now you have two list (genre and title).
you can simply convert them to xml file
You can use this code:
Search for the html tag which you want to search for and make a regular expression to parse the rest of the data.
This code will search for the table which has width 150 and gets all the url/nav url's.
HtmlElementCollection links = webBrowser1.Document.GetElementsByTagName("table"); //get collection in link
{
foreach (HtmlElement link_data in links) //parse for each collection
{
String width = link_data.GetAttribute("width");
{
if (width != null && width == "150")
{
Regex linkX = new Regex("<a[^>]*?href=\"(?<href>[\\s\\S]*?)\"[^>]*?>(?<Title>[\\s\\S]*?)</a>", RegexOptions.IgnoreCase);
MatchCollection category_urls = linkX.Matches(link_data.OuterHtml);
if (category_urls.Count > 0)
{
foreach (Match match in category_urls)
{
//rest of the code
}
}
}
}
}
}
Also consider looking at the Wikipedia API to zero in on a particular section of a wikipedia page
https://en.wikipedia.org/w/api.php?action=parse&page=2012_in_film&mobileformat=html&section=1&prop=wikitext
The API documentation describes how you can format the search results for subsequent parsing.

how to get all html tags from html file in the list using regular expression

file contains tag as
<html><head></head><body><span class=style32></span>....
i want only the html tag i.e span,head,body in list.There should not be duplicates.
please help me i'm new to regular expressions.
var tagList = new List<string>();
string pattern = #"(?<=</?)([^ >/]+)"
var matches = Regex.Matches(file, pattern);
for (int i = 0; i < matches.Count; i++)
{
tagList.Add(matches[i].ToString());
}
//to obtain non duplicate list
tagList = tagList.Distinct().ToList();

Categories

Resources