Hi guys I'm trying to take a description which has been entered in a wysiwyg editor and take a substring of it..
i.e
This is some <span style="font-weight:bold;">text</span>
I'd like to limit some descriptions without breaking the html if i just substring and add ...
it breaks the html tags..
I've tried:
string HtmlSubstring(string html, int maxlength)
{
string htmltag = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
string emptytags = "<(\\w+)((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?></\\1>";
var expression = new Regex(string.Format("({0})|(.?)", htmltag));
MatchCollection matches = expression.Matches(html);
int i = 0;
StringBuilder content = new StringBuilder();
foreach (Match match in matches)
{
if (match.Value.Length == 1 && i < maxlength)
{
content.Append(match.Value);
i++;
}
else if (match.Value.Length > 1)
{
content.Append(match.Value);
}
}
return Regex.Replace(content.ToString(), emptytags, string.Empty);
}
but it doesn't quite get me there!
Use the HTML Agility Pack to load the HTML and then get InnerText.
var document = new HtmlDocument();
document.LoadHtml("...");
document.DocumentNode.InnerText;
Also see C#: HtmlAgilityPack extract inner text
Related
I am attempting to create a crawler that returns only links from a website and i have it to a point that it returns the HTML script.
I am now wanting to use an if statement to check that the string is returned and if it is returned, it searches for all "< a >" tags and shows me the href link.
but I don't know what object to check or what value I should be checking for.
Here is what I have so far:
namespace crawler
{
class Program
{
static void Main(string[] args)
{
System.Net.WebClient wc = new System.Net.WebClient();
string WebData wc.DownloadString("https://www.abc.net.au/news/science/");
Console.WriteLine(WebData);
// if
}
}
}
You can have a look at HTML Agility Pack:
Then you can find all links from a web page like:
var hrefs = new List<string>();
var hw = new HtmlWeb();
HtmlDocument document = hw.Load(/* your url here */);
foreach(HtmlNode link in document.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute attribute = link.Attributes["href"];
if (!string.IsNullOrWhiteSpace(attribute.Value))
hrefs.Add(attribute.Value);
}
Firstly you can make a function to return the whole website HTML code as you have done. Here is the one I have!
public string GetPageContents()
{
string link = "https://www.abc.net.au/news/science/"
string pageContent = "";
WebClient web = new WebClient();
Stream stream;
stream = web.OpenRead(link);
using (StreamReader reader = new StreamReader(stream))
{
pageContent = reader.ReadToEnd();
}
stream.Close();
return pageContents;
}
Then you could make a function that would return a substring or a List of substring (meaning that if you wanted all < a > tags you would probably get more than one).
List<string> divTags = GetBetweenTags(pageContents, "<div>", "</div>")
This would give you a list where you could, for example, make another search for < a > tags inside each of those < div > tags.
public List<string> GetBetweenTags(string pageContents, string startTag, string endTag)
{
Regex rx = new Regex(startTag + "(.*?)" + endTag);
MatchCollection col = rx.Matches(value);
List<string> tags = new List<string>();
foreach(Match s in col)
tags.Add(s.ToString());
return tags;
}
Edit: Wow didn't know of HTML Agility Pack, thanks #Gauravsa i'll update my project to use it!
I have given XML string with a list of html tag like "<p>, <a>, <img>, <link>" etc.
Now I want to make generic function where I will be passing the list of html tags or can be one tag as well which I want to exclude from the passed XML string. Function will return the whole string back without excluded tags.
public const String[] htmlTags = new String[] { "<p>", "a", "img" };
string result = strString.ExcludeHTMLTags(htmlTags); //I will write the String extension not an issue, please suggest how to exclude tags from exisiting string.
EDIT:
I am trying below code:
/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source, String[] htmlTags)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
foreach (String htmlTag in htmlTags)
{
char let = source[i];
String tag = "<" + "htmlTag"; //How to handle this as this is character
if (let == tag)
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
}
return new string(array, 0, arrayIndex);
}
EDIT 2: Using Regex
String[] htmlTags = new String[] { "a", "img", "p" };
private const string STR_RemoveHtmlTagRegex = "</?{0}[^<]*?>";
public static string RemoveHtmlTag(String input, String[] htmlTags)
{
String strResult = String.Empty;
foreach (String htmlTag in htmlTags)
{
Regex reg = new Regex(String.Format(STR_RemoveHtmlTagRegex, htmlTag.Trim()), RegexOptions.IgnoreCase);
strResult = reg.Replace(input, String.Empty);
input = strResult;
}
return strResult;
}
Now the problem is that it is not removing value of tag, so if there is "Testing" then it returns "Testing", I want to remove whole tag with values as well.
Convert html to DOM-tree and remove element-nodes with name containing in given excluding tag list
Have you tried Html Agility Pack. It is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT, built as a .NET code library that allows you to parse "out of the web" HTML files, and you can fix a string the way you want, modify the DOM, add nodes, copy nodes, well.
Often I extract file names from html text data using regex but I heard the html agility pack is good for parsing html data. how can I use html agility pack to extract all url from html data. Can any one guide me with sample code. Thanks.
This is my code sample which works fine.
using System.Text.RegularExpressions;
private ArrayList GetFilesName(string Source)
{
ArrayList arrayList = new ArrayList();
Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
MatchCollection matchCollection = regex.Matches(Source);
foreach (Match match in matchCollection)
{
if (!match.get_Value().StartsWith("http://"))
{
arrayList.Add(Path.GetFileName(match.get_Value()));
}
match.NextMatch();
}
ArrayList arrayList1 = arrayList;
return arrayList1;
}
private string ReplaceSrc(string Source)
{
Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
MatchCollection matchCollection = regex.Matches(Source);
foreach (Match match in matchCollection)
{
string value = match.get_Value();
string str = string.Concat("images/", Path.GetFileName(value));
Source = Source.Replace(value, str);
match.NextMatch();
}
string source = Source;
return source;
}
Something like:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var images = doc.DocumentNode.Descendants("img")
.Where(i => i.GetAttributeValue("src", null) != null)
.Select(i => i.Attributes["src"].Value);
This selects all the <img> elements from the document which have src property set, and return these URLs.
Select all img tags with non-empty src attribute (otherwise you will get NullReferenceException during getting attribute value):
HtmlDocument html = new HtmlDocument();
html.Load(path_to_file);
var urls = html.DocumentNode.SelectNodes("//img[#src!='']")
.Select(i => i.Attributes["src"].Value);
I want to put one table from Wikipedia into xml file and then parse it to C#. Is it possible? If yes, can I save in xml only Title and Genre column?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/2012_in_film");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//table[#class='wikitable']");
You can use a web browser:
//First navigate to your address
webBrowser1.Navigate("http://en.wikipedia.org/wiki/2012_in_film");
List<string> Genre = new List<string>();
List<string> Title = new List<string>();
//When page loaded
foreach (HtmlElement table in webBrowser1.Document.GetElementsByTagName("table"))
{
if (table.GetAttribute("className").Equals("wikitable"))
{
foreach (HtmlElement tr in table.GetElementsByTagName("tr"))
{
int columncount = 1;
foreach (HtmlElement td in tr.GetElementsByTagName("td"))
{
//Title
if (columncount == 4)
{
Title.Add(td.InnerText);
}
//Genre
if (columncount == 7)
{
Genre.Add(td.InnerText);
}
columncount++;
}
}
}
}
now you have two list (genre and title).
you can simply convert them to xml file
You can use this code:
Search for the html tag which you want to search for and make a regular expression to parse the rest of the data.
This code will search for the table which has width 150 and gets all the url/nav url's.
HtmlElementCollection links = webBrowser1.Document.GetElementsByTagName("table"); //get collection in link
{
foreach (HtmlElement link_data in links) //parse for each collection
{
String width = link_data.GetAttribute("width");
{
if (width != null && width == "150")
{
Regex linkX = new Regex("<a[^>]*?href=\"(?<href>[\\s\\S]*?)\"[^>]*?>(?<Title>[\\s\\S]*?)</a>", RegexOptions.IgnoreCase);
MatchCollection category_urls = linkX.Matches(link_data.OuterHtml);
if (category_urls.Count > 0)
{
foreach (Match match in category_urls)
{
//rest of the code
}
}
}
}
}
}
Also consider looking at the Wikipedia API to zero in on a particular section of a wikipedia page
https://en.wikipedia.org/w/api.php?action=parse&page=2012_in_film&mobileformat=html§ion=1&prop=wikitext
The API documentation describes how you can format the search results for subsequent parsing.
file contains tag as
<html><head></head><body><span class=style32></span>....
i want only the html tag i.e span,head,body in list.There should not be duplicates.
please help me i'm new to regular expressions.
var tagList = new List<string>();
string pattern = #"(?<=</?)([^ >/]+)"
var matches = Regex.Matches(file, pattern);
for (int i = 0; i < matches.Count; i++)
{
tagList.Add(matches[i].ToString());
}
//to obtain non duplicate list
tagList = tagList.Distinct().ToList();