HtmlAgilityPack parse text blocks - c#

I am making a small web analysis tool and need to somehow extract all the text blocks on a given url that contain more than X amount of words.
The method i currently use is this:
public string getAllText(string _html)
{
string _allText = "";
try
{
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(_html);
var root = document.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
_allText = sb.ToString();
}
catch (Exception)
{
}
_allText = System.Web.HttpUtility.HtmlDecode(_allText);
return _allText;
}
The problem here is that i get all text returned, even if its a meny text, a footer text with 3 words, etc.
I want to analyse the actual content on a page, so my idea is to somehow only parse the text that could be content (ie text blocks with more than X words)
Any ideas how this could be achieved?

Well, first approach can be a simple word count analisys on each node.InnerText value using string.Split function:
string[] words;
words = text.Split((string[]) null, StringSplitOptions.RemoveEmptyEntries);
and append only text where words.Length is larger than 3.
Also see this question answer for some more tricks in raw text gathering.

Related

How to decode HTML into string?

I need to decode HTML into plain text. I know that there are a lot of questions like this but I noticed one problem with those solutions and don't know how to solve it.
For example we have this piece of HTML:
<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>
Tried regex solutions, HttpUtility.HtmlDecode method. And all of them give this output: Some textSome more text. Words get connected where they should be separate. Is there a way to decode string without merging words?
It's not clear what separator you wan between things that were not separated in the first place. So I used NewLine \n.
Where(x=>!string.IsNullOrWhiteSpace(x) will remove the empty element that will result in a lot of \n\n in more complex html doc
var input = "<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);
var result = string.Join(
"\n",
htmlDocument
.DocumentNode
.ChildNodes
.Select(x=> x.InnerText)
.Where(x=>!string.IsNullOrWhiteSpace(x))
);
Result:
"Some text\nSome more text"
easy way to do it is to use HTML Agility pack:
HtmlDocument htmlDocument= new HtmlDocument();
htmlDocument.Load(htmlString);
string res=htmlDocument.DocumentNode.SelectSingleNode("YOUR XPATH TO THE INTRESTING ELEMENT").InnerText
You can use something as follows. In this sample i have used new line to separate inner text, hope you can adapt this to suite your scenario.
public static string GetPlainTextFromHTML(string inputText)
{
// Extracted plain text
var plainText = string.Empty;
if(string.IsNullOrWhiteSpace(inputText))
{
return plainText;
}
var htmlNote = new HtmlDocument();
htmlNote.LoadHtml(inputText);
var nodes = htmlNote.DocumentNode.ChildNodes;
if(nodes == null)
{
return plainText;
}
StringBuilder innerString = new StringBuilder();
// Replace <p> with new lines
foreach (HtmlNode node in nodes)
{
innerString.Append(node.InnerText);
innerString.Append("\\n");
}
plainText = innerString.ToString();
return plainText;
}
You can use a regex : <(div|/div|br|p|/p)[^>]{0,}>

Get specific words from string c#

I am working on a final year project. I have a file that contain some text. I need to get words form this file that contain "//jj" tag. e.g abc//jj, bcd//jj etc.
suppose file is containing the following text
ffafa adada//bb adad ssss//jj aad adad adadad aaada dsdsd//jj
dsdsd sfsfhf//vv
dfdfdf
I need all the words that are associated with //jj tag. I am stuck here past few days.
My code that i am trying
// Create OpenFileDialog
Microsoft.Win32.OpenFileDialog dlg = new Microsoft.Win32.OpenFileDialog();
// Set filter for file extension and default file extension
dlg.DefaultExt = ".txt";
dlg.Filter = "Text documents (.txt)|*.txt";
// Display OpenFileDialog by calling ShowDialog method
Nullable<bool> result = dlg.ShowDialog();
// Get the selected file name and display in a TextBox
string filename = string.Empty;
if (result == true)
{
// Open document
filename = dlg.FileName;
FileNameTextBox.Text = filename;
}
string text;
using (var streamReader = new StreamReader(filename, Encoding.UTF8))
{
text = streamReader.ReadToEnd();
}
string FilteredText = string.Empty;
string pattern = #"(?<before>\w+) //jj (?<after>\w+)";
MatchCollection matches = Regex.Matches(text, pattern);
for (int i = 0; i < matches.Count; i++)
{
FilteredText="before:" + matches[i].Groups["before"].ToString();
//Console.WriteLine("after:" + matches[i].Groups["after"].ToString());
}
textbx.Text = FilteredText;
I cant find my result please help me.
With LINQ you could do this with one line:
string[] taggedwords = input.Split(' ').Where(x => x.EndsWith(#"//jj")).ToArray();
And all your //jj words will be there...
Personally I think Regex is overkill if that's definitely how the string will look. You haven't specified that you definitely need to use Regex so why not try this instead?
// A list that will hold the words ending with '//jj'
List<string> results = new List<string>();
// The text you provided
string input = #"ffafa adada//bb adad ssss//jj aad adad adadad aaada dsdsd//jj dsdsd sfsfhf//vv dfdfdf";
// Split the string on the space character to get each word
string[] words = input.Split(' ');
// Loop through each word
foreach (string word in words)
{
// Does it end with '//jj'?
if(word.EndsWith(#"//jj"))
{
// Yes, add to the list
results.Add(word);
}
}
// Show the results
foreach(string result in results)
{
MessageBox.Show(result);
}
Results are:
ssss//jj
dsdsd//jj
Obviously this is not quite as robust as a regex, but you didn't provide any more detail for me to go on.
You have an extra space in your regex, it assumes there's a space before "//jj". What you want is:
string pattern = #"(?<before>\w+)//jj (?<after>\w+)";
This regular expression will yield the words you are looking for:
string pattern = "(\\S*)\\/\\/jj"
A bit nicer without backslash escaping:
(\S*)\/\/jj
Matches will include the //jj but you can get the word from the first bracketed group.

Find & Replace Multiple Instagram Urls In A String Using C#

I want to find all the instagram urls within a string, and replace them with the embed url.
But I'm keen on performance, as this could be 5 to 20 posts each anything up to 6000 characters with an unknown amount of instagram urls in which need converting.
Url examples (Could be any of these in each string, so would need to match all)
http://instagram.com/p/xPnQ1ZIY2W/?modal=true
http://instagram.com/p/xPnQ1ZIY2W/
http://instagr.am/p/xPnQ1ZIY2W/
And this is what I need to replace them with (An embedded version)
<img src="http://instagram.com/p/xPnQ1ZIY2W/media/?size=l" class="instagramimage" />
I was thinking about going for regex? But is this the quickest and most performant way of doing this?
Any examples greatly appreciated.
Something like:
Regex reg = new Regex(#"http://instagr\.?am(?:\.com)?/\S*");
Edited regex. However i would combine this with a stringreader and do it line by line. Then put the string (modified or not) into a stringbuilder:
string original = #"someotherText http://instagram.com/p/xPnQ1ZIY2W/?modal=true some other text
some other text http://instagram.com/p/xPnQ1ZIY2W/ some other text
some other text http://instagr.am/p/xPnQ1ZIY2W/ some other text";
StringBuilder result = new StringBuilder();
using (StringReader reader = new StringReader(original))
{
while (reader.Peek() > 0)
{
string line = reader.ReadLine();
if (reg.IsMatch(line))
{
string url = reg.Match(line).ToString();
result.AppendLine(reg.Replace(line,string.Format("<img src=\"{0}\" class=\"instagramimage\" />",url)));
}
else
{
result.AppendLine(line);
}
}
}
Console.WriteLine(result.ToString());
You mean like this?
class Program
{
private static Regex reg = new Regex(#"http://instagr\.?am(?:\.com)?/\S*", RegexOptions.Compiled);
private static Regex idRegex = new Regex(#"(?<=p/).*?(?=/)",RegexOptions.Compiled);
static void Main(string[] args)
{
string original = #"someotherText http://instagram.com/p/xPnQ1ZIY2W/?modal=true some other text
some other text http://instagram.com/p/xPnQ1ZIY2W/ some other text
some other text http://instagr.am/p/xPnQ1ZIY2W/ some other text";
StringBuilder result = new StringBuilder();
using (StringReader reader = new StringReader(original))
{
while (reader.Peek() > 0)
{
string line = reader.ReadLine();
if (reg.IsMatch(line))
{
string url = reg.Match(line).ToString();
result.AppendLine(reg.Replace(line, string.Format("<img src=\"http://instagram.com/p/{0}/media/?size=1\" class=\"instagramimage\" />", idRegex.Match(url).ToString())));
}
else
{
result.AppendLine(line);
}
}
}
Console.WriteLine(result.ToString());
}
}
A well-crafted and compiled regular expression is hard to beat, especially since you're doing replacements, not just searching, but you should test to be sure.
If the Instagram URLs are only within HTML attributes, here's my first stab at a pattern to look for:
(?<=")(https?://instagr[^">]+)
(I added a check for https as well, which you didn't mention but I believe is supported by Instagram.)
Some false positives are theoretically possible, but it will perform better than pedantically matching every legal variation of an Instagram URL. (The ">" check is just in case the HTML is missing the end quote for some reason.)

Searching strings in txt file

I have a .txt file with a list of 174 different strings. Each string has an unique identifier.
For example:
123|this data is variable|
456|this data is variable|
789|so is this|
etc..
I wish to write a programe in C# that will read the .txt file and display only one of the 174 strings if I specify the ID of the string I want. This is because in the file I have all the data is variable so only the ID can be used to pull the string. So instead of ending up with the example about I get just one line.
eg just
123|this data is variable|
I seem to be able to write a programe that will pull just the ID from the .txt file and not the entire string or a program that mearly reads the whole file and displays it. But am yet to wirte on that does exactly what I need. HELP!
Well the actual string i get out from the txt file has no '|' they were just in the example. An example of the real string would be: 0111111(0010101) where the data in the brackets is variable. The brackets dont exsist in the real string either.
namespace String_reader
{
class Program
{
static void Main(string[] args)
{
String filepath = #"C:\my file name here";
string line;
if(File.Exists(filepath))
{
StreamReader file = null;
try
{
file = new StreamReader(filepath);
while ((line = file.ReadLine()) !=null)
{
string regMatch = "ID number here"; //this is where it all falls apart.
Regex.IsMatch (line, regMatch);
Console.WriteLine (line);// When program is run it just displays the whole .txt file
}
}
}
finally{
if (file !=null)
file.Close();
}
}
Console.ReadLine();
}
}
}
Use a Regex. Something along the lines of Regex.Match("|"+inputString+"|",#"\|[ ]*\d+\|(.+?)\|").Groups[1].Value
Oh, I almost forgot; you'll need to substitute the d+ for the actual index you want. Right now, that'll just get you the first one.
The "|" before and after the input string makes sure both the index and the value are enclosed in a | for all elements, including the first and last. There's ways of doing a Regex without it, but IMHO they just make your regex more complicated, and less readable.
Assuming you have path and id.
Console.WriteLine(File.ReadAllLines(path).Where(l => l.StartsWith(id + "|")).FirstOrDefault());
Use ReadLines to get a string array of lines then string split on the |
You could use Regex.Split method
FileInfo info = new FileInfo("filename.txt");
String[] lines = info.OpenText().ReadToEnd().Split(' ');
foreach(String line in lines)
{
int id = Convert.ToInt32(line.Split('|')[0]);
string text = Convert.ToInt32(line.Split('|')[1]);
}
Read the data into a string
Split the string on "|"
Read the items 2 by 2: key:value,key:value,...
Add them to a dictionary
Now you can easily find your string with dictionary[key].
first load the hole file to a string.
then try this:
string s = "123|this data is variable| 456|this data is also variable| 789|so is this|";
int index = s.IndexOf("123", 0);
string temp = s.Substring(index,s.Length-index);
string[] splitStr = temp.Split('|');
Console.WriteLine(splitStr[1]);
hope this is what you are looking for.
private static IEnumerable<string> ReadLines(string fspec)
{
using (var reader = new StreamReader(new FileStream(fspec, FileMode.Open, FileAccess.Read, FileShare.Read)))
{
while (!reader.EndOfStream)
yield return reader.ReadLine();
}
}
var dict = ReadLines("input.txt")
.Select(s =>
{
var split = s.Split("|".ToArray(), 2);
return new {Id = Int32.Parse(split[0]), Text = split[1]};
})
.ToDictionary(kv => kv.Id, kv => kv.Text);
Please note that with .NET 4.0 you don't need the ReadLines function, because there is ReadLines
You can now work with that as any dictionary:
Console.WriteLine(dict[12]);
Console.WriteLine(dict[999]);
No error handling here, please add your own
You can use Split method to divide the entire text into parts sepparated by '|'. Then all even elements will correspond to numbers odd elements - to strings.
StreamReader sr = new StreamReader(filename);
string text = sr.ReadToEnd();
string[] data = text.Split('|');
Then convert certain data elements to numbers and strings, i.e. int[] IDs and string[] Strs. Find the index of the given ID with idx = Array.FindIndex(IDs, ID.Equals) and the corresponding string will be Strs[idx]
List <int> IDs;
List <string> Strs;
for (int i = 0; i < data.Length - 1; i += 2)
{
IDs.Add(int.Parse(data[i]));
Strs.Add(data[i + 1]);
}
idx = Array.FindIndex(IDs, ID.Equals); // we get ID from input
answer = Strs[idx];

How can I strip HTML tags from a string in ASP.NET?

Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.
Example:
<ul><li>Hello</li></ul>
Output:
"Hello"
I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.
If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
<[^>]*(>|$)
with the empty string, globally. Don't forget to normalize the string afterwards, replacing:
[\s\r\n]+
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.
Go download HTMLAgilityPack, now! ;) Download LInk
This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.
Here is a sample:
string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlContents);
if (doc == null) return null;
string output = "";
foreach (var node in doc.DocumentNode.ChildNodes)
{
output += node.InnerText;
}
Regex.Replace(htmlText, "<.*?>", string.Empty);
protected string StripHtml(string Txt)
{
return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}
Protected Function StripHtml(Txt as String) as String
Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function
I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable.
In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:
System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;
I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an article on CodeProject.
Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like &amp; and &203;) and comment blocks replacement and more.
Please read the related article on CodeProject.
Thank you.
For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.
public static string RemoveHTMLTags(string content)
{
var cleaned = string.Empty;
try
{
StringBuilder textOnly = new StringBuilder();
using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
textOnly.Append(reader.ReadContentAsString());
}
}
cleaned = textOnly.ToString();
}
catch
{
//A tag is probably not closed. fallback to regex string clean.
string textOnly = string.Empty;
Regex tagRemove = new Regex(#"<[^>]*(>|$)");
Regex compressSpaces = new Regex(#"[\s\r\n]+");
textOnly = tagRemove.Replace(content, string.Empty);
textOnly = compressSpaces.Replace(textOnly, " ");
cleaned = textOnly;
}
return cleaned;
}
string result = Regex.Replace(anytext, #"<(.|\n)*?>", string.Empty);
I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.
So I propose the method below.
Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.
public static string RemoveHtmlTags(this string html)
{
if (String.IsNullOrEmpty(html))
return html;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
{
return WebUtility.HtmlDecode(html);
}
var sb = new StringBuilder();
var i = 0;
foreach (var node in doc.DocumentNode.ChildNodes)
{
var text = node.InnerText.SafeTrim();
if (!String.IsNullOrEmpty(text))
{
sb.Append(text);
if (i < doc.DocumentNode.ChildNodes.Count - 1)
{
sb.Append(Environment.NewLine);
}
}
i++;
}
var result = sb.ToString();
return WebUtility.HtmlDecode(result);
}
public static string SafeTrim(this string str)
{
if (str == null)
return null;
return str.Trim();
}
If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.
If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).
For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:
public static string StripTags(this string markup)
{
try
{
StringReader sr = new StringReader(markup);
XPathDocument doc;
using (XmlReader xr = XmlReader.Create(sr,
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
// for multiple roots
}))
{
doc = new XPathDocument(xr);
}
return doc.CreateNavigator().Value; // .Value is similar to .InnerText of
// XmlDocument or JavaScript's innerText
}
catch
{
return string.Empty;
}
}
using System.Text.RegularExpressions;
string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);
You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad). It is easier to use than HAP to get the text out of a HTML source.
var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();
You can take a look at the key features section where they make a case at being "better" than HAP. I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.
For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:
public string StripTags(HtmlNode documentNode, IList keepTags)
{
var result = new StringBuilder();
foreach (var childNode in documentNode.ChildNodes)
{
if (childNode.Name.ToLower() == "#text")
{
result.Append(childNode.InnerText);
}
else
{
if (!keepTags.Contains(childNode.Name.ToLower()))
{
result.Append(StripTags(childNode, keepTags));
}
else
{
result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
}
}
}
return result.ToString();
}
More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/
Simply use string.StripHTML();

Categories

Resources