I'm attempting to use the HTMLAgilityPack to get retrieve and edit inner text of some HTML. The inner text of each node i retrieve needs to be checked for matching strings and those matching strings to be highlighted like so:
var HtmlDoc = new HtmlDocument();
HtmlDoc.LoadHtml(item.Content);
var nodes = HtmlDoc.DocumentNode.SelectNodes("//div[#class='guide_subtitle_cell']/p");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(Methods.HighlightWords(htmlNode.InnerText, searchstring)), htmlNode);
}
This is the code for the HighlightWords method I use:
public static string HighlightWords(string input, string searchstring)
{
if (input == null || searchstring == null)
{
return input;
}
var lowerstring = searchstring.ToLower();
var words = lowerstring.Split(' ').ToList();
for (var i = 0; i < words.Count; i++)
{
Match m = Regex.Match(input, words[i], RegexOptions.IgnoreCase);
if (m.Success)
{
string ReplaceWord = string.Format("<span class='search_highlight'>{0}</span>", m.Value);
input = Regex.Replace(input, words[i], ReplaceWord, RegexOptions.IgnoreCase);
}
}
return input;
}
Can anyone suggest how to get this working or indicate what i'm doing wrong?
The problem is that HtmlTextNode.CreateNode can only create one node. When you add a <span> inside, that's another node, and CreateNode throws the exception you see.
Make sure that you are only doing a search and replace on the lowest leaf nodes (nodes with no children). Then rebuild that node by:
Create a new empty node to replace the old one
Search for the text in .InnerText
Use HtmlTextNode.Create to add the plain text before the text you want to highlight
Then add your new <span> with the highlighted text with HtmlNode.CreateNode
Then search for the next occurrence (start back at 1) until no more occurrences are found.
Your function HighlightWords must be returning multiple top-level HTML nodes. For example:
<p>foo</p>
<span>bar</span>
The HtmlAgilityPack only allows one top-level node to be returned. You can hardcode the return value for HighlightWords to test.
Also, this post has run across the same problem.
Related
I have some xml files that look like sample file
I want to remove invalid xref nodes from it but keep the contents of those nodes as it is.
The way to know whether a xref node is valid is to check its attribute rid's value exactly matches any of the attributes id of any node present in the entire file, so the output file of the above sample should be something like sample output file
The code I've written thus far is below
XDocument doc=XDocument.Load(#"D:\sample\sample.xml",LoadOptions.None);
var ids = from a in doc.Descendants()
where a.Attribute("id") !=null
select a.Attribute("id").Value;
var xrefs=from x in doc.Descendants("xref")
where x.Attribute("rid")!=null
select x.Attribute("rid").Value;
if (ids.Any() && xrefs.Any())
{
foreach(var xref in xrefs)
{
if (!ids.Contains(xref))
{
string content= File.ReadAllText(#"D:\sample\sample.xml");
string result=Regex.Replace(content,"<xref ref-type=\"[^\"]+\" rid=\""+xref+"\">(.*?)</xref>","$1");
File.WriteAllText(#"D:\sample\sample.xml",result);
}
}
Console.WriteLine("complete");
}
else
{
Console.WriteLine("No value found");
}
Console.ReadLine();
The problem is when the values of xref contain characters like ., *, (etc. which on a regex replace needs to be escaped properly or the replace can mess up the file.
Does anyone have a better solution to the problem?
You don't need regex to do this. Instead use element.ReplaceWith(element.Nodes()) to replace node with its children. Sample code:
XDocument doc = XDocument.Load(#"D:\sample\sample.xml", LoadOptions.None);
// use HashSet, since you only use it for lookups
var ids = new HashSet<string>(from a in doc.Descendants()
where a.Attribute("id") != null
select a.Attribute("id").Value);
// select both element itself (for update), and value of "rid"
var xrefs = from x in doc.Descendants("xref")
where x.Attribute("rid") != null
select new { element = x, rid = x.Attribute("rid").Value };
if (ids.Any()) {
var toUpdate = new List<XElement>();
foreach (var xref in xrefs) {
if (!ids.Contains(xref.rid)) {
toUpdate.Add(xref.element);
}
}
if (toUpdate.Count > 0) {
foreach (var xref in toUpdate) {
// replace with contents
xref.ReplaceWith(xref.Nodes());
}
doc.Save(#"D:\sample\sample.xml");
}
}
I have an Index view where I would like show a list of news article, the Text property is a string which contains a html string coming from a html editor; now the html content could be really long, so I would like show only the first <p> element.
I am doing that:
public ActionResult Index()
{
var articles = db.Articles.ToList().Select(a => new{Title = a.Title,
Tags = a.Tags,
Id = a.Id,
Text = (System.Xml.Linq.XDocument.Parse(a.Text).Descendants("p").FirstOrDefault())
}).ToList();
return View(articles);
}
But in the html string there is not a root node, so the Linq query fall in exception, How I can manage this case?
Thanks in advance for any suggestion
It might be a shorthand solution, but should wrapping your xml in a root node not fix the problem?
System.Xml.Linq.XDocument.Parse(
String.Format("<myRootNode>{0}</myRootNode>" , a.Text)
)
You can do it by using regex
static String GetTheFirstPElement(String rawHtml)
{
Regex myRegex = new Regex(#"(<p[^>]*>.*?</p>)", RegexOptions.IgnoreCase);
MatchCollection matches = myRegex.Matches(rawHtml);
var firstMatch = matches.FirstOrDefault() ;
return firstMatch != null ? firstMatch.Value : null ;
}
I need to go from a list like this:
/home
/home/room1
/home/room1/subroom
/home/room2
/home/room2/miniroom
/home/room2/bigroom
/home/room2/hugeroom
/home/room3
to an xml file. I've tried using LINQ to XML to do this but I just end up getting confused and not sure what to do from there. Any help is much appreciated!
Edit:
I want the XML file to look something like this:
<home>
<room1>
<subroom>This is a subroom</subroom>
</room1>
<room2>
<miniroom>This is a miniroom</miniroom>
<bigroom>This is a bigroom</bigroom>
<hugeroom>This is a hugeroom</hugeroom>
</room2>
<room3></room3>
</home>
The text inside if the tags ("this is a subroom", etc) is optional, but would be really nice to have!
Ok buddy, here's a solution.
Couple of notes and explanation.
Your text structure can be split up into lines and then again by the slashes into the names of the XML nodes. If you think of the text in this way, you get a list of "lines" broken into a list of
names.
/home
First of all, the first line /home is the root of the XML; we can get rid of it and just create and XDocument object with that name as the root element;
var xDoc = new XDocument("home");
Of course we don't want to hard code things but this is just an example. Now, on to the real work:
/home/room1/
/home/room1/bigroom
etc...
as a List<T> then it will look like this
myList = new List<List<string>>();
... [ add the items ]
myList[0][0] = home
myList[0][1] = room1
myList[1][0] = home
myList[1][1] = room1
myList[1][2] = bigroom
So what we can do to get the above structure is use string.Split() multiple times to break your text first into lines, then into parts of each line, and end up with a multidimensional array-style List<T> that contains List<T> objects, in this case, List<List<string>>.
First let's create the container object:
var possibleNodes = new List<List<string>>();
Next, we should split the lines. Let's call the variable that holds the text, "text".
var splitLines = text
.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
.ToList();
This gives us a List but our lines are still not broken up. Let's split them again by the slash (/) character. This is where we build our node names. We can do this in a ForEach and just add to our list of possible nodes:
splitLines.ForEach(l =>
possibleNodes.Add(l
.Split(new char[] { '/' }, StringSplitOptions.RemoveEmptyEntries)
.ToList()
)
);
Now, we need to know the DEPTH of the XML. Your text shows that there will be 3 nodes of depth. The node depth is the maximum depth of any one given line of nodes, now stored in the List<List<string>>; we can use the .Max() method to get this:
var nodeDepth = possibleNodes.Max(n => n.Count);
A final setup step: We don't need the first line, because it's just "home" and it will be our root node. We can just create an XDocument object and give it this first line to use as the name of Root:
// Create the root node
XDocument xDoc = new XDocument(new XElement(possibleNodes[0][0]));
// We don't need it anymore
possibleNodes.RemoveAt(0);
Ok, here is where the real work happens, let me explain the rules:
We need to loop through the outer list, and through each inner list.
We can use the list indexes to understand which node to add to or which names to ignore
We need to keep hierarchy proper and not duplicate nodes, and some XLinq helps here
The loops - see the comments for a detailed explanation:
// This gets us looping through the outer nodes
for (var i = 0; i < possibleNodes.Count; i++)
{
// Here we go "sideways" by going through each inner list (each broken down line of the text)
for (var ii = 1; ii < nodeDepth; ii++)
{
// Some lines have more depth than others, so we have to check this here since we are looping on the maximum
if (ii < possibleNodes[i].Count)
{
// Let's see if this node already exists
var existingNode = xDoc.Root.Descendants().FirstOrDefault(d => d.Name.LocalName == (possibleNodes[i][ii]));
// Let's also see if a parent node was created in the previous loop iteration.
// This will tell us whether to add the current node at the root level, or under another node
var parentNode = xDoc.Root.Descendants().FirstOrDefault(d => d.Name.LocalName == (possibleNodes[i][ii - 1]));
// If the current node has already been added, we do nothing (this if statement is not entered into)
// Otherwise, existingNode will be null and that means we need to add the current node
if (null == existingNode)
{
// Now, use parentNode to decide where to add the current node
if (null == parentNode)
{
// The parent node does not exist; therefore, the current node will be added to the root node.
xDoc.Root.Add(new XElement(possibleNodes[i][ii]));
}
else
{
// There IS a parent node for this node!
// Therefore, we must add the current node to the parent node
// (remember, parent node is the previous iteration of the inner for loop on nodeDepth )
var newNode = new XElement(possibleNodes[i][ii]);
parentNode.Add(newNode);
// Add "this is a" text (bonus!) -- only adding this text if the current node is the last one in the list.
if (possibleNodes[i].Count -1 == ii)
{
newNode.Add(new XText("This is a " + newNode.Name.LocalName));
}
}
}
}
}
}
The bonus here is this code will work with any number of nodes and build your XML.
To check it, XDocument has a nifty .ToString() overriden implementation that just spits out all of the XML it is holding, so all you do is this:
Console.Write(xDoc.ToString());
And, you'll get this result:
(Note I added a test node to make sure it works with more than 3 levels)
Below, you will find the entire program with your test text, etc, as a working solution:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml.Linq;
namespace XmlFromTextString
{
class Program
{
static void Main(string[] args)
{
// This simulates text from a file; note that it must be flush to the left of the screen or else the extra spaces
// add unneeded nodes to the lists that are generated; for simplicity of code, I chose not to implement clean-up of that and just
// ensure that the string literal is not indented from the left of the Visual Studio screen.
string text =
#"/home
/home/room1
/home/room1/subroom
/home/room2
/home/room2/miniroom
/home/room2/test/thetest
/home/room2/bigroom
/home/room2/hugeroom
/home/room3";
var possibleNodes = new List<List<string>>();
var splitLines = text
.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
.ToList();
splitLines.ForEach(l =>
possibleNodes.Add(l
.Split(new char[] { '/' }, StringSplitOptions.RemoveEmptyEntries)
.ToList()
)
);
var nodeDepth = possibleNodes.Max(n => n.Count);
// Create the root node
XDocument xDoc = new XDocument(new XElement(possibleNodes[0][0]));
// We don't need it anymore
possibleNodes.RemoveAt(0);
// This gets us looping through the outer nodes
for (var i = 0; i < possibleNodes.Count; i++)
{
// Here we go "sideways" by going through each inner list (each broken down line of the text)
for (var ii = 1; ii < nodeDepth; ii++)
{
// Some lines have more depth than others, so we have to check this here since we are looping on the maximum
if (ii < possibleNodes[i].Count)
{
// Let's see if this node already exists
var existingNode = xDoc.Root.Descendants().FirstOrDefault(d => d.Name.LocalName == (possibleNodes[i][ii]));
// Let's also see if a parent node was created in the previous loop iteration.
// This will tell us whether to add the current node at the root level, or under another node
var parentNode = xDoc.Root.Descendants().FirstOrDefault(d => d.Name.LocalName == (possibleNodes[i][ii - 1]));
// If the current node has already been added, we do nothing (this if statement is not entered into)
// Otherwise, existingNode will be null and that means we need to add the current node
if (null == existingNode)
{
// Now, use parentNode to decide where to add the current node
if (null == parentNode)
{
// The parent node does not exist; therefore, the current node will be added to the root node.
xDoc.Root.Add(new XElement(possibleNodes[i][ii]));
}
else
{
// There IS a parent node for this node!
// Therefore, we must add the current node to the parent node
// (remember, parent node is the previous iteration of the inner for loop on nodeDepth )
var newNode = new XElement(possibleNodes[i][ii]);
parentNode.Add(newNode);
// Add "this is a" text (bonus!) -- only adding this text if the current node is the last one in the list.
if (possibleNodes[i].Count -1 == ii)
{
newNode.Add(new XText("This is a " + newNode.Name.LocalName));
// For the same default text on all child-less nodes, us this:
// newNode.Add(new XText("This is default text"));
}
}
}
}
}
}
Console.Write(xDoc.ToString());
Console.ReadKey();
}
}
}
Time for LINQ magic?
// load file into string[]
var input = File.ReadAllLines("TextFile1.txt");
// in case you have more than one home in your file
var homes =
new XDocument(
new XElement("root",
from line in input
let items = line.Split(new[] { "/" }, StringSplitOptions.RemoveEmptyEntries)
group items by items[0] into g
select new XElement(g.Key,
from rooms in g.OrderBy(x => x.Length).Skip(1)
group rooms by rooms[1] into g2
select new XElement(g2.Key,
from name in g2.OrderBy(x => x.Length).Skip(1)
select new XElement(name[2], string.Format("This is a {0}", name[2]))))));
// get the right home
var home = new XDocument(homes.Root.Element("home"));
UPDATED: I still have this problem, better explanation.
I have a list of XElements and I'm iterating through them to check if it match a regex pattern. If there's a match, I need to replace the value of the current element without affecting his child elements.
For example,
<root>{REGEX:#Here}<child>Element</child> more content</root
In that case, I need to replace {REGEX:#Here} which is under the root element but his not a child element! If Use:
string newValue = xElement.ToString();
if(ReplaceRegex(ref newValue))
xElement.ReplaceAll(newValue);
I'm losing the child elements and the tags get converted to & lt;child & gt;element in the value.
If I use:
xElement.SetValue(newValue);
The value of the xElement will be,
"{REGEX:Replaced} Element more content"
thus losing child elements as well.
What can I do to replace the value that will keep the child elements and work if the regex pattern is under the root element or child elements.
PS: I will add the regex function here for understanding purpose
private bool ReplaceRegex(ref string text)
{
bool match = false;
Regex linkRegex = new Regex(#"\{XPath:.*?\}", System.Text.RegularExpressions.RegexOptions.Multiline);
Match m = linkRegex.Match(text);
while (m.Success)
{
match = true;
string substring = m.Value;
string xpath = substring.Replace("{XPath:", string.Empty).Replace("}", string.Empty);
object temp = this.Container.Data.XPathEvaluate(xpath);
text = text.Replace(substring, Utility.XPathResultToString(temp));
m = m.NextMatch();
}
return match;
}
private void ReplaceRegex(XElement xElement)
{
if(xElement.HasElements)
{
foreach (XElement subElement in xElement.Elements())
this.ReplaceRegex(subElement);
}
foreach(var node in xElement.Nodes().OfType<XText>())
{
string value = node.Value;
if(this.ReplaceRegex(ref value))
node.Value = value;
}
}
EDIT :
Regarding your mixed-content comment, edited the code to take care of text nodes. See if it works.
string htmlHeaderPattern = ("(<h[2|3])>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
From this code, I get a bunch of h2 and h3-elements. In these, I'd like to insert an ID-attribute, with the value equal to (the content in the header, minus special chars and ToLower()). I also need this value as a separate string, as I need to store it for later use.
Input: <h3>Some sort of header!</h3>
Output: <h3 id="#some-sort-of-header">Some sort of header!</h3>
Plus, I need the values "#some-sort-of-header" and "Some sort of header!" stored in a dictionary or list or whatever else.
This is what I have so far:
string htmlHeaderPattern = ("(<h[2|3]>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
Dictionary<string,string> returnValue = new Dictionary<string, string>();
foreach (Match match in matches)
{
string idValue = StripTextValue(match.Groups[4].Value);
returnValue.Add(idValue, match.Groups[4].Value);
}
MainBody = Regex.Replace(mainBody, htmlHeaderPattern, "this is where i must replace all the headers with one with an ID-attribute?");
Any regex-wizards out there to help me?
There are a lot of mentions regarding not to use regex when parsing HTML, so you could use e.g. Html Agility Pack for this:
var html = #"<h2>Some sort of header!</h2>";
HtmlDocument document= new HtmlDocument();
document.LoadHtml(html);
var headers = document.DocumentNode.SelectNodes("//h2|//h3");
if (headers != null)
{
foreach (HtmlNode header in headers)
{
var innerText = header.InnerText;
var idValue = StripTextValue(innerText);
if (header.Attributes["id"] != null)
{
header.Attributes["id"].Value = idValue;
}
else
{
header.Attributes.Add("id", idValue);
}
}
}
This code finds all the <h2> and <h3> elements in the document passed, gets inner text from there and setting(or adding) id attributes to them.
With this example you should get something like:
<h2 id='#some-sort-of-header'>Some sort of header!</h2>