I have an Index view where I would like show a list of news article, the Text property is a string which contains a html string coming from a html editor; now the html content could be really long, so I would like show only the first <p> element.
I am doing that:
public ActionResult Index()
{
var articles = db.Articles.ToList().Select(a => new{Title = a.Title,
Tags = a.Tags,
Id = a.Id,
Text = (System.Xml.Linq.XDocument.Parse(a.Text).Descendants("p").FirstOrDefault())
}).ToList();
return View(articles);
}
But in the html string there is not a root node, so the Linq query fall in exception, How I can manage this case?
Thanks in advance for any suggestion
It might be a shorthand solution, but should wrapping your xml in a root node not fix the problem?
System.Xml.Linq.XDocument.Parse(
String.Format("<myRootNode>{0}</myRootNode>" , a.Text)
)
You can do it by using regex
static String GetTheFirstPElement(String rawHtml)
{
Regex myRegex = new Regex(#"(<p[^>]*>.*?</p>)", RegexOptions.IgnoreCase);
MatchCollection matches = myRegex.Matches(rawHtml);
var firstMatch = matches.FirstOrDefault() ;
return firstMatch != null ? firstMatch.Value : null ;
}
Related
I'm attempting to use the HTMLAgilityPack to get retrieve and edit inner text of some HTML. The inner text of each node i retrieve needs to be checked for matching strings and those matching strings to be highlighted like so:
var HtmlDoc = new HtmlDocument();
HtmlDoc.LoadHtml(item.Content);
var nodes = HtmlDoc.DocumentNode.SelectNodes("//div[#class='guide_subtitle_cell']/p");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(Methods.HighlightWords(htmlNode.InnerText, searchstring)), htmlNode);
}
This is the code for the HighlightWords method I use:
public static string HighlightWords(string input, string searchstring)
{
if (input == null || searchstring == null)
{
return input;
}
var lowerstring = searchstring.ToLower();
var words = lowerstring.Split(' ').ToList();
for (var i = 0; i < words.Count; i++)
{
Match m = Regex.Match(input, words[i], RegexOptions.IgnoreCase);
if (m.Success)
{
string ReplaceWord = string.Format("<span class='search_highlight'>{0}</span>", m.Value);
input = Regex.Replace(input, words[i], ReplaceWord, RegexOptions.IgnoreCase);
}
}
return input;
}
Can anyone suggest how to get this working or indicate what i'm doing wrong?
The problem is that HtmlTextNode.CreateNode can only create one node. When you add a <span> inside, that's another node, and CreateNode throws the exception you see.
Make sure that you are only doing a search and replace on the lowest leaf nodes (nodes with no children). Then rebuild that node by:
Create a new empty node to replace the old one
Search for the text in .InnerText
Use HtmlTextNode.Create to add the plain text before the text you want to highlight
Then add your new <span> with the highlighted text with HtmlNode.CreateNode
Then search for the next occurrence (start back at 1) until no more occurrences are found.
Your function HighlightWords must be returning multiple top-level HTML nodes. For example:
<p>foo</p>
<span>bar</span>
The HtmlAgilityPack only allows one top-level node to be returned. You can hardcode the return value for HighlightWords to test.
Also, this post has run across the same problem.
I have an XElement that looks like this:
<User ID="11" Name="Juan Diaz" LoginName="DN1\jdiaz" xmlns="http://schemas.microsoft.com/sharepoint/soap/directory/" />
How can I use XML to extract the value of the LoginName attribute? I tried the following, but the q2 "Enumeration yielded no results".
var q2 = from node in el.Descendants("User")
let loginName = node.Attribute(ns + "LoginName")
select new { LoginName = (loginName != null) };
foreach (var node in q2)
{
Console.WriteLine("LoginName={0}", node.LoginName);
}
var xml = #"<User ID=""11""
Name=""Juan Diaz""
LoginName=""DN1\jdiaz""
xmlns=""http://schemas.microsoft.com/sharepoint/soap/directory/"" />";
var user = XElement.Parse(xml);
var login = user.Attribute("LoginName").Value; // "DN1\jdiaz"
XmlDocument doc = new XmlDocument();
doc.Load("myFile.xml"); //load your xml file
XmlNode user = doc.getElementByTagName("User"); //find node by tag name
string login = user.Attributes["LoginName"] != null ? user.Attributes["LoginName"].Value : "unknown login";
The last line of code, where it's setting the string login, the format looks like this...
var variable = condition ? A : B;
It's basically saying that if condition is true, variable equals A, otherwise variable equals B.
from the docs for XAttribute.Value:
If you are getting the value and the attribute might not exist, it is more convenient to use the explicit conversion operators, and assign the attribute to a nullable type such as string or Nullable<T> of Int32. If the attribute does not exist, then the nullable type is set to null.
I ended up using string manipulation to get the value, so I'll post that code, but I would still like to see an XML approach if there is one.
string strEl = el.ToString();
string[] words = strEl.Split(' ');
foreach (string word in words)
{
if (word.StartsWith("LoginName"))
{
strEl = word;
int first = strEl.IndexOf("\"");
int last = strEl.LastIndexOf("\"");
string str2 = strEl.Substring(first + 1, last - first - 1);
//str2 = "dn1\jdiaz"
}
}
string htmlHeaderPattern = ("(<h[2|3])>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
From this code, I get a bunch of h2 and h3-elements. In these, I'd like to insert an ID-attribute, with the value equal to (the content in the header, minus special chars and ToLower()). I also need this value as a separate string, as I need to store it for later use.
Input: <h3>Some sort of header!</h3>
Output: <h3 id="#some-sort-of-header">Some sort of header!</h3>
Plus, I need the values "#some-sort-of-header" and "Some sort of header!" stored in a dictionary or list or whatever else.
This is what I have so far:
string htmlHeaderPattern = ("(<h[2|3]>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
Dictionary<string,string> returnValue = new Dictionary<string, string>();
foreach (Match match in matches)
{
string idValue = StripTextValue(match.Groups[4].Value);
returnValue.Add(idValue, match.Groups[4].Value);
}
MainBody = Regex.Replace(mainBody, htmlHeaderPattern, "this is where i must replace all the headers with one with an ID-attribute?");
Any regex-wizards out there to help me?
There are a lot of mentions regarding not to use regex when parsing HTML, so you could use e.g. Html Agility Pack for this:
var html = #"<h2>Some sort of header!</h2>";
HtmlDocument document= new HtmlDocument();
document.LoadHtml(html);
var headers = document.DocumentNode.SelectNodes("//h2|//h3");
if (headers != null)
{
foreach (HtmlNode header in headers)
{
var innerText = header.InnerText;
var idValue = StripTextValue(innerText);
if (header.Attributes["id"] != null)
{
header.Attributes["id"].Value = idValue;
}
else
{
header.Attributes.Add("id", idValue);
}
}
}
This code finds all the <h2> and <h3> elements in the document passed, gets inner text from there and setting(or adding) id attributes to them.
With this example you should get something like:
<h2 id='#some-sort-of-header'>Some sort of header!</h2>
I am developing a Windows Forms application which is interacting with a web site.
Using a WebBrowser control I am controlling the web site and I can iterate through the tags using:
HtmlDocument webDoc1 = this.webBrowser1.Document;
HtmlElementCollection aTags = webDoc1.GetElementsByTagName("a");
Now, I want to get a particular text from the tag which is below:
Show Assigned<br>
Like here I want to get the number 244 which is equal to assignedto in above tag and save it into a variable for further use.
How can I do this?
You can try splitting a string by ';' values, and then each string by '=' like this:
string aTag = ...;
foreach(var splitted in aTag.Split(';'))
{
if(splitted.Contains("="))
{
var leftSide = splitted.Split('=')[0];
var rightSide = splitted.Split('=')[1];
if(leftSide == "assignedto")
{
MessageBox.Show(rightSide); //It should be 244
//Or...
int num = int.Parse(rightSide);
}
}
}
Other option is to use Regexes, which you can test here: www.regextester.com. And some more info on regexes: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
Hope it helps!
If all cases are similar to this and you don't mind a reference to System.Web in your Windows Forms application, tou can do something like this:
using System;
public class Program
{
static void Main()
{
string href = #"issue?status=-1,1,2,3,4,5,6,7&
#sort=-activity&#search_text=&#dispname=Show Assigned&
#filter=status,assignedto&#group=priority&
#columns=id,activity,title,creator,status&assignedto=244&
#pagesize=50&#startwith=0";
href = System.Web.HttpUtility.HtmlDecode(href);
var querystring = System.Web.HttpUtility.ParseQueryString(href);
Console.WriteLine(querystring["assignedto"]);
}
}
This is a simplified example and first you need to extract the href attribute text, but that should not be complex. Having the href attribute text you can take advantage that is basically a querystring and reuse code in .NET that already parses query strings.
To complete the example, to obtain the href attribute text you could do:
HtmlElementCollection aTags = webBrowser.Document.GetElementsByTagName("a");
foreach (HtmlElement element in aTags)
{
string href = element.GetAttribute("href");
}
I have a linq expression that I've been playing with in LINQPad and I would like to refactor the expression to replace all the tests for idx == -1 with a single test. The input data for this is the result of a free text search on a database used for caching Active Directory info. The search returns a list of display names and associated summary data from the matching database rows. I want to extract from that list the display name and the matching Active Directory entry. Sometimes the match will only occur on the display name so there may be no further context. In the example below, the string "Sausage" is intended to be the search term that returned the two items in the matches array. Clearly this wouldn't be the case for a real search because there is no match for Sausage in the second array item.
var matches = new []
{
new { displayName = "Sausage Roll", summary = "|Title: Network Coordinator|Location: Best Avoided|Department: Coordination|Email: Sausage.Roll#somewhere.com|" },
new { displayName = "Hamburger Pattie", summary = "|Title: Network Development Engineer|Location: |Department: Planning|Email: Hamburger.Pattie#somewhere.com|" },
};
var context = (from match in matches
let summary = match.summary
let idx = summary.IndexOf("Sausage")
let start = idx == -1 ? 0 : summary.LastIndexOf('|', idx) + 1
let stop = idx == -1 ? 0 : summary.IndexOf('|', idx)
let ctx = idx == -1 ? "" : string.Format("...{0}...", summary.Substring(start, stop - start))
select new { displayName = match.displayName, summary = ctx, })
.Dump();
I'm trying to create a list of names and some context for the search results if any exists. The output below is indicative of what Dump() displays and is the correct result:
displayName summary
---------------- ------------------------------------------
Sausage Roll ...Email: Sausage.Roll#somewhere.com...
Hamburger Pattie
Edit: Regex version is below, definitely tidier:
Regex reg = new Regex(#"\|((?:[^|]*)Sausage[^|]*)\|");
var context = (from match in matches
let m = reg.Match(match.summary)
let ctx = m.Success ? string.Format("...{0}...", m.Groups[1].Value) : ""
select new { displayName = match.displayName, context = ctx, })
.Dump();
(I know this doesn't answer your specific question), but here's my contribution anyway:
You haven't really described how your data comes in. As #Joe suggested, you could use a regex or split the fields as I've done below.
Either way I would suggested refactoring your code to allow unit testing.
Otherwise if your data is invalid / corrupt whatever, you will get a runtime error in your linq query.
[TestMethod]
public void TestMethod1()
{
var matches = new[]
{
new { displayName = "Sausage Roll", summary = "|Title: Network Coordinator|Location: Best Avoided|Department: Coordination|Email: Sausage.Roll#somewhere.com|" },
new { displayName = "Hamburger Pattie", summary = "|Title: Network Development Engineer|Location: |Department: Planning|Email: Hamburger.Pattie#somewhere.com|" },
};
IList<Person> persons = new List<Person>();
foreach (var m in matches)
{
string[] fields = m.summary.Split('|');
persons.Add(new Person { displayName = m.displayName, Title = fields[1], Location = fields[2], Department = fields[3] });
}
Assert.AreEqual(2, persons.Count());
}
public class Person
{
public string displayName { get; set; }
public string Title { get; set; }
public string Location { get; set; }
public string Department { get; set; }
/* etc. */
}
Or something like this:
Regex reg = new Regex(#"^|Email.*|$");
foreach (var match in matches)
{
System.Console.WriteLine(match.displayName + " ..." + reg.Match(match.summary) + "... ");
}
I haven't tested this, probably not even correct syntax but just to give you an idea of how you could do it with regex.
Update
Ok, i've seen your answer and it's good that you posted it because I think i didn't explain it clearly.
I expected your answer to look something like this at the end (tested using LINQPad now, and now i understand what you mean by using LINQPad because it actually does run a C# program not just linq commands, awesome!) Anyway this is what it should look like:
foreach (var match in matches)
Console.WriteLine(string.Format("{0,-20}...{1}...", match.displayName, Regex.Match(match.summary, #"Email:(.*)[|]").Groups[1]));
}
That's it, the whole thing, take linq out of it, completely!
I hope this clears it up, you do not need linq at all.
like this?
var context = (from match in matches
let summary = match.summary
let idx = summary.IndexOf("Sausage")
let test=idx == -1
let start =test ? 0 : summary.LastIndexOf('|', idx) + 1
let stop = test ? 0 : summary.IndexOf('|', idx)
let ctx = test ? "" : string.Format("...{0}...", summary.Substring(start, stop - start))
select new { displayName = match.displayName, summary = ctx, })
.Dump();