string htmlHeaderPattern = ("(<h[2|3])>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
From this code, I get a bunch of h2 and h3-elements. In these, I'd like to insert an ID-attribute, with the value equal to (the content in the header, minus special chars and ToLower()). I also need this value as a separate string, as I need to store it for later use.
Input: <h3>Some sort of header!</h3>
Output: <h3 id="#some-sort-of-header">Some sort of header!</h3>
Plus, I need the values "#some-sort-of-header" and "Some sort of header!" stored in a dictionary or list or whatever else.
This is what I have so far:
string htmlHeaderPattern = ("(<h[2|3]>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
Dictionary<string,string> returnValue = new Dictionary<string, string>();
foreach (Match match in matches)
{
string idValue = StripTextValue(match.Groups[4].Value);
returnValue.Add(idValue, match.Groups[4].Value);
}
MainBody = Regex.Replace(mainBody, htmlHeaderPattern, "this is where i must replace all the headers with one with an ID-attribute?");
Any regex-wizards out there to help me?
There are a lot of mentions regarding not to use regex when parsing HTML, so you could use e.g. Html Agility Pack for this:
var html = #"<h2>Some sort of header!</h2>";
HtmlDocument document= new HtmlDocument();
document.LoadHtml(html);
var headers = document.DocumentNode.SelectNodes("//h2|//h3");
if (headers != null)
{
foreach (HtmlNode header in headers)
{
var innerText = header.InnerText;
var idValue = StripTextValue(innerText);
if (header.Attributes["id"] != null)
{
header.Attributes["id"].Value = idValue;
}
else
{
header.Attributes.Add("id", idValue);
}
}
}
This code finds all the <h2> and <h3> elements in the document passed, gets inner text from there and setting(or adding) id attributes to them.
With this example you should get something like:
<h2 id='#some-sort-of-header'>Some sort of header!</h2>
Related
I'm attempting to use the HTMLAgilityPack to get retrieve and edit inner text of some HTML. The inner text of each node i retrieve needs to be checked for matching strings and those matching strings to be highlighted like so:
var HtmlDoc = new HtmlDocument();
HtmlDoc.LoadHtml(item.Content);
var nodes = HtmlDoc.DocumentNode.SelectNodes("//div[#class='guide_subtitle_cell']/p");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(Methods.HighlightWords(htmlNode.InnerText, searchstring)), htmlNode);
}
This is the code for the HighlightWords method I use:
public static string HighlightWords(string input, string searchstring)
{
if (input == null || searchstring == null)
{
return input;
}
var lowerstring = searchstring.ToLower();
var words = lowerstring.Split(' ').ToList();
for (var i = 0; i < words.Count; i++)
{
Match m = Regex.Match(input, words[i], RegexOptions.IgnoreCase);
if (m.Success)
{
string ReplaceWord = string.Format("<span class='search_highlight'>{0}</span>", m.Value);
input = Regex.Replace(input, words[i], ReplaceWord, RegexOptions.IgnoreCase);
}
}
return input;
}
Can anyone suggest how to get this working or indicate what i'm doing wrong?
The problem is that HtmlTextNode.CreateNode can only create one node. When you add a <span> inside, that's another node, and CreateNode throws the exception you see.
Make sure that you are only doing a search and replace on the lowest leaf nodes (nodes with no children). Then rebuild that node by:
Create a new empty node to replace the old one
Search for the text in .InnerText
Use HtmlTextNode.Create to add the plain text before the text you want to highlight
Then add your new <span> with the highlighted text with HtmlNode.CreateNode
Then search for the next occurrence (start back at 1) until no more occurrences are found.
Your function HighlightWords must be returning multiple top-level HTML nodes. For example:
<p>foo</p>
<span>bar</span>
The HtmlAgilityPack only allows one top-level node to be returned. You can hardcode the return value for HighlightWords to test.
Also, this post has run across the same problem.
I have a webpage. If I look at the "view-source" of the page, I find multiple instance of following statement:
<td class="my_class" itemprop="main_item">statement 1</td>
<td class="my_class" itemprop="main_item">statement 2</td>
<td class="my_class" itemprop="main_item">statement 3</td>
I want to extract data like this:
statement 1
statement 2
statement 3
To accomplish this, I have made a method "GetContent" which takes "URL" as parameter and copy all the content of the webpage source in a C# string.
private string GetContent(string url)
{
HttpWebResponse response = null;
StreamReader respStream = null;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Timeout = 100000;
response = (HttpWebResponse)request.GetResponse();
respStream = new StreamReader(response.GetResponseStream());
return respStream.ReadToEnd();
}
Now I want to create a method "GetMyList" which will extract the list I want. I am searching for the possible regex which can serve my purpose. Any help is highly appreciated.
using the HTML AgilityPack, this would be really easy...
HtmlDocument doc= new HtmlDocument ();
doc.LoadHtml(html);
//var nodes = doc.DocumentNode.SelectNodes("//td//text()");
var nodes = doc.DocumentNode.SelectNodes("//td[#itemprop=\"main_item\"]//text()");
var list = new List<string>();
foreach (var m in nodes)
{
list.Add(m.InnerText);
}
But if you want Regex, Try this :
string regularExpressionPattern1 = #"<td.*?>(.*?)<\/td>";
Regex regex = new Regex(regularExpressionPattern1, RegexOptions.Singleline);
MatchCollection collection = regex.Matches(html.ToString());
var list = new List<string>();
foreach (Match m in collection)
{
list.Add( m.Groups[1].Value);
}
Hosseins answer is pretty much the solution (and I would recommend you to use a parser if you have the option) but a regular expression with non-capturing paraentheses ?: would bring you the extracted data statement 1 or statement 2 as you need it:
IEnumerable<string> GetMyList(string str)
{
foreach(Match m in Regex.Matches(str, #"(?:<td.*?>)(.*?)(?:<\/td>)"))
yield return m.Groups[1].Value;
}
See Explanation at regex101 for a more detailed description.
I have an Index view where I would like show a list of news article, the Text property is a string which contains a html string coming from a html editor; now the html content could be really long, so I would like show only the first <p> element.
I am doing that:
public ActionResult Index()
{
var articles = db.Articles.ToList().Select(a => new{Title = a.Title,
Tags = a.Tags,
Id = a.Id,
Text = (System.Xml.Linq.XDocument.Parse(a.Text).Descendants("p").FirstOrDefault())
}).ToList();
return View(articles);
}
But in the html string there is not a root node, so the Linq query fall in exception, How I can manage this case?
Thanks in advance for any suggestion
It might be a shorthand solution, but should wrapping your xml in a root node not fix the problem?
System.Xml.Linq.XDocument.Parse(
String.Format("<myRootNode>{0}</myRootNode>" , a.Text)
)
You can do it by using regex
static String GetTheFirstPElement(String rawHtml)
{
Regex myRegex = new Regex(#"(<p[^>]*>.*?</p>)", RegexOptions.IgnoreCase);
MatchCollection matches = myRegex.Matches(rawHtml);
var firstMatch = matches.FirstOrDefault() ;
return firstMatch != null ? firstMatch.Value : null ;
}
I am trying to read in POST data to an ASPX (c#) page. I have got the post data now inside a string. I am now wondering if this is the best way to use it. Using the code here (http://stackoverflow.com/questions/10386534/using-request-getbufferlessinputstream-correctly-for-post-data-c-sharp) I have the following string
<callback variable1="foo1" variable2="foo2" variable3="foo3" />
As this is now in a string, I am splitting based on a space.
string[] pairs = theResponse.Split(' ');
Dictionary<string, string> results = new Dictionary<string, string>();
foreach (string pair in pairs)
{
string[] paramvalue = pair.Split('=');
results.Add(paramvalue[0], paramvalue[1]);
Debug.WriteLine(paramvalue[0].ToString());
}
The trouble comes when a value has a space in it. For example, variable3="foo 3" upsets the code.
Is there something better I should be doing to parse the incoming http post variables within the string??
You might want to treat it as XML directly:
// just use 'theResponse' here instead
var xml = "<callback variable1=\"foo1\" variable2=\"foo2\" variable3=\"foo3\" />";
// once inside an XElement you can get all the values
var ele = XElement.Parse(xml);
// an example of getting the attributes out
var values = ele.Attributes().Select(att => new { Name = att.Name, Value = att.Value });
// or print them
foreach (var attr in ele.Attributes())
{
Console.WriteLine("{0} - {1}", attr.Name, attr.Value);
}
Of course you can change that last line to whatever you want, the above is a rough example.
Given the (specimen - real markup may be considerably more complicated) markup and constraints listed below, could anyone propose a solution (C#) more effective/efficient than walking the whole tree to retrieve { "##value1##", "##value2##", "##value3##" }, i.e. a list of tokens that are going to be replaced when the markup is actually used.
Note: I have no control over the markup, structure of the markup or format/naming of the tokens that are being replaced.
<markup>
<element1 attributea="blah">##value1##</element1>
<element2>##value2##</element2>
<element3>
<element3point1>##value1##</element3point1>
<element3point2>##value3##</element3point2>
<element3point3>apple</element3point3>
<element3>
<element4>pear</element4>
</markup>
How about:
var keys = new HashSet<string>();
Regex.Replace(input, "##[^#]+##", match => {
keys.Add(match.Value);
return ""; // doesn't matter
});
foreach (string key in keys) {
Console.WriteLine(key);
}
This:
doesn't bother parsing the xml (just string manipulation)
only includes the unique values (no need to return a MatchCollection with the duplicates we don't want)
However, it may build a larger string, so maybe just Matches:
var matches = Regex.Matches(input, "##[^#]+##");
var result = matches.Cast<Match>().Select(m => m.Value).Distinct();
foreach (string s in result) {
Console.WriteLine(s);
}
I wrote a quick prog with your sample, this should do the trick.
class Program
{
//I just copied your stuff to Test.xml
static void Main(string[] args)
{
XDocument doc = XDocument.Load("Test.xml");
var verbs=new Dictionary<string,string>();
//Add the values to replace ehre
verbs.Add("##value3##", "mango");
verbs.Add("##value1##", "potato");
ReplaceStuff(verbs, doc.Root.Elements());
doc.Save("Test2.xml");
}
//A simple replace class
static void ReplaceStuff(Dictionary<string,string> verbs,IEnumerable<XElement> elements)
{
foreach (var e in elements)
{
if (e.Elements().Count() > 0)
ReplaceStuff(verbs, e.Elements() );
else
{
if (verbs.ContainsKey(e.Value.Trim()))
e.Value = verbs[e.Value];
}
}
}
}