I have a string like this:
<div>
<query>select * from table1</query>
</div>
<div>
<query>select * from table2</query>
</div>
This is a templating usecase. Each query will be replaced by a different value (ie SQL result). Is it possible to use Regex Replace method to do this ?
The solution I'm thinking of is to use Regex.Match in the first pass, collect all the matches and then use string.replace in the second pass to replace the matches one by one. Is there a better way to solve this ?
var source =
#"<div>
<query>select * from table1</query>
</div>
<div>
<query>select * from table2</query>
</div>";
var result = Regex.Replace(
source,
"(?<=<query>).*?(?=</query>)",
match => Sql.Execute(match.Value));
The Sql.Execute is a placeholder function for whatever logic you invoke to execute your query. Upon completion, its results will substitute the original <query>…</query> contents.
If you want the query tags to be eliminated, then use a named capture group rather than lookarounds:
var result = Regex.Replace(
source,
"<query>(?<q>.*?)</query>",
match => Sql.Execute(match.Groups["q"].Value));
You could use Html Agility Pack to get first the query tags and replace the inner text with whatever you want:
var html = new HtmlDocument();
html.Load(filepath);
var queries = html.DocumentNode.SelectNodes("//query");
foreach(var node in queries)
{
if(node.InnerText=="select * from table1")
{
node.InnerText="your result";
}
}
You could also use a dictionary to save the pattern as key and the replacement as value:
var dict = new Dictionary<string, string>();
dict.Add("select * from table1","your result");
//...
var html = new HtmlDocument();
html.Load(filepath);
var queries = html.DocumentNode.SelectNodes("//query");
foreach(var node in queries)
{
if(dict.Keys.Contains(node.InnerText))
{
node.InnerText=dict[node.InnerText];
}
}
We know regex is not good for html parsing, but I think you don't need to parse html here, but simply get what's inside <query>xxx</query> pattern.
So it doesn't matter what is the rest of the document as you don't want to traverse it, nor validate nor change, nothing (according with your question).
So, in this particular case, I would use regex more than html parser:
var pattern = "<query>.+<\/query>";
And then replace every match with string Replace method
Related
I have a string something like this:
<BU Name="xyz" SerialNo="3838383" impression="jdhfl87lkjh8937ljk" />
I want to extract values like this:
Name = xyz
SerialNo = 3838383
impression = jdhfl87lkjh8937ljk
How to get these values in C#?
I am using C# 3.5.
If by some reason you don't want to use Xml parser you can use reqular expression to achieve this.
Use this regular expression:
(\w)+=\"(\w)+\"
Use this regular expression like this:
var input = #"<BU Name=""xyz"" SerialNo=""3838383"" impression=""jdhfl87lkjh8937ljk"" />";
var pattern = #"(\w)+=\""(\w)+\""";
var result = Regex.Matches(input, pattern);
foreach (var match in result.Cast<Match>())
{
Console.WriteLine(match.Value);
}
Result:
//Name="xyz"
//SerialNo="3838383"
//impression="jdhfl87lkjh8937ljk"
//Press any key to continue.
user input content by text editor, and finally submitted to the database.
before store in database,i want remove empty line in content at begin and end (the middle can not be removed).
i want use JavaScript and C#
sample content is:
<div>
<p><span><br></span></p>
<span>a<br/>bc</span>
<p>te<br>st</p>
<p>\n<span>\n</span></p>
<p><span><br/></span></p>
</div>
i need is:
<div>
<span>a<br/>bc</span>
<p>te<br>st</p>
</div>
who can help me?
Well if I understand what you are trying to accomplish, this should solve your problem:
string input = #"
<div>
<p><span><br></span></p>
<span>a<br/>bc</span>
<p>te<br>st</p>
<p>\n<span>\n</span></p>
<p><span><br/></span></p>
</div>
";
string pattern = #"(<p>)?(\\n|<br/?>)?<span>(<br/?>|\\n)</span>(</p>)?";
System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex(pattern);
string final = reg.Replace(input, String.Empty);
Console.WriteLine(final);
}
That above code will return:
<div>
<span>a<br/>bc</span>
<p>te<br>st</p>
</div>
You could then go about trimming ever line, as it looks like it needs it.
It is not mentioned in the question whether you want to clean up your content on the client or server side.
If it should be done on the server please don't use regex for it. Why? See this excellent answer. Use HTML parser instead. E.g. with HtmlAgiltyPack:
var doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(var node in doc.DocumentNode.SelectNodes("//div|//span|//p"))
if (string.IsNullOrWhiteSpace(node.InnerText.Replace(#"\n", string.Empty)))
node.Remove();
var result = doc.DocumentNode.OuterHtml;
But it could be done even simplier on the client (without regex too) by using jQuery:
var dom = $(html);
dom.find('p,span,div').each(function() {
if ($(this).text().trim() == '')
$(this).remove();
});
var result = dom.wrap('<div>').parent().html();
Im using HTML Agility Pack, and Im trying to replace the InnerText of some Tags like this
protected void GerarHtml()
{
List<string> labels = new List<string>();
string patch = #"C:\EmailsMKT\" +
Convert.ToString(Session["ssnFileName"]) + ".html";
DocHtml.Load(patch);
//var titulos = DocHtml.DocumentNode.SelectNodes("//*[#class='lblmkt']");
foreach (HtmlNode titulo in
DocHtml.DocumentNode.SelectNodes("//*[#class='lblmkt']"))
{
titulo.InnerText.Replace("test", lbltitulo1.Text);
}
DocHtml.Save(patch);
}
the html:
<.div><.label id="titulo1" class="lblmkt">teste</label.><./Div>
Strings are immutable (you should be able to find much documentation on this).
Methods of the String class do not alter the instance, but rather create a new, modified string.
Thus, your call to:
titulo.InnerText.Replace("test", lbltitulo1.Text);
does not alter InnerText, but returns the string you want InnerText to be.
In addition, InnerText is read-only; you'll have to use Text as shown in Set InnerText with HtmlAgilityPack
Try the following line instead (assign the result of the string operation to the property again):
titulo.Text = titulo.Text.Replace("test", lbltitulo1.Text);
I was able get the result like this:
HtmlTextNode Hnode = null;
Hnode = DocHtml.DocumentNode.SelectSingleNode("//label[#id='titulo1']//text()") as HtmlTextNode;
Hnode.Text = lbltitulo1.Text;
string htmlHeaderPattern = ("(<h[2|3])>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
From this code, I get a bunch of h2 and h3-elements. In these, I'd like to insert an ID-attribute, with the value equal to (the content in the header, minus special chars and ToLower()). I also need this value as a separate string, as I need to store it for later use.
Input: <h3>Some sort of header!</h3>
Output: <h3 id="#some-sort-of-header">Some sort of header!</h3>
Plus, I need the values "#some-sort-of-header" and "Some sort of header!" stored in a dictionary or list or whatever else.
This is what I have so far:
string htmlHeaderPattern = ("(<h[2|3]>.*</h[2|3]>)");
MatchCollection matches = Regex.Matches(mainBody, htmlHeaderPattern, RegexOptions.Compiled);
Dictionary<string,string> returnValue = new Dictionary<string, string>();
foreach (Match match in matches)
{
string idValue = StripTextValue(match.Groups[4].Value);
returnValue.Add(idValue, match.Groups[4].Value);
}
MainBody = Regex.Replace(mainBody, htmlHeaderPattern, "this is where i must replace all the headers with one with an ID-attribute?");
Any regex-wizards out there to help me?
There are a lot of mentions regarding not to use regex when parsing HTML, so you could use e.g. Html Agility Pack for this:
var html = #"<h2>Some sort of header!</h2>";
HtmlDocument document= new HtmlDocument();
document.LoadHtml(html);
var headers = document.DocumentNode.SelectNodes("//h2|//h3");
if (headers != null)
{
foreach (HtmlNode header in headers)
{
var innerText = header.InnerText;
var idValue = StripTextValue(innerText);
if (header.Attributes["id"] != null)
{
header.Attributes["id"].Value = idValue;
}
else
{
header.Attributes.Add("id", idValue);
}
}
}
This code finds all the <h2> and <h3> elements in the document passed, gets inner text from there and setting(or adding) id attributes to them.
With this example you should get something like:
<h2 id='#some-sort-of-header'>Some sort of header!</h2>
I have data in an html file, in a table:
<table>
<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>
<tr><td>002</td><td>Tone Loc</td><td>Funky Cold Medina</td></tr>
<tr><td>003</td><td>Funkdoobiest</td><td>Bow Wow Wow</td></tr>
</table>
How do I split a single row into an array or list?
string row = streamReader.ReadLine();
List<string> data = row.Split //... how do I do this bit?
string artist = data[1];
Short answer: never try to parse HTML from the wild with regular expressions. It will most likely come back to haunt you.
Longer answer: As long as you can absolutely, positively guarantee that the HTML that you are parsing fits the given structure, you can use string.Split() as Jenni suggested.
string html = "<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>";
string[] values = html.Split(new string[] { "<tr>","</tr>","<td>","</td>" }, StringSplitOptions.RemoveEmptyEntries);
List<string> list = new List<string>(values);
Listing the tags independently keeps this slightly more readable, and the .RemoveEmptyEntries will keep you from getting an empty string in your list between adjacent closing and opening tags.
If this HTML is coming from the wild, or from a tool that may change - in other words, if this is more than a one-off transaction - I strongly encourage you to use something like the HTML Agility Pack instead. It's pretty easy to integrate, and there are lots of examples on the Intarwebs.
If your HTML is well-formed you could use LINQ to XML:
string input = #"<table>
<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>
<tr><td>002</td><td>Tone Loc</td><td>Funky Cold Medina</td></tr>
<tr><td>003</td><td>Funkdoobiest</td><td>Bow Wow Wow</td></tr>
</table>";
var xml = XElement.Parse(input);
// query each row
foreach (var row in xml.Elements("tr"))
{
foreach (var item in row.Elements("td"))
{
Console.WriteLine(item.Value);
}
Console.WriteLine();
}
// if you really need a string array...
var query = xml.Elements("tr")
.Select(row => row.Elements("td")
.Select(item => item.Value)
.ToArray());
foreach (var item in query)
{
// foreach over item content
// or access via item[0...n]
}
You could try:
Row.Split /<tr><td>|<\/td><td>|<\/td><\/tr>/
But it depends on how regular the HTML is. Is it programmatically generated, or does a human write it? You should only use a regular expression if you're sure it will always be generated the same way, otherwise you should use a proper HTML parser
When parsing HTML, I usually turn to the HTML Agility Pack.