how to remove empty line use regex in javascript and c# - c#

user input content by text editor, and finally submitted to the database.
before store in database,i want remove empty line in content at begin and end (the middle can not be removed).
i want use JavaScript and C#
sample content is:
<div>
<p><span><br></span></p>
<span>a<br/>bc</span>
<p>te<br>st</p>
<p>\n<span>\n</span></p>
<p><span><br/></span></p>
</div>
i need is:
<div>
<span>a<br/>bc</span>
<p>te<br>st</p>
</div>
who can help me?

Well if I understand what you are trying to accomplish, this should solve your problem:
string input = #"
<div>
<p><span><br></span></p>
<span>a<br/>bc</span>
<p>te<br>st</p>
<p>\n<span>\n</span></p>
<p><span><br/></span></p>
</div>
";
string pattern = #"(<p>)?(\\n|<br/?>)?<span>(<br/?>|\\n)</span>(</p>)?";
System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex(pattern);
string final = reg.Replace(input, String.Empty);
Console.WriteLine(final);
}
That above code will return:
<div>
<span>a<br/>bc</span>
<p>te<br>st</p>
</div>
You could then go about trimming ever line, as it looks like it needs it.

It is not mentioned in the question whether you want to clean up your content on the client or server side.
If it should be done on the server please don't use regex for it. Why? See this excellent answer. Use HTML parser instead. E.g. with HtmlAgiltyPack:
var doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(var node in doc.DocumentNode.SelectNodes("//div|//span|//p"))
if (string.IsNullOrWhiteSpace(node.InnerText.Replace(#"\n", string.Empty)))
node.Remove();
var result = doc.DocumentNode.OuterHtml;
But it could be done even simplier on the client (without regex too) by using jQuery:
var dom = $(html);
dom.find('p,span,div').each(function() {
if ($(this).text().trim() == '')
$(this).remove();
});
var result = dom.wrap('<div>').parent().html();

Related

c# regex replace but with different replacement value each time

I have a string like this:
<div>
<query>select * from table1</query>
</div>
<div>
<query>select * from table2</query>
</div>
This is a templating usecase. Each query will be replaced by a different value (ie SQL result). Is it possible to use Regex Replace method to do this ?
The solution I'm thinking of is to use Regex.Match in the first pass, collect all the matches and then use string.replace in the second pass to replace the matches one by one. Is there a better way to solve this ?
var source =
#"<div>
<query>select * from table1</query>
</div>
<div>
<query>select * from table2</query>
</div>";
var result = Regex.Replace(
source,
"(?<=<query>).*?(?=</query>)",
match => Sql.Execute(match.Value));
The Sql.Execute is a placeholder function for whatever logic you invoke to execute your query. Upon completion, its results will substitute the original <query>…</query> contents.
If you want the query tags to be eliminated, then use a named capture group rather than lookarounds:
var result = Regex.Replace(
source,
"<query>(?<q>.*?)</query>",
match => Sql.Execute(match.Groups["q"].Value));
You could use Html Agility Pack to get first the query tags and replace the inner text with whatever you want:
var html = new HtmlDocument();
html.Load(filepath);
var queries = html.DocumentNode.SelectNodes("//query");
foreach(var node in queries)
{
if(node.InnerText=="select * from table1")
{
node.InnerText="your result";
}
}
You could also use a dictionary to save the pattern as key and the replacement as value:
var dict = new Dictionary<string, string>();
dict.Add("select * from table1","your result");
//...
var html = new HtmlDocument();
html.Load(filepath);
var queries = html.DocumentNode.SelectNodes("//query");
foreach(var node in queries)
{
if(dict.Keys.Contains(node.InnerText))
{
node.InnerText=dict[node.InnerText];
}
}
We know regex is not good for html parsing, but I think you don't need to parse html here, but simply get what's inside <query>xxx</query> pattern.
So it doesn't matter what is the rest of the document as you don't want to traverse it, nor validate nor change, nothing (according with your question).
So, in this particular case, I would use regex more than html parser:
var pattern = "<query>.+<\/query>";
And then replace every match with string Replace method

c# JSON date issue

I have a date in my DB of 2014-03-03 05:00:00, which is being rendered in JSON as:
the date is /Date{(-6xxxxx)/ and I call this method to parse it:
function parseJsonDate(dateString) {
var result = new Date(+dateString.replace(/\/Date\((-?\d+)\)\//gi, "$1"));
var result = new Date(parseInt(dateString.replace('/Date(', '')));
result.format("dd-MM-yyyy");
return result;
}
when running, i comment out one of the results lines, but get the same result for both:
the method is being called from Jquery template like such:
<tr>
<td>
<span id="approvedDate"><i class="glyphicon glyphicon-time" data-toggle="tooltip"
data-original-title="approved date"></i> ${parseJsonDate(AuditDate)}</span>
</td>
</tr>
EDIT
What a muppet.. I spent so long thinking this was a JSON vconversion issue, i totally forgot to go back and check my dapper code. my ApprovalHistory objec had AuditDate, but I was asking for EnteredDate in the sql. So, it was doing as expected.
aaaaaaaarrr :-)
I am seeing something fishy there
var result = new Date(+dateString.replace(/\/Date\((-?\d+)\)\//gi, "$1"));
var result = new Date(parseInt(dateString.replace('/Date(', '')));
you are making two variables named result within the same closure, is it intentional?
what does result.format do? since result is a Date object I wouldn't assume that it would change the original type from Date to string.
Maybe
var s = result.format("dd-MM-yyyy");
return s;
is what you really want to do?
you can do this after the ajax complete, this will save you tons of trouble having to parse the Date(xxxxx) thing over and over again
data = data.replace(/\"\\\/Date\((-?\d+)\)\\\/\"/g, '$1')
this will convert "Date(xxxx)" to xxxx and then you can just call new Date(xxxx) to make new Date object.
Maybe you could use something like this:
var str = (result.getMonth() + 1) + "-" + result.getDate() + "-" + result.getFullYear();
return str;

C# Scrape data from wiki page (screen-scraping)

I want to scrape a Wiki page. Specifically, this one.
My app will allow users to enter the registration number of the vehicle (for example, SBS8988Z) and it will display the related information (which is on the page itself).
For example, if the user enters SBS8988Z into a text field in my application, it should look for the line on that wiki page
SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
and return SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen).
My code so far is (copied and edited from various websites)...
WebClient getdeployment = new WebClient();
string url = "http://sgwiki.com/wiki/Scania_K230UB_(Batch_1_Euro_V)";
getdeployment.Headers["User-Agent"] = "NextBusApp/GetBusData UserAgent";
string sgwikiresult = getdeployment.DownloadString(url); // <<< EXCEPTION
MessageBox.Show(sgwikiresult); //for debugging only!
HtmlAgilityPack.HtmlDocument sgwikihtml = new HtmlAgilityPack.HtmlDocument();
sgwikihtml.Load(new StreamReader(sgwikiresult));
HtmlNode root = sgwikihtml.DocumentNode;
List<string> anchorTags = new List<string>();
foreach(HtmlNode deployment in root.SelectNodes("SBS8988Z"))
{
string att = deployment.OuterHtml;
anchorTags.Add(att);
}
However, I am getting a an ArgumentException was unhandled - Illegal Characters in path.
What is wrong with the code? Is there an easier way to do this? I'm using HtmlAgilityPack but if there is a better solution, I'd be glad to comply.
What's wrong with the code? To be blunt, everything. :P
The page is not formatted in the way you are reading it. You can't hope to get the desired contents that way.
The contents of the page (the part we're interested in) looks something like this:
<h2>
<span id="Deployments" class="mw-headline">Deployments</span>
</h2>
<p>
<!-- ... -->
<b>SBS8987B</b>
(SLBP 192/194*)
<br>
<b>SBS8988Z</b>
(SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
<br>
<b>SBS8989X</b>
(SLBP SP)
<br>
<!-- ... -->
</p>
Basically we need to find the b elements that contain the registration number we are looking for. Once we find that element, get the text and put it together to form the result. Here it is in code:
static string GetVehicleInfo(string reg)
{
var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";
// HtmlWeb is a helper class to get pages from the web
var web = new HtmlAgilityPack.HtmlWeb();
// Create an HtmlDocument from the contents found at given url
var doc = web.Load(url);
// Create an XPath to find the `b` elements which contain the registration numbers
var xpath = "//h2[span/#id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
+ "/following-sibling::p[1]" // move to the first `p` element (where the actual content is in) after the header
+ "/b"; // select the `b` elements
// Get the elements from the specified XPath
var deployments = doc.DocumentNode.SelectNodes(xpath);
// Create a LINQ query to find the requested registration number and generate a result
var query =
from b in deployments // from the list of registration numbers
where b.InnerText == reg // find the registration we're looking for
select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)
// The query should yield exactly one result (or we have a problem) or none (null)
var content = query.SingleOrDefault();
// Decode the content (to convert stuff like "&" to "&")
var decoded = System.Net.WebUtility.HtmlDecode(content);
return decoded;
}

Split html row into string array

I have data in an html file, in a table:
<table>
<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>
<tr><td>002</td><td>Tone Loc</td><td>Funky Cold Medina</td></tr>
<tr><td>003</td><td>Funkdoobiest</td><td>Bow Wow Wow</td></tr>
</table>
How do I split a single row into an array or list?
string row = streamReader.ReadLine();
List<string> data = row.Split //... how do I do this bit?
string artist = data[1];
Short answer: never try to parse HTML from the wild with regular expressions. It will most likely come back to haunt you.
Longer answer: As long as you can absolutely, positively guarantee that the HTML that you are parsing fits the given structure, you can use string.Split() as Jenni suggested.
string html = "<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>";
string[] values = html.Split(new string[] { "<tr>","</tr>","<td>","</td>" }, StringSplitOptions.RemoveEmptyEntries);
List<string> list = new List<string>(values);
Listing the tags independently keeps this slightly more readable, and the .RemoveEmptyEntries will keep you from getting an empty string in your list between adjacent closing and opening tags.
If this HTML is coming from the wild, or from a tool that may change - in other words, if this is more than a one-off transaction - I strongly encourage you to use something like the HTML Agility Pack instead. It's pretty easy to integrate, and there are lots of examples on the Intarwebs.
If your HTML is well-formed you could use LINQ to XML:
string input = #"<table>
<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>
<tr><td>002</td><td>Tone Loc</td><td>Funky Cold Medina</td></tr>
<tr><td>003</td><td>Funkdoobiest</td><td>Bow Wow Wow</td></tr>
</table>";
var xml = XElement.Parse(input);
// query each row
foreach (var row in xml.Elements("tr"))
{
foreach (var item in row.Elements("td"))
{
Console.WriteLine(item.Value);
}
Console.WriteLine();
}
// if you really need a string array...
var query = xml.Elements("tr")
.Select(row => row.Elements("td")
.Select(item => item.Value)
.ToArray());
foreach (var item in query)
{
// foreach over item content
// or access via item[0...n]
}
You could try:
Row.Split /<tr><td>|<\/td><td>|<\/td><\/tr>/
But it depends on how regular the HTML is. Is it programmatically generated, or does a human write it? You should only use a regular expression if you're sure it will always be generated the same way, otherwise you should use a proper HTML parser
When parsing HTML, I usually turn to the HTML Agility Pack.

Getting only the DIRECT InnerText of an IHTMLElement

Consider the following html code:
<div id='x'><div id='y'>Y content</div>X content</div>
I'd like to extract only the content of 'x'. However, its innerText property includes the content of 'y' as well. I tried iterating over its children and all properties but they only return the inner tags.
How can I access through the IHTMLElement interface only the actual data of 'x'?
Thanks
Use something like:
function getText(this) {
var txt = this.innerHTML;
txt.replace(/<(.)*>/g, "");
return txt;
}
Since this.innerHTML returns
<div id='y'>Y content</div>X content
the function getText would return
X content
Maybe this'll help.
Use the childNodes collection to return child elements and textnodes
You need to QI IHTMLDomNote from IHTMLelement for that.
Here is the final code as suggested by Sheng (just a part of the sample, of course):
mshtml.IHTMLElementCollection c = ((mshtml.HTMLDocumentClass)(wbBrowser.Document)).getElementsByTagName("div");
foreach (IHTMLElement div in c)
{
if (div.className == "lyricbox")
{
IHTMLDOMNode divNode = (IHTMLDOMNode)div;
IHTMLDOMChildrenCollection children = (IHTMLDOMChildrenCollection)divNode.childNodes;
foreach (IHTMLDOMNode child in children)
{
Console.WriteLine(child.nodeValue);
}
}
}
Since innerText() doesn't work with ie, there is no real way i guess.
Maybe try server-side solving the issue by creating content the following way:
<div id='x'><div id='y'>Y content</div>X content</div>
<div id='x-plain'>_plain X content_</div>
"Plain X content" represents your c# generated content for the element.
Now you gain access to the element by refering to getObject('x-plan').innerHTML().

Categories

Resources