Regular Expression to ignore empty href

Regular Expression to ignore empty href - c#

I have written one function to replace value of href with somevalue + original href value
say:-
<a href="/somepage.htm" id="test">
replace with
<a href="http//www.stackoverflow.com/somepage.htm" id="test">
Places where no replacement needed:-
<a href="http//www.stackoverflow.com/somepage.htm" id="test">
<a href="#" id="test">
<a href="javascript:alert('test');" id="test">
<a href="" id="test">
I have written following method, working with all the cases but not with blank value of href
public static string RelativeToAbsoluteURLS(string text, string absoluteUrl, string pattern = "src|href")
{
if (String.IsNullOrEmpty(text))
{
return text;
}
String value = Regex.Replace(text, "<(.*?)(" + pattern + ")=\"(?!http|javascript|#)(.*?)\"(.*?)>", "<$1$2=\"" + absoluteUrl + "$3\"$4>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
return value.Replace(absoluteUrl + "/", absoluteUrl);
}
Written ?!http|javascript|# to ignore http, javascript, #, so it is working for these cases, but if we consider following part
(?!http|javascript|#)(.*?)
And replace this * with +
(?!http|javascript|#)(.+?)
It is not working for empty case.

Changing * to + does not work, because you got it completely wrong:
* means "zero or more"
+ means "one or more"
So with + you are forcing the content to be at the place, rather that allowing the content to be missing.
Another thing you got wrong is the placement. The * at that place refers to .. Together, they mean "zero or more characters". So, this part already does not require any content. Therefore, since your regex currently does not work with null-content, something other seems to be requiring that.
Looking at the preceding expressions:
(?!http|javascript|#)(.*?)
The ?! is a zero-width negative lookahead. Zero-width. Negative. That means that it will not require any content either.
So, I got your code, pasted it into the online compiler, then I fed it with your example <a href="" id="test">:
using System.IO;
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
string text = "<a href=\"\" id=\"test\">";
string pattern = "src|href";
string absoluteUrl = "YADA";
string value = Regex.Replace(text, "<(.*?)(" + pattern + ")=\"(?!http|javascript|#)(.*?)\"(.*?)>", "<$1$2=\"" + absoluteUrl + "$3\"$4>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Console.WriteLine(value);
}
}
and guess what it works:
Compiling the source code....
$mcs main.cs -out:demo.exe 2>&1
Executing the program....
$mono demo.exe
<a href="YADA" id="test">
So, either you are not telling the truth, or you have changed the code when posting it here, or you've got something completely other messed up in your code, sorry.
EDIT:
So, it turned out that the href="" was meant to be ignored.
Then the simplest thing you can do it to add another negative-lookahead that will block the href="" case explicitely. However, note that the placement of that group will be different. The current group is inside the quotes from href, so it cannot "peek" how the whole href-quotes look like. The new group must be before the quotes.
"<(.*?)(" + pattern + ")=(?!\"\")\"(?!http|javascript|#)(.*?)\"(.*?)>"
Note that just-before the first quote from href, I've added a (?!\"\") that will ensure that "there will be no such case that quote follows a quote".

I know that you are asking for RegEx.
But here is an alternative, because I think the use of Uri.IsWellFormedUriString worths it.
This way you also you can reuse the helpers functions:
public string RelativeToAbsoluteURLS(string text, string absoluteUrl, string pattern = "src|href")
{
if (isHrefRelativeURIPath(text)){
text = absoluteUrl + "/" + System.Text.RegularExpressions.Regex.Replace("///days/hours.htm", #"^\/+", "");
}
return text;
}
public bool isHrefRelativeURIPath(string value) {
if (isLink(value) ||
value.StartsWith("#") ||
value.StartsWith("javascript"))
{
return false;
}
// Others Custom exclusions
return true;
}
public bool isLink(string value) {
if (String.IsNullOrEmpty(value))
return false;
return Uri.IsWellFormedUriString("http://" + value, UriKind.Absolute);
}

Related

Pass old regular expression to new custom regular expression to exclude specific characters

I have a program with a lot of string constants being used to allow specific characters via regular expressions. I now have a list of characters I want to block everywhere, but I don't want to have to go back through all my old string constants and rewrite them. Instead, I want to create a list of restricted characters and edit that list in only one place (in case it changes in the future). I'll then run all the string constants through a custom regular expression.
I have the list of restricted characters defined in web.config like so:
<add key="RestrChar" value="\!#%<>|&;"/>
Calling a custom regular expression like this:
[RestrictCharRegExpress(ConstantStringName, ErrorMessage = CustomErrMsg)]
public string StringName
Class is defined as follows:
public class RestrictCharRegExpressAttribute : RegularExpressionAttribute
{
public RestrictCharRegExpressAttribute(string propRegex) : base(GetRegex(propRegex)){ }
private static string GetRegex(string propRegex)
{
string restrictedChars = ConfigurationManager.AppSettings.Get("RestrChar");
return Regex.Replace(propRegex, $"[{restrictedChars}]+", "");
}
}
Now this works when ConstantStringName specifically includes some of the characters I want to exclude like this:
public const string ConstantStringName = "^[-a-z A-Z.0-9/!&\"()]{1,40}$";
! and & are explicitly included so they will get replaced with nothing. But this won't work if the characters I'm trying to exclude aren't explicitly listed and are instead included via a list like this:
public const string ConstantStringName = "^[ -~\x0A\x0D]{1,40}$";
I've tried adding a negative lookahead like this:
return propRegex + "(?![" + restrictedChars + "])";
But that doesn't work in both cases. Also tried the negated set:
int i = propRegex.IndexOf(']');
if (i != -1)
{
propRegex = propRegex.Insert(i, "[^" + restrictedChars + "]");
return propRegex;
}
Still not working for both cases. Finally I tried character class subtraction:
int i = propRegex.IndexOf(']');
if (i != -1)
{
propRegex = propRegex.Insert(i, "-[" + restrictedChars + "]");
return propRegex;
}
And once again I achieved failure.
Does anyone have any other ideas how I can achieve my goal to exclude a set of characters no matter what set of regex rules are passed into my custom regular expression?

Actually figured out what I'm trying to do:
int indexPropRegex = propRegex.IndexOf('^');
string restrictedCharsAction = "(?!.*[" + restricedChars + "]);
propRegex = indexPropRegex == -1 ? propRegex.Insert(0, restrictedCharsAction) : propRegex.Insert(indexPropRegex +1, restrictedCharsAction);
return propRegex;

Acting on the indentation of a c# multiline string

I want to write some Html from c# (html is an example, this might be other languages..)
For example:
string div = #"<div class=""className"">
<span>Mon text</span>
</div>";
will produce:
<div class="className">
<span>Mon text</span>
</div>
that's not very cool from the Html point of view...
The only way to have a correct HTML indentation will be to indent the C# code like this :
string div = #"<div class=""className"">
<span>Mon text</span>
</div>";
We get the correctly indented Html:
<div class="className">
<span>Mon text</span>
</div>
But indenting the C# like this really broke the readability of the code...
Is there a way to act on the indentation in the C# language ?
If not, does someone have a tip better than :
string div = "<div class=\"className\">" + Environment.NewLine +
" <span>Mon text</span>" + Environment.NewLine +
"</div>";
and better than
var sbDiv = new StringBuilder();
sbDiv.AppendLine("<div class=\"className\">");
sbDiv.AppendLine(" <span>Mon text</span>");
sbDiv.AppendLine("</div>");
What i use as a solution:
Greats thanks to #Yotam for its answer.
I write a little extension to make the alignment "dynamic" :
/// <summary>
/// Align a multiline string from the indentation of its first line
/// </summary>
/// <remarks>The </remarks>
/// <param name="source">The string to align</param>
/// <returns></returns>
public static string AlignFromFirstLine(this string source)
{
if (String.IsNullOrEmpty(source)) {
return source;
}
if (!source.StartsWith(Environment.NewLine)) {
throw new FormatException("String must start with a NewLine character.");
}
int indentationSize = source.Skip(Environment.NewLine.Length)
.TakeWhile(Char.IsWhiteSpace)
.Count();
string indentationStr = new string(' ', indentationSize);
return source.TrimStart().Replace($"\n{indentationStr}", "\n");
}
Then i can use it like that :
private string GetHtml(string className)
{
return $#"
<div class=""{className}"">
<span>Texte</span>
</div>".AlignFromFirstLine();
}
That return the correct html :
<div class="myClassName">
<span>Texte</span>
</div>
One limitation is that it will only work with space indentation...
Any improvement will be welcome !

You could wrap the string to the next line to get the desired indentation:
string div =
#"
<div class=""className"">
<span>Mon text</span>
</div>"
.TrimStart(); // to remove the additional new-line at the beginning
Another nice solution (disadvantage: depends on the indentation level!)
string div = #"
<div class=""className"">
<span>Mon text</span>
</div>".TrimStart().Replace("\n ", "\n");
It just removes the indentation out of the string. make sure the number of spaces in the first string of the Replace is the same amount of spaces your indentation has.

I like this solution more, but how about:
string div = "<div class='className'>\n"
+ " <span>Mon text</span>\n"
+ "</div>";
This gets rid of some clutter:
Replace " inside strings with ' so that you don't need to escape the quote. (Single quotes in HTML appear to be legal.)
You can then also use regular "" string literals instead of #"".
Use \n instead of Environment.NewLine.
Note that the string concatenation is performed during compilation, by the compiler. (See also this and this blog post on the subject by Eric Lippert, who previously worked on the C# compiler.) There is no runtime performance penalty.

Inspired by trimIndent() in Kotlin.
This code:
var x = #"
anything
you
want
".TrimIndent();
will produce a string:
anything
you
want
or "\nanything\n you\nwant\n"
Implementation:
public static string TrimIndent(this string s)
{
string[] lines = s.Split('\n');
IEnumerable<int> firstNonWhitespaceIndices = lines
.Skip(1)
.Where(it => it.Trim().Length > 0)
.Select(IndexOfFirstNonWhitespace);
int firstNonWhitespaceIndex;
if (firstNonWhitespaceIndices.Any()) firstNonWhitespaceIndex = firstNonWhitespaceIndices.Min();
else firstNonWhitespaceIndex = -1;
if (firstNonWhitespaceIndex == -1) return s;
IEnumerable<string> unindentedLines = lines.Select(it => UnindentLine(it, firstNonWhitespaceIndex));
return String.Join("\n", unindentedLines);
}
private static string UnindentLine(string line, int firstNonWhitespaceIndex)
{
if (firstNonWhitespaceIndex < line.Length)
{
if (line.Substring(0, firstNonWhitespaceIndex).Trim().Length != 0)
{
return line;
}
return line.Substring(firstNonWhitespaceIndex, line.Length - firstNonWhitespaceIndex);
}
return line.Trim().Length == 0 ? "" : line;
}
private static int IndexOfFirstNonWhitespace(string s)
{
char[] chars = s.ToCharArray();
for (int i = 0; i < chars.Length; i++)
{
if (chars[i] != ' ' && chars[i] != '\t') return i;
}
return -1;
}

If it is one long string then you can always keep the string in a text file and read it into your variable, e.g.
string text = File.ReadAllText(#"c:\file.txt", Encoding.UTF8);
This way you can format it anyway you want using a text editor and it won't negatively effect the look of your code.
If you're changing parts of the string on the fly then StringBuilder is your best option. - or if you did decide to read the string in from a text file, you could include {0} elements in your string and then use string.format(text, "text1","text2", etc) to change the required parts.

Regex to remove single-line SQL comments (--)

Question:
Can anybody give me a working regex expression (C#/VB.NET) that can remove single line comments from a SQL statement ?
I mean these comments:
-- This is a comment
not those
/* this is a comment */
because I already can handle the star comments.
I have a made a little parser that removes those comments when they are at the start of the line, but they can also be somewhere after code or worse, in a SQL-string 'hello --Test -- World'
Those comments should also be removed (except those in a SQL string of course - if possible).
Surprisingly I didn't got the regex working. I would have assumed the star comments to be more difficult, but actually, they aren't.
As per request, here my code to remove /**/-style comments
(In order to have it ignore SQL-Style strings, you have to subsitute strings with a uniqueidentifier (i used 4 concated), then apply the comment-removal, then apply string-backsubstitution.
static string RemoveCstyleComments(string strInput)
{
string strPattern = #"/[*][\w\d\s]+[*]/";
//strPattern = #"/\*.*?\*/"; // Doesn't work
//strPattern = "/\\*.*?\\*/"; // Doesn't work
//strPattern = #"/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/ "; // Doesn't work
//strPattern = #"/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/ "; // Doesn't work
// http://stackoverflow.com/questions/462843/improving-fixing-a-regex-for-c-style-block-comments
strPattern = #"/\*(?>(?:(?>[^*]+)|\*(?!/))*)\*/"; // Works !
string strOutput = System.Text.RegularExpressions.Regex.Replace(strInput, strPattern, string.Empty, System.Text.RegularExpressions.RegexOptions.Multiline);
Console.WriteLine(strOutput);
return strOutput;
} // End Function RemoveCstyleComments

I will disappoint all of you. This can't be done with regular expressions. Sure, it's easy to find comments not in a string (that even the OP could do), the real deal is comments in a string. There is a little hope of the look arounds, but that's still not enough. By telling that you have a preceding quote in a line won't guarantee anything. The only thing what guarantees you something is the oddity of quotes. Something you can't find with regular expression. So just simply go with non-regular-expression approach.
EDIT:
Here's the c# code:
String sql = "--this is a test\r\nselect stuff where substaff like '--this comment should stay' --this should be removed\r\n";
char[] quotes = { '\'', '"'};
int newCommentLiteral, lastCommentLiteral = 0;
while ((newCommentLiteral = sql.IndexOf("--", lastCommentLiteral)) != -1)
{
int countQuotes = sql.Substring(lastCommentLiteral, newCommentLiteral - lastCommentLiteral).Split(quotes).Length - 1;
if (countQuotes % 2 == 0) //this is a comment, since there's an even number of quotes preceding
{
int eol = sql.IndexOf("\r\n") + 2;
if (eol == -1)
eol = sql.Length; //no more newline, meaning end of the string
sql = sql.Remove(newCommentLiteral, eol - newCommentLiteral);
lastCommentLiteral = newCommentLiteral;
}
else //this is within a string, find string ending and moving to it
{
int singleQuote = sql.IndexOf("'", newCommentLiteral);
if (singleQuote == -1)
singleQuote = sql.Length;
int doubleQuote = sql.IndexOf('"', newCommentLiteral);
if (doubleQuote == -1)
doubleQuote = sql.Length;
lastCommentLiteral = Math.Min(singleQuote, doubleQuote) + 1;
//instead of finding the end of the string you could simply do += 2 but the program will become slightly slower
}
}
Console.WriteLine(sql);
What this does: find every comment literal. For each, check if it's within a comment or not, by counting the number of quotes between the current match and the last one. If this number is even, then it's a comment, thus remove it (find first end of line and remove whats between). If it's odd, this is within a string, find the end of the string and move to it. Rgis snippet is based on a wierd SQL trick: 'this" is a valid string. Even tho the 2 quotes differ. If it's not true for your SQL language, you should try a completely different approach. I'll write a program to that too if that's the case, but this one's faster and more straightforward.

You want something like this for the simple case
-{2,}.*
The -{2,} looks for a dash that happens 2 or more times
The .* gets the rest of the lines up to the newline
*But, for the edge cases, it appears that SinistraD is correct in that you cannot catch everything, however here is an article about how this can be done in C# with a combination of code and regex.

This seems to work well for me so far; it even ignores comments within strings, such as SELECT '--not a comment--' FROM ATable
private static string removeComments(string sql)
{
string pattern = #"(?<=^ ([^'""] |['][^']*['] |[""][^""]*[""])*) (--.*$|/\*(.|\n)*?\*/)";
return Regex.Replace(sql, pattern, "", RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
}
Note: it is designed to eliminate both /**/-style comments as well as -- style. Remove |/\*(.|\n)*?\*/ to get rid of the /**/ checking. Also be sure you are using the RegexOptions.IgnorePatternWhitespace Regex option!!
I wanted to be able to handle double-quotes too, but since T-SQL doesn't support them, you could get rid of |[""][^""]*[""] too.
Adapted from here.
Note (Mar 2015): In the end, I wound up using Antlr, a parser generator, for this project. There may have been some edge cases where the regex didn't work. In the end I was much more confident with the results having used Antlr, and it's worked well.

Using System.Text.RegularExpressions;
public static string RemoveSQLCommentCallback(Match SQLLineMatch)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
bool open = false; //opening of SQL String found
char prev_ch = ' ';
foreach (char ch in SQLLineMatch.ToString())
{
if (ch == '\'')
{
open = !open;
}
else if ((!open && prev_ch == '-' && ch == '-'))
{
break;
}
sb.Append(ch);
prev_ch = ch;
}
return sb.ToString().Trim('-');
}
The code
public static void Main()
{
string sqlText = "WHERE DEPT_NAME LIKE '--Test--' AND START_DATE < SYSDATE -- Don't go over today";
//for every matching line call callback func
string result = Regex.Replace(sqlText, ".*--.*", RemoveSQLCommentCallback);
}
Let's replace, find all the lines that match dash dash comment and call your parsing function for every match.

As a late solution, the simplest way is to do it using ScriptDom-TSqlParser:
// https://michaeljswart.com/2014/04/removing-comments-from-sql/
// http://web.archive.org/web/*/https://michaeljswart.com/2014/04/removing-comments-from-sql/
public static string StripCommentsFromSQL(string SQL)
{
Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser parser =
new Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser(true);
System.Collections.Generic.IList<Microsoft.SqlServer.TransactSql.ScriptDom.ParseError> errors;
Microsoft.SqlServer.TransactSql.ScriptDom.TSqlFragment fragments =
parser.Parse(new System.IO.StringReader(SQL), out errors);
// clear comments
string result = string.Join(
string.Empty,
fragments.ScriptTokenStream
.Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.MultilineComment)
.Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.SingleLineComment)
.Select(x => x.Text));
return result;
}
or instead of using the Microsoft-Parser, you can use ANTL4 TSqlLexer
or without any parser at all:
private static System.Text.RegularExpressions.Regex everythingExceptNewLines =
new System.Text.RegularExpressions.Regex("[^\r\n]");
// http://drizin.io/Removing-comments-from-SQL-scripts/
// http://web.archive.org/web/*/http://drizin.io/Removing-comments-from-SQL-scripts/
public static string RemoveComments(string input, bool preservePositions, bool removeLiterals = false)
{
//based on http://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689
var lineComments = #"--(.*?)\r?\n";
var lineCommentsOnLastLine = #"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
// literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
// there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
var literals = #"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
var bracketedIdentifiers = #"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
var quotedIdentifiers = #"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
//var blockComments = #"/\*(.*?)\*/"; //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx
//so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
var nestedBlockComments = #"/\*
(?>
/\* (?<LEVEL>) # On opening push level
|
\*/ (?<-LEVEL>) # On closing pop level
|
(?! /\* | \*/ ) . # Match any char unless the opening and closing strings
)+ # /* or */ in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
\*/";
string noComments = System.Text.RegularExpressions.Regex.Replace(input,
nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
me => {
if (me.Value.StartsWith("/*") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
else if (me.Value.StartsWith("/*") && !preservePositions)
return "";
else if (me.Value.StartsWith("--") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
else if (me.Value.StartsWith("--") && !preservePositions)
return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
return me.Value; // do not remove object identifiers ever
else if (!removeLiterals) // Keep the literal strings
return me.Value;
else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
{
var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
}
else if (removeLiterals && !preservePositions) // wrap completely all literals
return "''";
else
throw new System.NotImplementedException();
},
System.Text.RegularExpressions.RegexOptions.Singleline | System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace);
return noComments;
}

I don't know if C#/VB.net regex is special in some way but traditionally s/--.*// should work.

In PHP, i'm using this code to uncomment SQL (only single line):
$sqlComments = '#(([\'"`]).*?[^\\\]\2)|((?:\#|--).*?$)\s*|(?<=;)\s+#ms';
/* Commented version
$sqlComments = '#
(([\'"`]).*?[^\\\]\2) # $1 : Skip single & double quoted + backticked expressions
|((?:\#|--).*?$) # $3 : Match single line comments
\s* # Trim after comments
|(?<=;)\s+ # Trim after semi-colon
#msx';
*/
$uncommentedSQL = trim( preg_replace( $sqlComments, '$1', $sql ) );
preg_match_all( $sqlComments, $sql, $comments );
$extractedComments = array_filter( $comments[ 3 ] );
var_dump( $uncommentedSQL, $extractedComments );
To remove all comments see Regex to match MySQL comments

Replace with .Replace/.Regex

I am using Html.Raw(Html.Encode()) to allow some of html to be allowed. For example I want bold, italic, code etc... I am not sure it's the right method, code seems pretty ugly.
Input
Hello, this text will be [b]bold[/b]. [code]alert("Test...")[/code]
Output
Code
#Html.Raw(Html.Encode(Model.Body)
.Replace(Environment.NewLine, "<br />")
.Replace("[b]", "<b>")
.Replace("[/b]", "</b>")
.Replace("[code]", "<div class='codeContainer'><pre name='code' class='javascript'>")
.Replace("[/code]", "</pre></div>"))
My Solution
I want to make it all a bit different. Instead of using BB-Tags I want to use simpler tags.For example * will stand for bold. That means if I input This text is *bold*. it will replace text to This text is <b>bold</b>.. Kinda like this website is using BTW.
Problem
To implement this I need some Regex and I have little to no experience with it. I've searched many sites, but no luck.
My implementation of it looks something like this, but it fails since I can't really replace a char with string.
static void Main(string[] args)
{
string myString = "Hello, this text is *bold*, this text is also *bold*. And this is code: ~MYCODE~";
string findString = "\\*";
int firstMatch, nextMatch;
Match match = Regex.Match(myString, findString);
while (match.Success == true)
{
Console.WriteLine(match.Index);
firstMatch = match.Index;
match = match.NextMatch();
if (match.Success == true)
{
nextMatch = match.Index;
myString = myString[firstMatch] = "<b>"; // Ouch!
}
}
Console.ReadLine();
}

To implement this I need some Regex
Ah no, you don't need Regex. Manipulating HTML with Regex could lead to some undesired effects. So you could simply use MarkDownSharp which by the way is what this site uses to safely render Markdown markup into HTML.
Like this:
var markdown = new Markdown();
string html = markdown.Transform(SomeTextContainingMarkDown);
Of course to polish this you would write an HTML helper so that in your view:
#Html.Markdown(Model.Body)

Simple text to HTML conversion

I have a very simple asp:textbox with the multiline attribute enabled. I then accept just text, with no markup, from the textbox. Is there a common method by which line breaks and returns can be converted to <p> and <br/> tags?
I'm not looking for anything earth shattering, but at the same time I don't just want to do something like:
html.Insert(0, "<p>");
html.Replace(Enviroment.NewLine + Enviroment.NewLine, "</p><p>");
html.Replace(Enviroment.NewLine, "<br/>");
html.Append("</p>");
The above code doesn't work right, as in generating correct html, if there are more than 2 line breaks in a row. Having html like <br/></p><p> is not good; the <br/> can be removed.

I know this is old, but I couldn't find anything better after some searching, so here is what I'm using:
public static string TextToHtml(string text)
{
text = HttpUtility.HtmlEncode(text);
text = text.Replace("\r\n", "\r");
text = text.Replace("\n", "\r");
text = text.Replace("\r", "<br>\r\n");
text = text.Replace(" ", " ");
return text;
}
If you can't use HttpUtility for some reason, then you'll have to do the HTML encoding some other way, and there are lots of minor details to worry about (not just <>&).
HtmlEncode only handles the special characters for you, so after that I convert any combo of carriage-return and/or line-feed to a BR tag, and any double-spaces to a single-space plus a NBSP.
Optionally you could use a PRE tag for the last part, like so:
public static string TextToHtml(string text)
{
text = "<pre>" + HttpUtility.HtmlEncode(text) + "</pre>";
return text;
}

Your other option is to take the text box contents and instead of trying for line a paragraph breaks just put the text between PRE tags. Like this:
<PRE>
Your text from the text box...
and a line after a break...
</PRE>

Depending on exactly what you are doing with the content, my typical recommendation is to ONLY use the <br /> syntax, and not to try and handle paragraphs.

How about throwing it in a <pre> tag. Isn't that what it's there for anyway?

I know this is an old post, but I've recently been in a similar problem using C# with MVC4, so thought I'd share my solution.
We had a description saved in a database. The text was a direct copy/paste from a website, and we wanted to convert it into semantic HTML, using <p> tags. Here is a simplified version of our solution:
string description = getSomeTextFromDatabase();
foreach(var line in description.Split('\n')
{
Console.Write("<p>" + line + "</p>");
}
In our case, to write out a variable, we needed to prefix # before any variable or identifiers, because of the Razor syntax in the ASP.NET MVC framework. However, I've shown this with a Console.Write, but you should be able to figure out how to implement this in your specific project based on this :)

Combining all previous plus considering titles and subtitles within the text comes up with this:
public static string ToHtml(this string text)
{
var sb = new StringBuilder();
var sr = new StringReader(text);
var str = sr.ReadLine();
while (str != null)
{
str = str.TrimEnd();
str.Replace(" ", " ");
if (str.Length > 80)
{
sb.AppendLine($"<p>{str}</p>");
}
else if (str.Length > 0)
{
sb.AppendLine($"{str}</br>");
}
str = sr.ReadLine();
}
return sb.ToString();
}
the snippet could be enhanced by defining rules for short strings

I understand that I was late with the answer for 13 years)
but maybe someone else needs it
sample line 1 \r\n
sample line 2 (last at paragraph) \r\n\r\n [\r\n]+
sample line 3 \r\n
Example code
private static Regex _breakRegex = new("(\r?\n)+");
private static Regex _paragrahBreakRegex = new("(?:\r?\n){2,}");
public static string ConvertTextToHtml(string description) {
string[] descrptionParagraphs = _paragrahBreakRegex.Split(description.Trim());
if (descrptionParagraphs.Length > 0)
{
description = string.Empty;
foreach (string line in descrptionParagraphs)
{
description += $"<p>{line}</p>";
}
}
return _breakRegex.Replace(description, "<br/>");
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expression to ignore empty href - c#

Related

Pass old regular expression to new custom regular expression to exclude specific characters

Acting on the indentation of a c# multiline string

Regex to remove single-line SQL comments (--)

Replace with .Replace/.Regex

Simple text to HTML conversion

Categories

Resources