Acting on the indentation of a c# multiline string - c#

I want to write some Html from c# (html is an example, this might be other languages..)
For example:
string div = #"<div class=""className"">
<span>Mon text</span>
</div>";
will produce:
<div class="className">
<span>Mon text</span>
</div>
that's not very cool from the Html point of view...
The only way to have a correct HTML indentation will be to indent the C# code like this :
string div = #"<div class=""className"">
<span>Mon text</span>
</div>";
We get the correctly indented Html:
<div class="className">
<span>Mon text</span>
</div>
But indenting the C# like this really broke the readability of the code...
Is there a way to act on the indentation in the C# language ?
If not, does someone have a tip better than :
string div = "<div class=\"className\">" + Environment.NewLine +
" <span>Mon text</span>" + Environment.NewLine +
"</div>";
and better than
var sbDiv = new StringBuilder();
sbDiv.AppendLine("<div class=\"className\">");
sbDiv.AppendLine(" <span>Mon text</span>");
sbDiv.AppendLine("</div>");
What i use as a solution:
Greats thanks to #Yotam for its answer.
I write a little extension to make the alignment "dynamic" :
/// <summary>
/// Align a multiline string from the indentation of its first line
/// </summary>
/// <remarks>The </remarks>
/// <param name="source">The string to align</param>
/// <returns></returns>
public static string AlignFromFirstLine(this string source)
{
if (String.IsNullOrEmpty(source)) {
return source;
}
if (!source.StartsWith(Environment.NewLine)) {
throw new FormatException("String must start with a NewLine character.");
}
int indentationSize = source.Skip(Environment.NewLine.Length)
.TakeWhile(Char.IsWhiteSpace)
.Count();
string indentationStr = new string(' ', indentationSize);
return source.TrimStart().Replace($"\n{indentationStr}", "\n");
}
Then i can use it like that :
private string GetHtml(string className)
{
return $#"
<div class=""{className}"">
<span>Texte</span>
</div>".AlignFromFirstLine();
}
That return the correct html :
<div class="myClassName">
<span>Texte</span>
</div>
One limitation is that it will only work with space indentation...
Any improvement will be welcome !

You could wrap the string to the next line to get the desired indentation:
string div =
#"
<div class=""className"">
<span>Mon text</span>
</div>"
.TrimStart(); // to remove the additional new-line at the beginning
Another nice solution (disadvantage: depends on the indentation level!)
string div = #"
<div class=""className"">
<span>Mon text</span>
</div>".TrimStart().Replace("\n ", "\n");
It just removes the indentation out of the string. make sure the number of spaces in the first string of the Replace is the same amount of spaces your indentation has.

I like this solution more, but how about:
string div = "<div class='className'>\n"
+ " <span>Mon text</span>\n"
+ "</div>";
This gets rid of some clutter:
Replace " inside strings with ' so that you don't need to escape the quote. (Single quotes in HTML appear to be legal.)
You can then also use regular "" string literals instead of #"".
Use \n instead of Environment.NewLine.
Note that the string concatenation is performed during compilation, by the compiler. (See also this and this blog post on the subject by Eric Lippert, who previously worked on the C# compiler.) There is no runtime performance penalty.

Inspired by trimIndent() in Kotlin.
This code:
var x = #"
anything
you
want
".TrimIndent();
will produce a string:
anything
you
want
or "\nanything\n you\nwant\n"
Implementation:
public static string TrimIndent(this string s)
{
string[] lines = s.Split('\n');
IEnumerable<int> firstNonWhitespaceIndices = lines
.Skip(1)
.Where(it => it.Trim().Length > 0)
.Select(IndexOfFirstNonWhitespace);
int firstNonWhitespaceIndex;
if (firstNonWhitespaceIndices.Any()) firstNonWhitespaceIndex = firstNonWhitespaceIndices.Min();
else firstNonWhitespaceIndex = -1;
if (firstNonWhitespaceIndex == -1) return s;
IEnumerable<string> unindentedLines = lines.Select(it => UnindentLine(it, firstNonWhitespaceIndex));
return String.Join("\n", unindentedLines);
}
private static string UnindentLine(string line, int firstNonWhitespaceIndex)
{
if (firstNonWhitespaceIndex < line.Length)
{
if (line.Substring(0, firstNonWhitespaceIndex).Trim().Length != 0)
{
return line;
}
return line.Substring(firstNonWhitespaceIndex, line.Length - firstNonWhitespaceIndex);
}
return line.Trim().Length == 0 ? "" : line;
}
private static int IndexOfFirstNonWhitespace(string s)
{
char[] chars = s.ToCharArray();
for (int i = 0; i < chars.Length; i++)
{
if (chars[i] != ' ' && chars[i] != '\t') return i;
}
return -1;
}

If it is one long string then you can always keep the string in a text file and read it into your variable, e.g.
string text = File.ReadAllText(#"c:\file.txt", Encoding.UTF8);
This way you can format it anyway you want using a text editor and it won't negatively effect the look of your code.
If you're changing parts of the string on the fly then StringBuilder is your best option. - or if you did decide to read the string in from a text file, you could include {0} elements in your string and then use string.format(text, "text1","text2", etc) to change the required parts.

Related

Get Text Between Two Strings (HTML) in C#

I am trying to parse a website's HTML and then get text between two strings.
I wrote a small function to get text between two strings.
public string getBetween(string strSource, string strStart, string strEnd)
{
int Start, End;
if (strSource.Contains(strStart) && strSource.Contains(strEnd))
{
Start = strSource.IndexOf(strStart, 0) + strStart.Length;
End = strSource.IndexOf(strEnd, Start);
return strSource.Substring(Start, End - Start);
}
else
{
return string.Empty;
}
}
I have the HTML stored in a string called 'html'. Here is a part of the HTML that I am trying to parse:
<div class="info">
<div class="content">
<div class="address">
<h3>Andrew V. Kenny</h3>
<div class="adr">
67 Romines Mill Road<br/>Dallas, TX 75204 </div>
</div>
<p>Curious what <strong>Andrew</strong> means? Click here to find out!</p>
So, I use my function like this.
string m2 = getBetween(html, "<div class=\"address\">", "<p>Curious what");
string fullName = getBetween(m2, "<h3>", "</h3>");
string fullAddress = getBetween(m2, "<div class=\"adr\">", "<br/>");
string city = getBetween(m2, "<br/>", "</div>");
The output of the full name works fine, but the others have additional spaces in them for some reason. I tried various ways to avoid them (such as completely copying the spaces from the source and adding them in my function) but it didn't work.
I get an output like this:
fullName = "Andrew V. Kenny"
fullAddress = " 67 Romines Mill Road"
city = "Dallas, TX 75204 "
There are spaces in the city and address which I don't know how to avoid.
Trim the string and the unecessary spaces will be gone:
fullName = fullName.Trim ();
fullAddress = fullAddress.Trim ();
city = city.Trim ();

Regular Expression to ignore empty href

I have written one function to replace value of href with somevalue + original href value
say:-
<a href="/somepage.htm" id="test">
replace with
<a href="http//www.stackoverflow.com/somepage.htm" id="test">
Places where no replacement needed:-
<a href="http//www.stackoverflow.com/somepage.htm" id="test">
<a href="#" id="test">
<a href="javascript:alert('test');" id="test">
<a href="" id="test">
I have written following method, working with all the cases but not with blank value of href
public static string RelativeToAbsoluteURLS(string text, string absoluteUrl, string pattern = "src|href")
{
if (String.IsNullOrEmpty(text))
{
return text;
}
String value = Regex.Replace(text, "<(.*?)(" + pattern + ")=\"(?!http|javascript|#)(.*?)\"(.*?)>", "<$1$2=\"" + absoluteUrl + "$3\"$4>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
return value.Replace(absoluteUrl + "/", absoluteUrl);
}
Written ?!http|javascript|# to ignore http, javascript, #, so it is working for these cases, but if we consider following part
(?!http|javascript|#)(.*?)
And replace this * with +
(?!http|javascript|#)(.+?)
It is not working for empty case.
Changing * to + does not work, because you got it completely wrong:
* means "zero or more"
+ means "one or more"
So with + you are forcing the content to be at the place, rather that allowing the content to be missing.
Another thing you got wrong is the placement. The * at that place refers to .. Together, they mean "zero or more characters". So, this part already does not require any content. Therefore, since your regex currently does not work with null-content, something other seems to be requiring that.
Looking at the preceding expressions:
(?!http|javascript|#)(.*?)
The ?! is a zero-width negative lookahead. Zero-width. Negative. That means that it will not require any content either.
So, I got your code, pasted it into the online compiler, then I fed it with your example <a href="" id="test">:
using System.IO;
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
string text = "<a href=\"\" id=\"test\">";
string pattern = "src|href";
string absoluteUrl = "YADA";
string value = Regex.Replace(text, "<(.*?)(" + pattern + ")=\"(?!http|javascript|#)(.*?)\"(.*?)>", "<$1$2=\"" + absoluteUrl + "$3\"$4>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Console.WriteLine(value);
}
}
and guess what it works:
Compiling the source code....
$mcs main.cs -out:demo.exe 2>&1
Executing the program....
$mono demo.exe
<a href="YADA" id="test">
So, either you are not telling the truth, or you have changed the code when posting it here, or you've got something completely other messed up in your code, sorry.
EDIT:
So, it turned out that the href="" was meant to be ignored.
Then the simplest thing you can do it to add another negative-lookahead that will block the href="" case explicitely. However, note that the placement of that group will be different. The current group is inside the quotes from href, so it cannot "peek" how the whole href-quotes look like. The new group must be before the quotes.
"<(.*?)(" + pattern + ")=(?!\"\")\"(?!http|javascript|#)(.*?)\"(.*?)>"
Note that just-before the first quote from href, I've added a (?!\"\") that will ensure that "there will be no such case that quote follows a quote".
I know that you are asking for RegEx.
But here is an alternative, because I think the use of Uri.IsWellFormedUriString worths it.
This way you also you can reuse the helpers functions:
public string RelativeToAbsoluteURLS(string text, string absoluteUrl, string pattern = "src|href")
{
if (isHrefRelativeURIPath(text)){
text = absoluteUrl + "/" + System.Text.RegularExpressions.Regex.Replace("///days/hours.htm", #"^\/+", "");
}
return text;
}
public bool isHrefRelativeURIPath(string value) {
if (isLink(value) ||
value.StartsWith("#") ||
value.StartsWith("javascript"))
{
return false;
}
// Others Custom exclusions
return true;
}
public bool isLink(string value) {
if (String.IsNullOrEmpty(value))
return false;
return Uri.IsWellFormedUriString("http://" + value, UriKind.Absolute);
}

Escape command line arguments in c#

Short version:
Is it enough to wrap the argument in quotes and escape \ and " ?
Code version
I want to pass the command line arguments string[] args to another process using ProcessInfo.Arguments.
ProcessStartInfo info = new ProcessStartInfo();
info.FileName = Application.ExecutablePath;
info.UseShellExecute = true;
info.Verb = "runas"; // Provides Run as Administrator
info.Arguments = EscapeCommandLineArguments(args);
Process.Start(info);
The problem is that I get the arguments as an array and must merge them into a single string. An arguments could be crafted to trick my program.
my.exe "C:\Documents and Settings\MyPath \" --kill-all-humans \" except fry"
According to this answer I have created the following function to escape a single argument, but I might have missed something.
private static string EscapeCommandLineArguments(string[] args)
{
string arguments = "";
foreach (string arg in args)
{
arguments += " \"" +
arg.Replace ("\\", "\\\\").Replace("\"", "\\\"") +
"\"";
}
return arguments;
}
Is this good enough or is there any framework function for this?
It's more complicated than that though!
I was having related problem (writing front-end .exe that will call the back-end with all parameters passed + some extra ones) and so i looked how people do that, ran into your question. Initially all seemed good doing it as you suggest arg.Replace (#"\", #"\\").Replace(quote, #"\"+quote).
However when i call with arguments c:\temp a\\b, this gets passed as c:\temp and a\\b, which leads to the back-end being called with "c:\\temp" "a\\\\b" - which is incorrect, because there that will be two arguments c:\\temp and a\\\\b - not what we wanted! We have been overzealous in escapes (windows is not unix!).
And so i read in detail http://msdn.microsoft.com/en-us/library/system.environment.getcommandlineargs.aspx and it actually describes there how those cases are handled: backslashes are treated as escape only in front of double quote.
There is a twist to it in how multiple \ are handled there, the explanation can leave one dizzy for a while. I'll try to re-phrase said unescape rule here: say we have a substring of N \, followed by ". When unescaping, we replace that substring with int(N/2) \ and iff N was odd, we add " at the end.
The encoding for such decoding would go like that: for an argument, find each substring of 0-or-more \ followed by " and replace it by twice-as-many \, followed by \". Which we can do like so:
s = Regex.Replace(arg, #"(\\*)" + "\"", #"$1$1\" + "\"");
That's all...
PS. ... not. Wait, wait - there is more! :)
We did the encoding correctly but there is a twist because you are enclosing all parameters in double-quotes (in case there are spaces in some of them). There is a boundary issue - in case a parameter ends on \, adding " after it will break the meaning of closing quote. Example c:\one\ two parsed to c:\one\ and two then will be re-assembled to "c:\one\" "two" that will me (mis)understood as one argument c:\one" two (I tried that, i am not making it up). So what we need in addition is to check if argument ends on \ and if so, double the number of backslashes at the end, like so:
s = "\"" + Regex.Replace(s, #"(\\+)$", #"$1$1") + "\"";
My answer was similar to Nas Banov's answer but I wanted double quotes only if necessary.
Cutting out extra unnecessary double quotes
My code saves unnecessarily putting double quotes around it all the time which is important *when you are getting up close to the character limit for parameters.
/// <summary>
/// Encodes an argument for passing into a program
/// </summary>
/// <param name="original">The value that should be received by the program</param>
/// <returns>The value which needs to be passed to the program for the original value
/// to come through</returns>
public static string EncodeParameterArgument(string original)
{
if( string.IsNullOrEmpty(original))
return original;
string value = Regex.Replace(original, #"(\\*)" + "\"", #"$1\$0");
value = Regex.Replace(value, #"^(.*\s.*?)(\\*)$", "\"$1$2$2\"");
return value;
}
// This is an EDIT
// Note that this version does the same but handles new lines in the arugments
public static string EncodeParameterArgumentMultiLine(string original)
{
if (string.IsNullOrEmpty(original))
return original;
string value = Regex.Replace(original, #"(\\*)" + "\"", #"$1\$0");
value = Regex.Replace(value, #"^(.*\s.*?)(\\*)$", "\"$1$2$2\"", RegexOptions.Singleline);
return value;
}
explanation
To escape the backslashes and double quotes correctly you can just replace any instances of multiple backslashes followed by a single double quote with:
string value = Regex.Replace(original, #"(\\*)" + "\"", #"\$1$0");
An extra twice the original backslashes + 1 and the original double quote. i.e., '\' + originalbackslashes + originalbackslashes + '"'. I used $1$0 since $0 has the original backslashes and the original double quote so it makes the replacement a nicer one to read.
value = Regex.Replace(value, #"^(.*\s.*?)(\\*)$", "\"$1$2$2\"");
This can only ever match an entire line that contains a whitespace.
If it matches then it adds double quotes to the beginning and end.
If there was originally backslashes on the end of the argument they will not have been quoted, now that there is a double quote on the end they need to be. So they are duplicated, which quotes them all, and prevents unintentionally quoting the final double quote
It does a minimal matching for the first section so that the last .*? doesn't eat into matching the final backslashes
Output
So these inputs produce the following outputs
hello
hello
\hello\12\3\
\hello\12\3\
hello world
"hello world"
\"hello\"
\\"hello\\\"
\"hello\ world
"\\"hello\ world"
\"hello\\\ world\
"\\"hello\\\ world\\"
hello world\\
"hello world\\\\"
I have ported a C++ function from the Everyone quotes command line arguments the wrong way article.
It works fine, but you should note that cmd.exe interprets command line differently. If (and only if, like the original author of article noted) your command line will be interpreted by cmd.exe you should also escape shell metacharacters.
/// <summary>
/// This routine appends the given argument to a command line such that
/// CommandLineToArgvW will return the argument string unchanged. Arguments
/// in a command line should be separated by spaces; this function does
/// not add these spaces.
/// </summary>
/// <param name="argument">Supplies the argument to encode.</param>
/// <param name="force">
/// Supplies an indication of whether we should quote the argument even if it
/// does not contain any characters that would ordinarily require quoting.
/// </param>
private static string EncodeParameterArgument(string argument, bool force = false)
{
if (argument == null) throw new ArgumentNullException(nameof(argument));
// Unless we're told otherwise, don't quote unless we actually
// need to do so --- hopefully avoid problems if programs won't
// parse quotes properly
if (force == false
&& argument.Length > 0
&& argument.IndexOfAny(" \t\n\v\"".ToCharArray()) == -1)
{
return argument;
}
var quoted = new StringBuilder();
quoted.Append('"');
var numberBackslashes = 0;
foreach (var chr in argument)
{
switch (chr)
{
case '\\':
numberBackslashes++;
continue;
case '"':
// Escape all backslashes and the following
// double quotation mark.
quoted.Append('\\', numberBackslashes*2 + 1);
quoted.Append(chr);
break;
default:
// Backslashes aren't special here.
quoted.Append('\\', numberBackslashes);
quoted.Append(chr);
break;
}
numberBackslashes = 0;
}
// Escape all backslashes, but let the terminating
// double quotation mark we add below be interpreted
// as a metacharacter.
quoted.Append('\\', numberBackslashes*2);
quoted.Append('"');
return quoted.ToString();
}
I was running into issues with this, too. Instead of unparsing args, I went with taking the full original commandline and trimming off the executable. This had the additional benefit of keeping whitespace in the call, even if it isn't needed/used. It still has to chase escapes in the executable, but that seemed easier than the args.
var commandLine = Environment.CommandLine;
var argumentsString = "";
if(args.Length > 0)
{
// Re-escaping args to be the exact same as they were passed is hard and misses whitespace.
// Use the original command line and trim off the executable to get the args.
var argIndex = -1;
if(commandLine[0] == '"')
{
//Double-quotes mean we need to dig to find the closing double-quote.
var backslashPending = false;
var secondDoublequoteIndex = -1;
for(var i = 1; i < commandLine.Length; i++)
{
if(backslashPending)
{
backslashPending = false;
continue;
}
if(commandLine[i] == '\\')
{
backslashPending = true;
continue;
}
if(commandLine[i] == '"')
{
secondDoublequoteIndex = i + 1;
break;
}
}
argIndex = secondDoublequoteIndex;
}
else
{
// No double-quotes, so args begin after first whitespace.
argIndex = commandLine.IndexOf(" ", System.StringComparison.Ordinal);
}
if(argIndex != -1)
{
argumentsString = commandLine.Substring(argIndex + 1);
}
}
Console.WriteLine("argumentsString: " + argumentsString);
I published small project on GitHub that handles most issues with command line encoding/escaping:
https://github.com/ericpopivker/Command-Line-Encoder
There is a CommandLineEncoder.Utils.cs class, as well as Unit Tests that verify the Encoding/Decoding functionality.
I wrote you a small sample to show you how to use escape chars in command line.
public static string BuildCommandLineArgs(List<string> argsList)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
foreach (string arg in argsList)
{
sb.Append("\"\"" + arg.Replace("\"", #"\" + "\"") + "\"\" ");
}
if (sb.Length > 0)
{
sb = sb.Remove(sb.Length - 1, 1);
}
return sb.ToString();
}
And here is a test method:
List<string> myArgs = new List<string>();
myArgs.Add("test\"123"); // test"123
myArgs.Add("test\"\"123\"\"234"); // test""123""234
myArgs.Add("test123\"\"\"234"); // test123"""234
string cmargs = BuildCommandLineArgs(myArgs);
// result: ""test\"123"" ""test\"\"123\"\"234"" ""test123\"\"\"234""
// when you pass this result to your app, you will get this args list:
// test"123
// test""123""234
// test123"""234
The point is to to wrap each arg with double-double quotes ( ""arg"" ) and to replace all quotes inside arg value with escaped quote ( test\"123 ).
static string BuildCommandLineFromArgs(params string[] args)
{
if (args == null)
return null;
string result = "";
if (Environment.OSVersion.Platform == PlatformID.Unix
||
Environment.OSVersion.Platform == PlatformID.MacOSX)
{
foreach (string arg in args)
{
result += (result.Length > 0 ? " " : "")
+ arg
.Replace(#" ", #"\ ")
.Replace("\t", "\\\t")
.Replace(#"\", #"\\")
.Replace(#"""", #"\""")
.Replace(#"<", #"\<")
.Replace(#">", #"\>")
.Replace(#"|", #"\|")
.Replace(#"#", #"\#")
.Replace(#"&", #"\&");
}
}
else //Windows family
{
bool enclosedInApo, wasApo;
string subResult;
foreach (string arg in args)
{
enclosedInApo = arg.LastIndexOfAny(
new char[] { ' ', '\t', '|', '#', '^', '<', '>', '&'}) >= 0;
wasApo = enclosedInApo;
subResult = "";
for (int i = arg.Length - 1; i >= 0; i--)
{
switch (arg[i])
{
case '"':
subResult = #"\""" + subResult;
wasApo = true;
break;
case '\\':
subResult = (wasApo ? #"\\" : #"\") + subResult;
break;
default:
subResult = arg[i] + subResult;
wasApo = false;
break;
}
}
result += (result.Length > 0 ? " " : "")
+ (enclosedInApo ? "\"" + subResult + "\"" : subResult);
}
}
return result;
}
An Alternative Approach
If you're passing a complex object such as nested JSON and you have control over the system that's receiving the command line arguments, it's far easier to just encode the command line arg/s as base64 and then decode them from the receiving system.
See here: Encode/Decode String to/from Base64
Use Case: I needed to pass a JSON object that contained an XML string in one of the properties which was overly complicated to escape. This solved it.
Does a nice job of adding arguments, but doesn't escape. Added comment in method where escape sequence should go.
public static string ApplicationArguments()
{
List<string> args = Environment.GetCommandLineArgs().ToList();
args.RemoveAt(0); // remove executable
StringBuilder sb = new StringBuilder();
foreach (string s in args)
{
// todo: add escape double quotes here
sb.Append(string.Format("\"{0}\" ", s)); // wrap all args in quotes
}
return sb.ToString().Trim();
}
Copy sample code function from this url:
http://csharptest.net/529/how-to-correctly-escape-command-line-arguments-in-c/index.html
You can get command line to execute for example like this:
String cmdLine = EscapeArguments(Environment.GetCommandLineArgs().Skip(1).ToArray());
Skip(1) skips executable name.

Simple text to HTML conversion

I have a very simple asp:textbox with the multiline attribute enabled. I then accept just text, with no markup, from the textbox. Is there a common method by which line breaks and returns can be converted to <p> and <br/> tags?
I'm not looking for anything earth shattering, but at the same time I don't just want to do something like:
html.Insert(0, "<p>");
html.Replace(Enviroment.NewLine + Enviroment.NewLine, "</p><p>");
html.Replace(Enviroment.NewLine, "<br/>");
html.Append("</p>");
The above code doesn't work right, as in generating correct html, if there are more than 2 line breaks in a row. Having html like <br/></p><p> is not good; the <br/> can be removed.
I know this is old, but I couldn't find anything better after some searching, so here is what I'm using:
public static string TextToHtml(string text)
{
text = HttpUtility.HtmlEncode(text);
text = text.Replace("\r\n", "\r");
text = text.Replace("\n", "\r");
text = text.Replace("\r", "<br>\r\n");
text = text.Replace(" ", " ");
return text;
}
If you can't use HttpUtility for some reason, then you'll have to do the HTML encoding some other way, and there are lots of minor details to worry about (not just <>&).
HtmlEncode only handles the special characters for you, so after that I convert any combo of carriage-return and/or line-feed to a BR tag, and any double-spaces to a single-space plus a NBSP.
Optionally you could use a PRE tag for the last part, like so:
public static string TextToHtml(string text)
{
text = "<pre>" + HttpUtility.HtmlEncode(text) + "</pre>";
return text;
}
Your other option is to take the text box contents and instead of trying for line a paragraph breaks just put the text between PRE tags. Like this:
<PRE>
Your text from the text box...
and a line after a break...
</PRE>
Depending on exactly what you are doing with the content, my typical recommendation is to ONLY use the <br /> syntax, and not to try and handle paragraphs.
How about throwing it in a <pre> tag. Isn't that what it's there for anyway?
I know this is an old post, but I've recently been in a similar problem using C# with MVC4, so thought I'd share my solution.
We had a description saved in a database. The text was a direct copy/paste from a website, and we wanted to convert it into semantic HTML, using <p> tags. Here is a simplified version of our solution:
string description = getSomeTextFromDatabase();
foreach(var line in description.Split('\n')
{
Console.Write("<p>" + line + "</p>");
}
In our case, to write out a variable, we needed to prefix # before any variable or identifiers, because of the Razor syntax in the ASP.NET MVC framework. However, I've shown this with a Console.Write, but you should be able to figure out how to implement this in your specific project based on this :)
Combining all previous plus considering titles and subtitles within the text comes up with this:
public static string ToHtml(this string text)
{
var sb = new StringBuilder();
var sr = new StringReader(text);
var str = sr.ReadLine();
while (str != null)
{
str = str.TrimEnd();
str.Replace(" ", " ");
if (str.Length > 80)
{
sb.AppendLine($"<p>{str}</p>");
}
else if (str.Length > 0)
{
sb.AppendLine($"{str}</br>");
}
str = sr.ReadLine();
}
return sb.ToString();
}
the snippet could be enhanced by defining rules for short strings
I understand that I was late with the answer for 13 years)
but maybe someone else needs it
sample line 1 \r\n
sample line 2 (last at paragraph) \r\n\r\n [\r\n]+
sample line 3 \r\n
Example code
private static Regex _breakRegex = new("(\r?\n)+");
private static Regex _paragrahBreakRegex = new("(?:\r?\n){2,}");
public static string ConvertTextToHtml(string description) {
string[] descrptionParagraphs = _paragrahBreakRegex.Split(description.Trim());
if (descrptionParagraphs.Length > 0)
{
description = string.Empty;
foreach (string line in descrptionParagraphs)
{
description += $"<p>{line}</p>";
}
}
return _breakRegex.Replace(description, "<br/>");
}

Apostrophe (') in XPath query

I use the following XPATH Query to list the object under a site. ListObject[#Title='SomeValue']. SomeValue is dynamic. This query works as long as SomeValue does not have an apostrophe ('). Tried using escape sequence also. Didn't work.
What am I doing wrong?
This is surprisingly difficult to do.
Take a look at the XPath Recommendation, and you'll see that it defines a literal as:
Literal ::= '"' [^"]* '"'
| "'" [^']* "'"
Which is to say, string literals in XPath expressions can contain apostrophes or double quotes but not both.
You can't use escaping to get around this. A literal like this:
'Some&apos;Value'
will match this XML text:
Some&apos;Value
This does mean that it's possible for there to be a piece of XML text that you can't generate an XPath literal to match, e.g.:
<elm att=""&apos"/>
But that doesn't mean it's impossible to match that text with XPath, it's just tricky. In any case where the value you're trying to match contains both single and double quotes, you can construct an expression that uses concat to produce the text that it's going to match:
elm[#att=concat('"', "'")]
So that leads us to this, which is a lot more complicated than I'd like it to be:
/// <summary>
/// Produce an XPath literal equal to the value if possible; if not, produce
/// an XPath expression that will match the value.
///
/// Note that this function will produce very long XPath expressions if a value
/// contains a long run of double quotes.
/// </summary>
/// <param name="value">The value to match.</param>
/// <returns>If the value contains only single or double quotes, an XPath
/// literal equal to the value. If it contains both, an XPath expression,
/// using concat(), that evaluates to the value.</returns>
static string XPathLiteral(string value)
{
// if the value contains only single or double quotes, construct
// an XPath literal
if (!value.Contains("\""))
{
return "\"" + value + "\"";
}
if (!value.Contains("'"))
{
return "'" + value + "'";
}
// if the value contains both single and double quotes, construct an
// expression that concatenates all non-double-quote substrings with
// the quotes, e.g.:
//
// concat("foo", '"', "bar")
StringBuilder sb = new StringBuilder();
sb.Append("concat(");
string[] substrings = value.Split('\"');
for (int i = 0; i < substrings.Length; i++ )
{
bool needComma = (i>0);
if (substrings[i] != "")
{
if (i > 0)
{
sb.Append(", ");
}
sb.Append("\"");
sb.Append(substrings[i]);
sb.Append("\"");
needComma = true;
}
if (i < substrings.Length - 1)
{
if (needComma)
{
sb.Append(", ");
}
sb.Append("'\"'");
}
}
sb.Append(")");
return sb.ToString();
}
And yes, I tested it with all the edge cases. That's why the logic is so stupidly complex:
foreach (string s in new[]
{
"foo", // no quotes
"\"foo", // double quotes only
"'foo", // single quotes only
"'foo\"bar", // both; double quotes in mid-string
"'foo\"bar\"baz", // multiple double quotes in mid-string
"'foo\"", // string ends with double quotes
"'foo\"\"", // string ends with run of double quotes
"\"'foo", // string begins with double quotes
"\"\"'foo", // string begins with run of double quotes
"'foo\"\"bar" // run of double quotes in mid-string
})
{
Console.Write(s);
Console.Write(" = ");
Console.WriteLine(XPathLiteral(s));
XmlElement elm = d.CreateElement("test");
d.DocumentElement.AppendChild(elm);
elm.SetAttribute("value", s);
string xpath = "/root/test[#value = " + XPathLiteral(s) + "]";
if (d.SelectSingleNode(xpath) == elm)
{
Console.WriteLine("OK");
}
else
{
Console.WriteLine("Should have found a match for {0}, and didn't.", s);
}
}
Console.ReadKey();
}
I ported Robert's answer to Java (tested in 1.6):
/// <summary>
/// Produce an XPath literal equal to the value if possible; if not, produce
/// an XPath expression that will match the value.
///
/// Note that this function will produce very long XPath expressions if a value
/// contains a long run of double quotes.
/// </summary>
/// <param name="value">The value to match.</param>
/// <returns>If the value contains only single or double quotes, an XPath
/// literal equal to the value. If it contains both, an XPath expression,
/// using concat(), that evaluates to the value.</returns>
public static String XPathLiteral(String value) {
if(!value.contains("\"") && !value.contains("'")) {
return "'" + value + "'";
}
// if the value contains only single or double quotes, construct
// an XPath literal
if (!value.contains("\"")) {
System.out.println("Doesn't contain Quotes");
String s = "\"" + value + "\"";
System.out.println(s);
return s;
}
if (!value.contains("'")) {
System.out.println("Doesn't contain apostophes");
String s = "'" + value + "'";
System.out.println(s);
return s;
}
// if the value contains both single and double quotes, construct an
// expression that concatenates all non-double-quote substrings with
// the quotes, e.g.:
//
// concat("foo", '"', "bar")
StringBuilder sb = new StringBuilder();
sb.append("concat(");
String[] substrings = value.split("\"");
for (int i = 0; i < substrings.length; i++) {
boolean needComma = (i > 0);
if (!substrings[i].equals("")) {
if (i > 0) {
sb.append(", ");
}
sb.append("\"");
sb.append(substrings[i]);
sb.append("\"");
needComma = true;
}
if (i < substrings.length - 1) {
if (needComma) {
sb.append(", ");
}
sb.append("'\"'");
}
System.out.println("Step " + i + ": " + sb.toString());
}
//This stuff is because Java is being stupid about splitting strings
if(value.endsWith("\"")) {
sb.append(", '\"'");
}
//The code works if the string ends in a apos
/*else if(value.endsWith("'")) {
sb.append(", \"'\"");
}*/
sb.append(")");
String s = sb.toString();
System.out.println(s);
return s;
}
Hope this helps somebody!
EDIT: After a heavy unit testing session, and checking the XPath Standards, I have revised my function as follows:
public static string ToXPath(string value) {
const string apostrophe = "'";
const string quote = "\"";
if(value.Contains(quote)) {
if(value.Contains(apostrophe)) {
throw new XPathException("Illegal XPath string literal.");
} else {
return apostrophe + value + apostrophe;
}
} else {
return quote + value + quote;
}
}
It appears that XPath doesn't have a character escaping system at all, it's quite primitive really. Evidently my original code only worked by coincidence. My apologies for misleading anyone!
Original answer below for reference only - please ignore
For safety, make sure that any occurrence of all 5 predefined XML entities in your XPath string are escaped, e.g.
public static string ToXPath(string value) {
return "'" + XmlEncode(value) + "'";
}
public static string XmlEncode(string value) {
StringBuilder text = new StringBuilder(value);
text.Replace("&", "&");
text.Replace("'", "&apos;");
text.Replace(#"""", """);
text.Replace("<", "<");
text.Replace(">", ">");
return text.ToString();
}
I have done this before and it works fine. If it doesn't work for you, maybe there is some additional context to the problem that you need to make us aware of.
By far the best approach to this problem is to use the facilities provided by your XPath library to declare an XPath-level variable that you can reference in the expression. The variable value can then be any string in the host programming language, and isn't subject to the restrictions of XPath string literals. For example, in Java with javax.xml.xpath:
XPathFactory xpf = XPathFactory.newInstance();
final Map<String, Object> variables = new HashMap<>();
xpf.setXPathVariableResolver(new XPathVariableResolver() {
public Object resolveVariable(QName name) {
return variables.get(name.getLocalPart());
}
});
XPath xpath = xpf.newXPath();
XPathExpression expr = xpath.compile("ListObject[#Title=$val]");
variables.put("val", someValue);
NodeList nodes = (NodeList)expr.evaluate(someNode, XPathConstants.NODESET);
For C# XPathNavigator you would define a custom XsltContext as described in this MSDN article (you'd only need the variable-related parts of this example, not the extension functions).
Most of the answers here focus on how to use string manipulation to cobble together an XPath that uses string delimiters in a valid way.
I would say the best practice is not to rely on such complicated and potentially fragile methods.
The following applies to .NET since this question is tagged with C#. Ian Roberts has provided what I think is the best solution for when you're using XPath in Java.
Nowadays, you can use Linq-to-Xml to query XML documents in a way that allows you to use your variables in the query directly. This is not XPath, but the purpose is the same.
For the example given in OP, you could query the nodes you want like this:
var value = "Some value with 'apostrophes' and \"quotes\"";
// doc is an instance of XElement or XDocument
IEnumerable<XElement> nodes =
doc.Descendants("ListObject")
.Where(lo => (string)lo.Attribute("Title") == value);
or to use the query comprehension syntax:
IEnumerable<XElement> nodes = from lo in doc.Descendants("ListObject")
where (string)lo.Attribute("Title") == value
select lo;
.NET also provides a way to use XPath variables in your XPath queries. Sadly, it's not easy to do this out of the box, but with a simple helper class that I provide in this other SO answer, it's quite easy.
You can use it like this:
var value = "Some value with 'apostrophes' and \"quotes\"";
var variableContext = new VariableContext { { "matchValue", value } };
// ixn is an instance of IXPathNavigable
XPathNodeIterator nodes = ixn.CreateNavigator()
.SelectNodes("ListObject[#Title = $matchValue]",
variableContext);
Here is an alternative to Robert Rossney's StringBuilder approach, perhaps more intuitive:
/// <summary>
/// Produce an XPath literal equal to the value if possible; if not, produce
/// an XPath expression that will match the value.
///
/// Note that this function will produce very long XPath expressions if a value
/// contains a long run of double quotes.
///
/// From: http://stackoverflow.com/questions/1341847/special-character-in-xpath-query
/// </summary>
/// <param name="value">The value to match.</param>
/// <returns>If the value contains only single or double quotes, an XPath
/// literal equal to the value. If it contains both, an XPath expression,
/// using concat(), that evaluates to the value.</returns>
public static string XPathLiteral(string value)
{
// If the value contains only single or double quotes, construct
// an XPath literal
if (!value.Contains("\""))
return "\"" + value + "\"";
if (!value.Contains("'"))
return "'" + value + "'";
// If the value contains both single and double quotes, construct an
// expression that concatenates all non-double-quote substrings with
// the quotes, e.g.:
//
// concat("foo",'"',"bar")
List<string> parts = new List<string>();
// First, put a '"' after each component in the string.
foreach (var str in value.Split('"'))
{
if (!string.IsNullOrEmpty(str))
parts.Add('"' + str + '"'); // (edited -- thanks Daniel :-)
parts.Add("'\"'");
}
// Then remove the extra '"' after the last component.
parts.RemoveAt(parts.Count - 1);
// Finally, put it together into a concat() function call.
return "concat(" + string.Join(",", parts) + ")";
}
You can quote an XPath string by using search and replace.
In F#
let quoteString (s : string) =
if not (s.Contains "'" ) then sprintf "'%s'" s
else if not (s.Contains "\"") then sprintf "\"%s\"" s
else "concat('" + s.Replace ("'", "', \"'\", '") + "')"
I haven't tested it extensively, but seems to work.
I really like Robert's answer, but I feel like the code could be a little denser.
using System.Linq;
namespace Humig.Csp.Common
{
public static class XpathHelpers
{
public static string XpathLiteralEncode(string literalValue)
{
return string.IsNullOrEmpty(literalValue)
? "''"
: !literalValue.Contains("\"")
? $"\"{literalValue}\""
: !literalValue.Contains("'")
? $"'{literalValue}'"
: $"concat({string.Join(",'\"',", literalValue.Split('"').Select(k => $"\"{k}\""))})";
}
}
}
I have also created a unit test with all the test cases:
using HtmlAgilityPack;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace Humig.Csp.Common.Tests
{
[TestClass()]
public class XpathHelpersTests
{
[DataRow("foo")] // no quotes
[DataRow("\"foo")] // double quotes only
[DataRow("'foo")] // single quotes only
[DataRow("'foo\"bar")] // both; double quotes in mid-string
[DataRow("'foo\"bar\"baz")] // multiple double quotes in mid-string
[DataRow("'foo\"")] // string ends with double quotes
[DataRow("'foo\"\"")] // string ends with run of double quotes
[DataRow("\"'foo")] // string begins with double quotes
[DataRow("\"\"'foo")] // string begins with run of double quotes
[DataRow("'foo\"\"bar")] // run of double quotes in mid-string
[TestMethod()]
public void XpathLiteralEncodeTest(string attrValue)
{
var doc = new HtmlDocument();
var hnode = doc.CreateElement("html");
var body = doc.CreateElement("body");
var div = doc.CreateElement("div");
div.Attributes.Add("data-test", attrValue);
doc.DocumentNode.AppendChild(hnode);
hnode.AppendChild(body);
body.AppendChild(div);
var literalOut = XpathHelpers.XpathLiteralEncode(attrValue);
string xpath = $"/html/body/div[#data-test = {literalOut}]";
var result = doc.DocumentNode.SelectSingleNode(xpath);
Assert.AreEqual(div, result, $"did not find a match for {attrValue}");
}
}
}
If you're not going to have any double-quotes in SomeValue, you can use escaped double-quotes to specify the value you're searching for in your XPath search string.
ListObject[#Title=\"SomeValue\"]
You can fix this issue by using double quotes instead of single quotes in the XPath expression.
For ex:
element.XPathSelectElements(String.Format("//group[#title=\"{0}\"]", "Man's"));
I had this problem a while back and seemingly the simplest, but not the fastest solution is that you add a new node into the XML document that has an attribute with the value 'SomeValue', then look for that attribute value using a simple xpath search. After the you're finished with the operation, you can delete the "temporary node" from the XML document.
This way, the whole comparison happens "inside", so you don't have to construct the weird XPath query.
I seem to remember that in order to speed things up, you should be adding the temp value to the root node.
Good luck...

Categories

Resources