Removing HTML from messages safely - c#

I need to output all of the plaintext within messages that may include valid and/or invalid HTML and possibly text that is superficially similar to HTML (i.e. non-HTML text within <...> such as: < why would someone do this?? >).
It is more important that I preserve all non-HTML content than it is to strip out all HTML, but ideally I would like to get rid of as much of the HTML as possible for readability.
I am currently using HTML Agility Pack, but I am having issues where non-HTML within < and > is also removed, for example:
my function:
text = HttpUtility.HtmlDecode(text);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(text);
text = doc.DocumentNode.InnerText;
simple example input*:
this text has <b>weird < things</b> going on >
actual output (unacceptable, lost the word "things"):
this text has weird going on >
desired output:
this text has weird < things going on >
Is there a way to remove only legitimate HTML tags within HTML Agility Pack without stripping out other content that may include < and/or >? Or do I need to manually create a white-list of tags to remove like in this question? That is my fallback solution but I'm hoping there is a more complete solution built in to HTML Agility Pack (or another tool) that I just haven't been able to find.
*(real input often has a ton of unneeded HTML in it, I can give a longer example if that would be useful)

You could use this pattern to replace the HTML tags:
</?[a-zA-Z][a-zA-Z0-9 \"=_-]*?>
Explanation:
<
maybe / (as it may be closing tag)
match a-z or A-Z as the first letter
MAYBE match any of a-z, or A-Z, 0-9, "=_- indefinitely
>
Final Code:
using System;
using System.Text.RegularExpressions;
namespace Regular
{
class Program
{
static void Main(string[] args)
{
string yourText = "this text has <b>weird < things</b> going on >";
string newText = Regex.Replace(yourText, "</?[a-zA-Z][a-zA-Z0-9 \"=_-]*>", "");
Console.WriteLine(newText);
}
}
}
Outputs:
this text has weird < things going on >
#corey-ogburn's comment is not correct as <[space]abc> would be replaced.
As you only want to strip them off the string I don't see a reason where you'd want to check if you have a tag starting/ending, but you could easily make it with regex.
It's not always a good choice to use RegEx to parse HTML, but I think it'd be fine if you want to parse simple text.

I wrote this a really long time ago to do something similar. You might use it as a starting point:
You'll need:
using System;
using System.Collections.Generic;
And the code:
/// <summary>
/// Instances of this class strip HTML/XML tags from a string
/// </summary>
public class HTMLStripper
{
public HTMLStripper() { }
public HTMLStripper(string source)
{
m_source = source;
stripTags();
}
private const char m_beginToken = '<';
private const char m_endToken = '>';
private const char m_whiteSpace = ' ';
private enum tokenType
{
nonToken = 0,
beginToken = 1,
endToken = 2,
escapeToken = 3,
whiteSpace = 4
}
private string m_source = string.Empty;
private string m_stripped = string.Empty;
private string m_tagName = string.Empty;
private string m_tag = string.Empty;
private Int32 m_startpos = -1;
private Int32 m_endpos = -1;
private Int32 m_currentpos = -1;
private IList<string> m_skipTags = new List<string>();
private bool m_tagFound = false;
private bool m_tagsStripped = false;
/// <summary>
/// Gets or sets the source string.
/// </summary>
/// <value>
/// The source string.
/// </value>
public string source { get { return m_source; } set { clear(); m_source = value; stripTags(); } }
/// <summary>
/// Gets the string stripped of HTML tags.
/// </summary>
/// <value>
/// The string.
/// </value>
public string stripped { get { return m_stripped; } set { } }
/// <summary>
/// Gets or sets a value indicating whether [HTML tags were stripped].
/// </summary>
/// <value>
/// <c>true</c> if [HTML tags were stripped]; otherwise, <c>false</c>.
/// </value>
public bool tagsStripped { get { return m_tagsStripped; } set { } }
/// <summary>
/// Adds the name of an HTML tag to skip stripping (leave in the text).
/// </summary>
/// <param name="value">The value.</param>
public void addSkipTag(string value)
{
if (value.Length > 0)
{
// Trim start and end tokens from skipTags if present and add to list
CharEnumerator tmpScanner = value.GetEnumerator();
string tmpString = string.Empty;
while (tmpScanner.MoveNext())
{
if (tmpScanner.Current != m_beginToken && tmpScanner.Current != m_endToken) { tmpString += tmpScanner.Current; }
}
if (tmpString.Length > 0) { m_skipTags.Add(tmpString); }
}
}
/// <summary>
/// Clears this instance.
/// </summary>
public void clear()
{
m_source = string.Empty;
m_tag = string.Empty;
m_startpos = -1;
m_endpos = -1;
m_currentpos = -1;
m_tagsStripped = false;
}
/// <summary>
/// Clears all.
/// </summary>
public void clearAll()
{
this.clear();
m_skipTags.Clear();
}
/// <summary>
/// Strips the HTML tags.
/// </summary>
private void stripTags()
{
// Preserve source and make a copy for stripping
m_stripped = m_source;
// Find first tag
getNext();
// If there are any tags (if next tag is string.Empty we are at EOS)...
if (m_tagName != string.Empty)
{
do
{
// If the tag we found is not to be skipped...
if (!m_skipTags.Contains(m_tagName))
{
// Remove tag from string
m_stripped = m_stripped.Remove(m_startpos, m_endpos - m_startpos + 1);
m_tagsStripped = true;
}
// Get next tag, rinse and repeat (if next tag is string.Empty we are at EOS)
getNext();
} while (m_tagName != string.Empty);
}
}
/// <summary>
/// Steps the pointer to the next HTML tag.
/// </summary>
private void getNext()
{
m_tagFound = false;
m_tag = string.Empty;
m_tagName = string.Empty;
bool beginTokenFound = false;
CharEnumerator scanner = m_stripped.GetEnumerator();
// If we're not at the beginning of the string, move the enumerator to the appropriate location in the string
if (m_currentpos != -1)
{
Int32 index = 0;
do
{
scanner.MoveNext();
index += 1;
} while (index < m_currentpos + 1);
}
while (!m_tagFound && m_currentpos + 1 < m_stripped.Length)
{
// Find next begin token
while (scanner.MoveNext())
{
m_currentpos += 1;
if (evaluateChar(scanner.Current) == tokenType.beginToken)
{
m_startpos = m_currentpos;
beginTokenFound = true;
break;
}
}
// If a begin token is found, find next end token
if (beginTokenFound)
{
while (scanner.MoveNext())
{
m_currentpos += 1;
// If we find another begin token before finding an end token we are not in a tag
if (evaluateChar(scanner.Current) == tokenType.beginToken)
{
m_tagFound = false;
beginTokenFound = true;
break;
}
// If the char immediately following a begin token is a white space we are not in a tag
if (m_currentpos - m_startpos == 1 && evaluateChar(scanner.Current) == tokenType.whiteSpace)
{
m_tagFound = false;
beginTokenFound = true;
break;
}
// End token found
if (evaluateChar(scanner.Current) == tokenType.endToken)
{
m_endpos = m_currentpos;
m_tagFound = true;
break;
}
}
}
if (m_tagFound)
{
// Found a tag, get the info for this tag
m_tag = m_stripped.Substring(m_startpos, (m_endpos + 1) - m_startpos);
m_tagName = m_stripped.Substring(m_startpos + 1, m_endpos - m_startpos - 1);
// If this tag is to be skipped, we do not want to reset the position within the string
// Also, if we are at the end of the string (EOS) we do not want to reset the position
if (!m_skipTags.Contains(m_tagName) && m_currentpos != stripped.Length)
{
m_currentpos = -1;
}
}
}
}
/// <summary>
/// Evaluates the next character.
/// </summary>
/// <param name="value">The value.</param>
/// <returns>tokenType</returns>
private tokenType evaluateChar(char value)
{
tokenType returnValue = new tokenType();
switch (value)
{
case m_beginToken:
returnValue = tokenType.beginToken;
break;
case m_endToken:
returnValue = tokenType.endToken;
break;
case m_whiteSpace:
returnValue = tokenType.whiteSpace;
break;
default:
returnValue = tokenType.nonToken;
break;
}
return returnValue;
}
}

Related

How to Trim exactly 1 whitespace after splitting the string

I have made a program that evaluates a string by splitting it at a pipeline, the string are randomly generated and sometimes whitespace is a part of what need to be evaluated.
HftiVfzRIDBeotsnU uabjvLPC | LstHCfuobtv eVzDUBPn jIRfai
This string is same length on either side(2 x whitespace on left side of pipeline), but my problem comes when i have to trim the space on both sides of the pipeline (i do this after splitting)
is there some way of making sure that i only trim 1 single space instead of them all.
my code so far:
foreach (string s in str)
{
int bugCount = 0;
string[] info = s.Split('|');
string testCase = info[0].TrimEnd();
char[] testArr = testCase.ToCharArray();
string debugInfo = info[1].TrimStart();
char[] debugArr = debugInfo.ToCharArray();
int arrBound = debugArr.Count();
for (int i = 0; i < arrBound; i++)
if (testArr[i] != debugArr[i])
bugCount++;
if (bugCount <= 2 && bugCount != 0)
Console.WriteLine("Low");
if (bugCount <= 4 && bugCount != 0)
Console.WriteLine("Medium");
if (bugCount <= 6 && bugCount != 0)
Console.WriteLine("High");
if (bugCount > 6)
Console.WriteLine("Critical");
else
Console.WriteLine("Done");
}
Console.ReadLine();
You have 2 options.
If there is always 1 space before and after the pipe, split on {space}|{space}.
myInput.Split(new[]{" | "},StringSplitOptions.None);
Otherwise, instead of using TrimStart() & TrimEnd() use SubString.
var split = myInput.Split('|');
var s1 = split[0].EndsWith(" ")
? split[0].SubString(0,split[0].Length-1)
: split[0];
var s2 = split[1].StartsWith(" ")
? split[1].SubString(1) // to end of line
: split[1];
Note, there is some complexity here - if the pipe has no space around it, but the last/first character is a legitimate (data) space character the above will cut it off. You need more logic, but hopefully this will get you started!
There is no way to tell the Trim.. methods family to stop after cutting out some number of characters.
In general case, you'd need to do it manually by inspecting the parts obtained after Split and checking their first/last characters and substring'ing to get the correct part.
However, in your case, there's a much simpler way - the Split can also take a string as an argument, and even more - a set of strings:
string[] info = s.Split(new []{ " | " });
// or even
string[] info = s.Split(new []{ " | ", " |", "| ", "|" });
That should take care of the single spaces around the pipe | character by simply treating them as a part of the separator.
This is a string extension to trim space for count times, just in case.
public static class StringExtension
{
/// <summary>
/// Trim space at the end of string for count times
/// </summary>
/// <param name="input"></param>
/// <param name="count">number of space at the end to trim</param>
/// <returns></returns>
public static string TrimEnd(this string input, int count = 1)
{
string result = input;
if (count <= 0)
{
return result;
}
if (result.EndsWith(new string(' ', count)))
{
result = result.Substring(0, result.Length - count);
}
return result;
}
/// <summary>
/// Trim space at the start of string for count times
/// </summary>
/// <param name="input"></param>
/// <param name="count">number of space at the start to trim</param>
/// <returns></returns>
public static string TrimStart(this string input, int count = 1)
{
string result = input;
if (count <= 0)
{
return result;
}
if (result.StartsWith(new string(' ', count)))
{
result = result.Substring(count);
}
return result;
}
}
In the main
static void Main(string[] args)
{
string a = "1234 ";
string a1 = a.TrimEnd(1); // returns "1234 "
string a2 = a.TrimEnd(2); // returns "1234"
string a3 = a.TrimEnd(3); // returns "1234 "
string b = " 5678";
string b1 = b.TrimStart(1); // returns " 5678"
string b2 = b.TrimStart(2); // returns "5678"
string b3 = b.TrimStart(3); // returns " 5678"
}

Check if string contains rage of Unicode characters

What is the best way to check if string contains specified Unicode character? My problem is I cannot parse string/characters to format \u[byte][byte][byte][byte]. I followed many tutorials and threads here on StackOverflow, but when I have method like this one:
private bool ContainsInvalidCharacters(string name)
{
if (translation.Any(c => c > 255))
{
byte[] bytes = new byte[name.Length];
Buffer.BlockCopy(name.ToCharArray(), 0, bytes, 0, bytes.Length);
string decoded = Encoding.UTF8.GetString(bytes, 0, name.Length);
(decoded.Contains("\u0001"))
{
//do something
}
}
I get output like: "c\0o\0n\0t\0i\0n\0g\0u\0t\0".
This really is not my cup of tea. I will be grateful for any help.
If I were to picture a rage of Unicode characters that would be my bet:
ლ(~•̀︿•́~)つ︻̷┻̿═━一
So to answer your question, that is to check string for such rage you could simply:
private bool ContainsInvalidCharacters(string name)
{
return name.IndexOf("ლ(~•̀︿•́~)つ︻̷┻̿═━一") != -1;
}
;)
Is this what you want?
public static bool ContainsInvalidCharacters(string name)
{
return name.IndexOfAny(new[]
{
'\u0001', '\u0002', '\u0003',
}) != -1;
}
and
bool res = ContainsInvalidCharacters("Hello\u0001");
Note the use of '\uXXXX': the ' denote a char instead of a string.
Check this also
/// <summary>
/// Check invalid character based on the pattern
/// </summary>
/// <param name="text">The string</param>
/// <returns></returns>
public static string IsInvalidCharacters(this string text)
{
string pattern = #"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
var match = Regex.Match(text, pattern, "");
return match.Sucess;
}

How to replace text with index value in C#

I have a text file which contains a repeated string called "map" for more than 800 now I would like to replace them with map to map0, map1, map2, .....map800.
I tried this way but it didn't work for me:
void Main() {
string text = File.ReadAllText(#"T:\File1.txt");
for (int i = 0; i < 2000; i++)
{
text = text.Replace("map", "map"+i);
}
File.WriteAllText(#"T:\File1.txt", text);
}
How can I achieve this?
This should work fine:
void Main() {
string text = File.ReadAllText(#"T:\File1.txt");
int num = 0;
text = (Regex.Replace(text, "map", delegate(Match m) {
return "map" + num++;
}));
File.WriteAllText(#"T:\File1.txt", text);
}
/// <summary>
/// Replaces each existing key within the original string by adding a number to it.
/// </summary>
/// <param name="original">The original string.</param>
/// <param name="key">The key we are searching for.</param>
/// <param name="offset">The offset of the number we want to start with. The default value is 0.</param>
/// <param name="increment">The increment of the number.</param>
/// <returns>A new string where each key has been extended with a number string with "offset" and beeing incremented with "increment".The default value is 1.</returns>
/// <example>
/// Assuming that we have an original string of "mapmapmapmap" and the key "map" we
/// would get "map0map1map2map3" as result.
/// </example>
public static string AddNumberToKeyInString(string original, string key, int offset = 0, int increment = 1)
{
if (original.Contains(key))
{
int counter = offset;
int position = 0;
int index;
// While we are withing the length of the string and
// the "key" we are searching for exists at least once
while (position < original.Length && (index = original.Substring(position).IndexOf(key)) != -1)
{
// Insert the counter after the "key"
original = original.Insert(position + key.Length, counter.ToString());
position += index + key.Length + counter.ToString().Length;
counter += increment;
}
}
return original;
}
It's because you are replacing the same occurrence of map each time. So the resulting string will have map9876543210 map9876543210 map9876543210 for 10 iterations, if the original string was "map map map". You need to find each individual occurrence of map, and replace it. Try using the indexof method.
Something along these lines should give you an idea of what you're trying to do:
static void Main(string[] args)
{
string text = File.ReadAllText(#"C:\temp\map.txt");
int mapIndex = text.IndexOf("map");
int hitCount = 0;
int hitTextLength = 1;
while (mapIndex >= 0 )
{
text = text.Substring(0, mapIndex) + "map" + hitCount++.ToString() + text.Substring(mapIndex + 2 + hitTextLength);
mapIndex = text.IndexOf("map", mapIndex + 3 + hitTextLength);
hitTextLength = hitCount.ToString().Length;
}
File.WriteAllText(#"C:\temp\map1.txt", text);
}
Due to the fact that strings are immutable this wouldn't be the ideal way to deal with large files (1MB+) as you would be creating and disposing the entire string for each instance of "map" in the file.
For an example file:
map hat dog
dog map cat
lost cat map
mapmapmaphat
map
You get the results:
map0 hat dog
dog map1 cat
lost cat map2
map3map4map5hat
map6

OpenXML replace text in all document

I have the piece of code below. I'd like replace the text "Text1" by "NewText", that's work. But when I place the text "Text1" in a table that's not work anymore for the "Text1" inside the table.
I'd like make this replacement in the all document.
using (WordprocessingDocument doc = WordprocessingDocument.Open(String.Format("c:\\temp\\filename.docx"), true))
{
var body = doc.MainDocumentPart.Document.Body;
foreach (var para in body.Elements<Paragraph>())
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
if (text.Text.Contains("##Text1##"))
text.Text = text.Text.Replace("##Text1##", "NewText");
}
}
}
}
Your code does not work because the table element (w:tbl) is not contained in
a paragraph element (w:p). See the following MSDN article for more information.
The Text class (serialized as w:t) usually represents literal text within a Run element in a
word document. So you could simply search for all w:t elements (Text class) and replace your
tag if the text element (w:t) contains your tag:
using (WordprocessingDocument doc = WordprocessingDocument.Open("yourdoc.docx", true))
{
var body = doc.MainDocumentPart.Document.Body;
foreach (var text in body.Descendants<Text>())
{
if (text.Text.Contains("##Text1##"))
{
text.Text = text.Text.Replace("##Text1##", "NewText");
}
}
}
Borrowing on some other answers in various places, and with the fact that four main obstacles must be overcome:
Delete any high level Unicode chars from your replace string that cannot be read from Word (from bad user input)
Ability to search for your find result across multiple runs or text elements within a paragraph (Word will often break up a single sentence into several text runs)
Ability to include a line break in your replace text so as to insert multi-line text into the document.
Ability to pass in any node as the starting point for your search so as to restrict the search to that part of the document (such as the body, the header, the footer, a specific table, table row, or tablecell).
I am sure advanced scenarios such as bookmarks, complex nesting will need more modification on this, but it is working for the types of basic word documents I have run into so far, and is much more helpful to me than disregarding runs altogether or using a RegEx on the entire file with no ability to target a specific TableCell or Document part (for advanced scenarios).
Example Usage:
var body = document.MainDocumentPart.Document.Body;
ReplaceText(body, replace, with);
The code:
using System;
using System.Collections.Generic;
using System.Linq;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
namespace My.Web.Api.OpenXml
{
public static class WordTools
{
/// <summary>
/// Find/replace within the specified paragraph.
/// </summary>
/// <param name="paragraph"></param>
/// <param name="find"></param>
/// <param name="replaceWith"></param>
public static void ReplaceText(Paragraph paragraph, string find, string replaceWith)
{
var texts = paragraph.Descendants<Text>();
for (int t = 0; t < texts.Count(); t++)
{ // figure out which Text element within the paragraph contains the starting point of the search string
Text txt = texts.ElementAt(t);
for (int c = 0; c < txt.Text.Length; c++)
{
var match = IsMatch(texts, t, c, find);
if (match != null)
{ // now replace the text
string[] lines = replaceWith.Replace(Environment.NewLine, "\r").Split('\n', '\r'); // handle any lone n/r returns, plus newline.
int skip = lines[lines.Length - 1].Length - 1; // will jump to end of the replacement text, it has been processed.
if (c > 0)
lines[0] = txt.Text.Substring(0, c) + lines[0]; // has a prefix
if (match.EndCharIndex + 1 < texts.ElementAt(match.EndElementIndex).Text.Length)
lines[lines.Length - 1] = lines[lines.Length - 1] + texts.ElementAt(match.EndElementIndex).Text.Substring(match.EndCharIndex + 1);
txt.Space = new EnumValue<SpaceProcessingModeValues>(SpaceProcessingModeValues.Preserve); // in case your value starts/ends with whitespace
txt.Text = lines[0];
// remove any extra texts.
for (int i = t + 1; i <= match.EndElementIndex; i++)
{
texts.ElementAt(i).Text = string.Empty; // clear the text
}
// if 'with' contained line breaks we need to add breaks back...
if (lines.Count() > 1)
{
OpenXmlElement currEl = txt;
Break br;
// append more lines
var run = txt.Parent as Run;
for (int i = 1; i < lines.Count(); i++)
{
br = new Break();
run.InsertAfter<Break>(br, currEl);
currEl = br;
txt = new Text(lines[i]);
run.InsertAfter<Text>(txt, currEl);
t++; // skip to this next text element
currEl = txt;
}
c = skip; // new line
}
else
{ // continue to process same line
c += skip;
}
}
}
}
}
/// <summary>
/// Determine if the texts (starting at element t, char c) exactly contain the find text
/// </summary>
/// <param name="texts"></param>
/// <param name="t"></param>
/// <param name="c"></param>
/// <param name="find"></param>
/// <returns>null or the result info</returns>
static Match IsMatch(IEnumerable<Text> texts, int t, int c, string find)
{
int ix = 0;
for (int i = t; i < texts.Count(); i++)
{
for (int j = c; j < texts.ElementAt(i).Text.Length; j++)
{
if (find[ix] != texts.ElementAt(i).Text[j])
{
return null; // element mismatch
}
ix++; // match; go to next character
if (ix == find.Length)
return new Match() { EndElementIndex = i, EndCharIndex = j }; // full match with no issues
}
c = 0; // reset char index for next text element
}
return null; // ran out of text, not a string match
}
/// <summary>
/// Defines a match result
/// </summary>
class Match
{
/// <summary>
/// Last matching element index containing part of the search text
/// </summary>
public int EndElementIndex { get; set; }
/// <summary>
/// Last matching char index of the search text in last matching element
/// </summary>
public int EndCharIndex { get; set; }
}
} // class
} // namespace
public static class OpenXmlTools
{
// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
#"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
RegexOptions.Compiled);
/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML which give exception on save
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
if (string.IsNullOrEmpty(text)) return "";
return _invalidXMLChars.Replace(text, "");
}
}
Maybe this solution is easier
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
string docText = null;
//1. Copy all the file into a string
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
docText = sr.ReadToEnd();
//2. Use regular expression to replace all text
Regex regexText = new Regex(find);
docText = regexText.Replace(docText, replace);
//3. Write the changed string into the file again
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
sw.Write(docText);

multiline formatting for verbatim strings in c# (prefix with #)

I love using the #"strings" in c#, especially when I have a lot of multi-line text. The only annoyance is that my code formatting goes to doodie when doing this, because the second and greater lines are pushed fully to the left instead of using the indentation of my beautifully formatted code. I know this is by design, but is there some option/hack way of allowing these lines to be indented, without adding the actual tabs/spaces to the output?
adding example:
var MyString = #" this is
a multi-line string
in c#.";
My variable declaration is indented to the "correct" depth, but the second and further lines in the string get pushed to the left margin- so the code is kinda ugly. You could add tabs to the start of line 2 and 3, but the string itself would then contain those tabs... make sense?
How about a string extension? Update: I reread your question and I hope there is a better answer. This is something that bugs me too and having to solve it as below is frustrating but on the plus side it does work.
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
public static class StringExtensions
{
public static string StripLeadingWhitespace(this string s)
{
Regex r = new Regex(#"^\s+", RegexOptions.Multiline);
return r.Replace(s, string.Empty);
}
}
}
And an example console program:
using System;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string x = #"This is a test
of the emergency
broadcasting system.";
Console.WriteLine(x);
Console.WriteLine();
Console.WriteLine("---");
Console.WriteLine();
Console.WriteLine(x.StripLeadingWhitespace());
Console.ReadKey();
}
}
}
And the output:
This is a test
of the emergency
broadcasting system.
---
This is a test
of the emergency
broadcasting system.
And a cleaner way to use it if you decide to go this route:
string x = #"This is a test
of the emergency
broadcasting system.".StripLeadingWhitespace();
// consider renaming extension to say TrimIndent() or similar if used this way
Cymen has given the right solution. I use a similar approach as derived from Scala's stripMargin() method. Here's what my extension method looks like:
public static string StripMargin(this string s)
{
return Regex.Replace(s, #"[ \t]+\|", string.Empty);
}
Usage:
var mystring = #"
|SELECT
| *
|FROM
| SomeTable
|WHERE
| SomeColumn IS NOT NULL"
.StripMargin();
Result:
SELECT
*
FROM
SomeTable
WHERE
SomeColumn IS NOT NULL
I can't think of an answer that would completely satisfy your question, however you could write a function that strips leading spaces from lines of text contained in a string and call it on each creation of such a string.
var myString = TrimLeadingSpacesOfLines(#" this is a
a multi-line string
in c#.");
Yes it is a hack, but you specified your acceptance of a hack in your question.
Here is a longish solution which tries to mimic textwrap.dedent as much as possible. The first line is left as-is and expected not to be indented. (You can generate the unit tests based on the doctests using doctest-csharp.)
/// <summary>
/// Imitates the Python's
/// <a href="https://docs.python.org/3/library/textwrap.html#textwrap.dedent">
/// <c>textwrap.dedent</c></a>.
/// </summary>
/// <param name="text">Text to be dedented</param>
/// <returns>array of dedented lines</returns>
/// <code doctest="true">
/// Assert.That(Dedent(""), Is.EquivalentTo(new[] {""}));
/// Assert.That(Dedent("test me"), Is.EquivalentTo(new[] {"test me"}));
/// Assert.That(Dedent("test\nme"), Is.EquivalentTo(new[] {"test", "me"}));
/// Assert.That(Dedent("test\n me"), Is.EquivalentTo(new[] {"test", " me"}));
/// Assert.That(Dedent("test\n me\n again"), Is.EquivalentTo(new[] {"test", "me", " again"}));
/// Assert.That(Dedent(" test\n me\n again"), Is.EquivalentTo(new[] {" test", "me", " again"}));
/// </code>
private static string[] Dedent(string text)
{
var lines = text.Split(
new[] {"\r\n", "\r", "\n"},
StringSplitOptions.None);
// Search for the first non-empty line starting from the second line.
// The first line is not expected to be indented.
var firstNonemptyLine = -1;
for (var i = 1; i < lines.Length; i++)
{
if (lines[i].Length == 0) continue;
firstNonemptyLine = i;
break;
}
if (firstNonemptyLine < 0) return lines;
// Search for the second non-empty line.
// If there is no second non-empty line, we can return immediately as we
// can not pin the indent.
var secondNonemptyLine = -1;
for (var i = firstNonemptyLine + 1; i < lines.Length; i++)
{
if (lines[i].Length == 0) continue;
secondNonemptyLine = i;
break;
}
if (secondNonemptyLine < 0) return lines;
// Match the common prefix with at least two non-empty lines
var firstNonemptyLineLength = lines[firstNonemptyLine].Length;
var prefixLength = 0;
for (int column = 0; column < firstNonemptyLineLength; column++)
{
char c = lines[firstNonemptyLine][column];
if (c != ' ' && c != '\t') break;
bool matched = true;
for (int lineIdx = firstNonemptyLine + 1; lineIdx < lines.Length;
lineIdx++)
{
if (lines[lineIdx].Length == 0) continue;
if (lines[lineIdx].Length < column + 1)
{
matched = false;
break;
}
if (lines[lineIdx][column] != c)
{
matched = false;
break;
}
}
if (!matched) break;
prefixLength++;
}
if (prefixLength == 0) return lines;
for (var i = 1; i < lines.Length; i++)
{
if (lines[i].Length > 0) lines[i] = lines[i].Substring(prefixLength);
}
return lines;
}

Categories

Resources