I have a web site built with DevExpress controls including Reports. The main language used is Hebrew so the basic direction is RTL. However there is often a need to include English text, LTR, mixed in the Hebrew text. Their web controls support RTL and usually there isn't a problem with mixed texts.
The problem is with their reports that up until recently did not support RTL. Creating a report entirely in Hebrew was not to much of a problem. Trouble starts when we have Hebrew and English mixed, then the text becomes messed up.
I succeeded in fixing that with the following code:
private string FixBiDirectionalString(string textToFix)
{
try
{
char RLE = '\u202B';
char PDF = '\u202C';
char LRM = '\u200E';
char RLM = '\u200F';
StringBuilder sb = new StringBuilder(textToFix.Replace("\r", "").Replace("\n", string.Format("{0}", '\u000A')));
System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex("[A-Za-z0-9-+ ]+");
System.Text.RegularExpressions.MatchCollection mc = r.Matches(sb.ToString());
foreach (System.Text.RegularExpressions.Match m in mc)
{
double tmp;
if (m.Value == " ")
continue;
if (double.TryParse(RemoveAcceptedChars(m.Value), out tmp))
continue;
sb.Replace(m.Value, LRM + m.Value + RLM);
}
return RLE + sb.ToString() + PDF;
}
catch { return Text; }
}
private string RemoveAcceptedChars(string p)
{
return p.Replace("+", "").Replace("-", "").Replace("*", "").Replace("/", "");
}
This code is based on code I found in this article XtraReports RTL: bidirectional text drawing in one of the comments.
However I still had a problem with spaces between Hebrew and English words disapearing or being misplaced.
How can that be fixed? (I'm still using an older version of reports that doesn't support RTL).
I fixed it by first trimming the leading and trailing spaces in the string that matched the Regex of the English alphabet and then added spaces accordingly in relation to the unicode elements.
string mTrim = m.Value.Trim();
sb.Replace(m.Value, " " + LRM + mTrim + " " + RLM);
This problem is caused because the spaces are neutral or weakly directional, meaning their direction is dependent on the text they are in and here the text being mixed can cause the space to be misplaced. So this code forces one space to be part the general RTL direction and one to be part of the LTR segment. Then the words are displayed separated properly.
Related
I found the most popular answer to this question is:
Regex.Replace(value, "[^a-zA-Z0-9]+", " ", RegexOptions.Compiled);
However, if users type in Non-English name when billing, this method will consider these non- are special characters and remove them.
Is there any way we can build for most of users since my website is multi-language.
Make it Unicode aware:
var res = Regex.Replace(value, #"[^\p{L}\p{M}\p{N}]+", " ");
If you plan to keep only regular digits, keep [0-9].
The regex matches one or more symbols other than Unicode letters (\p{L}), diacritics (\p{M}) and digits (\p{N}).
You might consider var res = Regex.Replace(value, #"\W+", " "), but it will keep _ since the underscore is a "word" character.
I found my self that the best way to achieve this and make work with all languages is create a string with all banned characters, look this code:
string input = #"heya's #FFFFF , CUL8R M8 how are you?'"; // This is the input string
string regex = #"[!""#$%&'()*+,\-./:;<=>?#[\\\]^_`{|}~]"; //Banned characters string, add all characters you don´t want to be displayed here.
Match m;
while ((m = Regex.Match(input, regex)) != null)
{
if (m.Success)
input = input.Remove(m.Index, m.Length);
else // if m.Success is false: break, because while loop can be infinite
break;
}
input = input.Replace(" ", " ").Replace(" "," "); //if string has two-three-four spaces together change it to one
MessageBox.Show(input);
Hope it works!
PS: As others posted here, there are other ways. But I personally prefer that one even though it´s way more code. Choose the one you think better fits for your needing.
Till now I was thinking HttpUtility.HtmlDecode(" ") was a space. But the below code always returns false.
string text = " ";
text = HttpUtility.HtmlDecode(text);
string space = " ";
if (String.Compare(space, text) == 0)
return true;
else
return false;
Same when I try with Server.HtmlDecode()
Why is it so?
Any help would be much appreciated
Thanks,
N
The HTML entity doesn't represent a space, it represents a non-breaking space.
The non-breaking space has character code 160:
string nbspace = "\u00A0";
Also, as Marc Gravell noticed, you have double encoded the code, so you would need to decode it twice to get the character:
string text = " ";
text = HttpUtility.HtmlDecode(HttpUtility.HtmlDecode(text));
I'm cleaning the html like this:
var text = WebUtility.HtmlDecode(html)
.Replace("\u00A0", " ") // Replace non breaking space with space.
.Replace(" ", " ") // Shrink multiple spaces into one space.
.Trim();
The HTML of doesn't mean any kind of space. It means, literally, the text - for example, if you were writing HTML that was talking about HTML, you may need to include the text , which you would do by writing the HTML .
If you had:
string text = " ";
then that would decode to a non-breaking space.
Hello I faced the same issue some minutes ago.
I solved it in this way:
string text = " ";
text = Server.HtmlDecode(text).Trim;
so now:
text = "" is true (the Trim at the end eliminates the space)
I have a need to get rid of all line breaks that appear in my strings (coming from db).
I do it using code below:
value.Replace("\r\n", "").Replace("\n", "").Replace("\r", "")
I can see that there's at least one character acting like line ending that survived it. The char code is 8232.
It's very lame of me, but I must say this is the first time I have a pleasure of seeing this char. It's obvious that I can just replace this char directly, but I was thinking about extending my current approach (based on replacing combinations of "\r" and "\n") to something much more solid, so it would not only include the '8232' char but also all others not-found-by-me yet.
Do you have a bullet-proof approach for such a problem?
EDIT#1:
It seems to me that there are several possible solutions:
use Regex.Replace
remove all chars if it's IsSeparator or IsControl
replace with " " if it's IsWhiteSpace
create a list of all possible line endings ( "\r\n", "\r", "\n",LF ,VT, FF, CR, CR+LF, NEL, LS, PS) and just replace them with empty string. It's a lot of replaces.
I would say that the best results will be after applying 1st and 4th approaches but I cannot decide which will be faster. Which one do you think is the most complete one?
EDIT#2
I posted anwer below.
Below is the extension method solving my problem. LineSeparator and ParagraphEnding can be of course defined somewhere else, as static values etc.
public static string RemoveLineEndings(this string value)
{
if(String.IsNullOrEmpty(value))
{
return value;
}
string lineSeparator = ((char) 0x2028).ToString();
string paragraphSeparator = ((char)0x2029).ToString();
return value.Replace("\r\n", string.Empty)
.Replace("\n", string.Empty)
.Replace("\r", string.Empty)
.Replace(lineSeparator, string.Empty)
.Replace(paragraphSeparator, string.Empty);
}
According to wikipedia, there are numerous line terminators you may need to handle (including this one you mention).
LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029
8232 (0x2028) and 8233 (0x2029) are the only other ones you might want to eliminate. See the documentation for char.IsSeparator.
Props to Yossarian on this one, I think he's right. Replace all whitespace with a single space:
data = Regex.Replace(data, #"\s+", " ");
I'd recommend removing ALL the whitespace (char.IsWhitespace), and replacing it with single space.. IsWhiteSpace takes care of all weird unicode whitespaces.
This is my first attempt at this, but I think this will do what you want....
var controlChars = from c in value.ToCharArray() where Char.IsControl(c) select c;
foreach (char c in controlChars)
value = value.Replace(c.ToString(), "");
Also, see this link for details on other methods you can use: Char Methods
Have you tried string.Replace(Environment.NewLine, "") ? That usually gets a lot of them for me.
Check out this link: http://msdn.microsoft.com/en-us/library/844skk0h.aspx
You wil lhave to play around and build a REGEX expression that works for you. But here's the skeleton...
static void Main(string[] args)
{
StringBuilder txt = new StringBuilder();
txt.Append("Hello \n\n\r\t\t");
txt.Append( Convert.ToChar(8232));
System.Console.WriteLine("Original: <" + txt.ToString() + ">");
System.Console.WriteLine("Cleaned: <" + CleanInput(txt.ToString()) + ">");
System.Console.Read();
}
static string CleanInput(string strIn)
{
// Replace invalid characters with empty strings.
return Regex.Replace(strIn, #"[^\w\.#-]", "");
}
Assuming that 8232 is unicode, you can do this:
value.Replace("\u2028", string.Empty);
personally i'd go with
public static String RemoveLineEndings(this String text)
{
StringBuilder newText = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
if (!char.IsControl(text, i))
newText.Append(text[i]);
}
return newText.ToString();
}
If you've a string say "theString" then
use the method Replace and give it the arguments shown below:
theString = theString.Replace(System.Environment.NewLine, "");
Here are some quick solutions with .NET regex:
To remove any whitespace from a string: s = Regex.Replace(s, #"\s+", ""); (\s matches any Unicode whitespace chars)
To remove all whitespace BUT CR and LF: s = Regex.Replace(s, #"[\s-[\r\n]]+", ""); ([\s-[\r\n]] is a character class containing a subtraction construct, it matches any whitespace but CR and LF)
To remove any vertical whitespace, subtract \p{Zs} (any horizontal whitespace but tab) and \t (tab) from \s: s = Regex.Replace(s, #"[\s-[\p{Zs}\t]]+", "");.
Wrapping the last one into an extension method:
public static string RemoveLineEndings(this string value)
{
return Regex.Replace(value, #"[\s-[\p{Zs}\t]]+", "");
}
See the regex demo.
buildLetter.Append("</head>").AppendLine();
buildLetter.Append("").AppendLine();
buildLetter.Append("<style type="text/css">").AppendLine();
Assume the above contents resides in a file. I want to write a snippet that
removes any line which has empty string "" and put escape character before
the middle quotations. The final output would be:
buildLetter.Append("</head>").AppendLine();
buildLetter.Append("<style type=\"text/css\">").AppendLine();
The outer " .... " is not considered special chars. The special chars may be single
quotation or double quotation.
I could run it via find and replace feature of Visual Studio. However, in my case i
want it to be written in c# or VB.NET
Any help will be appreciated.
Perhaps this does what you want:
string s = File.ReadAllText("input.txt");
string empty = "buildLetter.Append(\"\").AppendLine();" + Environment.NewLine;
s = s.Replace(empty, "");
s = Regex.Replace(s, #"(?<="").*(?="")",
match => { return match.Value.Replace("\"", "\\\""); }
);
Result:
buildLetter.Append("</head>").AppendLine();
buildLetter.Append("<style type=\"text/css\">").AppendLine();
Ok,
I have a string in a sql table like this
hello /r/n this is a test /r/n new line.
When i retrieve this string using c# for a textbox using multiline i expect the escape chars to be newlines.
But they are not and what happens is i get the line exactly as it is above.
It seems that the string returned from the table is taken as literal but i want the newlines!
How do i get this to work?
Thanks in advance..
/ is not the escape character, \ is. What you need is:
hello \r\n this is a test \r\n new line.
Things like "\n" (not "/n" by the way) are escape characters in programming languages, not inherently in text. (In particular, they're programming-language dependent.)
If you want to make
hello\r\nthis is a test\r\nnew line
format as
hello
this is a test
new line
you'll need to do the parsing yourself, replacing "\r" with carriage return, "\n" with newline etc, handling "\" as a literal backslash etc. I've typically found that a simple parser which just remembers whether or not the previous character was a backslash is good enough for most purposes. Something like this:
static string Unescape(string text)
{
StringBuilder builder = new StringBuilder(text.Length);
bool escaping = false;
foreach (char c in text)
{
if (escaping)
{
// We're not handling \uxxxx etc
escaping = false;
switch(c)
{
case 'r': builder.Append('\r'); break;
case 'n': builder.Append('\n'); break;
case 't': builder.Append('\t'); break;
case '\\': builder.Append('\\'); break;
default:
throw new ArgumentException("Unhandled escape: " + c);
}
}
else
{
if (c == '\\')
{
escaping = true;
}
else
{
builder.Append(c);
}
}
}
if (escaping)
{
throw new ArgumentException("Unterminated escape sequence");
}
return builder.ToString();
}
There are more efficient ways of doing it (skipping from backslash to backslash and appending whole substrings of non-escaped text, basically) but this is simple.
I had a similar problem with my SQLite database the other day. Users are able to enter multi-line text into a textbox, and when it's inserted into the database, it goes in like:
"This is \r\n a multiline test"
It was coming out of System.Data.SQLite as:
"This is \\r\\n a multiline test"
I ended up using:
String.Replace("\\r","\r").Replace("\\n","\n")
to replace each of the escaped characters, which formatted the text properly when displayed in either a multi-line textbox or label.
Looks like you need to use System.Web.HttpUtility.HtmlDecode on the string before putting it in the TextBox.
textBox.Value = HttpUtility.HtmlDecode(myString);
I do not see why a Parser is required in this instance, given that the language in question is C# and I assume that the OS is 32-bit Windows which treats \r\n as the NewLine character.
The following simple code (Windows app) works for me:
string s = "hello \r\n this is a test \r\n new line.";
// Set Multiline to True.
textBox1.Multiline = true;
// Expand the height of the textbox to be able to view the full text.
textBox1.Height = 100;
textBox1.Text = s;
The textbox shows text as:
hello
this is a test
new line.
I think the issue here is that the string as seen by the user is
hello \r\n this is a test \r\n new line.
while the C# code sees it as
"hello \\r\\n this is a test \\r\\n new line."
A much simpler solution would be the following
s = Regex.Replace(s, #"\\\\", #"\");
which replaces #"\\" with #"\". Don't be confused by the regex match pattern, the \ has to be escaped in the match but not the replacement.