C# string.IsNullOrWhiteSpace("\t") == true - c#

I have a line of code
var delimiter = string.IsNullOrWhiteSpace(foundDelimiter) ? "," : foundDelimiter;
when foundDelimiter is "\t", string.IsNullOrWhiteSpace returns true.
Why? And what is the approriate way to work around this?

\t is the tab character, which is whitespace. In C# can do either of these to get a tab:
var tab1 = "\t";
var tab2 = " ";
var areEqual = tab1 == tab2; //returns true
Edit: As noted by Magus, SO is converting my tab character into spaces when the answer gets rendered. If you're in your IDE you'd just hit quote, tab, quote.
As far as a workaround goes, I'd suggest you just add a check for tabs in your conditional.
var delimiter = string.IsNullOrWhiteSpace(foundDelimiter) && foundDelimiter != "\t" ? "," : foundDelimiter;

Welcome to Unicode.
What did you expect would happen? HT (horizontal tab) has been a whitespace character for decades. The "classic" C-language definition of white-space characters consists of the US-ASCII characters:
SP: space (0x20,' ')
HT: horizontal tab (0x09,'\t')
LF: line feed (0x0A, '\n')
VT: vertical tab (0x0B, '\v')
FF: vertical tab (0x0C, '\f')
CR: carriage return (0x0C, '\r')
Unicode is a little more...ecumenical in its approach: its definition of white-space characters is this set:
Members of the Unicode category SpaceSeparator:
SPACE (U+0020)
OGHAM SPACE MARK (U+1680)
MONGOLIAN VOWEL SEPARATOR (U+180E)
EN QUAD (U+2000)
EM QUAD (U+2001)
EN SPACE (U+2002)
EM SPACE (U+2003)
THREE-PER-EM SPACE (U+2004)
FOUR-PER-EM SPACE (U+2005)
SIX-PER-EM SPACE (U+2006)
FIGURE SPACE (U+2007)
PUNCTUATION SPACE (U+2008)
THIN SPACE (U+2009)
HAIR SPACE (U+200A)
NARROW NO-BREAK SPACE (U+202F)
MEDIUM MATHEMATICAL SPACE (U+205F)
IDEOGRAPHIC SPACE (U+3000)
Members of the Unicode category LineSeparator, which consists solely of
LINE SEPARATOR (U+2028)
Member of the Unicode category ParagraphSeparator, which consists solely of
PARAGRAPH SEPARATOR (U+2029)
These Basic Latin/C0 Controls/US-ASCII characters:
CHARACTER TABULATION (U+0009)
LINE FEED (U+000A)
LINE TABULATION (U+000B)
FORM FEED (U+000C)
CARRIAGE RETURN (U+000D)
These C1 Controls and Latin-1 Supplement characters
NEXT LINE (U+0085)
NO-BREAK SPACE (U+00A0)
If you don't like the definition, roll your own along these lines (plug in your own character set):
public static bool IsNullOrCLanguageWhitespace( this string s )
{
bool value = ( s == null || rxWS.IsMatch(s) ) ;
return value ;
}
private static Regex rxWS = new Regex( #"^[ \t\n\v\f\r]*$") ;
You might want to add a char analog as well:
public static bool IsCLanguageWhitespace( this char c )
{
bool value ;
switch ( c )
{
case ' ' : value = true ; break ;
case '\t' : value = true ; break ;
case '\n' : value = true ; break ;
case '\v' : value = true ; break ;
case '\f' : value = true ; break ;
case '\r' : value = true ; break ;
default : value = false ; break ;
}
return value ;
}

Related

Regular expression replace (C#)

How to make Regex.Replace for the following texts:
1) "Name's", "Sex", "Age", "Height_(in)", "Weight (lbs)"
2) " LatD", "LatM ", 'LatS', "NS", "LonD", "LonM", "LonS", "EW", "City", "State"
Result:
1) Name's, Sex, Age, Height (in), Weight (lbs)
2) LatD, LatM, LatS, NS, LonD, LonM, LonS, EW, City, State
Spaces between brackets can be any size (Example 1). There may also be incorrect spaces in brackets (Example 2). Also, instead of spaces, the "_" sign can be used (Example 1). And instead of double quotes, single quotes can be used (Example 2).
As a result, words must be separated with a comma and a space.
Snippet of my code
StreamReader fileReader = new StreamReader(...);
var fileRow = fileReader.ReadLine();
fileRow = Regex.Replace(fileRow, "_", " ");
fileRow = Regex.Replace(fileRow, "\"", "");
var fileDataField = fileRow.Split(',');
I don't well know C# syntax, but this regex does the job:
Find: (?:_|^["']\h*|\h*["']$|\h*["']\h*,\h*["']\h*)
Replace: A space
Explanation:
(?: # non capture group
_ # undersscore
| # OR
^["']\h* # beginning of line, quote or apostrophe, 0 or more horizontal spaces
| # OR
\h*["']$ # 0 or more horizontal spaces, quote or apostrophe, end of line
| # OR
\h*["']\h* # 0 or more horizontal spaces, quote or apostrophe, 0 or more horizontal spaces
, #
\h*["']\h* # 0 or more horizontal spaces, quote or apostrophe, 0 or more horizontal spaces
) # end group
Demo
How about a simple straight string manipulation way?
using System;
using System.Linq;
static void Main(string[] args)
{
string dirty1 = "\"Name's\", \"Sex\", \"Age\", \"Height_(in)\", \"Weight (lbs)\"";
string dirty2 = "\" LatD\", \"LatM \", 'LatS', \"NS\", \"LonD\", \"LonM\", \"LonS\", \"EW\", \"City\", \"State\"";
Console.WriteLine(Clean(dirty1));
Console.WriteLine(Clean(dirty2));
Console.ReadKey();
}
private static string Clean(string dirty)
{
return dirty.Split(',').Select(item => item.Trim(' ', '"', '\'')).Aggregate((a, b) => string.Join(", ", a, b));
}
private static string CleanNoLinQ(string dirty)
{
string[] items = dirty.Split(',');
for(int i = 0; i < items.Length; i++)
{
items[i] = items[i].Trim(' ', '"', '\'');
}
return String.Join(", ", items);
}
You can even replace the LinQ with a foreach and then string.Join().
Easier to understand - easier to maintain.

weird regex behavior in the tokenization

I am using the following regex to tokenize:
reg = new Regex("([ \\t{}%$^&*():;_–`,\\-\\d!\"?\n])");
The regex is supposed to filter out everything later, however the input string format that i am having problem with is in the following form:
; "string1"; "string2"; "string...n";
the result of the string: ; "social life"; "city life"; "real life" as I know should be like the following:
; White " social White life " ; White " city White life " ; White " real White life "
However there is a problem such that, I get the output in the following form
; empty White empty " social White life " empty ; empty White empty " city White life " empty ; empty White empty " real White life " empty
White: means White-Space,
empty: means empty entry in the split array.
My code for split is as following:
string[] ret = reg.Split(input);
for (int i = 0; i < ret.Length; i++)
{
if (ret[i] == "")
Response.Write("empty<br>");
else
if (ret[i] == " ")
Response.Write("White<br>");
else
Response.Write(ret[i] + "<br>");
}
Why I get these empty entries ? and especially when there is ; followed by space followed by " then the result looks like the following:
; empty White empty "
can I get explanation of why the command adds empty entries ? and how to remove them without any additional O(n) complexity or using another data structure as ret
In my experience, splitting at regex matches is almost always not the best idea. You'll get much better results through plain matching.
And regexes are very well suited for tokenization purposes, as they let you implement a state machine really easily, just take a look at that:
\G(?:
(?<string> "(?>[^"\\]+|\\.)*" )
| (?<separator> ; )
| (?<whitespace> \s+ )
| (?<invalid> . )
)
Demo - use this with RegexOptions.IgnorePatternWhitespace of course.
Here, each match will have the following properties:
It will start at the end of the previous match, so there will be no unmatched text
It will contain exactly one matching group
The name of the group tells you the token type
You can ignore the whitespace group, and you should raise an error if you ever encounter a matching invalid group.
The string group will match an entire quoted string, it can handle escapes such as \" inside the string.
The invalid group should always be last in the pattern. You may add rules for other other types.
Some example code:
var regex = new Regex(#"
\G(?:
(?<string> ""(?>[^""\\]+|\\.)*"" )
| (?<separator> ; )
| (?<whitespace> \s+ )
| (?<invalid> . )
)
", RegexOptions.IgnorePatternWhitespace);
var input = "; \"social life\"; \"city life\"; \"real life\"";
var groupNames = regex.GetGroupNames().Skip(1).ToList();
foreach (Match match in regex.Matches(input))
{
var groupName = groupNames.Single(name => match.Groups[name].Success);
var group = match.Groups[groupName];
Console.WriteLine("{0}: {1}", groupName, group.Value);
}
This produces the following:
separator: ;
whitespace:
string: "social life"
separator: ;
whitespace:
string: "city life"
separator: ;
whitespace:
string: "real life"
See how much easier it is to deal with these results rather than using split?

Regex: Matching individual characters only (no characters containted in strings)

Aim: To split a string Regex.Split(...) based on a pattern of individual characters, leaving the character matched at the beginning of a split list.
Problem: One of the characters can appear in other parts of the string I don't wish to split on and I'm getting more list items than intended.
Example of string to split: T 2 TBS PO And > Qd PRN MIX X A 3 TB \ A 4 TB Xmon UG
Outcome desired:
T 2 TBS PO And
> Qd PRN MIX
X A 3 TB
\ A 4 TB Xmon UG
Pattern: (?=[#\+X\\>])
This works for everything except the X. Instead of the desired outcome, I'm getting it split in undesired places.
Current outcome:
T 2 TBS PO And
> Qd PRN MI
X
X A 3 TB
\ A 4 TB X
mon UG
Basically, I need it to not split on a string of characters only when it's on it's own.
Thanks in advance for your help
UPDATE: Oops! I seemed to have forgot to mention that that the centre of the pattern, the characters to split by, have been pulled from a table and technically, I don't know there's the X there beforehand (they may also change.)
For this reason, Jonny 5/Jerry's suggestion seems the most viable to me. I'll test when I get into work.
You could put some \s in there to make sure the characters you are matching are alone:
(?<=\s)(?=[#\+\\>X]\s)
(?<=\s) makes sure the character is preceded by a space, and the space that follows makes sure the character is followed by a space.
Note: where 'space' is mentioned above, it actually means whitespace, tab, newline, carriage return.
Just split your regex into two and piped them:
(?=[#\+\\>])|(?=\bX\b)
(?=[#\+\\>]) checking for your regular characters.
(?=\bX\b) is checking for alone X
Why not roll your own instead of using a regular expression:
public IEnumerable<string> CustomSplit( string source )
{
StringBuilder buf = new StringBuilder();
for ( int i = 0 ; i < source.Length ; ++i )
{
char curr = source[i] ;
char next = i+1 < source.Length ? source[i+1] : ' ' ;
bool isDelimiter = curr == '#'
| curr == '+'
| curr == '\\'
| curr == '>'
| ( curr == 'X' && char.IsWhiteSpace(next) )
;
if ( isDelimiter )
{
if ( buf.Length > 0 ) yield return buf.ToString() ;
buf.Length = 0 ;
}
buf.Append(curr) ;
}
// return the last element, if there is one.
if ( buf.Length > 0 ) yield return buf.ToString() ;
}

checking input for morse code converter

I want to check the input from the user to make sure that they only enter dots and dashes and any other letters or numbers will give back and error message. Also i wanted to allow the user to enter a space yet when i am converting how can i remove or ignore the white space?
string permutations;
string entered = "";
do
{
Console.WriteLine("Enter Morse Code: \n");
permutations = Console.ReadLine();
.
.
} while(entered.Length != 0);
Thanks!
string permutations = string.Empty;
Console.WriteLine("Enter Morse Code: \n");
permutations = Console.ReadLine(); // read the console
bool isValid = Regex.IsMatch(permutations, #"^[-. ]+$"); // true if it only contains whitespaces, dots or dashes
if (isValid) //if input is proper
{
permutations = permutations.Replace(" ",""); //remove whitespace from string
}
else //input is not proper
{
Console.WriteLine("Error: Only dot, dashes and spaces are allowed. \n"); //display error
}
Let's assume that you separate letters by a single space and words by two spaces. Then you can test if your string is well formatted by using a regular expression like this
bool ok = Regex.IsMatch(entered, #"^(\.|-)+(\ {1,2}(\.|-)+)*$");
Regular expression explained:
^ is the beginning of the string.
\.|- is a dot (escaped with \ as the dot has a special meaning within Regex) or (|) a minus sign.
+ means one or more repetitions of what's left to it (dot or minus).
\ {1,2} one or two spaces (they are followed by dots or minuses again (\.|-)+).
* repeats the space(s) followed by dots or minuses zero or more times.
$ is the end of the line.
You can split the string at the spaces with
string[] parts = input.Split();
Two spaces will create an empty entry. This allows you to detect word boundaries. E.g.
"–– ––– .–. ... . –.–. ––– –.. .".Split();
produces the following string array
{string[10]}
[0]: "––"
[1]: "–––"
[2]: ".–."
[3]: "..."
[4]: "."
[5]: ""
[6]: "–.–."
[7]: "–––"
[8]: "–.."
[9]: "."

What RegEx string will find the last (rightmost) group of digits in a string?

Looking for a regex string that will let me find the rightmost (if any) group of digits embedded in a string. We only care about contiguous digits. We don't care about sign, commas, decimals, etc. Those, if found should simply be treated as non-digits just like a letter.
This is for replacement/incrementing purposes so we also need to grab everything before and after the detected number so we can reconstruct the string after incrementing the value so we need a tokenized regex.
Here's examples of what we are looking for:
"abc123def456ghi" should identify the'456'
"abc123def456ghi789jkl" should identify the'789'
"abc123def" should identify the'123'
"123ghi" should identify the'123'
"abc123,456ghi" should identify the'456'
"abc-654def" should identify the'654'
"abcdef" shouldn't return any match
As an example of what we want, it would be something like starting with the name 'Item 4-1a', extracting out the '1' with everything before being the prefix and everything after being the suffix. Then using that, we can generate the values 'Item 4-2a', 'Item 4-3a' and 'Item 4-4a' in a code loop.
Now If I were looking for the first set, this would be easy. I'd just find the first contiguous block of 0 or more non-digits for the prefix, then the block of 1 or more contiguous digits for the number, then everything else to the end would be the suffix.
The issue I'm having is how to define the prefix as including all (if any) numbers except the last set. Everything I try for the prefix keeps swallowing that last set, even when I've tried anchoring it to the end by basically reversing the above.
How about:
^(.*?)(\d+)(\D*)$
then increment the second group and concat all 3.
Explanation:
^ : Begining of string
( : start of 1st capture group
.*? : any number of any char not greedy
) : end group
( : start of 2nd capture group
\d+ : one or more digits
) : end group
( : start of 3rd capture group
\D* : any number of non digit char
) : end group
$ : end of string
The first capture group will match all characters until the first digit of last group of digits before the end of the string.
or if you can use named group
^(?<prefix>.*?)(?<number>\d+)(?<suffix>\D*)$
Try next regex:
(\d+)(?!.*\d)
Explanation:
(\d+) # One or more digits.
(?!.*\d) # (zero-width) Negative look-ahead: Don't find any characters followed with a digit.
EDIT (OFFTOPIC of the question):: This answer is incorrect but this question has already been answered in other posts so to avoid delete this one I will use this same regex other way, for example in Perl could be used like this to get same result as in C# (increment last digit):
s/(\d+)(?!.*\d)/$1 + 1/e;
You can also try little bit simpler version:
(\d+)[^\d]*$
This should do it:
Regex regexObj = new Regex(#"
# Grab last set of digits, prefix and suffix.
^ # Anchor to start of string.
(.*) # $1: Stuff before last set of digits.
(?<!\d) # Anchor start of last set of digits.
(\d+) # $2: Last set of one or more digits.
(\D*) # $3: Zero or more trailing non digits.
$ # Anchor to end of string.
", RegexOptions.IgnorePatternWhitespace);
What about not using Regex. Here's code snippet (for console)
string[] myStringArray = new string[] { "abc123def456ghi", "abc123def456ghi789jkl", "abc123def", "123ghi", "abcdef","abc-654def" };
char[] numberSet = new char[] { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };
char[] filterSet = new char[] {'a','b','c','d','e','f','g','h','i','j','k','l','m',
'n','o','p','q','r','s','t','u','v','w','x','y','z','-'};
foreach (string myString in myStringArray)
{
Console.WriteLine("your string - {0}",myString);
int index1 = myString.LastIndexOfAny(numberSet);
if (index1 == -1)
Console.WriteLine("no number");
else
{
string mySubString = myString.Substring(0,index1 + 1);
string prefix = myString.Substring(index1 + 1);
Console.WriteLine("prefix - {0}", prefix);
int index2 = mySubString.LastIndexOfAny(filterSet);
string suffix = myString.Substring(0, index2 + 1);
Console.WriteLine("suffix - {0}",suffix);
mySubString = mySubString.Substring(index2 + 1);
Console.WriteLine("number - {0}",mySubString);
Console.WriteLine("_________________");
}
}
Console.Read();

Categories

Resources