Comparing to strings but no effect - c#

So I have a list of terminals: a, b, c, d
I have a production like
A > aA | cC |a
I'm trying to check if a part of production A is a terminal or if that part exists in list of terminals. The problem is that when I compare the two parts the result is always false. I have tried with "Equals", "Contains" and "==" and the result is the same and I don't know why.
My code from where I split the production and compare the two parts:
foreach (Production production in productions)
{
String prod = production.ToString();
String[] right = prod.Trim().Split('>');
String justRightPart = right[1];
String[] separate = justRightPart.Trim().Split('|');
Boolean ok = true;
foreach (String s in separate)
{
foreach(string terminal in terminals)
{
Console.WriteLine("Terminal: " + terminal + " string part is " +s);
Boolean bool = terminal.Contains(s) || (terminal == s);
Console.WriteLine("bool : " + bool);
}
}
}
and the bool is always false even if it says:
Terminal a string part is a
Why is not equal?
Any suggestions?

There is probably whitespace in s that isn't in terminal or vice versa. Try adding single quotes around your string printing, eg:
Console.WriteLine("Terminal: '" + terminal + "' string part is '" + s + "'");
It might be clearer if you rewrite your print using positional arguments, eg:
Console.WriteLine("Terminal: '{0}' string part is '{1}'", terminal, s);

I suggest using Linq, e.g.
List<String> terminals = new List<string>() { "a", "b", "c" };
string source = "A > aA | cC | a";
var hasTerminal = source
.Substring(source.IndexOf('>') + 1) // at the right of ">"
.Split('|') // split to parts
.Select(part => part.Trim()) // trim each part (leading/trailing white spaces)
.Any(part => terminals.Contains(part)); // ... if part exists in list of terminals
You may want to implement a debugging report / log as well:
var report = source
.Substring(source.IndexOf('>') + 1)
.Split('|')
.Select(part => part.Trim())
.Select(part => $"{part,4} is {(terminals.Contains(part) ? "a term" : "NOT a term")}");
Console.Write(string.Join(Environment.NewLine, report));
The outcome is
aA is NOT a term
cC is NOT a term
a is a term

Related

The specificity of sorting

Code of the character '-' is 45, code of the character 'a' is 97. It's clear that '-' < 'a' is true.
Console.WriteLine((int)'-' + " " + (int)'a');
Console.WriteLine('-' < 'a');
45 97
True
Hence the result of the following sort is correct
var a1 = new string[] { "a", "-" };
Console.WriteLine(string.Join(" ", a1));
Array.Sort(a1);
Console.WriteLine(string.Join(" ", a1));
a -
- a
But why the result of the following sort is wrong?
var a2 = new string[] { "ab", "-b" };
Console.WriteLine(string.Join(" ", a2));
Array.Sort(a2);
Console.WriteLine(string.Join(" ", a2));
ab -b
ab -b
The - is ignored,
so - = "" < a
and -b = "b" > "ab"
this is because of Culture sort ( which is default )
https://msdn.microsoft.com/en-us/library/system.globalization.compareoptions(v=vs.110).aspx
The .NET Framework uses three distinct ways of sorting: word sort, string
sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them. For example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases. Therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.

weird regex behavior in the tokenization

I am using the following regex to tokenize:
reg = new Regex("([ \\t{}%$^&*():;_–`,\\-\\d!\"?\n])");
The regex is supposed to filter out everything later, however the input string format that i am having problem with is in the following form:
; "string1"; "string2"; "string...n";
the result of the string: ; "social life"; "city life"; "real life" as I know should be like the following:
; White " social White life " ; White " city White life " ; White " real White life "
However there is a problem such that, I get the output in the following form
; empty White empty " social White life " empty ; empty White empty " city White life " empty ; empty White empty " real White life " empty
White: means White-Space,
empty: means empty entry in the split array.
My code for split is as following:
string[] ret = reg.Split(input);
for (int i = 0; i < ret.Length; i++)
{
if (ret[i] == "")
Response.Write("empty<br>");
else
if (ret[i] == " ")
Response.Write("White<br>");
else
Response.Write(ret[i] + "<br>");
}
Why I get these empty entries ? and especially when there is ; followed by space followed by " then the result looks like the following:
; empty White empty "
can I get explanation of why the command adds empty entries ? and how to remove them without any additional O(n) complexity or using another data structure as ret
In my experience, splitting at regex matches is almost always not the best idea. You'll get much better results through plain matching.
And regexes are very well suited for tokenization purposes, as they let you implement a state machine really easily, just take a look at that:
\G(?:
(?<string> "(?>[^"\\]+|\\.)*" )
| (?<separator> ; )
| (?<whitespace> \s+ )
| (?<invalid> . )
)
Demo - use this with RegexOptions.IgnorePatternWhitespace of course.
Here, each match will have the following properties:
It will start at the end of the previous match, so there will be no unmatched text
It will contain exactly one matching group
The name of the group tells you the token type
You can ignore the whitespace group, and you should raise an error if you ever encounter a matching invalid group.
The string group will match an entire quoted string, it can handle escapes such as \" inside the string.
The invalid group should always be last in the pattern. You may add rules for other other types.
Some example code:
var regex = new Regex(#"
\G(?:
(?<string> ""(?>[^""\\]+|\\.)*"" )
| (?<separator> ; )
| (?<whitespace> \s+ )
| (?<invalid> . )
)
", RegexOptions.IgnorePatternWhitespace);
var input = "; \"social life\"; \"city life\"; \"real life\"";
var groupNames = regex.GetGroupNames().Skip(1).ToList();
foreach (Match match in regex.Matches(input))
{
var groupName = groupNames.Single(name => match.Groups[name].Success);
var group = match.Groups[groupName];
Console.WriteLine("{0}: {1}", groupName, group.Value);
}
This produces the following:
separator: ;
whitespace:
string: "social life"
separator: ;
whitespace:
string: "city life"
separator: ;
whitespace:
string: "real life"
See how much easier it is to deal with these results rather than using split?

Regex masking of words that contain a digit

Trying to come up with a 'simple' regex to mask bits of text that look like they might contain account numbers.
In plain English:
any word containing a digit (or a train of such words) should be matched
leave the last 4 digits intact
replace all previous part of the matched string with four X's (xxxx)
So far
I'm using the following:
[\-0-9 ]+(?<m1>[\-0-9]{4})
replacing with
xxxx${m1}
But this misses on the last few samples below
sample data:
123456789
a123b456
a1234b5678
a1234 b5678
111 22 3333
this is a a1234 b5678 test string
Actual results
xxxx6789
a123b456
a1234b5678
a1234 b5678
xxxx3333
this is a a1234 b5678 test string
Expected results
xxxx6789
xxxxb456
xxxx5678
xxxx5678
xxxx3333
this is a xxxx5678 test string
Is such an arrangement possible with a regex replace?
I think I"m going to need some greediness and lookahead functionality, but I have zero experience in those areas.
This works for your example:
var result = Regex.Replace(
input,
#"(?<!\b\w*\d\w*)(?<m1>\s?\b\w*\d\w*)+",
m => "xxxx" + m.Value.Substring(Math.Max(0, m.Value.Length - 4)));
If you have a value like 111 2233 33, it will print xxxx3 33. If you want this to be free from spaces, you could turn the lambda into a multi-line statement that removes whitespace from the value.
To explain the regex pattern a bit, it's got a negative lookbehind, so it makes sure that the word behind it does not have a digit in it (with optional word characters around the digit). Then it's got the m1 portion, which looks for words with digits in them. The last four characters of this are grabbed via some C# code after the regex pattern resolves the rest.
I don't think that regex is the best way to solve this problem and that's why I am posting this answer. For so complex situations, building the corresponding regex is too difficult and, what is worse, its clarity and adaptability is much lower than a longer-code approach.
The code below these lines delivers the exact functionality you are after, it is clear enough and can be easily extended.
string input = "this is a a1234 b5678 test string";
string output = "";
string[] temp = input.Trim().Split(' ');
bool previousNum = false;
string tempOutput = "";
foreach (string word in temp)
{
if (word.ToCharArray().Where(x => char.IsDigit(x)).Count() > 0)
{
previousNum = true;
tempOutput = tempOutput + word;
}
else
{
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
output = output + " " + word;
}
}
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
Have you tried this:
.*(?<m1>[\d]{4})(?<m2>.*)
with replacement
xxxx${m1}${m2}
This produces
xxxx6789
xxxx5678
xxxx5678
xxxx3333
xxxx5678 test string
You are not going to get 'a123b456' to match ... until 'b' becomes a number. ;-)
Here is my really quick attempt:
(\s|^)([a-z]*\d+[a-z,0-9]+\s)+
This will select all of those test cases. Now as for C# code, you'll need to check each match to see if there is a space at the beginning or end of the match sequence (e.g., the last example will have the space before and after selected)
here is the C# code to do the replace:
var redacted = Regex.Replace(record, #"(\s|^)([a-z]*\d+[a-z,0-9]+\s)+",
match => "xxxx" /*new String("x",match.Value.Length - 4)*/ +
match.Value.Substring(Math.Max(0, match.Value.Length - 4)));

C# Email Address validation

Just I want to clarify one thing. Per client request we have to create a regular expression in such a way that it should allow apostrophe in email address.
My Question according to RFC standard will an email address contain aportrophe? If so how to recreate regular expression to allow apostrophe?
The regular expression below implements the official RFC 2822 standard for email addresses. Using this regular expression in actual applications is NOT recommended. It is shown to illustrate that with regular expressions there's always a trade-off between what's exact and what's practical.
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
You could use the simplified one:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
And yes, apostrophe is allowed in the email, as long as it is not in domain name.
Here's the validation attribute I wrote. It validates pretty much every "raw" email address, that is those of the form local-part#*domain*. It doesn't support any of the other, more...creative constructs that the RFCs allow (this list is not comprehensive by any means):
comments (e.g., jsmith#whizbang.com (work))
quoted strings (escaped text, to allow characters not allowed in an atom)
domain literals (e.g. foo#[123.45.67.012])
bang-paths (aka source routing)
angle addresses (e.g. John Smith <jsmith#whizbang.com>)
folding whitespace
double-byte characters in either local-part or domain (7-bit ASCII only).
etc.
It should accept almost any email address that can be expressed thusly
foo.bar#bazbat.com
without requiring the use of quotes ("), angle brackets ('<>') or square brackets ([]).
No attempt is made to validate that the rightmost dns label in the domain is a valid TLD (top-level domain). That is because the list of TLDs is far larger now than the "big 6" (.com, .edu, .gov, .mil, .net, .org) plus 2-letter ISO country codes. ICANN actually updates the TLD list daily, though I suspect that the list doesn't actually change daily. Further, ICANN just approved a big expansion of the generic TLD namespace). And some email addresses don't have what you're recognize as a TLD (did you know that postmaster#. is theoretically valid and mailable? Mail to that address should get delivered to the postmaster of the DNS root zone.)
Extending the regular expression to support domain literals, it shouldn't be too difficult.
Here you go. Use it in good health:
using System;
using System.ComponentModel.DataAnnotations;
using System.Text.RegularExpressions;
namespace ValidationHelpers
{
[AttributeUsage( AttributeTargets.Property | AttributeTargets.Field , AllowMultiple = false )]
sealed public class EmailAddressValidationAttribute : ValidationAttribute
{
static EmailAddressValidationAttribute()
{
RxEmailAddress = CreateEmailAddressRegex();
return;
}
private static Regex CreateEmailAddressRegex()
{
// references: RFC 5321, RFC 5322, RFC 1035, plus errata.
string atom = #"([A-Z0-9!#$%&'*+\-/=?^_`{|}~]+)" ;
string dot = #"(\.)" ;
string dotAtom = "(" + atom + "(" + dot + atom + ")*" + ")" ;
string dnsLabel = "([A-Z]([A-Z0-9-]{0,61}[A-Z0-9])?)" ;
string fqdn = "(" + dnsLabel + "(" + dot + dnsLabel + ")*" + ")" ;
string localPart = "(?<localpart>" + dotAtom + ")" ;
string domain = "(?<domain>" + fqdn + ")" ;
string emailAddrPattern = "^" + localPart + "#" + domain + "$" ;
Regex instance = new Regex( emailAddrPattern , RegexOptions.Singleline | RegexOptions.IgnoreCase );
return instance;
}
private static Regex RxEmailAddress;
public override bool IsValid( object value )
{
string s = Convert.ToString( value ) ;
bool fValid = string.IsNullOrEmpty( s ) ;
// we'll take an empty field as valid and leave it to the [Required] attribute to enforce that it's been supplied.
if ( !fValid )
{
Match m = RxEmailAddress.Match( s ) ;
if ( m.Success )
{
string emailAddr = m.Value ;
string localPart = m.Groups[ "localpart" ].Value ;
string domain = m.Groups[ "domain" ].Value ;
bool fLocalPartLengthValid = localPart.Length >= 1 && localPart.Length <= 64 ;
bool fDomainLengthValid = domain.Length >= 1 && domain.Length <= 255 ;
bool fEmailAddrLengthValid = emailAddr.Length >= 1 && emailAddr.Length <= 256 ; // might be 254 in practice -- the RFCs are a little fuzzy here.
fValid = fLocalPartLengthValid && fDomainLengthValid && fEmailAddrLengthValid ;
}
}
return fValid ;
}
}
}
Cheers!

Add one space after every two characters and add a character infront of every single character

I want to add one space after every two characters, and add a character in front of every single character.
This is my code:
string str2;
str2 = str1.ToCharArray().Aggregate("", (result, c) => result += ((!string.IsNullOrEmpty(result) && (result.Length + 1) % 3 == 0) ? " " : "") + c.ToString());
I have no problems separating every two characters with one space, but how do I know if the separated string has an individual character, and add a character infront of that character?
I understand that my question is confusing as I'm not sure how to put what I want in words..
So I'll just give an example:
I have this string:
0123457
After separating every two characters with a space, I'll get:
01 23 45 7
I want to add a 6 infront of the 7.
Note: Numbers are dependent on user's input, so it's not always the same.
Thanks.
[TestMethod]
public void StackOverflowQuestion()
{
var input = "0123457";
var temp = Regex.Replace(input, #"(.{2})", "$1 ");
Assert.AreEqual("01 23 45 7", temp);
}
Try something like this:
static string ProcessString(string input)
{
StringBuilder buffer = new StringBuilder(input.Length*3/2);
for (int i=0; i<input.Length; i++)
{
if ((i>0) & (i%2==0))
buffer.Append(" ");
buffer.Append(input[i]);
}
return buffer.ToString();
}
Naturally you'd need to add in some logic about the extra numbers, but the basic idea should be clear from the above.
May be you can try, if i right understand your request,
String.Length % 2
if result is 0, you done with first iteration, if not, just add a character infront of last one.
I think this is what you asked for
string str1 = "3322356";
string str2;
str2 = String.Join(" ",
str1.ToCharArray().Aggregate("",
(result, c) => result += ((!string.IsNullOrEmpty(result) &&
(result.Length + 1) % 3 == 0) ? " " : "") + c.ToString())
.Split(' ').ToList().Select(
x => x.Length == 1
? String.Format("{0}{1}", Int32.Parse(x) - 1, x)
: x).ToArray());
result is "33 22 35 56"

Categories

Resources