Regex expression not filtering special symbols

Regex expression not filtering special symbols - c#

I'm currently using the following line of code:
Regex Regex_Alpha = new Regex(#"[a-zA-Z]+('[a-zA-Z])?[a-zA-Z]*");
What I want to do is filter the input of text fields with the condition that input should only be letters and the apostrophe symbol (actually, I still want to add more, but I'm trying to resolve this first).
Right now, it is accepting ALL characters, even numbers.
With my understanding of Regex, I tried to formulate my own expression in the line of:
Regex Regex_Alpha = new Regex(#"^[a-zA-Z'-"+$);
It filters numbers, but doesn't accept the apostrophe symbol. Tried to remove the # sign and filter the apostrophe with the backslash escape character, but still no use.
What should be the best approach to filter the input so that it only accepts letters and apostrophe? (I'll do the rest of the symbols once I understand how this one should work)

As I've commented, your first regular expression is a pretty good shot at "letters, with a single apostrophe not at either end". However, it matchs any string with even a single letter because a regular expression looks for any match in the input, not for whether the entire input matches.
You can fix this by doing what you've done in your second regular expression - just put a ^ at the start and a $ at the end. This means the start and end of the expression have to match the start and end of the input, so it ensures the whole input is only made up of letters and a possible apostrophe.
Regarding your second regular expression, you have a few of problems.
If you want a double-quote in a #"..." string literal, you need to put two double quotes. (I think this might just be a typing mistake in your question, as what you currently have wouldn't even compile.)
You need to close your character class with a ], otherwise the [ and everything inside just get treated as a sequence of characters to match, one after the other.
If you want a hyphen in a character class, it has to go at the start or end, or it gets mistaken for a "between" hyphen (as in A-Z).
The expression #"^[a-zA-Z'""-]+$" should match "any string entirely made of letters, apostrophes, quotes or hyphens".

Related

How to match string by using regular expression which will not allow same special character at same time?

I m trying to matching a string which will not allow same special character at same time
my regular expression is:
[RegularExpression(#"^+[a-zA-Z0-9]+[a-zA-Z0-9.&' '-]+[a-zA-Z0-9]$")]
this solve my all requirement except the below two issues
this is my string : bracks
acceptable :
bra-cks, b-r-a-c-ks, b.r.a.c.ks, bra cks (by the way above regular expression solved this)
not acceptable:
issue 1: b.. or bra..cks, b..racks, bra...cks (two or more any special character together),
issue 2: bra cks (two ore more white space together)

You can use a negative lookahead to invalidate strings containing two consecutive special characters:
^(?!.*[.&' -]{2})[a-zA-Z0-9.&' -]+$
Demo: https://regex101.com/r/7j14bu/1

The goal
From what i can tell by your description and pattern, you are trying to match text, which start and end with alphanumeric (due to ^+[a-zA-Z0-9] and [a-zA-Z0-9]$ inyour original pattern), and inside, you just don't want to have any two consecuive (adjacent) special characters, which, again, guessing from the regex, are . & ' -
What was wrong
^+ - i think here you wanted to assure that match starts at the beginning of the line/string, so you don't need + here
[a-zA-Z0-9.&' '-] - in this character class you doubled ' which is totally unnecessary
Solution
Please try pattern
^[a-zA-Z0-9](?:(?![.& '-]{2,})[a-zA-Z0-9.& '-])*[a-zA-Z0-9]$
Pattern explanation
^ - anchor, match the beginning of the string
[a-zA-Z0-9] - character class, match one of the characters inside []
(?:...) - non capturing group
(?!...) - negative lookahead
[.& '-]{2,} - match 2 or more of characters inside character class
[a-zA-Z0-9.& '-] - character class, match one of the characters inside []
* - match zero or more text matching preceeding pattern
$ - anchor, match the end of the string
Regex demo

Some remarks on your current regex:
It looks like you placed the + quantifiers before the pattern you wanted to quantify, instead of after. For instance, ^+ doesn't make much sense, since ^ is just the start of the input, and most regex engines would not even allow that.
The pattern [a-zA-Z0-9.&' '-]+ doesn't distinguish between alphanumerical and other characters, while you want the rules for them to be different. Especially for the other characters you don't want them to repeat, so that + is not desired for those.
In a character class it doesn't make sense to repeat the same character, like you have a repeat of a quote ('). Maybe you wanted to somehow delimit the space, but realise that those quotes are interpreted literally. So probably you should just remove them. Or if you intended to allow for a quote, only list it once.
Here is a correction (add the quote if you still need it):
^[a-zA-Z0-9]+(?:[.& -][a-zA-Z0-9]+)*$
Follow-up
Based on a comment, I suspect you would allow a non-alphanumerical character to be surrounded by single spaces, even if that gives a sequence of more than one non-alphanumerical character. In that case use this:
^[a-zA-Z0-9]+(?:(?:[ ]|[ ]?[.&-][ ]?)[a-zA-Z0-9]+)*$
So here the space gets a different role: it can optionally occur before and after a delimiter (one of ".&-"), or it can occur on its own. The brackets around the spaces are not needed, but I used them to stress that the space is intended and not a typo.

Difficulty finding where to insert "word exclusion" in a regex

I know the regex for excluding words, roughly anyway, It would be (!?wordToIgnore|wordToIgnore2|wordToIgnore3)
But I have an existing, complicated regex that I need to add this to, and I am a bit confused about how to go about that. I'm still pretty new to regex, and it took me a very long time to make this particular one, but I'm not sure where to insert it or how ...
The regex I have is ...
^(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:/\p{L}'-]{1,64}$)$
This should only allow the person typing to insert between 1 and 64 letters that match that pattern, cannot start with a space, quote, double quote, special character, a dash, an escape character, etc, and only allows a-z both upper and lowercase, can include a space, ":", a dash, and a quote anywhere but the beginning.
But I want to forbid them from using certain words, so I have this list of words that I want to be forbidden, I just cannot figure out how to get that to fit into here.. I tried just pasting the whole .. "block" in, and that didn't work.
?!the|and|or|a|given|some|that|this|then|than
Has anyone encountered this before?

ciel, first off, congratulations for getting this far trying to build your regex rule. If you want to read something detailed about all kinds of exclusions, I suggest you have a look at Match (or replace) a pattern except in situations s1, s2, s3 etc
Next, in your particular situation, here is how we could approach your regex.
For consision, let's make all the negative lookarounds more compact, replacing them with a single (?!.*(?: |-|'){2})
In your character class, the \: just escapes the colon, needlessly so as : is enough. I assume you wanted to add a backslash character, and if so we need to use \\
\p{L} includes [a-zA-Z], so you can drop [a-zA-Z]. But are you sure you want to match all letters in any script? (Thai etc). If so, remember to set the u flag after the regex string.
For your "bad word exclusion" applying to the whole string, place it at the same position as the other lookarounds, i.e., at the head of the string, but using the .* as in your other exclusions: (?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3)) It does not matter which lookahead comes first because lookarounds do not change your position in the string. For more on this, see Mastering Lookahead and Lookbehind
This gives us this glorious regex (I added the case-insensitive flag):
^(?i)(?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3))(?!.*(?: |-|'){2})(?:[\\0-9 :/\p{L}'-]{1,64}$)$
Of course if you don't want unicode letters, replace \p{L} with a-z
Also, if you want to make sure that the wordToIgnore is a real word, as opposed to an embedded string (for instance you don't want cat but you are okay with catalog), add boundaries to the lookahead rule: (?!.*\b(?:wordToIgnore|wordToIgnore2|wordToIgnore3)\b)

use this:
^(?!.*(the|and|or|a|given|some|that|this|then|than))(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:\p{L}'-]{1,64}$)$
see demo

Regular expression to replace a string

I'm working on some code inherited from someone else and trying to understand some regular expression code in C#:
Regex.Replace(query, #"""[^""~]+""([^~]|$)",
m => string.Format(field + "_exact:{0}", m.Value))
What is the above regular expression doing? This is in relation to input from a user performing a search. It's doing a replace of the query string using the pattern provided in the second argument, with the value of the third. But what is that regular expression? For the life of me, it doesn't make sense. Thanks.

As far as I can see, xanatos' answer is correct. I tried to understand the regex, so here it comes:
"[^"~]+"([^~]|$)
You can test our regex and play with the single parts for better understanding at http://www.regexpal.com/
1.) a single character
"
The first pattern is a literal character. Since there is no statement of relative position, it can occur everywhere.
2.) a character class
[^"~]
The next expression is the []-bracket. This is a character set. It defines a quantity of characters, which maybe follow next. It is a placeholder for one single character... So lets see inside, which content is allowed:
^"~
The definition of the character class begins with an caret (^), which is a special character. Typing a caret after the opening square bracket will negate the character class. So it's "upside down": everything following, which does not match the class expression, matches and is a valid character.
In this case, every literal character is possible, except the two excluded ones: " or ~.
3.) a special character
+
The next expression, a plus, tells the engine to attempt to match the preceding token once or more.
So the defined character class should one or multiple times repeated to match the given expression.
4.) a single character
"
To match, the expression should contain furthermore one further apostrophe, which will be the corresponding apostrophe to the first one in 1.) since the character class in (2.) hence (3.) does not permit an apostrophe.
5.) a lookaround
([^~]|$)
The first structure here to examine is the ()-bracket. This is called a "Lookaround".
It is is a special kind of group. Lookaround matches a position. It does not expand the regex match.
So this means this part does not try to find any certain characters inside of an expression
rather then to localize them.
The localisation demands has two conditions, which are connected by a logical OR by the pipeline symbol: |
So the next character of the matched expression could either be
[^~] one single character out of the class everything excluding the character ~
or
$ the end of the line (or word, if multiline-mode is not used in regex engine)
I'll try to edit my answer to a better format, since this is my first post, I first have to check out how this is working.. :)
Update:
to "detect" a Asterisk/star in front/end of the line, you have to do following:
First it's a special character, so you have to escape it with an backslash: *
To define the position, you can use:
^ to look at the beginning of the line,
$ end of the line
The overall expression would be:
^* in front of the expression to search for an * at the beginning of
the line $* at the end of the regex to demand an * at the end.
.... in your case you can add the * in the last character class to detect an * in the end:
([^~]|$|$*)
and to force an * in the end, delete the other conditions:
($*)
PS:
(somehow my regex is swallowed up by formating engine, so my update is wrong...)

The # makes it necessary to escape all the " with a second ", so "". Without it to escape the " you would have used \", but I consider it better to always use # in regexes, because the \ is used quite often, and it's boring and unreadable to always have to escape it to \\.
Let's see what the regex really is:
Console.WriteLine(#"""[^""~]+""([^~]|$)");
is
"[^"~]+"([^~]|$)
So now we can look at the "real" regex.
It looks for a " followed by one or more non-" and non-~ followed by another " followed by a non-~ or the end of the string. Note that the match could start after the start of the string and it could end before the end of the string (with a non-~)
For example in
car"hello"help
it would match "hello"h

Regex for non-alphabets and non-numerals

Please provide a solution to write a regular expression as following in C#.NET:
I would require a RegEx for Non-Alphabets(a to z;A to Z) and Non-Numerals(0 to 9).
Mean to say as reverse way for getting regular expression other than alphabets and otherthan numerals(0 to 9).
Kindly suggest the solution for the same.

You can use a negated character class here:
[^a-zA-Z0-9]
Above regex will match a single character which can't be a latin lowercase or uppercase letter or a number.
The ^ at the start of the character class (the part between [ and ]) negates the complete class so that it matches anything not in the class, instead of normal character class behavior.
To make it useful, you probably want one of those:
Zero or more such characters
[^a-zA-Z0-9]*
The asterisk (*) here signifies that the preceding part can be repeated zero or more times.
One or more such characters
[^a-zA-Z0-9]+
The plus (+) here signifies that the preceding part can be repeated one or more times.
A complete (possibly empty) string, consisting only of such characters
^[^a-zA-Z0-9]*$
Here the characters ^ and $ have a meaning as anchors, matching the start and end of the string, respectively. This ensures that the entire string consists of characters not in that character class and no other characters come before or after them.
A complete (non-empty) string, consisting only of such characters
^[^a-zA-Z0-9]+$
Elaborating a bit, this won't (and can't) make sure that you won't use any other characters, possibly from other scripts. The string аеΒ would be completely valid with the above regular expression, because it uses letters from Greek and Cyrillic. Furthermore there are other pitfalls. The string á will pass above regular expression, while the string ́a will not (because it constructs the letter á from the letter a and a combining diacritical mark).
So negated character classes have to be taken with care at times.
I can also use numerals from other scripts, if I wanted to: ١٢٣ :-)
You can use the character class
[^\p{L&}\p{Nd}]
if you need to take care of the above things.

just negate the class:
[^A-Za-z0-9]

To obey local setting use:
[^[:alnum:]]

regular expression should split , that are contained outside the double quotes in a CSV file?

This is the sample
"abc","abcsds","adbc,ds","abc"
Output should be
abc
abcsds
adbc,ds
abc

Try this:
"(.*?)"
if you need to put this regex inside a literal, don't forget to escape it:
Regex re = new Regex("\"(.*?)\"");

This is a tougher job than you realize -- not only can there be commas inside the quotes, but there can also be quotes inside the quotes. Two consecutive quotes inside of a quoted string does not signal the end of the string. Instead, it signals a quote embedded in the string, so for example:
"x", "y,""z"""
should be parsed as:
x
y,"z"
So, the basic sequence is something like this:
Find the first non-white-space character.
If it was a quote, read up to the next quote. Then read the next character.
Repeat until that next character is not also a quote.
If the next (non-whitespace) character is not a comma, input is malformed.
If it was not a quote, read up to the next comma.
Skip the comma, repeat the whole process for the next field.
Note that despite the tag, I'm not providing a regex -- I'm not at all sure I've seen a regex that can really handle this properly.

This answer has a C# solution for dealing with CSV.
In particular, the line
private static Regex rexCsvSplitter = new Regex( #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" );
contains the Regex used to split properly, i.e., taking quoting and escaping into consideration.
Basically what it says is, match any comma that is followed by an even number of quote marks (including zero). This effectively prevents matching a comma that is part of a quoted string, since the quote character is escaped by doubling it.
Keep in mind that the quotes in the above line are doubled for the sake of the string literal. It might be easier to think of the expression as
,(?=(?:[^"]*"[^"]*")*(?![^"]*"))

If you can be sure there are no inner, escaped quotes, then I guess it's ok to use a regular expression for this. However, most modern languages already have proper CSV parsers.
Use a proper parser is the correct answer to this. Text::CSV for Perl, for example.
However, if you're dead set on using regular expressions, I'd suggest you "borrow" from some sort of module, like this one:
http://metacpan.org/pod/Regexp::Common::balanced

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex expression not filtering special symbols - c#

Related

How to match string by using regular expression which will not allow same special character at same time?

Difficulty finding where to insert "word exclusion" in a regex

Regular expression to replace a string

Regex for non-alphabets and non-numerals

regular expression should split , that are contained outside the double quotes in a CSV file?

Categories

Resources