Adding "Max Length" to Regex - c#

How can I extend already present Regex's with an attribute telling that the regex can't exceed a maximum length of (let's say) 255?
I've got the following regex:
([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)
I've tried it like that, but failed:
{.,255([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)}

Best way of doing this, if it has to be a solely regex based solution, would be to use lookarounds.
View this example: http://regex101.com/r/yM3vL0
What I am doing here is only matching strings that are at most three characters long. Granted, for my example, this is not the best way to do it. But ignore that, I'm just trying to show an example that will work for you.
You also have to anchor your pattern, otherwise the engine will just ignore the lookaround (do I have to explain this in depth?)
In other words, you can use the following in your regular expression to limit it to at most 255 characters:
^(?!^.{256})([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)
I also feel it is my duty to tell you your regular expression is bad and you should feel bad.

A regex is not really made to solve all problems. In this case, I'd suggest that testing a length of 255 is going to be expensive because that's going to require 255 states in the underlying representation. Instead, just test the length of the string separately.
But if you really must, you will need to make your characters optional, so something like:
.?{255}
Will match any string of 255 or fewer characters.

Why not just check for Max Length of the string as well? If you're using DataAnnotations, you can stick [StringLength(255)] on the property.
If you're using ASP.NET Validators, you can use a RangeValidator.
If you're using a custom validation function it's much more readable (and faster) to check the length before you throw a complex regex against it.

You "may" be able to use a look-ahead as follows:
^(?=.{0,255}$)your regex here$
So...
^(?=^.{0,255}$)([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$

Related

RegEx for a specific string pattern

Using C#, I will be handling character arrays of info, looking for the following pattern:
a pipe (0x7C), 2 to 7 pairs of characters, followed by another pipe (0x7C).
Stated another way:
|1122[33][44][55][66][77]|
The character pairs consist of characters whose range is from 33-124 decimal ( '!' to '|').
Pairs 3 through 7 are optional, but occur in order, if they occur, so you could have
|1122| <---shortest
|112233|
|11223344|
|1122334455|
|112233445566|
|11223344556677| <---longest
I want to 1) find out if this pattern exists in the character array, 2) extract the individual pairs. These tasks can be separate. I think the best approach to this would be a RegEx, but so far I haven't been able to dream-up an expression to get the job done.
Is a RegEx the way to go and what would a solution for the RegEx itself be?
Is there a better way?
Chuck
If I understand your question correctly the correct pattern would be:
\|([!-|]{2}){2,7}\|
Or to capture each set
\|([!-|]{2})([!-|]{2})([!-|]{2})?([!-|]{2})?([!-|]{2})?([!-|]{2})?([!-|]{2})?\|
Not sure if the range will work directly like that or not, so you may need to do [A-Za-Z!##$......] if the simplified range doesn't work
Also, I think you don't want to include pipe(|) in the range as it could mess up the rest so [!-{] might be better

Matching operators

I have a text input containing lots of operators, variable and English words. From this input I have to separate all the operators alone.
As of now I'm using regular expression matching, so the number of operators matched depends on the regular expression. problem I get are '= is matched with <=', '& is matched with &&'. I need to match both = and <= separately.
Is there any better way for matching the operators other than regex?
as far as regex goes, you could have the pattern match the special (compound) case first, then the catch-all last with simple alternation. In your simple input case: /<=|&&|=|&/. this isn't necessarily terrible, you can still put whatever your catch-all is after that: /special1|...specialN|special-chars-catch-all/
this technique could be useful in some cases where a greedy expression would just get the whole thing, like: if($x==-1), you would want ==, not ==-
Look at the extended variants in your RE language.
In most RE languages /[<](?![=])/ will match "<" but not "<=" and not "=", for example. The (?! ... ) means "except when followed by ...". The term for this is Negative Look-ahead Assertion. These are sometimes spelled differently, as they are less standard than most other formations, but they are usually available. They never consume more characters, but they create slower matches.
The "except when preceded" or Negative Look-behind Assertion is sometimes also available, but you may wish to avoid it. It is seldom clear to a reader and can create slower matches.
There probably is. But as an alternative, you could have your regex as (e.g.):
[><=&|]+
(Modify to your specifications - not sure if you want addition, subtraction, ++ for incrementing etc too).
The + means "one or more" and so the regex matches as many characters as possible, meaning that if <= is in the text, it will match <= rather than < and then =.
Then, only once you've extracted all the matches, loop through them all and classify them.
I think you might still be able to get regex to do what you want.
If you want to completely abandon it, please forgive me and ignore my suggestion :)
If you want to use regex to detect just = then you could use [^<>=]=[^<>=] which means 'match the equals only when it is not preceded or seceded by < > or another =.
You could use {1}& with ampersands to detect one (and only one) ampersand.
(NB you might need to escape a couple of those symbols with \)
I hope that might help. Good luck.
K.
If you do multiple passes, you can also find the compound operators and then replace them with other characters before a pass that finds the simple ones.
This is often a useful approach anyway: to slowly overwrite your interpreted string as it is processed, so that what is left when you are done is just tokens. RE processors often return index ranges. So you can easily go back and overwrite that range with something no one else will match later (like a control-character token, a NUL, or a tilde).
An advantage is that you can then have debug code that does a verification pass to check that you have not left anything around uninterpreted.

How Can I Check If a C# Regular Expression Is Trying to Match 1-(and-only-1)-Character Strings?

Maybe this is a very rare (or even dumb) question, but I do need it in my app.
How can I check if a C# regular expression is trying to match 1-character strings?
That means, I only allow the users to search 1-character strings. If the user is trying to search multi-character strings, an error message will be displaying to the users.
Did I make myself clear?
Thanks.
Peter
P.S.: I saw an answer about calculating the final matched strings' length, but for some unknown reason, the answer is gone.
I thought it for a while, I think calculating the final matched strings length is okay, though it's gonna be kind of slow.
Yet, the original question is very rare and tedious.
a regexp would be .{1}
This will allow any char though. if you only want alpanumeric then you can use [a-z0-9]{1} or shorthand /w{1}
Another option its to limit the number of chars a user can type in an input field. set a maxlength on it.
Yet another option is to save the forms input field to a char and not a string although you may need some handling around this to prevent errors.
Why not use maxlength and save to a char.
You can look for unescaped *, +, {}, ? etc. and count the number of characters (don't forget to flatten the [] as one character).
Basically you have to parse your regex.
Instead of validating the regular expression, which could be complicated, you could apply it only on single characters instead of the whole string.
If this is not possible, you may want to limit the possibilities of regular expression to some certain features. For instance the user can only enter characters to match or characters to exclude. Then you build up the regex in your code.
eg:
ABC matches [ABC]
^ABC matches [^ABC]
A-Z matches [A-Z]
# matches [0-9]
\w matches \w
AB#x-z matches [AB]|[0-9]|[x-z]|\w
which cases do you need to support?
This would be somewhat easy to parse and validate.

How to Check if a String is a "string" or a RegEx?

How can I check if a String in an textbox is a plain String ore a RegEx?
I'm searching through a text file line by line.
Either by .Contains(Textbox.Text); or by Regex(Textbox.Text) Match(currentLine)
(I know, syntax isn't working like this, it's just for presentation)
Now my Program is supposed to autodetect if Textbox.Text is in form of a RegEx or if it is a normal String.
Any suggestions? Write my own little RexEx to detect if Textbox contains a RegEx?
Edit:
I failed to add thad my Strings
can be very simple like Foo ore 0005
I'm trying the suggested solutions
right away!
You can't detect regular expressions with a regular expression, as regular expressions themselves are not a regular language.
However, the easiest you probably could do is trying to compile a regex from your textbox contents and when it succeeds you know that it's a regex. If it fails, you know it's not.
But this would classify ordinary strings like "foo" as a regular expression too. Depending on what you need to do, this may or may not be a problem. If it's a search string, then the results are identical for this case. In the case of "foo.bar" they would differ, though since it's a valid regex but matches different things than the string itself.
My advice, also stated in another comment, would be that you simply always enable regex search since there is exactly no difference if you split code paths here. Aside from a dubious performance benefit (which is unlikely to make any difference if there is much of a benefit at all).
Many strings could be a regex, every regex could actually be a string.
Consider the string "thin." could either be a string ('.' is a dot) or a regex ('.' is any character).
I would just add a checkbox where the user indicates if he enters a regex, as usual in many applications.
One possible solution depending on your definition of string and regex would be to check if the string contains any regex typical characters.
You could do something like this:
string s = "I'm not a Regex";
if (s == Regex.Escape(s))
{
// no regex indeed
}
Try and use it in a regex and see if an exception is thrown.
This approach only checks if it is a valid regex, not whether it was intended to be one.
Another approach could be to check if it is surrounded by slashes (ie. ‘/foo/‘) Surrounding regexes with slashes is common practice (although you must remove the slashes before feeding it into the regex library)

Difficulty with Simple Regex (match prefix/suffix)

I'm try to develop a regex that will be used in a C# program..
My initial regex was:
(?<=\()\w+(?=\))
Which successfully matches "(foo)" - matching but excluding from output the open and close parens, to produce simply "foo".
However, if I modify the regex to:
\[(?<=\()\w+(?=\))\]
and I try to match against "[(foo)]" it fails to match. This is surprising. I'm simply prepending and appending the literal open and close brace around my previous expression. I'm stumped. I use Expresso to develop and test my expressions.
Thanks in advance for your kind help.
Rob Cecil
Your look-behinds are the problem. Here's how the string is being processed:
We see [ in the string, and it matches the regex.
Look-behind in regex asks us to see if the previous character was a '('. This fails, because it was a '['.
At least thats what I would guess is causing the problem.
Try this regex instead:
(?<=\[\()\w+(?=\)\])
Out of context, it is hard to judge, but the look-behind here is probably overkill. They are useful to exclude strings (as in strager's example) and in some other special circumstances where simple REs fail, but I often see them used where simpler expressions are easier to write, work in more RE flavors and are probably faster.
In your case, you could probably write (\b\w+\b) for example, or even (\w+) using natural bounds, or if you want to distinguish (foo) from -foo- (for example), using \((\w+)\).
Now, perhaps the context dictates this convoluted use (or perhaps you were just experimenting with look-behind), but it is good to know alternatives.
Now, if you are just curious why the second expression doesn't work: these are known as "zero-width assertions": they check that what is following or preceding is conform to what is expected, but they don't consume the string so anything after (or before if negative) them must match the assertion too. Eg. if you put something after the positive lookahead which doesn't match what is asserted, you are sure the RE will fail.

Categories

Resources