I have a text (string) that I want in all upper case, except the following:
Words starting with : (colon)
Words or strings surrounded by double quotation marks, ""
Words or strings surrounded by single quotation marks, ''
Everything else should be replaced with its upper case counterpart, and formatting (whitespaces, line breaks, etc.) should remain.
How would I go about doing this using Regex (C# style/syntax)?
I think you are looking for something like this:
text = Regex.Replace(text, #":\w+|""[^""]*""|'[^']*'|(.)",
match => match.Groups[1].Success ?
match.Groups[1].Value.ToUpper() : match.Value);
:\w+ - match words with a colon.
"[^"]*"|'[^']*' - match quoted text. For escaped quotes, you may use:
"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'
(.) - capture anything else (you can also try ([^"':]*|.), it might be faster).
Next, we use a callback for Regex.Replace to do two things:
Determine if we need to keep the text as-is, or
Return the upper-case version of the text.
Working example: http://ideone.com/ORFU8
You can start with this RegEx:
\b(?<![:"'])(\w+?)(?!["'])\b
But of course you have to improve it by yourself, if it is not enough.
For example this will also not find "dfgdfg' (not equal quotation)
The word which is found is in the first match ($1)
Related
I m trying to matching a string which will not allow same special character at same time
my regular expression is:
[RegularExpression(#"^+[a-zA-Z0-9]+[a-zA-Z0-9.&' '-]+[a-zA-Z0-9]$")]
this solve my all requirement except the below two issues
this is my string : bracks
acceptable :
bra-cks, b-r-a-c-ks, b.r.a.c.ks, bra cks (by the way above regular expression solved this)
not acceptable:
issue 1: b.. or bra..cks, b..racks, bra...cks (two or more any special character together),
issue 2: bra cks (two ore more white space together)
You can use a negative lookahead to invalidate strings containing two consecutive special characters:
^(?!.*[.&' -]{2})[a-zA-Z0-9.&' -]+$
Demo: https://regex101.com/r/7j14bu/1
The goal
From what i can tell by your description and pattern, you are trying to match text, which start and end with alphanumeric (due to ^+[a-zA-Z0-9] and [a-zA-Z0-9]$ inyour original pattern), and inside, you just don't want to have any two consecuive (adjacent) special characters, which, again, guessing from the regex, are . & ' -
What was wrong
^+ - i think here you wanted to assure that match starts at the beginning of the line/string, so you don't need + here
[a-zA-Z0-9.&' '-] - in this character class you doubled ' which is totally unnecessary
Solution
Please try pattern
^[a-zA-Z0-9](?:(?![.& '-]{2,})[a-zA-Z0-9.& '-])*[a-zA-Z0-9]$
Pattern explanation
^ - anchor, match the beginning of the string
[a-zA-Z0-9] - character class, match one of the characters inside []
(?:...) - non capturing group
(?!...) - negative lookahead
[.& '-]{2,} - match 2 or more of characters inside character class
[a-zA-Z0-9.& '-] - character class, match one of the characters inside []
* - match zero or more text matching preceeding pattern
$ - anchor, match the end of the string
Regex demo
Some remarks on your current regex:
It looks like you placed the + quantifiers before the pattern you wanted to quantify, instead of after. For instance, ^+ doesn't make much sense, since ^ is just the start of the input, and most regex engines would not even allow that.
The pattern [a-zA-Z0-9.&' '-]+ doesn't distinguish between alphanumerical and other characters, while you want the rules for them to be different. Especially for the other characters you don't want them to repeat, so that + is not desired for those.
In a character class it doesn't make sense to repeat the same character, like you have a repeat of a quote ('). Maybe you wanted to somehow delimit the space, but realise that those quotes are interpreted literally. So probably you should just remove them. Or if you intended to allow for a quote, only list it once.
Here is a correction (add the quote if you still need it):
^[a-zA-Z0-9]+(?:[.& -][a-zA-Z0-9]+)*$
Follow-up
Based on a comment, I suspect you would allow a non-alphanumerical character to be surrounded by single spaces, even if that gives a sequence of more than one non-alphanumerical character. In that case use this:
^[a-zA-Z0-9]+(?:(?:[ ]|[ ]?[.&-][ ]?)[a-zA-Z0-9]+)*$
So here the space gets a different role: it can optionally occur before and after a delimiter (one of ".&-"), or it can occur on its own. The brackets around the spaces are not needed, but I used them to stress that the space is intended and not a typo.
So i have the following RegEx for the purpose of finding and adding whitespace:
(\S)(\()
So for a string like "SomeText(Somemoretext)" I want to update this to "SomeText (Somemoretext)" it matches "t(" and so my replace eliminates the "t" from the string which is not good. I also do not know what the character could be, I'm merely trying to find the non-existence of whitespace.
Is there a better expression to use or is there a way to exclude the found character from the match returned so that I can safely replace without catching characters i do not want to replace?
Thanks
I find lookarounds hard to read and would prefer using substitutions in the replacement string instead:
var s = Regex.Replace("test1() test2()", #"(\S)\(", "$1 (");
Debug.Assert(s == "test1 () test2 ()");
$1 inserts the first capture group from the regex into the replacement string which is the non-space character before the opening parenthesis (.
If you need to detect the absence of space before a specific character (such as bracket) after a word, how about the following?
\b(?=[^\s])\(
This will detect words ( [a-zA-z0-9_] that are followed by a bracket, without a space).
(if I got your problem correctly) you can replace the full match with ( and get exactly what you need.
In case you need to look for absence spaces before a symbol (like a bracket) in any kind of text (as in the text may be non-word, such as punctuation) you might want to use the following instead.
^(?:\S*)(\()(?:\S*)$
When using this, your result will be in group 1, instead of just full match (which now contains the whole line, if a line is matched).
If I need to allow only first punctuation mark in string with different punctuation marks sequence between words, for example if string is:
string str = "hello,.,.,.world.,.?,.";
in result I want get this:
hello, world.
It would be good to know both, how to pass such string after insert and how to avoid writing of more then one mark and one white space between the words in string directly in textbox.
You can try this: (?<=[,.])[,.?]+.
See it working here: https://regex101.com/r/di5Ebw/1.
If you need to have a list of special ponctuation that you want to strip we can adjust in [,.]!
(So in the example I give you the match is on the chars you want to remove: just replace that match with empty string - as you can see in the SUBSTITUTION panel at the bottom)
[EDIT] Extend the match cases.
If you don't want to bother let this do it for you: (?<=\W)(?<! )\W+
See it working here: https://regex101.com/r/di5Ebw/2
.Net regular expressions have a punctuation class, so a simple way of achieving the required result is to search for the string (\w\p{P})\p{P}+ and replace with $1.
For a regular expression that handles exactly the few punctuation characters used in the question the regular expression (\w[.,?])[.,?]+ can be used.
(Note, the above shows the regular expressions. Their C# strings are "(\\w\\p{P})\\p{P}+" and "(\\w[.,?])[.,?]+".)
Explanation. This looks for a word character (\w) followed by one punctuation character and it captures these two characters. Any immediately following punctuation characters are matched by the \p{P}+. The whole match is replace by the capture.
The \p{name} construct is defined here as "Matches any single character in the Unicode general category or named block specified by name.
".
The \p{P} category is defined here as "All punctuation characters". There are also several subcategories of punctuation, but it may be best to look at Unicode to understand them.
When user pastes something like this (from notepad for example):
multi
line#email.com
into input text box, the line break dissapears and it looks like this:
multi
line#email.com
But whatever the line break is converted to does not match this regex:
'\s|\t|\r|\n|\0','i'
so this invalid character passes through js validation to the .NET application code I am working on.
It is interesting but this text editor does the same transformation, that is why I had to post original sample as code. I would like to find out what the line break got converted to, so I can add a literal to the regex but I don't know how. Many thanks!
Here is the whole snippet:
var invalidChars = new RegExp('(^[.])|[<]|[>]|[(]|[)]|[\]|[,]|[;]|[:]|([.])[.]|\s|\t|\r|\n|\0', 'i');
if (text.match(invalidChars)) {
return false;
}
Your immediate problem is escaping. You're using a string literal to create the regex, like this:
'(^[.])|[<]|[>]|[(]|[)]|[\]|[,]|[;]|[:]|([.])[.]|\s|\t|\r|\n|\0'
But before it ever reaches the RegExp constructor, the [\] becomes []; \s becomes s; \0 becomes 0; and \t, \r and \n are converted to the characters they represent (tab, carriage return and linefeed, respectively). That won't happen if you use a regex literal instead, but you still have to escape the backslash to match a literal backslash.
Your regex is also has way more brackets than it needs. I think this is what you were trying for:
/^\.|\.\.|[<>()\\,;:\s]/
That matches a dot at the beginning, two consecutive dots, or one of several forbidden characters including any whitespace character (\s matches any whitespace character, not just a space).
Ok - here it is
vbCrLF
This is what pasted line breaks are converted to. I added (vbCrLF) group and those spaces are now detected. Thanks, Dan1M
http://forums.asp.net/t/1183613.aspx?Multiline+Textbox+Input+not+showing+line+breaks+in+Repeater+Control
I've searched for hours and already tried tons of different patterns - there's a simple thing I wan't to achive with regex, but somehow it just won't do as I want:
Possible Strings
String1
This is some text \0"ยง%lfsdrlsrblabla\0\0\0}dfglpdfgl
String2
This is some text
String3
This is some text \0
Desired Match/Result
This is some text
I simply want to match everything - until and except the \0 - resulting in only 1 Match. (everything before the \0)
Important for my case is, that it will match everytime, even when the \0 is not given.
Thanks for your help!
You can try with this pattern:
#"^(?:[^\\]+|\\(?!0))+"
In other words: all characters except backslashes or backslashes not followed by 0
I like
#"^((?!\\0).)*"
Because it's very easy to implement with any arbitrary string. The basic trick is the negative lookahead, which asserts that the string starting at this point doesn't match the
regular expression inside. We follow this with a wildcard to mean "Literally any character not at the start of my string. If your string should change, this is an easy update - just
#"^((?!--STRING--).)*)"
As long as you properly escape that string. Heck, with this pattern, you're merely a regex_escape function from generating any delimiter string.
Bonus: using * instead of + will return a blank string as a valid match when your string starts with your delimiter.