I am looking for a way to get words out of a sentence. I am pretty far with the following expression:
\b([a-zA-Z]+?)\b
but there are some occurrences that it counts a word when I want it not to. E.g a word followed by more than one period like "text..". So, in my regex I want to have the period to be at the end of a word zero or one time. Inserting \.? did not do the trick, and variations on this have not yielded anything fruitful either.
Hope someone can help!
A single dot means any character. You must escape it as
\.?
Maybe you want an expression like this:
\w+\.?
or
\p{L}+\.?
You need to add \.? (and not .?) because the period has special meaning in regexes.
to avoid a match on your example "test.." you ask for you not only need to put the \.? for checking first character after the word to be a dot but also look one character further to check the second character after the word.
I did end up with something like this
\w{2,}\.?[^.]
You should also consider that a sentence not always ends with a . but also ! or ? and alike.
I usually use rubulator.com to quick test a regexp
Related
I know the regex for excluding words, roughly anyway, It would be (!?wordToIgnore|wordToIgnore2|wordToIgnore3)
But I have an existing, complicated regex that I need to add this to, and I am a bit confused about how to go about that. I'm still pretty new to regex, and it took me a very long time to make this particular one, but I'm not sure where to insert it or how ...
The regex I have is ...
^(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:/\p{L}'-]{1,64}$)$
This should only allow the person typing to insert between 1 and 64 letters that match that pattern, cannot start with a space, quote, double quote, special character, a dash, an escape character, etc, and only allows a-z both upper and lowercase, can include a space, ":", a dash, and a quote anywhere but the beginning.
But I want to forbid them from using certain words, so I have this list of words that I want to be forbidden, I just cannot figure out how to get that to fit into here.. I tried just pasting the whole .. "block" in, and that didn't work.
?!the|and|or|a|given|some|that|this|then|than
Has anyone encountered this before?
ciel, first off, congratulations for getting this far trying to build your regex rule. If you want to read something detailed about all kinds of exclusions, I suggest you have a look at Match (or replace) a pattern except in situations s1, s2, s3 etc
Next, in your particular situation, here is how we could approach your regex.
For consision, let's make all the negative lookarounds more compact, replacing them with a single (?!.*(?: |-|'){2})
In your character class, the \: just escapes the colon, needlessly so as : is enough. I assume you wanted to add a backslash character, and if so we need to use \\
\p{L} includes [a-zA-Z], so you can drop [a-zA-Z]. But are you sure you want to match all letters in any script? (Thai etc). If so, remember to set the u flag after the regex string.
For your "bad word exclusion" applying to the whole string, place it at the same position as the other lookarounds, i.e., at the head of the string, but using the .* as in your other exclusions: (?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3)) It does not matter which lookahead comes first because lookarounds do not change your position in the string. For more on this, see Mastering Lookahead and Lookbehind
This gives us this glorious regex (I added the case-insensitive flag):
^(?i)(?!.*(?:wordToIgnore|wordToIgnore2|wordToIgnore3))(?!.*(?: |-|'){2})(?:[\\0-9 :/\p{L}'-]{1,64}$)$
Of course if you don't want unicode letters, replace \p{L} with a-z
Also, if you want to make sure that the wordToIgnore is a real word, as opposed to an embedded string (for instance you don't want cat but you are okay with catalog), add boundaries to the lookahead rule: (?!.*\b(?:wordToIgnore|wordToIgnore2|wordToIgnore3)\b)
use this:
^(?!.*(the|and|or|a|given|some|that|this|then|than))(?!.*[ ]{2})(?!.*[']{2})(?!.*[-]{2})(?:[a-zA-Z0-9 \:\p{L}'-]{1,64}$)$
see demo
I am using the following regex to locate in a document any series of characters that begins with characters dash dash -- and ends with a line feed character /n.
return #"(^--).*?(?=\r|\n)";
Almost works but only when there is a space between the -- and the next character.
return #"(?:--\s).*?(?=\r|\n)
Almost works but only when there is no space between the -- and the next character.
How do I get my return whether a space is following the -- or not?
I know nothing of regex other than what it's capable of. I found both of these sample patterns online. Thanks for your assistance.
You need to use \s? to capture either 0 or 1 spaces.
One use of the question mark in regex is to indicate that 0 or one matches of the previous character (or group of characters) will be matched, but not more than one.
Also, if you ever have the desire to learn regex for yourself, visit http://www.regular-expressions.info to learn and http://www.regexpal.com to practice.
Assuming that you are searching for substrings in a larger string and want to capture the the substring between -- and \n you could use an expression like:
--(.*)\r?\n
Which can be quoted in C# like this:
#"--(.*)\r?\n"
If you just want to make sure that a string starts with -- and ends with \n you could use:
(?s)^--.*\n\z
I am trying to use Regex to find out if a string matches *abc - in other words, it starts with anything but finishes with "abc"?
What is the regex expression for this?
I tried *abc but "Regex.Matches" returns true for xxabcd, which is not what I want.
abc$
You need the $ to match the end of the string.
.*abc$
should do.
So you have a few "fish" here, but here's how to fish.
An online expression library and .NET-based tester: RegEx Library
An online Ruby-based tester (faster than the .NET one) Rubular
A windows app for testing exressions (most fully-featured, but no zero-width look-aheads or behind) RegEx Coach
Try this instead:
.*abc$
The $ matches the end of the line.
^.*abc$
Will capture any line ending in abc.
It depends on what exactly you're looking for. If you're trying to match whole lines, like:
a line with words and spacesabc
you could do:
^.*abc$
Where ^ matches the beginning of a line and $ the end.
But if you're matching words in a line, e.g.
trying to match thisabc and thisabc but not thisabcd
You will have to do something like:
\w*abc(?!\w)
This means, match any number of continuous characters, followed by abc and then anything but a character (e.g. whitespace or the end of the line).
If you want a string of 4 characters ending in abc use, /^.abc$/
I am updating some code that I didn't write and part of it is a regex as follows:
\[url(?:\s*)\]www\.(.*?)\[/url(?:\s*)\]
I understand that .*? does a non-greedy match of everything in the second register.
What does ?:\s* in the first and third registers do?
Update: As requested, language is C# on .NET 3.5
The syntax (?:) is a way of putting parentheses around a subexpression without separately extracting that part of the string.
The author wanted to match the (.*?) part in the middle, and didn't want the spaces at the beginning or the end from getting in the way. Now you can use \1 or $1 (or whatever the appropriate method is in your particular language) to refer to the domain name, instead of the first chunk of spaces at the beginning of the string
?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.
What does ?:\s* in the first and third registers do?
It's matching zero or more whitespace characters, without capturing them.
The regex author intends to allow trailing whitespace in the square-bracket-tags, matching all DNS labels following the "www." like so:
[url]www.foo.com[/url] # foo.com
[url ]www.foo.com[/url ] # same
[url ]www.foo.com[/url] # same
[url]www.foo.com[/url ] # same
Note that the regex also matches:
[url]www.[/url] # empty string!
and fails to match
[url]stackoverflow.com[/url] # no match, bummer
You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn.
http://krijnhoetmer.nl/stuff/regex/cheat-sheet/
I need to find all matches of word which strictly begins with "$" and contains only digits. So I wrote
[$]\d+
which gave me 4 matches for
$10 $10 $20a a$20
so I thought of using word boundaries using \b:
[$]\d+\b
But it again matched
a$20 for me.
I tried
\b[$]\d+\b
but I failed.
I'm looking for saying, ACCEPT ONLY IF THE WORD STARTS WITH $ and is followed by DIGITS. How do I tell IT STARTS WITH $, because I think \b is making it assume word boundaries which means surrounded inside alphanumeric characters.
What is the solution?
Not the best solution but this should work. (It does with your test case)
(?<=\s+|^)\$\d+\b
Have you tried
\B\$\d+\b
You were close, you just need to escape the $:
\B\$\d+\b
See the example matches here: http://regexhero.net/tester/?id=79d0ac3b-dd2c-4872-abb4-6a9780c91fc1
Try with ^\$\d+
where ^ denoted the beginning of a string.