Regular Expression to find separate words?

Regular Expression to find separate words? - c#

Here's a quickie for your RegEx wizards. I need a regular expression that will find groups of words. Any group of words. For instance, I'd like for it to find the first two words in any sentence.
Example "Hi there, how are you?" - Return would be "hi there"
Example "How are you doing?" - Return would be "How are"

Try this:
^\w+\s+\w+
Explanation: one or more word characters, spaces and more one or more word characters together.

Regular expressions could be used to parse language. Regular expressions are a more natural tool. After gathering the words, use a dictionary to see if they're actually words in a particular language.
The premise is to define a regular expression that will split out %99.9 of possible words, word being a key definition.
I assume C# is going to use a PCRE based on 5.8 Perl.
This is my ascii definition of how to split out words (expanded):
regex = '[\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* )
and unicode (more has to be added/subtracted to suite specific encodings):
regex = '[\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* )'
To find ALL of the words, cat the regex string into a regex (i don't know c#):
#matches =~ /$regex/xg
where /xg are the expanded and global modifiers. Note that there is only capture group 1 in the regex string so the intervening text is not captured.
To find just the FIRST TWO:
#matches =~ /(?:$regex)(?:$regex)/x
Below is a Perl sample. Anyway, play around with it. Cheers!
use strict;
use warnings;
binmode (STDOUT,':utf8');
# Unicode
my $regex = qr/ [\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* ) /x;
# Ascii
# my $regex = qr/ [\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* ) /x;
my $text = q(
I confirm that sufficient information and detail have been
reported in this technical report, that it's "scientifically" sound,
and that appropriate conclusion's have been included
);
print "\n**\n$text\n";
my #matches = $text =~ /$regex/g;
print "\nTotal ".scalar(#matches)." words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
# =======================================
my $junk = q(
Hi, there, A écafé and Horse d'oeuvre
hasn't? 'n? '? a-b? -'a-?
);
print "\n\n**\n$junk\n";
# First 2 words
#matches = $junk =~ /(?:$regex)(?:$regex)/;
print "\nFirst 2 words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
# All words
#matches = $junk =~ /$regex/g;
print "\nTotal ".scalar(#matches)." words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
Output:
**
I confirm that sufficient information and detail have been
reported in this technical report, that it's "scientifically" sound,
and that appropriate conclusion's have been included
Total 25 words
--------------------
I
confirm
that
sufficient
information
and
detail
have
been
reported
in
this
technical
report
that
it's
scientifically
sound
and
that
appropriate
conclusion's
have
been
included
**
Hi, there, A écafé and Horse d'oeuvre
hasn't? 'n? '? a-b? -'a-?
First 2 words
--------------------
Hi
there
Total 11 words
--------------------
Hi
there
A
écafé
and
Horse
d'oeuvre
hasn't
n
a-b
a-

#Rubens Farias:
Per my comment, here's the code I used:
public int startAt = 0;
private void btnGrabWordPairs_Click(object sender, EventArgs e)
{
Regex regex = new Regex(#"\b\w+\s+\w+\b"); //Start at word boundary, find one or more word chars, one or more whitespaces, one or more chars, end at word boundary
if (startAt <= txtTest.Text.Length)
{
string match = regex.Match(txtArticle.Text, startAt).ToString();
MessageBox.Show(match);
startAt += match.Length; //update the starting position to the end of the last match
}
{
Each time the button is clicked it grabs pairs of words quite nicely, proceeding through the text in the txtTest TextBox and finding the pairs sequentially until the end of the string is reached.
#sln: Thanks for the extremely detailed response!

Related

How to write a regular expression that captures tags in a comma-separated list?

Here is my input:
#
tag1, tag with space, !##%^, 🦄
I would like to match it with a regex and yield the following elements easily:
tag1
tag with space
!##%^
🦄
I know I could do it this way:
var match = Regex.Match(input, #"^#[\n](?<tags>[\S ]+)$");
// if match is a success
var tags = match.Groups["tags"].Value.Split(',').Select(x => x.Trim());
But that's cheating, as it involves messing around with C#. There must be a neat way to do this with a regex. Just must be... right? ;D
The question is: how to write a regular expression that would allow me to iterate through captures and extract tags, without the need of splitting and trimming?

This works (?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+
It uses C#'s Capture Collection to find a variable amount of field data
in a single record.
You could extend the regex further to get all records at once.
Where each record contains its own variable amount of field data.
The regex has built-in trimming as well.
Expanded:
(?ms) # Inline modifiers: multi-line, dot-all
^ \# \s+ # Beginning of record
(?: # Quantified group, 1 or more times, get all fields of record at once
\s* # Trim leading wsp
( # (1 start), # Capture collector for variable fields
(?: # One char at a time, but not comma or begin of record
(?!
,
| ^ \# \s+
)
.
)*?
) # (1 end)
\s*
(?: , | $ ) # End of this field, comma or EOL
)+
C# code:
string sOL = #"
#
tag1, tag with space, !##%^, 🦄";
Regex RxOL = new Regex(#"(?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+");
Match _mOL = RxOL.Match(sOL);
while (_mOL.Success)
{
CaptureCollection ccOL1 = _mOL.Groups[1].Captures;
Console.WriteLine("-------------------------");
for (int i = 0; i < ccOL1.Count; i++)
Console.WriteLine(" '{0}'", ccOL1[i].Value );
_mOL = _mOL.NextMatch();
}
Output:
-------------------------
'tag1'
'tag with space'
'!##%^'
'??'
''
Press any key to continue . . .

Nothing wrong with cheating ;]
string input = #"#
tag1, tag with space, !##%^, 🦄";
string[] tags = Array.ConvertAll(input.Split('\n').Last().Split(','), s => s.Trim());

You can pretty much make it without regex. Just split it like this:
var result = input.Split(new []{'\n','\r'}, StringSplitOptions.RemoveEmptyEntries).Skip(1).SelectMany(x=> x.Split(new []{','},StringSplitOptions.RemoveEmptyEntries).Select(y=> y.Trim()));

How to match string that contains ^ in regular expression?

I tried to make a regular expression using online tool but not succeeded. Here is the string i need to match:-
27R4FF^27R4FF Text until end
always starts with alphanumeric (case-insensitive)
then always caret sign ^ (no space before & after)
then alphanumeric string
then always one white space
then string until end.
Here is the regular expression that is not working for me:-
((?:[a-z][a-z]*[0-9]+[a-z0-9]*))(\^)((?:[a-z][a-z]*[0-9]+[a-z0-9]*)).*?((?:[a-z][a-z]+))
c# code:-
string txt = "784SFS^784SFS Value is here";
var regs = #"((?:[a-z][a-z]*[0-9]+[a-z0-9]*))(\^)((?:[a-z][a-z]*[0-9]+[a-z0-9]*)).*?((?:[a-z][a-z]+))";
Regex r = new Regex(regs, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match m = r.Match(txt);
Console.Write(m.Success ? "matched" : "didn't match");
Console.ReadLine();
Help appreciated. Thanks

Verbatim ^[^\W_]+\^[^\W_]+[ ].*$
^ # BOS
[^\W_]+ # Alphanum
\^ # Caret
[^\W_]+ # Alphanum
[ ] # Space
.* # Anything
$ # EOS
Output
** Grp 0 - ( pos 0 , len 28 )
27R4FF^27R4FF Text until end

I didn't get if string 'until the end' should be matched.
This works for
27R4FF^27R4FF Text
^\w+\^\w+\s\w+$
if you have some spaces at the end, try with
^\w+\^\w+\s[\w\s]+$

Try this: https://regex101.com/r/hD0hV0/2
^[\da-z]+\^[\da-z]+\s.*$
...or commented (assumes RegexOptions.IgnorePatternWhitespace if you're using the format in code):
^ # always starts...
[\da-z]+ # ...with alphanumeric (case-insensitive)
\^ # then always caret sign ^ (no space before & after)
[\da-z]+ # then alphanumeric string
\s # then always one white space
.* # then string...
$ # ...until end.
The other answers don't actually match what you describe (at the time of this writing) because \w matches underscore and you didn't mention any limitations on "the string at the end".

Regex to match numbers written as words, digits or roman numerals

I'm trying to match a number written as a word, digit or roman numeral. Here's a bunch of samples
CHAPTER 1
CHAPTER 2
CHAPTER THREE
CHAPTER IV
CHAPTER TWENTY TWO
I'm pretty bad at regex, here's what I've got so far.
(CHAPTER (([0-9]+)|(/* words - see below */)|( /* roman - see below */)))
// words
(TWENTY|THIRTY|etc)?( |-)?(ONE|TWO|THREE|FOUR|FIVE|etc)?
// roman
(I|II|III|IV|V|etc)+
The statement catches CHAPTER 1, CHAPTER 2 and CHAPTER THREE, but tries to match IV as a word (I'm guessing its matching FIVE somehow?). TWENTY TWO Doesn't match at all.
Can anyone help? Here's the full regex
(CHAPTER (
([0-9]+)|
((TWENTY|THIRTY)?( |-)?(ONE|TWO|THREE|FOUR|FIVE)?)|
((I|II|III|IV|V)+)
))
NOTE:
The point of this is to convert these text representations to actual integers. I have methods to do this in each case, so I do need to distinguish between the various cases

Since you've already got parsers, which hopefully fail gracefully if given something which superficially looks like valid roman/text input but isn't, you could just call them all and see which pass.
If you don't just want to call them all, this regex should identify which parser to pass each input to.
var re = new Regex(
#"CHAPTER (?:(?<arabic>\d+)|(?<roman>[IVXLCDM]+)|(?<text>[A-Z ]+))");
called for example as
var input = #"CHAPTER 1
CHAPTER 2
CHAPTER THREE
CHAPTER IV
CHAPTER TWENTY TWO";
foreach (Match match in re.Matches(input))
{
if (match.Groups["arabic"].Success)
{
Console.WriteLine("Pass {0} to Arabic parser", match.Groups["arabic"].Value);
}
else if (match.Groups["roman"].Success)
{
Console.WriteLine("Pass {0} to Roman parser", match.Groups["roman"].Value);
}
else if (match.Groups["text"].Success)
{
Console.WriteLine("Pass {0} to Text parser", match.Groups["text"].Value);
}
}
results in
Pass 1 to Arabic parser
Pass 2 to Arabic parser
Pass THREE to Text parser
Pass IV to Roman parser
Pass TWENTY TWO to Text parser

Regex for roman numeral is: \bM{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\b
Regex for digits: \d+
Regex for literal: [a-z ]+
combine all these in:
CHAPTER (?:(?<digits>\d+)|(?<roman>\bM{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\b)|(?<literal>[A-Z ]+))

CHAPTER (?:\d+|(?:XVIII|XVII|XIII|VIII|XIV|XVI|XII|III|VII|XV|VI|IV|XI|IX|XX|III|II|X|V|I)|(?:(?P<d>TWENTY|THIRTY|FORTY|FIFTY|SIXTY|SEVENTY|EIGHTY|NINETY)?(?(d)(?: (?:ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE))?|(?:ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE|TEN|ELEVEN|TWELVE|THIRTEEN|FOURTEEN|FIFTEEN|SIXTEEN|SEVENTEEN|EIGTHEEN|NINETEEN))))
breakdown and explanation:
CHAPTER // match "CHAPTER " literally
(?:// then either:
\d+// 1: digits
|
(?:// or 2: roman numerals (up to 18) (note: make sure to order them by length!)
XVIII|XVII|XIII|VIII|XIV|XVI|XII|III|VII|XV|VI|IV|XI|IX|XX|III|II|X|V|I
)
|// or 3: words
(?:
(?P<d>// first, one of the literals "TWENTY", "THIRTY", etc...
TWENTY|THIRTY|FORTY|FIFTY|SIXTY|SEVENTY|EIGHTY|NINETY
)?// ...if possible
(?(d) // then, if the previous group matched...
(?: // ...a space...
(?:// ...and the numbers "ONE" to "NINE"
ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE
)
)?// ...if possible.
|
(?://otherwise, one of "ONE" to "NINETEEN"
ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE|TEN|ELEVEN|TWELVE|THIRTEEN|FOURTEEN|FIFTEEN|SIXTEEN|SEVENTEEN|EIGTHEEN|NINETEEN
)
)
)
)
Demo.

Regex replace between and including tags

I have the following line of text (META Title):
Buy [ProductName][Text] at a great price [/Text] from [ShopName] today.
I am replacing depending on what values I have.
I have it working as I require however I can't find the correct regex to replace:
[Text] at a great price [/Text]
The words (in a nd between square brackets) change so the only thing that will remain the same is:
[][/]
i.e I may also want to replace
[TestText]some test text[/TestText] with nothing.
I have this working:
System.Text.RegularExpressions.Regex.Replace(SEOContent, #"\[Text].*?\[/Text]", #"");
I presumed the regex of:
[.*?].*?\[/.*?]
Would work but it didn't! - I'm coding in ASP.NET C#
Thanks in advance,
Dave

Use a named capture to get the node name of [..], then find it again using \k<..>.
(\[(?<Tag>[^\]]+)\][^\[]+\[/\k<Tag>\])
Broken down using Ignore Pattern Whitespace and an example program.
string pattern = #"
( # Begin our Match
\[ # Look for the [ escape anchor
(?<Tag>[^\]]+) # Place anything that is not antother ] into the named match Tag
\] # Anchor of ]
[^\[]+ # Get all the text to the next anchor
\[/ # Anchor of the closing [...] tag
\k<Tag> # Use the named capture subgroup Tag to balance it out
\] # Properly closed end tag/node.
) # Match is done";
string text = "[TestText]some test text[/TestText] with nothing.";
Console.WriteLine (Regex.Replace(text, pattern, "Jabberwocky", RegexOptions.IgnorePatternWhitespace));
// Outputs
// Jabberwocky with nothing.
As an aside, I would actually create a tokenizing regex (using a regex If with the above pattern) and replace within matches by identify the sections by named captures. Then in the replace using a match evaluator replace the identified tokens such as:
string pattern = #"
(?(\[(?<Tag>[^\]]+)\][^\[]+\[/\k<Tag>\]) # If statement to check []..[/] situation
( # Yes it is, match into named captures
\[
(?<Token>[^\]]+) # What is the text inside the [ ], into Token
\]
(?<TextOptional>[^\[]+) # Optional text to reuse
\[
(?<Closing>/[^\]]+) # The closing tag info
\]
)
| # Else, let is start a new check for either [] or plain text
(?(\[) # If a [ is found it is a token.
( # Yes process token
\[
(?<Token>[^\]]+) # What is the text inside the [ ], into Token
\]
)
| # Or (No of the second if) it is just plain text
(?<Text>[^\[]+) # Put it into the text match capture.
)
)
";
string text = #"Buy [ProductName] [Text]at a great price[/Text] from [ShopName] today.";
Console.WriteLine (
Regex.Replace(text,
pattern,
new MatchEvaluator((mtch) =>
{
if (mtch.Groups["Text"].Success) // If just text, return it.
return mtch.Groups["Text"].Value;
if (mtch.Groups["Closing"].Success) // If a Closing match capture group reports success, then process
{
return string.Format("Reduced Beyond Comparison (Used to be {0})", mtch.Groups["TextOptional"].Value);
}
// Otherwise its just a plain old token, swap it out.
switch ( mtch.Groups["Token"].Value )
{
case "ProductName" : return "Jabberwocky"; break;
case "ShopName" : return "StackOverFlowiZon"; break;
}
return "???"; // If we get to here...we have failed...need to determine why.
}),
RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture));
// Outputs:
// Buy Jabberwocky Reduced Beyond Comparison (Used to be at a great price) from StackOverFlowiZon today.

Regex for matching season and episode

I'm making small app for myself, and I want to find strings which match to a pattern but I could not find the right regular expression.
Stargate.SG-1.S01E08.iNT.DVDRip.XviD-LOCK.avi
That is expamle of string I have and I only want to know if it contains substring of S[NUMBER]E[NUMBER] with each number max 2 digits long.
Can you give me a clue?

Regex
Here is the regex using named groups:
S(?<season>\d{1,2})E(?<episode>\d{1,2})
Usage
Then, you can get named groups (season and episode) like this:
string sample = "Stargate.SG-1.S01E08.iNT.DVDRip.XviD-LOCK.avi";
Regex regex = new Regex(#"S(?<season>\d{1,2})E(?<episode>\d{1,2})");
Match match = regex.Match(sample);
if (match.Success)
{
string season = match.Groups["season"].Value;
string episode = match.Groups["episode"].Value;
Console.WriteLine("Season: " + season + ", Episode: " + episode);
}
else
{
Console.WriteLine("No match!");
}
Explanation of the regex
S // match 'S'
( // start of a capture group
?<season> // name of the capture group: season
\d{1,2} // match 1 to 2 digits
) // end of the capture group
E // match 'E'
( // start of a capture group
?<episode> // name of the capture group: episode
\d{1,2} // match 1 to 2 digits
) // end of the capture group

There's a great online test site here: http://gskinner.com/RegExr/
Using that, here's the regex you'd want:
S\d\dE\d\d
You can do lots of fancy tricks beyond that though!

Take a look at some of the media software like XBMC they all have pretty robust regex filters for tv shows
See here, here

The regex I would put for S[NUMBER1]E[NUMBER2] is
S(\d\d?)E(\d\d?) // (\d\d?) means one or two digit
You can get NUMBER1 by <matchresult>.group(1), NUMBER2 by <matchresult>.group(2).

I would like to propose a little more complex regex. I don't have ". : - _"
because i replace them with space
str_replace(
array('.', ':', '-', '_', '(', ')'), ' ',
This is the capture regex that splits title to title season and episode
(.*)\s(?:s?|se)(\d+)\s?(?:e|x|ep)\s?(\d+)
e.g. Da Vinci's Demons se02ep04 and variants
https://regex101.com/r/UKWzLr/3
The only case that i can't cover is to have interval between season and the number, because the letter s or se is becoming part if the title that does not work for me. Anyhow i haven't seen such a case, but still it is an issue.
Edit:
I managed to get around it with a second line
$title = $matches[1];
$title = preg_replace('/(\ss|\sse)$/i', '', $title);
This way i remove endings on ' s' and ' se' if name is part of series

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expression to find separate words? - c#

Try this: ^\w+\s+\w+ Explanation: one or more word characters, spaces and more one or more word characters together.

Related

How to write a regular expression that captures tags in a comma-separated list?

How to match string that contains ^ in regular expression?

Regex to match numbers written as words, digits or roman numerals

Regex replace between and including tags

Regex for matching season and episode

Categories

Resources