I encountered the problem when I tired to run my regex function on my text which can be found here.
With a HttpRequest I fetch the text form the link above. Then I run my regex to clean up the text before filtering the most occurrences of a certain word.
After cleaning up the word I split the string by whitespace and added it into a string array and notice there was a huge difference in the number of indexes.
Does anyone know why this happens because the result of occurrences for the word " the " - is 6806 hits.
raw data correct answer is 6806
And with my regex I get - 8073 hits
with regex
The regex i'm using is here in the sandbox with the text and below in the code.
//Application storing.
var dictionary = new Dictionary<string, long>(StringComparer.OrdinalIgnoreCase);
// Cleaning up a bit
var words = CleanByRegex(rawSource);
string[] arr = words.Split(" ", StringSplitOptions.RemoveEmptyEntries);
string CleanByRegex(string rawSource)
{
Regex r = RemoveSpecialChars();
return r.Replace(rawSource, " ");
}
// arr {string[220980]} - with regex
// arr {string[157594]} - without regex
foreach (var word in arr)
{
// some logic
}
```
partial class Program
{
[GeneratedRegex("(?:[^a-zA-Z0-9]|(?<=['\\\"]\\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled, "en-SE")]
private static partial Regex RemoveSpecialChars();
}
```
I have tried debugging it and I have my suspicion that I'm adding trailing whitespace but I don't know how to handle it.
I have tired to add a whitespace removing regex where I remove multiple whitespace and replace that with one whitespace.
the regex would look something like - [ ]{2,}"
partial class Program
{
[GeneratedRegex("[ ]{2,}", RegexOptions.Compiled)]
private static partial Regex RemoveWhiteSpaceTrails();
}
It would be helpful if you describe what you're trying to clean up.
However your specific question is answerable: from the sandbox I see that you're removing newlines and punctuation. This can definitely lead to occurrences of the that weren't there before:
The quick brown fox jumps over the
lazy dog
//the+newline does not match
//after regex:
The quick brown fox jumps over the lazy dog
//now there's one more *the+space*
If you change your search to something not so common, for example Seward, then you should see the same results before and after the regex.
The reason I believe the regex created more text while I was replacing it with string.empty or " ". Is not true I just created more matches.
Is because I thought the search in Chrome via ctrl + f would give me all the words for a certain search and this necessarily isn't true.
I tried my code and instead I added a subset of the Lorem Ipsum text. This is because I questioned the search on Chrome to see if it's really the correct answer.
Short answer is NO.
If I was to search for " the " that would mean I won't get the "the+Environmental.NewLine" which #simmetric proved,
Another scenario is sentences that begins with the word "The ". Since I am curious about the words in the Text I used the following regex \w+ to get the words and returned a MatchCollection (IList<Match>()) That I later looped through to add the value to my dictionary.
Code Demonstration
var rawSource = "Some text"
var words = CleanByRegex(rawSource);
IList<Match> CleanByRegex(string rawSource)
{
IList<Match> r = Regex.Matches(rawSource, "\\w+");
return r;
}
foreach (var word in words)
{
if (word.Value.Length >= 1) // at least 3 letters and has any letters
{
if (dictionary.ContainsKey(word.Value)) //if it's in the dictionary
dictionary[word.Value] = dictionary[word.Value] + 1; //Increment the count
else
dictionary[word.Value] = 1; //put it in the dictionary with a count 1
}
}
This is an extra exercise given to us on our Uni course, in which we need to find whether a sentence contains a palindrome or not. Whereas finding if a word is a palindrome or not is fairly easy, there could be a situation where the given sentence looks like this: "Dog cat - kajak house". My logic is to, using functions I already wrote, first determine if a character is a letter or not, if not delete it. Then count number of spaces+1 to find out how many words there are in a sentence, prepare an array of those words and then cast a function that checks if a word is palindrome on every element of an array. However, the double space would mess everything up on a "counting" phase. I've spent around an hour fiddling with code to do this, however I can't wrap my head around this. Could anyone help me? Note that I'm not supposed to use any external methods or libraries. I've done this using RegEx and it was fairly easy, however I'd like to do this "legally". Thanks in advance!
Just split on space, the trick is to remove empties. Google StringSplitOptions.RemoveEmptyEntries
then obviously join with one clean space
You could copy all the characters you are interested in to a new string:
var copy = new StringBuilder();
// Indicates whether the previous character was whitespace
var whitespace = true;
foreach (var character in originalString)
{
if (char.IsLetter(character))
{
copy.Append(character);
whitespace = false;
}
else if (char.IsWhiteSpace(character) && !whitespace)
{
copy.Append(character);
whitespace = true;
}
else
{
// Ignore other characters
}
}
If you're looking for the most rudimentary/olde-worlde option, this will work. But no-one would actually do this, other than in an illustrative manner.
string test = "Dog cat kajak house"; // Your string minus any non-letters
int prevLen;
do
{
prevLen = test.Length;
test = test.Replace(" ", " "); // Repeat-replace double spaces until none left
} while (prevLen > test.Length);
Personally, I'd probably do the following to end up with an array of words to check:
string[] words = test.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
If I wanted to parse back a string to only return the first all capital words in it how would I do that?
Example:
"OTHER COMMENTS These are other comments that would be here. Some more
comments"
I want to just return "OTHER COMMENTS"
These first upper case words can be many and the exact count is
unknown.
There could be other words in the string after with all caps
that I just want to ignore.
You can use a combination of Split (to break the sentence into words), SkipWhile (to skip words that aren't all caps), ToUpper (to test the word against it's upper-case counterpart), and TakeWhile (to take all sequential upper-case words once one is found). Finally, these words can be re-joined using Join:
string words = "OTHER COMMENTS These are other comments that would be here. " +
"Some more comments";
string capitalWords = string.Join(" ", words
.Split()
.SkipWhile(word => word != word.ToUpper())
.TakeWhile(word => word == word.ToUpper()));
You can loop through the string as an array of chars. To check if the char is uppercase, use Char.IsUpper https://www.dotnetperls.com/char-islower. So, in the loop you can say if its a char - set a flag that we started reading the set. Then add that char to a collection of chars. Keep looping and once it is no longer an upper case char and the flag is still true, break out of the loop. Then return the collection of chars as a string.
Hope that helps.
var input = "OTHER COMMENTS These are other comments that would be here. Some more comments";
var output = String.Join(" ", input.Split(' ').TakeWhile(w => w.ToUpper() == w));
Split it into words, then take words while the uppercase version of the word is the same as the word. Then combine them back with the space separator.
You could also use Regex:
using System.Text.RegularExpressions;
...
// The Regex pattern is any number of capitalized letter followed by a non-word character.
// You may have to adjust this a bit.
Regex r = new Regex(#"([A-Z]+\W)+");
string s = "OTHER COMMENTS These are other comments that would be here. Some more comments";
MatchCollection m = r.Matches(s);
// Only return the first match if there are any matches.
if (m.Count > 0)
{
Console.WriteLine(r.Matches(s)[0]);
}
I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!") and isn't quite working. It's splitting on the space before the quote.
Example input:
This is a string that "will be" highlighted when your 'regular expression' matches something.
Desired output:
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.
Note that "will be" and 'regular expression' retain the space between the words.
I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:
[^\s"']+|"([^"]*)"|'([^']*)'
I added the capturing groups because you don't want the quotes in the list.
This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
if (regexMatcher.group(1) != null) {
// Add double-quoted string without the quotes
matchList.add(regexMatcher.group(1));
} else if (regexMatcher.group(2) != null) {
// Add single-quoted string without the quotes
matchList.add(regexMatcher.group(2));
} else {
// Add unquoted word
matchList.add(regexMatcher.group());
}
}
If you don't mind having the quotes in the returned list, you can use much simpler code:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:
parsings strings: extracting words and phrases
Best way to parse Space Separated Text
UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?
m/('.*?'|".*?"|\S+)/g
Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).
This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.
Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)
If you want to allow escaped quotes inside the string, you can use something like this:
(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))
Quoted strings will be group 2, single unquoted words will be group 3.
You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/
The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com).
If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:
("[^"]*"|'[^']*'|[\S]+)+
(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s
This will match the spaces not surrounded by double quotes.
I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.
It'll probably be easier to search the string, grabbing each part, vs. split it.
Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.
(not actual Java)
string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
regex = "\"(\\\"|(?!\\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();
while (string.length > 0) {
string = string.trim();
if (Regex(regex).test(string)) {
final.push(Regex(regex).match(string)[0]);
string = string.replace(regex, ""); // progress to next "word"
}
}
Also, capturing single quotes could lead to issues:
"Foo's Bar 'n Grill"
//=>
"Foo"
"s Bar "
"n"
"Grill"
String.split() is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt() is probably what you need:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\\s]+?))\\s++").matcher(str);
for (int i = 0; i < len; i++)
{
m.region(i, len);
if (m.lookingAt())
{
String s = m.group(1);
if ((s.startsWith("\"") && s.endsWith("\"")) ||
(s.startsWith("'") && s.endsWith("'")))
{
s = s.substring(1, s.length() - 1);
}
System.out.println(i + ": \"" + s + "\"");
i += (m.group(0).length() - 1);
}
}
which produces the following output:
0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."
I liked Marcus's approach, however, I modified it so that I could allow text near the quotes, and support both " and ' quote characters. For example, I needed a="some value" to not split it into [a=, "some value"].
(?<!\\G\\S{0,99999}[\"'].{0,99999})\\s|(?<=\\G\\S{0,99999}\".{0,99999}\"\\S{0,99999})\\s|(?<=\\G\\S{0,99999}'.{0,99999}'\\S{0,99999})\\s"
Jan's approach is great but here's another one for the record.
If you actually wanted to split as mentioned in the title, keeping the quotes in "will be" and 'regular expression', then you could use this method which is straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc
The regex:
'[^']*'|\"[^\"]*\"|( )
The two left alternations match complete 'quoted strings' and "double-quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expressions on the left. We replace those with SplitHere then split on SplitHere. Again, this is for a true split case where you want "will be", not will be.
Here is a full working implementation (see the results on the online demo).
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
Pattern regex = Pattern.compile("\'[^']*'|\"[^\"]*\"|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program
If you are using c#, you can use
string input= "This is a string that \"will be\" highlighted when your 'regular expression' matches <something random>";
List<string> list1 =
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""|'(?<match>[\w\s]*)'|<(?<match>[\w\s]*)>").Cast<Match>().Select(m => m.Groups["match"].Value).ToList();
foreach(var v in list1)
Console.WriteLine(v);
I have specifically added "|<(?[\w\s]*)>" to highlight that you can specify any char to group phrases. (In this case I am using < > to group.
Output is :
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something random
1st one-liner using String.split()
String s = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
String[] split = s.split( "(?<!(\"|').{0,255}) | (?!.*\\1.*)" );
[This, is, a, string, that, "will be", highlighted, when, your, 'regular expression', matches, something.]
don't split at the blank, if the blank is surrounded by single or double quotes
split at the blank when the 255 characters to the left and all characters to the right of the blank are neither single nor double quotes
adapted from original post (handles only double quotes)
I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.
A couple hopefully helpful tweaks on Jan's accepted answer:
(['"])((?:\\\1|.)+?)\1|([^\s"']+)
Allows escaped quotes within quoted strings
Avoids repeating the pattern for the single and double quote; this also simplifies adding more quoting symbols if needed (at the expense of one more capturing group)
You can also try this:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something";
String ss[] = str.split("\"|\'");
for (int i = 0; i < ss.length; i++) {
if ((i % 2) == 0) {//even
String[] part1 = ss[i].split(" ");
for (String pp1 : part1) {
System.out.println("" + pp1);
}
} else {//odd
System.out.println("" + ss[i]);
}
}
The following returns an array of arguments. Arguments are the variable 'command' split on spaces, unless included in single or double quotes. The matches are then modified to remove the single and double quotes.
using System.Text.RegularExpressions;
var args = Regex.Matches(command, "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'").Cast<Match>
().Select(iMatch => iMatch.Value.Replace("\"", "").Replace("'", "")).ToArray();
When you come across this pattern like this :
String str = "2022-11-10 08:35:00,470 RAV=REQ YIP=02.8.5.1 CMID=caonaustr CMN=\"Some Value Pyt Ltd\"";
//this helped
String[] str1= str.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println("Value of split string is "+ Arrays.toString(str1));
This results in :[2022-11-10, 08:35:00,470, PLV=REQ, YIP=02.8.5.1, CMID=caonaustr, CMN="Some Value Pyt Ltd"]
This regex matches spaces ONLY if it is followed by even number of double quotes.
I need to create a regex that can match multiple strings. For example, I want to find all the instances of "good" or "great". I found some examples, but what I came up with doesn't seem to work:
\b(good|great)\w*\b
Can anyone point me in the right direction?
Edit: I should note that I don't want to just match whole words. For example, I may want to match "ood" or "reat" as well (parts of the words).
Edit 2: Here is some sample text: "This is a really great story."
I might want to match "this" or "really", or I might want to match "eall" or "reat".
If you can guarantee that there are no reserved regex characters in your word list (or if you escape them), you could just use this code to make a big word list into #"(a|big|word|list)". There's nothing wrong with the | operator as you're using it, as long as those () surround it. It sounds like the \w* and the \b patterns are what are interfering with your matches.
String[] pattern_list = whatever;
String regex = String.Format("({0})", String.Join("|", pattern_list));
(good)*(great)*
after your edit:
\b(g*o*o*d*)*(g*r*e*a*t*)*\b
I think you are asking for smth you dont really mean
if you want to search for any Part of the word, you litterally searching letters
e.g. Search {Jack, Jim} in "John and Shelly are cool"
is searching all letters in the names {J,a,c,k,i,m}
*J*ohn *a*nd Shelly *a*re
and for that you don't need REG-EX :)
in my opinion,
A Suffix Tree can help you with that
http://en.wikipedia.org/wiki/Suffix_tree#Functionality
enjoy.
I don't understand the problem correctly:
If you want to match "great" or "reat" you can express this by a pattern like:
"g?reat"
This simply says that the "reat"-part must exist and the "g" is optional.
This would match "reat" and "great" but not "eat", because the first "r" in "reat" is required.
If you have the too words "great" and "good" and you want to match them both with an optional "g" you can write this like this:
(g?reat|g?ood)
And if you want to include a word-boundary like:
\b(g?reat|g?ood)
You should be aware that this would not match anything like "breat" because you have the "reat" but the "r" is not at the word boundary because of the "b".
So if you want to match whole words that contain a substring link "reat" or "ood" then you should try:
"\b\w*?(reat|ood)\w+\b"
This reads:
1. Beginning with a word boundary begin matching any number word-characters, but don't be gready.
2. Match "reat" or "ood" enshures that only those words are matched that contain one of them.
3. Match any number of word characters following "reat" or "ood" until the next word boundary is reached.
This will match:
"goodness", "good", "ood" (if a complete word)
It can be read as: Give me all complete words that contain "ood" or "reat".
Is that what you are looking for?
I'm not entirely sure that regex alone offers a solution for what you're trying to do. You could, however, use the following code to create a regex expression for a given word. Although, the resulting regex pattern has the potential to become very long and slow:
function wordPermutations( $word, $minLength = 2 )
{
$perms = array( );
for ($start = 0; $start < strlen( $word ); $start++)
{
for ($end = strlen( $word ); $end > $start; $end--)
{
$perm = substr( $word, $start, ($end - $start));
if (strlen( $perm ) >= $minLength)
{
$perms[] = $perm;
}
}
}
return $perms;
}
Test Code:
$perms = wordPermutations( 'great', 3 ); // get all permutations of "great" that are 3 or more chars in length
var_dump( $perms );
echo ( '/\b('.implode( '|', $perms ).')\b/' );
Example Output:
array
0 => string 'great' (length=5)
1 => string 'grea' (length=4)
2 => string 'gre' (length=3)
3 => string 'reat' (length=4)
4 => string 'rea' (length=3)
5 => string 'eat' (length=3)
/\b(great|grea|gre|reat|rea|eat)\b/
Just check for the boolean that Regex.IsMatch() returns.
if (Regex.IsMatch(line, "condition") && Regex.IsMatch(line, "conditition2"))
The line will have both regex, right.