How to match all words separated by spaces in RegEx? - c#

I am studying regex, but still find hard to learn.
So my problem is this, I have given a set of keywords:
The quick brown fox
where I have to find in bunch of sentences like:
the Brown SexyFox Jumps soQuickly in the backyard...
If there is any match with these words (not Casesensitive):
The, the, brown, Brown, fox, Fox, quick, Quick
Then I can say that return value is true
How to do it in regex? I was thinking to split the words and put in Array and use loop and find them using .Contains(...) but I know that is not ideal.
Actually I have another concern. But I'm afraid to post it as a new question.
So my second question is, how does regex read the pattern? What are the priorities and least priorities?
Anyway please help me with my problem.
EDIT
Sorry for the late response, but the solution of #PatrikW seems not to work.
I have static class:
public static bool ValidateRegex(string value, string regex)
{
value += ""; // Fail safe for null
Regex obj = new Regex(regex, RegexOptions.IgnoreCase);
if (value.Trim() == "")
return false;
else
{
return obj.IsMatch(value);
}
}
Construct regex pattern:
keyword = "maria";
string regexPattern = "(?<=\b)(";
string Or = string.Empty;
foreach (string item in keyword.Split(new char[] { ' ', ',', '.' }, StringSplitOptions.RemoveEmptyEntries).ToList())
{
regexPattern += Or + "(" + item + ")";
Or = "|";
}
regexPattern += ")(?=\b)";
Data information:
List<Friend> useritems = null;
useritems = ((List<Friend>)SessonHandler.Data.FriendList).Where(i =>
Utility.ValidateRegex(i.LastName, regexPattern) ||
Utility.ValidateRegex(i.FirstName, regexPattern) ||
Utility.ValidateRegex(i.MiddleName, regexPattern)).ToList();
//regexPattern = "(?<=\b)((maria))(?=\b)"
//LastName = "MARIA CALIBRI"
//FirstName = "ALICE"
//MiddleName = null
May be I did something wrong with the code. Please help.
EDIT 2
I forgot the # sign. This must work now:
string regexPattern = #"(?<=\b)(";
.
.
.
regexPattern += #")(?=\b)";
The answer below is correct.

What Felice showed is the more dynamic solution, but here's a pattern for finding the exact keywords you've got:
"(?<=\b)((The)|(quick)|(brown)|(fox))(?=\b)"
Because of the leading and trailing capturing groups, it will only match whole words and not parts of them.
Here's an example:
Regex foxey = new Regex(#"(?<=\b)((The)|(quick)|(brown)|(fox))(?=\b)");
foxey.Options = RegexOptions.IgnoreCase;
bool doesMatch = foxey.IsMatching("the Brown SexyFox Jumps soQuickly in the backyard...");
Edit - Regex engine:
Simply put, the Regex-engine walks through the input-string one character at a time, starting at the leftmost one, checking it against the first part of the regex-pattern we've written. If it matches, the parser moves to the next character and checks it against the next part of the pattern. If it manages to successfully walk through the whole pattern, that is a match.
You can read about how the internals of regex works just by searching for "regex engine" or something along those lines. Here's a pick:
http://www.regular-expressions.info/engine.html

Related

Why is my Regex for removing special characters adding more words to my text?

I encountered the problem when I tired to run my regex function on my text which can be found here.
With a HttpRequest I fetch the text form the link above. Then I run my regex to clean up the text before filtering the most occurrences of a certain word.
After cleaning up the word I split the string by whitespace and added it into a string array and notice there was a huge difference in the number of indexes.
Does anyone know why this happens because the result of occurrences for the word " the " - is 6806 hits.
raw data correct answer is 6806
And with my regex I get - 8073 hits
with regex
The regex i'm using is here in the sandbox with the text and below in the code.
//Application storing.
var dictionary = new Dictionary<string, long>(StringComparer.OrdinalIgnoreCase);
// Cleaning up a bit
var words = CleanByRegex(rawSource);
string[] arr = words.Split(" ", StringSplitOptions.RemoveEmptyEntries);
string CleanByRegex(string rawSource)
{
Regex r = RemoveSpecialChars();
return r.Replace(rawSource, " ");
}
// arr {string[220980]} - with regex
// arr {string[157594]} - without regex
foreach (var word in arr)
{
// some logic
}
```
partial class Program
{
[GeneratedRegex("(?:[^a-zA-Z0-9]|(?<=['\\\"]\\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled, "en-SE")]
private static partial Regex RemoveSpecialChars();
}
```
I have tried debugging it and I have my suspicion that I'm adding trailing whitespace but I don't know how to handle it.
I have tired to add a whitespace removing regex where I remove multiple whitespace and replace that with one whitespace.
the regex would look something like - [ ]{2,}"
partial class Program
{
[GeneratedRegex("[ ]{2,}", RegexOptions.Compiled)]
private static partial Regex RemoveWhiteSpaceTrails();
}
It would be helpful if you describe what you're trying to clean up.
However your specific question is answerable: from the sandbox I see that you're removing newlines and punctuation. This can definitely lead to occurrences of the that weren't there before:
The quick brown fox jumps over the
lazy dog
//the+newline does not match
//after regex:
The quick brown fox jumps over the lazy dog
//now there's one more *the+space*
If you change your search to something not so common, for example Seward, then you should see the same results before and after the regex.
The reason I believe the regex created more text while I was replacing it with string.empty or " ". Is not true I just created more matches.
Is because I thought the search in Chrome via ctrl + f would give me all the words for a certain search and this necessarily isn't true.
I tried my code and instead I added a subset of the Lorem Ipsum text. This is because I questioned the search on Chrome to see if it's really the correct answer.
Short answer is NO.
If I was to search for " the " that would mean I won't get the "the+Environmental.NewLine" which #simmetric proved,
Another scenario is sentences that begins with the word "The ". Since I am curious about the words in the Text I used the following regex \w+ to get the words and returned a MatchCollection (IList<Match>()) That I later looped through to add the value to my dictionary.
Code Demonstration
var rawSource = "Some text"
var words = CleanByRegex(rawSource);
IList<Match> CleanByRegex(string rawSource)
{
IList<Match> r = Regex.Matches(rawSource, "\\w+");
return r;
}
foreach (var word in words)
{
if (word.Value.Length >= 1) // at least 3 letters and has any letters
{
if (dictionary.ContainsKey(word.Value)) //if it's in the dictionary
dictionary[word.Value] = dictionary[word.Value] + 1; //Increment the count
else
dictionary[word.Value] = 1; //put it in the dictionary with a count 1
}
}

How to remove only certain substrings from a string?

Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);
'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12
While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.
some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}
Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.
I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again

PO Box RegEx At the Start of a String

I've looked around this site for a good PO Box regex and didn't find any that I liked or worked consistently, so I tried my hand at making my own... I feel pretty good about it, but I'm sure the kind folks here on SO can poke some holes in it :) So... what problems do you see with this and what false-positives/false-negatives can you think up that would get through?
One caveat that I can see is that the PO Box pattern has to be at the start of the string, but what else is wrong with it?
public bool AddressContainsPOB(string Addr)
{
string input = Addr.Trim().ToLower();
bool Result = false;
Regex regexObj1 = new Regex(#"^p(ost){0,1}(\.){0,1}(\s){0,2}o(ffice){0,1}(\.){0,1}((\s){1}|b{1}|[1-9]{1})");
Regex regexObj2 = new Regex(#"^pob((\s){1}|[0-9]{1})");
Regex regexObj3 = new Regex(#"^box((\s){1}|[0-9]{1})");
Match match1 = regexObj1.Match(input);
if (match1.Success)
{ Result = true; }
Match match2 = regexObj2.Match(input);
if (match2.Success)
{ Result = true; }
Match match3 = regexObj3.Match(input);
if (match3.Success)
{ Result = true; }
return Result;
}
What do you expect from us? You don't even give us valid/invalid strings. Have you tested your regexes somehow?
What I see at the first glance, without knowing something about valid input is:
One caveat that I can see is that the PO Box pattern has to be at the start of the string
Do you want to match it only at the start of the string or not? You need to know that and define it in your pattern. If you don't want to, then remove the start of the string anchor ^ and replace it with a word boundary \b.
{1} is superfluous, you can just remove it.
For {0,1} there is a shortform ?, I like this better, because it is shorter.
^box((\s){1}|[0-9]{1}) matches either "box" followed by a whitespace OR followed by a digit. Is this really what you want to match?
(\.) in the first regex: Why do you group a single dot?

How to get all words of a string in c#?

I have a paragraph in a single string and I'd like to get all the words in that paragraph.
My problem is that I don't want the suffixes words that end with punctuation marks such as (',','.',''','"',';',':','!','?') and /n /t etc.
I also don't want words with 's and 'm such as world's where it should only return world.
In the example
he said. "My dog's bone, toy, are missing!"
the list should be: he said my dog bone toy are missing
Expanding on Shan's answer, I would consider something like this as a starting point:
MatchCollection matches = Regex.Match(input, #"\b[\w']*\b");
Why include the ' character? Because this will prevent words like "we're" from being split into two words. After capturing it, you can manually strip out the suffix yourself (whereas otherwise, you couldn't recognize that re is not a word and ignore it).
So:
static string[] GetWords(string input)
{
MatchCollection matches = Regex.Matches(input, #"\b[\w']*\b");
var words = from m in matches.Cast<Match>()
where !string.IsNullOrEmpty(m.Value)
select TrimSuffix(m.Value);
return words.ToArray();
}
static string TrimSuffix(string word)
{
int apostropheLocation = word.IndexOf('\'');
if (apostropheLocation != -1)
{
word = word.Substring(0, apostropheLocation);
}
return word;
}
Example input:
he said. "My dog's bone, toy, are missing!" What're you doing tonight, by the way?
Example output:
[he, said, My, dog, bone, toy, are, missing, What, you, doing, tonight, by, the, way]
One limitation of this approach is that it will not handle acronyms well; e.g., "Y.M.C.A." would be treated as four words. I think that could also be handled by including . as a character to match in a word and then stripping it out if it's a full stop afterwards (i.e., by checking that it's the only period in the word as well as the last character).
Hope this is helpful for you:
string[] separators = new string[] {",", ".", "!", "\'", " ", "\'s"};
string text = "My dog's bone, toy, are missing!";
foreach (string word in text.Split(separators, StringSplitOptions.RemoveEmptyEntries))
Console.WriteLine(word);
See Regex word boundary expressions, What is the most efficient way to count all of the words in a richtextbox?. Moral of the story is that there are many ways to approach the problem, but regular expressions are probably the way to go for simplicity.
split on whitespace, trim anything that isn't a letter on the resulting strings.
Here's a looping replace method... not fast, but a way to solve it...
string result = "string to cut ' stuff. ! out of";
".',!#".ToCharArray().ToList().ForEach(a => result = result.Replace(a.ToString(),""));
This assumes you want to place it back in the original string, not a new string or a list.

How can I find a string after a specific string/character using regex

I am hopeless with regex (c#) so I would appreciate some help:
Basicaly I need to parse a text and I need to find the following information inside the text:
Sample text:
KeywordB:***TextToFind* the rest is not relevant but **KeywordB: Text ToFindB and then some more text.
I need to find the word(s) after a certain keyword which may end with a “:”.
[UPDATE]
Thanks Andrew and Alan: Sorry for reopening the question but there is quite an important thing missing in that regex. As I wrote in my last comment, Is it possible to have a variable (how many words to look for, depending on the keyword) as part of the regex?
Or: I could have a different regex for each keyword (will only be a hand full). But still don't know how to have the "words to look for" constant inside the regex
The basic regex is this:
var pattern = #"KeywordB:\s*(\w*)";
\s* = any number of spaces
\w* = 0 or more word characters (non-space, basically)
() = make a group, so you can extract the part that matched
var pattern = #"KeywordB:\s*(\w*)";
var test = #"KeywordB: TextToFind";
var match = Regex.Match(test, pattern);
if (match.Success) {
Console.Write("Value found = {0}", match.Groups[1]);
}
If you have more than one of these on a line, you can use this:
var test = #"KeywordB: TextToFind KeyWordF: MoreText";
var matches = Regex.Matches(test, #"(?:\s*(?<key>\w*):\s?(?<value>\w*))");
foreach (Match f in matches ) {
Console.WriteLine("Keyword '{0}' = '{1}'", f.Groups["key"], f.Groups["value"]);
}
Also, check out the regex designer here: http://www.radsoftware.com.au/. It is free, and I use it constantly. It works great to prototype expressions. You need to rearrange the UI for basic work, but after that it's easy.
(fyi) The "#" before strings means that \ no longer means something special, so you can type #"c:\fun.txt" instead of "c:\fun.txt"
Let me know if I should delete the old post, but perhaps someone wants to read it.
The way to do a "words to look for" inside the regex is like this:
regex = #"(Key1|Key2|Key3|LastName|FirstName|Etc):"
What you are doing probably isn't worth the effort in a regex, though it can probably be done the way you want (still not 100% clear on requirements, though). It involves looking ahead to the next match, and stopping at that point.
Here is a re-write as a regex + regular functional code that should do the trick. It doesn't care about spaces, so if you ask for "Key2" like below, it will separate it from the value.
string[] keys = {"Key1", "Key2", "Key3"};
string source = "Key1:Value1Key2: ValueAnd A: To Test Key3: Something";
FindKeys(keys, source);
private void FindKeys(IEnumerable<string> keywords, string source) {
var found = new Dictionary<string, string>(10);
var keys = string.Join("|", keywords.ToArray());
var matches = Regex.Matches(source, #"(?<key>" + keys + "):",
RegexOptions.IgnoreCase);
foreach (Match m in matches) {
var key = m.Groups["key"].ToString();
var start = m.Index + m.Length;
var nx = m.NextMatch();
var end = (nx.Success ? nx.Index : source.Length);
found.Add(key, source.Substring(start, end - start));
}
foreach (var n in found) {
Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
}
}
And the output from this is:
Key=Key1, Value=Value1
Key=Key2, Value= ValueAnd A: To Test
Key=Key3, Value= Something
/KeywordB\: (\w)/
This matches any word that comes after your keyword. As you didn´t mentioned any terminator, I assumed that you wanted only the word next to the keyword.

Categories

Resources