How To get text between 2 strings? - c#

String is given below from which i want to extract the text.
String:
Hello Mr John and Hello Ms Rita
Regex
Hello(.*?)Rita
I am try to get text between 2 strings which "Hello" and "Rita" I am using the above given regex, but its is giving me
Mr John and Hello Ms
which is wrong. I need only "Ms" Can anyone help me out to write proper regex for this situation?

Use a tempered greedy token:
Hello((?:(?!Hello|Rita).)*)Rita
^^^^^^^^^^^^^^^^^^^
See regex demo here
The (?:(?!Hello|Rita).)* is the tempered greedy token that only matches text that is not Hello or Rita. You may add word boundaries \b if you need to check for whole words.
In order to get a Ms without spaces on both ends, use this regex variation:
Hello\s*((?:(?!Hello|Rita).)*?)\s*Rita
Adding the ? to * will form a lazy quantifier *? that matches as few characters as needed to find a match, and \s* will match zero or more whitespaces.

To get the closest match towards ending word, let a greedy dot in front of the initial word consume.
.*Hello(.*?)Rita
See demo at regex101
Or without whitespace in captured: .*Hello\s*(.*?)\s*Rita
Or with use of two capture groups: .*(Hello\s*(.*?)\s*Rita)

Your (.*?) is picking up too much text because .* matches any string of characters. So it grabs everything from the first "Hello" to "Rita" at the end.
One easy way you could get what you want is with this regular expression:
Hello (\S+) Rita
\S matches any non-whitespace character, so \S+ matches any consecutive string of non-whitespace characters, i.e. a single word.
This would be a bit more robust, allowing for multiple spaces or other whitespace between the words:
Hello\s+(\S+)\s+Rita
Demo

you can use lookahead and lookbehind (?<=Hello).*?(?=Rita)

Related

How to match exactly one or more characters inside boundary

Currently i using this pattern: [HelloWorld]{1,}.
So if my input is: Hello -> It will be match.
But if my input is WorldHello -> Still match but not right.
So how to make input string must match exactly will value inside pattern?
Just get rid of the square brackets, and the comma and you're good to go!
HelloWorld{1}
In regex what's between square brackets is a character set.
So [HelloWorld] matches 1 character that's in the set [edlorHW].
And .{1,} or .+ both match 1 or more characters.
What you probably want is the literal word.
So the regex would simple be "HelloWorld".
That would match HelloWord in the string "blaHelloWorldbla".
If you want the word to be a single word, and not part of a word?
Then you could use wordboundaries \b, which indicate the transition between a word character (\w = [A-Za-z0-9_]) and a non-word character (\W = [^A-Za-z0-9_]) or the beginning of a line ^ or the end of a line $.
For example #"\bHelloWorld\b" to get a match from "bla HelloWorld bla" but not from "blaHelloWorldbla".
Note that the regex string this time was proceeded by #.
Because by using a verbatim string the backslashes don't have to be backslashed.
it seems you need to use online regex tester web sites to check your pattern. for example you could find one of them here and also you could study c# regex reference here
Try this pattern:
[a-zA-Z]{1,}
You can test it online

Regex to return the word before the match

I've been trying to extract the word before the match. For example, I have the following sentence:
"Allatoona was a town located in extreme southeastern Bartow County, Georgia."
I want to extract the word before "Bartow".
I've tried the following regex to extract that word:
\w\sCounty,
What I get returned is "w County" when what I wanted is just the word Bartow.
Any assistance would be greatly appreciated. Thanks!
You can use this regex with a lookahead to find word before County:
\w+(?=\s+County)
(?=\s+County) is a positive lookahead that asserts presence of 1 or more whitespaces followed by word County ahead of current match.
RegEx Demo
If you want to avoid lookahead then you can use a capture group:
(\w+)\s+County
and extract captured group #1 from match result.
Your \w\sCounty, regex returns w County because \w matches a single character that is either a letter, digit, or _. It does not match a whole word.
To match 1 or more symbols, you need to use a + quantifier and to capture the part you need to extract you can rely on capturing groups, (...).
So, you can fix your pattern by mere replacing \w with (\w+) and then, after getting a match, access the Match.Groups[1].Value.
However, if the county name contains a non-word symbol, like a hyphen, \w+ won't match it. A \S+ matching 1 or more non-whitespace symbols might turn out a better option in that case.
See a C# demo:
var m = Regex.Match(s, #"(\S+)\s+County");
if (m.Success)
{
Console.WriteLine(m.Groups[1].Value);
}
See a regex demo.
You can use this regex to find the word before Country
([\w]*.?\s+).?County
The [\w]* match any characters any times
the .? is if maybe there is a especial character in the sentences like (,.!)
and the \s+ for the banks spaces ( work if there is a double blank space in the sentence)
.? before Country if maybe a special character is placed there
If you want to find more than one word just add {n} after like this ([\w]*.?\s+){3}.?County

Extract only numbers from text

I'm trying to extract only numbers from a string/text. Below is the regex pattern I'm using.
Regex regex = new Regex(#"[\d+]\S+");
string extract_from = " 12 abcd 1-2-3a a123z 1.2.3.4 xyz";
From the string "extract_from" above, the regex is extracting the numbers
12
1-2-3a
123z
1.2.3.4
The regex is extracting it correctly except the second and third one "1-2-3a", "123z", which shouldn't be extracted because it contains an alphabet. What pattern can I add in regex to not extract where the numbers also have an alphabet in between?
dash and dot are ok, just not alphabets.
Here, change the regex \S to be \s, notice the caps.
\S matches all but space, \s matches space.
Regex regex = new Regex(#"[\d+]\s+");
Try this one:
[0-9\-.]+\s+
That will allow expressions with more than one decimal, and dashes inside them, vs just at the beginning.
You can use regexhero.net or www.regexplanet.com to test your regex expressions, they're very powerful tools.
Output from your given input would be the following matches:
12
1.2.3.4
Edit, based on comment from OP
This regex shouldn't require a space at the beginning. If you need to match a number at the end of the line, it's probably simplest to just add a special case for it:
[0-9\-.]+\s|[0-9\-.]+$
use this pattern to catch anything but alphabets
(?!\S*[a-zA-Z])\b([^a-zA-Z\s]+)\b
Demo

Regex - Get matches of #[SomeText] in a string

I want to get all matches of #[SomeText] pattern in a string.
For example, for this string:
here is #[text1] some text #[text2]
I want #[text1] and #[text2].
I'm using Regex Hero to check my pattern matching online,
and my pattern works fine when there's one expression to match,
For example:
here is #[text1] text
but with more then one, I get both matches with the text in the middle.
This is my regex:
#\[.*\]
I would appreciate assistance in isolating the occurrences.
The problem here is that you are using greedy quantifier (*). To capture all you need, you should use lazy quantifier (*?) with a global modifier:
/(#\[.*?\])/g
Take a look here https://regex101.com/r/pH0gA5/1
This should work :
#\[(.*?)\]
Details :
(.*?) : match everything in a non-greedy way and capture it.
Because the *? quantifier is lazy (non-greedy), it matches as few characters as possible to allow the overall match attempt to succeed, i.e. text1. For the match attempt that starts at a given position, a lazy quantifier gives you the shortest match.
.* is greedy by default, so it only finds one match, treating "text1] and #[text2" as the text between the two square brackets.
If you add a questions mark after the .* then it will find the minimum number of characters before reaching a ].
So the regex \#[.*?] do what you want.

Regular expression to match exactly the start of a string

I'm trying to build a regular expression in c# to check whether a string follow a specific format.
The format i want is: [digit][white space][dot][letters]
For example:
123 .abc follow the format
12345 .def follow the format
123 abc does not follow the format
I write this expression but it not works completelly well
Regex.IsMatch(exampleString, #"^\d+ .")
^ matches the start of the string, and you got it right.
\d+ matches one or more digits, and you got that one right as well.
A space in a regex matches a literal space, so that works too!
However, a . is a wildcard and will match any one character. You will need to escape it with a backslash like this if you want to match a literal period: \..
To match letters now, you can use [a-z]+ right after the period.
#"^\d+ \.[a-z]+"
The dot is a special character in regex, which matches any character (except, typically, newlines). To match a literal ., you need to escape it:
Regex.IsMatch(exampleString, #"^\d+ \.")
If you want to include the condition for the succeeding letters, use:
Regex.IsMatch(exampleString, #"^\d+ \.[A-Za-z]+$")
For you to get yours to match, keep in mind that the period in regular expressions is a special character that will match any character, so you'll need to escape that.
In addition, \s is a match for any white-space character (tabs, line breaks).
^\d+\s+ \..+
(untested)

Categories

Resources