Split string into sentences using regular expression

Split string into sentences using regular expression - c#

I need to match a string like "one. two. three. four. five. six. seven. eight. nine. ten. eleven" into groups of four sentences. I need a regular expression to break the string into a group after every fourth period. Something like:
string regex = #"(.*.\s){4}";
System.Text.RegularExpressions.Regex exp = new System.Text.RegularExpressions.Regex(regex);
string result = exp.Replace(toTest, ".\n");
doesn't work because it will replace the text before the periods, not just the periods themselves. How can I count just the periods and replace them with a period and new line character?

. in a regex means "any character"
so in your regex, you have used .*. which will match a word (this is equivalent to .+)
You were probably looking for [^.]\*[.] - a series of characters that are not "."s followed by a ".".

Try defining the method
private string AppendNewLineToMatch(Match match) {
return match.Value + Environment.NewLine;
}
and using
string result = exp.Replace(toTest, AppendNewLineToMatch);
This should call the method for each match, and replace it with that method's result. The method's result would be the matching text and a newline.
EDIT: Also, I agree with Oliver. The correct regex definition should be:
string regex = #"([^.]*[.]\s*){4}";
Another edit: Fixed the regex, hopefully I got it right this time.

Are you forced to do this via regex? Wouldn't it be easier to just split the string then process the array?

I'm not sure if configurator's answer got mangled by the editor or what, but it doesn't work.
The Correct pattern is
string regex = #"([^.]*[.]){4}\s*";

Search expression: #"(?:([^\.]+?).\s)(?:([^\.]+?).\s)(?:([^\.]+?).\s)(?:([^\.]+?).\s)"
Replace expression: "$1 $2 $3 $4.\n"
I've ran this expression in RegexBuddy with .NET regex selected, and the output is:
one two three four.
five six seven eight.
nine. ten. eleven
I tried with a #"(?:([^.]+?).\s){4}" type of arrangement, but the capturing will only capture the last occurrence (i.e. word), so when it comes to replacing, you will lose three words out of 4. Please someone correct me if I am wrong.

In this case it would seem that regex is a bit of overkill. I would recommend using String.split and then breaking up the resulting array of strings. It should be far simpler and far more reliable than trying to make a regex do what you're trying to do.
Something like this might be a bit easier to read and debug.
String s = "one. two. three. four. five. six. seven. eight. nine. ten. eleven"
String[] splitString = s.split(".")
List li = new ArrayList(splitString.length/2)
for(int i=0;i<splitString.length;i+=4) {
st = splitString[i]+"."
st += splitString[i+1]+"."
st += splitString[i+2]+"."
st += splitString[i+3]+"."
li.add(st)
}

Related

Regular Expression - Remove zeroes inside an expression

I need to remove leading zeroes from the numerical part of an expression (using .net 2.0 C# Regex class).
Ex:
PAR0000034 -> PAR34
WP0003204 -> WP3204
I tried the following:
//keep starting characters, get rid of leading zeroes, keep remaining digits
string result = Regex.Replace(inputStr, "^(.+)(0+)(/d*)", "$1$3", RegexOptions.IgnoreCase)
Obviously, it did not work. I need a bit of help to find the mistake.

You don't need a regular expression for that, the Split method can do that for you.
Splitting on '0', removing empty entries (i.e. between the mulitple zeroes), and limiting the result to two strings will give you the two strings before and after the leading zeroes. Then you just put those two strings together again:
string result = String.Concat(
input.Split(new char[] { '0' }, 2, StringSplitOptions.RemoveEmptyEntries)
);

In your expression the .* part is greedy, so it catches full string. Further
use backslash instead of slash for digit \d
string result = Regex.Replace(inputStr, #"^([^0]+)(0+)(\d*)", "$1$3");
Or use look behind instead:
string result = Regex.Replace(inputStr, "(?<=[a-zA-Z])0+", "");

This works for me:
Regex.Replace("PPP00001001", "([^0]*)0+(.*)", "$1$2");

The phrase "leading zeroes" is confusing, since the zeroes you're talking about aren't actually at the beginning of the string. But if I understand you correctly, you want this:
string result = Regex.Replace(inputStr, "^(.*?)0+", "$1");
There are actually several ways to do it, with and without regex, but the above is probably the shortest and easiest to understand. The important part is the .*? lazy quantifier. This will ensure that it a) finds only the first string of zeroes, and b) deletes all the "leading" zeroes in the string.

Regex to isolate a specific substring

I have this string I have retrieved from a File.ReadAllText:
6 11 rows processed
As you can see there is always an integer specifying the line number in this document. What I am interested in is the integer that comes after it and the words "rows processed". So in this case I am only interested in the substring "11 rows processed".
So, knowing that each line will start with an integer and then some white space, I need to be able to isolate the integer that follows it and the words "rows processed" and return that to a string by itself.
I have been told this is easy to do with Regex, but so far I haven't the faintest clue how to build it.

You don't need regular expressions for this. Just split on the whitespace:
var fields = s.Split(new char[0], StringSplitOptions.RemoveEmptyEntries);
Console.WriteLine(String.Join(" ", fields.Skip(1));
Here, I am using the fact that if you pass an empty array as the char [] parameter to String.Split, it splits on all whitespace.

This should work for what you need:
\d+(.*)
This searches for 1 or more digits (\d+) and then it puts everything afterwards in a group:
. = any character
* = repeater (zero or more of the preceding value (which is any character in the above
() = grouping
However, Jason is correct in that you only need to use a split function

If you need to use a Regex it would be like this:
string result = null;
Match match = Regex.Match(row, #"^\s*\d+\s*(.*)");
if (match.Success)
result = match.Groups[1].Value;
The regex matches from start of row: first spaces if any, then digits and then more spaces. Last it extracts rest of line and return it as result.

This is done easily with Regex.Replace() using the following regular expression...
^\d+\s+
So it'd be something like this:
return Regex.Replace(text, #"^\d+\s+", "");
Basically you're just trimming the first number \d and the whitespace \s that follows.

Example in PHP(C# regex should be compatible):
$line = "6 11 rows processed";
$resp = preg_match("/[0-9]+\s+(.*)/",$line,$out);
echo $out[1];
I hope I catched your point.

Regex which ensures no character is repeated

I need to ensure that a input string follows these rules:
It should contain upper case characters only.
NO character should be repeated in the string.
eg. ABCA is not valid because 'A' is being repeated.
For the upper case thing, [A-Z] should be fine.
But i am lost at how to ensure no repeating characters.
Can someone suggest some method using regular expressions ?

You can do this with .NET regular expressions although I would advise against it:
string s = "ABCD";
bool result = Regex.IsMatch(s, #"^(?:([A-Z])(?!.*\1))*$");
Instead I'd advise checking that the length of the string is the same as the number of distinct characters, and checking the A-Z requirement separately:
bool result = s.Cast<char>().Distinct().Count() == s.Length;
Alteranatively, if performance is a critical issue, iterate over the characters one by one and keep a record of which you have seen.

This cannot be done via regular expressions, because they are context-free. You need at least context-sensitive grammar language, so only way how to achieve this is by writing the function by hand.
See formal grammar for background theory.

Why not check for a character which is repeated or not in uppercase instead ? With something like ([A-Z])?.*?([^A-Z]|\1)

Use negative lookahead and backreference.
string pattern = #"^(?!.*(.).*\1)[A-Z]+$";
string s1 = "ABCDEF";
string s2 = "ABCDAEF";
string s3 = "ABCDEBF";
Console.WriteLine(Regex.IsMatch(s1, pattern));//True
Console.WriteLine(Regex.IsMatch(s2, pattern));//False
Console.WriteLine(Regex.IsMatch(s3, pattern));//False
\1 matches the first captured group. Thus the negative lookahead fails if any character is repeated.

This isn't regex, and would be slow, but You could create an array of the contents of the string, and then iterate through the array comparing n to n++
=Waldo

It can be done using what is call backreference.
I am a Java program so I will show you how it is done in Java (for C#, see here).
final Pattern aPattern = Pattern.compile("([A-Z]).*\\1");
final Matcher aMatcher1 = aPattern.matcher("ABCDA");
System.out.println(aMatcher1.find());
final Matcher aMatcher2 = aPattern.matcher("ABCDA");
System.out.println(aMatcher2.find());
The regular express is ([A-Z]).*\\1 which translate to anything between 'A' to 'Z' as group 1 ('([A-Z])') anything else (.*) and group 1.
Use $1 for C#.
Hope this helps.

Regex: I want this AND that AND that... in any order

I'm not even sure if this is possible or not, but here's what I'd like.
String: "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870"
I have a text box where I type in the search parameters and they are space delimited. Because of this, I want to return a match is string1 is in the string and then string2 is in the string, OR string2 is in the string and then string1 is in the string. I don't care what order the strings are in, but they ALL (will somethings me more than 2) have to be in the string.
So for instance, in the provided string I would want:
"FEB Low"
or
"Low FEB"
...to return as a match.
I'm REALLY new to regex, only read some tutorials on here but that was a while ago and I need to get this done today. Monday I start a new project which is much more important and can't be distracted with this issue. Is there anyway to do this with regular expressions, or do I have to iterate through each part of the search filter and permutate the order? Any and all help is extremely appreciated. Thanks.
UPDATE:
The reason I don't want to iterate through a loop and am looking for the best performance wise is because unfortunately, the dataTable I'm using calls this function on every key press, and I don't want it to bog down.
UPDATE:
Thank you everyone for your help, it was much appreciated.
CODE UPDATE:
Ultimately, this is what I went with.
string sSearch = nvc["sSearch"].ToString().Replace(" ", ")(?=.*");
if (sSearch != null && sSearch != "")
{
Regex r = new Regex("^(?=.*" + sSearch + ").*$", RegexOptions.IgnoreCase);
_AdminList = _AdminList.Where<IPB>(
delegate(IPB ipb)
{
//Concatenated all elements of IPB into a string
bool returnValue = r.IsMatch(strTest); //strTest is the concatenated string
return returnValue;
}).ToList<IPB>();
}
}
The IPB class has X number of elements and in no one table throughout the site I'm working on are the columns in the same order. Therefore, I needed to any order search and I didn't want to have to write a lot of code to do it. There were other good ideas in here, but I know my boss really likes Regex (preaches them) and therefore I thought it'd be best if I went with that for now. If for whatever reason the site's performance slips (intranet site) then I'll try another way. Thanks everyone.

You can use (?=…) positive lookahead; it asserts that a given pattern can be matched. You'd anchor at the beginning of the string, and one by one, in any order, look for a match of each of your patterns.
It'll look something like this:
^(?=.*one)(?=.*two)(?=.*three).*$
This will match a string that contains "one", "two", "three", in any order (as seen on rubular.com).
Depending on the context, you may want to anchor on \A and \Z, and use single-line mode so the dot matches everything.
This is not the most efficient solution to the problem. The best solution would be to parse out the words in your input and putting it into an efficient set representation, etc.
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
More practical example: password validation
Let's say that we want our password to:
Contain between 8 and 15 characters
Must contain an uppercase letter
Must contain a lowercase letter
Must contain a digit
Must contain one of special symbols
Then we can write a regex like this:
^(?=.{8,15}$)(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[!##$%^&*]).*$
\__________/\_________/\_________/\_________/\______________/
length upper lower digit symbol

Why not just do a simple check for the text since order doesn't matter?
string test = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
test = test.ToUpper();
bool match = ((test.IndexOf("FEB") >= 0) && (test.IndexOf("LOW") >= 0));
Do you need it to use regex?

I think the most expedient thing for today will be to string.Split(' ') the search terms and then iterate over the results confirming that sourceString.Contains(searchTerm)
var source = #"NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870".ToLowerInvariant();
var search = "FEB Low";
var terms = search.Split(' ');
bool all_match = !terms.Any(term => !(source.Contains(term.ToLowerInvariant())));
Notice that we use Any() to set up a short-circuit, so if the first term fails to match, we skip checking the second, third, and so forth.
This is not a great use case for RegEx. The string manipulation necessary to take an arbitrary number of search strings and convert that into a pattern almost certainly negates the performance benefit of matching the pattern with the RegEx engine, though this may vary depending on what you're matching against.
You've indicated in some comments that you want to avoid a loop, but RegEx is not a one-pass solution. It is not hard to create horrifically non-performant searches that loop and step character by character, such as the infamous catastrophic backtracking, where a very simple match takes thousands of steps to return false.

The answer by #polygenelubricants is both complete and perfect but I had a case where I wanted to match a date and something else e.g. a 10-digit number so the lookahead does not match and I cannot do it with just lookaheads so I used named groups:
(?:.*(?P<1>[0-9]{10}).*(?P<2>2[0-9]{3}-(?:0?[0-9]|1[0-2])-(?:[0-2]?[0-9]|3[0-1])).*)+
and this way the number is always group 1 and the date is always group 2. Of course it has a few flaws but it was very useful for me and I just thought I should share it! ( take a look https://www.debuggex.com/r/YULCcpn8XtysHfmE )

var text = #"NS306Low FEBRUARY 2FEB0078/9/201013B1-9-1Low31 AUGUST 19870";
var matches = Regex.Matches(text, #"(FEB)|(Low)");
foreach (Match match in matches) {
Console.WriteLine(match.Value);
}
output:
Low
FEB
FEB
Low
should get you started

You don't have to test each permutation, just split your search into multiple parts "FEB" and "Low" and make sure each part matches. That will be far easier than trying to come up with a regex which matches the whole thing in one go (which I'm sure is theoretically possible, but probably not practical in reality).

Use string.Split(). It will return an array of subtrings thata re delimited by a specified string/char. The code will look something like this.
int maximumSize = 100;
string myString = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
string[] individualString = myString.Split(' ', maximumSize);
For more information
http://msdn.microsoft.com/en-us/library/system.string.split.aspx
Edit:
If you really wanted to use Regular Expressions this pattern will work.
[^ ]*
And you will just use Regex.Matches();
The code will be something like this:
string myString = "NS306 FEBRUARY 20078/9/201013B1-9-1Low31 AUGUST 19870";
string pattern = "[^ ]*";
Regex rgx = new Regex(pattern);
foreach(Match match in reg.Matches(s))
{
//do stuff with match.value
}

Replacing numbers in strings with C#

I'd thought i do a regex replace
Regex r = new Regex("[0-9]");
return r.Replace(sz, "#");
on a file named aa514a3a.4s5 . It works exactly as i expect. It replaces all the numbers including the numbers in the ext. How do i make it NOT replace the numbers in the ext. I tried numerous regex strings but i am beginning to think that its a all or nothing pattern so i cant do this? do i need to separate the ext from the string or can i use regex?

This one does it for me:
(?<!\.[0-9a-z]*)[0-9]
This does a negative lookbehind (the string must not occur before the matched string) on a period, followed by zero or more alphanumeric characters. This ensures only numbers are matched that are not in your extension.
Obviously, the [0-9a-z] must be replaced by which characters you expect in your extension.

I don't think you can do that with a single regular expression.
Probably best to split the original string into base and extension; do the replace on the base; then join them back up.

Yes, I thing you'd be better off separating the extension.
If you are sure there is always a 3-character extension at the end of your string, the easiest, most readable/maintainable solution would be to only perform the replace on
yourString.Substring(0,YourString.Length-4)
..and then append
yourString.Substring(YourString.Length-4, 4)

Why not run the regex on the substring?
String filename = "aa514a3a.4s5";
String nameonly = filename.Substring(0,filename.Length-4);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Split string into sentences using regular expression - c#

. in a regex means "any character" so in your regex, you have used .. which will match a word (this is equivalent to .+) You were probably looking for [^.]\[.] - a series of characters that are not "."s followed by a ".".

Are you forced to do this via regex? Wouldn't it be easier to just split the string then process the array?

I'm not sure if configurator's answer got mangled by the editor or what, but it doesn't work. The Correct pattern is string regex = #"([^.][.]){4}\s";

Related

Regular Expression - Remove zeroes inside an expression

Regex to isolate a specific substring

Regex which ensures no character is repeated

Regex: I want this AND that AND that... in any order

Replacing numbers in strings with C#

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Split string into sentences using regular expression - c#

. in a regex means "any character" so in your regex, you have used .*. which will match a word (this is equivalent to .+) You were probably looking for [^.]\*[.] - a series of characters that are not "."s followed by a ".".

Are you forced to do this via regex? Wouldn't it be easier to just split the string then process the array?

I'm not sure if configurator's answer got mangled by the editor or what, but it doesn't work. The Correct pattern is string regex = #"([^.]*[.]){4}\s*";

Related

Regular Expression - Remove zeroes inside an expression

Regex to isolate a specific substring

Regex which ensures no character is repeated

Regex: I want this AND that AND that... in any order

Replacing numbers in strings with C#

Categories

Resources

. in a regex means "any character" so in your regex, you have used .. which will match a word (this is equivalent to .+) You were probably looking for [^.]\[.] - a series of characters that are not "."s followed by a ".".

I'm not sure if configurator's answer got mangled by the editor or what, but it doesn't work. The Correct pattern is string regex = #"([^.][.]){4}\s";