I have a string in following format..
"ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00|ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC11.00|ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00";
What I am trying to do is find the next group which starts with pipe but is not followed by a -
So the above string will point to 3 sections such as
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00
I played around with following code but it doesn't seem to do anything, it is not giving me the position of the next block where pipe char is which is not followed by a dash (-)
String pattern = #"^+|[A-Z][A-Z][A-Z]$";
In the above my logic is
1:Start from the beginning
2:Find a pipe character which is not followed by a dash char
3:Return its position
4:Which I will eventually use to substring the blocks
5:And do this till the end of the string
Pls be kind as I have no idea how regex works, I am just making an attempt to use it. Thanks, language is C#
You can use the Regex.Split method with a pattern of \|(?!-).
Notice that you need to escape the | character since it's a metacharacter in regex that is used for alternatation. The (?!-) is a negative look-ahead that will stop matching when a dash is encountered after the | character.
var pattern = #"\|(?!-)";
var results = Regex.Split(input, pattern);
foreach (var match in results) {
Console.WriteLine(match);
}
My Regex logic for this was:
the delimiter is pipe "[|]"
we will gather a series of characters that are not our delimiter
"(" not our delimiter ")" but at least one character "+"
"[^|]" is not our delimiter
"[|][-]" is also not our delimiter
Variable "pattern" could use a "*" instead of "+" if empty segments are acceptable. The pattern ends with a "?" since our final string segment (in your example) does not have a pipe character.
using System;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace ConsoleTest1
{
class Program
{
static void Main(string[] args)
{
var input = "ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00|ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC11.00|ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00";
var pattern = "([^|]|[|][-])+[|]?";
Match m;
m = Regex.Match(input, pattern);
while (m.Success) {
Debug.WriteLine(String.Format("Match from {0} for {1} characters", m.Index, m.Length));
Debug.WriteLine(input.Substring(m.Index, m.Length));
m = m.NextMatch();
}
}
}
}
Output is:
Match from 0 for 50 characters
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00|
Match from 50 for 49 characters
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC11.00|
Match from 99 for 49 characters
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00
Related
I have a regex code written in C# that basically adds a space between a number and a unit with some exceptions:
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+", #"$1");
dosage_value = Regex.Replace(dosage_value, #"(\d)%\s+", #"$1%");
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d+)?)", #"$1 ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+%", #"$1% ");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+:", #"$1:");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+e", #"$1e");
dosage_value = Regex.Replace(dosage_value, #"(\d)\s+E", #"$1E");
Example:
10ANYUNIT
10:something
10 : something
10 %
40 e-5
40 E-05
should become
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
Exceptions are: %, E, e and :.
I have tried, but since my regex knowledge is not top-notch, would someone be able to help me reduce this code with same expected results?
Thank you!
For your example data, you might use 2 capture groups where the second group is in an optional part.
In the callback of replace, check if capture group 2 exists. If it does, use is in the replacement, else add a space.
(\d+(?:\.\d+)?)(?:\s*([%:eE]))?
( Capture group 1
\d+(?:\.\d+)? match 1+ digits with an optional decimal part
) Close group 1
(?: Non capture group to match a as a whole
\s*([%:eE]) Match optional whitespace chars, and capture 1 of % : e E in group 2
)? Close non capture group and make it optional
.NET regex demo
string[] strings = new string[]
{
"10ANYUNIT",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
string pattern = #"(\d+(?:\.\d+)?)(?:\s*([%:eE]))?";
var result = strings.Select(s =>
Regex.Replace(
s, pattern, m =>
m.Groups[1].Value + (m.Groups[2].Success ? m.Groups[2].Value : " ")
)
);
Array.ForEach(result.ToArray(), Console.WriteLine);
Output
10 ANYUNIT
10:something
10: something
10%
40e-5
40E-05
As in .NET \d can also match digits from other languages, \s can also match a newline and the start of the pattern might be a partial match, a bit more precise match can be:
\b([0-9]+(?:\.[0-9]+)?)(?:[\p{Zs}\t]*([%:eE]))?
I think you need something like this:
dosage_value = Regex.Replace(dosage_value, #"(\d+(\.\d*)?)\s*((E|e|%|:)+)\s*", #"$1$3 ");
Group 1 - (\d+(\.\d*)?)
Any number like 123 1241.23
Group 2 - ((E|e|%|:)+)
Any of special symbols like E e % :
Group 1 and Group 2 could be separated with any number of whitespaces.
If it's not working as you asking, please provide some samples to test.
For me it's too complex to be handled just by one regex. I suggest splitting into separate checks. See below code example - I used four different regexes, first is described in detail, the rest can be deduced based on first explanation.
using System.Text.RegularExpressions;
var testStrings = new string[]
{
"10mg",
"10:something",
"10 : something",
"10 %",
"40 e-5",
"40 E-05",
};
foreach (var testString in testStrings)
{
Console.WriteLine($"Input: '{testString}', parsed: '{RegexReplace(testString)}'");
}
string RegexReplace(string input)
{
// First look for exponential notation.
// Pattern is: match zero or more whitespaces \s*
// Then match one or more digits and store it in first capturing group (\d+)
// Then match one ore more whitespaces again.
// Then match part with exponent ([eE][-+]?\d+) and store it in second capturing group.
// It will match lower or uppercase 'e' with optional (due to ? operator) dash/plus sign and one ore more digits.
// Then match zero or more white spaces.
var expForMatch = Regex.Match(input, #"\s*(\d+)\s+([eE][-+]?\d+)\s*");
if(expForMatch.Success)
{
return $"{expForMatch.Groups[1].Value}{expForMatch.Groups[2].Value}";
}
var matchWithColon = Regex.Match(input, #"\s*(\d+)\s*:\s*(\w+)");
if (matchWithColon.Success)
{
return $"{matchWithColon.Groups[1].Value}:{matchWithColon.Groups[2].Value}";
}
var matchWithPercent = Regex.Match(input, #"\s*(\d+)\s*%");
if (matchWithPercent.Success)
{
return $"{matchWithPercent.Groups[1].Value}%";
}
var matchWithUnit = Regex.Match(input, #"\s*(\d+)\s*(\w+)");
if (matchWithUnit.Success)
{
return $"{matchWithUnit.Groups[1].Value} {matchWithUnit.Groups[2].Value}";
}
return input;
}
Output is:
Input: '10mg', parsed: '10 mg'
Input: '10:something', parsed: '10:something'
Input: '10 : something', parsed: '10:something'
Input: '10 %', parsed: '10%'
Input: '40 e-5', parsed: '40e-5'
Input: '40 E-05', parsed: '40E-05'
C# Regex
I have the following list of strings:
"New patient, brief"
"New patient, limited"
"Established patient, brief"
"Established patient, limited"
"New diet patient"
"Established diet patient"
"School Physical"
"Deposition, 1 hour"
"Deposition, 2 hour"
I would like to separate these strings into groups using regex.
The first pattern I see is:
"New" or "Established" -- will be the first word of the matched pattern. This word will need to be captured and returned. Of this pattern, "patient" must be present without need to capture. Any word after "patient" must be captured.
I've tried: ((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)
but the return match gives:
Full match 0-3 `New`
Group 1. 0-0 ``
Group 2. 0-3 `New`
Not at all what I am looking for.
string input = "New patient, limited";
string pattern = #"((?=.*\bNew\b))(?=.*\bpatient\b)([A-Za-z0-9\-]+)";
MatchCollection matches = Regex.Matches(input, pattern);
GroupCollection groups = matches[0].Groups;
foreach (Match match in matches)
{
Console.WriteLine("First word: {0}", match.Groups[1].Value);
Console.WriteLine("Last words: {0}", match.Groups[2].Value);
Console.WriteLine();
}
Console.WriteLine();
Thank you for any help with this.
Edit #1
For strings like "New patient, limited"
output should be: "New" "limited"
For strings like "Deposition, 1 hour" where "hour" is present,
output should be: "Deposition, 1 hour"
For strings where there are no words after "patient" but "patient" is present, like
"New diet patient",
output should be: "New" "diet"
For strings where neither "patient" nor "hour" is present, the entire string should be returned. i.e like "School Physical" should return the entire string,
"School Physical".
As I said, this is my ultimate quest. At the moment, I am trying to focus on separating out only the first pattern :). Much Thanks.
I suggest using
^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)
See the regex demo
Details
^(?:(?!\b(?:New|Established)\b).)*$ - any string that has no New or Established as whole words
| - or
\b(New|Established) - a whole word New or Established (put into Group 1)
\s+ - 1+ whitespaces
(?:patient\b\W*)? - an optional non-capturing group matching 1 or 0 occurrences of patient followed with word boundary and 0+ non-word chars
(.+) - Group 2: any 1 or more chars other than line break chars.
The code will look like
var match = Regex.Match(s, #"^(?:(?!\b(?:New|Established)\b).)*$|\b(New|Established)\s+(?:patient\b\W*)?(.+)");
If Group 1 is not matched (!match.Groups[1].Success), grab the whole match, match.Value. Else, grab match.Groups[1].Value and match.Groups[2].Value.
Results:
I am trying to match a pattern <two alpha chars>single space<two digits>single space<two digits>and remove in all occurrences in a string.
var myRegex = #"(?:^|[\s]|[, ]|[.]|[\n]|[\t])([A-Za-z]{2}\s[0-9]{2}\s[0-9]{2})($|[,]|[.]|[\s]|[\n]|[\t])";
string myString = "this 02 34, HU 23 76 , hh 76 745 1.HO 12 33. HO 34 56";
var matches = Regex.Matches(myString, myRegex);
foreach (Match match in matches)
{
myString = myString.Replace(match.Value, "");
}
In above variable myString "this 02 34" will not match as there is no space or period or comma or new line or tab. This is expected behavior.
But "HO 34 56" is not matching as it is not ending with space or period or comma or new line or tab. How can I include this in the match and not have a match for "hh 76 745"
After executing above code, I expect myString variable to have "this 02 34, , hh 76 745 1.. "
Use this regex with word boundaries:
\b[A-Za-z]{2}\s[0-9]{2}\s[0-9]{2}\b
See the regex demo
Details:
\b - a leading word boundary
[A-Za-z]{2} - 2 alpha
\s - a whitespace
[0-9]{2} - 2 digits
\s - a whitespace
[0-9]{2} - 2 digits
\b - a trailing word boundary.
If you need to say "not preceded with alpha" replace the first \b with (?<![a-zA-Z]) and if you want to say "not followed with digit" replace the last \b with (?!\d). That is, use lookarounds, that, like word boundaries, are zero-width assertions.
If you really after matching that chunk when it has leading or trailer with following space or period or comma or new line or tab or beginning of string or end of string, use
(?<=^|[\s,.])[A-Za-z]{2}\s[0-9]{2}\s[0-9]{2}(?=$|[\s,.])
See this demo
I need a regex that is to be used for text substitution. Example: text to be matched is ABC (which could be surrounded by square brackets), substitution text is DEF. This is basic enough. The complication is that I don't want to match the ABC text when it is preceded by the pattern \[[\d ]+\]\. - in other words, when it is preceded by a word or set of words in brackets, followed by a period.
Here are some examples of source text to be matched, and the result, after the regex substitution would be made:
1. [xxx xxx].[ABC] > [xxx xxx].[ABC] (does not match - first part fits the pattern)
2. [xxx xxx].ABC > [xxx xxx].ABC (does not match - first part fits the pattern)
3. [xxx.ABC > [xxx.DEF (matches - first part has no closing bracket)
4. [ABC] > [DEF] (matches - no first part)
5. ABC > DEF (matches - no first part)
6. [xxx][ABC] > [xxx][DEF] (matches - no period in between)
7. [xxx]. [ABC] > [xxx] [DEF] (matches - space in between)
What it comes down to is: how can I specify the preceding pattern that when present as described will prevent a match? What would the pattern be in this case? (C# flavor of regex)
You want a negative look-behind expression. These look like (?<!pattern), so:
(?<!\[[\d ]+\]\.)\[?ABC\]?
Note that this does not force a matching pair of square brackets around ABC; it just allows for an optional open bracket before and an optional close bracket after. If you wanted to force a matching pair or none, you'd have to use alternation:
(?<!\[[\d ]+\]\.)(?:ABC|\[ABC\])
This uses non-capturing parentheses to delimit the alternation. If you want to actually capture ABC, you can of turn that into a capture group.
ETA: The reason the first expression seems to fail is that it is matching on ABC], which is not preceded by the prohibited text. The open bracket [ is optional, so it just doesn't match that. The way around this is to shift the optional open bracket [ into the negative look-behind assertion, like so:
(?<!\[[\d ]+\]\.\[?)ABC\]?
An example of what it matches and doesn't:
[123].[ABC]: fail (expected: fail)
[123 456].[ABC]: fail (expected: fail)
[123.ABC: match (expected: match)
matched: ABC
ABC: match (expected: match)
matched: ABC
[ABC]: match (expected: match)
matched: ABC]
[ABC[: match (expected: fail)
matched: ABC
Trying to make the presence of an open bracket [ force a matching close bracket ], as the second pattern intended, is trickier, but this seems to work:
(?:(?<!\[[\d ]+\]\.\[)ABC\]|(?<!\[[\d ]+\]\.)(?<!\[)ABC(?!\]))
An example of what it matches and doesn't:
[123].[ABC]: fail (expected: fail)
[123 456].[ABC]: fail (expected: fail)
[123.ABC: match (expected: match)
matched: ABC
ABC: match (expected: match)
matched: ABC
[ABC]: match (expected: match)
matched: ABC]
[ABC[: fail (expected: fail)
The examples were generated using this code:
// Compile and run with: mcs so_regex.cs && mono so_regex.exe
using System;
using System.Text.RegularExpressions;
public class SORegex {
public static void Main() {
string[] values = {"[123].[ABC]", "[123 456].[ABC]", "[123.ABC", "ABC", "[ABC]", "[ABC["};
string[] expected = {"fail", "fail", "match", "match", "match", "fail"};
string pattern = #"(?<!\[[\d ]+\]\.\[?)ABC\]?"; // Don't force [ to match ].
//string pattern = #"(?:(?<!\[[\d ]+\]\.\[)ABC\]|(?<!\[[\d ]+\]\.)(?<!\[)ABC(?!\]))"; // Force balanced brackets.
Console.WriteLine("pattern: {0}", pattern);
int i = 0;
foreach (string text in values) {
Match m = Regex.Match(text, pattern);
bool isMatch = m.Success;
Console.WriteLine("{0}: {1} (expected: {2})", text, isMatch? "match" : "fail", expected[i++]);
if (isMatch) Console.WriteLine("\tmatched: {0}", m.Value);
}
}
}
My file has certain data like::
/Pages 2 0 R/Type /Catalog/AcroForm
/Count 1 /Kids [3 0 R]/Type /Pages
/Filter /FlateDecode/Length 84
What is the regular expression to get this output..
Pages Type Catalog AcroForm Count Kids Type Pages Filter FlateDecode Length
I want to fetch string after '/' & before 2nd '/' or space.
Thanks in advance.
class Program
{
static void Main()
{
string s = #"/Pages 2 0 R/Type /Catalog/AcroForm
/Count 1 /Kids [3 0 R]/Type /Pages
/Filter /FlateDecode/Length 84";
var regex = new Regex(#"[\/]([^\s^\/]*)[\s]");
foreach (Match item in regex.Matches(s))
{
Console.WriteLine(item.Groups[1].Value);
}
}
}
Remark: Don't use regular expressions to parse PDF files.
\/[^\/\s]+
\/ -- A slash (escaped)
[^ ] -- A character class not (^) containing...
\/ -- ... slashes ...
\s -- ... or whitespace
+ -- One or more of these
Here it is for c#:
#"/([^\s/]+)"
You can test it here just adding what is in between quotes:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
I wouldn't use a regex for this, I find that using string operations is more readable:
string[] lines = input.split(#"\");
foreach(string line in lines)
{
if(line.contains(" "))
{
// Get everything before the space
}
else
{
// Get whole string
}
}