I'm trying to split a string into tokens (via regular expressions)
in the following way:
Example #1
input string: 'hello'
first token: '
second token: hello
third token: '
Example #2
input string: 'hello world'
first token: '
second token: hello world
third token: '
Example #3
input string: hello world
first token: hello
second token: world
i.e., only split up the string if it is NOT in single quotation marks, and single quotes should be in their own token.
This is what I have so far:
string pattern = #"'|\s";
Regex RE = new Regex(pattern);
string[] tokens = RE.Split("'hello world'");
This will work for example #1 and example #3 but it will NOT work for example #2.
I'm wondering if there's theoretically a way to achieve what I want with regular expressions
You could build a simple lexer, which would involve consuming each of the tokens one by one. So you would have a list of regular expressions and would attempt to match one of them at each point. That is the easiest and cleanest way to do this if your input is anything beyond the very simple.
Use a token parsor to split into tokens. Use regex to find a string patterns
'[^']+' will match text inside single quotes. If you want it grouped, (')([^']+)('). If no matches are found, then just use a regular string split. I don't think it makes sense to try to do the whole thing in one regular expression.
EDIT: It seems from your comments on the question that you actually want this applied over a larger block of text rather than just simple inputs like you indicated. If that's the case, then I don't think a regular expression is your answer.
While it would be possible to match ' and the text inside separately, and also alternatively match the text alone, RegExp does not allow an indefinite number of matches. Or better said, you can only match those objects you explicitely state in the expression. So ((\w+)+\b) could theoretically match all words one-by-one. The outer group will correctly match the whole text, and also the inner group will match the words separately correctly, but you will only be able to reference the last match.
There is no way to match a group of matched matches (weird sentence). The only possible way would be to match the string and then split it into separate words.
Not exactly what you are trying to do, but regular expression conditions might help out as you look for a solution:
(?<quot>')?(?<words>(?(quot)[^']|\w)+)(?(quot)')
If a quote is found, then it matches until a non-quote is found. Otherwise looks at word characters. Your results are in groups named "quot" and "words".
You'll have hard time using Split here, but you can use a MatchCollection to find all matches in your string:
string str = "hello world, 'HELLO WORLD': we'll be fine.";
MatchCollection matches = Regex.Matches(str, #"(')([^']+)(')|(\w+)");
The regex searches for a string between single quotes. If it cannot find one, it takes a single word.
Now it gets a little tricky - .net returns a collection of Matchs. Each Match has several Groups - the first Group has the whole string ('hello world'), but the rest have sub-matches (',hello world,'). Also, you get many empty unsuccessful Groups.
You can still iterate easily and get your matches. Here's an example using LINQ:
var tokens = from match in matches.Cast<Match>()
from g in match.Groups.Cast<Group>().Skip(1)
where g.Success
select g.Value;
tokens is now a collection of strings:
hello, world, ', HELLO WORLD, ', we, ll, be, fine
You can first split on quoted string, and then further tokenize.
foreach (String s in Regex.Split(input, #"('[^']+')")) {
// Check first if s is a quote.
// If so, split out the quotes.
// If not, do what you intend to do.
}
(Note: you need the brackets in the pattern to make sure Regex.Split returns those too)
Try this Regular Expression:
([']*)([a-z]+)([']*)
This finds 1 or more single quotes at the beginning and end of a string. It then finds 1 or more characters in the a-z set (if you don't set it to be case insensitive it will only find lower case characters). It groups these so that group 1 has the ', group 2 (or more) has the words which are split by anything that is not a character a - z and the last group has the single quote if it exists.
Related
I am trying to extract a character/digit from a string that is between single quotes and seems like i am failing to write the correct pattern.
Test string - only value that changes is the single character/digit in single quotes
[+] Random session part: 'm'
I am using the following pattern but it returns empty
var line = "[+] Random session part: 'm'";
Regex pattern = new Regex(#"(?<=\')(.*?)(?=\')");
Match match = pattern.Match(line);
Debug.Log($"{match.Groups["postfix"].Value}");
int postFix = int.Parse(match.Groups["postfix"].Value);
what am i missing?
You have an overly complicated regex, and looking for a group named 'postfix' in you match, while your regex does not have such a named group.
A simpler regex would be:
'(.)'
This looks for a single character between two single quotes, and has that character wrapped in a capture group. Put a breakpoint after your match row, and you can explore the matched object.
You can explore the regex above with your match here:
https://regexr.com/77b0m
BTW: Your code tries to parse the string "m" into an int, this will throw and error, your should probably handle that case with int.TryParse
you can use this regX :
'(.)' // match any string between single quotes
show result
or
(?<=\')(.*?)(?=\') //containing a non-greedy match
show result
I have the following string:
"483 432,96 (HM: 369 694,86; ZP: 32 143,48; NP: 4 507,19; SP: 40 800,62; SDS: 4 389,84; IP: 9 497,14; PvN: 3 157,25; ÚP: 3 102,14; GP: 808,28; PRFS: 15 332,16)"
What I am trying to do, is to retrieve all values (if they exist) for the following letters (I highlighted necessary values in bold below):
483 432,96 (HM: 369 694,86; ZP: 32 143,48; NP: 4 507,19; SP: 40 800,62; SDS: 4 389,84; IP: 9 497,14; PvN: 3 157,25; ÚP: 3 102,14; GP: 808,28; PRFS: 15 332,16)
I tried to retrieve values one by one with the following regex:
string regex = "NP: ^[0-9]^[\\s\\d]([.,\\s\\d][0-9]{1,4})?$";
But with no luck either (I am a newbie in Regex patterns).
Is it possible to retrieve all values in a one string (and then simply loop through the results), or do I have to go one key at the time?
Here is my full code:
string sTest = "483 432,96 (HM: 369 694,86; ZP: 32 143,48; NP: 4 507,19; SP: 40 800,62; SDS: 4 389,84; IP: 9 497,14; PvN: 3 157,25; ÚP: 3 102,14; GP: 808,28; PRFS: 15 332,16)";
string regex = "NP: ^[0-9]^[\\s\\d]([.,\\s\\d][0-9]{1,4})?$";
System.Text.RegularExpressions.MatchCollection coll = System.Text.RegularExpressions.Regex.Matches(sTest, regex);
String result = coll[0].Groups[1].Value;
You can't get them all with one regex, unless you are absolutely sure that they will all appear next to each other. Also, what would be the point of getting them all and having to split the result afterwards anyway. Here is a regex which would find the values you wanted:
(ZP|NP|SP|SDS|IP|PvN|ÚP|GP|PRFS): ([^;)]+)
Now the first group will be the key and the second group will be the value.
The idea is:
(x|y|z) matches either x or y or z
[^;)]+ matches something, which is not ; (because this is how they are currently delimited) or ) (for the last position) one or more times
I tried to retrieve values one by one with the following regex:
Let's fix your one-by-one regex:
Caret ^ outside the [] character class means "start of line", so your expression with two carets in different places will not match anything.
Use \d instead of [0-9] and \D instead of [^0-9]
Here is one expression that matches NP: pattern (demo):
NP: \d+\D\d+([.,]\d{1,4})?
Now convert it to an expression that matches other tags like this:
(NP|ZP|SP|...): \d+\D\d+([.,]\d{1,4})?
Applying this pattern in a loop repeatedly will let you extract the tags one by one.
I'm trying to extract a string between two quotes, and I thought I had my regex working, but it's giving me two strings in my GroupCollection, and I can't get it to ignore the first one, which includes the first quote and ID=
The string that I want to parse is
Test ID="12345" hello
I want to return 12345 in a group, so that I can manipulate it in code later. I've tried the following regex: http://regexr.com/3bgtl, with this code:
nodeValue = "Test ID=\"12345\" hello";
GroupCollection ids = Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups;
The problem is that the GroupCollection contains two entries:
ID="12345
12345
I just want it to return the second one.
Use positive lookbehind operator:
GroupCollection ids = Regex.Match(nodeValue, "(?<=ID=\")[^\"]*").Groups;
You also used a capturing group (the parenthesis), this is why you get 2 results.
There are a few ways to accomplish this. I like named capture groups for readability.
Regex with named capture group:
"(?<capture>.*?)"
And your code would be:
match.Groups["capture"].Value
Your code is totally OK and is the most efficient from all the solutions suggested here. Capturing groups allow the quickest and least resource-consuming way to match substrings inside larger texts.
All you need to do with your regex is just access the captured group 1 that is defined by the round brackets. Like this:
var nodeValue = "Test ID=\"12345\" hello";
GroupCollection ids = Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups;
Console.WriteLine(ids[1].Value);
// or just on one line
// Console.WriteLine(Regex.Match(nodeValue, "ID=\"([^\"]*)").Groups[1].Value);
See IDEONE demo
Please have a look at Grouping Constructs in Regular Expressions:
Grouping constructs delineate the subexpressions of a regular expression and capture the substrings of an input string. You can use grouping constructs to do the following:
Match a subexpression that is repeated in the input string.
Apply a quantifier to a subexpression that has multiple regular expression language elements. For more information about quantifiers, see [Quantifiers in Regular Expressions][3].
Include a subexpression in the string that is returned by the [Regex.Replace][4] and [Match.Result][5] methods.
Retrieve individual subexpressions from the [Match.Groups][6] property and process them separately from the matched text as a whole.
Note that if you do not need overlapping matches, capturing group mechanism is the best solution here.
I'm trying to learn regex, but still have no clue. I have this line of code, which successfully seperates the placeholder 'FirstWord' by the '{' delimiter from all following text:
var regexp = new Regex(#"(?<FirstWord>.*?)\{(?<TextBetweenCurlyBrackets>.*?)\}");
Which reads this string with no problem:
Greetings{Hello World}
What I want to do is to replace the '{' with a character chain like for instance '/>>'
so I tried this:
var regexp = new Regex(#"(?<FirstWord>.*?)\/>>(?<OtherText>.*?)\");
I removed the last bracket and replaced the first one with '/>>' But it throws an ArgumentException. How would the correct character combination look like?
/ does not need to be escaped, unless you use it as the pattern-delimiter.:
#"(?<FirstWord>.*?)/>>(?<OtherText>.*?)\"
Also your last \ will basically escape the " which should end the String (c#-wise: remove it):
#"(?<FirstWord>.*?)/>>(?<OtherText>.*?)"
And since you want most likely fetch until the END of the String (.*? will fetch as less characters as required to satisfy the expression), you should use the $ at the end or use any other sort of delimiter (whitspace, linebreak, etc...).
#"(?<FirstWord>.*?)/>>(?<OtherText>.*?)$"
Example:
(.*?)/>>(.*?)$
Debuggex Demo
Removing the trailing $ will fetch the empty string for the second match group, because "" is the shortest string possible satisfying the expression .*?
(.*?)/>>(.*?)$ on This/>>Test One will match This and Test One
(.*?)/>>(.*?)\s on This/>>Test One will match This and Test
(.*?)/>>(.*?) on This/>>Test One will match This and ""
Note: I'm saying "" is the shortest string possible satisfying the expression .?* on purpose! A frequent Misstake is to interpret .*?a as "everything until a":
Regex is greedy by default!
Searching for the expressiong (.*?)a$ on "caba" will NOT fail to match - it will return cab!, because cab followed by a is satisfying the expression AND cab is the shortest string possible for any match.
One might also expect b to be matched - but regex is working from left to right, hence aborting once it found cab - even if b would be shorter.
Duplicate
Regex for variable declaration and initialization in c#
I was looking for a Regular Expression to parse CSV values, and I came across this Regular Expression
[^,]+
Which does my work by splitting the words on every occurance of a ",". What i want to know is say I have the string
value_name v1,v2,v3,v4,...
Now I want a regular expression to find me the words v1,v2,v3,v4..
I tried ->
^value_name\s+([^,]+)*
But it didn't work for me. Can you tell me what I am doing wrong? I remember working on regular expressions and their statemachine implementation. Doesn't it work in the same way.
If a string starts with Value_name followed by one or more whitespaces. Go to Next State. In That State read a word until a "," comes. Then do it again! And each word will be grouped!
Am i wrong in understanding it?
You could use a Regex similar to those proposed:
(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?
The first group is non-capturing and would match the start of the line and the value_name.
To ensure that the Regex is still valid over all matches, we make that group optional by using the '?' modified (meaning match at most once).
The second group is capturing and would match your vXX data.
The third group is non-capturing and would match the ,, and any whitespace before and after it.
Again, we make it optional by using the '?' modifier, otherwise the last 'vXX' group would not match unless we ended the string with a final ','.
In you trials, the Regex wouldn't match multiple times: you have to remember that if you want a Regex to match multiple occurrences in a strings, the whole Regex needs to match every single occurrence in the string, so you have to build your Regex not only to match the start of the string 'value_name', but also match every occurrence of 'vXX' in it.
In C#, you could list all matches and groups using code like this:
Regex r = new Regex(#"(?:^value_name\s+)?([^,]+)(?:\s*,\s*)?");
Match m = r.Match(subjectString);
while (m.Success) {
for (int i = 1; i < m.Groups.Count; i++) {
Group g = m.Groups[i];
if (g.Success) {
// matched text: g.Value
// match start: g.Index
// match length: g.Length
}
}
m = m.NextMatch();
}
I would expect it only to get v1 in the group, because the first comma is "blocking" it from grabbing the rest of the fields. How you handle this is going to depend on the methods you use on the regular expression, but it may make sense to make two passes, first grab all the fields seperated by commas and then break things up on spaces. Perhaps ^value_name\s+(?:([^,]+),?)* instead.
Oh yeah, lists....
/(?:^value_name\s+|,\s*)([^,]+)/g will theoreticly grab them, but you will have to use RegExp.exec() in a loop to get the capture, rather than the whole match.
I wish pre-matches worked in JS :(.
Otherwise, go with Logan's idea: /^value_name\s+([^,]+(?:,\s*[^,]+)*)$/ followed by .split(/,\s*/);