I know a bit about regular expressions, but far from enough to figure out this one.
I have tried to see if I could find something that could help me, but I got a hard time understanding how to construct the REGEX expression in c#.
Here is what I need.If I have a string like the following.
string s = "this is (a (string))"
What I need is to focus on the parentheses.
I want to be able to split this string up into the following List/Array "parts".
1) "this", "is", "a (string)"
or
2) "this", "is", "(a (string))".
would both like how to do it with 1) and 2). Anyone got an idea of how to solve this problem?
Can this be solved using REGEX? Anyone knows a good guide to learn about it?
Hope someone can help.
Greetings.
If you want to split with some kind of escape (do not count for space if it's within parentheses) you
can easily implement something like this, easy loop without regular expressions:
private static IEnumerable<String> SplitWithEscape(String source) {
if (String.IsNullOrEmpty(source))
yield break;
int escapeCount = 0;
int start = 0;
for (int i = 0; i < source.Length; ++i) {
char ch = source[i];
if (escapeCount > 0) {
if (ch == '(')
escapeCount += 1;
else if (ch == ')')
escapeCount -= 1;
}
else {
if (ch == ' ') {
yield return source.Substring(start, i - start);
start = i;
}
else if (ch == '(')
escapeCount += 1;
}
}
if ((start < source.Length - 1) && (escapeCount == 0))
yield return source.Substring(start);
}
....
String source = "this is (a (string))";
String[] split = SplitWithEscape(source).ToArray();
Console.Write(String.Join("; ", split));
You can try something like this:
([^\(\s]+)\s+([^\(\s]+)\s+\((.*)\)
Regex Demo
But this will only match with fixed number of words in your input string, in this case, two words before the parentheses. The final regex will depend on what are your specifications.
.NET regex supports balanced constructs. Thus, you can always safely use .NET regex to match substrings between a balanced number of delimiters that may have something inside them.
So, you can use
\(((?>[^()]+|\((?<o>)|\)(?<-o>))*(?(o)(?!)))\)|\S+
to match parenthesized substrings (while capturing the contents in-between parentheses into Group 1) or match all non-whitespace chunks (\S+ matches 1+ non-whitespace symbols).
See Grouping Constructs in Regular Expressions, Matching Nested Constructs with Balancing Groups or What are regular expression Balancing Groups? for more details on how balancing groups work.
Here is a regex demo
If you need to extract all the match values and captured values, you need to get all matched groups that are not empty or whitespace. So, use this C# code:
var line = "this is (a (string))";
var pattern = #"\(((?>[^()]+|\((?<o>)|\)(?<-o>))*(?(o)(?!)))\)|\S+";
var result = Regex.Matches(line, pattern)
.Cast<Match>()
.SelectMany(x => x.Groups.Cast<Group>()
.Where(m => !string.IsNullOrWhiteSpace(m.Value))
.Select(t => t.Value))
.ToList();
foreach (var s in result) // DEMO
Console.WriteLine(s);
Maybe you can use ((?<=\()[^}]*(?=\)))|\W+ to split in words and then get the content in the group 1...
See this Regex
Related
So I have quite a big problem...
I get a string like:
'x,y',2,4,'y,z'
And I need to seperate it into
'x,y'
2
4
'y,z'
Nothing I tried came anywhere near the expected result...
Thanks in advance!
If you're looking for a quick solution, try this (simple loop and no regular expressions):
private static IEnumerable<string> CsvSplitter(string source) {
if (string.IsNullOrEmpty(source))
yield break; //TODO: you may want to throw exception in case source == null
int lastIndex = 0;
bool inQuot = false;
for (int i = 0; i < source.Length; ++i) {
char c = source[i];
if (inQuot)
inQuot = c != '\'';
else if (c == '\'')
inQuot = true;
else if (c == ',') {
yield return source.Substring(lastIndex, i - lastIndex);
lastIndex = i + 1;
}
}
//TODO: you can well have invalid csv (unterminated quotation):
// if (inQuot)
// throw new FormatException("Incorrect CSV");
yield return source.Substring(lastIndex);
}
Sample:
string source = #"'x,y',2,4,'y,z',";
string[] result = CsvSplitter(source).ToArray();
Console.Write(string.Join(Environment.NewLine, result));
Output:
'x,y'
2
4
'y,z'
However, in general case google for CSV parser
If you wanna go the regex way, you can use
('.*?'|[^,]+)
and browse the capture groups, but I strongly recommend you to use a CSV parser.
If no nested quotes allowed, we can retrieve the required parts with a simple regex '.*?'|[^,]+:
var input = "'x,y',2,4,'y,z'";
var parts = Regex
.Matches(input, "'.*?'|[^,]+")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
Console.WriteLine(string.Join(Environment.NewLine, parts));
Demo: https://dotnetfiddle.net/qo5aHz
Although .NET flavour allows to elaborate a regex for nested quotes, it would be rather hard and therefore it's best to use a ready-made CSV parser. For example, TextFieldParser provided with .NET.
I have been trying to write a code that will check if the given string contains certain strings with certain pattern.
To be precise, for example:
string mainString = #"~(Homo Sapiens means (human being)) or man or ~woman"
List<string> checkList = new List<string>{"homo sapiens","human","man","woman"};
Now, I want to extract
"homo sapiens", "human" and "woman" but NOT "man"
from the above list as they follow the pattern, i.e string followed by~ or one of the strings inside parenthesis that starts with ~.
So far I have come up with:
string mainString = #"~(Homo Sapiens means (human being)) or man or ~woman"
List<string> checkList = new List<string>{"homo sapiens","human","man","woman"};
var prunedList = new List<string>();
foreach(var term in checkList)
{
var pattern = #"~(\s)*(\(\s*)?(\(?\w\s*\)?)*" + term + #"(\s*\))?";
Match m = Regex.Match(mainString, pattern);
if(m.success)
{
prunedList.Add(term);
}
}
But this pattern is not working for all cases...
Can any one suggest me how this can be done?
I wrote a simple parser that works well for the example you gave.
I don't know what the expected behavior is for a string that ends in this pattern: ~(some words (ie, no closing parenthesis with valid opening)
I'm sure you could clean this up some...
private bool Contains(string source, string given)
{
return ExtractValidPhrases(source).Any(p => RegexMatch(p, given));
}
private bool RegexMatch(string phrase, string given)
{
return Regex.IsMatch(phrase, string.Format(#"\b{0}\b", given), RegexOptions.IgnoreCase);
}
private IEnumerable<string> ExtractValidPhrases(string source)
{
bool valid = false;
var parentheses = new Stack<char>();
var phrase = new StringBuilder();
for(int i = 0; i < source.Length; i++)
{
if (valid) phrase.Append(source[i]);
switch (source[i])
{
case '~':
valid = true;
break;
case ' ':
if (valid && parentheses.Count == 0)
{
yield return phrase.ToString();
phrase.Clear();
}
if (parentheses.Count == 0) valid = false;
break;
case '(':
if (valid)
{
parentheses.Push('(');
}
break;
case ')':
if (valid)
{
parentheses.Pop();
}
break;
}
}
//if (valid && !parentheses.Any()) yield return phrase.ToString();
if (valid) yield return phrase.ToString();
}
Here are the tests I used:
// NUnit tests
[Test]
[TestCase("Homo Sapiens", true)]
[TestCase("human", true)]
[TestCase("woman", true)]
[TestCase("man", false)]
public void X(string given, bool shouldBeFound)
{
const string mainString = #"~(Homo Sapiens means (human being)) or man or ~woman";
Assert.AreEqual(shouldBeFound, Contains(mainString, given));
}
[Test]
public void Y()
{
const string mainString = #"~(Homo Sapiens means (human being)) or man or ~woman";
var checkList = new List<string> {"homo sapiens", "human", "man", "woman"};
var expected = new List<string> { "homo sapiens", "human", "woman" };
var filtered = checkList.Where(s => Contains(mainString, s));
CollectionAssert.AreEquivalent(expected, filtered);
}
The language of balanced parenthesis is not regular and as a result you cannot accomplish what you want using RegEx. A better approach would be to use traditional string parsing with a couple of counters - one for open paren and one for close parens - or a stack to create a model similar to a Push Down Automaton.
To get a better idea of the concept check out PDA's on Wikipedia. http://en.wikipedia.org/wiki/Pushdown_automaton
Below is an example using a stack to get strings inside the out most parens (pseudo code).
Stack stack = new Stack();
char[] stringToParse = originalString.toCharArray();
for (int i = 0; i < stringToParse.Length; i++)
{
if (stringToParse[i] == '(')
stack.push(i);
if (stringToParse[i] == ')')
string StringBetweenParens = originalString.GetSubstring(stack.pop(), i);
}
Now of course this is a contrived example and would need some work to do more serious parsing, but it gives you the basic idea of how to do it. I've left out things like; the correct function names (don't feel like looking them up right now), how to get text in nested parens like getting "inner" out of the string "(outer (inner))" (that function would return "outer (inner)"), and how to store the strings you get back.
Simply for academic reasons, I would like to present the regex solution, too. Mostly, because you are probably using the only regex engine that is capable of solving this.
After clearing up some interesting issues about the combination of .NET's unique features, here is the code that gets you the desired results:
string mainString = #"~(Homo Sapiens means (human being)) or man or ~woman";
List<string> checkList = new List<string> { "homo sapiens", "human", "man", "woman" };
// build subpattern "(?:homo sapiens|human|man|woman)"
string searchAlternation = "(?:" + String.Join("|", checkList.ToArray()) + ")";
MatchCollection matches = Regex.Matches(
mainString,
#"(?<=~|(?(Depth)(?!))~[(](?>[^()]+|(?<-Depth>)?[(]|(?<Depth>[)]))*)"+searchAlternation,
RegexOptions.IgnoreCase
);
Now how does this work? Firstly, .NET supports balancing groups, which allow for detection of correctly nested patterns. Every time we capture something with a named capturing group
(like (?<Depth>somepattern)) it does not overwrite the last capture, but instead is pushed onto a stack. We can pop one capture from that stack with (?<-Depth>). This will fail, if the stack is empty (just like something that does not match at the current position). And we can check whether the stack is empty or not with (?(Depth)patternIfNotEmpty|patternIfEmpty).
In addition to that, .NET has the only regex engine that supports variable-length lookbehinds. If we can use these two features together, we can look to the left of one of our desired strings and see whether there is a ~( somewhere outside the current nesting structure.
But here is the catch (see the link above). Lookbehinds are executed from right to left in .NET, which means that we need to push closing parens and pop on encountering opening parens, instead of the other way round.
So here is for some explanation of that murderous regex (it's easier to understand if you read the lookbehind from bottom to top, just like .NET would do):
(?<= # lookbehind
~ # if there is a literal ~ to the left of our string, we're good
| # OR
(?(Depth)(?!)) # if there is something left on the stack, we started outside
# of the parentheses that end end "~("
~[(] # match a literal ~(
(?> # subpattern to analyze parentheses. the > makes the group
# atomic, i.e. suppresses backtracking. Note: we can only do
# this, because the three alternatives are mutually exclusive
[^()]+ # consume any non-parens characters without caring about them
| # OR
(?<-Depth>)? # pop the top of stack IF possible. the last ? is necessary for
# like "human" where we start with a ( before there was a )
# which could be popped.
[(] # match a literal (
| # OR
(?<Depth>[)]) # match a literal ) and push it onto the stack
)* # repeat for as long as possible
) # end of lookbehind
(?:homo sapiens|human|man|woman)
# match one of the words in the check list
Paranthesis checking is a context-free language or grammar which requires a stack for checking. Regular expressions are suitable for regular languages. They do not have memory, therefore they cannot be used for such purposes.
To check this you need to scan the string and count the parentheses:
initialize count to 0
scan the string
if current character is ( then increment count
if current character is ) then decrement count
if count is negative, raise an error that parentheses are inconsistent; e.g., )(
In the end, if count is positive, then there are some unclosed parenthesis
If count is zero, then the test is passed
Or in C#:
public static bool CheckParentheses(string input)
{
int count = 0;
foreach (var ch in input)
{
if (ch == '(') count++;
if (ch == ')') count--;
// if a parenthesis is closed without being opened return false
if(count < 0)
return false;
}
// in the end the test is passed only if count is zero
return count == 0;
}
You see, since regular expressions are not capable of counting, then they cannot check such patterns.
This is not possible using regular expressions.
You should abandon idea of using them and use normal string operations like IndexOf.
What is the regular expression to split on comma (,) except if surrounded by double quotes? For example:
max,emily,john = ["max", "emily", "john"]
BUT
max,"emily,kate",john = ["max", "emily,kate", "john"]
Looking to use in C#: Regex.Split(string, "PATTERN-HERE");
Thanks.
Situations like this often call for something other than regular expressions. They are nifty, but patterns for handling this kind of thing are more complicated than they are useful.
You might try something like this instead:
public static IEnumerable<string> SplitCSV(string csvString)
{
var sb = new StringBuilder();
bool quoted = false;
foreach (char c in csvString) {
if (quoted) {
if (c == '"')
quoted = false;
else
sb.Append(c);
} else {
if (c == '"') {
quoted = true;
} else if (c == ',') {
yield return sb.ToString();
sb.Length = 0;
} else {
sb.Append(c);
}
}
}
if (quoted)
throw new ArgumentException("csvString", "Unterminated quotation mark.");
yield return sb.ToString();
}
It probably needs a few tweaks to follow the CSV spec exactly, but the basic logic is sound.
This is a clear-cut case for a CSV parser, so you should be using .NET's own CSV parsing capabilities or cdhowie's solution.
Purely for your information and not intended as a workable solution, here's what contortions you'd have to go through using regular expressions with Regex.Split():
You could use the regex (please don't!)
(?<=^(?:[^"]*"[^"]*")*[^"]*) # assert that there is an even number of quotes before...
\s*,\s* # the comma to be split on...
(?=(?:[^"]*"[^"]*")*[^"]*$) # as well as after the comma.
if your quoted strings never contain escaped quotes, and you don't mind the quotes themselves becoming part of the match.
This is horribly inefficient, a pain to read and debug, works only in .NET, and it fails on escaped quotes (at least if you're not using "" to escape a single quote). Of course the regex could be modified to handle that as well, but then it's going to be perfectly ghastly.
A little late maybe but I hope I can help someone else
String[] cols = Regex.Split("max, emily, john", #"\s*,\s*");
foreach ( String s in cols ) {
Console.WriteLine(s);
}
Justin, resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc.
Here's our simple regex:
"[^"]*"|(,)
The left side of the alternation matches complete "quoted strings" tags. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left. We replace these commas with SplitHere, then we split on SplitHere.
This program shows how to use the regex (see the results at the bottom of the online demo):
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main() {
string s1 = #"max,""emily,kate"",john";
var myRegex = new Regex(#"""[^""]*""|(,)");
string replaced = myRegex.Replace(s1, delegate(Match m) {
if (m.Groups[1].Value == "") return m.Value;
else return "SplitHere";
});
string[] splits = Regex.Split(replaced,"SplitHere");
foreach (string split in splits) Console.WriteLine(split);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
I have a string like "AAA 101 B202 C 303 " and I want to get rid of the space between char and number if there is any.
So after operation, the string should be like "AAA101 B202 C303 ". But I am not sure whether regex could do this?
Any help? Thanks in advance.
Yes, you can do this with regular expressions. Here's a short but complete example:
using System;
using System.Text.RegularExpressions;
class Test
{
static void Main()
{
string text = "A 101 B202 C 303 ";
string output = Regex.Replace(text, #"(\p{L}) (\d)", #"$1$2");
Console.WriteLine(output); // Prints A101 B202 C303
}
}
(If you're going to do this a lot, you may well want to compile a regular expression for the pattern.)
The \p{L} matches any unicode letter - you may want to be more restrictive.
You can do something like
([A-Z]+)\s?(\d+)
And replace with
$1$2
The expression can be tightened up, but the above should work for your example input string.
What it does is declaring a group containing letters (first set of parantheses), then an optional space (\s?), and then a group of digits (\d+). The groups can be used in the replacement by referring to their index, so when you want to get rid of the space, just replace with $1$2.
While not as concise as Regex, the C# code for something like this is fairly straightforward and very fast-running:
StringBuilder sb = new StringBuilder();
for(int i=0; i<s.Length; i++)
{
// exclude spaces preceeded by a letter and succeeded by a number
if(!(s[i] == ' '
&& i-1 >= 0 && IsLetter(s[i-1])
&& i+1 < s.Length && IsNumber(s[i+1])))
{
sb.Append(s[i]);
}
}
return sb.ToString();
Just for fun (because the act of programming is/should be fun sometimes) :o) I'm using LINQ with Aggregate:
var result = text.Aggregate(
string.Empty,
(acc, c) => char.IsLetter(acc.LastOrDefault()) && Char.IsDigit(c) ?
acc + c.ToString() : acc + (char.IsWhiteSpace(c) && char.IsLetter(acc.LastOrDefault()) ?
string.Empty : c.ToString())).TrimEnd();
I read a string from the console. How do I make sure it only contains English characters and digits?
Assuming that by "English characters" you are simply referring to the 26-character Latin alphabet, this would be an area where I would use regular expressions: ^[a-zA-Z0-9 ]*$
For example:
if( Regex.IsMatch(Console.ReadLine(), "^[a-zA-Z0-9]*$") )
{ /* your code */ }
The benefit of regular expressions in this case is that all you really care about is whether or not a string matches a pattern - this is one where regular expressions work wonderfully. It clearly captures your intent, and it's easy to extend if you definition of "English characters" expands beyond just the 26 alphabetic ones.
There's a decent series of articles here that teach more about regular expressions.
Jørn Schou-Rode's answer provides a great explanation of how the regular expression presented here works to match your input.
You could match it against this regular expression: ^[a-zA-Z0-9]*$
^ matches the start of the string (ie no characters are allowed before this point)
[a-zA-Z0-9] matches any letter from a-z in lower or upper case, as well as digits 0-9
* lets the previous match repeat zero or more times
$ matches the end of the string (ie no characters are allowed after this point)
To use the expression in a C# program, you will need to import System.Text.RegularExpressions and do something like this in your code:
bool match = Regex.IsMatch(input, "^[a-zA-Z0-9]*$");
If you are going to test a lot of lines against the pattern, you might want to compile the expression:
Regex pattern = new Regex("^[a-zA-Z0-9]*$", RegexOptions.Compiled);
for (int i = 0; i < 1000; i++)
{
string input = Console.ReadLine();
pattern.IsMatch(input);
}
The accepted answer does not work for the white spaces or punctuation. Below code is tested for this input:
Hello: 1. - a; b/c \ _(5)??
(Is English)
Regex regex = new Regex("^[a-zA-Z0-9. -_?]*$");
string text1 = "سلام";
bool fls = regex.IsMatch(text1); //false
string text2 = "123 abc! ?? -_)(/\\;:";
bool tru = regex.IsMatch(text2); //true
One other way is to check if IsLower and IsUpper both doesn't return true.
Something like :
private bool IsAllCharEnglish(string Input)
{
foreach (var item in Input.ToCharArray())
{
if (!char.IsLower(item) && !char.IsUpper(item) && !char.IsDigit(item) && !char.IsWhiteSpace(item))
{
return false;
}
}
return true;
}
and for use it :
string str = "فارسی abc";
IsAllCharEnglish(str); // return false
str = "These are english 123";
IsAllCharEnglish(str); // return true
Do not use RegEx and LINQ they are slower than the loop by characters of string
Performance test
My solution:
private static bool is_only_eng_letters_and_digits(string str)
{
foreach (char ch in str)
{
if (!(ch >= 'A' && ch <= 'Z') && !(ch >= 'a' && ch <= 'z') && !(ch >= '0' && ch <= '9'))
{
return false;
}
}
return true;
}
do you have web access? i would assume that cannot be guaranteed, but Google has a language api that will detect the language you pass to it.
google language api
bool onlyEnglishCharacters = !EnglishText.Any(a => a > '~');
Seems cheap, but it worked for me, legit easy answer.
Hope it helps anyone.
bool AllAscii(string str)
{
return !str.Any(c => !Char.IsLetterOrDigit(c));
}
Something like this (if you want to control input):
static string ReadLettersAndDigits() {
StringBuilder sb = new StringBuilder();
ConsoleKeyInfo keyInfo;
while ((keyInfo = Console.ReadKey(true)).Key != ConsoleKey.Enter) {
char c = char.ToLower(keyInfo.KeyChar);
if (('a' <= c && c <= 'z') || char.IsDigit(c)) {
sb.Append(keyInfo.KeyChar);
Console.Write(c);
}
}
return sb.ToString();
}
If i dont wnat to use RegEx, and just to provide an alternate solution, you can just check the ASCII code of each character and if it lies between that range, it would either be a english letter or a number (This might not be the best solution):
foreach (char ch in str.ToCharArray())
{
int x = (int)char;
if (x >= 63 and x <= 126)
{
//this is english letter, i.e.- A, B, C, a, b, c...
}
else if(x >= 48 and x <= 57)
{
//this is number
}
else
{
//this is something diffrent
}
}
http://en.wikipedia.org/wiki/ASCII for full ASCII table.
But I still think, RegEx is the best solution.
I agree with the Regular Expression answers. However, you could simplify it to just "^[\w]+$". \w is any "word character" (which translates to [a-zA-Z_0-9] if you use a non-unicode alphabet. I don't know if you want underscores as well.
More on regexes in .net here: http://msdn.microsoft.com/en-us/library/ms972966.aspx#regexnet_topic8
As many pointed out, accepted answer works only if there is a single word in the string. As there are no answers that cover the case of multiple words or even sentences in the string, here is the code:
stringToCheck.Any(x=> char.IsLetter(x) && !((int)x >= 63 && (int)x <= 126));
<?php
$string="हिन्दी";
$string="Manvendra Rajpurohit";
echo strlen($string); echo '<br>';
echo mb_strlen($string, 'utf-8');
echo '<br>';
if(strlen($string) != mb_strlen($string, 'utf-8'))
{
echo "Please enter English words only:(";
}
else {
echo "OK, English Detected!";
}
?>