Regex.Split everything inside square brackets [] - c#

I'm really a n00b when it comes to regular expressions. I've been trying to Split a string wherever there's a [----anything inside-----] for example.
string s = "Hello Word my name_is [right now I'm hungry] Julian";
string[] words = Regex.Split( s, "------");
The outcome would be "Hello Word my name_is " and " Julian"

The regex you want to use is:
Regex.Split( s, "\\[.*?\\]" );
Square brackets are special characters (specifying a character group), so they have to be escaped with a backslash. Inside the square brackets, you want any sequence of characters EXCEPT a close square bracket. There are a couple of ways to handle that. One is to specify [^\]]* (explicitly specifying "not a close square bracket"). The other, as I used in my answer, is to specify that the match is not greedy by affixing a question mark after it. This tells the regular expression processor not to greedily consume as many characters as it can, but to stop as soon as the next expression is matched.

#"\[.*?\]" will match the brackets of text

Another way to write it:
Regex.Split(str, #"\[[^]]*\]");

Related

RegEx to find non-existence of white space prefix but not include the character in the match?

So i have the following RegEx for the purpose of finding and adding whitespace:
(\S)(\()
So for a string like "SomeText(Somemoretext)" I want to update this to "SomeText (Somemoretext)" it matches "t(" and so my replace eliminates the "t" from the string which is not good. I also do not know what the character could be, I'm merely trying to find the non-existence of whitespace.
Is there a better expression to use or is there a way to exclude the found character from the match returned so that I can safely replace without catching characters i do not want to replace?
Thanks
I find lookarounds hard to read and would prefer using substitutions in the replacement string instead:
var s = Regex.Replace("test1() test2()", #"(\S)\(", "$1 (");
Debug.Assert(s == "test1 () test2 ()");
$1 inserts the first capture group from the regex into the replacement string which is the non-space character before the opening parenthesis (.
If you need to detect the absence of space before a specific character (such as bracket) after a word, how about the following?
\b(?=[^\s])\(
This will detect words ( [a-zA-z0-9_] that are followed by a bracket, without a space).
(if I got your problem correctly) you can replace the full match with ( and get exactly what you need.
In case you need to look for absence spaces before a symbol (like a bracket) in any kind of text (as in the text may be non-word, such as punctuation) you might want to use the following instead.
^(?:\S*)(\()(?:\S*)$
When using this, your result will be in group 1, instead of just full match (which now contains the whole line, if a line is matched).

One single regular expression to match multiple alphanumeric words from 15 to 20 characters

I need to find all the words that have between 15 and 20 characters in a big string. And I want to avoid getting a long words with something else at the end (for ex 1234567890abcdef#asdf.com). I don't want that to be a result, only words. Right now I'm spliting the string using white space as token and for each word I'm applying the following regular expression:
^[a-zA-Z0-9]{15,20}$
Is there any chance to do both things using one regular expression?
I'm using C#.
Good examples to catch:
1234567890abcdeg
qwertyuiopasdfgh
1234567890abcdeg, (catch it but remove ",")
Examples to avoid: 1234567890abcdeg#gmail.com
Don't use start/end anchors (^/$), but word delimiters (\b):
\b[a-zA-Z0-9]{15,20}(?=[\s,]|$)
I used (?=[\s,]|$) instead of the end delimiter to force a space character or a comma or the end of the string. Expand it as needed.
You may want to do likewise for the first \b if you need to, for instance: (?<=\s|^).
Normally, you would use word boundaries (\b) before and after the alphanumerics:
\b[a-zA-Z0-9]{15,20}\b
However, there's a small detail to take into account: uderscores ("_") are also considered a word character. The previous regex won't match the following text:
12345678901234567_
In order to avoid it, you can check if it's preceded and followed by either a \b or a "_", with lookarounds.
Regex:
(?<=\b|_)[a-zA-Z0-9]{15,20}(?=\b|_)

Regex to remove all spaces, periods and other non-word characters from a string

I have this [^\w\.#-] regex expression that removes any character that is not a word character from the given string it works fine. Except for the two cases that I want it to cater also that is to also remove any spaces or full-stops . if one exists in the string.
Can you please help me in editing this regex for it, I tried getting a hold of regex over internet but it doesn't seems that easy.
Regex.Replace(title, #"[^\w\.#-]", "",RegexOptions.None, TimeSpan.FromSeconds(1.5));
Remove dot from your negative character class. You only need to place those character in your negative character class that you want to keep in the replaced string.
You can use:
string repl = Regex.Replace(title, #"[^\w#-]", "", TimeSpan.FromSeconds(1.5));
Space is already being removed since space is not considered a word character.
Your regex is fine. It appears that the problem is in the way that you are trying to use it.
The replacement does not happen in place, you need to capture the result in order to get the new string:
var newTitle = Regex.Replace(title, #"[^\w\.#-]", "", RegexOptions.None, TimeSpan.FromSeconds(1.5));
This expression works as expected (demo) - it keeps only word characters, dots, dashes, and at # signs.

Regex expression not filtering special symbols

I'm currently using the following line of code:
Regex Regex_Alpha = new Regex(#"[a-zA-Z]+('[a-zA-Z])?[a-zA-Z]*");
What I want to do is filter the input of text fields with the condition that input should only be letters and the apostrophe symbol (actually, I still want to add more, but I'm trying to resolve this first).
Right now, it is accepting ALL characters, even numbers.
With my understanding of Regex, I tried to formulate my own expression in the line of:
Regex Regex_Alpha = new Regex(#"^[a-zA-Z'-"+$);
It filters numbers, but doesn't accept the apostrophe symbol. Tried to remove the # sign and filter the apostrophe with the backslash escape character, but still no use.
What should be the best approach to filter the input so that it only accepts letters and apostrophe? (I'll do the rest of the symbols once I understand how this one should work)
As I've commented, your first regular expression is a pretty good shot at "letters, with a single apostrophe not at either end". However, it matchs any string with even a single letter because a regular expression looks for any match in the input, not for whether the entire input matches.
You can fix this by doing what you've done in your second regular expression - just put a ^ at the start and a $ at the end. This means the start and end of the expression have to match the start and end of the input, so it ensures the whole input is only made up of letters and a possible apostrophe.
Regarding your second regular expression, you have a few of problems.
If you want a double-quote in a #"..." string literal, you need to put two double quotes. (I think this might just be a typing mistake in your question, as what you currently have wouldn't even compile.)
You need to close your character class with a ], otherwise the [ and everything inside just get treated as a sequence of characters to match, one after the other.
If you want a hyphen in a character class, it has to go at the start or end, or it gets mistaken for a "between" hyphen (as in A-Z).
The expression #"^[a-zA-Z'""-]+$" should match "any string entirely made of letters, apostrophes, quotes or hyphens".

regular expression should split , that are contained outside the double quotes in a CSV file?

This is the sample
"abc","abcsds","adbc,ds","abc"
Output should be
abc
abcsds
adbc,ds
abc
Try this:
"(.*?)"
if you need to put this regex inside a literal, don't forget to escape it:
Regex re = new Regex("\"(.*?)\"");
This is a tougher job than you realize -- not only can there be commas inside the quotes, but there can also be quotes inside the quotes. Two consecutive quotes inside of a quoted string does not signal the end of the string. Instead, it signals a quote embedded in the string, so for example:
"x", "y,""z"""
should be parsed as:
x
y,"z"
So, the basic sequence is something like this:
Find the first non-white-space character.
If it was a quote, read up to the next quote. Then read the next character.
Repeat until that next character is not also a quote.
If the next (non-whitespace) character is not a comma, input is malformed.
If it was not a quote, read up to the next comma.
Skip the comma, repeat the whole process for the next field.
Note that despite the tag, I'm not providing a regex -- I'm not at all sure I've seen a regex that can really handle this properly.
This answer has a C# solution for dealing with CSV.
In particular, the line
private static Regex rexCsvSplitter = new Regex( #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" );
contains the Regex used to split properly, i.e., taking quoting and escaping into consideration.
Basically what it says is, match any comma that is followed by an even number of quote marks (including zero). This effectively prevents matching a comma that is part of a quoted string, since the quote character is escaped by doubling it.
Keep in mind that the quotes in the above line are doubled for the sake of the string literal. It might be easier to think of the expression as
,(?=(?:[^"]*"[^"]*")*(?![^"]*"))
If you can be sure there are no inner, escaped quotes, then I guess it's ok to use a regular expression for this. However, most modern languages already have proper CSV parsers.
Use a proper parser is the correct answer to this. Text::CSV for Perl, for example.
However, if you're dead set on using regular expressions, I'd suggest you "borrow" from some sort of module, like this one:
http://metacpan.org/pod/Regexp::Common::balanced

Categories

Resources