I want to replace single occurrences of a character but not two in a string using C#.
For example, I want to replace & by an empty string but not when the ocurrence is &&. Another example, a&b&&c would become ab&&c after the replacement.
If I use a regex like &[^&], it will also match the character after the & and I don't want to replace it.
Another solution I found is to iterate over the string characters.
Do you know a cleaner solution to do that?
To only match one & (not preceded or followed by &), use look-arounds (?<!&) and (?!&):
(?<!&)&(?!&)
See regex demo
You tried to use a negated character class that still matches a character, and you need to use a look-ahead/look-behind to just check for some character absence/presence, without consuming it.
See regular-expressions.info:
Negative lookahead is indispensable if you want to match something not followed by something else. When explaining character classes, this tutorial explained why you cannot use a negated character class to match a q not followed by a u. Negative lookahead provides the solution: q(?!u).
Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind. It doesn't match cab, but matches the b (and only the b) in bed or debt.
You can match both & and && (or any number of repetition) and only replace the single one with an empty string:
str = Regex.Replace(str, "&+", m => m.Value.Length == 1 ? "" : m.Value);
You can use this regex: #"(?<!&)&(?!&)"
var str = Regex.Replace("a&b&&c", #"(?<!&)&(?!&)", "");
Console.WriteLine(str); // ab&&c
You can go with this:
public static string replacement(string oldString, char charToRemove)
{
string newString = "";
bool found = false;
foreach (char c in oldString)
{
if (c == charToRemove && !found)
{
found = true;
continue;
}
newString += c;
}
return newString;
}
Which is as generic as possible
I would use something like this, which IMO should be better than using Regex:
public static class StringExtensions
{
public static string ReplaceFirst(this string source, char oldChar, char newChar)
{
if (string.IsNullOrEmpty(source)) return source;
int index = source.IndexOf(oldChar);
if (index < 0) return source;
var chars = source.ToCharArray();
chars[index] = newChar;
return new string(chars);
}
}
I'll contribute to this statement from the comments:
in this case, only the substring with odd number of '&' will be replaced by all the "&" except the last "&" . "&&&" would be "&&" and "&&&&" would be "&&&&"
This is a pretty neat solution using balancing groups (though I wouldn't call it particularly clean nor easy to read).
Code:
string str = "11&222&&333&&&44444&&&&55&&&&&";
str = Regex.Replace(str, "&((?:(?<2>&)(?<-2>&)?)*)", "$1$2");
Output:
11222&&333&&44444&&&&55&&&&
ideone demo
It always matches the first & (not captured).
If it's followed by an even number of &, they're matched and stored in $1. The second group is captured by the first of the pair, but then it's substracted by the second.
However, if there's there's an odd number of of &, the optional group (?<-2>&)? does not match, and the group is not substracted. Then, $2 will capture an extra &
For example, matching the subject "&&&&", the first char is consumed and it isn't captured (1). The second and third chars are matched, but $2 is substracted (2). For the last char, $2 is captured (3). The last 3 chars were stored in $1, and there's an extra & in $2.
Then, the substitution "$1$2" == "&&&&".
Related
I am trying to make a catch exception for input string for example
if user enters
123test456
the program to say
The first 3 characters must be letter
so it should accept
wowTest456
You can use Take(), with .All() including letter condition,
var validString = inputString
.Take(3)
.All(x => Char.IsLetter(x));
You can solve it using Regex too. Credit to #JohnathanBarclay.
bool isInvalidString = Regex.IsMatch(inputString, #"^\d{3}");
Explanation:
^ : Matches the beginning of the string
\d: Matches the digit characters
{3} : Matches 3 tokens.
To make it positive regex just check \w instead of \d
bool validString = Regex.IsMatch(inputString, #"^\w{3}");
Using \w includes _(underscore) as well, if you don't want _ as a part of first three letters, then you can use range like below
bool validString = Regex.IsMatch(inputString, #"^[a-zA-Z]{3}");
Try Online
I have this string
TEST_TEXT_ONE_20112017
I want to eliminate _20112017, which is a underscore with numbers, those numbers can vary; my goal is to have only
TEST_TEXT_ONE
So far I have this but I get the entire string, is there something I'm missing?
Regex r = new Regex(#"\b\w+[0-9]+\b");
MatchCollection words = r.Matches("TEST_TEXT_ONE_20112017");
foreach(Match word in words)
{
string w = word.Groups[0].Value;
//I still get the entire string
}
Notes for your consideration:
You should use parenthesis to mark groups for capture -or- use named group. The first group (index=0) is the entire match. you probably want index=1 instead.
\w stands for word character and it already includes both underscore and digits. If you want to match anything before the numbers then you should consider using . instead of \w.
by default +is greedy and your \w+ will consume your last undescore and all but the very last number as well. You probably want to explicitly require an underscore before last block of numbers.
I would suggest considering if you want to find a matching substring or the entire string to match. if the latter, then consider using the start and end markers: ^ and $.
if you know you want to eliminate 8 digits, then you could giving explicit count like \d{8}
For example this should work:
Regex r = new Regex(#"^(.+)_\d+$");
MatchCollection words = r.Matches("TEST_TEXT_ONE_20112017");
foreach (Match word in words)
{
string w = word.Groups[1].Value;
}
Alternative
Use a Zero-Width Positive Lookahead Assertions construct to check what comes next without capturing it. This uses the syntax on (?=stuff). So you could use a shorter code and avoid surfing in Groups altogether:
Regex r = new Regex(#"^.+(?=_\d+$)");
String result = r.Match("TEST_TEXT_ONE_20112017").Value;
Note that we require the end marker $ within the positive lookahead group.
Regex r = new Regex(#"(\b.+)_([0-9]+)\b");
String w = r.Match("TEST_TEXT_ONE_20112017").Groups[1].Value; //TEST_TEXT_ONE
or:
String w = r.Match("TEST_TEXT_ONE_20112017").Groups[2].Value; //20112017
This seems a bit overkill for Regex in my opinion. As an alternative you could just split on the _ character and rebuild the string:
private static string RemoveDate(string input)
{
string[] parts = input.Split('_');
return string.Join("_", parts.Take(parts.Length - 1));
}
Or if the date suffix is always the same length, you could also just substring:
private static string RemoveDateFixedLength(string input)
{
//Removes last 9 characters (8 for date, 1 for underscore)
return input.Substring(0, input.Length - 9);
}
However I feel like the first approach is better, this is just another option.
Fiddle here
I'm struggling to get this regex pattern exactly right, and am open to other options outside of regex if someone has a better alternative.
The situation:
I'm basically looking to parse a T-SQL "in" clause against a text column in C#. So, I need to take a string value like this:
"'don''t', 'do', 'anything', 'stupid'"
And interpret that as a list of values (I'll take care of the double single quotes later):
"don''t"
"do"
"anything"
"stupid"
I have a regex that works for most cases, but I'm struggling to generalize it to the point where it will accept any character OR a doubled-up single quote inside my group: (?:')([a-z0-9\s(?:'(?='))]+)(?:')[,\w]*
I'm fairly experienced with regexes, but have rarely, if ever, found a need for look-arounds (so downgrade my assessment of my regex experience accordingly).
So, to put this another way, I'm wanting to take a string of comma-delimited values, each enclosed in single quotes but can contain doubled single quotes, and output each such value.
EDIT
Here's a non-working example with my current regex (my problem is I need to handle all characters in my grouping and stop when I encounter a single quote not followed by a second single quote):
"'don''t', 'do?', 'anything!', '#stupid$'"
If you still think about a regex-based solution, you can use the following regex:
'(?:''|[^'])*'
Or an "un-rolled" version suggested by #sln:
'[^']*(?:''[^']*)*'
See demo
It is fairly simple, it captures double single quotation marks OR anything that is not a single quotation mark. No need using any look-behinds or look-aheads. It does not take care of any escaped entities, but I do not see this requirement in your question.
Moreover, this regex will return matches that are easy to access and deal with:
var text = "'don''t', 'do', 'anything', 'stupid'";
var re = new Regex(#"'[^']*(?:''[^']*)*'"); // Updated thanks to #sln, previous (#"'(?:''|[^'])*'");
var match_values = re.Matches(text).Cast<Match>().Select(p => p.Value).ToList();
Output:
If you want to use the Capture Collection feature, you can grab them all in a
single pass.
# #"""\s*(?:'([^']*(?:''[^']*)*)'\s*(?:,\s*|(?="")))+"""
"
\s*
(?:
'
( # (1 start)
[^']*
(?:
'' [^']*
)*
) # (1 end)
'
\s*
(?:
, \s*
| (?= " )
)
)+
"
C# code:
string strSrc = "\"'don''t', 'do', 'anything', 'stupid'\"";
Regex rx = new Regex(#"""\s*(?:'([^']*(?:''[^']*)*)'\s*(?:,\s*|(?="")))+""");
Match srcMatch = rx.Match(strSrc);
if (srcMatch.Success)
{
CaptureCollection cc = srcMatch.Groups[1].Captures;
for (int i = 0; i < cc.Count; i++)
Console.WriteLine("{0} = '{1}'", i, cc[i].Value);
}
Output:
0 = 'don''t'
1 = 'do'
2 = 'anything'
3 = 'stupid'
Press any key to continue . . .
Why don't you split on ', ':
Regex regex = new Regex(#"'\s*,\s*'");
string[] substrings = regex.Split(str);
And then take care of the extra single quotes by Trimming
Looks to me like you're over-thinking the problem. A quoted string with an escaped quote looks just like two strings without escaped quotes, one right after the other (not even spaces between them).
(?:'[^']*')+
Of course, you'll have to remove the enclosing quotes, but you probably had to do some post-processing anyway, to unescape the escaped quotes.
Also note that I'm not trying to validate the input or work around possible errors; for example, I don't bother matching the commas between the strings. If the input is well formed, this regex should be all you need.
In the interest of maintainability, I decided against a regex and followed the advice of using a state machine. Here's the crux of my implementation:
string currentTerm = string.Empty;
State currentState = State.BetweenTerms;
foreach (char c in valueToParse)
{
switch (currentState)
{
// if between terms, only need to do something if we encounter a single quote, signalling to start a new term
// encloser is client-specified char to look for (e.g. ')
case State.BetweenTerms:
if (c == encloser)
{
currentState = State.InTerm;
}
break;
case State.InTerm:
if (c == encloser)
{
if (valueToParse.Length > index + 1 && valueToParse[index + 1] == encloser && valueToParse.Length > index + 2)
{
// if next character is also encloser then add it and move on
currentTerm += c;
}
else if (currentTerm.Length > 0 && currentTerm[currentTerm.Length - 1] != encloser)
{
// on an encloser and didn't just add encloser, so we are done
// converterFunc is a client-specified Func<string,T> to return terms in the specified type (to allow for converting to int, for example)
yield return converterFunc(currentTerm);
currentTerm = string.Empty;
currentState = State.BetweenTerms;
}
}
else
{
currentTerm += c;
}
break;
}
index++;
}
if (currentTerm.Length > 0)
{
yield return converterFunc(currentTerm);
}
Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);
'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12
While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.
some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}
Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.
I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again
I have text like this
Inc12345_Month
Ted12345_Month
J8T12345_Month
What I need to do is extract the 12345 and also remove everything before it. This will be done in C#
.+?(?=\d_Monthly) was working in a regex tester online but when I put it in my code it only returned 5_Month.
Edit: the 12345 could be a variable length so I cannot [0-9] multiple times.
Edit2: Code this was just to try and remove everything before the 12345
string text = /* the above text pulled in from a file */;
Regex reg = new Regex(#".+?(?=\d+_Monthly)");
text = reg.Replace(string, "");
You can use this function to strip it:
private static Regex getNumberAndBeyondRegex = new Regex(^.{2}\D+(\d.*)$", RegexOptions.Compiled);
public static string GetNumberAndBeyond(string input)
{
var match = getNumberAndBeyondRegex.Match(input);
if (!match.Success) throw new ArgumentException("String isn't in the correct format.", "input");
return match.Groups[1].Value;
}
The regex at work is ^.{2}\D+(\d.*)$
It works by grabbing anything that's a number, after at least one character that isn't a number. It'll not only match _Month but also other endings.
The regex exists out of a few parts:
^ matches the beginning of the string
.{2} matches any two characters, to prevent a digit from matching if it's the first or 2nd character, you can increase this number to be equal to the minimum prefix length - 1
\D+ matches at least one character that isn't a number
( starts capturing a group
\d.* matches at least one number and any values beyond that
) closes the capturing group
$ matches the end of the string
There are a lot of different regex flavors, many of them have slight differences in terms of escaping, capturing, replacing and quite surely some others.
For testing .NET regexes online I use the free version of the tool RegexHero, it has an popup every now and then, but it makes up for that time by showing you live results, capture groups, and instant replacing. Next to having quite a lot of features.
If you want to match anywhere within the string, you can use the regex \d+_Month, it is very similiar to your original regex. In code:
new Regex("\d+_Month").Match(input).Value
Edit:
Based on the format you supplied in the comment I've created a regex and function to parse the entire file name:
private static Regex parseFileNameRegex = new Regex(#"^.*\D(\d+)_Month_([a-zA-Z]+)\.(\w+)$", RegexOptions.Compiled);
public static bool TryParseFileName(string fileName, out int id, out string month, out string fileExtension)
{
id = 0; month = null; fileExtension = null;
if (fileName == null) return false;
var match = parseFileNameRegex.Match(fileName);
if (!match.Success) return false;
if (!int.TryParse(match.Groups[1].Value, out id) || id < 1) return false; // Convert the ID into a number
month = match.Groups[2].Value;
fileExtension = match.Groups[3].Value;
return true;
}
In the parse function it requires the ID to be at least 1, 0 isn't accepted (and negative numbers won't match the regex), if you don't want this restriction, simply remove || id < 1 from the function.
Using the function would look like:
int id; string month, fileExtension;
if (!TryParseFileName("CompanyName_ClientName12345_Month_Nov.pdf", out id, out month, out fileExtension))
throw new FormatException("File name is incorrectly formatted."); // Do whatever you want when you get an invalid filename
// Use id, month and fileExtension here :)
The regex ^.*\D(\d+)_Month_([a-zA-Z]+)\.(\w+)$ works like:
^ matches the beginning of the string
.*\D matches at least one non-numeric character
(\d+) captures at least 1 number, this is the ID
_Month_ is the literal text in between
([a-zA-Z]+) matches and captures at least 1 letter, this is the month
\. matches a . character
(\w+) matches and captures any alphanumeric (letters and numbers), this is the file extension
$ matches the end of the string
Using :
Regex reg = new Regex(#"\D+(?=(\d+)_Monthly)");
is more explicit, the result is in Groups[1].
Part by part:
.+?
Match anything, maybe. This doesn't make any sense to me. It would be equivalent to ".*", which may or may not be what you meant.
(?=
start a group
\d
Match exactly 1 decimal, which explains what you are seeing, and the rest of the number is matched by .+? which is outside the group
_Monthly
match the literal text
)
end group
I think what you want is:
.*(?=\d+_Monthly)
I guess you are missing the + sign after \d
.+?(?=\d+_Monthly)
This should ask for one or more digits.
If you don't need anything before the number, this should work:
(\d+_Month)
I use Derek Slager's regex tester when I'm working with C# regex.
Better dotnet regular expression tester