Regex syntax in a C# application - c#

I am trying to figure out how to replace by a space all punctuation from a string but keeping one special character : '-'
For example, the sentence
"hi! I'm an out-of-the-box person, did you know ?"
should be transformed into
"hi I m an out-of-the-box person did you know "
I know the solution will be a single line Regex expression, but I'm really not used to "think" in Regex, so what I have tried so far is replacing all '-' by '9', then replacing all punctuation by ' ', then re-replacing all '9' by '-'. It works, but this is awful (especially if the input contains some '9' characters) :
string s = #"Hello! Hi want to remove all punctuations but not ' - ' signs ... Please help ;)";
s = s.Replace("-", "9");
s = Regex.Replace(s, #"[\W_]", " ");
s = s.Replace("9", "-");
So, can someone help me writing a Regex that only catch punctuation different from '-' ?

How about replacing matches for the following regex with a space:
[^\w\s-]|_
This says, any character that is not a word character, digit, whitespace, or dash.

This regex should help. Use Character class subtraction to remove some character from character classes.
var expected = Regex.Replace(subject, #"[_\W-[\-\s]]","");

You can do this by using Linq:
var chars = s.Select(c => char.IsPunctuation(c) && c != '-' ? ' ' : c);
var result = new string(chars.ToArray());

Place everything you consider punctuation into a set [ ... ] and look for that as a single match character in a ( ... ) to be replaced. Here is an example where I seek to replace !, ., ,,', and ?.
string text = "hi! I'm an out-of-the-box person, did you know ?";
Console.WriteLine (
Regex.Replace(text, "([!.,'?])", " ")
);
// result:
// hi I m an out-of-the-box person did you know
Update
For the regex purist who doesn't want to specify a set one can use set subtraction. I still specify a set which searches for any non alphabetic character \W which will match all items including the -. But by using set subtraction -[ ... ] we can place the - to be excluded.
Here is that example
Regex.Replace(text, #"([\W-[-]])", " ")
// result:
// hi I m an out-of-the-box person did you know

Related

How do I remove specific character before and after single quote using regex

I have a text string with single quotes, I'd like to remove the parenthesis before and after that single quotes by using regular expression. Could anyone suggest me Thank you.
For example,
I have (name equal '('John')') the result that I expect is name equal '('John')'
// Using Regex
string input = "(name equal '('John')')";
Regex rx = new Regex(#"^\((.*?)\)$");
Console.WriteLine(rx.Match(input).Groups[1].Value);
// Using Substring method
String input= "(name equal '('John')')";
var result = input.Substring (1, input.Length-2);
Console.WriteLine(result);
Result:
name equal '('John')'
Try this:
var replaced = Regex.Replace("(name equal '('John')')", #"\((.+?'\)')\)", "${1}");
The Regex class is in the System.Text.RegularExpressions namespace.
Use negative look behind (?<! ) and negative look ahead (?! ) which will stop a match if it encounters the ', such as
(?<!')\(|\)(?!')
The example explains it as a comment:
string pattern =
#"
(?<!')\( # Match an open paren that does not have a tick behind it
| # or
\)(?!') # Match a closed paren tha does not have tick after it
";
var text = "(name equal '('John')')";
// Ignore Pattern whitespace allows us to comment the pattern ONLY, does not affect processing.
var final = Regex.Replace(text, pattern, string.Empty, RegexOptions.IgnorePatternWhitespace);
Result
name equal '('John')'

Regex & C#: Replace all Special Characters except Emojis

I need to replace all special characters in a string except the following (which includes alphabetic characters):
:)
:P
;)
:D
:(
This is what I have now:
string input = "Hi there!!! :)";
string output = Regex.Replace(input, "[^0-9a-zA-Z]+", "");
This replaces all special characters. How can I modify this to not replace mentioned characters (emojis) but replace any other special character?
You may use a known technique: match and capture what you need and match only what you want to remove, and replace with the backreference to Group 1:
(:(?:[D()P])|;\))|[^0-9a-zA-Z\s]
Replace with $1. Note I added \s to the character class, but in case you do not need spaces, remove it.
See the regex demo
Pattern explanation:
(:(?:[D()P])|;\)) - Group 1 (what we need to keep):
:(?:[D()P]) - a : followed with either D, (, ) or P
| - or
;\) - a ;) substring
(here, you may extend the capture group with more |-separated branches).
| - or ...
[^0-9a-zA-Z\s] - match any char other than ASCII digits, letters (and whitespace, but as I mentioned, you may remove \s if you do not need to keep spaces).
I would use a RegEx to match all emojis and select them out of the text
string input = "Hi there!!! :)";
string output = string.Concat(Regex.Matches(input, "[;|:][D|P|)|(]+").Cast<Match>().Select(x => x.Value));
Pattern [;|:][D|P|)|(]+
[;|:] starts with : or ;
[D|P|)|(] ends with D, P, ) or (
+ one or more

Replace Single WhiteSpace without Replacing Multiple WhiteSpace

I have a string in the format:
abc def ghi xyz
I would like to end with it in format:
abcdefghi xyz
What is the best way to do this? In this particular case, I could just strip off the last three characters, remove spaces, and then add them back at the end, but this won't work for cases in which the multiple spaces are in the middle of the string.
In Short, I want to remove all single whitespaces, and then replace all multiple whitespaces with a single. Each of those steps is easy enough by itself, but combining them seems a bit less straightforward.
I'm willing to use regular expressions, but I would prefer not to.
This approach uses regular expressions but hopefully in a way that's still fairly readable. First, split your input string on multiple spaces
var pattern = #" +"; // match two or more spaces
var groups = Regex.Split(input, pattern);
Next, remove the (individual) spaces from each token:
var tokens = groups.Select(group => group.Replace(" ", String.Empty));
Finally, join your tokens with single spaces
var result = String.Join(' ', tokens.ToArray());
This example uses a literal space character rather than 'whitespace' (which includes tabs, linefeeds, etc.) - substitute \s for ' ' if you need to split on multiple whitespace characters rather than actual spaces.
Well, Regular Expressions would probably be the fastest here, but you could implement some algorithm that uses a lookahead for single spaces and then replaces multiple spaces in a loop:
// Replace all single whitespaces
for (int i = 0; i < sourceString.Length; i++)
{
if (sourceString[i] = ' ')
{
if (i < sourceString.Length - 1 && sourceString[i+1] != ' ')
sourceString = sourceString.Delete(i);
}
}
// Replace multiple whitespaces
while (sourceString.Contains(" ")) // Two spaces here!
sourceString = sourceString.Replace(" ", " ");
But hey, that code is pretty ugly and slow compared to a proper regular expression...
For a Non-REGEX option you can use:
string str = "abc def ghi xyz";
var result = str.Split(); //This will remove single spaces from the result
StringBuilder sb = new StringBuilder();
bool ifMultipleSpacesFound = false;
for (int i = 0; i < result.Length;i++)
{
if (!String.IsNullOrWhiteSpace(result[i]))
{
sb.Append(result[i]);
ifMultipleSpacesFound = false;
}
else
{
if (!ifMultipleSpacesFound)
{
ifMultipleSpacesFound = true;
sb.Append(" ");
}
}
}
string output = sb.ToString();
The output would be:
output = "abcdefghi xyz"
Here's an approach which uses some fairly subtle logic:
public static string RemoveUnwantedSpaces(string text)
{
var sb = new StringBuilder();
char lhs = '\0';
char mid = '\0';
foreach (char rhs in text)
{
if (rhs != ' ' || (mid == ' ' && lhs != ' '))
sb.Append(rhs);
lhs = mid;
mid = rhs;
}
return sb.ToString().Trim();
}
How it works:
We will examine each possible three-character subsequence linearly across the string (in a kind of three-character sliding window). These three characters will be represented, in order, by the variables lhs, mid and rhs.
For each rhs character in the string:
If it's not a space we should output it.
If it is a space, and the previous character was also space but the one before that isn't, then this is the second in a sequence of at least two spaces, and therefore we should output one space.
Otherwise, don't output a space because this is either the first or the third (or later) space in a sequence of two or more spaces and in either case we don't want to output a space: If this happens to be the first in a sequence of two or more spaces, a space will be output when the second space comes along. If this is the third or later, we've already output a space for it.
The subtlety here is that I've avoided special casing the beginning of the sequence by initialising the lhs and mid variables with non-space characters. It doesn't matter what those values are, as long as they are not spaces, but I made them \0 to indicate that they are special values.
After second thought here is one line regex solution:
Regex.Replace("abc def ghi xyz", "( )( )*([^ ])", "$2$3")
the result of this is "abcdefghi xyz"
ORIGINAL ANSWER:
Two lines of code regex solution:
var tmp = Regex.Replace("abc def ghi xyz", "( )([^ ])", "$2")
tmp is "abcdefghi xyz"
then:
var result = Regex.Replace(tmp, "( )+", " ");
result is "abcdefghi xyz"
Explanation:
The first line of code removes single whitespaces and removes one whitespace for multiple whitespaces (so there are 3 spaces in tmp between letters i and x).
The second line just replace multiple whitespaces with one.
In-depth explanation of first line:
We match input string to regex that matches one space and non-space next to it. We also put this two characters in separate groups (we use ( ) for anonymous grouping).
So for "abc def ghi xyz" string we have this matches and groups:
match: " d" group1: " " group2: "d"
match: " g" group1: " " group2: "g"
match: " x" group1: " " group2: "x"
We are using substitution syntax for Regex.Replace method to replace match with the content of second group (which is non-whitespace character)

Getting the substring after a character in C# using regex

I have the following input string:
string val = "[01/02/70]\nhello world ";
I want to get the all words after the last ] character.
Example output for a sample string above:
\nhello world
In C#, use Substring() with IndexOf:
string val = val.Substring(val.IndexOf(']') + 1);
If you have multiple ] symbols, and you want to get all the string after the last one, use LastIndexOf:
string val = "[01/02/70]\nhello [01/02/80] world ";
string val = val.Substring(val.LastIndexOf(']') + 1); // => " world "
If you are a fan of Regex, you might want to use a Regex.Replace like
string val = "[01/02/70]\nhello [01/02/80] world ";
val = Regex.Replace(val, #"^.*\]", string.Empty, RegexOptions.Singleline); // => " world "
See demo
Notes on REGEX:
RegexOptions.Singleline makes . match a linebreak
^ - matches beginning of string
.* - matches 0 or more characters but as many as possible (greedy matching)
\] - matches literal ] (as it is a special regex metacharacter, it must be escaped).
You need to use lookbehind assertion. And not only that, you have to enable DOTALL modifier also, so that it would also match the newline character present inbetween.
"(?s)(?<=\\]).*"
(?s) - DOTALL modifier.
(?<=\\]) - lookbehind which asserts that the match must be preceeded by a close bracket
.* - Matches any chracater zero or more times.
or
"(?s)(?<=\\])[\\s\\S]*"
Try this if you don't want to match the following newline character.
#"(?<=\][\n\r]*).*"

Building a regex, how to remove redundant line breaks?

I have a string like this
"a a a a aaa b c d e f a g a aaa aa a a"
I want to turn it into either
"a b c d e f a g a"
or
"a b c d e f a g a "
(whichever's easier, it doesn't matter since it'll be HTML)
"a"s are line breaks ( \r\n ), in case that changes anything.
Generally your code should be:
s.replace(new RegExp("(\\S)(?:\\s*\\1)+","g"), "$1");
Check this fiddle.
But, depends on what those characters a, b, c, ... represent in your case/question, you might need to change \\S to other class, such as [^ ], and then \\s to [ ], if you want to include \r and \n to being collapsed as well >>
s.replace(new RegExp("([^ ])(?:[ ]*\\1)+","g"), "$1");
Check this fiddle.
However if a is going to represent string \r\n, then you would need a little more complicated pattern >>
s.replace(new RegExp("(\\r\\n|\\S)(?:[^\\S\\r\\n]*\\1)+","g"), "$1");
Check this fiddle.
Went with this:
private string GetDescriptionFor(HtmlDocument document)
{
string description = CrawlUsingMetadata(XPath.ResourceDescription, document);
Regex regex = new Regex(#"(\r\n(?:[ ])*|\n(?:[ ])*){3,}", RegexOptions.Multiline | RegexOptions.IgnoreCase);//(?:[^\S\r\n|\n]*\1)+
string result = regex.Replace(description, "\n\n");
string decoded = HttpUtility.HtmlDecode(result);
return decoded;
}
It does, as it's supposed to, ignore all line breaks except cases where it matches three or more continuous line breaks, ignoring whitespace, and replaces those matches with \n\n.
If I understand the problem correctly, the goal is to remove duplicate copies of a specific character/string, possibly separated by spaces. You can do that by replacing the regular expression (a\s*)+ with a ; + for multiple consecutive copies, a\s* for as followed by spaces How precisely you do that depends on the language: in Perl it's $str =~ s/(a\s*)+/a /g, in Ruby it's str.gsub(/(a\s*)+/, "a "), and so on.
The fact that a is actually \r\n shouldn't complicate things, but might mean that the replacement would work better as s/(\r\n[ \t]*)+/\r\n/g (since \s overlaps with \r and \n).
If you need C# code and you want to collapse JUST \r\n strings with leading and trailing whitespaces, then the solution is pretty simple:
string result = Regex.Replace(input, #"\s*\r\n\s*", "\r\n");
Check this code here.
Try this one:
Regex.Replace(inputString, #"(\r\n\s+)", " ");

Categories

Resources