Building a regex, how to remove redundant line breaks? - c#

I have a string like this
"a a a a aaa b c d e f a g a aaa aa a a"
I want to turn it into either
"a b c d e f a g a"
or
"a b c d e f a g a "
(whichever's easier, it doesn't matter since it'll be HTML)
"a"s are line breaks ( \r\n ), in case that changes anything.

Generally your code should be:
s.replace(new RegExp("(\\S)(?:\\s*\\1)+","g"), "$1");
Check this fiddle.
But, depends on what those characters a, b, c, ... represent in your case/question, you might need to change \\S to other class, such as [^ ], and then \\s to [ ], if you want to include \r and \n to being collapsed as well >>
s.replace(new RegExp("([^ ])(?:[ ]*\\1)+","g"), "$1");
Check this fiddle.
However if a is going to represent string \r\n, then you would need a little more complicated pattern >>
s.replace(new RegExp("(\\r\\n|\\S)(?:[^\\S\\r\\n]*\\1)+","g"), "$1");
Check this fiddle.

Went with this:
private string GetDescriptionFor(HtmlDocument document)
{
string description = CrawlUsingMetadata(XPath.ResourceDescription, document);
Regex regex = new Regex(#"(\r\n(?:[ ])*|\n(?:[ ])*){3,}", RegexOptions.Multiline | RegexOptions.IgnoreCase);//(?:[^\S\r\n|\n]*\1)+
string result = regex.Replace(description, "\n\n");
string decoded = HttpUtility.HtmlDecode(result);
return decoded;
}
It does, as it's supposed to, ignore all line breaks except cases where it matches three or more continuous line breaks, ignoring whitespace, and replaces those matches with \n\n.

If I understand the problem correctly, the goal is to remove duplicate copies of a specific character/string, possibly separated by spaces. You can do that by replacing the regular expression (a\s*)+ with a ; + for multiple consecutive copies, a\s* for as followed by spaces How precisely you do that depends on the language: in Perl it's $str =~ s/(a\s*)+/a /g, in Ruby it's str.gsub(/(a\s*)+/, "a "), and so on.
The fact that a is actually \r\n shouldn't complicate things, but might mean that the replacement would work better as s/(\r\n[ \t]*)+/\r\n/g (since \s overlaps with \r and \n).

If you need C# code and you want to collapse JUST \r\n strings with leading and trailing whitespaces, then the solution is pretty simple:
string result = Regex.Replace(input, #"\s*\r\n\s*", "\r\n");
Check this code here.

Try this one:
Regex.Replace(inputString, #"(\r\n\s+)", " ");

Related

Regex & C#: Replace all Special Characters except Emojis

I need to replace all special characters in a string except the following (which includes alphabetic characters):
:)
:P
;)
:D
:(
This is what I have now:
string input = "Hi there!!! :)";
string output = Regex.Replace(input, "[^0-9a-zA-Z]+", "");
This replaces all special characters. How can I modify this to not replace mentioned characters (emojis) but replace any other special character?
You may use a known technique: match and capture what you need and match only what you want to remove, and replace with the backreference to Group 1:
(:(?:[D()P])|;\))|[^0-9a-zA-Z\s]
Replace with $1. Note I added \s to the character class, but in case you do not need spaces, remove it.
See the regex demo
Pattern explanation:
(:(?:[D()P])|;\)) - Group 1 (what we need to keep):
:(?:[D()P]) - a : followed with either D, (, ) or P
| - or
;\) - a ;) substring
(here, you may extend the capture group with more |-separated branches).
| - or ...
[^0-9a-zA-Z\s] - match any char other than ASCII digits, letters (and whitespace, but as I mentioned, you may remove \s if you do not need to keep spaces).
I would use a RegEx to match all emojis and select them out of the text
string input = "Hi there!!! :)";
string output = string.Concat(Regex.Matches(input, "[;|:][D|P|)|(]+").Cast<Match>().Select(x => x.Value));
Pattern [;|:][D|P|)|(]+
[;|:] starts with : or ;
[D|P|)|(] ends with D, P, ) or (
+ one or more

Regex syntax in a C# application

I am trying to figure out how to replace by a space all punctuation from a string but keeping one special character : '-'
For example, the sentence
"hi! I'm an out-of-the-box person, did you know ?"
should be transformed into
"hi I m an out-of-the-box person did you know "
I know the solution will be a single line Regex expression, but I'm really not used to "think" in Regex, so what I have tried so far is replacing all '-' by '9', then replacing all punctuation by ' ', then re-replacing all '9' by '-'. It works, but this is awful (especially if the input contains some '9' characters) :
string s = #"Hello! Hi want to remove all punctuations but not ' - ' signs ... Please help ;)";
s = s.Replace("-", "9");
s = Regex.Replace(s, #"[\W_]", " ");
s = s.Replace("9", "-");
So, can someone help me writing a Regex that only catch punctuation different from '-' ?
How about replacing matches for the following regex with a space:
[^\w\s-]|_
This says, any character that is not a word character, digit, whitespace, or dash.
This regex should help. Use Character class subtraction to remove some character from character classes.
var expected = Regex.Replace(subject, #"[_\W-[\-\s]]","");
You can do this by using Linq:
var chars = s.Select(c => char.IsPunctuation(c) && c != '-' ? ' ' : c);
var result = new string(chars.ToArray());
Place everything you consider punctuation into a set [ ... ] and look for that as a single match character in a ( ... ) to be replaced. Here is an example where I seek to replace !, ., ,,', and ?.
string text = "hi! I'm an out-of-the-box person, did you know ?";
Console.WriteLine (
Regex.Replace(text, "([!.,'?])", " ")
);
// result:
// hi I m an out-of-the-box person did you know
Update
For the regex purist who doesn't want to specify a set one can use set subtraction. I still specify a set which searches for any non alphabetic character \W which will match all items including the -. But by using set subtraction -[ ... ] we can place the - to be excluded.
Here is that example
Regex.Replace(text, #"([\W-[-]])", " ")
// result:
// hi I m an out-of-the-box person did you know

Regexp skip pattern

Problem
I need to replace all asterisk symbols('*') with percent symbol('%'). The asterisk symbols in square brackets should be ignored.
Example
[Test]
public void Replace_all_asterisks_outside_the_square_brackets()
{
var input = "Hel[*o], w*rld!";
var output = Regex.Replace(input, "What_pattern_should_be_there?", "%")
Assert.AreEqual("Hel[*o], w%rld!", output));
}
Try using a look ahead:
\*(?![^\[\]]*\])
Here's a bit stronger solution, which takes care of [] blocks better, and even escaped \[ characters:
string text = #"h*H\[el[*o], w*rl\]d!";
string pattern = #"
\\. # Match an escaped character. (to skip over it)
|
\[ # Match a character class
(?:\\.|[^\]])* # which may also contain escaped characters (to skip over it)
\]
|
(?<Asterisk>\*) # Match `*` and add it to a group.
";
text = Regex.Replace(text, pattern,
match => match.Groups["Asterisk"].Success ? "%" : match.Value,
RegexOptions.IgnorePatternWhitespace);
If you don't care about escaped characters you can simplify it to:
\[ # Skip a character class
[^\]]* # until the first ']'
\]
|
(?<Asterisk>\*)
Which can be written without comments as: #"\[[^\]]*\]|(?<Asterisk>\*)".
To understand why it works we need to understand how Regex.Replace works: for every position in the string it tries to match the regex. If it fails, it moves one character. If it succeeds, it moves over the whole match.
Here, we have dummy matches for the [...] blocks so we may skip over the asterisks we don't want to replace, and match only the lonely ones. That decision is made in a callback function that checks if Asterisk was matched or not.
I couldn't come up with a pure RegEx solution. Therefore I am providing you with a pragmatic solution. I tested it and it works:
[Test]
public void Replace_all_asterisks_outside_the_square_brackets()
{
var input = "H*]e*l[*o], w*rl[*d*o] [o*] [o*o].";
var actual = ReplaceAsterisksNotInSquareBrackets(input);
var expected = "H%]e%l[*o], w%rl[*d*o] [o*] [o*o].";
Assert.AreEqual(expected, actual);
}
private static string ReplaceAsterisksNotInSquareBrackets(string s)
{
Regex rx = new Regex(#"(?<=\[[^\[\]]*)(?<asterisk>\*)(?=[^\[\]]*\])");
var matches = rx.Matches(s);
s = s.Replace('*', '%');
foreach (Match match in matches)
{
s = s.Remove(match.Groups["asterisk"].Index, 1);
s = s.Insert(match.Groups["asterisk"].Index, "*");
}
return s;
}
EDITED
Okay here is my final attempt ;)
Using negative lookbehind (?<!) and negative lookahead (?!).
var output = Regex.Replace(input, #"(?<!\[)\*(?!\])", "%");
This also passes the test in the comment to another answer "Hel*o], w*rld!"

How to Replace a word in a String with a blank in C#?

Any algorithm which changes a word from a string to a blank?
I'm having problems with this.
my scenario is replacing a word to a blank but some words are also found inside another word so what happens is it also replaces the certain string in that word.
Ex. Hello Ello llo lo o
if I replace llo with _
output:
He___ E__ ____ lo o
I think you should have some luck with Regex.Replace and using the word boundary special character (\b) to indicate word boundaries. I'm no expert, though. But do try that.
Example:
string input = "llo Hello Ello llo lo o llo, and oh hello llo! Look, a yellow llo.";
string output = Regex.Replace(input, #"\bllo\b", "___");
Console.WriteLine(output);
Output:
___ Hello Ello ___ lo o ___, and oh hello ___! Look, a yellow ___.
Note this caught the "llo" at the beginning, as well as those followed by punctuation ("llo,", "llo!", and "llo.") without erroneously stripping out those that were part of other words ("Hello", "Ello", "yellow").
Edit: The following is in response to your comment. My understanding is that you have two TextBoxes, preview and question, and a ListBox, chosenwordlist. I believe this is what you want to do:
// Start by setting preview to the same text as question.
preview.Text = question.Text;
for (int i = 0; i < chosenwordlist.Items.Count; ++i)
{
string word = chosenwordlist.Items[i].ToString();
// Notice the verbatim literal (#) for the SECOND "\b" as well.
string pattern = #"\b"+ word + #"\b";
// Since the text in your question TextBox isn't changing, you need to base each
// replacement off of what's in the preview TextBox, which IS (changing).
preview.Text = Regex.Replace(preview.Text, pattern, "__________");
}
You can use regular expressions
http://www.regular-expressions.info/dotnet.html
I hope I have understood you correctly. Using Regex you can achieve this:
Regex myRegex = new Regex(#"(?<![a-z])[a-z](?![a-z])", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Debug.WriteLine(myRegex.Replace("test-a-b-c-test", "!"));
Will print: test-!-!-!-test
From the original string: test-a-b-c-test

RegEx Problem using .NET

I have a little problem on RegEx pattern in c#. Here's the rule below:
input: 1234567
expected output: 123/1234567
Rules:
Get the first three digit in the input. //123
Add /
Append the the original input. //123/1234567
The expected output should looks like this: 123/1234567
here's my regex pattern:
regex rx = new regex(#"((\w{1,3})(\w{1,7}))");
but the output is incorrect. 123/4567
I think this is what you're looking for:
string s = #"1234567";
s = Regex.Replace(s, #"(\w{3})(\w+)", #"$1/$1$2");
Instead of trying to match part of the string, then match the whole string, just match the whole thing in two capture groups and reuse the first one.
It's not clear why you need a RegEx for this. Why not just do:
string x = "1234567";
string result = x.Substring(0, 3) + "/" + x;
Another option is:
string s = Regex.Replace("1234567", #"^\w{3}", "$&/$&"););
That would capture 123 and replace it to 123/123, leaving the tail of 4567.
^\w{3} - Matches the first 3 characters.
$& - replace with the whole match.
You could also do #"^(\w{3})", "$1/$1" if you are more comfortable with it; it is better known.
Use positive look-ahead assertions, as they don't 'consume' characters in the current input stream, while still capturing input into groups:
Regex rx = new Regex(#"(?'group1'?=\w{1,3})(?'group2'?=\w{1,7})");
group1 should be 123, group2 should be 1234567.

Categories

Resources