Remove Dashes but Not Hyphens - c#

I want to remove dashes before, after, and between spaced words, but not hyphenated words.
This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.
should become:
This is a test-sentence. Test One-Two--Three---Four.
Remove multiple dashes ---.
Keep multiple hyphens Three---Four.
I was trying to do it with this:
http://rextester.com/SXQ57185
string sentence = "This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.";
string regex = #"(?<!\w)\-(?!\-)|(?<!\-)\-(?!\w)";
sentence = Regex.Replace(sentence, regex, "");
Console.WriteLine(sentence);
But the output is:
This is a test-sentence. Test - One-TwoThree-Four--.

What I would recommend doing is a combination of both a positive lookback and a positive lookahead against the characters that you don't want the dashes to be next to. In your case, that would be spaces and full stops. If either the lookbehind or lookahead match, you want to remove that dash.
This would be: ((?<=[\s\.])\-+)|(\-+(?=[\s\.])).
Breaking this down:
((?<=[\s\.])\-+) - match hyphens that follow either a space or a full stop
| - or
(\-+(?=[\s\.]) - match hyphens that are followed by either a space or a full stop
Here's a JavaScript example showcasing that:
const string = 'This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.';
const regex = /((?<=[\s\.])\-+)|(\-+(?=[\s\.]))/g;
console.log(string.replace(regex, ''));
And this can also been seen on Regex101.
Note that you'll probably also want to trim the excess spaces after using this, which can simply be done with .Trim() in C#.

You can use \b|\s for this task.
/(\b|\s)(-{3})(\b|\s)/g
DEMO
Breakdown shamelessly copied from regex101.com:
/(\b|\s)(-{3})(\b|\s)/g
1st Capturing Group (\b|\s)
1st Alternative \b
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])
2nd Capturing Group (-{3})
-{3} matches the character - literally (case sensitive)
{3} Quantifier — Matches exactly 3 times
3rd Capturing Group (\b|\s)
1st Alternative \b
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])

You may just match all hyphens in between word chars, and remove all others with a simple
Regex.Replace(s, #"\b(-+)\b|-", "$1")
See the regex demo
Details
\b(-+)\b - word boundary, followed with 1+ hyphens, and then again a word boundary (that is, hyphen(s) in between letters, digits and underscores)
| - or
- - a hyphen in other contexts (it will be removed).
See the C# demo:
var s = "This- -is - a test-sentence. -Test- --- One-Two--Three---Four----.";
var result = Regex.Replace(s, #"\b(-+)\b|-", "$1");
Console.WriteLine(result);
// => This is a test-sentence. Test One-Two--Three---Four.

Related

Regex to match a "global::" prefixed fully qualified C# type name

I'm trying to match fully qualified C# type names, but the + after \w+ captures too much:
global((::|\.)\w+(?!\s|\())+
Tried to play with quantifiers and negative lookahead but without success.
Online sandbox:
https://regex101.com/r/L6Y8kv/1
Sample:
public global::libebur128.EBUR128StateInternal D
{
get
{
var __result0 = global::libebur128.EBUR128StateInternal.__GetOrCreateInstance(((__Internal*)__Instance)->d, false);
return __result0;
}
Result:
global::libebur128.EBUR128StateInterna
global::libebur128.EBUR128StateInternal.__GetOrCreateInstanc
Expected:
global::libebur128.EBUR128StateInternal
global::libebur128.EBUR128StateInternal
For the example data, you might use:
\bglobal::[^\W_]+(?:\.[^\W_]+)*
The pattern matches:
\bglobal:: A word boundary, followed by matching global::
[^\W_]+ Match 1+ word characters excluding _
(?:\.[^\W_]+)* Optionally repeat matching . and 1+ word characters excluding _
See a regex101 demo.
If the last part should not be followed by ( and you don't want to take the underscore into account, you might add a word boundary and a negative lookahead:
\bglobal::\w+(?:\.\w+)*\b(?!\()
The pattern matches:
\b A word boundary
global:: Match literally
\w+ Match 1+ word chars
(?:\.\w+)* Optionally repeat . and 1+ word chars
\b A word boundary (to prevent backtracking to make the next assertion true)
(?!\() Negative lookahead, assert not ( directly to the right of the current position
regex101 demo

Get string after the last comma or the last number using Regex in C#

How can I get the string after the last comma or last number using regex for this examples:
"Flat 1, Asker Horse Sports", -- get string after "," result: "Asker
Horse Sports"
"9 Walkers Barn" -- get string after "9" result:
Walkers Barn
I need that regex to support both cases or to different regex rules, each / case.
I tried /,[^,]*$/ and (.*),[^,]*$ to get the strings after the last comma but no luck.
You can use
[^,\d\s][^,\d]*$
See the regex demo (and a .NET regex demo).
Details
[^,\d\s] - any char but a comma, digit and whitespace
[^,\d]* - any char but a comma and digit
$ - end of string.
In C#, you may also tell the regex engine to search for the match from the end of the string with the RegexOptions.RightToLeft option (to make regex matching more efficient. although it might not be necessary in this case if the input strings are short):
var output = Regex.Match(text, #"[^,\d\s][^,\d]*$", RegexOptions.RightToLeft)?.Value;
You were on the right track the capture group in (.*),[^,]*$, but the group should be the part that you are looking for.
If there has to be a comma or digit present, you could match until the last occurrence of either of them, and capture what follows in the capturing group.
^.*[\d,]\s*(.+)$
^ Start of string
.* Match any char except a newline 0+ times
[\d,] Match either , or a digit
\s* Match 0+ whitespace chars
(.+) Capture group 1, match any char except a newline 1+ times
$ End of string
.NET regex demo | C# demo

Regex replace special character

I need help in my regex.
I need to remove the special character found in the start of text
for example I have a text like this
.just a $#text this should not be incl#uded
The output should be like this
just a text this should not be incl#uded
I've been testing my regex here but i can't make it work
([\!-\/\;-\#]+)[\w\d]+
How do I limit the regex to check only the text that starts in special characters?
Thank you
Use \B[!-/;-#]+\s*\b:
var result = Regex.Replace(s, #"\B[!-/;-#]+\s*\b", "");
See the regex demo
Details
\B - the position other than a word boundary (there must be start of string or a non-word char immediately to the left of the current position)
[!-/;-#]+ - 1 or more ASCII punctuation
\s* - 0+ whitespace chars
\b - a word boundary, there must be a letter/digit/underscore immediately to the right of the current location.
If you plan to remove all punctuation and symbols, use
var result = Regex.Replace(s, #"\B[\p{P}\p{S}]+\s*\b", "");
See another regex demo.
Note that \p{P} matches any punctuation symbols and \p{S} matches any symbols.
Use lookahead:
(^[.$#]+|(?<= )[.$#]+)
The ^[.$#]+ is used to match the special characters at the start of a line.
The (?<= )[.$#]+) is used to matching the special characters at the start of a word which is in the sentence.
Add your special characters in the character group [] as you need.
Following are two possible options from your question details. Hope it will help you.
string input = ".just a $#text this should not be incl#uded";
//REMOVING ALL THE SPECIAL CHARACTERS FROM THE WHOLE STRING
string output1 = Regex.Replace(input, #"[^0-9a-zA-Z\ ]+", "");
// REMOVE LEADING SPECIAL CHARACTERS FROM EACH WORD IN THE STRING. WILL KEEP OTHER SPECIAL CHARACTERS
var split = input.Split();
string output2 = string.Join(" ", split.Select(s=> Regex.Replace(s, #"^[^0-9a-zA-Z]+", "")).ToArray());
Negative lookahead is fine here :
(?![\.\$#].*)[\S]+
https://regex101.com/r/i0aacp/11/
[\S] match any character
(?![\.\$#].*) negative lookahead means those characters [\S]+ should not start with any of \.\$#

How to create a repeating non-capturing group?

I'm trying to create what I think is a repeating non-capturing group, and I just can't figure out how.
In plain words, I want to match:
Any number which is both
Preceded by any amount of blocks that doesn't contain a space, but is not either just a number.
Followed by any amount of blocks that doesn't contain a space, but is not either just a number.
Here is what I tried:
Pattern: (?:\w.)+(\d+)(?:.\w+)+
Test Set:
3.AAA
AAA.BBB
AAA.3.BBB
AAA.3.B555B
AAA.3.BBB.4
AAA.3.BBB.4.CCC
AAA.3.BBB.CCC
AAA.3.BBB.CCC.4
AAA.3.BBB.CCC.4.DDD
ZZZ.AAA.3.BBB
ZZZ.AAA.3.BBB.4
ZZZ.AAA.3.BBB.4.CCC
ZZZ.AAA.3.BBB.CCC
ZZZ.AAA.3.BBB.CCC.4
ZZZ.AAA.3.BBB.CCC.4.DDD
I would want it to match only to:
AAA.3.BBB
AAA.3.B555B
AAA.3.BBB.CCC
ZZZ.AAA.3.BBB
ZZZ.AAA.3.BBB.CCC
Note: I saw some other posts asking the same-ish question, but I can't use the answers because they were all like "Instead of trying to repeat a group, just match 'this' and it will work for your specific case".
Code
See regex in use here
^(?:(?!(?:\.|^)\d+\.)\S)+\.\d+\.(?:(?!\.\d+(?:\.|$))\S)+$
Results
Input
3.AAA
AAA.BBB
AAA.3.BBB
AAA.3.B555B
AAA.3.BBB.4
AAA.3.BBB.4.CCC
AAA.3.BBB.CCC
AAA.3.BBB.CCC.4
AAA.3.BBB.CCC.4.DDD
ZZZ.AAA.3.BBB
ZZZ.AAA.3.BBB.4
ZZZ.AAA.3.BBB.4.CCC
ZZZ.AAA.3.BBB.CCC
ZZZ.AAA.3.BBB.CCC.4
ZZZ.AAA.3.BBB.CCC.4.DDD
Output
AAA.3.BBB
AAA.3.B555B
AAA.3.BBB.CCC
ZZZ.AAA.3.BBB
ZZZ.AAA.3.BBB.CCC
Explanation
^ Assert position at the start of the line
(?:(?!(?:\.|^)\d+\.)\S)+ Match the following one or more times
(?!(?:\.|^)\d+\.) Negative lookahead ensuring what follows doesn't match
(?:\.|^) Match either of the following
\. Match a literal dot character .
^ Assert position at the start of the line
\d+ Match one or more digits
\. Match a literal dot character .
\S Match any non-whitespace character
\. Match a literal dot character .
\d+ Match one or more digits
\. Match a literal dot chracter .
(?:(?!\.\d+(?:\.|$))\S)+ Match the following one or more times
(?!\.\d+(?:\.|$)) Negative lookahead ensuring what follows doesn't match
\. Match a literal dot chracter .
\d+ Match one or more digits
(?:\.|$) Match either of the following
\. Match a literal dot chracter .
$ Assert position at the end of the line
\S Match any non-whitespace character
$ Assert position at the end of the line
There is a bit simpler solution:
^(?:(?!\d+\.)\w+\.)+\d+(?:\.(?!\d+(?=\.|$))\w+)+$
See the .NET regex demo (since it is a multiline demo, \r? has to be added before $, it is not necessary when matching standalone strings).
Details
^ - start of string
(?:(?!\d+\.)\w+\.)+ - 1 or more occurrences (due to (?:...)+) of any 1+ word chars (letters, digits, _ - due to \w+) that are not all digits followed with a dot (note that to match only letters and digits, you need to use [\w-[_]] or [^\W_] instead of \w, or if you are really after matching the blocks that may even have symbols or punctuation, replace \w with [^\s.] - any char but whitespace or dot)
\d+ - 1 or more digits
(?:\.(?!\d+(?=\.|$))\w+)+ - 1 or more occurrences of
\. - a dot
(?!\d+(?=\.|$)) - not followed with 1+ digits (\d+) followed with a dot or end of string
\w+ - 1 or more word chars
$ - end of string.
C# demo:
using System.Collections.Generic;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var lst = new List<string> {"3.AAA", "AAA.BBB", "AAA.3.BBB", "AAA.3.B555B", "AAA.3.BBB.4",
"AAA.3.BBB.4.CCC", "AAA.3.BBB.CCC", "AAA.3.BBB.CCC.4", "AAA.3.BBB.CCC.4.DDD",
"ZZZ.AAA.3.BBB","ZZZ.AAA.3.BBB.4","ZZZ.AAA.3.BBB.4.CCC", "ZZZ.AAA.3.BBB.CCC",
"ZZZ.AAA.3.BBB.CCC.4", "ZZZ.AAA.3.BBB.CCC.4.DDD"};
var rx = new Regex(#"^(?:(?!\d+\.)[^\s.]+\.)+\d+(?:\.(?!\d+(?=\.|$))[^\s.]+)+$",
RegexOptions.Compiled | RegexOptions.ECMAScript);
foreach (var s in lst)
{
if (rx.IsMatch(s))
Console.WriteLine(s);
}
}
}
Results:
AAA.3.BBB
AAA.3.B555B
AAA.3.BBB.CCC
ZZZ.AAA.3.BBB
ZZZ.AAA.3.BBB.CCC

Regex expression to match whole word with special characters not working ? [duplicate]

This question already has an answer here:
Regex expression to match whole word ?
(1 answer)
Closed 4 years ago.
I was going through this question
C#, Regex.Match whole words
It says for match whole word use "\bpattern\b"
This works fine for match whole word without any special characters since it is meant for word characters only!
I need an expression to match words with special characters also. My code is as follows
class Program
{
static void Main(string[] args)
{
string str = Regex.Escape("Hi temp% dkfsfdf hi");
string pattern = Regex.Escape("temp%");
var matches = Regex.Matches(str, "\\b" + pattern + "\\b" , RegexOptions.IgnoreCase);
int count = matches.Count;
}
}
But it fails because of %. Do we have any workaround for this?
There can be other special characters like 'space','(',')', etc
If you have non-word characters then you cannot use \b. You can use the following
#"(?<=^|\s)" + pattern + #"(?=\s|$)"
Edit: As Tim mentioned in comments, your regex is failing precisely because \b fails to match the boundary between % and the white-space next to it because both of them are non-word characters. \b matches only the boundary between word character and a non-word character.
See more on word boundaries here.
Explanation
#"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
# Match either the regular expression below (attempting the next alternative only if this one fails)
^ # Assert position at the beginning of the string
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
)
temp% # Match the characters “temp%” literally
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
"
If the pattern can contain characters that are special to Regex, run it through Regex.Escape first.
This you did, but do not escape the string that you search through - you don't need that.
output = Regex.Replace(output, "(?<!\w)-\w+", "")
output = Regex.Replace(output, " -"".*?""", "")

Categories

Resources