Regex for retaining numbers in the replacement of group containing numbers - c#

Regarding the possible dupe post: Replace only some groups with Regex
This is not a dupe as the post replaces the group with static text, what I want is to replace the group by retaining the text in the group.
I have some texts which contain pattern like:
\super 1 \nosupersub
\super 2 \nosupersub
...
\super 592 \nosupersub
I want to replace them using regex such that they become:
<sup>1</sup>
<sup>2</sup>
...
<sup>592</sup>
So, I am using the following regex (note the group (\d+)):
RegexOptions options = RegexOptions.Multiline; //as of v1.3.1.0 default is multiline
mytext = Regex.Replace(mytext, #"\s?\\super\s?(\d+)\s?\\nosupersub\s", #"<sup>\1</sup>", options);
However, instead of getting what I want, I got all the results replaced with <sup>\1</sup>:
<sup>\1</sup>
<sup>\1</sup>
...
<sup>\1</sup>
If I try the regex replacement using a text editor like https://www.sublimetext.com and also using Python, it is OK.
How to get such group replacement of (\d+) like that (retain the number) in C#?

Many regex tools use the \1 notation to refer to a group's value in the replacement pattern (same in syntax to a backreference). For whatever reason, Microsoft chose to instead use $1 for the notation in the .NET implementation of regex. Note that backreferences still use the \1 syntax in .NET. It's only the syntax in the replacement pattern which is different. See the Substitutions section of this page for more info.

I haven't tested this code and wrote it from memory so this might not work but the general idea is there.
Why use regex at all?
List<string> output = new List<string>();
foreach (string line in myText.Split(new string[] { Environment.NewLine }, StringSplitOptions.None))
{
string alteredLine = line.Replace("\super", "").Replace("\nosupersub", "").Trim();
int n;
if (Int32.TryParse(alteredLine, out n))
{
output.Add("<sup>" + n + "</sup>");
}
else
{
//Add the original input in case it failed?
output.Add(line);
}
}
or for a linq version:
myText = myText.Split(new string[] { Environment.NewLine }, StringSplitOptions.None)
.Select(l => "<sup>" + l.Replace("\super", "").Replace("\nosupersub", "").Trim() + "</sup>");

Related

Regex to match alphanumeric except specific substring

Edit:
MANDATORY CONDITION:
Regex has to be inserted into the following statement:
Regex regex = new Regex("<REGEX_STRING>");
val= regex.Matches(val).Cast<Match>().Aggregate("", (s, e) => s + e.Value, s => s);
I found out that I can't use Regex.Replace() method as it was suggested in the answer below.
I am looking for a RegEx that would have to follow two conditions:
accept only a-z, A-Z, 0-9, \s (one or more), and ignore _ (that's why \w is not an option)
[!] exclude any {sq} "substring" anywhere inside the string
*{sq} - it's literally this 4-chars string, not any shortcut for ASCII sign !
What I have so far is:
\b(?!sq)[a-zA-Z0-9 ]*
but this RegEx cuts everything when _ shows up + it also excludes i.e whole [sq].
So for example for a given string:
test[sq]uirrel{sq}_things I should get testsquirrelthings and what I get is: testuirrel
Small input | expected output table below:
Input string
Expected output
Na#me
Name
M2a_ny
M2any
Vari{sq}o#us
Various
test [sq]uirrel h23ere!
test squirrel h23ere
I would really appreciate any help, it's the most complicated RegEx I have ever came across 🙄
The problem is that it is not possible in .NET regex to match any text but a multicharacter sequence.
You will have to use a terrible workaround like
((?:(?!{sq})[A-Za-z0-9\s])+)|{sq}
and you will need to get Group 1 values. See the .NET regex demo. Here is a C# demo:
var texts = new List<string> { "Na#me","M2a_ny","Vari{sq}o#us","test [sq]uirrel h23ere!" };
var pattern = #"((?:(?!{sq})[A-Za-z0-9\s])+)|{sq}";
foreach (var text in texts) {
var result = Regex.Matches(text, pattern).Cast<Match>()
.Aggregate("", (s, e) => s + e.Groups[1].Value, s => s);
Console.WriteLine(result);
}
// => Name, M2any, Various, test squirrel h23ere
A better, Regex.Replace based solution
You can remove {sq} and all non-letter and non-whitespace chars using
Regex.Replace(text, #"{sq}|[^a-zA-Z0-9\s]", "")
Regex.Replace(text, #"{sq}|[^\p{L}\p{N}\s]", "")
The \p{L} / \p{N} version can be used to support any Unicode letters/digits.
See the .NET regex demo:

How to use (?!...) regex pattern to skip the whole unmatched part?

I would like to use the ((?!(SEPARATOR)).)* regex pattern for splitting a string.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var separator = "__";
var pattern = String.Format("((?!{0}).)*", separator);
var regex = new Regex(pattern);
foreach (var item in regex.Matches("first__second"))
Console.WriteLine(item);
}
}
It works fine when a SEPARATOR is a single character, but when it is longer then 1 character I get an unexpected result. In the code above the second matched string is "_second" instead of "second". How shall I modify my pattern to skip the whole unmatched separator?
My real problem is to split lines where I should skip line separators inside quotes. My line separator is not a predefined value and it can be for example "\r\n".
You can do something like this:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "plum--pear";
string pattern = "-"; // Split on hyphens
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
}
}
// The method displays the following output:
// 'plum'
// ''
// 'pear'
The .NET regex does not does not support matching a piece of text other than a specific multicharacter string. In PCRE, you would use (*SKIP)(*FAIL) verbs, but they are not supported in the native .NET regex library. Surely, you might want to use PCRE.NET, but .NET regex can usually handle those scenarios well with Regex.Split
If you need to, say, match all but [anything here], you could use
var res = Regex.Split(s, #"\[[^][]*]").Where(m => !string.IsNullOrEmpty(m));
If the separator is a simple literal fixed string like __, just use String.Split.
As for your real problem, it seems all you need is
var res = Regex.Matches(s, "(?:\"[^\"]*\"|[^\r\n\"])+")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
See the regex demo
It matches 1+ (due to the final +) occurrences of ", 0+ chars other than " and then " (the "[^"]*" branch) or (|) any char but CR, LF or/and " (see [^\r\n"]).

Regex: C# extract text within double quotes

I want to extract only those words within double quotes. So, if the content is:
Would "you" like to have responses to your "questions" sent to you via email?
The answer must be
you
questions
Try this regex:
\"[^\"]*\"
or
\".*?\"
explain :
[^ character_group ]
Negation: Matches any single character that is not in character_group.
*?
Matches the previous element zero or more times, but as few times as possible.
and a sample code:
foreach(Match match in Regex.Matches(inputString, "\"([^\"]*)\""))
Console.WriteLine(match.ToString());
//or in LINQ
var result = from Match match in Regex.Matches(line, "\"([^\"]*)\"")
select match.ToString();
Based on #Ria 's answer:
static void Main(string[] args)
{
string str = "Would \"you\" like to have responses to your \"questions\" sent to you via email?";
var reg = new Regex("\".*?\"");
var matches = reg.Matches(str);
foreach (var item in matches)
{
Console.WriteLine(item.ToString());
}
}
The output is:
"you"
"questions"
You can use string.TrimStart() and string.TrimEnd() to remove double quotes if you don't want it.
I like the regex solutions. You could also think of something like this
string str = "Would \"you\" like to have responses to your \"questions\" sent to you via email?";
var stringArray = str.Split('"');
Then take the odd elements from the array. If you use linq, you can do it like this:
var stringArray = str.Split('"').Where((item, index) => index % 2 != 0);
This also steals the Regex from #Ria, but allows you to get them into an array where you then remove the quotes:
strText = "Would \"you\" like to have responses to your \"questions\" sent to you via email?";
MatchCollection mc = Regex.Matches(strText, "\"([^\"]*)\"");
for (int z=0; z < mc.Count; z++)
{
Response.Write(mc[z].ToString().Replace("\"", ""));
}
I combine Regex and Trim:
const string searchString = "This is a \"search text\" and \"another text\" and not \"this text";
var collection = Regex.Matches(searchString, "\\\"(.*?)\\\"");
foreach (var item in collection)
{
Console.WriteLine(item.ToString().Trim('"'));
}
Result:
search text
another text
Try this (\"\w+\")+
I suggest you to download Expresso
http://www.ultrapico.com/Expresso.htm
I needed to do this in C# for parsing CSV and none of these worked for me so I came up with this:
\s*(?:(?:(['"])(?<value>(?:\\\1|[^\1])*?)\1)|(?<value>[^'",]+?))\s*(?:,|$)
This will parse out a field with or without quotes and will exclude the quotes from the value while keeping embedded quotes and commas. <value> contains the parsed field value. Without using named groups, either group 2 or 3 contains the value.
There are better and more efficient ways to do CSV parsing and this one will not be effective at identifying bad input. But if you can be sure of your input format and performance is not an issue, this might work for you.
Slight improvement on answer by #ria,
\"[^\" ][^\"]*\"
Will recognize a starting double quote only when not followed by a space to allow trailing inch specifiers.
Side effect: It will not recognize "" as a quoted value.

c# Regex question

I have a problem dealing with the # symbol in Regex, I am trying to remove #sometext
from a text string can't seem to find anywhere where it uses the # as a literal. I have tried myself but doesn't remove the word from the string. Any ideas?
public string removeAtSymbol(string input)
{
Regex findWords = new Regex(______);//Find the words like "#text"
Regex[] removeWords;
string test = input;
MatchCollection all = findWords.Matches(test);
removeWords = new Regex[all.Count];
int index = 0;
string[] values = new string[all.Count];
YesOutputBox.Text = " you got here";
foreach (Match m in all) //List all the words
{
values[index] = m.Value.Trim();
index++;
YesOutputBox.Text = YesOutputBox.Text + " " + m.Value;
}
for (int i = 0; i < removeWords.Length; i++)
{
removeWords[i] = new Regex(" " + values[i]);
// If the words appears more than one time
if (removeWords[i].Matches(test).Count > 1)
{
removeWords[i] = new Regex(" " + values[i] + " ");
test = removeWords[i].Replace(test, " "); //Remove the first word.
}
}
return test;
}
You can remove all occurences of "#sometext" from string test via the method
Regex.Replace(test, "#sometext", "")
or for any word starting with "#" you can use
Regex.Replace(test, "#\\w+", "")
If you need specifically a separate word (i.e. nothing like #comp within tom#comp.com) you may preceed the regex with a special word boundary (\b does not work here):
Regex.Replace(test, "(^|\\W)#\\w+", "")
You can use:
^\s#([A-Za-z0-9_]+)
as the regex to recognize Twitter usernames.
Regex to remove #something from this string: I want to remove #something from this string.
var regex = new Regex("#\\w*");
string result = regex.Replace(stringWithAt, "");
Is that what you are looking for?
I've had good luck applying this pattern:
\B#\w+
This will match any string starting with an # character that contains alphanumeric characters, plus some linking punctuation like the underscore character, if it does not occur on a boundary between alphanumeric and non-alphanumeric characters.
The result of executing this code:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"\B#\w+",
#"redacted");
is the following string:
redacted redacted this2#3that redacted redacted#beta#gamma
If this question is Twitter-specific, then Twitter provides an open source library that helps capture Twitter-specific entities like links, mentions and hashtags. This java file contains the code defining the regular expressions that Twitter uses, and this yml file contains test strings and expected outcomes of many unit tests that exercise the regular expressions in the Twitter library.
Twitter's mention-matching pattern (extracted from their library, modified to remove unnecessary capture groups, and edited to make sense in the context of a replacement) is shown below. The match should be performed in a case-insensitive manner.
(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}
Here is an example which reproduces the results of the first replacement in my answer:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}",
#"$1redacted",
RegexOptions.IgnoreCase);
Note the need to include the substitution $1 since the first capture group can't be directly converted into an atomic zero-width assertion.

How can I find a string after a specific string/character using regex

I am hopeless with regex (c#) so I would appreciate some help:
Basicaly I need to parse a text and I need to find the following information inside the text:
Sample text:
KeywordB:***TextToFind* the rest is not relevant but **KeywordB: Text ToFindB and then some more text.
I need to find the word(s) after a certain keyword which may end with a “:”.
[UPDATE]
Thanks Andrew and Alan: Sorry for reopening the question but there is quite an important thing missing in that regex. As I wrote in my last comment, Is it possible to have a variable (how many words to look for, depending on the keyword) as part of the regex?
Or: I could have a different regex for each keyword (will only be a hand full). But still don't know how to have the "words to look for" constant inside the regex
The basic regex is this:
var pattern = #"KeywordB:\s*(\w*)";
\s* = any number of spaces
\w* = 0 or more word characters (non-space, basically)
() = make a group, so you can extract the part that matched
var pattern = #"KeywordB:\s*(\w*)";
var test = #"KeywordB: TextToFind";
var match = Regex.Match(test, pattern);
if (match.Success) {
Console.Write("Value found = {0}", match.Groups[1]);
}
If you have more than one of these on a line, you can use this:
var test = #"KeywordB: TextToFind KeyWordF: MoreText";
var matches = Regex.Matches(test, #"(?:\s*(?<key>\w*):\s?(?<value>\w*))");
foreach (Match f in matches ) {
Console.WriteLine("Keyword '{0}' = '{1}'", f.Groups["key"], f.Groups["value"]);
}
Also, check out the regex designer here: http://www.radsoftware.com.au/. It is free, and I use it constantly. It works great to prototype expressions. You need to rearrange the UI for basic work, but after that it's easy.
(fyi) The "#" before strings means that \ no longer means something special, so you can type #"c:\fun.txt" instead of "c:\fun.txt"
Let me know if I should delete the old post, but perhaps someone wants to read it.
The way to do a "words to look for" inside the regex is like this:
regex = #"(Key1|Key2|Key3|LastName|FirstName|Etc):"
What you are doing probably isn't worth the effort in a regex, though it can probably be done the way you want (still not 100% clear on requirements, though). It involves looking ahead to the next match, and stopping at that point.
Here is a re-write as a regex + regular functional code that should do the trick. It doesn't care about spaces, so if you ask for "Key2" like below, it will separate it from the value.
string[] keys = {"Key1", "Key2", "Key3"};
string source = "Key1:Value1Key2: ValueAnd A: To Test Key3: Something";
FindKeys(keys, source);
private void FindKeys(IEnumerable<string> keywords, string source) {
var found = new Dictionary<string, string>(10);
var keys = string.Join("|", keywords.ToArray());
var matches = Regex.Matches(source, #"(?<key>" + keys + "):",
RegexOptions.IgnoreCase);
foreach (Match m in matches) {
var key = m.Groups["key"].ToString();
var start = m.Index + m.Length;
var nx = m.NextMatch();
var end = (nx.Success ? nx.Index : source.Length);
found.Add(key, source.Substring(start, end - start));
}
foreach (var n in found) {
Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
}
}
And the output from this is:
Key=Key1, Value=Value1
Key=Key2, Value= ValueAnd A: To Test
Key=Key3, Value= Something
/KeywordB\: (\w)/
This matches any word that comes after your keyword. As you didn´t mentioned any terminator, I assumed that you wanted only the word next to the keyword.

Categories

Resources