C# Regex returning multiple lines of text - c#

I have the following function:
public static string ReturnEmailAddresses(string input)
{
string regex1 = #"\[url=";
string regex2 = #"mailto:([^\?]*)";
string regex3 = #".*?";
string regex4 = #"\[\/url\]";
Regex r = new Regex(regex1 + regex2 + regex3 + regex4, RegexOptions.IgnoreCase | RegexOptions.Multiline);
MatchCollection m = r.Matches(input);
if (m.Count > 0)
{
StringBuilder sb = new StringBuilder();
int i = 0;
foreach (var match in m)
{
if (i > 0)
sb.Append(Environment.NewLine);
string shtml = match.ToString();
var innerString = shtml.Substring(shtml.IndexOf("]") + 1, shtml.IndexOf("[/url]") - shtml.IndexOf("]") - 1);
sb.Append(innerString); //just titles
i++;
}
return sb.ToString();
}
return string.Empty;
}
As you can see I define a url in the "markdown" format:
[url = http://sample.com]sample.com[/url]
In the same way, emails are written in that format too:
[url=mailto:service#paypal.com.au]service#paypal.com.au[/url]
However when i pass in a multiline string, with multiple email addresses, it only returns the first email only. I would like it to have multple matches, but I cannot seem to get that working?
For example
[url=mailto:service#paypal.com.au]service#paypal.com.au[/url] /r/n a whole bunch of text here /r/n more stuff here [url=mailto:anotheremail#paypal.com.au]anotheremail#paypal.com.au[/url]
This will only return the first email above?

The mailto:([^\?]*) part of your pattern is matching everything in your input string. You need to add the closing bracket ] to the inside of your excluded characters to restrict that portion from overflowing outside of the "mailto" section and into the text within the "url" tags:
\[url=mailto:([^\?\]]*).*?\[\/url\]
See this link for an example: https://regex101.com/r/zcgeW8/1

You can extract desired result with help of positive lookahead and positive lookbehind. See http://www.rexegg.com/regex-lookarounds.html
Try regex: (?<=\[url=mailto:).*?(?=\])
Above regex will capture two email addresses from sample string
[url=mailto:service#paypal.com.au]service#paypal.com.au[/url] /r/n a whole bunch of text here /r/n more stuff here [url=mailto:anotheremail#paypal.com.au]anotheremail#paypal.com.au[/url]
Result:
service#paypal.com.au
anotheremail#paypal.com.au

Related

My regex is not matching if a line break is found

I have a large string separated by line breaks.
Example:
This is my first sentence and here i will search for the word my
This is my second sentence
Using the code below, if I search for 'my' it will only return the 2 instances of 'my' from the first sentence and not the second.
I wish to display the sentence the phrase is found in - which works fine but its just that it does not search anything after the first line break if found.
Code;
var regex = new Regex(string.Format("[^.!?;]*({0})[^.?!;]*[.?!;]", userSearchCriteraInHere, RegexOptions.Singleline));
var results = regex.Matches(largeStringInHere);
for (int i = 0; i < results.Count; i++)
{
searchCriteriaFound.Append((results[i].Value.Trim()));
searchCriteriaFound.Append(Environment.NewLine);
}
Code Edit:
string pattern = #".*(" + userSearchCriteraInHere + ")+.*";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(largeStringInHere, pattern, options))
{
searchCriteriaFound.Append(m.Value);
}
var userSearchCriteraInHere = "my";
var largeStringInHere = #"This is my first sentence and here i will search for the word my.
This is my second sentence.";
var regex = new Regex(string.Format("[^.!?;]*({0})[^.?!;]*[.?!;]", userSearchCriteraInHere), RegexOptions.Singleline);
var results = regex.Matches(largeStringInHere);
Console.WriteLine(results.Count);
var searchCriteriaFound = new StringBuilder();
for (int i = 0; i < results.Count; i++)
{
searchCriteriaFound.Append((results[i].Value.Trim()));
searchCriteriaFound.Append(Environment.NewLine);
}
Console.Write(searchCriteriaFound.ToString());
This returns the following output:
2
This is my first sentence and here i will search for the word my.
This is my second sentence.
I did need to add periods at the end of your sentences, as your regex expects them.
Is there a particular reason not to just search for the word "my" multiple times in the following way:
(my)+
You can test it over at the following URL on Regex101: https://regex101.com/r/QIHWKf/1
If you want to match the whole sentence that has "my" you can use the following:
.*(my)+.*
https://regex101.com/r/QIHWKf/2
Here your full match is the whole sentence, and your first group match is the "my".
Change
Regex(string.Format("[^.!?;]*({0})[^.?!;]*[.?!;]", userSearchCriteraInHere, RegexOptions.Singleline)
To
Regex(string.Format("[^.!?;]*({0})[^.?!;]*[.?!;]", userSearchCriteraInHere, RegexOptions.Multiline)
This changes the meaning of the symbols ^ and $ to be at the beginning/end of a line, rather than the entire string.
You could use a word boundary \b to prevent it from being part of a larger match like for example mystery and change the option to RegexOptions.Multiline instead of RegexOptions.Singleline to let ^ and $ match the end of the line.
^.*\bmy\b.*$
Regex demo
Test
To get all lines containing 'my' word, you can try this:
Code
static string GetSentencesContainMyWord(StreamReader file)
{
int counter = 0;
string line;
var sb = new StringBuilder();
while ((line = file.ReadLine()) != null)
{
if (line.Contains("my"))
sb.Append(line + Environment.NewLine);
counter++;
}
return sb.ToString();
}

How to replace the text between two characters in c#

I am bit confused writing the regex for finding the Text between the two delimiters { } and replace the text with another text in c#,how to replace?
I tried this.
StreamReader sr = new StreamReader(#"C:abc.txt");
string line;
line = sr.ReadLine();
while (line != null)
{
if (line.StartsWith("<"))
{
if (line.IndexOf('{') == 29)
{
string s = line;
int start = s.IndexOf("{");
int end = s.IndexOf("}");
string result = s.Substring(start+1, end - start - 1);
}
}
//write the lie to console window
Console.Write Line(line);
//Read the next line
line = sr.ReadLine();
}
//close the file
sr.Close();
Console.ReadLine();
I want replace the found text(result) with another text.
Use Regex with pattern: \{([^\}]+)\}
Regex yourRegex = new Regex(#"\{([^\}]+)\}");
string result = yourRegex.Replace(yourString, "anyReplacement");
string s = "data{value here} data";
int start = s.IndexOf("{");
int end = s.IndexOf("}", start);
string result = s.Substring(start+1, end - start - 1);
s = s.Replace(result, "your replacement value");
To get the string between the parentheses to be replaced, use the Regex pattern
string errString = "This {match here} uses 3 other {match here} to {match here} the {match here}ation";
string toReplace = Regex.Match(errString, #"\{([^\}]+)\}").Groups[1].Value;
Console.WriteLine(toReplace); // prints 'match here'
To then replace the text found you can simply use the Replace method as follows:
string correctString = errString.Replace(toReplace, "document");
Explanation of the Regex pattern:
\{ # Escaped curly parentheses, means "starts with a '{' character"
( # Parentheses in a regex mean "put (capture) the stuff
# in between into the Groups array"
[^}] # Any character that is not a '}' character
* # Zero or more occurrences of the aforementioned "non '}' char"
) # Close the capturing group
\} # "Ends with a '}' character"
The following regular expression will match the criteria you specified:
string pattern = #"^(\<.{27})(\{[^}]*\})(.*)";
The following would perform a replace:
string result = Regex.Replace(input, pattern, "$1 REPLACE $3");
For the input: "<012345678901234567890123456{sdfsdfsdf}sadfsdf" this gives the output "<012345678901234567890123456 REPLACE sadfsdf"
You need two calls to Substring(), rather than one: One to get textBefore, the other to get textAfter, and then you concatenate those with your replacement.
int start = s.IndexOf("{");
int end = s.IndexOf("}");
//I skip the check that end is valid too avoid clutter
string textBefore = s.Substring(0, start);
string textAfter = s.Substring(end+1);
string replacedText = textBefore + newText + textAfter;
If you want to keep the braces, you need a small adjustment:
int start = s.IndexOf("{");
int end = s.IndexOf("}");
string textBefore = s.Substring(0, start-1);
string textAfter = s.Substring(end);
string replacedText = textBefore + newText + textAfter;
the simplest way is to use split method if you want to avoid any regex .. this is an aproach :
string s = "sometext {getthis}";
string result= s.Split(new char[] { '{', '}' })[1];
You can use the Regex expression that some others have already posted, or you can use a more advanced Regex that uses balancing groups to make sure the opening { is balanced by a closing }.
That expression is then (?<BRACE>\{)([^\}]*)(?<-BRACE>\})
You can test this expression online at RegexHero.
You simply match your input string with this Regex pattern, then use the replace methods of Regex, for instance:
var result = Regex.Replace(input, "(?<BRACE>\{)([^\}]*)(?<-BRACE>\})", textToReplaceWith);
For more C# Regex Replace examples, see http://www.dotnetperls.com/regex-replace.

Regex finding any characters in between ( )

i have a long text and in the text there is many something like this ( hello , hi ) or (hello,hi) , i have to take the space into account . how do i detect them in a long text and retrieve the hello and hi word and add to a list from the text? currently i use this regex :
string helpingWordPattern = "(?<=\\()(.*?)(?<=\\))";
Regex regexHelpingWord = new Regex(helpingWordPattern);
foreach (Match m in regexHelpingWord.Matches(lstQuestion.QuestionContent))
{
// removing "," and store helping word into a list
string str = m.ToString();
if (str.Contains(","))
{
string[] strWords = str.Split(','); // Will contain a ) with a word , e.g. ( whole) )
if(strWords.Contains(")"))
{
strWords.Replace(")", ""); // Try to remove them. ERROR here cos i can't use array with replace.
}
foreach (string words in strWords)
{
options.Add(words);
}
}
}
I google and search for the correct regex , the regex i use suppose to remove the ) too but it doesn't .
Put the \\( \\) bracket-matchers, outside the group you wish to capture?
Regex regex = new Regex( "\\((.*?)\\)");
foreach (Match m in regex.Matches( longText)) {
string inside = Match.Groups[1]; // inside the brackets.
...
}
Then use Match.Groups[1], not the whole text of the match.
You can also use this regex pattern:
(?<=[\(,])(.*?)(?=[\),])
(?<=[\(,])(\D*?)(?=[\),]) // for anything except number
Break Up:
(?<=[\(,]) = Positive look behind, looks for `(`or `,`
(.*?) = Looks for any thing except new line, but its lazy(matches as less as possible)
(?=[\),]) = Positive look ahead, looks for `)` or `,` after `hello` or `hi` etc.
Demo
EDIT
You can try this sample code for achievement: (untested)
List<string> lst = new List<string>();
MatchCollection mcoll = Regex.Matches(sampleStr,#"(?<=[\(,])(.*?)(?=[\),])")
foreach(Match m in mcoll)
{
lst.Add(m.ToString());
Debug.Print(m.ToString()); // Optional, check in Output window.
}
There are a lot of different ways you could do this... Below is some code using regex to match / split.
string input = "txt ( apple , orange) txt txt txt ( hello, hi,5 ) txt txt txt txt";
List Options = new List();
Regex regexHelpingWord = new Regex(#"\((.+?)\)");
foreach (Match m in regexHelpingWord.Matches(input))
{
string words = Regex.Replace(m.ToString(), #"[()]", "");
Regex regexSplitComma = new Regex(#"\s*,\s*");
foreach (string word in regexSplitComma.Split(words))
{
string Str = word.Trim();
double Num;
bool isNum = double.TryParse(Str, out Num);
if (!isNum) Options.Add(Str);
}
}

A More Efficient Way to Parse a String in C#

I have this code that reads a file and creates Regex groups. Then I walk through the groups and use other matches on keywords to extract what I need. I need the stuff between each keyword and the next space or newline. I am wondering if there is a way using the Regex keyword match itself to discard what I don't want (the keyword).
//create the pattern for the regex
String VSANMatchString = #"vsan\s(?<number>\d+)[:\s](?<info>.+)\n(\s+name:(?<name>.+)\s+state:(?<state>.+)\s+\n\s+interoperability mode:(?<mode>.+)\s\n\s+loadbalancing:(?<loadbal>.+)\s\n\s+operational state:(?<opstate>.+)\s\n)?";
//set up the patch
MatchCollection VSANInfoList = Regex.Matches(block, VSANMatchString);
// set up the keyword matches
Regex VSANNum = new Regex(#" \d* ");
Regex VSANName = new Regex(#"name:\S*");
Regex VSANState = new Regex(#"operational state\S*");
//now we can extract what we need since we know all the VSAN info will be matched to the correct VSAN
//match each keyword (name, state, etc), then split and extract the value
foreach (Match m in VSANInfoList)
{
string num=String.Empty;
string name=String.Empty;
string state=String.Empty;
string s = m.ToString();
if (VSANNum.IsMatch(s)) { num=VSANNum.Match(s).ToString().Trim(); }
if (VSANName.IsMatch(s))
{
string totrim = VSANName.Match(s).ToString().Trim();
string[] strsplit = Regex.Split (totrim, "name:");
name=strsplit[1].Trim();
}
if (VSANState.IsMatch(s))
{
string totrim = VSANState.Match(s).ToString().Trim();
string[] strsplit=Regex.Split (totrim, "state:");
state=strsplit[1].Trim();
}
It looks like your single regex should be able to gather all you need. Try this:
string name = m.Groups["name"].Value; // Or was it m.Captures["name"].Value?

c# Regex question

I have a problem dealing with the # symbol in Regex, I am trying to remove #sometext
from a text string can't seem to find anywhere where it uses the # as a literal. I have tried myself but doesn't remove the word from the string. Any ideas?
public string removeAtSymbol(string input)
{
Regex findWords = new Regex(______);//Find the words like "#text"
Regex[] removeWords;
string test = input;
MatchCollection all = findWords.Matches(test);
removeWords = new Regex[all.Count];
int index = 0;
string[] values = new string[all.Count];
YesOutputBox.Text = " you got here";
foreach (Match m in all) //List all the words
{
values[index] = m.Value.Trim();
index++;
YesOutputBox.Text = YesOutputBox.Text + " " + m.Value;
}
for (int i = 0; i < removeWords.Length; i++)
{
removeWords[i] = new Regex(" " + values[i]);
// If the words appears more than one time
if (removeWords[i].Matches(test).Count > 1)
{
removeWords[i] = new Regex(" " + values[i] + " ");
test = removeWords[i].Replace(test, " "); //Remove the first word.
}
}
return test;
}
You can remove all occurences of "#sometext" from string test via the method
Regex.Replace(test, "#sometext", "")
or for any word starting with "#" you can use
Regex.Replace(test, "#\\w+", "")
If you need specifically a separate word (i.e. nothing like #comp within tom#comp.com) you may preceed the regex with a special word boundary (\b does not work here):
Regex.Replace(test, "(^|\\W)#\\w+", "")
You can use:
^\s#([A-Za-z0-9_]+)
as the regex to recognize Twitter usernames.
Regex to remove #something from this string: I want to remove #something from this string.
var regex = new Regex("#\\w*");
string result = regex.Replace(stringWithAt, "");
Is that what you are looking for?
I've had good luck applying this pattern:
\B#\w+
This will match any string starting with an # character that contains alphanumeric characters, plus some linking punctuation like the underscore character, if it does not occur on a boundary between alphanumeric and non-alphanumeric characters.
The result of executing this code:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"\B#\w+",
#"redacted");
is the following string:
redacted redacted this2#3that redacted redacted#beta#gamma
If this question is Twitter-specific, then Twitter provides an open source library that helps capture Twitter-specific entities like links, mentions and hashtags. This java file contains the code defining the regular expressions that Twitter uses, and this yml file contains test strings and expected outcomes of many unit tests that exercise the regular expressions in the Twitter library.
Twitter's mention-matching pattern (extracted from their library, modified to remove unnecessary capture groups, and edited to make sense in the context of a replacement) is shown below. The match should be performed in a case-insensitive manner.
(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}
Here is an example which reproduces the results of the first replacement in my answer:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}",
#"$1redacted",
RegexOptions.IgnoreCase);
Note the need to include the substitution $1 since the first capture group can't be directly converted into an atomic zero-width assertion.

Categories

Resources