How do I match a regex pattern and extract data from it

How do I match a regex pattern and extract data from it - c#

I can have 0 or many substrings within a text area in the format {key-value}Some text{/key},
For example This is my {link-123}test{/link} text area
I'd like to iterate through any items that match this pattern, perform and action based on the key and value, then replace this substring with a new string (a anchor link that is retreived by the action based on the key).
How would I achieve this in C#?

If these tags are not nested, then you only need to iterate once over the file; if nesting is possible, then you need to do one iteration for each level of nesting.
This answer assumes that braces only occur as tag delimiters (and not, for example, inside comments):
result = Regex.Replace(subject,
#"\{ # opening brace
(?<key>\w+) # Match the key (alnum), capture into the group 'key'
- # dash
(?<value>\w+) # Match the value (alnum), capture it as above
\} # closing brace
(?<content> # Match and capture into the group 'content':
(?: # Match...
(?!\{/?\k<key>) # (unless there's an opening or closing tag
. # of the same name right here) any character
)* # any number of times
) # End of capturing group
\{/\k<key>\} # Match the closing tag.",
new MatchEvaluator(ComputeReplacement), RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
public String ComputeReplacement(Match m) {
// You can vary the replacement text for each match on-the-fly
// m.Groups["key"].Value will contain the key
// m.Groups["value"].Value will contain the value of the match
// m.Groups["value"].Value will contain the content between the tags
return ""; // change this to return the string you generated here
}

Something like this?
Regex.Replace(text,
"[{](?<key>[^-]+)-(?<value>[^}])[}](?<content>.*?)[{][/]\k<key>[}]",
match => {
var key = match.Groups["key"].Value;
var value= match.Groups["value"].Value;
var content = match.Groups["content"].Value;
return string.format("The content of {0}-{1} is {2}", key, value, content);
});

Use the .net Regular Expression libraries. Here is an example that uses the Matches method:
http://www.dotnetperls.com/regex-matches
For replacing text, consider using a templating engine such as Antlr
http://www.antlr.org/wiki/display/ANTLR3/Antlr+3+CSharp+Target
Here is the example from the Matches Blog
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
// Input string.
const string value = #"said shed see spear spread super";
// Get a collection of matches.
MatchCollection matches = Regex.Matches(value, #"s\w+d");
// Use foreach loop.
foreach (Match match in matches)
{
foreach (Capture capture in match.Captures)
{
Console.WriteLine("Index={0}, Value={1}", capture.Index, capture.Value);
}
}
}
}
For more information on the C# regular expression syntax, you could use this cheat sheet:
http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

Related

How do I regex match each individual word within backticks?

I am trying to get results for each individual word within backticks. For example,
if I have something like this text
some description `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`
I want the search results to be:
match
these_words
th_is_wor
THIS_WOR
thi_sqw
word_snake
I'm essentially trying to get each "word", word being one or more english letter or underscore characters, between each set of backticks.
I currently have the following regex that seems to match ALL the text between each set of backticks:
/(?<=`)(\b([^`\]|\w|_)*\b)(?=`)/gi
This uses a positive lookbehind to find text that comes after a ` character: (?<=`)
Followed by a capture group for one or more things such that the thing is not a `, not a \, is a word character, or is an _ character within word boundaries: (\b([^`\]|\w|_)*\b)
Followed by a positive lookahead for another ` character to ensure we're enclosed within backticks.
This sort of works, but captures ALL the text between backticks instead of each individual word. This would require further processing which I'd like to avoid. My regex results right now are:
match these_words th_is_wor
THIS_WOR thi_sqw
word_snake
If there is a generic formula for getting each individual word within backticks or within quotes, that would be fantastic. Thank you!
Note: Much appreciated if the answer could be formatted in C#, but not required, as I can do that bit myself if needed.
Edit: Thank you Mr. إين from Ben Awad's Discord server for the quickest response! This is the solution as proposed by him. Also thank you to everyone who responded to my post, you guys are all AWESOME!
using System;
using System.Text.RegularExpressions;
class Program {
static void Main(string[] args) {
string backtickSentence = "i want to `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`";
string backtickPattern = #"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w+";
string quoteSentence = "some other \"words in a \" sentence be \"gettin me tripped_up AllUp inHere\"";
string quotePattern = "(?<=^[^\"]*(?:\"[^\"]*\"[^\"]*)*\"(?:[^\"]* )*)\\w+";
// Call Matches method without specifying any options.
try {
foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Console.WriteLine();
foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
}
catch (RegexMatchTimeoutException) {} // Do Nothing: Assume that timeout represents no match.
Console.WriteLine();
// Call Matches method for case-insensitive matching.
try {
foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.IgnoreCase))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Console.WriteLine();
foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.IgnoreCase))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
}
catch (RegexMatchTimeoutException) {}
}
}
His explanation for this was as follows, but you can paste his regex into regexr.com for more info
var NOT_BACKTICK = #"[^`]*";
var WORD = #"(\w+)";
var START = $#"^{NOT_BACKTICK}"; // match anything before the first backtick
var INSIDE_BACKTICKS = $#"`{NOT_BACKTICK}`"; // match a pair of backticks
var ODD_NUM_BACKTICKS_BEFORE = $#"{START}({INSIDE_BACKTICKS}{NOT_BACKTICK})*`"; // match anything before the first backtick, then any amount of paired backticks with anything afterwards, then a single opening backtick
var CONDITION = $#"(?<={ODD_NUM_BACKTICKS_BEFORE})";
var CONDITION_TRUE = $#"(?: *{WORD})"; // match any spaces then a word
var CONDITION_FALSE = $#"(?:(?<={ODD_NUM_BACKTICKS_BEFORE}{NOT_BACKTICK} ){WORD})"; // match up to an opening backtick, then anything up to a space before the current word
// uses conditional matching
// see https://learn.microsoft.com/en-us/dotnet/standard/base-types/alternation-constructs-in-regular-expressions#Conditional_Expr
var pattern = $#"(?{CONDITION}{CONDITION_TRUE}|{CONDITION_FALSE})";
// refined backtick pattern
string backtickPattern = #"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w+";

With C# you can use the Group.Captures Property and then get the capture group values.
Note that \w also matches _
`(?:[\p{Zs}\t]*(\w+)[\p{Zs}\t]*)+`
Explanation
<code> Match literally
(?: Non capture group to repeat as a whole part
[\p{Zs}\t]* Match optional spaces
(\w+) Capture group 1, match 1+ word characters
[\p{Zs}\t]* Match optional spaces
)+ Close the non capture group and repeat as least 1 or more times
<code> Match literally
See a .NET regex demo and a C# demo.
For example:
string s = #"some description ` match these_words th_is_wor ` or `THIS_WOR thi_sqw` a `word_snake`";
string pattern = #"`(?:[\p{Zs}\t]*(\w+)[\p{Zs}\t]*)+`";
foreach (Match m in Regex.Matches(s, pattern))
{
string[] result = m.Groups[1].Captures.Select(c => c.Value).ToArray();
Console.WriteLine(String.Join(',', result));
}
Output
match,these_words,th_is_wor
THIS_WOR,thi_sqw
word_snake

C# - Save text into variable

I want to save an e-mail-address out of a .txt-file into a string variable. This is my code:
String path = "C:\\Users\\test.txt";
string from;
var fro = new Regex("from: (?<fr>)");
using (var reader = new StreamReader(File.OpenRead(#path)))
{
while (true)
{
var nextLine = reader.ReadLine();
if (nextLine == null)
break;
var matchb = fro.Match(nextLine);
if (matchb.Success)
{
from = matchb.Groups["fr"].Value;
Console.WriteLine(from);
}
}
}
I know that matchb.Success is true, however from won't be displayed correctly. I'm afraid it has something to do with the escape sequence, but I was unable to find anything helpful on the internet.
The textfile might look like this:
LOG 00:01:05 processID=123456-12345 from: test#test.org
LOG 00:01:06 processID=123456-12345 OK

Your (?<fr>) pattern defines a named group "fr" that matches an empty string.
To fill the group with some value you need to define the group pattern.
If you plan to match the rest of the line, you may use .*. To match a sequence of non-whitespace chars, use \S+. To match a sequence of non-whitespace chars that has a # inside, use \S+#\S+. All the three approaches will work for the current scenario.
In C#, it will look like
var fro = new Regex(#"from: *(?<fr>\S+#\S+)");
Note that #"..." is a verbatim string literal where a single backslash defines a literal backslash, so you do not have to double it. I also suggest using the * quantifier to match 0 or more spaces before the email. You might want to use \s* (to match any 0+ whitespace chars) or [\p{Zs}\t]* (to match only horizontal whitespace chars) instead.

Find String Between To Identical Control Separators?

I'm reading from a file, and need to find a string that is encapsulated by two identical non-ascii values/control seperators, in this case 'RS'
How would I go about doing this? Would I need some form of regex?

RS stands for Record Separator, and it has a value of 30 (or 0x1E in hexadecimal). You can use this regular expression:
\x1E([\w\s]*?)\x1E
That matches the RS, then matches any letter, number or space, and then again the RS. The ? is to make the regex match as less characters as possible, in case there are more RS characters afterwards.
If you prefer not to match numbers, you could use [a-zA-Z\s] instead of [\w\s].
Example:
string fileContents = "Something \u001Eyour string\u001E more things \u001Eanother text\u001E end.";
MatchCollection matches = Regex.Matches(fileContents, #"\x1E([\w\s]*?)\x1E");
if (matches.Count == 0)
return; // Not found, display an error message and exit.
foreach (Match match in matches)
{
if (match.Groups.Count > 1)
Console.WriteLine(match.Groups[1].Value);
}
As you can see, you get a collection of Match, and each match.Value will have the whole matched string including the separators. match.Groups will have all matched groups, being the first one again the whole matched string (that's by default) and then each of your groups (those between parenthesis). In this case, you only have one in your regex, so you just need the second one on that list.

Using regex you can do something like this:
string pattern = string.Format("{0}(.*){1}",firstString,secondString);
var matches = Regex.Matches(myString, pattern);
foreach (Match match in matches)
{
foreach (Capture capture in match.Captures)
{
//Do stuff, with the current you should remove firstString and secondString from the capture.Value
}
}
After that use Regex.match to find the string that match with the pattern built before.
Remember to escape all the special char for regex.

You can use Regex.Matches, I'm using X as the separator in this example:
var fileContents = "Xsomething1X Xsomething2X Xsomething3X";
var results = Regex.Matches(fileContents, #"(X).*?(\1)");
The you can loop on results to do anything you want with the matches.
The \1 in the regex means "reference first group". I've put X between () so it is going to be group 1, the I use \1 to say that the match in this place should be exactly the same as the group 1.

You don't need a regular expression for that.
Read the contents of the file (File.ReadAllText).
Split on the separator character (String.Split).
If you know there's only one occurrence of your string, take the second array element (result[1]). Otherwise, take every other entry (result.Where((x, i) => i % 2 == 1)).

Regex within a regex?

Truth is, I'm having a hard time writing a regex string to parse something in the form of
[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]
This regex would be parsed so that I can dynamically build tabs as demonstrated here. Initially I tried a regex pattern like \[\[\[tab name=(?'name'.*?) content=(?'content'.*?)\]\]\]
But I realized I couldn't get the tab as a whole and build upon a query without doing a regex.replace. Is it possible to take the entire tab leading up to the pipe symbol as a group and then parse that group down from the sub key/value pairs?
This is the current regex string I'm working with \[\[\[(?'tab'tab name=(?'name'.*?) content=(?'content'.*?))\]\]\]
And here is my code for performing the regex. Any guidance would be appreciated.
public override string BeforeParse(string markupText)
{
if (CompiledRegex.IsMatch(markupText))
{
// Replaces the [[[code lang=sql|xxx]]]
// with the HTML tags (surrounded with {{{roadkillinternal}}.
// As the code is HTML encoded, it doesn't get butchered by the HTML cleaner.
MatchCollection matches = CompiledRegex.Matches(markupText);
foreach (Match match in matches)
{
string tabname = match.Groups["name"].Value;
string tabcontent = HttpUtility.HtmlEncode(match.Groups["content"].Value);
markupText = markupText.Replace(match.Groups["content"].Value, tabcontent);
markupText = Regex.Replace(markupText, RegexString, ReplacementPattern, CompiledRegex.Options);
}
}
return markupText;
}

Is this what you want?
string input = "[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]";
Regex r = new Regex(#"tab name=([a-z0-9]+) content=([a-z0-9]+)(\||])");
foreach (Match m in r.Matches(input))
{
Console.WriteLine("{0} : {1}", m.Groups[1].Value, m.Groups[2].Value);
}
http://regexr.com/3boot

Maybe string.split will be better in that case? For example something like that :
strgin str = "[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]";
foreach(var entry in str.Split('|')){
var eqBlocks = entry.Split('=');
var tabName = eqBlocks[1].TrimEnd(" content");
var content = eqBlocks[2];
}
Ugly code, but should work.

Try this:
Starts with a word boundary and followed only by allowed characters.
/\b[\w =]*/g
https://regex101.com/r/cI7jS7/1

Just distill the regex pattern down to the individual tab patterns such as name=??? content=??? and match that only. That pattern which will make each Match (two in you example) where the data can be extracted.
string text = #"[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]";
string pattern = #"name=(?<Name>[^\s]+)\scontent=(?<Content>[^\s|\]]+)";
var result = Regex.Matches(text, pattern)
.OfType<Match>()
.Select(mt => new
{
Name = mt.Groups["Name"].Value,
Content = mt.Groups["Content"].Value,
});
The result is an enumerable list with the created dynamic entities with the tabs needed which can be directly bound to the control:
Note in the set notation [^\s|\]] the pipe | is treated as a literal in the set and not used as an or. The bracket ] does have to be escaped though to be treated as a literal. Finally the logic the parse will look for: "To not (^) be a space or a pipe or a brace for that set".

Regex to extract Variable Part

I have a string containing this: #[User::RootPath]+"Dim_MyPackage10.dtsx" and I need to extract the [User::RootPath] part using a regex. So far I have this regex: [a-zA-Z0-9]*\.dtsx but I don't know how to proceed further.

For the variable, why not consume what is needed by using the not set [^ ] to extract everything except in the set?
The ^ in the braces means find what is not matched, such as this where it seeks all that is not a ] or a quote (").
Then we can place the actual matches in named capture groups (?<{NameHere}> ) and extract accordingly
string pattern = #"(?:#\[)(?<Path>[^\]]+)(?:\]\+\"")(?<File>[^\""]+)(?:"")";
// Pattern is (?:#\[)(?<Path>[^\]]+)(?:\]\+\")(?<File>[^\"]+)(?:")
// w/o the "'s escapes for the C# parser
string text = #"#[User::RootPath]+""Dim_MyPackage10.dtsx""";
var result = Regex.Match(text, pattern);
Console.WriteLine ("Path: {0}{1}File: {2}",
result.Groups["Path"].Value,
Environment.NewLine,
result.Groups["File"].Value
);
/* Outputs
Path: User::RootPath
File: Dim_MyPackage10.dtsx
*/
(?: ) is match but don't capture, because we use those as defacto anchors for our pattern and to not place them into the match capture groups.

Use this regex pattern:
\[[^[\]]*\]
Check this demo.

Your regex will match any number of alphanumeric characters, followed by .dtsx. In your example, it would match MyPackage10.dtsx.
If you want to match Dim_MyPackage10.dtsx you need to add an underscore to your list of allowed characters in the regex: [a-zA-Z0-9]*.dtsx
If you want to match the [User::RootPath], you need a regex that will stop at the last / (or \, depends on which type of slashes you use in the paths): something like this: .*\/ (or .*\\)

From the answers and comments - and the fact that none has been 'accepted' so far - it appears to me that the question/problem is not completely clear. If you're looking for the pattern [User::SomeVariable] where only 'SomeVariable' is, well, variable, then you may try:
\[User::\w+]
to capture the full expression.
Furthermore, if you wish to detect that pattern, but then need only the "SomeVariable" part, you may try:
(?<=\[User::)\w+(?=])
which uses look-arounds.

Here it is bro
using System;
using System.Text.RegularExpressions;
namespace myapp
{
class Class1
{
static void Main(string[] args)
{
String sourcestring = "source string to match with pattern";
Regex re = new Regex(#"\[\S+\]");
MatchCollection mc = re.Matches(sourcestring);
int mIdx=0;
foreach (Match m in mc)
{
for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
{
Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
}
mIdx++;
}
}
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How do I match a regex pattern and extract data from it - c#

Related

How do I regex match each individual word within backticks?

C# - Save text into variable

Find String Between To Identical Control Separators?

Regex within a regex?

Regex to extract Variable Part

Categories

Resources