I have this html:
This is the content.
I just need to get rid of the anchor tag html around the content text, so that all I end up with is "This is the content".
Can I do this using Regex.Replace?
Your regex: <a[^>]+?>(.*?)</a>
Check this Regex with the Regex-class and iterate through the result collection
and you should get your inner text.
String text = "test";
Regex rx = new Regex("<a[^>]+?>(.*?)</a>");
// Find matches.
MatchCollection matches = rx.Matches(text);
// Report the number of matches found.
Console.WriteLine("{0} matches found. \n", matches.Count);
// Report on each match.
foreach (Match match in matches)
{
Console.WriteLine(match.Value);
Console.WriteLine("Groups:");
foreach (var g in match.Groups)
{
Console.WriteLine(g.ToString());
}
}
Console.ReadLine();
Output:
1 matches found.
test
Groups:
test
test
The match expression in () is stored in the second item of match's Groups collection (the first item is the whole match itself). Each expression in () gets into the Groups collection. See the MSDN for further information.
If you had to use Replace, this'd work for simple string content inside the tag:
Regex r = new Regex("<[^>]+>");
string result = r.Replace(#"This is the content.", "");
Console.WriteLine("Result = \"{0}\"", result);
Good luck
You could also use groups in Regex.
For example, the following would give you the content of any tag.
Regex r = new Regex(#"<a.*>(.*)</a>");
// Regex r = new Regex(#"<.*>(.*)</.*>"); or any kind of tag
var m = r.Match(#"This is the content.");
string content = m.Groups[1].Value;
you use groups in regexes by using the parenthesis, although group 0 is the whole match, not just the group.
Related
I am trying to get results for each individual word within backticks. For example,
if I have something like this text
some description `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`
I want the search results to be:
match
these_words
th_is_wor
THIS_WOR
thi_sqw
word_snake
I'm essentially trying to get each "word", word being one or more english letter or underscore characters, between each set of backticks.
I currently have the following regex that seems to match ALL the text between each set of backticks:
/(?<=`)(\b([^`\]|\w|_)*\b)(?=`)/gi
This uses a positive lookbehind to find text that comes after a ` character: (?<=`)
Followed by a capture group for one or more things such that the thing is not a `, not a \, is a word character, or is an _ character within word boundaries: (\b([^`\]|\w|_)*\b)
Followed by a positive lookahead for another ` character to ensure we're enclosed within backticks.
This sort of works, but captures ALL the text between backticks instead of each individual word. This would require further processing which I'd like to avoid. My regex results right now are:
match these_words th_is_wor
THIS_WOR thi_sqw
word_snake
If there is a generic formula for getting each individual word within backticks or within quotes, that would be fantastic. Thank you!
Note: Much appreciated if the answer could be formatted in C#, but not required, as I can do that bit myself if needed.
Edit: Thank you Mr. إين from Ben Awad's Discord server for the quickest response! This is the solution as proposed by him. Also thank you to everyone who responded to my post, you guys are all AWESOME!
using System;
using System.Text.RegularExpressions;
class Program {
static void Main(string[] args) {
string backtickSentence = "i want to `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`";
string backtickPattern = #"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w+";
string quoteSentence = "some other \"words in a \" sentence be \"gettin me tripped_up AllUp inHere\"";
string quotePattern = "(?<=^[^\"]*(?:\"[^\"]*\"[^\"]*)*\"(?:[^\"]* )*)\\w+";
// Call Matches method without specifying any options.
try {
foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Console.WriteLine();
foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
}
catch (RegexMatchTimeoutException) {} // Do Nothing: Assume that timeout represents no match.
Console.WriteLine();
// Call Matches method for case-insensitive matching.
try {
foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.IgnoreCase))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Console.WriteLine();
foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.IgnoreCase))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
}
catch (RegexMatchTimeoutException) {}
}
}
His explanation for this was as follows, but you can paste his regex into regexr.com for more info
var NOT_BACKTICK = #"[^`]*";
var WORD = #"(\w+)";
var START = $#"^{NOT_BACKTICK}"; // match anything before the first backtick
var INSIDE_BACKTICKS = $#"`{NOT_BACKTICK}`"; // match a pair of backticks
var ODD_NUM_BACKTICKS_BEFORE = $#"{START}({INSIDE_BACKTICKS}{NOT_BACKTICK})*`"; // match anything before the first backtick, then any amount of paired backticks with anything afterwards, then a single opening backtick
var CONDITION = $#"(?<={ODD_NUM_BACKTICKS_BEFORE})";
var CONDITION_TRUE = $#"(?: *{WORD})"; // match any spaces then a word
var CONDITION_FALSE = $#"(?:(?<={ODD_NUM_BACKTICKS_BEFORE}{NOT_BACKTICK} ){WORD})"; // match up to an opening backtick, then anything up to a space before the current word
// uses conditional matching
// see https://learn.microsoft.com/en-us/dotnet/standard/base-types/alternation-constructs-in-regular-expressions#Conditional_Expr
var pattern = $#"(?{CONDITION}{CONDITION_TRUE}|{CONDITION_FALSE})";
// refined backtick pattern
string backtickPattern = #"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w+";
With C# you can use the Group.Captures Property and then get the capture group values.
Note that \w also matches _
`(?:[\p{Zs}\t]*(\w+)[\p{Zs}\t]*)+`
Explanation
<code> Match literally
(?: Non capture group to repeat as a whole part
[\p{Zs}\t]* Match optional spaces
(\w+) Capture group 1, match 1+ word characters
[\p{Zs}\t]* Match optional spaces
)+ Close the non capture group and repeat as least 1 or more times
<code> Match literally
See a .NET regex demo and a C# demo.
For example:
string s = #"some description ` match these_words th_is_wor ` or `THIS_WOR thi_sqw` a `word_snake`";
string pattern = #"`(?:[\p{Zs}\t]*(\w+)[\p{Zs}\t]*)+`";
foreach (Match m in Regex.Matches(s, pattern))
{
string[] result = m.Groups[1].Captures.Select(c => c.Value).ToArray();
Console.WriteLine(String.Join(',', result));
}
Output
match,these_words,th_is_wor
THIS_WOR,thi_sqw
word_snake
I already tried two days to solve the Problem, that I have a MatchCollection. In the patter is a Group and I want to have a list with the Solutions of the Group (there were two or more Solutions).
string input = "<tr><td>Mi, 09.09.15</td><td>1</td><td>PK</td><td>E</td><td>123</td><td></td></tr><tr><td>Mi, 09.09.15</td><td>2</td><td>ER</td><td>ER</td><td>234</td><td></td></tr>";
string Patter2 = "^<tr>$?<td>$?[D-M][i-r],[' '][0-3][1-9].[0-1][1-9].[0-9][0-9]$?</td>$?<td>$?([1-9][0-2]?)$?</td>$?";
Regex r2 = new Regex(Patter2);
MatchCollection mc2 = r2.Matches(input);
foreach (Match match in mc2)
{
GroupCollection groups = match.Groups;
string s = groups[1].Value;
Datum2.Text = s;
}
But only the last match (2) appears in the TextBox "Datum2".
I know that I have to use e.g. a listbox, but the Groups[1].Value is a string...
Thanks for your help and time.
Dieter
First thing you need to correct in the code is Datum2.Text = s; would overwrite the text in Datum2 if it were more than one match.
Now, about your regex,
^ forces a match at the begging of the line, so there is really only 1 match. If you remove it, it'll match twice.
I can't seem to understand what was intended with $? all over the pattern (just take them out).
[' '] matches "either a quote, a space or a quote (no need to repeat characters in a character class.
All dots in [0-3][1-9].[0-1][1-9].[0-9][0-9] need to be escaped. A dot matches any character otherwise.
[0-1][1-9] matches all months except "10". The second character shoud be [0-9] (or \d).
Code:
string input = "<tr><td>Mi, 09.09.15</td><td>1</td><td>PK</td><td>E</td><td>123</td><td></td></tr><tr><td>Mi, 09.09.15</td><td>2</td><td>ER</td><td>ER</td><td>234</td><td></td></tr>";
string Patter2 = "<tr><td>[D-M][i-r],[' ][0-3][0-9]\\.[0-1][0-9]\\.[0-9][0-9]</td><td>([1-9][0-2]?)</td>";
Regex r2 = new Regex(Patter2);
MatchCollection mc2 = r2.Matches(input);
string s= "";
foreach (Match match in mc2)
{
GroupCollection groups = match.Groups;
s = s + " " + groups[1].Value;
}
Datum2.Text = s;
Output:
1 2
DEMO
You should know that regex is not the tool to parse HTML. It'll work for simple cases, but for real cases do consider using HTML Agility Pack
Truth is, I'm having a hard time writing a regex string to parse something in the form of
[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]
This regex would be parsed so that I can dynamically build tabs as demonstrated here. Initially I tried a regex pattern like \[\[\[tab name=(?'name'.*?) content=(?'content'.*?)\]\]\]
But I realized I couldn't get the tab as a whole and build upon a query without doing a regex.replace. Is it possible to take the entire tab leading up to the pipe symbol as a group and then parse that group down from the sub key/value pairs?
This is the current regex string I'm working with \[\[\[(?'tab'tab name=(?'name'.*?) content=(?'content'.*?))\]\]\]
And here is my code for performing the regex. Any guidance would be appreciated.
public override string BeforeParse(string markupText)
{
if (CompiledRegex.IsMatch(markupText))
{
// Replaces the [[[code lang=sql|xxx]]]
// with the HTML tags (surrounded with {{{roadkillinternal}}.
// As the code is HTML encoded, it doesn't get butchered by the HTML cleaner.
MatchCollection matches = CompiledRegex.Matches(markupText);
foreach (Match match in matches)
{
string tabname = match.Groups["name"].Value;
string tabcontent = HttpUtility.HtmlEncode(match.Groups["content"].Value);
markupText = markupText.Replace(match.Groups["content"].Value, tabcontent);
markupText = Regex.Replace(markupText, RegexString, ReplacementPattern, CompiledRegex.Options);
}
}
return markupText;
}
Is this what you want?
string input = "[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]";
Regex r = new Regex(#"tab name=([a-z0-9]+) content=([a-z0-9]+)(\||])");
foreach (Match m in r.Matches(input))
{
Console.WriteLine("{0} : {1}", m.Groups[1].Value, m.Groups[2].Value);
}
http://regexr.com/3boot
Maybe string.split will be better in that case? For example something like that :
strgin str = "[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]";
foreach(var entry in str.Split('|')){
var eqBlocks = entry.Split('=');
var tabName = eqBlocks[1].TrimEnd(" content");
var content = eqBlocks[2];
}
Ugly code, but should work.
Try this:
Starts with a word boundary and followed only by allowed characters.
/\b[\w =]*/g
https://regex101.com/r/cI7jS7/1
Just distill the regex pattern down to the individual tab patterns such as name=??? content=??? and match that only. That pattern which will make each Match (two in you example) where the data can be extracted.
string text = #"[[[tab name=dog content=cat|tab name=dog2 content=cat2]]]";
string pattern = #"name=(?<Name>[^\s]+)\scontent=(?<Content>[^\s|\]]+)";
var result = Regex.Matches(text, pattern)
.OfType<Match>()
.Select(mt => new
{
Name = mt.Groups["Name"].Value,
Content = mt.Groups["Content"].Value,
});
The result is an enumerable list with the created dynamic entities with the tabs needed which can be directly bound to the control:
Note in the set notation [^\s|\]] the pipe | is treated as a literal in the set and not used as an or. The bracket ] does have to be escaped though to be treated as a literal. Finally the logic the parse will look for: "To not (^) be a space or a pipe or a brace for that set".
I cant find regex expression of <#anystring#>?
Ex: <#sda#> or <#32dwdwwd#> or any of string between <# and #>
I tried "<#[^<#]+#>" but this has outputted only the first occurrence.
string sample = "\n\n<#sample01#> jus some words <#sample02#> <#sample03#> just some words ";
Match match = Regex.Match(sample, "<#[^<#]+#>");
if (match.Success)
{
foreach (Capture capture in match.Captures)
{
Console.WriteLine(capture.Value);
}
}
You are using the match() method. Try reading the documentation and you will see, that it returns only the first match.
Try the matches() method instead, it returns a MatchCollection.
It would look something like this (careful, not tested written directly here)
string sample = "\n\n<#sample01#> jus some words <#sample02#> <#sample03#> just some words ";
MatchCollection mc = Regex.Matches(sample, "<#(.*?)#>");
foreach (Match m in mc)
{
Console.WriteLine(m.Groups[0]);
}
}
try this one , it should work
UPDATED
<#(.*?)#>
The dot is any character except new line (\n).
The * means 0 or more.
The ? is used to make it ungreedy.
source here
I have the following working pattern:
String pattern = #"<(?<field>[^/>]+)>(?<data>.*)</\k<field>>";
String tcSQML = "<Pri_Key1>62</Pri_Key1><First1>SAM</First1><Last1>SPADE</Last1> <GstNo1></GstNo1><Pri_Key2>63</Pri_Key2><First2>TONY</First2><Last2>TUNE</Last2><GstNo2></GstNo2><Pri_Key3>64</Pri_Key3><First3>FRANK</First3><Last3>FAST</Last3><GstNo3></GstNo3><Pri_Key4>65</Pri_Key4><First4>BILLIE</First4><Last4>BLADES</Last4><GstNo4></GstNo4>";
MatchCollection matches = Regex.Matches(tcSQML, pattern, RegexOptions.Singleline);
foreach (Match m in matches)
{
Console.WriteLine(m.Groups["field"].ToString());
Console.WriteLine(m.Groups["number"].ToString());
}
What I want is to be able to also capture number after the field. E.g. in pri_key1 pri_key should be the field and 1 should be the number. I can not figure out how to introduce this new Number group into this pattern. I tried a few variations and nothing works. I am not good with RegEx so help and explanation is appreciated.
That's untested:
String pattern = #"<(?<field>[^0-9/>]+)(?<number>[^/>]*)>(?<data>.*)</\k<field>\k<number>>";
I just added a new group behind the field element. The new field element matches any string that does not contain a digit or the / or the > characters. The number element matches anything that is left between the field and the end of the tag.