Regex pattern BBCode to Wiki Notation, C# - c#

I am tasked with converting BB code to WIKI notation and thanx to the many examples on SO I have cracked most of the tougher nuts. This is my first foray into Regex and I'm trying to learn it as I go (I would prefer stringbuilder but it doesnt seem to work with BB code). I have 4 items I need replaced that I cannot seem to create the proper pattern to identify: (original string on left, what I need on right after double dash)
the first item is a problem child because the wiki engine adds a new line where the spaces are. It is not a separate field but part of a larger string so I cant TRIM() it. I am currently using
result = result.Replace("[b]", "*").Replace("[/b]", "*");
the img issue is a need to somehow include the attributes if possible in the given format.
for the last 2 I am stumped. I have used
Regex r = new Regex(#"<a .*?href=['""](.+?)['""].*?>(.+?)</a>");
foreach (var match in r.Matches(multistring).Cast<Match>().OrderByDescending(m => m.Index))
{
string href = match.Groups[1].Value;
string txt = match.Groups[2].Value;
string wikilink = "[" + txt + "|" + href + "]";
sb.Remove(match.Groups[2].Index, match.Groups[2].Length);
sb.Insert(match.Groups[2].Index, wikilink);
}
in the past for HTML but cant seem to refactor it for my current needs. Suggestions, links to resources, all would be appreciated.
EDIT
solved the img issue, though it's not pretty and I still risk removing a closing [/img] tag that may not be caught earlier. The [img] code is fairly consistent, so I used:
Regex imgparser = new Regex(#"\[img[^\]]*\]([^\[]*)");
foreach (var itag in imgparser.Matches(multistring).Cast<Match>().OrderByDescending(m => m.Index))
{
string isrc = itag.Groups[1].Value;
string wikipic = itag.ToString().Replace("[img ", "!" + isrc).Replace("width=", "!width=").Replace("height=", ",height=").Replace("]" + isrc, string.Empty);
result = result.Replace(itag.ToString(), wikipic);
}
result = result.Replace("[/img]", "!");

I can give you a little example for the last case :
string str1 = "[url=http://aadqsdqsd]link[/url]";
var pattern = #"^\[url=(.*)\](.*)\[\/url\]$";
var match = Regex.Match(str1, pattern);
var result = string.Format("[{0}| {1}]", match.Groups[2].Value, match.Groups[1].Value);
//[link| http://aadqsdqsd]
Is it what you want ?
EDIT
if you want to match a larger string you can do :
var strTomatch = "[url=http://1]link1[/url][url=http://2]link2[/url]" + Environment.NewLine +
"[url = http://3]link3[/url]" + Environment.NewLine +
"[url=http://4]link4[/url]";
var match = Regex.Match(strTomatch, #"\[url\s*=\s*(.*?)\](.*?)\[\/url\]", RegexOptions.Multiline);
while (match.Success)
{
var result = string.Format("[{0}| {1}]", match.Groups[2].Value, match.Groups[1].Value);
Debug.WriteLine(result);
match = match.NextMatch();
}
Output
[link1| http://1]
[link2| http://2]
[link3| http://3]
[link4| http://4]

Related

Regex from a html parsing, how do I grab a specific string?

I'm trying to specifically get the string after charactername= and before " >. How would I use regex to allow me to catch only the player name?
This is what I have so far, and it's not working. Not working as it doesn't actually print anything. On the client.DownloadString it returns a string like this:
<a href="https://my.examplegame.com/charactername=Atro+Roter" >
So, I know it actually gets string, I'm just stuck on the regex.
using (var client = new WebClient())
{
//Example of what the string looks like on Console when I Console.WriteLine(html)
//<a href="https://my.examplegame.com/charactername=Atro+Roter" >
// I want the "Atro+Roter"
string html = client.DownloadString(worldDest + world + inOrderName);
string playerName = "https://my.examplegame.com/charactername=(.+?)\" >";
MatchCollection m1 = Regex.Matches(html, playerName);
foreach (Match m in m1)
{
Console.WriteLine(m.Groups[1].Value);
}
}
I'm trying to specifically get the string after charactername= and before " >. 
So, you just need a lookbehind with lookahead and use LINQ to get all the match values into a list:
var input = "your input string";
var rx = new Regex(#"(?<=charactername=)[^""]+(?="")";
var res = rx.Matches(input).Cast<Match>().Select(p => p.Value).ToList();
The res variable should hold all your character names now.
I assume your issue is trying to parse the URL. Don't - use what .NET gives you:
var playerName = "https://my.examplegame.com/?charactername=NAME_HERE";
var uri = new Uri(playerName);
var queryString = HttpUtility.ParseQueryString(uri.Query);
Console.WriteLine("Name is: " + queryString["charactername"]);
This is much easier to read and no doubt more performant.
Working sample here: https://dotnetfiddle.net/iJlBKW
All forward slashes must be unescaped with back slashes like this \/
string input = #"<a href=""https://my.examplegame.com/charactername=Atro+Roter"" >";
string playerName = #"https:\/\/my.examplegame.com\/charactername=(.+?)""";
Match match = Regex.Match(input, playerName);
string result = match.Groups[1].Value;
Result = Atro+Roter

String operation in C#

I have an input string which data is coming in the following format:
"http://testing/site/name/lists/tasks"
"http://testing/site/name1/lists/tasks"
"http://testing/site/name2/lists/tasks" etc.,
How can I extract only name, name1, name2, etc. from this string?
Here is what I have tried:
SiteName = (Url.Substring("http://testing/site/".Length)).Substring(Url.Length-12)
It is throwing an exception stating StartIndex cannot be greater than the number of characters in the string. What is wrong with my expression? How can I fix it? Thanks.
A better option will be to use Regex matching/replace
But the following will also work based on the assumption that all the urls will be similar in pattern
var value = Url.Replace(#"http://testing/site/", "").Replace(#"/lists/tasks", "");
The other option will be to use Uri
var uriAddress = new Uri(#"http://testing/site/name/lists/tasks");
then breaking down uri parts according to your requirement
This is a job for regexp:
string strRegex = #"http://testing/site/(.+)/lists/tasks";
RegexOptions myRegexOptions = RegexOptions.IgnoreCase;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = #"http://testing/site/name/lists/tasks" + "\r\n" + #"http://testing/site/name1/lists/tasks" + "\r\n" + #"http://testing/site/name2/lists/tasks" + "\r\n" + #"http://testing/site/name3/lists/tasks";
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
// Add your code here. Reference to first group
}
}
You could also use the Uri class to get the desired part:
string[] urlString = urlText.Split();
Uri uri = default(Uri);
List<string> names = urlString
.Where(u => Uri.TryCreate(u, UriKind.Absolute, out uri))
.Select(u => uri.Segments.FirstOrDefault(s => s.StartsWith("name", StringComparison.OrdinalIgnoreCase)))
.ToList();
Assuming that the part always start with "name".
Because the Substring function with a single argument takes the index of the starting charachter and consume all to the end of the string. It will be a little naive, but you can start at charachter 19: Url.Substring(19);

c# Regex question

I have a problem dealing with the # symbol in Regex, I am trying to remove #sometext
from a text string can't seem to find anywhere where it uses the # as a literal. I have tried myself but doesn't remove the word from the string. Any ideas?
public string removeAtSymbol(string input)
{
Regex findWords = new Regex(______);//Find the words like "#text"
Regex[] removeWords;
string test = input;
MatchCollection all = findWords.Matches(test);
removeWords = new Regex[all.Count];
int index = 0;
string[] values = new string[all.Count];
YesOutputBox.Text = " you got here";
foreach (Match m in all) //List all the words
{
values[index] = m.Value.Trim();
index++;
YesOutputBox.Text = YesOutputBox.Text + " " + m.Value;
}
for (int i = 0; i < removeWords.Length; i++)
{
removeWords[i] = new Regex(" " + values[i]);
// If the words appears more than one time
if (removeWords[i].Matches(test).Count > 1)
{
removeWords[i] = new Regex(" " + values[i] + " ");
test = removeWords[i].Replace(test, " "); //Remove the first word.
}
}
return test;
}
You can remove all occurences of "#sometext" from string test via the method
Regex.Replace(test, "#sometext", "")
or for any word starting with "#" you can use
Regex.Replace(test, "#\\w+", "")
If you need specifically a separate word (i.e. nothing like #comp within tom#comp.com) you may preceed the regex with a special word boundary (\b does not work here):
Regex.Replace(test, "(^|\\W)#\\w+", "")
You can use:
^\s#([A-Za-z0-9_]+)
as the regex to recognize Twitter usernames.
Regex to remove #something from this string: I want to remove #something from this string.
var regex = new Regex("#\\w*");
string result = regex.Replace(stringWithAt, "");
Is that what you are looking for?
I've had good luck applying this pattern:
\B#\w+
This will match any string starting with an # character that contains alphanumeric characters, plus some linking punctuation like the underscore character, if it does not occur on a boundary between alphanumeric and non-alphanumeric characters.
The result of executing this code:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"\B#\w+",
#"redacted");
is the following string:
redacted redacted this2#3that redacted redacted#beta#gamma
If this question is Twitter-specific, then Twitter provides an open source library that helps capture Twitter-specific entities like links, mentions and hashtags. This java file contains the code defining the regular expressions that Twitter uses, and this yml file contains test strings and expected outcomes of many unit tests that exercise the regular expressions in the Twitter library.
Twitter's mention-matching pattern (extracted from their library, modified to remove unnecessary capture groups, and edited to make sense in the context of a replacement) is shown below. The match should be performed in a case-insensitive manner.
(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}
Here is an example which reproduces the results of the first replacement in my answer:
string result = Regex.Replace(
#"#This1 #That2_thing this2#3that #the5Others #alpha#beta#gamma",
#"(^|[^a-z0-9_])[#\uFF20][a-z0-9_]{1,20}",
#"$1redacted",
RegexOptions.IgnoreCase);
Note the need to include the substitution $1 since the first capture group can't be directly converted into an atomic zero-width assertion.

.NET Regex question

I'm trying to parse some data out of a website. The problem is that a javascript generates the data, thus I can't use a HTML parser for it. The string inside the source looks like:
<a href="http:www.domain.compid.php?id=123">
Everything is constant except the id that comes after the =. I don't know how many times the string will occur either. Would appreciate any help and an explanation on the regex example if possible.
Do you need to save any of it? A blanket regex href="[^"]+"> will match the entire string. If you need to save a specific part, let me know.
EDIT: To save the id, note the paren's after id= which signifies to capture it. Then to retrieve it, use the match object's Groups field.
string source = "a href=\"http:www.domain.compid.php?id=123\">";
Regex re = new Regex("href=\"[^\"]+id=([^\"]+)\">");
Match match = re.Match(source);
if(match.Success)
{
Console.WriteLine("It's a match!\nI found:{0}", match.Groups[0].Value);
Console.WriteLine("And the id is {0}", match.Groups[1].Value);
}
EDIT: example using MatchCollection
MatchCollection mc = re.Matches(source);
foreach(Match m in mc)
{
//do the same as above. except use "m" instead of "match"
//though you don't have to check for success in each m match object
//since it wouldn't have been added to the MatchCollection if it wasn't a match
}
This does the parsing in javascript and creates a csv-string:
var re = /<a href="http:www.domain.compid.php\?id=(\d+)">/;
var source = document.body.innerHTML;
var result = "result: ";
var match = re(source);
while (match != null) {
result += match[1] + ",";
source = source.substring(match.index + match[0].length);
match = re(source);
}
Demo. If the html-content is not used for anything else on the server it should be sufficient to send the ids.
EDIT, For performance and reliability it's probably better to use builtin javascript-functions (or jQuery) to find the urls instead of searching the entire content:
var re = /www.domain.compid.php\?id=(\d+)/;
var as = document.getElementsByTagName('a');
var result = "result: ";
for (var i = 0; i < as.length; i++) {
var match = re(as[i].getAttribute('href'));
if (match != null) {
result += match[1] + ",";
}
}

Using Regex to edit a string in C#

I'm just beginning to use Regex so bear with my terminology. I have a regex pattern that is working properly on a string. The string could be in the format "text [pattern] text". Therefore, I also have a regex pattern that negates the first pattern. If I print out the results from each of the matches everything is shown correctly.
The problem I'm having is I want to add text into the string and it changes the index of matches in a regex MatchCollection. For example, if I wanted to enclose the found match in "td" match "/td"" tags I have the following code:
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
MatchCollection mc = r.Matches(text);
if (mc.Count > 0)
{
for (int i = 0; i < mc.Count; i++)
{
text = text.Remove(mc[i].Index, mc[i].Length);
text = text.Insert(mc[i].Index, "<td>" + mc[i].Value + "</td>");
}
}
This works great for the first match. But as you'd expect the mc[i].Index is no longer valid because the string has changed. Therefore, I tried to search for just a single match in the for loop for the amount of matches I would expect (mc.Count), but then I keep finding the first match.
So hopefully without introducing more regex to make sure it's not the first match and with keeping everything in one string, does anybody have any input on how I could accomplish this? Thanks for your input.
Edit: Thank you all for your responses, I appreciate all of them.
It can be as simple as:-
string newString = Regex.Replace("abc", "b", "<td>${0}</td>");
Results in a<td>b</td>c.
In your case:-
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
text = r.Replace(text, "<td>${0}</td>");
Will replace all occurance of negRegexPattern with the content of that match surrounded by the td element.
Although I agree that the Regex.Replace answer above is the best choice, just to answer the question you asked, how about replacing from the last match to the first. This way your string grows beyond the "previous" match so the earlier matches indexes will still be valid.
for (int i = mc.Count - 1; i > 0; --i)
static string Tabulate(Match m)
{
return "<td>" + m.ToString() + "</td>";
}
static void Replace()
{
string text = "your text";
string result = Regex.Replace(text, "your_regexp", new MatchEvaluator(Tabulate));
}
You can try something like this:
Regex.Replace(input, pattern, match =>
{
return "<tr>" + match.Value + "</tr>";
});
Keep a counter before the loop starts, and add the amount of characters you inserted every time. IE:
Regex r = new Regex(negRegexPattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
MatchCollection mc = r.Matches(text);
int counter = 0;
for (int i = 0; i < mc.Count; i++)
{
text = text.Remove(mc[i].Index + counter, mc[i].Length);
text = text.Insert(mc[i].Index + counter, "<td>" + mc[i].Value + "</td>");
counter += ("<td>" + "</td>").Length;
}
I haven't tested this, but it SHOULD work.

Categories

Resources