.NET Regex question - c#

I'm trying to parse some data out of a website. The problem is that a javascript generates the data, thus I can't use a HTML parser for it. The string inside the source looks like:
<a href="http:www.domain.compid.php?id=123">
Everything is constant except the id that comes after the =. I don't know how many times the string will occur either. Would appreciate any help and an explanation on the regex example if possible.

Do you need to save any of it? A blanket regex href="[^"]+"> will match the entire string. If you need to save a specific part, let me know.
EDIT: To save the id, note the paren's after id= which signifies to capture it. Then to retrieve it, use the match object's Groups field.
string source = "a href=\"http:www.domain.compid.php?id=123\">";
Regex re = new Regex("href=\"[^\"]+id=([^\"]+)\">");
Match match = re.Match(source);
if(match.Success)
{
Console.WriteLine("It's a match!\nI found:{0}", match.Groups[0].Value);
Console.WriteLine("And the id is {0}", match.Groups[1].Value);
}
EDIT: example using MatchCollection
MatchCollection mc = re.Matches(source);
foreach(Match m in mc)
{
//do the same as above. except use "m" instead of "match"
//though you don't have to check for success in each m match object
//since it wouldn't have been added to the MatchCollection if it wasn't a match
}

This does the parsing in javascript and creates a csv-string:
var re = /<a href="http:www.domain.compid.php\?id=(\d+)">/;
var source = document.body.innerHTML;
var result = "result: ";
var match = re(source);
while (match != null) {
result += match[1] + ",";
source = source.substring(match.index + match[0].length);
match = re(source);
}
Demo. If the html-content is not used for anything else on the server it should be sufficient to send the ids.
EDIT, For performance and reliability it's probably better to use builtin javascript-functions (or jQuery) to find the urls instead of searching the entire content:
var re = /www.domain.compid.php\?id=(\d+)/;
var as = document.getElementsByTagName('a');
var result = "result: ";
for (var i = 0; i < as.length; i++) {
var match = re(as[i].getAttribute('href'));
if (match != null) {
result += match[1] + ",";
}
}

Related

How to get a loop of all tagged users

I am trying to get all tagged users from a String in ASP.NET
For example the string "Hello my name is #Naveh and my friend is named #Amit", I would like it to return me "Naveh" and "Amit" in a way I can send each of those user a notification method, like a loop on the code behind.
The only way I know to catch those Strings is by the 'Replace' method like that: (But that is only good for editing of course)
Regex.Replace(comment, #"#([\S]+)", #"<b>$1</b>")
You can't loop those strings like that. How can I loop all of the tagged users in the code behind?
You should probably use Regex.Match.
Regex.Match
E.g.
string pat = #"#([a-z]+)";
string src = "Hello my name is #Naveh and my friend is named #Amit";
string output = "";
// Instantiate the regular expression object.
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
// Match the regular expression pattern against a text string.
Match m = r.Match(src);
while (m.Success)
{
string matchValue = m.Groups[1].Value; //m.Groups[0] = "#Name". m.Groups[1] = "Name"
output += "Match: " + matchValue + "\r\n";
m = m.NextMatch();
}
Console.WriteLine(output);
Console.ReadLine();
You can use Regex.Matches to get MatchCollection object and loot through it with foreach. MSDN

Regex pattern BBCode to Wiki Notation, C#

I am tasked with converting BB code to WIKI notation and thanx to the many examples on SO I have cracked most of the tougher nuts. This is my first foray into Regex and I'm trying to learn it as I go (I would prefer stringbuilder but it doesnt seem to work with BB code). I have 4 items I need replaced that I cannot seem to create the proper pattern to identify: (original string on left, what I need on right after double dash)
the first item is a problem child because the wiki engine adds a new line where the spaces are. It is not a separate field but part of a larger string so I cant TRIM() it. I am currently using
result = result.Replace("[b]", "*").Replace("[/b]", "*");
the img issue is a need to somehow include the attributes if possible in the given format.
for the last 2 I am stumped. I have used
Regex r = new Regex(#"<a .*?href=['""](.+?)['""].*?>(.+?)</a>");
foreach (var match in r.Matches(multistring).Cast<Match>().OrderByDescending(m => m.Index))
{
string href = match.Groups[1].Value;
string txt = match.Groups[2].Value;
string wikilink = "[" + txt + "|" + href + "]";
sb.Remove(match.Groups[2].Index, match.Groups[2].Length);
sb.Insert(match.Groups[2].Index, wikilink);
}
in the past for HTML but cant seem to refactor it for my current needs. Suggestions, links to resources, all would be appreciated.
EDIT
solved the img issue, though it's not pretty and I still risk removing a closing [/img] tag that may not be caught earlier. The [img] code is fairly consistent, so I used:
Regex imgparser = new Regex(#"\[img[^\]]*\]([^\[]*)");
foreach (var itag in imgparser.Matches(multistring).Cast<Match>().OrderByDescending(m => m.Index))
{
string isrc = itag.Groups[1].Value;
string wikipic = itag.ToString().Replace("[img ", "!" + isrc).Replace("width=", "!width=").Replace("height=", ",height=").Replace("]" + isrc, string.Empty);
result = result.Replace(itag.ToString(), wikipic);
}
result = result.Replace("[/img]", "!");
I can give you a little example for the last case :
string str1 = "[url=http://aadqsdqsd]link[/url]";
var pattern = #"^\[url=(.*)\](.*)\[\/url\]$";
var match = Regex.Match(str1, pattern);
var result = string.Format("[{0}| {1}]", match.Groups[2].Value, match.Groups[1].Value);
//[link| http://aadqsdqsd]
Is it what you want ?
EDIT
if you want to match a larger string you can do :
var strTomatch = "[url=http://1]link1[/url][url=http://2]link2[/url]" + Environment.NewLine +
"[url = http://3]link3[/url]" + Environment.NewLine +
"[url=http://4]link4[/url]";
var match = Regex.Match(strTomatch, #"\[url\s*=\s*(.*?)\](.*?)\[\/url\]", RegexOptions.Multiline);
while (match.Success)
{
var result = string.Format("[{0}| {1}]", match.Groups[2].Value, match.Groups[1].Value);
Debug.WriteLine(result);
match = match.NextMatch();
}
Output
[link1| http://1]
[link2| http://2]
[link3| http://3]
[link4| http://4]

Regex from a html parsing, how do I grab a specific string?

I'm trying to specifically get the string after charactername= and before " >. How would I use regex to allow me to catch only the player name?
This is what I have so far, and it's not working. Not working as it doesn't actually print anything. On the client.DownloadString it returns a string like this:
<a href="https://my.examplegame.com/charactername=Atro+Roter" >
So, I know it actually gets string, I'm just stuck on the regex.
using (var client = new WebClient())
{
//Example of what the string looks like on Console when I Console.WriteLine(html)
//<a href="https://my.examplegame.com/charactername=Atro+Roter" >
// I want the "Atro+Roter"
string html = client.DownloadString(worldDest + world + inOrderName);
string playerName = "https://my.examplegame.com/charactername=(.+?)\" >";
MatchCollection m1 = Regex.Matches(html, playerName);
foreach (Match m in m1)
{
Console.WriteLine(m.Groups[1].Value);
}
}
I'm trying to specifically get the string after charactername= and before " >. 
So, you just need a lookbehind with lookahead and use LINQ to get all the match values into a list:
var input = "your input string";
var rx = new Regex(#"(?<=charactername=)[^""]+(?="")";
var res = rx.Matches(input).Cast<Match>().Select(p => p.Value).ToList();
The res variable should hold all your character names now.
I assume your issue is trying to parse the URL. Don't - use what .NET gives you:
var playerName = "https://my.examplegame.com/?charactername=NAME_HERE";
var uri = new Uri(playerName);
var queryString = HttpUtility.ParseQueryString(uri.Query);
Console.WriteLine("Name is: " + queryString["charactername"]);
This is much easier to read and no doubt more performant.
Working sample here: https://dotnetfiddle.net/iJlBKW
All forward slashes must be unescaped with back slashes like this \/
string input = #"<a href=""https://my.examplegame.com/charactername=Atro+Roter"" >";
string playerName = #"https:\/\/my.examplegame.com\/charactername=(.+?)""";
Match match = Regex.Match(input, playerName);
string result = match.Groups[1].Value;
Result = Atro+Roter

regex for PropertyName e.g. HelloWorld2HowAreYou would get Hello HelloWorld2 HelloWorld2How

I need a regex for PropertyName e.g. HelloWorld2HowAreYou would get:
Hello HelloWorld2 HelloWorld2How etc.
I want to use it in C#
[A-Z][a-z0-9]+ would give you all words that start with capital letter. You can write code to concat them one by one to get the complete set of words.
For example matching [A-Z][a-z0-9]+ against HelloWorld2HowAreYou with global flag set, you will get the following matches.
Hello
World2
How
Are
You
Just iterate through the matches and concat them to form the words.
Port this to C#
var s = "HelloWorld2HowAreYou";
var r = /[A-Z][a-z0-9]+/g;
var m;
var matches = [];
while((m = r.exec(s)) != null)
matches.push(m[0]);
var o = "";
for(var i = 0; i < matches.length; i++)
{
o += matches[i]
console.log(o + "\n");
}
I think something like this is what you want:
var s = "HelloWorld2HowAreYou";
Regex r = new Regex("(?=[A-Z]|$)(?<=(.+))");
foreach (Match m in r.Matches(s)) {
Console.WriteLine(m.Groups[1]);
}
The output is (as seen on ideone.com):
Hello
HelloWorld2
HelloWorld2How
HelloWorld2HowAre
HelloWorld2HowAreYou
How it works
The regex is based on two assertions:
(?=[A-Z]|$) matches positions just before an uppercase, and at the end of the string
(?<=(.+)) is a capturing lookbehind for .+ behind the current position into group 1
Essentially, the regex translates to:
"Everywhere just before an uppercase, or at the end of the string"...
"grab everything behind you if it's not an empty string"

How can I find a string after a specific string/character using regex

I am hopeless with regex (c#) so I would appreciate some help:
Basicaly I need to parse a text and I need to find the following information inside the text:
Sample text:
KeywordB:***TextToFind* the rest is not relevant but **KeywordB: Text ToFindB and then some more text.
I need to find the word(s) after a certain keyword which may end with a “:”.
[UPDATE]
Thanks Andrew and Alan: Sorry for reopening the question but there is quite an important thing missing in that regex. As I wrote in my last comment, Is it possible to have a variable (how many words to look for, depending on the keyword) as part of the regex?
Or: I could have a different regex for each keyword (will only be a hand full). But still don't know how to have the "words to look for" constant inside the regex
The basic regex is this:
var pattern = #"KeywordB:\s*(\w*)";
\s* = any number of spaces
\w* = 0 or more word characters (non-space, basically)
() = make a group, so you can extract the part that matched
var pattern = #"KeywordB:\s*(\w*)";
var test = #"KeywordB: TextToFind";
var match = Regex.Match(test, pattern);
if (match.Success) {
Console.Write("Value found = {0}", match.Groups[1]);
}
If you have more than one of these on a line, you can use this:
var test = #"KeywordB: TextToFind KeyWordF: MoreText";
var matches = Regex.Matches(test, #"(?:\s*(?<key>\w*):\s?(?<value>\w*))");
foreach (Match f in matches ) {
Console.WriteLine("Keyword '{0}' = '{1}'", f.Groups["key"], f.Groups["value"]);
}
Also, check out the regex designer here: http://www.radsoftware.com.au/. It is free, and I use it constantly. It works great to prototype expressions. You need to rearrange the UI for basic work, but after that it's easy.
(fyi) The "#" before strings means that \ no longer means something special, so you can type #"c:\fun.txt" instead of "c:\fun.txt"
Let me know if I should delete the old post, but perhaps someone wants to read it.
The way to do a "words to look for" inside the regex is like this:
regex = #"(Key1|Key2|Key3|LastName|FirstName|Etc):"
What you are doing probably isn't worth the effort in a regex, though it can probably be done the way you want (still not 100% clear on requirements, though). It involves looking ahead to the next match, and stopping at that point.
Here is a re-write as a regex + regular functional code that should do the trick. It doesn't care about spaces, so if you ask for "Key2" like below, it will separate it from the value.
string[] keys = {"Key1", "Key2", "Key3"};
string source = "Key1:Value1Key2: ValueAnd A: To Test Key3: Something";
FindKeys(keys, source);
private void FindKeys(IEnumerable<string> keywords, string source) {
var found = new Dictionary<string, string>(10);
var keys = string.Join("|", keywords.ToArray());
var matches = Regex.Matches(source, #"(?<key>" + keys + "):",
RegexOptions.IgnoreCase);
foreach (Match m in matches) {
var key = m.Groups["key"].ToString();
var start = m.Index + m.Length;
var nx = m.NextMatch();
var end = (nx.Success ? nx.Index : source.Length);
found.Add(key, source.Substring(start, end - start));
}
foreach (var n in found) {
Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
}
}
And the output from this is:
Key=Key1, Value=Value1
Key=Key2, Value= ValueAnd A: To Test
Key=Key3, Value= Something
/KeywordB\: (\w)/
This matches any word that comes after your keyword. As you didn´t mentioned any terminator, I assumed that you wanted only the word next to the keyword.

Categories

Resources