I want to remove specific tags from a HTML string. I am using HtmlAgility, but that removes entire nodes. I want to 'enhance' it to keep the innerHtml. It's all working but I have serious performance issues. This made me change the string.replace by a regex.replace and it is already 4 times faster. The replacement needs to be caseinsensitive. This is my current code:
var scrubHtmlTags = new[] {"strong","span","div","b","u","i","p","em","ul","ol","li","br"};
var stringToSearch = "LargeHtmlContent";
foreach (var stringToScrub in scrubHtmlTags)
{
stringToSearch = Regex.Replace(stringToSearch, "<" + stringToScrub + ">", "", RegexOptions.IgnoreCase);
stringToSearch = Regex.Replace(stringToSearch, "</" + stringToScrub + ">", "", RegexOptions.IgnoreCase);
}
There are still improvements however:
It should be possible to get rid of < b > as well as < /b > in one run I assume...
Is it possible to do all string replacements in one run?
To do it in one run you can use this:
stringToSearch = Regex.Replace(stringToSearch, "<\\/?" + string.Format("(?:{0})", string.Join("|", scrubHtmlTags)) + ".*?>", "", RegexOptions.IgnoreCase);
But keep in mind that this may fail on several cases.
If I were your manager ... (koff, koff) ... I would reject your code and tell you, nay, require(!) you, to "listen to Thomas Ayoub," in his #1 post to the first entry on this thread. You are well on your way to creating completely-unmaintainable code here: code that was written because it seemed, to someone who wasn’t talking to anyone else, to have “solved” the immediate problem that s/he faced at the time.
Going back to your original task-description, you say that you “wish to remove specific tags from an HTML string.” You further state that you are already using HtmlAgility (good, good ...), but then you object(!) that it “removes entire nodes.”
“ ’scuse me, but ...” exactly what did you expect it to do? A “tag,” I surmise, is a (DOM) “node.”
So, faced with what you call “a performance problem,” instead of(!) questing for the inevitable bug(!!) in your previous code, you decided to throw caution to the four winds, and to thereby inflict it upon the project and the rest of the team.
And that, as an old-phart Manager, would be where I would step in.
I would exercise my “authority has its privileges” and instruct you ... order you ... to abandon your present approach and to go back to find-and-fix the bugs in your original approach. But, going one step further, I would order you first to “find” the bugs, then to present your proposed(!) solution to the Team and to me, before authorizing you (by Team consensus) to implement your proposed fix.
(And I would like to think that, after you spent a suitable amount of time “calling me an a**hole” (of course ...), you would come to understand why I responded in this way, and why I took the time to say as much on Stack-whatever.com.)
You might try this:
foreach (var stringToScrub in scrubHtmlTags)
{
stringToSearch = Regex.Replace(
stringToSearch,
"<\/?" + stringToScrub + ">", "",
RegexOptions.IgnoreCase);
}
But I would try to use one expressions to remove them all.
Related
I am trying to figure out a viable way to go about parsing this CSV file. Currently I am using filehelpers which is great. But with this csv file it seems to be having issues.
Each record in the the csv file is contained in quotes and delimited by a comma.
The records have commas within them and 1 record out of the 90,000 records im dealing with has one single " that mucks up the Readline.
The record looks like this "24" Blah ",
So I'm looking to write a regex to insert into the BeforeReadRecord that will go through and replace all instances of " with a space.
I'm newer to regex but I'm not finding any way to exclude three cases.
Case one: each line starts with a "
Case two: each line ends with a "
Case three: each field is separated by ","
I am trying to figure out how I could exclude those three cases and be left to just replace any straggler " .
So far I've been failing miserably and am not even sure if there is a way to accomplish this. Perhaps someone knows of a better csv parser that handles this one odd case as well?
EDIT: Well here's what I ended up with. It takes a little time to process(also just changes any outlier " to ' which is fine since the data that contains quotes is needed for any queries) but looking for any pitfalls I may be falling in to make it faster but it seemed to be the quickest solution so far(took about 7 seconds for 92,000 records) but there doesn't seem any way around checking every line so... My previous solution was a nasty nested if that seemed to 30 seconds or so over the course of processing the records. It accounts for all scenarios except for where someone decides to put a random ", at the end of a field... hoping I don't run into a record like this but it wouldn't surprise me.
in its own method{
engine.BeforeReadRecord += (sender, args) =>
args.RecordLine = checkQuote(args.RecordLine);
var records = engine.ReadFile(reportFilePath);
}
private static string checkQuote(string checkString)
{
if (checkString.Substring(0, 1) == #"""")
{
string removeQuote = #"""" + checkString.Replace(#"""", "'").Replace(#"','", #""",""").Remove(checkString.Length-1,1).Remove(0,1) + #"""";
return removeQuote;
}
else
return checkString; }
File format readers typically don't handle malformed input well. Why should they? If you give a CSV reader bad data, I would expect it to barf. I've rarely had good luck with computer software that makes assumptions about what I meant.
Do you really need a regular expression? If you define a straggler as the last quote character when the number is odd, then it's trivial to remove the last one: just count them and if the number is odd, remove the last one.
For example:
var quoteCount = inputString.Count(c => c == '\"');
if ((quoteCount % 2) == 1)
{
inputString = inputString.Remove(inputString.LastIndexOf('\"'));
}
Done and done.
You could also do it in a single pass with a loop, but that's probably overkill. I strongly suspect that sanitizing the input is not a major bottleneck in your program.
For more complex patterns (i.e. you're looking for "," or for a quote at the start and end, you just write a simple state machine. It's probably a dozen lines of code.
I realize that you might be able to do this with regular expressions. I find regex great for finding stuff and doing simple replacements. For more complicated rules like "replace quote with space unless the quote is at the beginning or end of line or next to a comma", I find it hard to come up with a good expression. For example, what about this case:
"first name","last name","","phone"
You have to take that blank field (i.e. "") into account. You also have to take into account spaces between fields (i.e. "first" , "last" , ""), and a whole host of other things. I'm reasonably sure that regex can do it. My experience has been that I can usually write the simple state machine and prove that it's correct faster than I can puzzle out the required regex. And it's certain that I'll more easily understand the state machine six months later.
In my controller method for handling a (potentially hostile) user input field I have the following code:
string tmptext = comment.Replace(System.Environment.NewLine, "{break was here}"); //marks line breaks for later re-insertion
tmptext = Encoder.HtmlEncode(tmptext);
//other sanitizing goes in here
tmptext = tmptext.Replace("{break was here}", "<br />");
var regex = new Regex("(<br /><br />)\\1+");
tmptext = regex.Replace(tmptext, "$1");
My goal is to preserve line breaks for typical non-malicious use and display user input in safe, htmlencoded strings. I take the user input, parse it for newline characters and place a delimiter at the line breaks. I perform the HTML encoding and reinsert the breaks. (i will likely change this to reinserting paragraphs as p tags instead of br, but for now i'm using br)
Now actually inserting real html breaks opens me up to a subtle vulnerability: the enter key. The regex.replace code is there to strip out a malicious user just standing on the enter key and filling the page with crap.
This is a fix for big crap floods of just white but still leaves me open to abuse like entering one character, two line breaks, one character, two line breaks all down the page.
My question is for a method of determining that this is abusive and failing it on validation. I'm scared that there might not be a simple procedural method to do it and instead will need heuristic techniques or bayesian filters. Hopefully, someone has an easier, better way.
EDIT: perhaps I wasn't clear in the problem description, the regex handles seeing multiple line breaks in a row and converting them to just one or two. That problem is solved. The real problem is distinguishing legitimate text from crap flood like this:
a
a
a
...imagine 1000 of these...
a
a
a
a
A random suggestion, inspired by slashdot.org's comment filters: compress your user input with a System.IO.Compression.DeflateStream, and if it is too small in comparison with the original (you'll have to do some experimentation to find a useful cut-off) reject it.
I would HttpUtility.HtmlEncode the string, then convert newline characters to <br/>.
HttpUtility.HtmlEncode(subject).Replace("\r\n", "<br/>").Replace("\r", "<br/>").Replace("\n", "<br/>");
Also you should perform this logic when you are outputting to the user, not when saving in the database. The only validation I do on the database is make sure it's properly escaped (other than normal business rules that is).
EDIT: To fix the actual problem however, you can use Regex to replace multiple newlines with a single newline beforehand.
subject = Regex.Replace(#"(\r\n|\r|\n)+", #"\n", RegexOptions.Singleline);
I'm not sure if you would need RegexOptions.Singleline.
It sounds like you're tempted to try something "clever" with a regex, but IMO the simplest approach is to just loop through the characters of the string copying them to a StringBuilder, filtering as you go.
Any that fail a char.IsWhiteSpace() test are not copied. (If one of these is a newline, then insert a <br/> and don't allow any more <br/>'s to be added until you have hit a non-whitespace character).
edit
If you want to stop the user entering any old crap, give up now. You will never find a way filtering that a user can't find a way around in less than a minute, if they really want to.
You will be much better off putting a limit on the number of newlines, or the total number of characters, in the input.
Think of how much effort it will take to do something clever to sanitise "bad input", and then consider how likely it is that this will happen. Probbaly there is no point. Probably all the sanitisation you really need is to ensure the data is legal (not too large for your system to handle, all dangerous characters stripped or escaped, etc). (This is exactly why forums have human moderators who can filter the posts based on whatever criteria are approriate).
This is not the most efficient way of handling this, nor the smartest (disclaimer),
but if your text is not too big it doesn't matter much and short of any smarter algorithms (note: it's hard to detect something like char\nchar\nchar\n... though you could set a limit on the line len)
You could just Split on white characters (add any you can think of, short of \n) - then Join with just one space and then split on \n (to get lines) - join with <br />. While joining the lines you can test for line.Length > 2 e.g. or something.
To make this faster you can iterate with a more efficient algorithm, char by char, using IndexOf etc..
Again not the most efficient or perfect way of handling this but would give you something fast.
EDIT: to filter 'same lines' - you could use e.g. DistinctUntilChanged - that's from the Ix - Interactive extensions (see NuGet Ix-experimental I think) which should filter 'same lines' consecutive + you could add line test for those.
Rather than attempting to replace the newlines with filtered text and then attempting to use regular expressions on that, why not sanitize your data before inserting the <br /> tags? Don't forget to sanitize the input with HttpUtility.HtmlEncode first.
In an attempt to take care of multiple short lines in a row, here's my best attempt:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class Program {
static void Main() {
// Arbirary cutoff used to join short strings.
const int Cutoff = 6;
string input =
"\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome\r\n" +
"unsanatized\r\nbreaks\r\nand\ra\nsh\nor\nt\r\n\na\na\na\na" +
"\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na";
input = (input ?? String.Empty).Trim(); // Don't forget to HtmlEncode it.
StringBuilder temp = new StringBuilder();
List<string> result = new List<string>();
var items = input.Split(
new[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries)
.Select(i => new { i.Length, Value = i });
foreach (var item in items) {
if (item.Length > Cutoff) {
if (temp.Length > 0) {
result.Add(temp.ToString());
temp.Clear();
}
result.Add(item.Value);
continue;
}
if (temp.Length > 0) { temp.Append(" "); }
temp.Append(item.Value);
}
if (temp.Length > 0) {
result.Add(temp.ToString());
}
Console.WriteLine(String.Join("<br />", result));
}
}
Produces the following output:
thisisatest<br />string with some<br />unsanatized<br />breaks and a sh or t a a
a a a a a a a a a a a a a a a a a a a
I'm sure you've already come up with this solution but unfortunately what you're asking for isn't very straight forward.
For those interested, here's my first attempt:
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string input = "\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome" +
"\r\nunsanatized\r\nbreaks\r\n\r\n";
input = (input ?? String.Empty).Trim().Replace("\r", String.Empty);
string output = Regex.Replace(
input,
"\\\n+",
"<br />",
RegexOptions.Multiline);
Console.WriteLine(output);
}
}
producing the following output:
thisisatest<br />string<br />with<br />some<br />unsanatized<br />breaks
I have been doing a little work with regex over the past week and managed to make a lot of progress, however, I'm still fairly n00b. I have a regex written in C#:
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?"+
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]*)\s(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]*\s*[a-zA-Z_1-9]*\s*)[,]?\s*)+)\)";
IsMethodRegex = new Regex(isMethodRegex);
For some reason, when calling the regular expression IsMethodRegex.IsMatch() it hangs for 30+ seconds on the following string:
"\t * Returns collection of active STOP transactions (transaction type 30) "
Does anyone how the internals of Regex works and why this would be so slow on matching this string and not others. I have had a play with it and found that if I take out the * and the parenthesis then it runs fine. Perhaps the regular expression is poorly written?
Any help would be so greatly appreciated.
EDIT: I think the performance issue comes in the way <parameters> matching group is done. I have rearranged to match a first parameter, then any number of successive parameters, or optionally none at all. Also I have changed the \s* between parameter type and name to \s+ (I think this was responsible for a LOT of backtracking because it allows no spaces, so that object could match as obj and ect with \s* matching no spaces) and it seems to run a lot faster:
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?"+
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]*)\s*(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>((\s*[a-zA-Z\[\]\<\>_1-9]*\s+[a-zA-Z_1-9]*\s*)"+
#"(\s*,\s*[a-zA-Z\[\]\<\>_1-9]*\s+[a-zA-Z_1-9]*\s*)*\s*))?\)";
EDIT: As duly pointed out by #Dan, the following is simply because the Regex can exit early.
This is indeed a really bizarre situation, but if I remove the two optional matching at the beginning (for public/private/internal/protected and static/virtual/abstract) then it starts to run almost instantaneously again:
string isMethodRegex =
#"\b(public|private|internal|protected)\s*(static|virtual|abstract)"+
#"(?<returnType>[a-zA-Z\<\>_1-9]*)\s(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]*\s*[a-zA-Z_1-9]*\s*)[,]?\s*)+)\)";
var IsMethodRegex = new Regex(isMethodRegex);
string s = "\t * Returns collection of active STOP transactions (transaction type 30) ";
Console.WriteLine(IsMethodRegex.IsMatch(s));
Technically you could split into four separate Regex's for each possibility to deal with this particular situation. However, as you attempt to deal with more and more complicated scenarios, you will likely run into this performance issue again and again, so this is probably not the ideal approach.
I changed some 0-or-more (*) matchings with 1-or-more (+), where I think it makes sense for your regex (it's more suitable to Java and C# than to VB.NET):
string isMethodRegex =
#"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?" +
#"\s*(?<returnType>[a-zA-Z\<\>_1-9]+)\s+(?<method>[a-zA-Z\<\>_1-9]+)\s+\" +
#"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]+\s+[a-zA-Z_1-9]+\s*)[,]?\s*)+)\)";
It's fast now.
Please check if it still returns the result you expect.
For some background on bad regexes, look here.
Have you tried compiling your Regex?
string pattern = #"\b[at]\w+";
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Compiled;
string text = "The threaded application ate up the thread pool as it executed.";
MatchCollection matches;
Regex optionRegex = new Regex(pattern, options);
Console.WriteLine("Parsing '{0}' with options {1}:", text, options.ToString());
// Get matches of pattern in text
matches = optionRegex.Matches(text);
// Iterate matches
for (int ctr = 1; ctr <= matches.Count; ctr++)
Console.WriteLine("{0}. {1}", ctr, matches[ctr-1].Value);
Then the Regular Expression is only slow on the first execution.
I am making a simple console application for a home project. Basically, it monitors a folder for any files being added.
FileSystemWatcher fsw = new FileSystemWatcher(#"c:\temp");
fsw.Created += new FileSystemEventHandler(fsw_Created);
bool monitor = true;
while (monitor)
{
fsw.WaitForChanged(WatcherChangeTypes.Created, 1000);
if(Console.KeyAvailable)
{
monitor = false;
}
}
Show("User has quit the process...", ConsoleColor.Yellow);
When a new files arrives, 'WaitForChanges' gets called, and I can then start the work.
What I need to do is check the filename for patterns. In real life, I am putting video files into this folder. Based on the filename, I will have rules, which move the files into specific directories. So for now, I'll have a list of KeyValue pairs... holding a RegEx (I think?), and a folder. So, if the filename matches a regex, it moves it into the related folder.
An example of a filename is:
CSI- NY.S07E01.The 34th Floor.avi
So, my Regex needs to look at it, and see if the words CSI "AND" (NY "OR" NewYork "OR" New York) exist. If they do, I will then move them to a \Series\CSI\NY\ folder.
I need the AND, because another file example for a different series is:
CSI- Crime Scene Investigation.S11E16.Turn On, Tune In, Drop Dead
So, for this one, I would need to have some NOTs. So, I need to check if the filename has CSI, but NOT ("New York" or "NY" or "NewYork")
Could someone assist me with these RegExs? Or maybe, there's a better method?
You can try to store conditions in Func<string,bool>
Dictionary<Func<string,bool>,string> dic = new Dictionary<Func<string, bool>, string>();
Func<string, bool> f1 = x => x.Contains("CSI") && ((x.Contains("NY") || x.Contains("New York"));
dic.Add(f1,"C://CSI/");
foreach (var pair in dic)
{
if(pair.Key.Invoke("CSI- NY.S07E01.The 34th Floor.avi"))
{
// copy
return;
}
}
I think you have the right idea. The nice thing about this approach is that you can add/remove/edit regular expressions to a config file or some other approach which means you don't have to recompile the project every time you want to keep track of a new show.
A regular expression for CSI AND NY would look something like this.
First if you want to check if CSI exists in the filename the regex is simply "CSI". Keep in mind it's case sensitive by default.
If you want to check if NY, New York or NewYork exist in the file name the regex is "((NY)|(New York)|(NewYork))" The bars indicate OR and the parenthesis are used to designate groups.
In order to combine the two you could run both regexes and in some cases (where perhaps order is unimportant) this might be easier. However if you always expect the show type to come after the syntax would be "(CSI).*((NY)|(New York)|(NewYork))" The period means "any character" and the asterisk means zero or more.
This does not look as one regex, even if you succeed with tossing the whole thing into one. Regexes which match "anything without a given word" are a pain. I'd better stick with two regexes for each rule: one which should match, and the other which should NOT match for this rule to be triggered. If you need your "CSI" and "NY" but don't like fixing any particular order within the filename, you as well may switch from a pair of regexes to a pair of lists of regexes. In general it's better to put this logic into code and configuration and keep regexes as simple as possible. And yes, you're quite likely to get away with simple substring search, no explicit need for regexes as long as you keep your code smart enough.
Well, people already gave you some advises about doing this using:
Regular expressions
Func and storing exactly the C# code that will be executed against the file
so I'm just give you a different one.
I disagree with using Regular Expressions for this purpose. I agree with #Anton S. Kraievoy: I don't like regexes to match anything without a given word. It is easier to check: !text.Contains(word)
The second option looks perfect if you are looking for a fast solution, but...
If that is a more complex application, and you want to design it correctly, I think you should:
Define how you will store those patterns (in a class with members, or in a string, etc). An string example could be:
"CSI" & ("NY" || "Las Vegas")
Then write a module that will match a filename with that pattern.
You're creating kind of a DSL.
Why is it better than just paste directly the C# code?
Well, because:
You can easily change the semantic of your patterns
You can generate the validation code in any language you want, because you're storing patterns in a generic way.
The thing is how to match a pattern against a filename.
You have some options:
Write the grammar of your pattern and write a parser by yourself
Generate (I'm not 100% sure if it is possible, that depends on the grammar) the write a regex that will convert your grammar into C# code.
Like: "A" & "B" to string.Contains("A") && string.Contains("B") or something like that.
Use a tool to do that, like ANTLR.
I want to parse a config file sorta thing, like so:
[KEY:Value]
[SUBKEY:SubValue]
Now I started with a StreamReader, converting lines into character arrays, when I figured there's gotta be a better way. So I ask you, humble reader, to help me.
One restriction is that it has to work in a Linux/Mono environment (1.2.6 to be exact). I don't have the latest 2.0 release (of Mono), so try to restrict language features to C# 2.0 or C# 1.0.
I considered it, but I'm not going to use XML. I am going to be writing this stuff by hand, and hand editing XML makes my brain hurt. :')
Have you looked at YAML?
You get the benefits of XML without all the pain and suffering. It's used extensively in the ruby community for things like config files, pre-prepared database data, etc
here's an example
customer:
name: Orion
age: 26
addresses:
- type: Work
number: 12
street: Bob Street
- type: Home
number: 15
street: Secret Road
There appears to be a C# library here, which I haven't used personally, but yaml is pretty simple, so "how hard can it be?" :-)
I'd say it's preferable to inventing your own ad-hoc format (and dealing with parser bugs)
I was looking at almost this exact problem the other day: this article on string tokenizing is exactly what you need. You'll want to define your tokens as something like:
#"(?<level>\s) | " +
#"(?<term>[^:\s]) | " +
#"(?<separator>:)"
The article does a pretty good job of explaining it. From there you just start eating up tokens as you see fit.
Protip: For an LL(1) parser (read: easy), tokens cannot share a prefix. If you have abc as a token, you cannot have ace as a token
Note: The article's missing the | characters in its examples, just throw them in.
There is another YAML library for .NET which is under development. Right now it supports reading YAML streams and has been tested on Windows and Mono. Write support is currently being implemented.
Using a library is almost always preferably to rolling your own. Here's a quick list of "Oh I'll never need that/I didn't think about that" points which will end up coming to bite you later down the line:
Escaping characters. What if you want a : in the key or ] in the value?
Escaping the escape character.
Unicode
Mix of tabs and spaces (see the problems with Python's white space sensitive syntax)
Handling different return character formats
Handling syntax error reporting
Like others have suggested, YAML looks like your best bet.
You can also use a stack, and use a push/pop algorithm. This one matches open/closing tags.
public string check()
{
ArrayList tags = getTags();
int stackSize = tags.Count;
Stack stack = new Stack(stackSize);
foreach (string tag in tags)
{
if (!tag.Contains('/'))
{
stack.push(tag);
}
else
{
if (!stack.isEmpty())
{
string startTag = stack.pop();
startTag = startTag.Substring(1, startTag.Length - 1);
string endTag = tag.Substring(2, tag.Length - 2);
if (!startTag.Equals(endTag))
{
return "Fout: geen matchende eindtag";
}
}
else
{
return "Fout: geen matchende openeningstag";
}
}
}
if (!stack.isEmpty())
{
return "Fout: geen matchende eindtag";
}
return "Xml is valid";
}
You can probably adapt so you can read the contents of your file. Regular expressions are also a good idea.
It looks to me that you would be better off using an XML based config file as there are already .NET classes which can read and store the information for you relatively easily. Is there a reason that this is not possible?
#Bernard: It is true that hand editing XML is tedious, but the structure that you are presenting already looks very similar to XML.
Then yes, has a good method there.
#Gishu
Actually once I'd accommodated for escaped characters my regex ran slightly slower than my hand written top down recursive parser and that's without the nesting (linking sub-items to their parents) and error reporting the hand written parser had.
The regex was a slightly faster to write (though I do have a bit of experience with hand parsers) but that's without good error reporting. Once you add that it becomes slightly harder and longer to do.
I also find the hand written parser easier to understand the intention of. For instance, here is the a snippet of the code:
private static Node ParseNode(TextReader reader)
{
Node node = new Node();
int indentation = ParseWhitespace(reader);
Expect(reader, '[');
node.Key = ParseTerminatedString(reader, ':');
node.Value = ParseTerminatedString(reader, ']');
}
Regardless of the persisted format, using a Regex would be the fastest way of parsing.
In ruby it'd probably be a few lines of code.
\[KEY:(.*)\]
\[SUBKEY:(.*)\]
These two would get you the Value and SubValue in the first group. Check out MSDN on how to match a regex against a string.
This is something everyone should have in their kitty. Pre-Regex days would seem like the Ice Age.