I'm trying to create a fairly simple parser using Irony, but am coming to the conclusion that Irony may not be suitable in this particular case.
These is an example of what I'm trying to parse:
server_name example.com *.example.com www.example.*;
server_name www.example.com ~^www\d+\.example\.com$;
server_name ~^(?<subdomain>.+?)\.(?<domain>.+)$;
I'm using FreeTextLiterals with either a space or semi-colon as a terminator
var serverNamevalue = new FreeTextLiteral("serverNameValue", FreeTextOptions.None, " ", ";");
I'm then using the MakePlusRule to pick up one or more server_name values:
httpCoreServerName.Rule = "server_name" + httpCoreServerNameItems + semicolon;
httpCoreServerNameItems.Rule = MakePlusRule(httpCoreServerNameItems, serverNamevalue);
However - I think there's a problem with having whitespace as a terminator for the FreeTextLiteral in this case. When I run this, I get a parser error. If I substitute the whitespace for another specific character to act as terminator (and also add this a delimiter in the call to MakePlusRule) - it works fine.
Does anyone have any ideas as to how I could deal with this in Irony?
I posted this question over at the Irony project on Codeplex where Roman Ivantsov - the developer of Irony - confirmed there was an issue with the parser when using semi-colons with FreeTextLiterals.
Roman has helpfully fixed / patched this issue. I've dowloaded the latest source and can confirm it's fixed the issue.
Related
Word seems to use a different apostrophe character than Visual Studio and it is causing problems with using Regex.
I am trying to edit some Word documents in C# using OpenXML. I am basically replacing [[COMPANY]] with a company name. This has worked pretty smoothly until I have reached my corner case of companies with names that end in s. I end up with issue s where sometimes it creates a s's.
Example:
Company Name: Simmons
Text in Doc: The [[COMPANY]]'s business is cars.
Result: The Simmons's business is cars.
This is improper English.
I should be able to just use a basic find and replace like I did for [[COMPANY]], but it is not working.
Regex apostropheReplace = new Regex("s\\'s");
docText = apostropheReplace.Replace(docText, "s\'");
This does not. It seems that Word is using an different character for and apostrophe(') than the standard one that is created when I use the key on my keyboard in Visual Studio. If I write a find and replace using my keyboard it will not work, but if I copy and paste the apostrophe from Word it does.
Regex apostrophyReplace = new Regex("s\\’s");
docText = apostrophyReplace.Replace(docText, "s\'");
Notice the different character in the Regex for the second one. I'm confused as to why this is, and also want to know if the is a proper way of doing this. I tried "'" but that does not work. I just want to know if using the copied character from Word is the proper way of doing this, and is there a way to do it so that both characters work so I don't have an issue with docs that may be created with a different program.
The reason this happens is because they are different characters.
Word actually changes some punctuation characters after you type them in order to give them the right inclination or to improve presentation.
I ran in the very same issue before and I used this as regular expression: [\u2018\u2019\u201A\u201b\u2032']
So essentially modify your code to:
Regex apostropheReplace = new Regex("s\\[\u2018\u2019\u201A\u201b\u2032']s");
docText = apostropheReplace.Replace(docText, "s\'")
I found these were the five most common type of single quotes and apostrophes used.
And in case you come across the same issue with double quotes, here is what you can use: [\u201C\u201D\u201E\u201F\u2033\u2036\"]
Answering the question:
Is there a way to do it so that both characters work?
If you want one Regex to be able to handle both scenarios, this is perhaps a simple and readable solution:
Regex apostropheReplace = new Regex("s\\['’]s");
docText = apostropheReplace.Replace(docText, "s\'")
This has the added benefit of being understandable to other developers that you are attempting to cover both apostrophe cases. This benefit gets at the other part of your question:
If using the copied character from Word is the proper way of doing this?
That depends on what you mean by "proper". If you mean "most understandable to other developers," I'd say yes, because there would be the least amount of look-up needed to know exactly what your Regex is looking for. If you mean "most performant", that should not be an issue with this straightforward Regex search (some nice Regex performance tips can be found here).
If you mean "most versatile/robust single quote Regex", then as #Leonardo-Seccia points out, there are other character encodings that might cause trouble. (Some of the common Microsoft Word ones are listed here.) Such a solution might look like this:
Regex apostropheReplace =
new Regex("s\\['\u2018\u2019\u201A\u201b]s");
docText = apostropheReplace.Replace(docText, "s\'")
But you can certainly add other character encodings as needed. A more complete list of character encodings can be found here - to add them to the above Regex, simply change the "U+" to "u" and add it to the list after another "\" character. For example, to add the "prime" symbol (′ or U+2032) to the list above, change the RegEx string from
Regex("s\\['\u2018\u2019\u201A\u201b]s")
to
Regex("s\\['\u2018\u2019\u201A\u201b\u2032]s")
Ultimately, you would be the judge of what character encodings are the most "proper" for inclusion in your Regex based on your use cases.
I have a text file where I wish to search if a set of lines exists and update/overwrite them or if the set of lines does not exists, add them.
Here is the text file.
#
# Virtual Hosts
#
# If you want to maintain multiple domains/hostnames on your
# machine you can setup VirtualHost containers for them. Most configurations
# use only name-based virtual hosts so the server doesn't need to worry about
# IP addresses. This is indicated by the asterisks in the directives below.
#
# Please see the documentation at
# <URL:http://httpd.apache.org/docs/2.2/vhosts/>
# for further details before you try to setup virtual hosts.
#
# You may use the command line option '-S' to verify your virtual host
# configuration.
#
# Use name-based virtual hosting.
#
##NameVirtualHost *:80
#
# VirtualHost example:
# Almost any Apache directive may go into a VirtualHost container.
# The first VirtualHost section is used for all requests that do not
# match a ServerName or ServerAlias in any <VirtualHost> block.
#
##<VirtualHost *:80>
##ServerAdmin postmaster#dummy-host.localhost
##DocumentRoot "C:/xampp/htdocs/dummy-host.localhost"
##ServerName dummy-host.localhost
##ServerAlias www.dummy-host.localhost
##ErrorLog "logs/dummy-host.localhost-error.log"
##CustomLog "logs/dummy-host.localhost-access.log" combined
##</VirtualHost>
##<VirtualHost *:80>
##ServerAdmin postmaster#dummy-host2.localhost
##DocumentRoot "C:/xampp/htdocs/dummy-host2.localhost"
##ServerName dummy-host2.localhost
##ServerAlias www.dummy-host2.localhost
##ErrorLog "logs/dummy-host2.localhost-error.log"
##CustomLog "logs/dummy-host2.localhost-access.log" combined
##</VirtualHost>
<VirtualHost *:80>
ServerAdmin postmaster#dummy-host2.localhost
DocumentRoot "C:/xampp/htdocs/dummy-host2.localhost"
ServerName dummy-host2.localhost
ServerAlias www.dummy-host2.localhost
#ErrorLog "logs/dummy-host2.localhost-error.log"
CustomLog "logs/dummy-host2.localhost-access.log" combined
</VirtualHost>
Here is the skeleton code. I first open the text file, and read it to the end so that I can use the Regex class. (I chose this because the code looks cleaner and concise rather than doing it the C way - looping). But it isn't that simple because I need to check first a set of lines.
StreamReader reader = new StreamReader(path);
string content = reader.ReadToEnd();
reader.Close();
// Replace the strings using Regex replace method
StreamWriter writer = new StreamWriter(path);
writer.Write(content);
writer.Close();
Given a port number, I appended it to this pattern
string virtualHost = "<VirtualHost *:" + cls_globalvariables.portNumber + ">";
And I used
Match match = Regex.Match(content, virtualHost);
to find the index of the search pattern. I also had to find the index of its closing tag and replace them with an updated version of those lines. I have no problems of searching the ending line but I do have a problem of distinguishing the commented from uncommented lines.
Regex.Match returns the first occurrence of the search pattern which is the commented line. What I wanted to do was to search patterns without comments but how do I do that? I began thinking in C schemes such as looping character by character backwards and forwards starting from the match.Index until I detect a delimiter of "\r\n". Is there an efficient C# way to solve this?
You can either do it the way Jesse suggests, or if you want to stick to Regex, just put ^ at the start of your regex and set RegexOptions.Multiline.
Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string. For more information, see the "Multiline Mode" section in the Regular Expression Options topic.
I've got a .NET 3.5 web application written in C# doing some URL rewriting that includes a file path, and I'm running into a problem. When I call string.Split('/') it matches both '/' and '\' characters. Is that... supposed to happen? I assumed that it would notice that the ASCII values were different and skip it, but it appears that I'm wrong.
// url = 'someserver.com/user/token/files\subdir\file.jpg
string[] buffer = url.Split('/');
The above code gives a string[] with 6 elements in it... which seems counter intuitive. Is there a way to force Split() to match ONLY the forward slash? Right now I'm lucky, since the offending slashes are at the end of the URL, I can just concatenate the rest of the elements in the string[], but it's a lot of work for what we're doing, and not a great solution to the underlying problem.
Anyone run into this before? Have a simple answer? I appreciate it!
More Code:
url = HttpContext.Current.Request.Path.Replace("http://", "");
string[] buffer = url.Split('/');
Turns out, Request.Path and Request.RawUrl are both changing my slashes, which is ridiculous. So, time to research that a bit more and figure out how to get the URL from a function that doesn't break my formatting. Thanks everyone for playing along with my insanity, sorry it was a misleading question!
When I try the following:
string url = #"someserver.com/user/token/files\subdir\file.jpg";
string[] buffer = url.Split('/');
Console.WriteLine(buffer.Length);
... I get 4. Post more code.
Something else is happening, paste more code.
string str = "a\\b/c\\d";
string[] ts = str.Split('/');
foreach (string t in ts)
{
Console.WriteLine(t);
}
outputs
a\b
c\d
just like it should.
My guess is that you are converting / into \ somewhere.
You could use regex to convert all \ slashes to a temp char, split on /, then regex the temp chars back to \. Pain in the butt, but one option.
I suspect (without seeing your whole application) that the problem lies in the semantics of path delimiters in URLs. It sounds like you are trying to attach a semantic value to backslashes within your application that is contrary to the way HTTP protocols define and use backslashes.
This is just a guess, of course.
The best way to solve this problem might be modifying the application to encode the path in some other way (such as "%5C" for backslashes, maybe?).
those two functions are probably converting \ to / because \ is not a valid character in a URL (see Which characters make a URL invalid?). The browser (NOT C#, as you are inferring) is assuming that when you are using that invalid character, you mean /, so it is "fixing" it for you. If you want \ in your URL, you need to encode it first.
The browsers themselves are actually the ones that make that change in the request, even if it is behind the scenes. To verify this, just turn on fiddler and look at the URLs that are actually getting sent when you go to a URL like this. IE and Chrome actually change the \ to / in the URL field on the browser itself, FireFox doesn't, but the request goes through that way anyways.
Update:
How about this:
Regex.Split(url, "/");
Is there any classes to convert ascii to xml characterset preferably opensource i will be using this class either in vc++ or C#
My ascii has some printable characters which is not there in xml character set
i just tried to sen a resume which is in ascii character set and i tried to store it in a online crm and i got this error message
javax.xml.bind.UnmarshalException
- with linked exception:
[javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,22]
Message: Character reference "" is an invalid XML character.]
Thanks in advance
I had the same problem with Excel using the OpenXML document creation in C#.
My Excel Export feature would blow-up when building a doc with a bad ASCII character.
Somehow the string data, in my company's database, has funky characters in it.
Even though I used the Microsoft DocumentFormat.OpenXML assembly from their OpenXML SDK 2.0, it still didn't take care of this when assigning string values using their objects.
The Fix:
t.Text = Regex.Replace(sValue, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]", "?");
This cleans up the sValue string by removing the offending characters and replacing them with a question mark. You could replace with any string or just use an empty string.
The XML Spec Allows 0x09 (TAB), 0x0A (LF - Line Feed or NL - New Line), and 0x0D (CR - Carriage Return). The RegEx above takes care not remove those.
The XML 1.1 Spec allows you to escape some of these characters.
For example: Using for 0x03 appears as in HTML and as L in Office documents and notepad.
I use Asp.net and this is automatically taken care of in my GridView, so I do not need to replace these values - but I believe it may be the browser that takes care of it for all I know.
I thought of escaping these values in OpenXML, but when I looked at the output, it showed the excape markup. So MikeTeeVee still shows up as MikeTeeVee in Excel instead of something like MikeTeeVee, or MikeLTeeVee. This is why I preferred the Mike?TeeVee approach.
My hunch is this is a bug in the current OpenXML which encodes the allowed XML ASCII characters, but allows the unsupported ASCII characters to slip on through.
UPDATE:
I forgot I could look up how these characters are displayed using the "Open XML SDK 2.0 Productivity Tool" to see inside docs like Excel.
There I found it uses the format: _x0000_
Remember: XML 1.0 does not support escaping these values, but XML 1.1 does, so if you're using 1.1, then you can use this code to escape them.
Regular XML 1.1 Escaping:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("&#x" + string.Format("{0:00}", (byte)(m.Value[0])) + ";");
});
If you're escaping strings for OpenXML, then use this instead:
t.Text = Regex.Replace(s, #"[\x00-\x08]|[\x0B\x0C]|[\x0E-\x19]|[\uD800-\uDFFF]|[\uFFFE\uFFFF]",
delegate(Match m)
{
return (byte)(m.Value[0]) == 0 //0x00 is not Supported in 1.0 or 1.1
? ""
: ("_x" + string.Format("{0:0000}", (byte)(m.Value[0])) + "_");
});
Your text won't have any printable characters which aren't available in XML - but it may have some unprintable characters which aren't available in XML.
In particular, Unicode values U+0000 to U+001F are invalid except for tab. carriage return and line feed. If you really need those other control characters, you'll have to create your own form of escaping for them, and unescape them at the other end.
The character reference  is indeed not a valid XML character. You probably want either 
 or 
.
Out of curiousity, I took a few minutes to write a simple routinein C# to pump out a XML string of the 128 ASCII characters, to my surprise, .NET didn't output a really valid XML document. I guess the way I output the element text wasn't quite right. Anyway here is the code (comments are welcomed):
XmlDocument doc = new XmlDocument();
doc.AppendChild(doc.CreateXmlDeclaration("1.0", "us-ascii", ""));
XmlElement elem = doc.CreateElement("ASCII");
doc.AppendChild(elem);
byte[] b = new byte[1];
for (int i = 0; i < 128; i++)
{
b[0] = Convert.ToByte(i);
XmlElement e = doc.CreateElement("ASCII_" + i.ToString().PadLeft(3,'0'));
e.InnerText = System.Text.ASCIIEncoding.ASCII.GetString(b);
elem.AppendChild(e);
}
Console.WriteLine(doc.OuterXml);
Here is the formatted output:
<?xml version="1.0" encoding="us-ascii" ?>
<ASCII>
<ASCII_000></ASCII_000>
<ASCII_001></ASCII_001>
<ASCII_002></ASCII_002>
<ASCII_003></ASCII_003>
<ASCII_004></ASCII_004>
<ASCII_005></ASCII_005>
<ASCII_006></ASCII_006>
<ASCII_007></ASCII_007>
<ASCII_008></ASCII_008>
<ASCII_009> </ASCII_009>
<ASCII_010>
</ASCII_010>
<ASCII_011></ASCII_011>
<ASCII_012></ASCII_012>
<ASCII_013>
</ASCII_013>
<ASCII_014></ASCII_014>
<ASCII_015></ASCII_015>
<ASCII_016></ASCII_016>
<ASCII_017></ASCII_017>
<ASCII_018></ASCII_018>
<ASCII_019></ASCII_019>
<ASCII_020></ASCII_020>
<ASCII_021></ASCII_021>
<ASCII_022></ASCII_022>
<ASCII_023></ASCII_023>
<ASCII_024></ASCII_024>
<ASCII_025></ASCII_025>
<ASCII_026></ASCII_026>
<ASCII_027></ASCII_027>
<ASCII_028></ASCII_028>
<ASCII_029></ASCII_029>
<ASCII_030></ASCII_030>
<ASCII_031></ASCII_031>
<ASCII_032> </ASCII_032>
<ASCII_033>!</ASCII_033>
<ASCII_034>"</ASCII_034>
<ASCII_035>#</ASCII_035>
<ASCII_036>$</ASCII_036>
<ASCII_037>%</ASCII_037>
<ASCII_038>&</ASCII_038>
<ASCII_039>'</ASCII_039>
<ASCII_040>(</ASCII_040>
<ASCII_041>)</ASCII_041>
<ASCII_042>*</ASCII_042>
<ASCII_043>+</ASCII_043>
<ASCII_044>,</ASCII_044>
<ASCII_045>-</ASCII_045>
<ASCII_046>.</ASCII_046>
<ASCII_047>/</ASCII_047>
<ASCII_048>0</ASCII_048>
<ASCII_049>1</ASCII_049>
<ASCII_050>2</ASCII_050>
<ASCII_051>3</ASCII_051>
<ASCII_052>4</ASCII_052>
<ASCII_053>5</ASCII_053>
<ASCII_054>6</ASCII_054>
<ASCII_055>7</ASCII_055>
<ASCII_056>8</ASCII_056>
<ASCII_057>9</ASCII_057>
<ASCII_058>:</ASCII_058>
<ASCII_059>;</ASCII_059>
<ASCII_060><</ASCII_060>
<ASCII_061>=</ASCII_061>
<ASCII_062>></ASCII_062>
<ASCII_063>?</ASCII_063>
<ASCII_064>#</ASCII_064>
<ASCII_065>A</ASCII_065>
<ASCII_066>B</ASCII_066>
<ASCII_067>C</ASCII_067>
<ASCII_068>D</ASCII_068>
<ASCII_069>E</ASCII_069>
<ASCII_070>F</ASCII_070>
<ASCII_071>G</ASCII_071>
<ASCII_072>H</ASCII_072>
<ASCII_073>I</ASCII_073>
<ASCII_074>J</ASCII_074>
<ASCII_075>K</ASCII_075>
<ASCII_076>L</ASCII_076>
<ASCII_077>M</ASCII_077>
<ASCII_078>N</ASCII_078>
<ASCII_079>O</ASCII_079>
<ASCII_080>P</ASCII_080>
<ASCII_081>Q</ASCII_081>
<ASCII_082>R</ASCII_082>
<ASCII_083>S</ASCII_083>
<ASCII_084>T</ASCII_084>
<ASCII_085>U</ASCII_085>
<ASCII_086>V</ASCII_086>
<ASCII_087>W</ASCII_087>
<ASCII_088>X</ASCII_088>
<ASCII_089>Y</ASCII_089>
<ASCII_090>Z</ASCII_090>
<ASCII_091>[</ASCII_091>
<ASCII_092>\</ASCII_092>
<ASCII_093>]</ASCII_093>
<ASCII_094>^</ASCII_094>
<ASCII_095>_</ASCII_095>
<ASCII_096>`</ASCII_096>
<ASCII_097>a</ASCII_097>
<ASCII_098>b</ASCII_098>
<ASCII_099>c</ASCII_099>
<ASCII_100>d</ASCII_100>
<ASCII_101>e</ASCII_101>
<ASCII_102>f</ASCII_102>
<ASCII_103>g</ASCII_103>
<ASCII_104>h</ASCII_104>
<ASCII_105>i</ASCII_105>
<ASCII_106>j</ASCII_106>
<ASCII_107>k</ASCII_107>
<ASCII_108>l</ASCII_108>
<ASCII_109>m</ASCII_109>
<ASCII_110>n</ASCII_110>
<ASCII_111>o</ASCII_111>
<ASCII_112>p</ASCII_112>
<ASCII_113>q</ASCII_113>
<ASCII_114>r</ASCII_114>
<ASCII_115>s</ASCII_115>
<ASCII_116>t</ASCII_116>
<ASCII_117>u</ASCII_117>
<ASCII_118>v</ASCII_118>
<ASCII_119>w</ASCII_119>
<ASCII_120>x</ASCII_120>
<ASCII_121>y</ASCII_121>
<ASCII_122>z</ASCII_122>
<ASCII_123>{</ASCII_123>
<ASCII_124>|</ASCII_124>
<ASCII_125>}</ASCII_125>
<ASCII_126>~</ASCII_126>
<ASCII_127></ASCII_127>
</ASCII>
Update:
Added XML decalration with "us-ascii" encoding
Possibly you don't fully understand what a character set is. XML is not a character set, though XML based output does use character sets to encode data.
I'd recommend reading through Joel Spolsky's excellent post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), then come back and have another go at your question.
You won't need an additional library to do that. From different encodings to embedded binary data, all of that is possible through the common .net library. Can you just give a simple example?
I'm using an HTML sanitizing whitelist code found here:
http://refactormycode.com/codes/333-sanitize-html
I needed to add the "font" tag as an additional tag to match, so I tried adding this condition after the <img tag check
if (tagname.StartsWith("<font"))
{
// detailed <font> tag checking
// Non-escaped expression (for testing in a Regex editor app)
// ^<font(\s*size="\d{1}")?(\s*color="((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)")?(\s*face="(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)")?\s*?>$
if (!IsMatch(tagname, #"<font
(\s*size=""\d{1}"")?
(\s*color=""((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)"")?
(\s*face=""(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)"")?
\s*?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
}
Aside from the condition above, my code is almost identical to the code in the page I linked to. When I try to test this in C#, it throws an exception saying "Not enough )'s". I've counted the parenthesis several times and I've run the expression through a few online Javascript-based regex testers and none of them seem to tell me of any problems.
Am I missing something in my Regex that is causing a parenthesis to escape? What do I need to do to fix this?
UPDATE
After a lot of trial and error, I remembered that the # sign is a comment in regexes. The key to fixing this is to escape the # character. In case anyone else comes across the same problem, I've included my fix (just escaping the # sign)
if (tagname.StartsWith("<font"))
{
// detailed <font> tag checking
// Non-escaped expression (for testing in a Regex editor app)
// ^<font(\s*size="\d{1}")?(\s*color="((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)")?(\s*face="(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)")?\s*?>$
if (!IsMatch(tagname, #"<font
(\s*size=""\d{1}"")?
(\s*color=""((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)"")?
(\s*face=""(Arial|Courier\sNew|Garamond|Georgia|Tahoma|Verdana)"")?
\s*?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
}
Your IsMatch Method is using the option RegexOptions.IgnorePatternWhitespace, that allows you to put comments inside the regular expressions, so you have to scape the # chatacter, otherwise it will be interpreted as a comment.
if (!IsMatch(tagname,#"<font(\s*size=""\d{1}"")?
(\s*color=""((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)"")?
(\s*face=""(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)"")?
\s?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
I don't see anything obviously wrong with the regex. I would try isolating the problem by removing pieces of the regex until the problem goes away and then focus on the part that causes the issue.
It works fine for me... what version of the .NET framework are you using, and what is the exact exception?
Also - what does you IsMatch method look like? is this just a pass-thru to Regex.IsMatch?
[update] The problem is that the OP's example code didn't show they are using the IgnorePatternWhitespace regex option; with this option it doesn't work; without this option (i.e. as presented) the code is fine.
Download Chris Sells Regex Designer. Its a great free tool for testing .NET regex's.
I'm not sure this regex is going to do what you want because it depends on the order of the attributes matching what you have in the regex. If for example face="Arial" preceeded size="5" then face= wouldn't match.
There are some escaping problems in your regex. You need to escape your " with \ You need to escape your # with \ You need to use \s in Courier New instead of just the space. You need to use the RegexOptions.IgnorePatternWhitespace and RegexOptions.IgnoreCase options.
<font
(\s+size=\"\d{1}\")?
(\s+color=\"((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)\")?
(\s+face=\"(Arial|Courier\sNew|Garamond|Georgia|Tahoma|Verdana)\")?
The # characters are what was causing the exception with the somewhat misleading missing ) message.