Escaping \x from strings - c#

Well, I got this little method:
static string escapeString(string str) {
string s = str.Replace(#"\r", "\r").Replace(#"\n", "\n").Replace(#"\t", "\t");
Regex regex = new Regex(#"\\x(..)");
var matches = regex.Matches(s);
foreach (Match match in matches) {
s = s.Replace(match.Value, ((char)Convert.ToByte(match.Value.Replace(#"\x", ""), 16)).ToString());
}
return s;
}
It replaces "\x65" from String, which I've got in args[0].
But my Problem is: "\\x65" will be replaced too, so I get "\e". I'd tried to figure out a regex which would check if there are more then one backslashs, but I had no luck.
Can sombody gimme a hint?

You can continue to hack regexes together with things like "\s|\w\x(..)" to remove the case of \x65. Obviously that will be brittle since there is no guarantee that your sequence \x65 always has a space or character in front of it. It could be the beginning of the file. Also, your regex will match \xTT, which obviously isn't unicode. Consider replacing the '.' with a character class like "\x([0-9a-f]{2})".
If this was a school project, I would do something like the following. You can replace all combinations of "\" into another unlikely sequence, like "#!!#!!#", run the regex and replacements, and then replace all of the unlikely sequence back to "\". For example:
String s = inputString.Replace(#"\\", #"_#!!#!!#_");
// do all of the regex, replacements, etc here
String output = s.Replace(#"_#!!#!!#_", #"\");
However, you shouldn't do this in production code because if your input stream ever has the magic sequence then you will get extra backslashes.
It's obvious that you are writing come kind of interpolator. I feel obligated to recommend looking into something more robust like lexers that use regexes to form Finite State Machines. Wiki has some great articles on this topic, and I'm a big fan of ANTLR. It may be overengineering now, but if you keep running into these special cases consider solving your problem in a more general way.
Start reading here for the theory: http://en.wikipedia.org/wiki/Lexical_analysis

Use a negative look-behind:
Regex regex = new Regex(#"(?<!([^\]|^)\\)\\x(..)");
This asserts that the previous character is not a solo backslash, but without capturing the previous character (look-arounds do not capture).

Related

How can I split the following string into a string array

I want to split the following:
name[]address[I]dob[]nationality[]occupation[]
So my results would be:
name[]
address[I]
dob[]
nationality[]
occupation[]
I have tried using Regex.Split but can't get these results.
You can use Regex.Split with the following regex:
(?<=])(?=[a-z])
which will split between a closing square bracket on the left and a letter on the right. This is done using lookaround assertions. They don't consume any characters of the match so in this constellation they're pretty handy to match between letters.
Basically it means exactly what I wrote: (?<=]) will match a point in the string preceded by a closing bracket, while (?=[a-z]) matches a point in the string (both zero-width, i.e. between characters) where a letter follows. You can tweak that a little if your input data looks different from what you gave us in the question.
You could also simplify it a little, at the expense of readability, by using (?<=])\b. But I would advise against that since \b is tied to \w which is a really ugly thing, usually. It would work roughly the same, but not quite, as \b in this context amounts to (?=[\w]) and \w matches a lot more things, namely decimal digits and an underscore too.
Quick PowerShell test (it uses the same regex implementation since it's .NET underneath):
PS> 'name[]address[I]dob[]nationality[]occupation[]' -split '(?<=])(?=[a-z])'
name[]
address[I]
dob[]
nationality[]
occupation[]
Just for completeness, there is also another option. You can either split the string between the tokens you want to retain, or you could just collect all matches of tokens you want to keep. In the latter case you'll need a pattern that matches what you need, such as
[a-z]+\[[^\]]*]
or what Dennis gave as an answer (I just tend to avoid \w and \b except for quick and dirty hacks or golfing since I maintain that they have no useful application). You can use that with Regex.Matches.
Generally both approaches can work fine, it then depends on whether the split or the match pattern is easier to understand. And for Regex.Matches you'll get Match objects so you don't actually end up with a string[] if you need that, so that'd require .Select(m => m.Value) as well.
In this case I guess neither regex should be left alone without a comment explaining what it does. I can read them just fine, but many developers are a little uneasy around regexes and especially more advanced concepts like lookaround often warrant an explanation.
text.Split(new Char[] { ']' }, StringSplitOptions.RemoveEmptyEntries).Select(s => s + "]").ToArray();
Use this regex pattern:
\w*\[\w*\]
Regular expression should be fine. You can also consider to catch the opening and the closing square brackets with string.IndexOf, for example:
IEnumerable<string> Results(string input)
{
int currentIndex = -1;
while (true)
{
currentIndex++;
int openingBracketIndex = input.IndexOf("[", currentIndex);
int closingBracketIndex = input.IndexOf("]", currentIndex);
if (openingBracketIndex == -1 || closingBracketIndex == -1)
yield break;
yield return input.Substring(currentIndex, closingBracketIndex - currentIndex + 1);
currentIndex = closingBracketIndex;
}
}
string inputString = "name[]address[I]dob[]nationality[]occupation[]";
var result = Regex.Matches(inputString, #".*?\[I?\]").Cast<Match>().Select(m => m.Groups[0].Value).ToArray();

Need some C# Regular Expression Help

I'm trying to come up with a regular expression that will stop at the first occurence of </ol>. My current RegEx sort of works, but only if </ol> has spaces on either end. For instance, instead of stopping at the first instance in the line below, it'd stop at the second
some random text and HTML</ol></b> bla </ol>
Here's the pattern I'm currently using: string pattern = #"some random text(.|\r|\n)*</ol>";
What am I doing wrong?
string pattern = #"some random text(.|\r|\n)*?</ol>";
Note the question mark after the star -- that tells it to be non greedy, which basically means that it will capture as little as possible, rather than the greedy as much as possible.
Make your wild-card "ungreedy" by adding a ?. e.g.
some random text(.|\r|\n)*?</ol>
^- Addition
This will make regex match as few characters as possible, instead of matching as many (standard behavior).
Oh, and regex shouldn't parse [X]HTML
While not a Regex, why not simply use the Substring functions, like:
string returnString = someRandomText.Substring(0, someRandomText.IndexOf("</ol>") - 1);
That would seem to be a lot easier than coming up with a Regex to cover all the possible varieties of characters, spaces, etc.
This regex matches everything from the beginning of the string up to the first </ol>. It uses Friedl's "unrolling-the-loop" technique, so is quite efficient:
Regex pattern = new Regex(
#"^[^<]*(?:(?!</ol\b)<[^<]*)*(?=</ol\b)",
RegexOptions.IgnoreCase);
resultString = pattern.Match(text).Value;
Others had already explained the missing ? to make the quantifier non greedy. I want to suggest also another change.
I don't like your (.|\r|\n) part. If you have only single characters in your alternation, its simpler to make a character class [.\r\n]. This is doing the same thing and its better to read (I don't know compiler wise, maybe its also more efficient).
BUT in your special case when the alternatives to the . are only newline characters, this is also not the correct way. Here you should do this:
Regex A = new Regex(#"some random text.*?</ol>", RegexOptions.Singleline);
Use the Singleline modifier. It just makes the . match also newline characters.

What is the best way of splitting up a string by capital letters in C#?

What is the best way of splitting up a string by capital letters in C#?
Example:
HelloStackOverflow Users.How Are you doing?
Expected result:
Hello Stack Overflow Users. How are you doing?
You can use a regex:
static readonly Regex splitter = new Regex(#"\s+|(?=\s*[A-Z]+)|(?<=[,.?!])");
var spacedOut = splitter.Replace(str, " ");
This uses a lookahead to match the spot before a capital letter (with \s* to swallow the whitespace).
It uses a lookbehind to match the spot after punctuation.
It depends how you define "best".
Unless you want a trivial implementation (blindly insert a space in front of every uppercase letter), I'd avoid regex and just write the few lines of code that do precisely what I need - create a destination StringBuilder, do a foreach through the characters of the string, copying characters across and inserting extra spaces when appropriate - you'll just need to keep a state variable to know if the previous character was uppercase. This will make it easy to handle all the possible special cases (first character is uppercase, acronyms, characters following punctuation or whitespace, single words like "A", culture-sensitive handling, etc).
Why wouldn't I use regex?
Firstly, if you want to handle all the special cases well, you'll probably need quite advaned regex skills, and the result will be an undecipherable "magic string" (difficult to read/maintain, as perfectly demonstrated by #Slaks IMHO - can you read and understand his regex in under 10 seconds?). A simple loop will be much easier to write, test, debug, read and upgrade unless you (and anyone else who might have to read/maintain your code in future) have been doing regexes for years.
Secondly, a loop through the characters is very simple. The regex will almost certainly be slower due to the higher level of generalisation it provides. This may or may not be an issue for you, but efficiency could be a significant factor when definiing "best".
Thirdly, I'm an old dog and I don't see much point in using clever new tricks to solve problems that a simple for loop can handle :-) ... I often see programmers using "cool" obfuscated LINQ queries and Regexes in place of a simple 2-or-3-line loop, and it makes me think of the old adage "to a man with a hammer, everything looks like a nail". Regex, like all tools, has its place. And I'm not convinced this justifies anything that complex.
I'm an oldschool guy, I would write it using StringBuilder because I do not speak regexish:
var sb = new StringBuilder(input.Length);
int nextIndexToAdd = 0;
for (int i = 1; i < input.Length;i++ )
if (char.IsUpper(input[i])
&& !char.IsWhiteSpace(input[i - 1])
&& (!char.IsUpper(input[i - 1]) || (i < input.Length - 1 && !char.IsUpper(input[i + 1]))))
{
sb.Append(input.Substring(nextIndexToAdd, i - nextIndexToAdd));
sb.Append(" ");
nextIndexToAdd = i;
}
sb.Append(input.Substring(nextIndexToAdd));
string result = sb.ToString();
This handles both IAmFromUSA and HelloStack...

Regex vs String.Contains

Hola. I'm failing to write a method to test for words within a plain text or html document. I was reasonably literate with regex, and I am newer to c# (from way more java).
Just 'cause,
string html = source.ToLower();
string plaintext = Regex.Replace(html, #"<(.|\n)*?>", " "); // remove tags
plaintext = Regex.Replace(plaintext, #"\s+", " "); // remove excess white space
and then,
string tag = "c++";
bool foundAsRegex = Regex.IsMatch(plaintext,#"\b" + Regex.Escape(tag) + #"\b");
bool foundAsContains = plaintext.Contains(tag);
For a case where "c++" should be found, sometimes foundAsRegex is true and sometimes false. My google-fu is weak, so I didn't get much back on "what the hell". Any ideas or pointers welcome!
edit:
I'm searching for matches on skills in resumes. for example, the distinct value "c++".
edit:
a real excerpt is given below:
"...administration- c, c++, perl, shell programming..."
The problem is that \b matches between a word character and a non-word character. Given the expression \bc\+\+\b, you have a problem. "+" is a non-word character. So searching for the pattern in "xxx c++, xxx", you're not going to find anything. There's no "word break" after the "+" character.
If you're looking for non-word characters then you'll have to change your logic. Not sure what the best thing would be. I suppose you can use \W, but then it's not going to match at the beginning or end of the line, so you'll need (^|\W) and (\W|$) ... which is ugly. And slow, although perhaps still fast enough depending on your needs.
Your regular expression is turning into:
/\bc\+\+\b/
Which means you're looking for a word boundary, followed by the string c++, followed by another word boundary. This means it won't match on strings like abc++, whereas plaintext.Contains will succeed.
If you can give us examples of where your regex fails when you expected it to succeed, then we can give you a more definite answer.
Edit: My original regex was /\bc++\b/, which is incorrect, as c++ is being passed to Regex.Escape(), which escapes out regular expression metacharacters like +. I've fixed it above.

Any ideas why this does not work? C#

public class MyExample
{
public static void Main(String[] args)
{
string input = "The Venture Bros</p></li>";
// Call Regex.Match
Match m = Regex.Match(input, "/show_name=(.*?)&show_name_exact=true\">(.*?)</i");
// Check Match instance
if (m.Success)
{
// Get Group value
string key = m.Groups[1].Value;
Console.WriteLine(key);
// alternate-1
}
}
I want "The Venture Bros" as output (in this example).
try this :
string input = "The Venture Bros</p></li>";
// Call Regex.Match
Match m = Regex.Match(input, "show_name=(.*?)&show_name_exact=true\">(.*?)</a");
// Check Match instance
if (m.Success)
{
// Get Group value
string key = m.Groups[2].Value;
Console.WriteLine(key);
// alternate-1
}
I think it's because you're trying to do the perl-style slashes on the front and the end. A couple of other answerers have been confused by this already. The way he's written it, he's trying to do case-insensitive by starting and ending with / and putting an i on the end, the way you'd do it in perl.
But I'm pretty sure that .NET regexes don't work that way, and that's what's causing the problem.
Edit: to be more specific, look into RegexOptions, an example I pulled from MSDN is like this:
Dim rx As New Regex("\b(?<word>\w+)\s+(\k<word>)\b", RegexOptions.Compiled Or RegexOptions.IgnoreCase)
The key there is the "RegexOptions.IgnoreCase", that'll cause the effect that you were trying for with /pattern/i.
The correct regex in your case would be
^.*&show_name_exact=true\"\>(.*)</a></p></li>$
regexp is tricky, but at http://www.regular-expressions.info/ you can find a great tutorial
/?show_name=(.)&show_name_exact=true\">(.)
would work as you expect I believe. But another thing I notice, is that you're trying to get the value of group[1], but I believe that you want the value of group[2], because there will be 3 groups, the first is the match, and the second is the first group...
Gl ;)
Because of the question mark before show_name. It is in input but not in pattern, thus no match.
Also, you try to match </i but the input doesn't contain this (it contains </li>).
First the regex starts "/show_name", but the target string has "/?show_name" so the first group won't want the first expected hit.
This will cause the whole regex to fail.
Ok, let's break this down.
Test Data: "The Venture Bros</p></li>"
Original Regex: "/show_name=(.*?)&show_name_exact=true\">(.*?)</i"
Working Regex: "/\?show_name=(.*)&show_name_exact=true\">(.*)</a"
We'll start at the left and work our way to the right, through the regex.
"?" became "\?" this is because a "?" means that the preceding character or group is optional. When we put a slash before it, it now matches a literal question mark.
"(.*?)" became "(.*)" the parentheses denote a group, and a question mark means "optional", but the "*" already means "0 or more" so this is really just removing a redundancy.
"</i" became "</a" this change was made to match your actual text which terminates the anchor with a "</a>" tag.
Suggested Regex: "[\\W]show_name=([^><\"]*)&show_name_exact=true\">([^<]*)<"
(The extra \'s were added to provide proper c# string escaping.)
A good tool for testing regular expressions in c#, is the regex-freetool at code.google.com

Categories

Resources