.Net Regex to highlight keywords including special characters

.Net Regex to highlight keywords including special characters - c#

Keywords highlighters that are available on the internet do not highlight special characters.
e.g. http://sites.google.com/site/yewiki/aspnet/highlighting-multiple-search-keywords-in-aspnet
How can I make them hightlight any characters. e.g. C++

The code in the example is pretty much just taking the search string as the Regex and replacing the spaces with the or operator(|). Special characters entered will be misinterpreted as Regex operators. Much like the code exaple does a .Replace(" ", "|") , you can do a series of replaces like .Replace("#", "\#") to make sure the specials are escaped in the Regex and not interpreted as there special meaning.
I'm not sure exactly what you are after, but you could also just append the "\#" or whatever specials you are looking for to the Regex expression. I assume if you are doing a C++ like code highlighter your Regex will be a constant, and not a typed in search string like the example you gave.

Well first you need to get clear what you mean by "highlight any characters".
Do you want to highlight all characters that aren't letters or numbers? Or do you want in the case of C++ to highlight the whole word?
Once you've got it straight, you can use a regex table like this one to work out a suitable regex for matching.
Or better still you can re-use something like syntax-highlighter or google-code-prettify
There's also a well-written article on codingthewheel.com that may be helpful to you.

Related

Removing non alpha characters

What is the best way in order to remove all non-alpha characters in C#? I have looked up Regex but it doesn't seem to recognise Regex when I do:
string cleanString = "";
string dirtyString = "I don't_8 really know what ! 6 non alpha- is?";
cleanString = Regex.Replace(dirtyString, "[^A-Za-z0-9]", "");
Regex comes with a red wiggly line underneath. Is there a way I can remove simply non alpha letters and if so can some provide me with a sample? I'm not sure if loops and arrays are the way to go and also how can I get all non alpha characters? I'm assuming I have to do something like if doesn't equal A-Z or 0-9, then remove with ""?

You can do it using LINQ like so:
var cleanString = new string(dirtyString.Where(Char.IsLetter).ToArray());
You can check other Char checks on MSDN.

Regex comes with a red wiggly line underneath.
Then either:
The compilation prediction isn't working correctly (it does sometimes get things wrong).
You don't have a using System.Text.RegularExpressions in the code, so it can't work out you mean System.Text.RegularExpressions.Regex when you say Regex.
To return to your original question:
What is the best way in order to remove all non-alpha characters in C#?
The approach you take is good for small strings, though [^A-Za-z0-9] will remove non-alphanumerics and [^A-Za-z] non-alphabetical characters. This is assuming you are already restricted to (or want to add a restriction to) US-ASCII characters. To include letters like á, œ, ß or δ because you're dealing with real words rather than computer-code I'd use #"\P{L}" or #"[^\p{L}\p{N}]" to allow all letters and numbers.
If you are dealing with very large piece of text (many kilobytes) then you are better off reading it through a filtering stream that strips the characters you don't want as you go.

Regex setting word characters and matching exact word

I need my C# regex to only match full words, and I need to make sure that +-/*() delimit words as well (I'm not sure if the last part is already set that way.) I find regexes very confusing and would like some help on the matter.
Currently, my regex is:
public Regex codeFunctions = new Regex("draw_line|draw_rectangle|draw_circle");
Thank you! :)

Try
public Regex codeFunctions = new Regex(#"\b(draw_line|draw_rectangle|draw_circle)\b");
The \b means match a word boundary, i.e. a transition from a non-word character to a word character (or vice versa).
Word characters include alphabet characters, digits, and the underscore symbol. Non-word characters include everything else, including +-/*(), so it should work fine for you.
See the Regex Class documentation for more details.
The # at the start of the string makes the string a verbatim string, otherwise you have to type two backslashes to make one backslash.

Do you want to match any words, or just the words listed above? To match an arbitrary word, substitute this for the bit that creates the Regex object:
new Regex (#"\b(\w+)\b");
In the future, if you want more characters to be treated as whitespace (for example, underscores), I would recommend String.Replace-ing them to a space character. There may be a clever way to get the same effect with regular expressions, but personally I think it would be too clever. The String.Replace version is obvious.
Also, I can't help but recommend that you read up on regular expressions. Yes, they look like line noise until you get used to them, but once you do they're convenient and there are plenty of good resources out there to help you.

C# Regex Escape Sequences

Is there a complete list of regex escape sequences somewhere? I found this, but it was missing \\ and \e for starters. Thus far I have come up with this regex pattern that hopefully matches all the escape sequences:
#"\\([bBdDfnreasStvwWnAZG\\]|x[A-Z0-9]{2}|u[A-Z0-9]{4}|\d{1,3}|k<\w+>)"

Alternatively, if you only want to escape a string correctly, you could just depend on Regex.Escape() which will do the necessary escaping for you.
Hint: There is also a Regex.Unescape()

This MSDN page (Regular Expression Language Elements) is a good starting place, with this subpage specifically about escape sequences.

Don't forget the zillions of possible unicode categories: \p{Lu}, \P{Sm} etc.
There are too many of these for you to match individually, but I suppose you could use something along the lines of \\[pP]\{[A-Za-z0-9 \-_]+?\} (untested).
And there's also the simpler stuff that's missing from your list: \., \+, \*, \? etc etc.
If you're simply trying to unescape an existing regex then you could try Regex.Unescape. It's not perfect, but it's probably better than anything you or I could knock up in a short space of time.

How to Check if a String is a "string" or a RegEx?

How can I check if a String in an textbox is a plain String ore a RegEx?
I'm searching through a text file line by line.
Either by .Contains(Textbox.Text); or by Regex(Textbox.Text) Match(currentLine)
(I know, syntax isn't working like this, it's just for presentation)
Now my Program is supposed to autodetect if Textbox.Text is in form of a RegEx or if it is a normal String.
Any suggestions? Write my own little RexEx to detect if Textbox contains a RegEx?
Edit:
I failed to add thad my Strings
can be very simple like Foo ore 0005
I'm trying the suggested solutions
right away!

You can't detect regular expressions with a regular expression, as regular expressions themselves are not a regular language.
However, the easiest you probably could do is trying to compile a regex from your textbox contents and when it succeeds you know that it's a regex. If it fails, you know it's not.
But this would classify ordinary strings like "foo" as a regular expression too. Depending on what you need to do, this may or may not be a problem. If it's a search string, then the results are identical for this case. In the case of "foo.bar" they would differ, though since it's a valid regex but matches different things than the string itself.
My advice, also stated in another comment, would be that you simply always enable regex search since there is exactly no difference if you split code paths here. Aside from a dubious performance benefit (which is unlikely to make any difference if there is much of a benefit at all).

Many strings could be a regex, every regex could actually be a string.
Consider the string "thin." could either be a string ('.' is a dot) or a regex ('.' is any character).
I would just add a checkbox where the user indicates if he enters a regex, as usual in many applications.

One possible solution depending on your definition of string and regex would be to check if the string contains any regex typical characters.
You could do something like this:
string s = "I'm not a Regex";
if (s == Regex.Escape(s))
{
// no regex indeed
}

Try and use it in a regex and see if an exception is thrown.
This approach only checks if it is a valid regex, not whether it was intended to be one.
Another approach could be to check if it is surrounded by slashes (ie. ‘/foo/‘) Surrounding regexes with slashes is common practice (although you must remove the slashes before feeding it into the regex library)

Regex match words that are not part of a larger word

I am trying to use Regex in C# to look for a list of keywords in a bunch of text. However I want to be very specific about what the "surrounding" text can be for something to count as a keyword.
So for example, the keyword "hello" should be found in (hello), hello., hello< but not in hellothere.
My main problem is that I don't REQUIRE the separators, if the keyword is the first word or the last word it's okay. I guess another way to look at it is that the beginning-of-the-file and the end-of-the-file should be acceptable separators.
I'm new to Regex so I was hoping someone could help me get the pattern right. So far I have:
[ <(.]+?keyword[<(.]+?
where <, (, . are some example separators and keyword is of course the keyword I am looking for.

You could use the word boundary anchor:
\bkeyword\b
which would find your keyword only when not part of a larger word.

You will want to look into the word boundary (\b) to avoid matching keywords that appear as a part of another word (as in your hellothere example).
You can also add matching at beginning of line (^) and end of line ($) to control the position where keywords may appear.

I think you want something like:
(^$|[ <(.])+?keyword($|[<(.]+?)
The ^ and $ chars symbolise the start and end of the input text, respectively. (If you specify the Multiline option, it matches to the start/end of the line rather than text, but you would seem to want the Singleline option.)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

.Net Regex to highlight keywords including special characters - c#

Keywords highlighters that are available on the internet do not highlight special characters. e.g. http://sites.google.com/site/yewiki/aspnet/highlighting-multiple-search-keywords-in-aspnet How can I make them hightlight any characters. e.g. C++

Related

Removing non alpha characters

Regex setting word characters and matching exact word

C# Regex Escape Sequences

How to Check if a String is a "string" or a RegEx?

Regex match words that are not part of a larger word

Categories

Resources