Removing non alpha characters - c#

What is the best way in order to remove all non-alpha characters in C#? I have looked up Regex but it doesn't seem to recognise Regex when I do:
string cleanString = "";
string dirtyString = "I don't_8 really know what ! 6 non alpha- is?";
cleanString = Regex.Replace(dirtyString, "[^A-Za-z0-9]", "");
Regex comes with a red wiggly line underneath. Is there a way I can remove simply non alpha letters and if so can some provide me with a sample? I'm not sure if loops and arrays are the way to go and also how can I get all non alpha characters? I'm assuming I have to do something like if doesn't equal A-Z or 0-9, then remove with ""?

You can do it using LINQ like so:
var cleanString = new string(dirtyString.Where(Char.IsLetter).ToArray());
You can check other Char checks on MSDN.

Regex comes with a red wiggly line underneath.
Then either:
The compilation prediction isn't working correctly (it does sometimes get things wrong).
You don't have a using System.Text.RegularExpressions in the code, so it can't work out you mean System.Text.RegularExpressions.Regex when you say Regex.
To return to your original question:
What is the best way in order to remove all non-alpha characters in C#?
The approach you take is good for small strings, though [^A-Za-z0-9] will remove non-alphanumerics and [^A-Za-z] non-alphabetical characters. This is assuming you are already restricted to (or want to add a restriction to) US-ASCII characters. To include letters like á, œ, ß or δ because you're dealing with real words rather than computer-code I'd use #"\P{L}" or #"[^\p{L}\p{N}]" to allow all letters and numbers.
If you are dealing with very large piece of text (many kilobytes) then you are better off reading it through a filtering stream that strips the characters you don't want as you go.

Related

LinqToSql - Is there an alternative to chaining .Replace() to remove characters from database column on comparison?

I'm trying to do a where, where a database value starts with a user input string, but I need to remove spaces and special characters from both sides of the string.
Any of these:
O B
OB
O'B
Should match any of these:
O'Brian
O Brian
OBrian
Normally I would use regex to take out all non-alphanumeric characters, but LinqtoSql doesn't seem to be able to use regex. I did find SqlMethods.Like for pattern matching, but I cant use it to remove characters from both sides as far as I can tell.
The only way I've been able to figure out how to do this is with something like this:
x.LastName.Replace("", String.Empty).Replace("'", String.Empty).Replace("[", String.Empty).Replace("]", String.Empty.StartsWith(lastname.Replace("", String.Empty).Replace("'", String.Empty).Replace("[", String.Empty).Replace("]", String.Empty) )
Which works, but having a .Replace for every single non alpha character is very lengthy and difficult to read. Is there a better way?

Prevent Regex from devouring optional part of the match

I'v searched extensively but I can't find a simple answer to this and my Regex experience is limited. I'd appreciate a simple solution that is explained, please.
I have a very large string and I need to substitute certain words in it as follows:
Example: wherever you find the string "LINK-ABC" make it "LINK_ABC".
I wrote my Regex Match and Replace strings:
#"LINK-ABC", #"LINK_ABC" and it worked.
But there were a couple of things I had not recognized.
There COULD be words in the file like this:
LINK-ABC-DEF LINK-ABC-GHI-JKL ... and so on.
So I get "LINK_ABC-DEF" etc. (which is NOT what I want; this should have remained intact...)
Once I realized the problem it seemed that what I REALLY wanted was to recognize ONLY the word being matched and leave any cases where it was in combination with something else, unchanged. It seemed to me that if I checked for a space or period on the Match word, that should do it, so...
#"LINK-ABC[ |\\.]",#"LINK_ABC"
... and now I have stumbled.
Sample string:
link-xxx link-aaa-sss link-xxx-bbb link-xxx link-xxx.
Match/Replace string:
link-xxx[ |\\.],link_xxx
Result string:
link_xxxlink-aaa-sss link-xxx-bbb link_xxxlink_xxx
The replacements are correct, BUT the trailing comma or period has been "devoured" and so the result string is wrong.
Is there a way that I can match so that if it matches on space, the replacement will have a space and if it matches on a period, the replacement will have a period? I s'pose I could do 2 separate matches but I'd like to increase my understanding of Regex and do it more elegantly if it is possible.
You should be able to achieve the behavior you want with "capture groups"
var matchstring = #"link-xxx([ \.]|$)";
var fixstr = #"link_xxx$1";
The parenthesis around the last part of the matchstring will retain whatever matched inside it, and the $1 in the fixstr will substitute whatever was captured by that group.
I've also modified your punctuation section a little bit, presuming you want to replace a match if it happens to be the last word in the input (by adding the |$). A | inside a character class [] is a literal | character, so I removed it assuming you don't actually expect that in your input.

.Net Regex to highlight keywords including special characters

Keywords highlighters that are available on the internet do not highlight special characters.
e.g. http://sites.google.com/site/yewiki/aspnet/highlighting-multiple-search-keywords-in-aspnet
How can I make them hightlight any characters. e.g. C++
The code in the example is pretty much just taking the search string as the Regex and replacing the spaces with the or operator(|). Special characters entered will be misinterpreted as Regex operators. Much like the code exaple does a .Replace(" ", "|") , you can do a series of replaces like .Replace("#", "\#") to make sure the specials are escaped in the Regex and not interpreted as there special meaning.
I'm not sure exactly what you are after, but you could also just append the "\#" or whatever specials you are looking for to the Regex expression. I assume if you are doing a C++ like code highlighter your Regex will be a constant, and not a typed in search string like the example you gave.
Well first you need to get clear what you mean by "highlight any characters".
Do you want to highlight all characters that aren't letters or numbers? Or do you want in the case of C++ to highlight the whole word?
Once you've got it straight, you can use a regex table like this one to work out a suitable regex for matching.
Or better still you can re-use something like syntax-highlighter or google-code-prettify
There's also a well-written article on codingthewheel.com that may be helpful to you.

How Can I Check If a C# Regular Expression Is Trying to Match 1-(and-only-1)-Character Strings?

Maybe this is a very rare (or even dumb) question, but I do need it in my app.
How can I check if a C# regular expression is trying to match 1-character strings?
That means, I only allow the users to search 1-character strings. If the user is trying to search multi-character strings, an error message will be displaying to the users.
Did I make myself clear?
Thanks.
Peter
P.S.: I saw an answer about calculating the final matched strings' length, but for some unknown reason, the answer is gone.
I thought it for a while, I think calculating the final matched strings length is okay, though it's gonna be kind of slow.
Yet, the original question is very rare and tedious.
a regexp would be .{1}
This will allow any char though. if you only want alpanumeric then you can use [a-z0-9]{1} or shorthand /w{1}
Another option its to limit the number of chars a user can type in an input field. set a maxlength on it.
Yet another option is to save the forms input field to a char and not a string although you may need some handling around this to prevent errors.
Why not use maxlength and save to a char.
You can look for unescaped *, +, {}, ? etc. and count the number of characters (don't forget to flatten the [] as one character).
Basically you have to parse your regex.
Instead of validating the regular expression, which could be complicated, you could apply it only on single characters instead of the whole string.
If this is not possible, you may want to limit the possibilities of regular expression to some certain features. For instance the user can only enter characters to match or characters to exclude. Then you build up the regex in your code.
eg:
ABC matches [ABC]
^ABC matches [^ABC]
A-Z matches [A-Z]
# matches [0-9]
\w matches \w
AB#x-z matches [AB]|[0-9]|[x-z]|\w
which cases do you need to support?
This would be somewhat easy to parse and validate.

I need a regular expression to convert US tel number to link

Basically, the input field is just a string. People input their phone number in various formats. I need a regular expression to find and convert those numbers into links.
Input examples:
(201) 555-1212
(201)555-1212
201-555-1212
555-1212
Here's what I want:
(201) 555-1212 - Notice the space is gone
(201)555-1212
201-555-1212
555-1212
I know it should be more robust than just removing spaces, but it is for an internal web site that my employees will be accessing from their iPhone. So, I'm willing to "just get it working."
Here's what I have so far in C# (which should show you how little I know about regular expressions):
strchk = Regex.Replace(strchk, #"\b([\d{3}\-\d{4}|\d{3}\-\d{3}\-\d{4}|\(\d{3}\)\d{3}\-\d{4}])\b", "<a href='tel:$&'>$&</a>", RegexOptions.IgnoreCase);
Can anyone help me by fixing this or suggesting a better way to do this?
EDIT:
Thanks everyone. Here's what I've got so far:
strchk = Regex.Replace(strchk, #"\b(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})\b", "<a href='tel:$1'>$1</a>", RegexOptions.IgnoreCase);
It is picking up just about everything EXCEPT those with (nnn) area codes, with or without spaces between it and the 7 digit number. It does pick up the 7 digit number and link it that way. However, if the area code is specified it doesn't get matched. Any idea what I'm doing wrong?
Second Edit:
Got it working now. All I did was remove the \b from the start of the string.
Remove the [] and add \s* (zero or more whitespace characters) around each \-.
Also, you don't need to escape the -. (You can take out the \ from \-)
Explanation: [abcA-Z] is a character group, which matches a, b, c, or any character between A and Z.
It's not what you're trying to do.
Edits
In response to your updated regex:
Change [-\.\s] to [-\.\s]+ to match one or more of any of those characters (eg, a - with spaces around it)
The problem is that \b doesn't match the boundary between a space and a (.
Afaik, no phone enters the other characters, so why not replace [^0-9] with '' ?
Here's a regex I wrote for finding phone numbers:
(\+?\d[-\.\s]?)?(\(\d{3}\)\s?|\d{3}[-\.\s]?)\d{3}[-\.\s]?\d{4}
It's pretty flexible... allows a variety of formats.
Then, instead of killing yourself trying to replace it w/out spaces using a bunch of back references, instead pass the match to a function and just strip the spaces as you wanted.
C#/.net should have a method that allows a function as the replace argument...
Edit: They call it a `MatchEvaluator. That example uses a delegate, but I'm pretty sure you could use the slightly less verbose
(m) => m.Value.Replace(' ', '')
or something. working from memory here.

Categories

Resources