C# change numbers to small indexes in string - c#

I have string "H20" (chemical formula for water). I would like to change it so that all the numbers in the string would be small (the number 2 would be small index next to the letter H). How can I do that?

Assuming that you have the means of displaying the subscript unicode characters, you could easily write your own extension method for subscripting:
public static string Subscript(this string normal)
{
if(normal == null) return normal;
var res = new StringBuilder();
foreach(var c in normal)
{
char c1 = c;
// I'm not quite sure if char.IsDigit(c) returns true for, for example, '³',
// so I'm using the safe approach here
if (c >= '0' && c <= '9')
{
// 0x208x is the unicode offset of the subscripted number characters
c1 = (char)(c - '0' + 0x2080);
}
res.Append(c1);
}
return res.ToString();
}

As has been pointed out in the comments, you should typically use some presentation technology for this kind of formatting. For example, in HTML, you could display your text through:
<span>H<sub>2</sub>O</span>
However, Unicode allocates a superscripts and subscripts block for hexadecimal characters which you can take advantage of. Since Unicode is supported natively in .NET, including within string literals, you can use your desired character directly:
text = text.Replace("H2O", "H₂O");
Note: Using the Unicode subscript character would guarantee that your H₂O string would render correctly in any Unicode-aware application, irrespective of its formatting technology (HTML, RTF, PDF, XPS, etc).
Below is a screenshot showing how the string renders in a TextBox under Windows Forms. To improve legibility, font has been changed to Cambria, 11.25pt.
Edit: If you want to convert all numerals (not just 2 into ₂), you could use #Tobias’s code. Here is a regex adaptation of it. I’ve included a lookbehind since I assume that all numerals to be subscripted must be preceded by a letter.
text = Regex.Replace(text, #"(?<=[A-Za-z])\d",
match => ((char)(match.Value[0] - '0' + '₀')).ToString());
The above would transform a string like
CF3CH2Cl + Br2 → CF3CHBrCl + HBr
into
CF₃CH₂Cl + Br₂ → CF₃CHBrCl + HBr

Related

Align Non-English Text To Left

I'm Printing order in a receipt-like format to a RichtextBox, everything works just fine when my item name is in English, once my item name is a non-English language like Hebrew or Arabic where these two languages are written from right-to-left, the overall format becomes messy.
Example when all text is in English
1...5...10...15...20...25...30...35...40...45.48
ITM Price QTY Value
------------------------------------------------
Test 6,000 x1 6,000
test02 0 x1 0
test03 0 x1 0
As you Can see, everything is tidy and well-formatted, but when I have an Item which its name is in Hebrew or Arabic, This is what happens
1...5...10...15...20...25...30...35...40...45.48
ITM Price QTY Value
------------------------------------------------
Test 6,000 x1 6,000
1,500 تيست x1 1,500
As you can see, the Non-English text shifts under Price Column. As I mentioned, this happens only with languages written from Right-To-Left.
My code which does the formatting
int Item_Length = -29;
int Price_Length = -8;
int Qty_Length = -3;
int Value_Length = 8;
string Seperator = "------------------------------------------------"+"\n";
string ruler = "1...5...10...15...20...25...30...35...40...45.48"+"\n";
rTxtReceipt.Text = ruler;
string Headers = string.Format("{0,"+Item_Length+"}{1,"+Price_Length+"}{2,"+Qty_Length+"}{3,"+Value_Length+"}", "ITM", "Price", "QTY", "Value")+"\n";
rTxtReceipt.AppendText(Headers);
rTxtReceipt.AppendText(Seperator);
string Rows = null;
foreach (var item in Items_List)
{
Rows += string.Format("{0,"+Item_Length+"}{1," + Price_Length + ":N0}{2," + Qty_Length + "}{3," + Value_Length + ":N0}", item.ItemName, item.ItemSellPrice, ("x" + item.SellsQty), item.SellsValue) + "\n";
}
rTxtReceipt.AppendText(Rows);
Where rTxtReceipt is a RichTextBox Control.
Can anyone advise how to make all the texts regardless of the language to be aligned from left to right?
I do have a function where it can detects if text is in English or not, but I don't know where to change if the text was not in Egnlish.
public bool IsEnglish(string inputstring)
{
Regex regex = new Regex(#"[A-Za-z0-9 .,-=+(){}\[\]\\]");
MatchCollection matches = regex.Matches(inputstring);
if (matches.Count.Equals(inputstring.Length))
return true;
else
return false;
}
}
The problem has to do with how RTL characters behave when surrounded by numbers in bidirectional text. For more information, you may read this article:
Right-to-left language support and bidirectional text
You wouldn't have faced this problem if, say, instead of the price immediately following the item name, you had a Latin letter: تيست x1,500.
Well obviously, that's not what you want. So, to force the direction of text to remain LTR and prevent the numbers from messing it up, you simply add a "hidden character" immediately after the item name. Fortunately, there's a specific character that is used for this purpose. It's called a Left-to-Right Mark.
First, add the following constant somewhere appropriate:
const string LtrMark = "\u200E";
And then you can simply do something like this: *
Rows += string.Format(format, item.ItemName + LtrMark, item.ItemSellPrice, ...
// ^^^^^^^
Now, you don't have to worry about checking whether the item name is in English or not. Simply inserting that character will work for both LTR and RTL languages. Do note, however, that you need to use a fixed-width font that works for both English and Arabic (Courier New, for example). And this is how the final result would look like:
* Because the LTR mark is a zero-width character, it will make the following values appear one character behind. To avoid this, you may use Item_Length - 1 for the items or Item_Length + 1 for the headers to make sure they're aligned.

How to compare Unicode characters that "look alike"?

I fall into a surprising issue.
I loaded a text file in my application and I have some logic which compares the value having µ.
And I realized that even if the texts are same the compare value is false.
Console.WriteLine("μ".Equals("µ")); // returns false
Console.WriteLine("µ".Equals("µ")); // return true
In later line the character µ is copy pasted.
However, these might not be the only characters that are like this.
Is there any way in C# to compare the characters which look the same but are actually different?
Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).
References:
Unicode Character 'GREEK SMALL LETTER MU' (U+03BC)
Unicode Character 'MICRO SIGN' (U+00B5)
So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:
public void Main()
{
var s1 = "μ";
var s2 = "µ";
Console.WriteLine(s1.Equals(s2)); // false
Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true
}
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormKC);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
And the Demo
In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.
For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:
Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)
This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.
So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
char first = 'μ';
char second = 'µ';
// Technically you only need to normalize U+00B5 to obtain U+03BC, but
// if you're unsure which character is which, you can safely normalize both
string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);
Console.WriteLine(first.Equals(second)); // False
Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
}
}
For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.
They both have different character codes: Refer this for more details
Console.WriteLine((int)'μ'); //956
Console.WriteLine((int)'µ'); //181
Where, 1st one is:
Display Friendly Code Decimal Code Hex Code Description
====================================================================
μ μ μ μ Lowercase Mu
µ µ µ µ micro sign Mu
For the specific example of μ (mu) and µ (micro sign), the latter has a compatibility decomposition to the former, so you can normalize the string to FormKC or FormKD to convert the micro signs to mus.
However, there are lots of sets of characters that look alike but aren't equivalent under any Unicode normalization form. For example, A (Latin), Α (Greek), and А (Cyrillic). The Unicode website has a confusables.txt file with a list of these, intended to help developers guard against homograph attacks. If necessary, you could parse this file and build a table for “visual normalization” of strings.
Search both characters in a Unicode database and see the difference.
One is the Greek small Letter µ and the other is the Micro Sign µ.
Name : MICRO SIGN
Block : Latin-1 Supplement
Category : Letter, Lowercase [Ll]
Combine : 0
BIDI : Left-to-Right [L]
Decomposition : <compat> GREEK SMALL LETTER MU (U+03BC)
Mirror : N
Index entries : MICRO SIGN
Upper case : U+039C
Title case : U+039C
Version : Unicode 1.1.0 (June, 1993)
Name : GREEK SMALL LETTER MU
Block : Greek and Coptic
Category : Letter, Lowercase [Ll]
Combine : 0
BIDI : Left-to-Right [L]
Mirror : N
Upper case : U+039C
Title case : U+039C
See Also : micro sign U+00B5
Version : Unicode 1.1.0 (June, 1993)
EDIT After the merge of this question with How to compare 'μ' and 'µ' in C#
Original answer posted:
"μ".ToUpper().Equals("µ".ToUpper()); //This always return true.
EDIT
After reading the comments, yes it is not good to use the above method because it may provide wrong results for some other type of inputs, for this we should use normalize using full compatibility decomposition as mentioned in wiki. (Thanks to the answer posted by BoltClock)
static string GREEK_SMALL_LETTER_MU = new String(new char[] { '\u03BC' });
static string MICRO_SIGN = new String(new char[] { '\u00B5' });
public static void Main()
{
string Mus = "µμ";
string NormalizedString = null;
int i = 0;
do
{
string OriginalUnicodeString = Mus[i].ToString();
if (OriginalUnicodeString.Equals(GREEK_SMALL_LETTER_MU))
Console.WriteLine(" INFORMATIO ABOUT GREEK_SMALL_LETTER_MU");
else if (OriginalUnicodeString.Equals(MICRO_SIGN))
Console.WriteLine(" INFORMATIO ABOUT MICRO_SIGN");
Console.WriteLine();
ShowHexaDecimal(OriginalUnicodeString);
Console.WriteLine("Unicode character category " + CharUnicodeInfo.GetUnicodeCategory(Mus[i]));
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormC);
Console.Write("Form C Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormD);
Console.Write("Form D Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKC);
Console.Write("Form KC Normalized: ");
ShowHexaDecimal(NormalizedString);
NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKD);
Console.Write("Form KD Normalized: ");
ShowHexaDecimal(NormalizedString);
Console.WriteLine("_______________________________________________________________");
i++;
} while (i < 2);
Console.ReadLine();
}
private static void ShowHexaDecimal(string UnicodeString)
{
Console.Write("Hexa-Decimal Characters of " + UnicodeString + " are ");
foreach (short x in UnicodeString.ToCharArray())
{
Console.Write("{0:X4} ", x);
}
Console.WriteLine();
}
Output
INFORMATIO ABOUT MICRO_SIGN
Hexa-Decimal Characters of µ are 00B5
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 00B5
Form D Normalized: Hexa-Decimal Characters of µ are 00B5
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
INFORMATIO ABOUT GREEK_SMALL_LETTER_MU
Hexa-Decimal Characters of µ are 03BC
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ are 03BC
Form D Normalized: Hexa-Decimal Characters of µ are 03BC
Form KC Normalized: Hexa-Decimal Characters of µ are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ are 03BC
________________________________________________________________
While reading information in Unicode_equivalence I found
The choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), ..... so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.
So to compare equivalence we should normally use FormKC i.e. NFKC normalization or FormKD i.e NFKD normalization.
I was little curious to know more about all the Unicode characters so I made sample which would iterate over all the Unicode character in UTF-16 and I got some results I want to discuss
Information about characters whose FormC and FormD normalized values were not equivalent
Total: 12,118
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-253, ..... 44032-55203
Information about characters whose FormKC and FormKD normalized values were not equivalent
Total: 12,245
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-228, ..... 44032-55203, 64420-64421, 64432-64433, 64490-64507, 64512-64516, 64612-64617, 64663-64667, 64735-64736, 65153-65164, 65269-65274
All the character whose FormC and FormD normalized value were not equivalent, there FormKC and FormKD normalized values were also not equivalent except these characters
Characters: 901 '΅', 8129 '῁', 8141 '῍', 8142 '῎', 8143 '῏', 8157 '῝', 8158 '῞'
, 8159 '῟', 8173 '῭', 8174 '΅'
Extra character whose FormKC and FormKD normalized value were not equivalent, but there FormC and FormD normalized values were equivalent
Total: 119
Characters: 452 'DŽ' 453 'Dž' 454 'dž' 12814 '㈎' 12815 '㈏' 12816 '㈐' 12817 '㈑' 12818 '㈒'
12819 '㈓' 12820 '㈔' 12821 '㈕', 12822 '㈖' 12823 '㈗' 12824 '㈘' 12825 '㈙' 12826 '㈚'
12827 '㈛' 12828 '㈜' 12829 '㈝' 12830 '㈞' 12910 '㉮' 12911 '㉯' 12912 '㉰' 12913 '㉱'
12914 '㉲' 12915 '㉳' 12916 '㉴' 12917 '㉵' 12918 '㉶' 12919 '㉷' 12920 '㉸' 12921 '㉹' 12922 '㉺' 12923 '㉻' 12924 '㉼' 12925 '㉽' 12926 '㉾' 13056 '㌀' 13058 '㌂' 13060 '㌄' 13063 '㌇' 13070 '㌎' 13071 '㌏' 13072 '㌐' 13073 '㌑' 13075 '㌓' 13077 '㌕' 13080 '㌘' 13081 '㌙' 13082 '㌚' 13086 '㌞' 13089 '㌡' 13092 '㌤' 13093 '㌥' 13094 '㌦' 13099 '㌫' 13100 '㌬' 13101 '㌭' 13102 '㌮' 13103 '㌯' 13104 '㌰' 13105 '㌱' 13106 '㌲' 13108 '㌴' 13111 '㌷' 13112 '㌸' 13114 '㌺' 13115 '㌻' 13116 '㌼' 13117 '㌽' 13118 '㌾' 13120 '㍀' 13130 '㍊' 13131 '㍋' 13132 '㍌' 13134 '㍎' 13139 '㍓' 13140 '㍔' 13142 '㍖' .......... ﺋ' 65164 'ﺌ' 65269 'ﻵ' 65270 'ﻶ' 65271 'ﻷ' 65272 'ﻸ' 65273 'ﻹ' 65274'
There are some characters which can not be normalized, they throw ArgumentException if tried
Total:2081
Characters(int value): 55296-57343, 64976-65007, 65534
This links can be really helpful to understand what rules govern for Unicode equivalence
Unicode_equivalence
Unicode_compatibility_characters
Most likely, there are two different character codes that make (visibly) the same character. While technically not equal, they look equal. Have a look at the character table and see whether there are multiple instances of that character. Or print out the character code of the two chars in your code.
You ask "how to compare them" but you don't tell us what you want to do.
There are at least two main ways to compare them:
Either you compare them directly as you are and they are different
Or you use Unicode Compatibility Normalization if your need is for a comparison that finds them to match.
There could be a problem though because Unicode compatibility normalization will make many other characters compare equal. If you want only these two characters to be treated as alike you should roll your own normalization or comparison functions.
For a more specific solution we need to know your specific problem. What is the context under which you came across this problem?
If I would like to be pedantic, I would say that your question doesn't make sense, but since we are approaching christmas and the birds are singing, I'll proceed with this.
First off, the 2 entities that you are trying to compare are glyphs, a glyph is part of a set of glyphs provided by what is usually know as a "font", the thing that usually comes in a ttf, otf or whatever file format you are using.
The glyphs are a representation of a given symbol, and since they are a representation that depends on a specific set, you can't just expect to have 2 similar or even "better" identical symbols, it's a phrase that doesn't make sense if you consider the context, you should at least specify what font or set of glyphs you are considering when you formulate a question like this.
What is usually used to solve a problem similar to the one that you are encountering, it's an OCR, essentially a software that recognize and compares glyphs, If C# provides an OCR by default I don't know that, but it's generally a really bad idea if you don't really need an OCR and you know what to do with it.
You can possibly end up interpreting a physics book as an ancient greek book without mentioning the fact that OCR are generally expensive in terms of resources.
There is a reason why those characters are localized the way they are localized, just don't do that.
It's possible to draw both of chars with the same font style and size with DrawString method. After two bitmaps with symbols has been generated, it's possible to compare them pixel by pixel.
Advantage of this method is that you can compare not only absolute equal charcters, but similar too (with definite tolerance).

How to insert a Symbol (Pound, Euro, Copyright) into a Textbox

I can use the Alt Key with the Number Pad to type symbols, but how do I programmatically insert a Symbol (Pound, Euro, Copyright) into a Textbox?
I have a configuration screen so I need to dynamically create the \uXXXX's.
In C#, the Unicode character literal \uXXXX where the X's are hex characters will let you specify Unicode characters. For example:
\u00A3 is the Pound sign, £.
\u20AC is the Euro sign, €.
\u00A9 is the copyright symbol, ©.
You can use these Unicode character literals just like any other character in a string.
For example, "15 \u00A3 per item" would be the string "15 £ per item".
You can put such a string in a textbox just like you would with any other string.
Note: You can also just copy (Ctrl+C) a symbol off of a website, like Wikipedia (Pound sign), and then paste (Ctrl+V) it directly into a string literal in your C# source code file. C# source code files use Unicode natively. This approach completely relieves you from ever even having to know the four hex digits for the symbol you want.
To parallel the example above, you could make the same string literal as simply "15 £ per item".
Edit: If you want to dynamically create the Unicode character from its hex string, you can use this:
public static char HexToChar(string hex)
{
return (char)ushort.Parse(hex, System.Globalization.NumberStyles.HexNumber);
}
For example, HexToChar("20AC") will get you the Euro sign.
If you want to do the opposite operation dynamically:
public static string CharToHex(char c)
{
return ((ushort)c).ToString("X4");
}
For example CharToHex('€') will get you "20AC".
The choice of ushort corresponds to the range of possible char values, shown here.
I cant believe this was difficult to find on the internet!
For future developers,if you have the unicode character its easy to do. eg:
C#:
var selectionIndex = txt.SelectionStart;
string copyrightUnicode = "00A9";
int value = int.Parse(copyrightUnicode, System.Globalization.NumberStyles.HexNumber);
string symbol = char.ConvertFromUtf32(value).ToString();
txt.Text = txt.Text.Insert(selectionIndex, symbol);
txt.SelectionStart = selectionIndex + symbol.Length;
VB.Net
Dim selectionIndex = txt.SelectionStart
Dim copyrightUnicode As String = "00A9"
Dim value As Integer = Integer.Parse(copyrightUnicode, System.Globalization.NumberStyles.HexNumber)
Dim symbol As String = Char.ConvertFromUtf32(value).ToString()
txt.Text = txt.Text.Insert(selectionIndex, symbol)
txt.SelectionStart = selectionIndex + symbol.Length

How to fix encoding, im getting ascii values of 63 when it should be regular white spaces

In my c# code, I am extracting text from a pdf, but the text it gives back has some weird characters, if I search for "CLE action" when I know there is the text "CLE action" in the pdf document, it gives me a false, but I found out that after extracting the text, the space between the two words has a ascii byte value of 63...
Is there a quick way to fix the encoding on the text?
Currently I am using this method, but I think it's slow and only works for that one character. Is there some fast method that works for all characters?
public static string fix_encoding(string src)
{
StringWriter return_str = new StringWriter();
byte[] byte_array = Encoding.ASCII.GetBytes(src.Substring(0, src.Length));
int len = byte_array.Length;
byte byt;
for(var i=0; i<len; i+=1)
{
byt = byte_array[i];
if (byt == 63)
{
return_str.Write(" ");
}
else
{
return_str.Write(Encoding.ASCII.GetString(byte_array, i, 1));
}
}
return return_str.ToString();
}
This is how I call this method:
StringWriter output = new StringWriter();
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, page, new SimpleTextExtractionStrategy()));
currentText = fix_encoding(output.ToString());
It is possible that the spaces you extract from the pdf file, are no real spaces (" "), but other kind of spaces defined in unicode. For example a "em space" or a "non break space", see this list or here for a overview.
If the extracted text contains such a space, and you search the text for a normal space, you won't find it, because it is not identical.
Your fix_encoding function converts the string to ASCII. All the unusual kinds of spaces do not exist in ASCII. By default non-ASCII characters are converted to a question mark. So in your fix_encoding function, you see a question mark, even though the original text has a different character.
This means in your fix_encoding function, you should not convert to ASCII, but replace unusual spaces with a normal space. The following function will convert all non-ASCII characters, but you could also use Char.IsWhiteSpace to determine which characters to replace with a normal space.
public static string remove_non_ascii(string src)
{
return Regex.Replace(src, #"[^\u0000-\u007F]", " ");
}

Remove formatting from a string: "(123) 456-7890" => "1234567890"?

I have a string when a telephone number is inputted - there is a mask so it always looks like "(123) 456-7890" - I'd like to take the formatting out before saving it to the DB.
How can I do that?
One possibility using linq is:
string justDigits = new string(s.Where(c => char.IsDigit(c)).ToArray());
Adding the cleaner/shorter version thanks to craigmoliver
string justDigits = new string(s.Where(char.IsDigit).ToArray())
You can use a regular expression to remove all non-digit characters:
string phoneNumber = "(123) 456-7890";
phoneNumber = Regex.Replace(phoneNumber, #"[^\d]", "");
Then further on - depending on your requirements - you can either store the number as a string or as an integer. To convert the number to an integer type you will have the following options:
// throws if phoneNumber is null or cannot be parsed
long number = Int64.Parse(phoneNumber, NumberStyles.Integer, CultureInfo.InvariantCulture);
// same as Int64.Parse, but returns 0 if phoneNumber is null
number = Convert.ToInt64(phoneNumber);
// does not throw, but returns true on success
if (Int64.TryParse(phoneNumber, NumberStyles.Integer,
CultureInfo.InvariantCulture, out number))
{
// parse was successful
}
Since nobody did a for loop.
long GetPhoneNumber(string PhoneNumberText)
{
// Returns 0 on error
StringBuilder TempPhoneNumber = new StringBuilder(PhoneNumberText.Length);
for (int i=0;i<PhoneNumberText.Length;i++)
{
if (!char.IsDigit(PhoneNumberText[i]))
continue;
TempPhoneNumber.Append(PhoneNumberText[i]);
}
PhoneNumberText = TempPhoneNumber.ToString();
if (PhoneNumberText.Length == 0)
return 0;// No point trying to parse nothing
long PhoneNumber = 0;
if(!long.TryParse(PhoneNumberText,out PhoneNumber))
return 0; // Failed to parse string
return PhoneNumber;
}
used like this:
long phoneNumber = GetPhoneNumber("(123) 456-7890");
Update
As pr commented many countries do have zero's in the begining of the number, if you need to support that, then you have to return a string not a long. To change my code to do that do the following:
1) Change function return type from long to string.
2) Make the function return null instead of 0 on error
3) On successfull parse make it return PhoneNumberText
You can make it work for that number with the addition of a simple regex replacement, but I'd look out for higher initial digits. For example, (876) 543-2019 will overflow an integer variable.
string digits = Regex.Replace(formatted, #"\D", String.Empty, RegexOptions.Compiled);
Aside from all of the other correct answers, storing phone numbers as integers or otherwise stripping out formatting might be a bad idea.
Here are a couple considerations:
Users may provide international phone numbers that don't fit your expectations. See these examples So the usual groupings for standard US numbers wouldn't fit.
Users may NEED to provide an extension, eg (555) 555-5555 ext#343 The # key is actually on the dialer/phone, but can't be encoded in an integer. Users may also need to supply the * key.
Some devices allow you to insert pauses (usually with the character P), which may be necessary for extensions or menu systems, or dialing into certain phone systems (eg, overseas). These also can't be encoded as integers.
[EDIT]
It might be a good idea to store both an integer version and a string version in the database. Also, when storing strings, you could reduce all punctuation to whitespace using one of the methods noted above. A regular expression for this might be:
// (222) 222-2222 ext# 333 -> 222 222 2222 # 333
phoneString = Regex.Replace(phoneString, #"[^\d#*P]", " ");
// (222) 222-2222 ext# 333 -> 2222222222333 (information lost)
phoneNumber = Regex.Replace(phoneString, #"[^\d]", "");
// you could try to avoid losing "ext" strings as in (222) 222-2222 ext.333 thus:
phoneString = Regex.Replace(phoneString, #"ex\w+", "#");
phoneString = Regex.Replace(phoneString, #"[^\d#*P]", " ");
Try this:
string s = "(123) 456-7890";
UInt64 i = UInt64.Parse(
s.Replace("(","")
.Replace(")","")
.Replace(" ","")
.Replace("-",""));
You should be safe with this since the input is masked.
You could use a regular expression or you could loop over each character and use char.IsNumber function.
You would be better off using regular expressions. An int by definition is just a number, but you desire the formatting characters to make it a phone number, which is a string.
There are numerous posts about phone number validation, see A comprehensive regex for phone number validation for starters.
As many answers already mention, you need to strip out the non-digit characters first before trying to parse the number. You can do this using a regular expression.
Regex.Replace("(123) 456-7890", #"\D", String.Empty) // "1234567890"
However, note that the largest positive value int can hold is 2,147,483,647 so any number with an area code greater than 214 would cause an overflow. You're better off using long in this situation.
Leading zeros won't be a problem for North American numbers, as area codes cannot start with a zero or a one.
Alternative using Linq:
string phoneNumber = "(403) 259-7898";
var phoneStr = new string(phoneNumber.Where(i=> i >= 48 && i <= 57).ToArray());
This is basically a special case of C#: Removing common invalid characters from a string: improve this algorithm. Where your formatng incl. White space are treated as "bad characters"
'you can use module / inside sub main form VB.net
Public Function ClearFormat(ByVal Strinput As String) As String
Dim hasil As String
Dim Hrf As Char
For i = 0 To Strinput.Length - 1
Hrf = Strinput.Substring(i, 1)
If IsNumeric(Hrf) Then
hasil &= Hrf
End If
Next
Return Strinput
End Function
'you can call this function like this
' Phone= ClearFormat(Phone)
public static string DigitsOnly(this string phoneNumber)
{
return new string(
new[]
{
// phoneNumber[0], (
phoneNumber[1], // 6
phoneNumber[2], // 1
phoneNumber[3], // 7
// phoneNumber[4], )
// phoneNumber[5],
phoneNumber[6], // 8
phoneNumber[7], // 6
phoneNumber[8], // 7
// phoneNumber[9], -
phoneNumber[10], // 5
phoneNumber[11], // 3
phoneNumber[12], // 0
phoneNumber[13] // 9
});
}

Categories

Resources