Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I need a regex pattern which can detect if the given text is in English or not, but I want to include the following:
Allowing spaces
Allowing numbers and words
Allowing multiple lines and tabs
Allowing all special characters !##$%^&*()_-+={}|/<>~`':";[]
Allowing URLs, emails
If the given text contains any character rather than English, it should be considered a non-English text, this should be applied if the text contains Arabic letters/words like "ا ب ت ... etc." and the same for French "é, â ... etc." and also all of the other languages
In brief, I need to know if the given text, any text with any format, is in English or not. I tried a lot of patterns but I didn't get it, and actually, I don't need to use any language detector as the application will be used offline.
Samples of the texts which should not be accepted:
Hello! ... é
مرحبا بك
للتحميل اضغط هنا ... http://www.google.com
So, if the text contains non-English letter, it should be considered non-English text.
I think I found it, I tried the Basic Latin Unicode category, and it works fine so far. I used:
"^[\u0000-\u007F]+$"
Its idea is about checking if the given text is in English and is written by using English letters only, in addition, it allows special characters. So, if the given text was like this "I met my friend in a café", it is considered as non-English text, as the given text should contain only English letters and avoid any other letters even if typed a name, place ... etc. this was exactly what I need.
Thank you all.
Resources:
http://kourge.net/projects/regexp-unicode-block
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
Regular expression to match non-English characters?
In theory it is possible, if regex contained every word from English dictionary.
You can create a regex that detects non-English characters. That will detect text that is definitely not English, but won't be able to confirm it definitely is.
This should work:
#"[^\t\w\d\s$-/:-?{-~!"^_`\[\]]+"
If there is a match, there ARE non-english letters/characters.
BTW, you are just testing if the text contains only those characters where a English speaking person would normally use, NOT what language it is in.
To detect a language you need stuffs like Natural Language Processing but NOT regex.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I have silly problem with Asp.Net project that I'm working on it for more than 5 years.
Today suddenly the Trim() function stopped working.
Notes:
I update project framework from 4.7.2 to 4.8 and the problem still happen.
I tried TrimLeft() and TrimRight() also have the same problem of Trim()
Replace Function is still working fine not effected but it is not a good solution for me to use it every where.
Trim working fine on new projects
I also check the char code of space it is 32
any idea?
this is the code
string val = "abo "; // "abo\x0020\x0020\x0020\x0020\x200e"
string userName = val.Trim();
you can run the problem in this link
Update:
thanks you all for the comments, I also found a simple check to test the end of the char when you set the cursor at the end and press backspace one time nothing is happen the second press start deleting and that because of the \x200e char at the end.
any idea how to trim hidden Char from left and right and deal with just like spaces.
String.Trim works. If it didn't, hundreds of thousands of developers would have noticed 16 years ago.
The string ends with a formatting character, specifically \x200e, the Left-to-Right-Mark. That's definitely not a whitespace. Calling Char.GetUnicodeCategory('') returns Format. I suspect the input came from mixed Arabic and Latin text, perhaps something copied by a user from a longer string?
One way to handle this is to use String.Trim(char[]) specifying the LTR mark along with other characters. That's not quite the same as String.Trim() though, which removes any character that returns true with Char.IsWhiteSpace() :
var userName=val.Trim(' ','\t','\r','\n','\u200e`);
Another option would be to use a regular expression that trims both whitespace \s and characters in the Format Unicode category Cf, only from the start ^[\p{Cf}\s] or end ([\p{Cf}\s]+$) :
string userName = Regex.Replace(val,#"(^[\p{Cf}\s]+)|([\p{Cf}\s]+$)","");
Perhaps a better option would be to prevent unexpected characters using input validation, and require that the input TextBox contains only letter or letter and digit characters. After all, the user could paste some other unexpected non-printable character. It's better to warn the user than try to handle all possible bad data.
Usernames are typically letter and number combinations without whitespace. All ASP.NET stacks allow validation. Modern browsers allow regular expression validation in the input element directly, so we could come up with a regex that allows only valid characters, eg :
<input type="text" required pattern="[A-Z0-9]' ..../>
The NumberLetter block (Nl) could be used to capture numbers and letters in any language, just like Cf is used to capture format characters
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Having a hard time getting my regex to work correctly. Essentially, all I need is a valid number regex that just allows for one comma. Here's what I have tried:
[0-9]*[,]\\d
(This was when I thought I might have a number with multiple commas, not the case anymore)
[0-9][,]\\d
and
^\d+(?:[\,]\d+)?$ (http://regexr.com/3ggn5)
The latter seemed to work the best, however when I input this: 1,23134 it doesn't break the rule. How can I make it better to make sure if you input an invalid number 1,23232 (for example) it will break, but be fine if you do 1,232 (for example, just showing a valid number input).
UPDATE
This is the code surrounding, just using a RegularExpression annotation:
[RegularExpression(#"^\d+(?:[\,]\d+)?$", ErrorMessage = ...]
UPDATE 2
By valid number I simply mean a number that is correctly formatted to United States standards. Example of valid numbers:
1
10
100
1,000
1000
10,000
10000
100,000
100000
..etc
In the United States, we either have a comma or don't after the third digit sequentially (except for the first number in some cases, 1,000 is valid). Although, if you have comma, you typically will use commas every third digit. So I would assume a number like this: 1,00000000 isn't valid.
Examples of invalid numbers:
1,1
1,00
12,12
Basically if anywhere else in the world uses a comma in a place that isn't after the third digit, this would be invalid for what I need. Simply just numbers that may or may not have a comma.
This Regex will parse a number in many valid format:
^-?(\d+|\d{1,3}(?:,\d{3})+)?(\.\d+)?$
It will detect too many numbers after comma
wrong dot notation
numbers with no comma will pass
If you don't need nor negative nor float numbers, you can simplify it:
^(?:\d+|\d{1,3}(?:,\d{3})+)$
And if you don't want number without comma either (e.g: 1345):
^\d{1,3}(?:,\d{3})+$
P.S: For users coming from a non-english speaking world, you can replace the comma with a space in all those regex, and it will work the same way
This question already has answers here:
Sending a string containing special characters through a TcpClient (byte[])
(3 answers)
Closed 5 years ago.
I want to encode and then decode a string that contains multilingual characters, in which the language, length and character positioning (like, chinese character on indexes 8-10) are unknown.
Is it even possible to have a "universal" encoder? Or some algorithm that knows how to decode this?
Searching the web came up with only solutions that involved knowing where the special characters are, and of what language, and I cant even know the language itself.
Any ideas?
EDIT:
Example: a string that consists of several languages, such as:
"Hello {CHINESE} my {LATIN} is rusted"
which consists of english, chinese, and latin.
But when I do
var test = ASCIIEncoding.ASCII.GetBytes(someStr);
and then
ASCIIEncoding.ASCII.GetString(test)
the "special characters" (IE, not english characters) are converted to question marks
Don't use ASCII encoding since it isn't supposed to handle multiple language characters in the same string.
Use Unicode instead:
var test = UnicodeEncoding.Unicode.GetBytes(someStr);
var test1 = UnicodeEncoding.Unicode.GetString(test);
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Good Afternoon,
I would like a regex to validate if the user typed two names or more and not only the first name, as in this example:
Ex: Fabrício (no match)
Ex:
Fabrício Oliveira (match)
Fabrício Oliveira Xavier (match)
Note: The expression must contain accents
Here's a Regex that only relies on a whitespace separator:
^\S+(\s\S+)+$
It makes these assumptions:
No name has a space, tab, or newline in it
Names are separated by exactly one space, tab, or newline
* this is based on #juharr's comment but with the parentheses to allow more than two names.
Edit: You can play around with this Regex here https://regex101.com/r/nS3hN8/1
Edit2: Added the beginning and ending anchors to the regex
Try this regular expression which matches two any length words separated by a space:
new Regex(#"\w+ \w+");
Really you could just use this:
if (Regex.Match(stringname, #"\w+\s\w+").Success)
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I need to extract from a string all the hashtags (#hashtag), mentions (#user) and links.
Right now I'm using this one:
#"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|#|#|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)";
But it doesn't recognize users that starts with _ like "#_me" and links like this one (https://blogs.windows.com/windowsexperience/2015/12/03/whats-new-for-windows-10-iot-this-fall/#.VmB1q2NPg2A.twitter) are recognized partially.
How can I improve my regex to get all the possible cases?
Try this pattern (remember to turn RegexOptions.IgnorePatternWhitespace option on):
(?'tag'(#|\#)(\w|_)+)
|
(?'link'((https?://)|(www\.))[\w$-_.+!*'(),]+)
For this string:
My name is #_dave from #chicago. Visit my city at www.choosechicago.com/things-to-do/ Have a nice day!
It makes 3 captures: 2 under the tag group (#_dave and #chicago) and one under the link group (www.choosechicago.com/things-to-do/).
You can check it with a regex tester like Regex Storm
Explanation
RegexOptions.IgnorePatternWhitespace allows you to break your pattern into multiple lines for easier readability. Instead of this:
(?'tag'(#|#)(\w|_)+)|(?'link'www\.[\w$-_.+!*'(),]+)
You can write this when you turn on the option:
(?'tag'(#|\#)(\w|_)+) # capture # and # tags into the tag group
|
(?'link'www\.[\w$-_.+!*'(),]+) # capture hyperlinks, must begin with www
(?'tag'...) defines a capture group named tag, so you can refer to it by name Groups["tag"]rather by its positional value Groups[1].
[\w$-_.+!*'(),]+ defines the list of characters allowed in a URL, which I got from this question. I haven't checked the RFC specs so don't burn me if I missed a few.