Identify problematic characters in a string

Identify problematic characters in a string - c#

I want to be able to identify problematic characters in a string saved in my sql server using LINQ to Entities.
Problematic characters are characters which had problem in the encoding process.
This is an example of a problematic string : "testing�stringáאç".
In the above example only the � character is considered as problematic.
So for example the following string isn't considered problematic:"testingstringáאç".
How can I check this Varchar and identify that there are problematic chars in it?
Notice that my preferred solution is to identify it via a LINQ to entities query , but other solutions are also welcome - for example: some store procedure maybe?
I tried to play with Regex and with "LIKE" statement but with no success...

Check out the Encoding class.
It has a DecoderFallback Property and a EncoderFallback Property that lets you detect and substitute bad characters found during decoding.

.Net and NVARCHAR both use Unicode, so there is nothing inherently "problematic" (at least not for BMP characters).
So you first have to define what "problematic" in meant to mean:
characters are not mapped in target codepages
Simply convert between encodings and check whether data is lost:
CONVERT(NVARCHAR, CONVERT(VARCHAR, #originalNVarchar)) = #originalNVarchar
Note that you can use SQL Server collations using the COLLATE clause rather than using the default database collation.
characters cannot be displayed due to the fonts used
This cannot be easily done in .Net

You can do something like this:
DECLARE #StringWithProblem NVARCHAR(20) = N'This is '+NCHAR(8)+N'roblematic';
DECLARE #ProblemChars NVARCHAR(4000) = N'%['+NCHAR(0)+NCHAR(1)+NCHAR(8)+']%'; --list all problematic characters here, wrapped in %[]%
SELECT PATINDEX(#ProblemChars, #StringWithProblem), #StringWithProblem;
That gives you the index of the first problematic character or 0 if none is found.

Related

Encoding problem on MySql Database from Blazor form on hosting environment [duplicate]

I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as SeÃ±or or æ–°æµªæ–°é—» for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?

This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (SeÃ±or for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were SeÃ±or.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.

I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.

Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.

Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

C# OLEDB driver not reading null character

I am trying to read a dbf file through ADO using the FoxPro OLEDB driver. I can query fine however there are some special characters which do not seem to be coming through. They are not printable characters as disappear when clicked on however are definitely not the same via OLEDB as they are in FoxPro.
For example, the following field through Visual FoxPro:
When this is accessed through OLEDB it displays as the following:
I've narrowed this down to the fact that the first string contains the ASCII code 0 (null) character as the 10th character - this is valid however so I do not wish to remove it, but whatever I try the string ends after 9 characters when reading with ADO.

You don't show us any code and the image links are broken, we are left out with guesses. I have been using VFPOLEDB driver from C# for years and do no have this problem. I believe you are trying to describe a problem that exists on C# side and not VFP side. In VFP even the char(0) is a valid character. In C# however (docs are misleading IMO, says this is not the case but it is) strings are ASCIIZ strings where char(0) is accepted as the end of string. This should be your problem. You could simply read as a byte array instead, casting the field to a blob. Something like:
Instead of plain SQL like this:
select myField from myTable
Do like this and cast:
select cast(myField as w) as myField from myTable
EDIT: Images were not broken but blocked for me by my ISP, go figure why.

SQL Like Operator (single quotation) [duplicate]

The MySQL documentation says that it should be \'. However, both scite and mysql shows that '' works. I saw that and it works. What should I do?

The MySQL documentation you cite actually says a little bit more than you mention. It also says,
A “'” inside a string quoted with “'” may be written as “''”.
(Also, you linked to the MySQL 5.0 version of Table 8.1. Special Character Escape Sequences, and the current version is 5.6 — but the current Table 8.1. Special Character Escape Sequences looks pretty similar.)
I think the Postgres note on the backslash_quote (string) parameter is informative:
This controls whether a quote mark can be represented by \' in a string literal. The preferred, SQL-standard way to represent a quote mark is by doubling it ('') but PostgreSQL has historically also accepted \'. However, use of \' creates security risks...
That says to me that using a doubled single-quote character is a better overall and long-term choice than using a backslash to escape the single-quote.
Now if you also want to add choice of language, choice of SQL database and its non-standard quirks, and choice of query framework to the equation, then you might end up with a different choice. You don't give much information about your constraints.

Standard SQL uses doubled-up quotes; MySQL has to accept that to be reasonably compliant.
'He said, "Don''t!"'

What I believe user2087510 meant was:
name = 'something'
name = name.replace("'", "\\'")
I have also used this with success.

There are three ways I am aware of. The first not being the prettiest and the second being the common way in most programming languages:
Use another single quote: 'I mustn''t sin!'
Use the escape character \ before the single quote': 'I mustn\'t sin!'
Use double quotes to enclose string instead of single quotes: "I mustn't sin!"

just write '' in place of ' i mean two times '

Here's an example:
SELECT * FROM pubs WHERE name LIKE "%John's%"
Just use double quotes to enclose the single quote.
If you insist in using single quotes (and the need to escape the character):
SELECT * FROM pubs WHERE name LIKE '%John\'s%'

Possibly off-topic, but maybe you came here looking for a way to sanitise text input from an HTML form, so that when a user inputs the apostrophe character, it doesn't throw an error when you try to write the text to an SQL-based table in a DB. There are a couple of ways to do this, and you might want to read about SQL injection too.
Here's an example of using prepared statements and bound parameters in PHP:
$input_str = "Here's a string with some apostrophes (')";
// sanitise it before writing to the DB (assumes PDO)
$sql = "INSERT INTO `table` (`note`) VALUES (:note)";
try {
$stmt = $dbh->prepare($sql);
$stmt->bindParam(':note', $input_str, PDO::PARAM_STR);
$stmt->execute();
} catch (PDOException $e) {
return $dbh->errorInfo();
}
return "success";
In the special case where you may want to store your apostrophes using their HTML entity references, PHP has the htmlspecialchars() function which will convert them to '. As the comments indicate, this should not be used as a substitute for proper sanitisation, as per the example given.

Replace the string
value = value.replace(/'/g, "\\'");
where value is your string which is going to store in your Database.
Further,
NPM package for this, you can have look into it
https://www.npmjs.com/package/mysql-apostrophe

I think if you have any data point with apostrophe you can add one apostrophe before the apostrophe
eg. 'This is John's place'
Here MYSQL assumes two sentence 'This is John' 's place'
You can put 'This is John''s place'. I think it should work that way.

In PHP I like using mysqli_real_escape_string() which escapes special characters in a string for use in an SQL statement.
see https://www.php.net/manual/en/mysqli.real-escape-string.php

Why SQL Server stored procedure doesn't recognize text from Visual Studio?

Ok. Will try to explain with images... This is my SQL Server and my query:
As you see I getting the result. But then I start my app in VS2013, put break point when I want to call my stored procedure and copy text from VS:
And paste Name in Qhuery:
But I didn't get the result! The names ABSOLUTELY THE SAME!
This Query doesnt't work:
SELECT TOP 1 [Employee].[EmployeeID]
FROM [Employee]
WHERE [Employee].[FullName] = 'Brad Oelmann'

I agree the initial suspect is a "special character" that shows up as whitespace pasting in SSMS.
It has happened to me filtering client data with t-sql.
To replace special characters, there is a good starting point here:
.NET replace non-printable ASCII with string representation of hex code
In that case, they're looking for "control characters" in particular and doing a fancy replacement, but the idea of finding the special characters RegEx is the same.
You can look at all kinds of special sets of characters here:
http://msdn.microsoft.com/en-us/library/20bw873z(v=vs.110).aspx
But it might be easier to define what you do want if you are doing something specific like a name.
For example, you can replace anything that isn't an English letter (for one example) with a space:
str = System.Text.RegularExpressions.Regex.Replace( _
str, _
"[^a-zA-Z]", _
" ")

It's really stupid, but I got simple solution. Since my DB Table contains only ~50 records, I retyped all names and now it works. So the problem was not in VS but in SQL Server side.
If somebady will have similar problem, first of all try to update data in your table somehow. You can try to select all data, copy-paste in in notepad and put it back in SQL Server.

Replacing specific Unicode characters in strings read from Excel

I am attempting to replace some undesirable characters in a string retrieved from an Excel spreadsheet. The reason being that our Oracle database is using the WE8ISO8859P1 character set, which does not define several characters that Excel "helpfully" inserts for you in text (curly quotes, em and en dashes, etc.) Since I have no control over the database or how the Excel spreadsheets are created I need to replace the characters with something else.
I retrieve the cell contents into a string thus:
string s = xlRange.get_Range("A1", Missing.Value).Value2.ToString().Trim();
Viewing the string in Visual Studio's Text Visualiser shows the text to be complete and correctly retrieved. Next I try and replace one of the undesirable characters (in this case the right-hand curly quote symbol):
s = Regex.Replace(s, "\u0094", "\u0022");
But it does nothing (Text Visualiser shows it still to be there). To try and verify that the character I want to replace is actually in there, I tried:
bool a = s.Contains("\u0094");
but it returns false. However:
bool b = s.Contains("”");
returns true.
My (somewhat lacking) understanding of strings in .NET is that they're encoded in UTF-16, whereas Excel would probably be using ANSI. So does that mean I need to change the encoding of the text as it comes out of Excel? Or am I doing something else wrong here? Any advice would be greatly appreciated. I have read and re-read all articles I can find about Unicode and encoding but am still none the wiser.

Yes strings in .Net are UTF-16.
You're doing it right; perhaps your hex-math is incorrect.
The character you tested for isn't "\u0094" (Not sure that's what you meant). The following worked for me:
((int)"”"[0]).ToString("X") returns "201D"
"”" == "\u201D" returns true
"\u0094" == "" (right hand side is the empty string) returns false
A lot of UTF-16 characters will seem as an empty string by the text visualizer but they can either be an undisplayable character or part of a surrogate (i.e. Some characters may need to be typed "\UXXXXXXXX" while others you can do with (four digits) "\uXXXX".). My knowledge of this domain is very limited.
References - Jon Skeet's articles on:
Strings
Unicode

You can use NVARCHAR and NTEXT instead of VARCHAR and TEXT for the columns that need to accomodate those characters.
That wayyou don't have to convert the whole database, and you are future proof, because the columns will be Unicode.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Identify problematic characters in a string - c#

Check out the Encoding class. It has a DecoderFallback Property and a EncoderFallback Property that lets you detect and substitute bad characters found during decoding.

Related

Encoding problem on MySql Database from Blazor form on hosting environment [duplicate]

C# OLEDB driver not reading null character

SQL Like Operator (single quotation) [duplicate]

Why SQL Server stored procedure doesn't recognize text from Visual Studio?

Replacing specific Unicode characters in strings read from Excel

Categories

Resources