linq search for French characters - c#

I'm usint EF and have a simple LINQ statement and want to search for words.
So there is Textbox search and submit button.
when searchtext contains "march" it finds eg. "des marchés", but If I search for "marché" it doesn't find. So it's the French character.
listAgendaItems = dc.agenda.Where(a =>
a.libelle_activite.Contains(searchText)
).ToList<agenda>();
The database and the table Agenda have extended properties -> Collation : French_CI_AS
So how can I makes sure I get the French words, too? like "é, à" etc
I also tried to search for "marche" but it doesn't find "marchés".

Your collation French_CI_AS is "Case-Insensitive", "Accent-Sensitive". If you want a query for "marches" to match "marchés", you need French_CI_AI as your collation. In most languages, that's actually NOT what native speakers want, because the accents are semantically important, but that may depend on the circumstances or context.
If, in fact, your users do always want accent insensitive searches, you should set that collation property to AI instead of AS on the table (or the specific fields). Otherwise, if the need is rare, you can apply collation to a table in MS Sql on a per-query basis; keep in mind that if there is no index on that collation there may be a substantial performance cost. That may be nearly immaterial when you're doing a %wildcard% query, however, since you'll generally have a full table scan in that case anyway.
The last I checked, it wasn't possible to specify a collation in a Linq query directly, so if you are doing case insensitivity on an ad-hoc basis, you'd need to use direct-to-sql query through your data context.
Edited:
Based on your comment, it sounds like you are allowing HTML content to be stored in your database. You have numeric character references in your table, which SQL Server knows nothing about, since they're a feature of HTML, XML and SGML. You can only make this searchable if those characters are string literals in a suitable encoding.
NVARCHAR will store the content in Unicode, specifically UTF-16, and VARCHAR will use Windows-1252 with French collation.
If you are accepting this input via web forms, make sure the page encoding is appropriate. If you are only supporting modern browsers (essentially anything IE5+), UTF-8 is well supported, so you should consider using UTF-8 for all of your requests and responses.
Make sure in your web.config, you have something like this:
<configuration>
<system.web>
<globalization
requestEncoding="utf-8"
responseEncoding="utf-8" />
</system.web>
</configuration>
If you already have data stored with those numeric character references in your database, you can unescape them by translating &#ddddd; into literal UTF-16 sequences, and store them again. Make sure you don't accidentally unescape semantically important NCRs like the greater-than, less-than or ampersand codepoints.

Related

Encoding problem on MySql Database from Blazor form on hosting environment [duplicate]

I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++

Semicolon in url as a separator for query strings

I keep hearing that W3C recommends to use ";" instead of "&" as a query string separator.
We recommend that HTTP server implementors, and in particular, CGI
implementors support the use of ";" in place of "&" to save authors
the trouble of escaping "&" characters in this manner.
Can somebody please explain why ";" is recommended instead of "&"?
Also, i tried using ";" instead of "&". (example: .com?str1=val1;str2=val2 ) . When reading as Request.QueryString["str1"] i get "val1;str2=val2". So if ";" is recommended, how do we read the query strings?
As the linked document says, ; is recommended over & because
the use of the "&" character to separate form fields interacts with its use in SGML attribute values to delimit character entity references.
For example, say you want your URL to be ...?q1=v1&q2=v2
There's nothing wrong with & there. But if you want to put that query into an HTML attribute, <a href="...?q1=v1&q2=v2">, it breaks because, inside an HTML attribute, & represents the start of a character entity. You have to escape the & as &, giving <a href="...?q1=v1&q2=v2">, and it'd be easier if you didn't have to.
; isn't overloaded like this at all; you can put one in an HTML attribute and not worry about it. Thus it'd be much simpler if servers recognised ; as a query parameter separator.
However, by the look of things (based on your experiment), ASP.Net doesn't recognise it as such. How to get it to? I'm not sure you can.
In short, HTML is a big mess (due to its leniency), and using semicolons help to simplify this a LOT.
In order to use semicolons as the separator, i don't know if .NET allows this customization or whether we developers need to write our own methods to process the QueryString. .NET does give us access to the raw QueryString, and we can run with it from there. This is what i did. I wrote my own methods, which wasn't too hard, but it took a lot of testing time and debugging, some of which was Microsoft's fault for not even conforming to web standards when dealing with surrogate pairs. I made sure my implementation works with the full range of Unicode characters including the Multilingual plane (thus for Chinese and Japanese characters, etc.).
Before adding my own findings, I want also confirm and include the great info that Rawling, Jeevan, and BeniBela have pointed out in Rowling's answer and their comments to such answer: it is incorrect in HTML to not escape them, but it usually works, but only because parsers are so tolerant. With that, i also explain why this can lead to bugs with such improper encoding (which probably most developers fall victim to).
One cannot depend on this leniency of improperly encoding ampersands in QueryStrings, and sometimes this leniency leads to nasty bugs. Let's say for instance a QueryString passes a random ASCII string (or user input) and they are not properly encoded. Then 'amp;' which follows '&' gets decoded and the unexpected consequence is that 'amp;' is essentially 'swallowed'. (By swallowed, i mean it gets 'eaten' or it goes missing.) A practical usage scenario is when the user is asked for input that goes into a database and the user inputs HTML (like here at StackOverflow) but because it is not posted correctly then nasty bugs develop.
The real advantage of the ';' separator is in simplicity: proper encoding of ampersand separated QueryStrings takes two steps of complication for URL strings in an HTML page (and in XML too). First keys and values shud be URL encoded and then all concatenated, and then the whole QueryString or URL shud be HTML encoded (or for XML, encoded with a very similar encoding to HTML encoding). Also don't forget that the encoding process for HTML encoding and URL encoding are different, and it's important that they are different. A developer needs to be careful between the two. And since they are similar, it's not uncommon to see them mixed up by novice programmers.
A good example of a potential problematic URL is when passing two name/values in a QueryString:
a = 'me & you', and
b = 'you & me'.
Here, using '&' as a separator, then '?a=me+%26+you&b=you+%26+me' is a proper querystring BUT it shud also be HTML encoded before being written to HTML source code. This is important to be bug free. Most developers aren't careful to do this two step process of first URL Encoding the keys and values and then HTML encoding the full URL in the HTML source. It's no wonder why, when i had to sit down and seriously think this process thru and test out my conclusions thoroughly. Imaging when the name value is 'year=año' or far more complex when we need Chinese or Japanese characters that use surrogate pairs to represent them!
For the same above key value pairs for a and b, when using ';' as the separator, the process is MUCH simpler. As a matter of fact, the ampersand separator makes the process more than twice as complex as using the semicolon separator! Here's the same info represented using the ';' as a separator: '?a=me+%26+you;b=you+%26+me'. We notice that the only difference tho is that there's no '&' in the string. But using this ';' separator means that no second process of HTML encoding the URL or QueryString is needed. Now imagine if i were writing HTML and wanted correct HTML and needed to write the HTML to explain all this! All this HTML encoding with '&' really adds a lot of complication (and for many developers, quite a lot of confusion too).
Novice developers wud simply not HTML encode the QueryString or URL, which is CORRECT when ; is the separator. But it leaves room for bugs when ampersand is improperly encoded. So '?someText=blah&blah' wud need proper encoding.
Also in .NET, we can write XML documentation for our methods. Well, just today, i wrote a little explanation that used the above 'a=me+%26+you&b=you+%26+me' example. And in my XML, i had to manually type all those & character entities for the XML. In XML documentation, it's picky so one must correctly encode ampersands. But the leniency in HTML adds to ambiguity.
Perhaps this wasn't too confusing. But all the confusion or difficulty is due to using a character which shud be HTML encoded as the separator, thus '&' is the culprit. And semicolon relieves all that complication.
One last consideration: with how much more complicated the '&' separator makes this process, it's no wonder to me why the Microsoft implementation of surrogate pairs in QueryStrings still does not follow the official specifications. And if you write your own methods, you MUST account for Microsoft's incorrect use of percent-encoding surrogate pairs. The official specs forbid percent-encoding of surrogate pairs in UTF-8. So anyone who writes their own methods which also handle the full range of Unicode characters, beware of this.

Identify problematic characters in a string

I want to be able to identify problematic characters in a string saved in my sql server using LINQ to Entities.
Problematic characters are characters which had problem in the encoding process.
This is an example of a problematic string : "testing�stringáאç".
In the above example only the � character is considered as problematic.
So for example the following string isn't considered problematic:"testingstringáאç".
How can I check this Varchar and identify that there are problematic chars in it?
Notice that my preferred solution is to identify it via a LINQ to entities query , but other solutions are also welcome - for example: some store procedure maybe?
I tried to play with Regex and with "LIKE" statement but with no success...
Check out the Encoding class.
It has a DecoderFallback Property and a EncoderFallback Property that lets you detect and substitute bad characters found during decoding.
.Net and NVARCHAR both use Unicode, so there is nothing inherently "problematic" (at least not for BMP characters).
So you first have to define what "problematic" in meant to mean:
characters are not mapped in target codepages
Simply convert between encodings and check whether data is lost:
CONVERT(NVARCHAR, CONVERT(VARCHAR, #originalNVarchar)) = #originalNVarchar
Note that you can use SQL Server collations using the COLLATE clause rather than using the default database collation.
characters cannot be displayed due to the fonts used
This cannot be easily done in .Net
You can do something like this:
DECLARE #StringWithProblem NVARCHAR(20) = N'This is '+NCHAR(8)+N'roblematic';
DECLARE #ProblemChars NVARCHAR(4000) = N'%['+NCHAR(0)+NCHAR(1)+NCHAR(8)+']%'; --list all problematic characters here, wrapped in %[]%
SELECT PATINDEX(#ProblemChars, #StringWithProblem), #StringWithProblem;
That gives you the index of the first problematic character or 0 if none is found.

Handling Alphabets (Norwegian and Danish) in Sql Server

I have a application that have lot of products with special alphabets like é, è, ê, ó, ò, â, and ô.
Now these alphabets gives me problem like when i store them in sql server these symbols get replaced by ?. I also find problem during the processing.
How can i handle these.
Should i keep on using string to handle them or use something else
What should be their data-types in sql-server
Any help is appreciated.
Have you tried using nvarchar as the datatype? This is usually recommended when storing non-English text (the cost is more storage space). We use nvarchar for Finnish text (ä ö å), and have no problems or special processing. If writing to a stream, then make sure to use the iso-8859-1 encoding (at least for scandic languages. Eastern European languages use a different one).
If its not possible for you to change the datatype, let me know and we can come up with a different solution.

My website could possibly receive both Unicode and Non-Unicode characters

A user on my website could possibly add a comment on an item that contains both Arabic and English characters. sometimes maybe just Arabic, others just English, and others French!
It's an international website you can't expect the characters being stored in my application.
My website has nothing to do with the Facebook but I need my comments' TextBoxes to be able to accept and show any characters from any language!
...So how do you think I could achieve this ?
All strings in .NET are unicode strings (see documentation of the String class for more information). So taking input from the user in a mix of languages shouldn't be a problem.
If you plan to store this information in the database you will need to make sure the database columns are of type nchar or nvarchar. As others pointed out, when you run queries against these columns out of SSMS you will need to prefix Unicode strings with N to make sure they are handled properly. When executing queries from code you should use parameterized queries (or an ORM, which would probably be better) and ADO.NET will take care of properly constructing queries for you.
There are two elements here:
displaying the characters - this is handled on the user side, if something is missing there the outcome will be giberish, however you can't affect that
what you can affect is the way you save characters in your database - convert everything to utf-8 regardless of the input. Any popular browser is able to render utf-8.
If you use Unicode as a charset on the web pages and the database then you dont have to worry where are users from, since they will all type in unicode into your textboxes.
First, be sure for the fields of your database that will store data may be unicode charactters change these fields to Nvarchar not varchar
and you must know that NVarchar takes double value of row
ex. the maximum row size in sqlserver is 8000 character
that mean when you make a field nvarchar and make it 4000 that mean you have take the all 8000 character
Second, using according to language user select in browsing, set your charset in your code or page like
read from url like http://website.tld/ar/
<meta http-equiv="Content-Language" content="ar" >
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252" >
so according to change the language i n url you change the meta tags of your page
and that is it
and change it according to your language
Regards

Categories

Resources