I'm revamping an old .net 2 website, to get the look and feel of our new CI. Since there was money left over, I was told to review the code behind as well.
As of now, I ran into a serious problem with the charset: On almost all pages the German "special" characters like ß ä ö ü are rendered correct. But on one page every special character is rendered like a normal one. In this case ö --> o; ä--> a; ß --> ?
The text the query is grabbing from the database is rendered correctly in the debugger, but gets messed up as soon as its rendered in the browser.
I've set the charset in the master page to ISO-8859-1 as well as in config.web.
Help is much appreciated - thanks in advance.
Marco
Have you set the metatag in the head section of the rendered HTML?
I.e.
< meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type">
Ah, sorry! I've misread the line about the masterpage.
The Problem wasn't code related at all.
The admin who set up the machine, didnt use the normal Oracle client we normaly use. Instead he just copied over the instant client, set up the TNS_NAMES.ORA and was done with it.
This is wy, there never were Oracle entries in the registry, telling the client which charset to use.
Instead of bothering with it, I just pushed it through as an environmental var
NLS_LANG = German.Germany.Charset.
Problem solved.
Related
I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
???? instead of Asian characters. Even for European text, I got Se?or for Señor.
Strange gibberish (Mojibake?) such as Señor or 新浪新闻 for 新浪新闻.
Black diamonds, such as Se�or.
Finally, I got into a situation where the data was lost, or at least truncated: Se for Señor.
Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
Have your editor, etc. set to UTF-8.
HTML forms should start like <form accept-charset="UTF-8">.
Have your bytes encoded as UTF-8.
Establish UTF-8 as the encoding being used in the client.
Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
<meta charset=UTF-8> at the beginning of HTML
Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
For a blank space (in any language): 20
For English: 4x, 5x, 6x, or 7x
For most of Western Europe, accented letters should be Cxyy
Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
Most of Asia: Exyyzz
Emoji and some of Chinese: F0yyzzww
More details
Specific causes and fixes of the problems seen
Truncated text (Se for Señor):
The bytes to be stored are not encoded as utf8mb4. Fix this.
Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
The bytes to be stored are not encoded as utf8. Fix this.
The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Case 2 (original bytes were UTF-8):
The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>.
Question Marks (regular ones, not black diamonds) (Se?or for Señor):
The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
Also, check that the connection during reading is UTF-8.
Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8mb4");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
If it does not work, happy hunting for the right solution.
Set your code IDE language to UTF-8
Add <meta charset="utf-8"> to your webpage header where you collect data form.
Check your MySQL table definition looks like this:
CREATE TABLE your_table (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8');
$dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Depending on how the server is setup, you have to change the encode accordingly. utf8 from what you said should work the best. However, if you're getting weird characters, it might help if you change the webpage encoding to ANSI.
This helped me when I was setting up a PHP MySQLi. This might help you understand more: ANSI to UTF-8 in Notepad++
I'm building some tables from data in our databases. It is from a lot of international sources so I was having encoding issues and I think I got them all cleared up. But now I'm seeing some strange output and can't figure out why.
This is a C# app in VS2010. Running in Debug, I see the string in my class begins:
Animal and vegetable oils 1 < 5 MW <br>5-50 MW 30 <br>
But when I assign with:
td = htmlDoc.CreateElement("td");
td.Attributes.Add("rowspan", "5");
td.Attributes.Add("valign", "top");
td.InnerHtml = this.DRGuideNote.ToString();
The td.InnerHtml shows
Animal and vegetable oils 1 < 5=\"\" mw=\"\"><br>5-50 MW 30 <br>
Why is it putting the equals and escaped quotes into that text??? It doesn't do it across all the data, just a few files. Any ideas? (PS. There are html breaks in the strings not showing up, how do I post so it ignores html? Tried the "indent with 4 spaces but didn't seem to work?)
HTML Agility Pack's HTML parser is treating the < as the opening character of an HTML tag. So when it parses the 5 and the MW, it thinks it's inside a tag, and so it is treating them as tag attributes. This treatment stops once it runs into the <br> which forces it to close the tag.
The reason it works in browsers is because browsers generally follow the HTML5 spec for handling invalid HTML. The spec has a lot of rules for how to handle invalid HTML, with the goal of making sense of what the intent was. In this situation the spec says that a carat followed by a space should just be treated as text. HAP's parser doesn't deal with this particular edge case. So I wouldn't say this is a bug, so much as a limitation of HAP's native HTML parser.
An alternative to HAP is CsQuery (nuget) which uses a complete HTML5 parser (the same HTML parser as Firefox in fact), and can handle this kind of markup.
Im using this code:
$(document).ready(function () {
var breadCrumps = $('.breadcrumb');
breadCrumps.find('span').text("<%= ArticleSectionData.title %>");
});
title is a property which has values encoded in unicode (I think). These are Greek letters. On the local IIS developer server (embedded in visual studio), the characters are displayed in correct way but, on the test server they appear as:
Σ
Do You know any solution for this problem ?
Thanks for help
EDIT:
I have changed the code a little bit:
breadCrumps.find('span').text(<%= ArticleSectionData.title %>);
And now it works correctly, encoding is frustrating ...
If you are working off of a different database in test than in dev, then I suspect the issue is with the data. If you are storing HTML entities (eg, Σ) in your database, then you need to use .html(). If you are storing actual unicode characters (eg, Σ) in the database, then you need to use .text(). The way to represent Σ in html is with Σ. But if you set the text of an element to Σ, it displays that literally - the innerHTML of that element would contain Σ.
I don't know root of problem, but you can use this http://www.strictly-software.com/htmlencode for decode Σ to Sigma
A user on my website could possibly add a comment on an item that contains both Arabic and English characters. sometimes maybe just Arabic, others just English, and others French!
It's an international website you can't expect the characters being stored in my application.
My website has nothing to do with the Facebook but I need my comments' TextBoxes to be able to accept and show any characters from any language!
...So how do you think I could achieve this ?
All strings in .NET are unicode strings (see documentation of the String class for more information). So taking input from the user in a mix of languages shouldn't be a problem.
If you plan to store this information in the database you will need to make sure the database columns are of type nchar or nvarchar. As others pointed out, when you run queries against these columns out of SSMS you will need to prefix Unicode strings with N to make sure they are handled properly. When executing queries from code you should use parameterized queries (or an ORM, which would probably be better) and ADO.NET will take care of properly constructing queries for you.
There are two elements here:
displaying the characters - this is handled on the user side, if something is missing there the outcome will be giberish, however you can't affect that
what you can affect is the way you save characters in your database - convert everything to utf-8 regardless of the input. Any popular browser is able to render utf-8.
If you use Unicode as a charset on the web pages and the database then you dont have to worry where are users from, since they will all type in unicode into your textboxes.
First, be sure for the fields of your database that will store data may be unicode charactters change these fields to Nvarchar not varchar
and you must know that NVarchar takes double value of row
ex. the maximum row size in sqlserver is 8000 character
that mean when you make a field nvarchar and make it 4000 that mean you have take the all 8000 character
Second, using according to language user select in browsing, set your charset in your code or page like
read from url like http://website.tld/ar/
<meta http-equiv="Content-Language" content="ar" >
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252" >
so according to change the language i n url you change the meta tags of your page
and that is it
and change it according to your language
Regards
My web application sends emails to users. The email contains a link for further user action. Our security standards require that the link in the email cannot be clickable. However, the email clients recognize https:// in the email and auto-link the URL.
Any idea on how to stop the email clients to auto-link. I am thinking if I skip the https://, it may stop the auto-linking. But, if I have to keep the https:// is there any way to avoid auto-linking.
The link in the email is dynamically constructed in the c# code.
I know this thread is old, but I just had this issue myself, and wasn't thrilled by the gif image fix. If you're working with HTML emails, a slightly nicer solution is to break up the link text with a non-rendering tag that tricks the parser. I'm a fan of a simple non-existant <z>:
https<z>://securesite.</z>com
It even works in Stack Overflow posts: https://securesite.com.
Hope this helps someone.
I too wish to disable this, as I believe this is a "valid" use as to not wanting auto-linking (one reason is the designer wants it that way, and they are currently paying the bills).
In email sent that has no images, the header has the domain name in it:
EXTRANET.EXAMPLE.COM
I even put inline styles to make sure it stays white on a black background:
<span style="font-size: 1.5em;padding: 0.5em 0;text-transform: uppercase; font-weight:bold;color:#FFFFFF;text-decoration:none;">EXTRANET.EXAMPLE.COM</span>
Gmail makes this a link, adds an underline and also turns it bright blue instead of the intended white.
At first I tried replacing the dots with . which made it look fine, but didn't fool the Gmail parser.
So, I added a spanned space which work just fine (i.e. it fools Gmail's parser):
<span style="font-size: 1.5em;padding: 0.5em 0;text-transform: uppercase; font-weight:bold;color:#FFFFFF;text-decoration:none;">EXTRANET<span style="font-size:0.1em"> </span>.<span style="font-size:0.1em"> </span>EXAMPLE<span style="font-size:0.1em"> </span>.<span style="font-size:0.1em"> </span>COM</span>
Just create a plain <span> tag around the colon (<span>:</span>) or something like that :)
Replace the actual text with a small GIF image that looks like text.
Email parsers will not recognize text within an image.
My application has a similar security requirement. The solution we used was to add an underscore to the beginning of the URL (_http://).
Sorry to dredge up an old question, but I just tried the answer suggested by pieman72, and found that it didn't work within Outlooks 2007–2013. However, wrapping the individual elements of the URL within table cells did fool the Outlook parser:
Visit <table><tr><td>www.</td><td>website</td><td>.com</td></tr></table> for more information.
I ran a sample message through the Email On Acid test suite and found that it eluded the parser on all the major e-mail clients which automatically convert URLs (Outlook, iOS, Android 2.2, etc.) I did not run any deliverability tests.
#raugfer suggests in another answer: wrap the email/URL with an anchor.
<a name="myname">test#email.com</a>
Quoting from that answer:
Since the text is already wrapped in a hyperlink, Gmail gives up and
leave it alone. :)
(Note: also worked for Apple mail client.)
Necroing the question, I know, but it's relevant... I'd like to present a reasonable scenario where Gmail's auto-linking (at least - haven't tested other clients) doesn't make sense.
A client has an application form on their site, where visitors fill out some personal information and submit it. The system then sends a notification email to the client, presenting the information the visitor supplied.
I'm wanting to enhance the email sent to the client by adding a <textarea> at the bottom, with the fields the visitor filled out presented in CSV format so that the client can simply copy it all and paste it into a spreadsheet.
Gmail, however, fails to recognize that the URLs and email addresses are inside a <textarea> tag, and "helpfully" adds the ... link code around the URL/email - inside the <textarea>. This results in the raw HTML link code showing up in the <textarea>.
This is what i did:
Replace all instances of "." with <span style=""color:transparent; font-size:0px;"">[{</span>.<span style=""color:transparent; font-size:0px;"">}]</span>
Replace all instances of "#" with <span style=""color:transparent; font-size:0px;"">[{</span>#<span style=""color:transparent; font-size:0px;"">}]</span>
These characters stopped it parsing links and email addresses, but aren't visible to the user. The negative is that when you copy and paste an email for example, you end up with: "test1{[{.}]}domain{[{.}]}com"
.