C#, UTF-8 and encoding characters

C#, UTF-8 and encoding characters - c#

This is a shot-in-the-dark, and I apologize in advance if this question sounds like the ramblings of a madman.
As part of an integration with a third party, I need to UTF8-encode some string info using C# so I can send it to the target server via multipart form. The problem is that they are rejecting some of my submissions, probably because I'm not encoding their contents correctly.
Right now, I'm trying to figure out how a dash or hyphen -- I can't tell which it is just by looking at it -- is received or interpreted by the target server as ?~#~S (yes, that's a 5-character string and is not your browser glitching out). And unfortunately I don't have a thorough enough understanding of Encoding.UTF8.GetBytes() to know how to use the byte array to begin identifying where the problem might lie.
If anybody can provide any tips or advice, I would greatly appreciate it. So far my only friend has been MSDN, and not much of one at that.
UPDATE 1: After some more digging around, I discovered that using System.Web.HttpUtility.UrlEncode()to encode an EM DASH character ("—") will hex-encode it into "%e2%80%94".
I'm currently sending this info in aHttpWebRequestpost, with a content type of "application/x-www-form-urlencoded" -- could this be what's causing the problem? And if so, what is the proper way to encode a series of name-value pairs whose values may contain Unicode characters, such that it will be understood by a server expecting a UTF-8 request?

byte[] test = System.Text.Encoding.UTF8.GetBytes("-");
Should give you
test[0] = 0x2D (45 as integer).
Verify that your sending 0x2D to the target server.

You may need to add a "charset=utf-8" parameter to your Content-Type header. You may also want to have a Content-Encoding header to set your encoding. The headers should contain the following:
Content-Type: multipart/form-data; charset=utf-8
Otherwise, the web server won't know your bytes are UTF-8 bytes, so it will misinterpret them.

Related

Header encoding issues with Mailkit and Mimekit

When I use Mailkit to send emails, I noticed that it automatically decides to encode both the content as well as headers. Now, the content encoding is perfect however some email clients have difficulty decoding the headers which are like.
Is there a way to instruct the client to not encode certain headers?.
List-Unsubscribe:
=?us-ascii?q?=3Chttps=3A=2F=2Fbarlinkar=2Eus19=2Elist-manage=2Ecom=2Funsubscribe=3Fu=3D8c60690?=
=?us-ascii?q?5a7e637766f218816b&id=3D2e47bac84d&e=3D407e758886&c=3De27229afde=3E=2C?=
=?us-ascii?q?_=3Cmailto=3Aunsubscribe-mc=2Eus19=5F8c606905a7e637766f218816b=2Ee27229a?=
=?us-ascii?q?fde-407e758886=40mailin=2Emcsv=2Enet=3Fsubject=3Dunsubscribe=3E?=
X-Report-Abuse:
=?us-ascii?q?=3Chttps=3A=2F=2Fmailchimp=2Ecom=2Fcontact=2Fabuse=2F=3Fu=3D8c606905a7e637766f218?=
=?us-ascii?q?816b&id=3De27229afde&e=3D407e758886=3E?=
To: k****#****.***
EDIT: Jstedfast pointed out some errors and I fixed them but the overall result is the same.

I doubt the problem is that the header value is encoded. Your value is invalid to begin with.
Here's the raw value that you are using:
https://barlinkar.us19.list-manage.com/unsubscribe?u=8c606905a7e637766f218816b&id=2e47bac84d&e=407e758886&c=e27229afde>, <mailto:unsubscribe-mc.us19_8c606905a7e637766f218816b.e27229afde-407e758886#mailin.mcsv.net?subject=unsubscribe>List - Unsubscribe - Post: List - Unsubscribe = One - Click
Do you see anything wrong with that?
First, each URL should be enclosed in <>'s. Your first URL is missing the leading < character.
Secondly, you are including the List-Unsubscribe-Post header in the value of the List-Unsubscribe header. They need to be 2 distinct headers.
In other words, the receiving client is probably getting confused as to what the value is supposed to be because it is completely borked.

Decoding querystring values on the server side

I'm running a web service on my server using WCF and .Net 4. The contract type is WebGet. Here is my problem, at one point in time, someone was sending data through the querystring that was URL encoded. So I added HttpUtility.UrlDecode to decode the parameters. I think that fixed my issue at the time. Now, I've sent a URL encoded string to it and I see that the string is being URL decoded coming into the method (before even getting to the HttpUtility.UrlDecode).
So now I'm confused, if the .Net code is decoding it before it gets to my method, why would I need to call on decode explicitly? But for a time it wasn't, so is this a recent change to the underlying .Net framework?
My problem now is that my users are sending data (unencoded), where the data looks like this: "abc%1234" and I'm getting "abc34", the decoding is eating 3 characters. However, if I urlencode the % sign to be "abc%251234", the value coming into the method is "abc%1234" (what I expected) and then the call to HttpUtility.UrlDecode is changing it to "abc34" (which is not what I expected).
I'm not sure how to proceed here. Do I rip out the explicit call to URLDecode until it starts coming across encoded again or is there a better way to handle this?

It's a subtle thing in documentation, easily missed:
HttpRequest.QueryString Property
Property Value
NameValueCollection
The query string variables sent by the client. Keys and values are
URL-decoded.
So if you access the query string via HttpRequest.QueryString (or Params) collection they are already decoded.
You can get to the raw string in RawUrl, QueryString.ToString() (manually that is - re: manipulation, split, etc.).
End of day, %:
Because the percent ("%") character serves as the indicator for
percent-encoded octets, it must be percent-encoded as "%25" for that
octet to be used as data within a URI.
REF: RFC3986
Hth

Proper way to handle the ampersand character in JSON string send to REST web service

OK,
I am using the System.Runtime.Serialization and the DataContractJsonSerialization.
The problem is that in the request I send a value of a property with the & character. Say, AT&T, and I get a response with error: Invalid JSON Data.
I thought that the escaping would be done inside the library but now I see that the serialization is left untouched the ampersand & character.
Yes, for a JSON format this is valid.
But it will be a problem to my POST request since I need to send this to a server that if contains an ampersand will response with error, hence here I am.
HttpUtility.HtmlEncode is in the System.Web library and so the way to go is using Uri.EscapeUriString. I did this to try, but anyway, and without it all requests are working fine, except an ampersand is in a value.
EDIT: HttpUtility class is ported to the Windows Phone SDK but the prefer way to encode a string should be still Uri.EscapeUriString.
First thought was to get hands dirty and start replacing the special character which would cause a problem in the server, but, I wonder, is there another solution I should do, that it would be efficient and 'proper'?
I should tell that I use
// Convert the string into a byte array.
byte[] postBytes = Encoding.UTF8.GetBytes(data);
To convert the JSON to a byte[] and write to the Stream.
And,
request.ContentType = "application/x-www-form-urlencoded";
As the WebRequest.ContentType.
So, am I messed up for a reason or something I miss?
Thank you.

The problem was that I was encoding the whole request string including the key.
I had a request data={JSON} and I was formatting it, but the {JSON} part should only be encoded.
string requestData = "data=" + Uri.EncodeDataString(json) // worked perfect!
Stupid hole to step into.

Have you tried replacing the ampersand with & for the POST?

Which HttpUtility decode method to use?

this may be a silly question, but it trips me up every time.
HttpUtility has the methods HtmlDecode and UrlDecode. Do these two methods decode anything (Html/Http related) I might find? When do I have to use them, and which one am I supposed to use?
Just now I hit an error. This is my error log:
Payment receiver was not payment#mysite.com. (it was payment%40mysite.com).
But, I wrapped the email address here in HttpUtility.HtmlDecode before using it. It turns out I have to use .UrlDecode instead, but this email address didn't come from a URL so this wasn't obvious to me.
Can someone clarify this?

See What is meant by htmlencode and urlencode?
It's the reverse of your case, but essentially you need to use UrlEncode/Decode anytime you are using an address of sorts (urls and yes, email addresses). HtmlEncode/Decode is for code that typically a browser would render (html/xml tags).

This same encoding is also used in Form POST requests as well.
My guess is something read it 'naked' without decoding it.
Html Encoding/Decoding is only used to escape strings that contain characters that would otherwise be interpreted as html control characters. The process turns the characters into html entities and back again.
Url Encoding is to get around the fact that many characters are not allowed in Uris; or because they too could be misinterpreted. Thus the percent encoding is used.
Percent encoding is also used in the body of http requests.
In both cases, of course, it's also a way of expressing a specific character code in a request/response independent of character sets; but equally, interpreting what is meant by a particular code can also be dependent on knowing a particular character set. Generally you don't worry about that - but it can be important (especially in the HTML case).

URLEncode converts characters that aren't allowed in a URL into character equivalents which are parsable as a URL. In your example # became %40. URLDecode reverses this.
HTMLEncode is similar to URLEncode, but the target environment is text NESTED inside of HTML. This helps the browser from interpereting your content as HTML, but when rendered it should look like the decoded version. HTMLDecode reverses this.

When you see %xx this means percent encoding has occured - this is a URL encoding scheme, so you need to use UrlEncode / UrlDecode.
The HtmlEncode and HtmlDecode methods are for encoding and decoding elements for HTML display - so things like & get encoded to & and > to >.

Should HtmlEncode be used when updating twitter status with Tweetsharp?

I am using tweetsharp to send tweets.
var response = _twitter.AuthenticateWith(item.TwitterToken, item.TwitterSecret)
.Statuses().Update(HttpUtility.HtmlEncode(item.Tweet)).AsXml().Request().Response;
As you may have noticed, above I am HtmlEncoding the message this can cause the message to go over 140 chars? Is encoding the message this way necessary? Does tweetsharp or twitter recommend sending messages without encoding first?

TweetSharp will handle all of the encoding for you. Just pass it the string you want to post.

From here:
The Twitter API supports UTF-8
encoding. Please note that angle
brackets ("<" and ">") are
entity-encoded to prevent Cross-Site
Scripting attacks for web-embedded
consumers of JSON API output. The
resulting encoded entities do count
towards the 140 character limit. When
requesting XML, the response is UTF-8
encoded. Symbols and characters
outside of the standard ASCII range
may be translated to HTML entities.
This says to me that you should indeed make sure that your output is encoded (not necessarily HTML encoded) to UTF-8. Have you tried to UTF-8 encode and then submit, then look at the output of "special" characters?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C#, UTF-8 and encoding characters - c#

byte[] test = System.Text.Encoding.UTF8.GetBytes("-"); Should give you test[0] = 0x2D (45 as integer). Verify that your sending 0x2D to the target server.

Related

Header encoding issues with Mailkit and Mimekit

Decoding querystring values on the server side

Proper way to handle the ampersand character in JSON string send to REST web service

Which HttpUtility decode method to use?

Should HtmlEncode be used when updating twitter status with Tweetsharp?

Categories

Resources