I struggle with safely encoding html-like text in json. The text should be written into a <textarea>, transferred by ajax to the server (.net45 mvc) and stored in a database in a json-string.
When transferring to server, I get the famous "A potentially dangerous Request.Form value was detected" 500 server error. To avoid this message, I use the [AllowHtml] attribute on the model that are transferred. By doing so I open up for XSS-vulnerability, in case anyone paste in { "key1": "<script>alert(\"danger!\")</script>" }. As such, I would like to use something like
tableData.Json = AntiXssEncoder.HtmlEncode(json, true);
Problem is I cannot do this on the full json string, as it will render something like
{
"key1": ...}
which of course is not what I want. It should be more like
{ "key1": "<script>alert("danger!")</script>" }
With this result the user can write whatever code they want, but I can avoid it to be rendered as html, and just display it as ordinary text. Does anyone know how to traverse json with C# (Newtonsoft Json.NET) such that strings can be encoded with AntiXssEncoder.HtmlEncode(... , ....);? Or am I on a wrong track here?
Edit:
The data is non-uniform, so deserialization into uniform objects is not an option.
The data will probably be opened to the public, so storing the data encoded would ease my soul.
If you already have the data as a JSON string, you could parse it into proper objects with something like Json.NET using JsonConvert.DeserializeObject() (or anything else, there are actually quite a few options to choose from). Once it's plain objects, you can go through them and apply any encoding you want, then serialize them again into a JSON string. You can also have a look at this question and its answers.
Another approach that you may take is just leave it alone until actually inserting stuff into the page DOM. You can store unencoded data in the database, you can even send it to the client without HTML encoding as JSON data (of course it needs to be encoded for JSON, but any serializer does that). You need to be careful not to generate it this way directly into the page source though, but as long as it's an AJAX response with text/json content type, it's fine. Then on the client, when you decide to insert it into the actual textarea, you need to make sure you insert it as text, and not html. Technically this could mean using jQuery's .text() instead of .html(), or your template engine's or client-side data binding solution's relevant method (text: instead of html: in Knockout, #: instead of #= in say Kendo UI, etc.)
The advantage of this is latter approach is that when sending the data, the server (something like an API) does not need to know or care about where or how a client will use the data, it's just data. The client may need different encoding for an HTML or a Javascript context, the server cannot necessarily choose the right one.
If you know it's just that text area though where this data is needed, you can of course take the first (your original) approach, encode it on the server, that's equally good (some may argue that's even better in that scenario).
The problem with answering this question is that details count a lot. In theory, there are a myriad of ways you could do it right, but sometimes a good solution differs from a vulnerable one in one single character.
So this is the solution I went for. I added the [AllowHtml] attribute in the ViewModel, so that I could send raw html from the textarea (through ajax).
With this attribute I avoid the System.Web.HttpRequestValidationException that MVC gives to protect against XSS dangers.
Then I traverse the json-string by parsing it as a JToken and encode the strings:
public class JsonUtils
{
public static string HtmlEncodeJTokenStrings(string jsonString)
{
var reconstruct = JToken.Parse(jsonString);
var stack = new Stack<JToken>();
stack.Push(reconstruct);
while (stack.Count > 0)
{
var item = stack.Pop();
if (item.Type == JTokenType.String)
{
var valueItem = item as JValue;
if(valueItem == null)
continue;
var value = valueItem.Value<string>();
valueItem.Value = AntiXssEncoder.HtmlEncode(value, true);
}
foreach (var child in item.Children())
{
stack.Push(child);
}
}
return reconstruct.ToString();
}
}
The resulting json-string will still be valid and I store it in DB. Now, when printing it in a View, I can use the strings directly from json in JS.
When opening it again in another <textarea> for editing, I have to decode the html entities. For that I "stole" some js-code (decodeHtmlEntities) from string.js; of course adding the licence and credit note.
Hope this helps anyone.
Related
Here where I work they use an application called checkmarx to analyze the security of the application
In one of these analyzes the application detected the following problems:
Reflected XSS All Clients:
The application's GetBarcosNaoVinculados embeds untrusted data in the
generated output with Json, at line 1243 of
.../Controllers/AdminUserController.cs. This untrusted data is
embedded straight into the output without proper sanitization or
encoding, enabling an attacker to inject malicious code into the
output. The attacker would be able to alter the returned web page by
simply providing modified data in the user inputusuarioId, which is
read by the GetBarcosNaoVinculados method at line 1243 of
.../Controllers/AdminUserController.cs. This input then flows through
the code straight to the output web page, without sanitization.
public JsonResult GetBarcosNaoVinculados(string usuarioId)
.....
.....
return Json(barcosNaoVinculados, JsonRequestBehavior.AllowGet)
Elsewhere in the system it gives the same problem but with these two methods
The application's LoadCodeRve embeds untrusted data in the generated
output with SerializeObject, at line 738 of
.../BR.Rve.UI.Site/Controllers/InfoApontamentoController.cs. This
untrusted data is embedded straight into the output without proper
sanitization or encoding, enabling an attacker to inject malicious
codeinto the output.The attacker would be able to alter the returned
web page by saving malicious data in a data-store ahead oftime. The
attacker's modified data is then read from the database by the Buscar
method with Where, at line 78 of .../Repository/Repository.cs. This
untrusted data then flows through the code straight tothe output web
page, without sanitization.
public virtual IEnumerable<TEntity> Buscar(Expression<Func<TEntity, bool>>predicate)
return Dbset.Where(predicate);
public string LoadCodeRve()
return JsonConvert.SerializeObject(items);
It seems that it has to do with the treatment given to the JSON format, would anyone know how to treat this type of problem?
As the warning message indicates, you need to perform either some form of input validation (or sanitization), and also as a secure coding best practice - output encoding before rendering the output into the page. Checkmarx searches for the existence of these "sanitizers" and these are predefined in their Checkmarx query. One for instance is the use of the AntiXSS libraries (i.e. JavascriptEncode function)
The two critical lines to look out for is already pointed out by Checkmarx:
return Json(barcosNaoVinculados, JsonRequestBehavior.AllowGet)
and
return JsonConvert.SerializeObject(items);
whichever pages these values (JSON or String) are going to end up, they needed to be escaped. Now depending on the templating engine you are using, you might already get instant XSS protection. For example, "The Razor engine used in MVC automatically encodes all output sourced from variables, unless you work really hard to prevent it doing so." and unless of course you used the Html.Raw helper method.
As promoters of application security we believe in not trusting the input and having layers of defenses so my suggestion is to explicitly indicate that you want to encode the output by passing in JsonSerializerSettings argument:
return JsonConvert.SerializeObject(items, new JsonSerializerSettings { StringEscapeHandling = StringEscapeHandling.EscapeHtml });
The only dilemma here is that Checkmarx might not recognize this is as a sanitizer because it may not be in their predefined list of sanitizers. You could always present this solution as an argument to the Security team that is running the Security scans
For the case of the JsonResult return, you may want to javascript encode the barcosNaoVinculados variable:
return Json(HttpUtility.JavaScriptStringEncode(barcosNaoVinculados), JsonRequestBehavior.AllowGet)
Now, this too Checkmarx may not recognize. You can try using the ones that Checkmarx recognizes (i.e. Encoder.JavascriptEncode or AntiXss.JavascriptEncode) but I don't think these Nuget packages will work in your project type
I have a Windows service that publishes data to a .NET client and a web client over SignalR. I've recently had some funny issues, but can't quite get a consistent behavior.
The problem lies in serializing the degrees sign, i.e. "°C". Most of the time it is serialized correctly, but I've had a few times where I see the following in my debugger:
See how the first time the "°" is serialized correctly, but the second time we see the question marks in the diamonds?
I've read this means it is an invalid UTF-8 character. But then why do all the other properties serialize correctly? This is a screenshot where you see one correct and one incorrect, but the entire JSON contains hundreds of these "°C" strings that look correct.
So why this one exception? It's not always the same position/property, and it doesn't always happen. This makes me think it must be a combination with preceding/succeeding characters, no?
Any ideas how to fix this or at least how to investigate this further?
Update
This is how I do serialization. I set it up on startup:
var serializerSettings = new JsonSerializerSettings();
serializerSettings.Converters.Add(new StructuredAmountJsonConverter());
var serializer = JsonSerializer.Create(serializerSettings);
GlobalHost.DependencyResolver = new AutofacDependencyResolver(_lifetimeScope);
GlobalHost.DependencyResolver.Register(typeof(JsonSerializer), () => serializer);
What's happening here is I'm telling SignalR to use Autofac to resolve dependencies. Then I register my JSON.NET serializer. The JSON.NET serializer has one custom converter, which converts my Amount class to the structure you see above (with a value and a unit property).
So you could think the problem lies in the converter, but then why is it working 95% of the time? Or should I specify the encoding in my converter?
Update 2
I've been using Fiddler to capture my network traffic and I can't see the wrong characters there. So I'm guessing the encoding problem is at the client side. I will investigate further.
Update 3
I've managed to capture the traffic in Fiddler and while it looks good in the Text view, when I select the HexView I do see something weird:
Notice how it says "°C" instead of "°C". So maybe it is sending it from the server in the wrong way?
Also, keep in mind my client is a .NET (WPF) client. This is my code to connect on the client side (simplified):
var url = "myUrl...";
_hubConnection = new HubConnection(url);
var hubProxy = _hubConnection.CreateHubProxy("MyHub");
hubProxy.On<object>("receive", OnDataReceived);
await _hubConnection.Start();
And when receiving data:
var message = JsonConvert.DeserializeObject<MyDataContract>(obj.ToString(), new StructuredAmountJsonConverter());
Update 4
This post makes me think this is happening:
The server/SignalR is sending my data as UTF-8, but the client is expecting latin-1 or Windows-1252 (probably the latter). So now I need to find out how I can make it use UTF-8.
May I know how you are serializing the objects? I think there is a way to specify the encoding types for serializing special characters like this. Here is a link I found- Special characters in object making JSON invalid (DataContractJsonSerializer)
Say I have a sample Json format string as
string per1 = #"[{""Email"":""AAA"",""mj_campaign_id"":""22"",""mj_contact_id"":""PPP"",""customcampaign"":""AAA"",""blocked"":""22"",""hard_bounce"":""PPP"",""blocked"":""22"",""hard_bounce"":""PPP""},"
+ #"{""Email"":""BBB"",""mj_campaign_id"":""25"",""mj_contact_id"":""QQQ"",""customcampaign"":""AAA"",""blocked"":""22"",""hard_bounce"":""PPP"",""blocked"":""22""},"
+ #"{""Email"":""CCC"",""mj_campaign_id"":""38"",""mj_contact_id"":""RRR"",""customcampaign"":""AAA"",""blocked"":""22"",""hard_bounce"":""PPP""}]";
I am trying to deserialize it using
var result = JsonConvert.DeserializeObject(per1);
Its working fine as long as all the rows of the string has values for the following attributes Email, mj_campaign_id, mj_contact_id, customcampaign, blocked, hard_bounce, error_related_to, error. But when I skip some sttribute values in some rows its throwing an error saying
Can not add Newtonsoft.Json.Linq.JValue to Newtonsoft.Json.Linq.JObject.
Any help would be appreciated. Thanks
Your error is because you are not assigning a value to an object, which you need to do. If you remove the value, at least add an empty string.
THAT SAID!
Herein lies the danger of manually building JSON strings. You should always avoid it if you can. If you are reading from a web page, that web page should serialize the payload for you, and then you should deserialize it with whatever you are using to pull in the payload (controller, restful service, etc). The beauty of .NET is that it handles all of this plumbing for you and you really are going to run into painful issues if you try to reinvent the .NET wheel
say I have a textBox and a property to get and set its value:
public SomeText
{
get { return HttpUtility.HtmlEncode(textBox.Text); }
set { textBox.Text = HttpUtility.HtmlEncode(value); }
}
I have used HtmlEncode to prevent Javascript injection attacks. After thinking about it though I'm thinking I only need the HtmlEncode on the getter. The setter is only used by the system and can not be accessed by an external user.
Is this correct?
A couple points;
First:
You should really only encode values when you display them, and not any other time. By encoding them as you get the value from the box, and also when you paste in, you could end up with a real mess, that will just get worse and worse any time someone edits the values. You should not encode the values (against HTML/Javascript injection - you DO need to protect against SQL injection, of course) upon saving to the database in most cases, especially if that value could later be edited. In such a case, you actually need to decode it upon loading it back... not encode it again. But again; it's much simpler only to encode when displaying it (which includes displaying for editing, btw)
Second:
HtmlEncode protects against injecting HTML - which can include a <script> block which would run Javascript, true. But this also protects against generally malicious HTML that has nothing to do with Javascript. But protecting against Javascript injection is almost a different thing; that is, if you might ever display something entered by the user in, say, a javascript alert('%VARIABLE'); you have to do a totally different kind of encoding there than what you are doing.
Yes. You only need to encode strings that you have accepted from the users and you have to show inside your pages.
In my DB, I have a text that is markdown'ed. The same way than SO does when showing the excerpts of the questions, I would like to get the N first characters of the text, i.e. all formatting must be removed. Of course the MD -> HTML step must be avoided and the work must be done on the MD'ed text. Performance is a requirement. Thx.
In my DB, I have a text that is markdown'ed. The same way than SO does when showing the excerpts of the questions, I would like to get the N first characters of the text, i.e. all formatting must be removed.
We store both representations of the text in the database:
Raw Markdown suitable for editing
HTML-ized version suitable for output
and when we display it, we use the HTML-ized output version and simply apply our standard HTML stripping algorithms.
Forgive me if I'm misunderstanding (or simply under-understanding) what you need to do here, but it occurs to me that if there are more reads (page views) than there are inserts (additions of new markdown'ed records) to this database, that from a perfomance standpoint you may be able to make the biggest gain by saving a version of the text with all markup stripped in a separate field in the database. That way your front-end doesn't have to repeatedly parse what it reads from the database before displaying to the browser... it would be parsed only once when new records were added.
Whether or not this actually makes sense from a performance standpoint depends on a variety of variables specific to your situation... how big the text entries are, how often records are inserted versus read, etc.
The way that I would handle this is by defining a formatter interface for the class containing/representing your marked down text. You'd then have concrete implementations that support HTML formatting and plain text formatting. All you would need to do is inject the correct implementation and call the formatter.
Your plain text formatter could simply iterate through the characters in the string, copying characters until it hits some markdown. It would then skip the markdown and start outputting again when it hits the text.
public interface IFormatter
{
string Format();
}
public class HtmlFormatter: IFormatter
{
public Format()
{
return ...string translated to HTML...
}
}
public class PlainTextFormatter : IFormatter
{
public Format()
{
...go through and remove all markdown and return rest
}
}
public class Post : IFormattable
{
public IFormatter Formatter { get; set; }
public Post( IFormatter formatter )
{
this.Formatter = formatter ?? new HtmlFormatter();
}
public Format()
{
return this.Formatter.Format();
}
}
Here is the path I'm taking: I will modify the markdown code so that, with a switch, I can either produce html or simple text. Once the excerpt has been generated, I will surely store it in the DB.
I won't tag any answer as the solution since there are many ways to do it. Everyone gets my vote ;)