Serialize as NDJSON using Json.NET - c#

Is it possible to serialize to NDJSON (Newline Delimited JSON) using Json.NET? The Elasticsearch API uses NDJSON for bulk operations, and I can find nothing suggesting that this format is supported by any .NET libraries.
This answer provides guidance for deserializing NDJSON, and it was noted that one could serialize each row independently and join with newline, but I would not necessarily call that supported.

As Json.NET does not currently have a built-in method to serialize a collection to NDJSON, the simplest answer would be to write to a single TextWriter using a separate JsonTextWriter for each line, setting CloseOutput = false for each:
public static partial class JsonExtensions
{
public static void ToNewlineDelimitedJson<T>(Stream stream, IEnumerable<T> items)
{
// Let caller dispose the underlying stream
using (var textWriter = new StreamWriter(stream, new UTF8Encoding(false, true), 1024, true))
{
ToNewlineDelimitedJson(textWriter, items);
}
}
public static void ToNewlineDelimitedJson<T>(TextWriter textWriter, IEnumerable<T> items)
{
var serializer = JsonSerializer.CreateDefault();
foreach (var item in items)
{
// Formatting.None is the default; I set it here for clarity.
using (var writer = new JsonTextWriter(textWriter) { Formatting = Formatting.None, CloseOutput = false })
{
serializer.Serialize(writer, item);
}
// https://web.archive.org/web/20180513150745/http://specs.okfnlabs.org/ndjson/
// Each JSON text MUST conform to the [RFC7159] standard and MUST be written to the stream followed by the newline character \n (0x0A).
// The newline charater MAY be preceeded by a carriage return \r (0x0D). The JSON texts MUST NOT contain newlines or carriage returns.
textWriter.Write("\n");
}
}
}
Sample fiddle.
Since the individual NDJSON lines are likely to be short but the number of lines might be large, this answer suggests a streaming solution to avoid the necessity of allocating a single string larger than 85kb. As explained in Newtonsoft Json.NET Performance Tips, such large strings end up on the large object heap and may subsequently degrade application performance.

You could try this:
string ndJson = JsonConvert.SerializeObject(value, Formatting.Indented);
but now I see that you are not just wanting the serialized object to be pretty printed. If the object you are serializing is some kind of collection or enumeration, could you not just do this yourself by serializing each element?
StringBuilder sb = new StringBuilder();
foreach (var element in collection)
{
sb.AppendLine(JsonConvert.SerializeObject(element, Formatting.None));
}
// use the NDJSON output
Console.WriteLine(sb.ToString());

Related

How to deserialize a JSONP response (preferably with JsonTextReader and not a string)?

I am trying to consume a web service that claims to return JSON, but actually always returns JSONP. I don't see a way to change that service's behavior.
I would like to use NewtonSoft Json.Net to parse the result. I have declared a class, let's call it MyType that I want to deserialize the inner JSON result into.
JSONP:
parseResponse({
"total" : "13,769",
"lower" : "1",
"upper" : "20"})
As you can see this is not correct JSON as it has parseResponse( prefix and ) suffix. While this example is very simple, the actual response can be quite long, on the order of 100Ks.
MyType:
public class MyType
{
public Decimal total;
public int lower;
public int upper;
}
After I get my web service response into a stream and JsonTextReader I try to deserialize like this:
(MyType)serializer.Deserialize(jsonTextReader, typeof(MyType));
Of course I get null for a result because there is that pesky parseResponse with round brackets.
I've taken a look at this question which unfortunately does not help. I'm actually using a JsonTextReader to feed in the JSON, rather than a string (and prefer so to avoid the performance hit of creating huge a string). Even if I'd use the suggestion from that question, it looks dangerous as it uses a global replace. If there is no good way to use a stream, an answer with safe parsing of strings would be okay.
If I interpret your question as follows:
I am trying to deserialize some JSON from a Stream. The "JSON" is actually in JSONP format and so contains some prefix and postfix text I would like to ignore. How can I skip the prefix and postfix text while still reading and deserializing directly from stream rather than loading the entire stream into a string?
Then you can deserialize your JSON from a JSONP stream using the following extension method:
public static class JsonExtensions
{
public static T DeserializeEmbeddedJsonP<T>(Stream stream)
{
using (var textReader = new StreamReader(stream))
return DeserializeEmbeddedJsonP<T>(textReader);
}
public static T DeserializeEmbeddedJsonP<T>(TextReader textReader)
{
using (var jsonReader = new JsonTextReader(textReader.SkipPast('(')))
{
var settings = new JsonSerializerSettings
{
CheckAdditionalContent = false,
};
return JsonSerializer.CreateDefault(settings).Deserialize<T>(jsonReader);
}
}
}
public static class TextReaderExtensions
{
public static TTextReader SkipPast<TTextReader>(this TTextReader reader, char ch) where TTextReader : TextReader
{
while (true)
{
var c = reader.Read();
if (c == -1 || c == ch)
return reader;
}
}
}
Notes:
Prior to constructing the JsonTextReader I construct a StreamReader and skip past the first '(' character in the stream. This positions the StreamReader at the beginning of the actual JSON.
Before deserialization I set JsonSerializerSettings.CheckAdditionalContent = false to tell the serializer to ignore any characters after the end of the JSON content. Oddly enough it is necessary to do this explicitly despite the fact that the default value seems to be false already, since the underlying field is nullable.
The same code can be used to deserialize embedded JSONP from a string by passing a StringReader to DeserializeEmbeddedJsonP<T>(TextReader reader);. Doing so avoids the need to create a new string by trimming the prefix and postfix text and so may improve performance and memory use even for smaller strings.
Sample working .Net fiddle.
It looks like it's returning JSONP. Kind of weird that a webservice would do that by default, without you including "?callback". In any case, if that's just the way it is, you can easily use a RegEx to just strip off the method call:
var x = WebServiceCall();
x = Regex.Replace(x, #"^.+?\(|\)$", "");

How to deserialize JSONP in C#? [duplicate]

I am trying to consume a web service that claims to return JSON, but actually always returns JSONP. I don't see a way to change that service's behavior.
I would like to use NewtonSoft Json.Net to parse the result. I have declared a class, let's call it MyType that I want to deserialize the inner JSON result into.
JSONP:
parseResponse({
"total" : "13,769",
"lower" : "1",
"upper" : "20"})
As you can see this is not correct JSON as it has parseResponse( prefix and ) suffix. While this example is very simple, the actual response can be quite long, on the order of 100Ks.
MyType:
public class MyType
{
public Decimal total;
public int lower;
public int upper;
}
After I get my web service response into a stream and JsonTextReader I try to deserialize like this:
(MyType)serializer.Deserialize(jsonTextReader, typeof(MyType));
Of course I get null for a result because there is that pesky parseResponse with round brackets.
I've taken a look at this question which unfortunately does not help. I'm actually using a JsonTextReader to feed in the JSON, rather than a string (and prefer so to avoid the performance hit of creating huge a string). Even if I'd use the suggestion from that question, it looks dangerous as it uses a global replace. If there is no good way to use a stream, an answer with safe parsing of strings would be okay.
If I interpret your question as follows:
I am trying to deserialize some JSON from a Stream. The "JSON" is actually in JSONP format and so contains some prefix and postfix text I would like to ignore. How can I skip the prefix and postfix text while still reading and deserializing directly from stream rather than loading the entire stream into a string?
Then you can deserialize your JSON from a JSONP stream using the following extension method:
public static class JsonExtensions
{
public static T DeserializeEmbeddedJsonP<T>(Stream stream)
{
using (var textReader = new StreamReader(stream))
return DeserializeEmbeddedJsonP<T>(textReader);
}
public static T DeserializeEmbeddedJsonP<T>(TextReader textReader)
{
using (var jsonReader = new JsonTextReader(textReader.SkipPast('(')))
{
var settings = new JsonSerializerSettings
{
CheckAdditionalContent = false,
};
return JsonSerializer.CreateDefault(settings).Deserialize<T>(jsonReader);
}
}
}
public static class TextReaderExtensions
{
public static TTextReader SkipPast<TTextReader>(this TTextReader reader, char ch) where TTextReader : TextReader
{
while (true)
{
var c = reader.Read();
if (c == -1 || c == ch)
return reader;
}
}
}
Notes:
Prior to constructing the JsonTextReader I construct a StreamReader and skip past the first '(' character in the stream. This positions the StreamReader at the beginning of the actual JSON.
Before deserialization I set JsonSerializerSettings.CheckAdditionalContent = false to tell the serializer to ignore any characters after the end of the JSON content. Oddly enough it is necessary to do this explicitly despite the fact that the default value seems to be false already, since the underlying field is nullable.
The same code can be used to deserialize embedded JSONP from a string by passing a StringReader to DeserializeEmbeddedJsonP<T>(TextReader reader);. Doing so avoids the need to create a new string by trimming the prefix and postfix text and so may improve performance and memory use even for smaller strings.
Sample working .Net fiddle.
It looks like it's returning JSONP. Kind of weird that a webservice would do that by default, without you including "?callback". In any case, if that's just the way it is, you can easily use a RegEx to just strip off the method call:
var x = WebServiceCall();
x = Regex.Replace(x, #"^.+?\(|\)$", "");

Parse a string containing several json objects into something more convenient

The crux of the problem here is that I don't know any C#, yet find myself adding a feature to some test infrastructure which happens to be written in C#. I suspect this question is entirely trivial and beg your patience in answering. My colleagues who originally wrote this stuff are all out of the office.
I am parsing a string representing one or more json objects. So far I can get the first object, but can't work out how to access the remainder.
public class demo
{
public void minimal()
{
// Note - the input is not quite json! I.e. I don't have
// [{"Name" : "foo"}, {"Name" : "bar"}]
// Each individual object is well formed, they just aren't in
// a convenient array for easy parsing.
// Each string representation of an object are literally concatenated.
string data = #"{""Name"": ""foo""} {""Name"" : ""bar""}";
System.Xml.XmlDictionaryReader jsonReader =
JsonReaderWriterFactory.CreateJsonReader(Encoding.UTF8.GetBytes(data),
new System.Xml.XmlDictionaryReaderQuotas());
System.Xml.Linq.XElement root = XElement.Load(jsonReader);
Assert.AreEqual(root.XPathSelectElement("//Name").Value, "foo");
// The following clearly doesn't work
Assert.AreEqual(root.XPathSelectElement("//Name").Value, "bar");
}
}
I'm roughly at the point of rolling enough of a parser to work out where to split the string by counting braces but am hoping that the library support will do this for me.
The ideal end result is a sequential datastructure of your choice (list, vector? don't care) containing one System.Xml.Linq.XElement for each json object embedded in the string.
Thanks!
edit: Roughly viable example, mostly due to George Richardson - I'm playing fast and loose with the type system (not sure dynamic is available in C#3.0), but the end result seems to be predictable.
public class demo
{
private IEnumerable<Newtonsoft.Json.Linq.JObject>
DeserializeObjects(string input)
{
var serializer = new JsonSerializer();
using (var strreader = new StringReader(input))
{
using (var jsonreader = new JsonTextReader(strreader))
{
jsonreader.SupportMultipleContent = true;
while (jsonreader.Read())
{
yield return (Newtonsoft.Json.Linq.JObject)
serializer.Deserialize(jsonreader);
}
}
}
}
public void example()
{
string json = #"{""Name"": ""foo""} {""Name"" : ""bar""} {""Name"" : ""baz""}";
var objects = DeserializeObjects(json);
var array = objects.ToArray();
Assert.AreEqual(3, array.Length);
Assert.AreEqual(array[0]["Name"].ToString(), "foo");
Assert.AreEqual(array[1]["Name"].ToString(), "bar");
Assert.AreEqual(array[2]["Name"].ToString(), "baz");
}
}
You are going to want to use JSON.net for your actual deserialization needs. The big problem I see here is that your json data is just being concatenated together which means you are going to have to extract each object from the string. Luckily json.net's JsonReader has a SupportMultipleContent property which does just this
public void Main()
{
string json = #"{""Name"": ""foo""} {""Name"" : ""bar""} {""Name"" : ""baz""}";
IEnumerable<dynamic> deserialized = DeserializeObjects(json);
string name = deserialized.First().Name; //name is "foo"
}
IEnumerable<object> DeserializeObjects(string input)
{
JsonSerializer serializer = new JsonSerializer();
using (var strreader = new StringReader(input)) {
using (var jsonreader = new JsonTextReader(strreader)) {
jsonreader.SupportMultipleContent = true;
while (jsonreader.Read()) {
yield return serializer.Deserialize(jsonreader);
}
}
}
}

How can I output a raw byte array into my XML using XmlWriter?

I'm trying to output some raw byte data in some of my XML nodes.
I do not believe the Base64 output to be suitable for my solution.
My current work is as follows:
To save to the file:
(Member function in the container class Foo)
public void save(String file)
{
XmlWriterSettings settings = new XmlWriterSettings();
XmlSerializer serializer = new XmlSerializer(typeof(Foo));
XmlWriter writer = XmlWriter.Create(file, settings);
serializer.Serialize(writer, this);
}
To serialize the class (the class is inherited from IXmlSerializable):
(The data in Bytes is the raw data)
public void WriteXml(XmlWriter writer)
{
char[] temp = new char[Bytes.Length];
for (int i = 0; i < temp.Length; i++)
{
int n = (int)Bytes[i];
temp[i] = (char)n;
}
writer.WriteRaw(temp, 0, temp.Length);
}
I'm certain that after this operation the data in Bytes exactly matches the data in temp but after I have serialized the class the raw data in the output file does not seem to match, although some parts look similar. I have also tried playing around with encode settings on the XmlWriter, but that frequently ends in exceptions.
"I do not believe the Base64 output to be suitable for my solution." o_O, well... then this is not a programming question, but a philosophical one...
Still, and assuming many things, such as that the array Bytes contains the byte data of the a file containing a serialized instance of Foo as per your save() method, keep in mind that char represents a Unicode character and its size is 2-bytes... You are adding more bits when converting the byte to a char...
Oh encodings, encodings, encodings... That's why there exists a Base64!!!

Saving a Dictionary<String, Int32> in C# - Serialization?

I am writing a C# application that needs to read about 130,000 (String, Int32) pairs at startup to a Dictionary. The pairs are stored in a .txt file, and are thus easily modifiable by anyone, which is something dangerous in the context. I would like to ask if there is a way to save this dictionary so that the information can be reasonably safely stored, without losing performance at startup. I have tried using BinaryFormatter, but the problem is that while the original program takes between 125ms and 250ms at startup to read the information from the txt and build the dictionary, deserializing the resulting binary files takes up to 2s, which is not too much by itself but when compared to the original performance is a 8-16x decrease in speed.
Note: Encryption is important, but the most important should be a way to save and read the dictionary from the disk - possibly from a binary file - without having to use Convert.ToInt32 on each line, thus improving performance.
interesting question. I did some quick tests and you are right - BinaryFormatter is surprisingly slow:
Serialize 130,000 dictionary entries: 547ms
Deserialize 130,000 dictionary entries: 1046ms
When I coded it with a StreamReader/StreamWriter with comma separated values I got:
Serialize 130,000 dictionary entries: 121ms
Deserialize 130,000 dictionary entries: 111ms
But then I tried just using a BinaryWriter/BinaryReader:
Serialize 130,000 dictionary entries: 22ms
Deserialize 130,000 dictionary entries: 36ms
The code for that looks like this:
public void Serialize(Dictionary<string, int> dictionary, Stream stream)
{
BinaryWriter writer = new BinaryWriter(stream);
writer.Write(dictionary.Count);
foreach (var kvp in dictionary)
{
writer.Write(kvp.Key);
writer.Write(kvp.Value);
}
writer.Flush();
}
public Dictionary<string, int> Deserialize(Stream stream)
{
BinaryReader reader = new BinaryReader(stream);
int count = reader.ReadInt32();
var dictionary = new Dictionary<string,int>(count);
for (int n = 0; n < count; n++)
{
var key = reader.ReadString();
var value = reader.ReadInt32();
dictionary.Add(key, value);
}
return dictionary;
}
As others have said though, if you are concerned about users tampering with the file, encryption, rather than binary formatting is the way forward.
If you want to have the data relatively safely stored, you can encrypt the contents. If you just encrypt it as a string and decrypt it before your current parsing logic, you should be safe. And, this should not impact performance that much.
See Encrypt and decrypt a string for more information.
Encryption comes at the cost of key management. And, of course, even the fastest encryption/decryption algorithms are slower than no encryption at all. Same with compression, which will only help if you are I/O-bound.
If performance is your main concern, start looking at where the bottleneck actually is. If the culprit really is the Convert.ToInt32() call, I imagine you can store the Int32 bits directly and get away with a simple cast, which should be faster than parsing a string value. To obfuscate the strings, you can xor each byte with some fixed value, which is fast but provides nothing more than a roadbump for a determined attacker.
Perhaps something like:
static void Serialize(string path, IDictionary<string, int> data)
{
using (var file = File.Create(path))
using (var writer = new BinaryWriter(file))
{
writer.Write(data.Count);
foreach(var pair in data)
{
writer.Write(pair.Key);
writer.Write(pair.Value);
}
}
}
static IDictionary<string,int> Deserialize(string path)
{
using (var file = File.OpenRead(path))
using (var reader = new BinaryReader(file))
{
int count = reader.ReadInt32();
var data = new Dictionary<string, int>(count);
while(count-->0) {
data.Add(reader.ReadString(), reader.ReadInt32());
}
return data;
}
}
Note this doesn't do anything re encryption; that is a separate concern. You might also find that adding deflate into the mix reduces file IO and increases performance:
static void Serialize(string path, IDictionary<string, int> data)
{
using (var file = File.Create(path))
using (var deflate = new DeflateStream(file, CompressionMode.Compress))
using (var writer = new BinaryWriter(deflate))
{
writer.Write(data.Count);
foreach(var pair in data)
{
writer.Write(pair.Key);
writer.Write(pair.Value);
}
}
}
static IDictionary<string,int> Deserialize(string path)
{
using (var file = File.OpenRead(path))
using (var deflate = new DeflateStream(file, CompressionMode.Decompress))
using (var reader = new BinaryReader(deflate))
{
int count = reader.ReadInt32();
var data = new Dictionary<string, int>(count);
while(count-->0) {
data.Add(reader.ReadString(), reader.ReadInt32());
}
return data;
}
}
Is it safe enough to use BinaryFormatter instead of storing the contents directly in the text file? Obviously not. Because others can easily "destroy" the file by opening it by notepad and add something, even though he can see strange characters only. It's better if you store it in a database. But if you insist your solution, you can easily improve the performance a lot, by using Parallel Programming in C#4.0 (you can easily get a lot of useful examples by googling it). Something looks like this:
//just an example
Dictionary<string, int> source = GetTheDict();
var grouped = source.GroupBy(x =>
{
if (x.Key.First() >= 'a' && x.Key.First() <= 'z') return "File1";
else if (x.Key.First() >= 'A' && x.Key.First() <= 'Z') return "File2";
return "File3";
});
Parallel.ForEach(grouped, g =>
{
ThreeStreamsToWriteToThreeFilesParallelly(g);
});
Another alternative solution of Parallel is creating several threads, reading from/writing to different files will be faster.
Well, using a BinaryFormatter isn't really a safe way to store the pairs, as you can write a very simple program to deserialize it (after, say, running reflector on your code to get the type)
How about encrypting the txt?
With something like this for example ? (for maximum performance, try without compression)

Categories

Resources