Strategy for splitting a large JSON file

Strategy for splitting a large JSON file - c#

I'm trying to split very large JSON files into smaller files for a given array. For example:
{
"headerName1": "headerVal1",
"headerName2": "headerVal2",
"headerName3": [{
"element1Name1": "element1Value1"
},
{
"element2Name1": "element2Value1"
},
{
"element3Name1": "element3Value1"
},
{
"element4Name1": "element4Value1"
},
{
"element5Name1": "element5Value1"
},
{
"element6Name1": "element6Value1"
}]
}
...down to { "elementNName1": "elementNValue1" } where N is a large number
The user provides the name which represents the array to be split (in this example "headerName3") and the number of array objects per file, e.g. 1,000,000
This would result in N files each containing the top name:value pairs (headerName1, headerName3) and up to 1,000,000 of the headerName3 objects in each file.
I'm using the excellent Newtonsof JSON.net and understand that I need to do this using a stream.
So far I have looked a reading in JToken objects to establish where the PropertyName == "headerName3" occurs when reading in the tokens but what I would like to do is then read in the entire JSON object for each object in the array and not have to continue parsing JSON into JTokens;
Here's a snippet of the code I am building so far:
using (StreamReader oSR = File.OpenText(strInput))
{
using (var reader = new JsonTextReader(oSR))
{
while (reader.Read())
{
if (reader.TokenType == JsonToken.StartObject)
{
intObjectCount++;
}
else if (reader.TokenType == JsonToken.EndObject)
{
intObjectCount--;
if (intObjectCount == 1)
{
intArrayRecordCount++;
// Here I want to read the entire object for this record into an untyped JSON object
if( intArrayRecordCount % 1000000 == 0)
{
//write these to the split file
}
}
}
}
}
}
I don't know - and in fact, and am not concerned with - the structure of the JSON itself, and the objects can be of varying structures within the array. I am therefore not serializing to classes.
Is this the right approach? Is there a set of methods in the JSON.net library I can easily use to perform such operation?
Any help appreciated.

You can use JsonWriter.WriteToken(JsonReader reader, true) to stream individual array entries and their descendants from a JsonReader to a JsonWriter. You can also use JProperty.Load(JsonReader reader) and JProperty.WriteTo(JsonWriter writer) to read and write entire properties and their descendants.
Using these methods, you can create a state machine that parses the JSON file, iterates through the root object, loads "prefix" and "postfix" properties, splits the array property, and writes the prefix, array slice, and postfix properties out to new file(s).
Here's a prototype implementation that takes a TextReader and a callback function to create sequential output TextWriter objects for the split file:
enum SplitState
{
InPrefix,
InSplitProperty,
InSplitArray,
InPostfix,
}
public static void SplitJson(TextReader textReader, string tokenName, long maxItems, Func<int, TextWriter> createStream, Formatting formatting)
{
List<JProperty> prefixProperties = new List<JProperty>();
List<JProperty> postFixProperties = new List<JProperty>();
List<JsonWriter> writers = new List<JsonWriter>();
SplitState state = SplitState.InPrefix;
long count = 0;
try
{
using (var reader = new JsonTextReader(textReader))
{
bool doRead = true;
while (doRead ? reader.Read() : true)
{
doRead = true;
if (reader.TokenType == JsonToken.Comment || reader.TokenType == JsonToken.None)
continue;
if (reader.Depth == 0)
{
if (reader.TokenType != JsonToken.StartObject && reader.TokenType != JsonToken.EndObject)
throw new JsonException("JSON root container is not an Object");
}
else if (reader.Depth == 1 && reader.TokenType == JsonToken.PropertyName)
{
if ((string)reader.Value == tokenName)
{
state = SplitState.InSplitProperty;
}
else
{
if (state == SplitState.InSplitProperty)
state = SplitState.InPostfix;
var property = JProperty.Load(reader);
doRead = false; // JProperty.Load() will have already advanced the reader.
if (state == SplitState.InPrefix)
{
prefixProperties.Add(property);
}
else
{
postFixProperties.Add(property);
}
}
}
else if (reader.Depth == 1 && reader.TokenType == JsonToken.StartArray && state == SplitState.InSplitProperty)
{
state = SplitState.InSplitArray;
}
else if (reader.Depth == 1 && reader.TokenType == JsonToken.EndArray && state == SplitState.InSplitArray)
{
state = SplitState.InSplitProperty;
}
else if (state == SplitState.InSplitArray && reader.Depth == 2)
{
if (count % maxItems == 0)
{
var writer = new JsonTextWriter(createStream(writers.Count)) { Formatting = formatting };
writers.Add(writer);
writer.WriteStartObject();
foreach (var property in prefixProperties)
property.WriteTo(writer);
writer.WritePropertyName(tokenName);
writer.WriteStartArray();
}
count++;
writers.Last().WriteToken(reader, true);
}
else
{
throw new JsonException("Internal error");
}
}
}
foreach (var writer in writers)
using (writer)
{
writer.WriteEndArray();
foreach (var property in postFixProperties)
property.WriteTo(writer);
writer.WriteEndObject();
}
}
finally
{
// Make sure files are closed in the event of an exception.
foreach (var writer in writers)
using (writer)
{
}
}
}
This method leaves all the files open until the end in case "postfix" properties, appearing after the array property, need to be appended. Be aware that there is a limit of 16384 open files at one time, so if you need to create more split files, this won't work. If postfix properties are never encountered in practice, you can just close each file before opening the next and throw an exception in case any postfix properties are found. Otherwise you may need to parse the large file in two passes or close and reopen the split files to append them.
Here is an example of how to use the method with an in-memory JSON string:
private static void TestSplitJson(string json, string tokenName)
{
var builders = new List<StringBuilder>();
using (var reader = new StringReader(json))
{
SplitJson(reader, tokenName, 2, i => { builders.Add(new StringBuilder()); return new StringWriter(builders.Last()); }, Formatting.Indented);
}
foreach (var s in builders.Select(b => b.ToString()))
{
Console.WriteLine(s);
}
}
Prototype fiddle.

Related

ReadOuterXml is throwing OutOfMemoryException reading part of large (1 GB) XML file

I am working on a large XML file and while running the application, XmlTextReader.ReadOuterXml() method is throwing memory exception.
Lines of codes are like,
XmlTextReader xr = null;
try
{
xr = new XmlTextReader(fileName);
while (xr.Read() && success)
{
if (xr.NodeType != XmlNodeType.Element)
continue;
switch (xr.Name)
{
case "A":
var xml = xr.ReadOuterXml();
var n = GetDetails(xml);
break;
}
}
}
catch (Exception ex)
{
//Do stuff
}
Using:
private int GetDetails (string xml)
{
var rootNode = XDocument.Parse(xml);
var xnodes = rootNode.XPathSelectElements("//A/B").ToList();
//Then working on list of nodes
}
Now while loading the XML files, the application throwing exception on the xr.ReadOuterXml() line. What can be done to avoid this? The size of XML is almost 1 GB.

The most likely reason you are getting a OutOfMemoryException in ReadOuterXml() is that you are trying to read in a substantial portion of the 1 GB XML document into a string, and are hitting the Maximum string length in .Net.
So, don't do that. Instead load directly from the XmlReader using XDocument.Load() with XmlReader.ReadSubtree():
using (var xr = XmlReader.Create(fileName))
{
while (xr.Read() && success)
{
if (xr.NodeType != XmlNodeType.Element)
continue;
switch (xr.Name)
{
case "A":
{
// ReadSubtree() positions the reader at the EndElement of the element read, so the
// next call to Read() moves to the next node.
using (var subReader = xr.ReadSubtree())
{
var doc = XDocument.Load(subReader);
GetDetails(doc);
}
}
break;
}
}
}
And then in GetDetails() do:
private int GetDetails(XDocument rootDocument)
{
var xnodes = rootDocument.XPathSelectElements("//A/B").ToList();
//Then working on list of nodes
return xnodes.Count;
}
Not only will this use less memory, it will also be more performant. ReadOuterXml() uses a temporary XmlWriter to copy the XML in the input stream to an output StringWriter (which you then parse a second time). This version of the algorithm completely skips this extra work. It also avoids creating strings large enough to go on the large object heap which can cause additional performance issues.
If this is still using too much memory you will need to implement SAX-like parsing for your XML where you only load one element <B> at a time. First, introduce the following extension method:
public static partial class XmlReaderExtensions
{
public static IEnumerable<XElement> WalkXmlElements(this XmlReader xmlReader, Predicate<Stack<XName>> filter)
{
Stack<XName> names = new Stack<XName>();
while (xmlReader.Read())
{
if (xmlReader.NodeType == XmlNodeType.Element)
{
names.Push(XName.Get(xmlReader.LocalName, xmlReader.NamespaceURI));
if (filter(names))
{
using (var subReader = xmlReader.ReadSubtree())
{
yield return XElement.Load(subReader);
}
}
}
if ((xmlReader.NodeType == XmlNodeType.Element && xmlReader.IsEmptyElement)
|| xmlReader.NodeType == XmlNodeType.EndElement)
{
names.Pop();
}
}
}
}
Then, use it as follows:
using (var xr = XmlReader.Create(fileName))
{
Predicate<Stack<XName>> filter =
(stack) => stack.Peek().LocalName == "B" && stack.Count > 1 && stack.ElementAt(1).LocalName == "A";
foreach (var element in xr.WalkXmlElements(filter))
{
//Then working on the specific node.
}
}

using (var reader = XmlReader.Create(fileName))
{
XmlDocument oXml = new XmlDocument();
while (reader.Read())
{
oXml.Load(reader);
}
}
For me above code resolved the issue when we return it to XmlDocument through XmlDocument Load method

Can we deserialize InstallState file?

I am trying to pass the information in Install state file to the installer class which will then uninstall.
But before passing it I need to convert the info to System.Collections.IDictionary savedState.
For this, is it possible to deserialize install state file?
Screenshot of the Installstate file

If you use the AssemblyInstaller class, it appears (although this doesn't seem to be documented) that it will, in general, ignore any passed savedState parameter and will instead deal with the INSTALLSTATE file instead (writing it during install, reading it during uninstall).
If you're unable to use it, for some reason, you can probably use a disassembly tool to extract the necessary code from its Uninstall method to perform the deserialization (I believe, and it appears so, that the specific serialization methods used vary between .NET versions, so I'd recommend using the one appropriate to whichever .NET version you're currently working with).
This is the Uninstall method, decompiled from System.Configuration.Install (File Version 4.6.1590.0):
public override void Uninstall(IDictionary savedState)
{
this.PrintStartText(Res.GetString("InstallActivityUninstalling"));
if (!this.initialized)
{
this.InitializeFromAssembly();
}
string installStatePath = this.GetInstallStatePath(this.Path);
if ((installStatePath != null) && File.Exists(installStatePath))
{
FileStream input = new FileStream(installStatePath, FileMode.Open, FileAccess.Read);
XmlReaderSettings settings = new XmlReaderSettings {
CheckCharacters = false,
CloseInput = false
};
XmlReader reader = null;
if (input != null)
{
reader = XmlReader.Create(input, settings);
}
try
{
if (reader != null)
{
NetDataContractSerializer serializer = new NetDataContractSerializer();
savedState = (Hashtable) serializer.ReadObject(reader);
}
goto Label_00C6;
}
catch
{
object[] args = new object[] { this.Path, installStatePath };
base.Context.LogMessage(Res.GetString("InstallSavedStateFileCorruptedWarning", args));
savedState = null;
goto Label_00C6;
}
finally
{
if (reader != null)
{
reader.Close();
}
if (input != null)
{
input.Close();
}
}
}
savedState = null;
Label_00C6:
base.Uninstall(savedState);
if ((installStatePath != null) && (installStatePath.Length != 0))
{
try
{
File.Delete(installStatePath);
}
catch
{
object[] objArray2 = new object[] { installStatePath };
throw new InvalidOperationException(Res.GetString("InstallUnableDeleteFile", objArray2));
}
}
}
You'll notice that it doesn't use whatever was passed to it as savedSate - by the time it uses that variable for anything (here, passing it to its base class), it's either over-written it from the INSTALLSTATE file or it's assigned null to it.

JObject.SelectToken Equivalent in .NET

I need to remove the outer node of a JSON. So an example would be :
{
app: {
...
}
}
Any ideas on how to remove the outer node, so we get only
{
...
}
WITHOUT using JSON.NET, only tools in the .NET Framework (C#).
In Json.NET I used:
JObject.Parse(json).SelectToken("app").ToString();
Alternatively, any configuration of the DataContractJsonSerializer, so that it ignores the root when deserializing, would also work. The way I do the desrialization now is:
protected T DeserializeJsonString<T>(string jsonString)
{
T tempObject = default(T);
using (var memoryStream = new MemoryStream(Encoding.Unicode.GetBytes(jsonString)))
{
var serializer = new DataContractJsonSerializer(typeof(T));
tempObject = (T)serializer.ReadObject(memoryStream);
}
return tempObject;
}
Note that the root object's property name can differ from case to case. For example it can be "transaction".
Thanks for any suggestion.

There is no equivalent to SelectToken built into .Net. But if you simply want to unwrap an outer root node and do not know the node name in advance, you have the following options.
If you are using .Net 4.5 or later, you can deserialize to a Dictionary<string, T> with DataContractJsonSerializer.UseSimpleDictionaryFormat = true:
protected T DeserializeNestedJsonString<T>(string jsonString)
{
using (var memoryStream = new MemoryStream(Encoding.Unicode.GetBytes(jsonString)))
{
var serializer = new DataContractJsonSerializer(typeof(Dictionary<string, T>));
serializer.UseSimpleDictionaryFormat = true;
var dictionary = (Dictionary<string, T>)serializer.ReadObject(memoryStream);
if (dictionary == null || dictionary.Count == 0)
return default(T);
else if (dictionary.Count == 1)
return dictionary.Values.Single();
else
{
throw new InvalidOperationException("Root object has too many properties");
}
}
}
Note that if your root object contains more than one property, you cannot deserialize to a Dictionary<TKey, TValue> to get the first property since the order of the items in this class is undefined.
On any version of .Net that supports the data contract serializers, you can take advantage of the fact that DataContractJsonSerializer inherits from XmlObjectSerializer to call JsonReaderWriterFactory.CreateJsonReader() to create an XmlReader that actually reads JSON, then skip forward to the first nested "element":
protected T DeserializeNestedJsonStringWithReader<T>(string jsonString)
{
var reader = JsonReaderWriterFactory.CreateJsonReader(Encoding.Unicode.GetBytes(jsonString), System.Xml.XmlDictionaryReaderQuotas.Max);
int elementCount = 0;
while (reader.Read())
{
if (reader.NodeType == System.Xml.XmlNodeType.Element)
elementCount++;
if (elementCount == 2) // At elementCount == 1 there is a synthetic "root" element
{
var serializer = new DataContractJsonSerializer(typeof(T));
return (T)serializer.ReadObject(reader, false);
}
}
return default(T);
}
This technique looks odd (parsing JSON with an XmlReader?), but with some extra work it should be possible to extend this idea to create SAX-like parsing functionality for JSON that is similar to SelectToken(), skipping forward in the JSON until a desired property is found, then deserializing its value.
For instance, to select and deserialize specific named properties, rather than just the first root property, the following can be used:
public static class DataContractJsonSerializerExtensions
{
public static T DeserializeNestedJsonProperty<T>(string jsonString, string rootPropertyName)
{
// Check for count == 2 because there is a synthetic <root> element at the top.
Predicate<Stack<string>> match = s => s.Count == 2 && s.Peek() == rootPropertyName;
return DeserializeNestedJsonProperties<T>(jsonString, match).FirstOrDefault();
}
public static IEnumerable<T> DeserializeNestedJsonProperties<T>(string jsonString, Predicate<Stack<string>> match)
{
DataContractJsonSerializer serializer = null;
using (var reader = JsonReaderWriterFactory.CreateJsonReader(Encoding.UTF8.GetBytes(jsonString), XmlDictionaryReaderQuotas.Max))
{
var stack = new Stack<string>();
while (reader.Read())
{
if (reader.NodeType == System.Xml.XmlNodeType.Element)
{
stack.Push(reader.Name);
if (match(stack))
{
serializer = serializer ?? new DataContractJsonSerializer(typeof(T));
yield return (T)serializer.ReadObject(reader, false);
}
if (reader.IsEmptyElement)
stack.Pop();
}
else if (reader.NodeType == XmlNodeType.EndElement)
{
stack.Pop();
}
}
}
}
}
See Mapping Between JSON and XML for details on how JsonReaderWriterFactory maps JSON to XML.

How to deserialize JSON with duplicate property names in the same object

I have a JSON string that I expect to contain duplicate keys that I am unable to make JSON.NET happy with.
I was wondering if anybody knows the best way (maybe using JsonConverter? ) to get JSON.NET to change a JObject's child JObjects into to JArrays when it sees duplicate key names ?
// For example: This gives me a JObject with a single "JProperty\JObject" child.
var obj = JsonConvert.DeserializeObject<object>("{ \"HiThere\":1}");
// This throws:
// System.ArgumentException : Can not add Newtonsoft.Json.Linq.JValue to Newtonsoft.Json.Linq.JObject.
obj = JsonConvert.DeserializeObject<object>("{ \"HiThere\":1, \"HiThere\":2, \"HiThere\":3 }");
The actual JSON I am trying to deserialize is much more complicated and the duplicates are nested at multiple levels. But the code above demonstrates why it fails for me.
I understand that the JSON is not correct which is why I am asking if JSON.NET has a way to work around this. For argument's sake let's say I do not have control over the JSON. I actually do use a specific type for the parent object but the particular property that is having trouble will either be a string or another nested JSON object. The failing property type is "object" for this reason.

Interesting question. I played around with this for a while and discovered that while a JObject cannot contain properties with duplicate names, the JsonTextReader used to populate it during deserialization does not have such a restriction. (This makes sense if you think about it: it's a forward-only reader; it is not concerned with what it has read in the past). Armed with this knowledge, I took a shot at writing some code that will populate a hierarchy of JTokens, converting property values to JArrays as necessary if a duplicate property name is encountered in a particular JObject. Since I don't know your actual JSON and requirements, you may need to make some adjustments to it, but it's something to start with at least.
Here's the code:
public static JToken DeserializeAndCombineDuplicates(JsonTextReader reader)
{
if (reader.TokenType == JsonToken.None)
{
reader.Read();
}
if (reader.TokenType == JsonToken.StartObject)
{
reader.Read();
JObject obj = new JObject();
while (reader.TokenType != JsonToken.EndObject)
{
string propName = (string)reader.Value;
reader.Read();
JToken newValue = DeserializeAndCombineDuplicates(reader);
JToken existingValue = obj[propName];
if (existingValue == null)
{
obj.Add(new JProperty(propName, newValue));
}
else if (existingValue.Type == JTokenType.Array)
{
CombineWithArray((JArray)existingValue, newValue);
}
else // Convert existing non-array property value to an array
{
JProperty prop = (JProperty)existingValue.Parent;
JArray array = new JArray();
prop.Value = array;
array.Add(existingValue);
CombineWithArray(array, newValue);
}
reader.Read();
}
return obj;
}
if (reader.TokenType == JsonToken.StartArray)
{
reader.Read();
JArray array = new JArray();
while (reader.TokenType != JsonToken.EndArray)
{
array.Add(DeserializeAndCombineDuplicates(reader));
reader.Read();
}
return array;
}
return new JValue(reader.Value);
}
private static void CombineWithArray(JArray array, JToken value)
{
if (value.Type == JTokenType.Array)
{
foreach (JToken child in value.Children())
array.Add(child);
}
else
{
array.Add(value);
}
}
And here's a demo:
class Program
{
static void Main(string[] args)
{
string json = #"
{
""Foo"" : 1,
""Foo"" : [2],
""Foo"" : [3, 4],
""Bar"" : { ""X"" : [ ""A"", ""B"" ] },
""Bar"" : { ""X"" : ""C"", ""X"" : ""D"" },
}";
using (StringReader sr = new StringReader(json))
using (JsonTextReader reader = new JsonTextReader(sr))
{
JToken token = DeserializeAndCombineDuplicates(reader);
Dump(token, "");
}
}
private static void Dump(JToken token, string indent)
{
Console.Write(indent);
if (token == null)
{
Console.WriteLine("null");
return;
}
Console.Write(token.Type);
if (token is JProperty)
Console.Write(" (name=" + ((JProperty)token).Name + ")");
else if (token is JValue)
Console.Write(" (value=" + token.ToString() + ")");
Console.WriteLine();
if (token.HasValues)
foreach (JToken child in token.Children())
Dump(child, indent + " ");
}
}
Output:
Object
Property (name=Foo)
Array
Integer (value=1)
Integer (value=2)
Integer (value=3)
Integer (value=4)
Property (name=Bar)
Array
Object
Property (name=X)
Array
String (value=A)
String (value=B)
Object
Property (name=X)
Array
String (value=C)
String (value=D)

Brian Rogers - Here is the helper function of the JsonConverter that I wrote. I modified it based on your comments about how a JsonTextReader is just a forward-reader doesn't care about duplicate values.
private static object GetObject(JsonReader reader)
{
switch (reader.TokenType)
{
case JsonToken.StartObject:
{
var dictionary = new Dictionary<string, object>();
while (reader.Read() && (reader.TokenType != JsonToken.EndObject))
{
if (reader.TokenType != JsonToken.PropertyName)
throw new InvalidOperationException("Unknown JObject conversion state");
string propertyName = (string) reader.Value;
reader.Read();
object propertyValue = GetObject(reader);
object existingValue;
if (dictionary.TryGetValue(propertyName, out existingValue))
{
if (existingValue is List<object>)
{
var list = existingValue as List<object>;
list.Add(propertyValue);
}
else
{
var list = new List<object> {existingValue, propertyValue};
dictionary[propertyName] = list;
}
}
else
{
dictionary.Add(propertyName, propertyValue);
}
}
return dictionary;
}
case JsonToken.StartArray:
{
var list = new List<object>();
while (reader.Read() && (reader.TokenType != JsonToken.EndArray))
{
object propertyValue = GetObject(reader);
list.Add(propertyValue);
}
return list;
}
default:
{
return reader.Value;
}
}
}

You should not be using a generic type of object, it should be a more specific type.
However you json is malformed which is you rmain problem
You have :
"{ \"HiThere\":1, \"HiThere\":2, \"HiThere\":3 }"
But it should be:
"{"HiTheres": [{\"HiThere\":1}, {\"HiThere\":2}, {\"HiThere\":3} ]}"
Or
"{ \"HiThereOne\":1, \"HiThereTwo\":2, \"HiThereThree\":3 }"
You json is one object with 3 fields with all the same name ("HiThere").
Which wont work.
The json I have shown gives:
An array (HiTheres) of three objects each with a field of HiThere
Or
One object with three field with different names. (HiThereOne, HiThereTwo, "HiThereThree)
Have a look at http://jsoneditoronline.org/index.html
And http://json.org/

yield pattern, state machine flow

I have the following file and I am using an iterator block to parse certain re-occuring nodes/parts within the file. I initially used regex to parse the entire file but when certain fields were not present in a node, it would not match. So I am trying to use the yield pattern. The file format is as follows perceeded with the code I am using. All I want from the file are the replicate nodes as an individual part so I can fetch fields within it using a key string and store in collection of objects. I can start parsing where the first replicate occurs but unable to end it where the replicate node ends.
File Format:
X_HEADER
{
DATA_MANAGEMENT_FIELD_2 NA
DATA_MANAGEMENT_FIELD_3 NA
DATA_MANAGEMENT_FIELD_4 NA
SYSTEM_SOFTWARE_VERSION NA
}
Y_HEADER
{
DATA_MANAGEMENT_FIELD_2 NA
DATA_MANAGEMENT_FIELD_3 NA
DATA_MANAGEMENT_FIELD_4 NA
SYSTEM_SOFTWARE_VERSION NA
}
COMPLETION
{
NUMBER 877
VERSION 4
CALIBRATION_VERSION 1
CONFIGURATION_ID 877
}
REPLICATE
{
REPLICATE_ID 1985
ASSAY_NUMBER 656
ASSAY_VERSION 4
ASSAY_STATUS Research
DILUTION_ID 1
}
REPLICATE
{
REPLICATE_ID 1985
ASSAY_NUMBER 656
ASSAY_VERSION 4
ASSAY_STATUS Research
}
Code:
static IEnumerable<IDictionary<string, string>> ReadParts(string path)
{
using (var reader = File.OpenText(path))
{
var current = new Dictionary<string, string>();
string line;
while ((line = reader.ReadLine()) != null)
{
if (string.IsNullOrWhiteSpace(line)) continue;
if (line.StartsWith("REPLICATE"))
{
yield return current;
current = new Dictionary<string, string>();
}
else
{
var parts = line.Split('\t');
}
if (current.Count > 0) yield return current;
}
}
}
public static void parseFile(string fileName)
{
foreach (var part in ReadParts(fileName))
{
//part["fIELD1"] will retireve certain values from the REPLICATE PART HERE
}
}

Well, it sounds like you just need to "close" a section when you get a closing brace, and only yield return at that point. For example:
static IEnumerable<IDictionary<string, string>> ReadParts(string path)
{
using (var reader = File.OpenText(path))
{
string currentName = null;
IDictionary<string, string> currentMap = null;
while ((line = reader.ReadLine()) != null)
{
if (string.IsNullOrWhiteSpace(line))
{
continue;
}
if (line == "{")
{
if (currentName == null || currentMap != null)
{
throw new BadDataException("Open brace at wrong place");
}
currentMap = new Dictionary<string, string>();
}
else if (line == "}")
{
if (currentName == null || currentMap == null)
{
throw new BadDataException("Closing brace at wrong place");
}
// Isolate the "REPLICATE-only" requirement to a single
// line - if you ever need other bits, you can change this.
if (currentName == "REPLICATE")
{
yield return currentMap;
}
currentName = null;
currentMap = null;
}
else if (!line.StartsWith("\t"))
{
if (currentName != null || currentMap != null)
{
throw new BadDataException("Section name at wrong place");
}
currentName = line;
}
else
{
if (currentName == null || currentMap == null)
{
throw new BadDataException("Name/value pair at wrong place");
}
var parts = line.Substring(1).Split('\t');
if (parts.Length != 2)
{
throw new BadDataException("Invalid name/value pair");
}
currentMap[parts[0]] = parts[1];
}
}
}
}
Now that's a pretty ghastly function, to be honest. I suspect I'd put this in its own class instead (possibly a nested one) to store the state, and make each handler its own method. Heck, this is actually a situation where the state pattern could make sense :)

private IEnumerable<IDictionary<string, string>> ParseFile(System.IO.TextReader reader)
{
string token = reader.ReadLine();
while (token != null)
{
bool isReplicate = token.StartsWith("REPLICATE");
token = reader.ReadLine(); //consume this token to either skip it or parse it
if (isReplicate)
{
yield return ParseBlock(ref token, reader);
}
}
}
private IDictionary<string, string> ParseBlock(ref string token, System.IO.TextReader reader)
{
if (token != "{")
{
throw new Exception("Missing opening brace.");
}
token = reader.ReadLine();
var result = ParseValues(ref token, reader);
if (token != "}")
{
throw new Exception("Missing closing brace.");
}
token = reader.ReadLine();
return result;
}
private IDictionary<string, string> ParseValues(ref string token, System.IO.TextReader reader)
{
IDictionary<string, string> result = new Dictionary<string, string>();
while (token != "}" and token != null)
{
var args = token.Split('\t');
if (args.Length < 2)
{
throw new Exception();
}
result.Add(args[0], args[1]);
token = reader.ReadLine();
}
return result;
}

If you add a yield return current; after your while loop is over, you will get the final dictionary.
I believe it would be better to check for '}' as an end to the current block, and then put the yield return there. although you can't use regex t parse the entire file, you can use regex to search for the key-value pairs within the lines. The following iterator code should work. It will only return dictonaries for REPLICATE blocks.
// Check for lines that are a key-value pair, separated by whitespace.
// Note that value is optional
static string partPattern = #"^(?<Key>\w*)(\s+(?<Value>\.*))?$";
static IEnumerable<IDictionary<string, string>> ReadParts(string path)
{
using (var reader = File.OpenText(path))
{
string line;
while ((line = reader.ReadLine()) != null)
{
// Ignore lines that just contain whitespace
if (string.IsNullOrWhiteSpace(line)) continue;
// This is a new replicate block, start a new dictionary
if (line.Trim().CompareTo("REPLICATE") == 0)
{
yield return parseReplicateBlock(reader);
}
}
}
}
private static IDictionary<string, string> parseReplicateBlock(StreamReader reader)
{
// Make sure we have an opening brace
VerifyOpening(reader);
string line;
var currentDictionary = new Dictionary<string, string>();
while ((line = reader.ReadLine()) != null)
{
// Ignore lines that just contain whitespace
if (string.IsNullOrWhiteSpace(line)) continue;
line = line.Trim();
// Since our regex used groupings (?<Key> and ?<Value>),
// we can do a match and check to see if our groupings
// found anything. If they did, extract the key and value.
Match m = Regex.Match(line, partPattern);
if (m.Groups["Key"].Length > 0)
{
currentDictionary.Add(m.Groups["Key"].Value, m.Groups["Value"].Value);
}
else if (line.CompareTo("}") == 0)
{
return currentDictionary;
}
}
// We exited the loop before we found a closing brace, throw an exception
throw new ApplicationException("Missing closing brace");
}
private static void VerifyOpening(StreamReader reader)
{
string line;
while ((line = reader.ReadLine()) != null)
{
// Ignore lines that just contain whitespace
if (string.IsNullOrWhiteSpace(line)) continue;
if (line.Trim().CompareTo("{") == 0)
{
return;
}
else
{
throw new ApplicationException("Missing opening brace");
}
}
throw new ApplicationException("Missing opening brace");
}
Update: I made sure that the regex string includes cases where there is no value. In addition, the group indexes were all changed to use the group name to avoid any issues if the regex string is modified.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Strategy for splitting a large JSON file - c#

Related

ReadOuterXml is throwing OutOfMemoryException reading part of large (1 GB) XML file

Can we deserialize InstallState file?

JObject.SelectToken Equivalent in .NET

How to deserialize JSON with duplicate property names in the same object

yield pattern, state machine flow

Categories

Resources