I have recently been tasked with writing a piece of software that will import Excel files.
The problem I am trying to solve is that my company has c100 clients and each supply a file in a different layout, in so much as the columns in a file will differ between clients but the pertinent information is there in each file.
This process is complicated due to the fact that certain operations need to be done to different files.
In 1 file, for example, a column needs to be inserted after a specifc column and then the result of a calculation needs to be placed into that column. In that same sheet an address is supplied across 9 columns, this address needs to be moved into the last 6 of the 9 columns and then have the first 3 columns removed.
What I don't want to do is write the processing logic for each file (c 100 as mentioned) and thereby get trapped into the drudge of having to maintain this code and be responsible for adding new customer files as they come in.
What I want to do is create a Rule or Processing engine of sorts whereby I can have basic rules like "Insert Column", "Remove Column", "Insert Calculation", "Format a, b, c, d, e & f Columns To Use d, e & f" - the reason being so that configuring the read and process of any new file can be done through a front-end piece of software by an end user (obviously with some training on what to do).
Is there a pattern or strategy that might fit this? I have read about Rules engines but the best examples of these are simple boolean comparisons like "Age = 15" or "Surname = 'Smith'" but can't find a decent example of doing something like "Insert Column after Column G" then "Put G - 125 in to Column H".
Any help here, or a pointer to a good approach, would be greatly appreciated.
Let me see if I can help you out here.
Correct me if I am wrong, but it seems like all your input and output files contain data in columns and columns only.
In that case, you should imagine your problem as a transformation of X input columns to Y output columns. For each client, you will need a configuration that will specify the transform. The configuration might look like below
Y1 = X1
Y2 = X1 + X2
Y3 = X3 + " some string"
As you can see, your configuration lines are simply C# expressions. You can use the LINQ Expression class to build an expression tree from your transformation formulas. You can learn about Expressions here. These expressions can then be compiled and used to do the actual transform. If you think in terms of C#, you will build a static transform method that takes a list as input and returns a list as output for each client. When you use Expressions, you will have to parse the configuration files yourself.
You can also use the Roslyn Compiler Services, which can support proper C# syntax. This way, you can literally have a static method which can do the transform. This also relieves you of the parsing duties.
In either case, you will still have to deal with things like: should I expect the columns to be a string (which means your support needs to know explicitly instruct the configuration GUI to parse needed columns into numbers) or should I automatically convert number like fields into numbers (now support doesn't have to do extra configuration, but they might hit issues when dealing with columns which have numbers, like ID, but should be treated as a string to avoid any improper handling), etc.
In Summary, my approach is:
Create config file per client.
Convert the config file into C# method dynamically using Expressions or Roslyn
Provide a GUI for generating this config - this way the support person can easily specify the transform without knowing your special syntax (Expressions) or C# syntax (Roslyn). When saving config, you can generate one method per client in a single assembly (or separate assembly per client) and persist it. Let's call it client library.
Your main application can do all the standard stuff of reading from excel, validating, etc and then call the client library method to generate the output in a standard format, which can be further processed in your main application.
Hope you got the gist.
Edit: Adding some code to demonstrate. The code is a bit long-winded, but commented for understanding.
// this data represents your excel data
var data = new string[][] {
new string [] { "col_1_1", "10", "09:30" },
new string [] { "col_2_1", "12", "09:40" }
};
// you should read this from your client specific config file/section
// Remember: you should provide a GUI tool to build this config
var config = #"
output.Add(input[0]);
int hours = int.Parse(input[1]);
DateTime date = DateTime.Parse(input[2]);
date = date.AddHours(hours);
output.Add(""Custom Text: "" + date);
";
// this template code should be picked up from a
// non client specific config file/section
var code = #"
using System;
using System.Collections.Generic;
using System.Linq;
namespace ClientLibrary {
static class ClientLibrary {
public static List<string> Client1(string[] input) {
var output = new List<string>();
<<code-from-config>>
return output;
}
}
}
";
// Inject client configuration into template to form full code
code = code.Replace(#"<<code-from-config>>", config);
// Compile your dynamic method and get a reference to it
var references = new MetadataReference[] {
MetadataReference.CreateFromFile(typeof(object).Assembly.Location),
MetadataReference.CreateFromFile(typeof(Enumerable).Assembly.Location)
};
CSharpCompilation compilation = CSharpCompilation.Create(
null,
syntaxTrees: new[] { CSharpSyntaxTree.ParseText(code) },
references: references,
options: new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary));
MethodInfo clientMethod = null;
using (var ms = new MemoryStream()) {
EmitResult result = compilation.Emit(ms);
if (!result.Success) {
foreach (Diagnostic diagnostic in result.Diagnostics) {
Console.Error.WriteLine("{0}: {1}", diagnostic.Id, diagnostic.GetMessage());
}
} else {
ms.Seek(0, SeekOrigin.Begin);
Assembly assembly = Assembly.Load(ms.ToArray());
clientMethod = assembly.GetType("ClientLibrary.ClientLibrary").GetMethod("Client1");
}
}
if (clientMethod == null)
return;
// Do transformation
foreach (string[] row in data) {
var output = clientMethod.Invoke(null, new object[] { row }) as List<string>;
Console.WriteLine(string.Join("|", output));
}
You will need some nuget libraries to compile this, and their corresponding using clauses
nuget install Microsoft.Net.Compilers # Install C# and VB compilers
nuget install Microsoft.CodeAnalysis # Install Language APIs and Services
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Reflection;
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.Emit;
As you notice, the only piece to worry about is the GUI to auto-generate the code for the transformation - which I have not provided here. If you want simple transforms, that should be very easy, but for a complex transform, it will be more involved
It sounds like you're expecting your end user to be technical-savvy enough to understand this configuration mechanism that you're going to write. If they can handle that level of technical detail, it might be simpler to give them an Excel book and an official excel template that contains all the columns that your import app needs and they can manually massage the data to the spec.
Otherwise, I would suggest some strategy design based pattern solution to build a library of "data massager" classes for known formats, and just add new classes as new formats are encountered. e.g.
public interface IClientDataImporter
{
List<MyCustomRowStructure> Import(string filename);
}
// client 1 importer
public class ClientOneImporter : IClientDataImporter
{
public List<MyCustomRowStructure> Import(string filename)
{
var result = new List<MyCustomRowStructure>();
// ..... insert custom logic here
return result;
}
}
// client 2 importer
public class ClientTwoImporter : IClientDataImporter
{
public List<MyCustomRowStructure> Import(string filename)
{
var result = new List<MyCustomRowStructure>();
// ..... insert custom logic here
return result;
}
}
// repeat up to however many formats you need
// then.....
public class ExcelToDatabaseImporter
{
public void ImportExcelFile(string filename, string clientName)
{
var myValidData = GetClientDataImporter(clientName).Import(filename);
StickMyDataToMyDatabase(myValidData); // this is where you would load the structure into the db... won't need to touch every time a new format is encountered
}
public IClientDataImporter GetClientDataImporter(string clientName)
{
switch (clientName):
case "ClientOne":
return new ClientOneImporter();
break;
case "ClientTwo":
return new ClientTwoImporter();
break;
default:
throw new ArgumentException("No importer for client");
break;
}
}
I would suggest you to maintain an xml configuration file for each excel file. The xml configuration has to read by a tool, may be a console application, and generate new CSV file, based on the xml configuration.
As XML configuration file can be easily edited by any text editor, users can update the same.
Related
I'm looking for solution where i can fetch -nth +nth numbers of words from my searched keyword from string
ex.
string searchString= "For several years I’ve had a little “utility” function that I’ve used in several projects that I use to convert property names into strings. One use case is for instance when mapping code to some data source or third party API that where the names are used as keys...";
string keywordToSearch="instance";
int wordsToFetch=5;
output would be : One use case is for instance when mapping code to
some
Currently, I'm working on text mining subject in which I have to extract files and search a particular keyword + its sentence from the extracted string.
previously I was fetching the first sentence from string whenever I get the desired keyword. but now the requirement is changed as per above
here is the code snippet
using System.Linq;
using System.Text.RegularExpressions;
using System;
public class Program
{
public static void Main()
{
var sentence = "For several years I’ve had a little “utility” function that I’ve used in several projects that I use to convert property names into strings. One use case is for instance when mapping code to some data source or third party API that where the names are used as keys. The method uses “static reflection”, or rather it parses the expression tree from a lambda expression, to figure out the name of a property that the lambda expression returns the value of.Look, good against remotes is one thing, good against the living, that’s something else.";
var keyword = "instance";
var keyToSearch = new Regex("[^.!?;]*(" + keyword + ")[^.!?;]*");
var m = keyToSearch.Matches(sentence);
var result1 = Enumerable.Range(0, m.Count).Select(index => m[index].Value).ToList();
Console.WriteLine("Output:- {0} ",result1[0]);
}
}
dotnetFiddle
here is the output I got
Output:- One use case is for instance when mapping code to some
data source or third party API that where the names are used as keys
this gives me the first sentence where I got the desired keyword, any suggestion what changes should I do to get the new required output.
How about this: 1) split it into words 2) find the index of your keyword and 3) take a range of words starting 5 before you found index
using System;
using System.Linq;
namespace Foo
{
class Program
{
static void Main(string[] args)
{
var sentence = "For several years I’ve had a little “utility” function that I’ve used in several projects that I use to convert property names into strings. One use case is for instance when mapping code to some data source or third party API that where the names are used as keys. The method uses “static reflection”, or rather it parses the expression tree from a lambda expression, to figure out the name of a property that the lambda expression returns the value of.Look, good against remotes is one thing, good against the living, that’s something else.";
var keyword = "instance";
var words = sentence.Split(' ').ToArray(); // split into words
int index = Array.FindIndex(words, w => w.Equals(keyword)); // find the index within
// take 11 words from 5 before the index of your keyword
var r = Enumerable
.Range(index - 5, 11)
.Select(i => words[i]);
var result = string.Join(' ', r);
Console.WriteLine("Output:- {0} ", result);
Console.ReadKey();
}
}
}
This produces your desired output, but doesn't deal with:
Multiple matches
Matching on a different case
The possibility of an IndexOutOfRangeException when getting the desired words
thanks to #ath got a solution
using System.Linq;
using System.Text.RegularExpressions;
using System;
public class Program
{
public static void Main()
{
var sentence = "case is for instance doooo mapping code to some data source or third party API that where the names are used as keys. The method uses “static reflection”, or rather it parses the expression tree from a lambda expression, to figure out the name of a property that the lambda expression returns the value of.Look, good against remotes is one thing, good against the living, that’s something else.For several years I’ve had a little “utility” function that I’ve used in several projects that I use to convert property names into strings. One use case is for instance when mapping code to some data source or third party API that where the names are used as keys. The method uses “static reflection”, or rather it parses the expression tree from a lambda expression, to figure out the name of a property that the lambda expression returns the value of.Look, good against remotes is one thing, good against the living, that’s something else.";
var keyword = "instance";
int wordFreq = 2;
var words = sentence.Split(' ').ToArray(); // split into words
int foundndex = Array.FindIndex(words, w => w.Equals(keyword)); // find the index within
// take wordFreq words from wordFreq before the index of your keyword
var wordsArray = Enumerable
.Range((foundndex - wordFreq) > 0 ? (foundndex - wordFreq) : 0, (wordFreq*2+1 > (words.Length)-1) ? (words.Length)-1 : wordFreq*2+1 )
.Select(i => words[i]).ToArray();
var outPut = string.Join(" ", wordsArray);
Console.WriteLine("Output:- {0} ",outPut);
}
}
Hope I handled all possible exceptions!
DotNetFiddle
Right now I'm working on a game engine. To be more efficient and keep data from the end user, I'm trying to use serialization on a modified form of the Wavefront's *.OBJ format. I have multiple structs set up to represent data, and the serialization of the objects works fine except it takes up a significant amount of file space (at least x5 that of the original OBJ file).
To be specific, here's a quick example of what the final object would be (in a JSON-esque format):
{
[{float 5.0, float 2.0, float 1.0}, {float 7.0, float 2.0, float 1.0}, ...]
// ^^^ vertex positions
// other similar structures for colors, normals, texture coordinates
// ...
[[{int 1, int 1, int 1}, {int 2, int 2, int 1}, {int 3, int 3, int 2}], ...]
//represents one face; represents the following
//face[vertex{position index, text coords index, normal index}, vertex{}...]
}
Basically, my main issue with the method of serializing data (binary format) is it saves the names of the structs, not the values. I'd love to keep the data in the format I have already, just without saving the struct itself in my data. I want to save something similar to the above, yet it'll still let me recompile with a different struct name later.
Here's the main object I'm serializing and saving to a file:
[Serializable()] //the included structs have this applied
public struct InstantGameworksObjectData
{
public Position[] Positions;
public TextureCoordinates[] TextureCoordinates;
public Position[] Normals;
public Face[] Faces;
}
Here's the method in which I serialize and save the data:
IFormatter formatter = new BinaryFormatter();
long Beginning = DateTime.Now.Ticks / 10000000;
foreach (string file in fileNames)
{
Console.WriteLine("Begin " + Path.GetFileName(file));
var output = InstantGameworksObject.ConvertOBJToIGWO(File.ReadAllLines(file));
Console.WriteLine("Writing file");
Stream fileOutputStream = new FileStream(outputPath + #"\" + Path.GetFileNameWithoutExtension(file) + ".igwo", FileMode.Create, FileAccess.Write, FileShare.None);
formatter.Serialize(fileOutputStream, output);
Console.WriteLine(outputPath + #"\" + Path.GetFileNameWithoutExtension(file) + ".igwo");
}
The output, of course, is in binary/hex (based on what program you use to view the file), and that's great:
But putting it into a hex-to-text converter online yields specific name data:
In the long run, this could mean gigabytes worth of useless data. How can I save my C# object with the data in the correct format, just without the extra meta-clutter?
As you correctly note, the standard framework binary formatters include a host of metadata about the structure of the data. This is to try to keep the serialised data self-describing. If they were to separate the data from all that metadata, then the smallest change to the structure of classes would render the previously serialised data useless. By that token, I doubt you'd find any standard framework method of serialising binary data that didn't include all the metadata.
Even ProtoBuf includes the semantics of the data in the file data, albeit with less overhead.
Given that the structure of your data follows the reasonably common and well established form of 3D object data, you could roll your own format for your assets which strips the semantics and only stores the raw data. You can implement read and write methods easily using the BinaryReader/BinaryWriter classes (which would be my preference). If you're looking to obfuscate data from the end user, there are a variety of different ways that you could achieve that with this approach.
For example:
public static InstantGameworksObjectData ReadIgoObjct(BinaryReader pReader)
{
var lOutput = new InstantGameworksObjectData();
int lVersion = pReader.ReadInt32(); // Useful in case you ever want to change the format
int lPositionCount = pReader.ReadInt32(); // Store the length of the Position array before the data so you can pre-allocate the array.
lOutput.Positions = new Position[lPositionCount];
for ( int lPositionIndex = 0 ; lPositionIndex < lPositionCount ; ++ lPositionIndex )
{
lOutput.Positions[lPositionIndex] = new Position();
lOutput.Positions[lPositionIndex].X = pReader.ReadSingle();
lOutput.Positions[lPositionIndex].Y = pReader.ReadSingle();
lOutput.Positions[lPositionIndex].Z = pReader.ReadSingle();
// or if you prefer... lOutput.Positions[lPositionIndex] = Position.ReadPosition(pReader);
}
int lTextureCoordinateCount = pReader.ReadInt32();
lOutput.TextureCoordinates = new TextureCoordinate[lPositionCount];
for ( int lTextureCoordinateIndex = 0 ; lTextureCoordinateIndex < lTextureCoordinateCount ; ++ lTextureCoordinateIndex )
{
lOutput.TextureCoordinates[lTextureCoordinateIndex] = new TextureCoordinate();
lOutput.TextureCoordinates[lTextureCoordinateIndex].X = pReader.ReadSingle();
lOutput.TextureCoordinates[lTextureCoordinateIndex].Y = pReader.ReadSingle();
lOutput.TextureCoordinates[lTextureCoordinateIndex].Z = pReader.ReadSingle();
// or if you prefer... lOutput.TextureCoordinates[lTextureCoordinateIndex] = TextureCoordinate.ReadTextureCoordinate(pReader);
}
// ...
}
As far as space efficiency and speed goes, this approach is hard to beat. However, this works well for the 3D objects as they're fairly well-defined and the format is not likely to change, but this approach may not extend well to the other assets that you want to store.
If you find you are needing to change class structures frequently, you may find you have to write lots of if-blocks based on version to correctly read a file, and have to regularly debug issues where the data in the file is not quite in the format you expect. A happy medium might be to use something such as ProtoBuf for the bulk of your development until you're happy with the structure of your data object classes, and then writing raw binary Read/Write methods for each of them before you release.
I'd also recommend some Unit Tests to ensure that your Read and Write methods are correctly persisting the object to avoid pulling your hair out later.
Hope this helps
A little background. I am new to using C# in a professional setting. My experience is mainly in SQL. I have a file that I need to parse through to pull out certain pieces of information. I can figure out how to parse through each line, but have gotten stuck on searching for specific pieces of information. I am not interested in someone finishing this code for me. Instead, I am interested in pointers on where I can go from here.
Here is an example of the code I have written.
class Program
{
private static Dictionary<string, List<string>> _arrayLists = new Dictionary<string, List<string>>();
static void Main(string[] args)
{
string filePath = "c:\\test.txt";
StreamReader reader = new StreamReader(filePath);
string line;
while (null !=(line = reader.ReadLine()))
{
if (line.ToLower().Contains("disconnected"))
{
// needs to continue on search for Disconnected or Subscribed
}
else
{
if (line.ToLower().Contains("subscribed"))
{
// program needs to continue reading file
// looking for and assigning values to
// dvd, cls, jhd, dxv, hft
// records start at Subscribed and end at ;
}
}
}
}
}
A little bit of explanation of the file. I basically need to pull data existing between the word Subscribed and the first ; i come to. Specifically I need to take the values such as dvd = 234 and assign them to their same variables in the code. Not every record will have the same variables.
Here is an example of the text file that I need to parse through.
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 453,
jhd = 567,
more annoying info
more annoying info
dxv = 456,
hft = 876;
more annoying info
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 455,
more annoying info
more annoying info
dxv = 456,
hft = 876,
jjd = 768;
more annoying info
test information
annoying information
Disconnected more annoying info
more annoying info
more annoying info
Edit
My apologies on the vague question. I have to learn how to ask better questions.
My thought process was to make sure the program associated all the details between subscribed and the ; as one record. I think the part that I am confused on is in reading the lines. In my head I see the loop reading the line Subscribed, and then going into a method and reading the next line and assigning the value, and so on until it hits the ;. Once that was done I am trying to figure out how to tell the program to exit that method, but to continue reading from the line right after the semi-colon. Perhaps I am over thinking this.
I will take the advice I have been give and see what I can come up with to solve this. Thank you.
From you question as it is now it is not clear what specific problem you are struggling with. I'd suggest you edit your question providing specific challenges you'd like to overcome. currently you problem statement is "have gotten stuck on searching for specific pieces of information". This is as unspecific as it can get.
Having said that I'll try to help you.
First, you will never get into an if like that:
line.ToLower().Contains("Disconnected")
Here you convert all the characters to lower case, and then you are trying to find a substring with capital "D" in it. The expression above will (almost) always evaluate to false.
Secondly, in order for your application to do what you want to do it needs to track the current parsing state. I'm going to ignore the "Disconnected" bit now, as you have not shown what significance it has.
I'll be assuming that you are trying to find everything between Subscribed and first semicolon in the file. I'll also make a couple of other assumption regarding to what can constitute a string, which I won't list here. These can be wrong, but this is my best guess given the information you've provided.
You program will start in a state "looking for subscription". You already set up the read loop, which is good. In this loop you read lines of the file, and you find one that contains word Subscription.
Once you found such line your parser need to move to "parsing subscription" state. In this state, when you read lines you look for lines like jjd = 768, perhaps with a semicolon in the end. You can check if the line match a pattern by using Regular Expressions.
Regular Expressions also can divide match to capturing groups, so that you can extract the name (jjd) and the value (768) separately. Presences or absence of the semicolon could be another RegEx group.
Note that RegEx is not the only way to handle this, but this is the first that comes to mind.
You then keeping matching the lines to your regex and extracting names and values until you come across the semicolon, at which point you switch back to "looking for subscription" state.
You use the current state, to decide how to process the next read line.
You continue until the end of the file.
Generally you want to read up on parsing.
Hope this helps.
As with all code solutions to problems there are many possible ways to achieve what you are looking for. Some will work better then others. Below is one way that could help point you in the right direction.
You can check if the string starts with a keyword or value such as "dvd" (see MSDN String.StartsWith).
If it does then you can split the string into an array of parts (see MSDN String.Split).
You can then get the values of each part from the string array using the index of the value you want.
Do what you need to with the value retrieved.
Continue checking each line for your key business rules (ie. The semicolon that will end the section). Maybe you could check the last character of the string. (see String.EndsWith)
When processing text files containing semi-structured data, state variables can simplify the algorithm. In the code below, a boolean state variable isInRecord is used to track when a line is in a record.
using System;
using System.Collections.Generic;
using System.IO;
namespace ConsoleApplication19
{
public class Program
{
private readonly static String _testData = #"
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 453,
jhd = 567,
more annoying info
more annoying info
dxv = 456,
hft = 876;
more annoying info
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 455,
more annoying info
more annoying info
dxv = 456,
hft = 876,
jjd = 768;
more annoying info
test information
annoying information
Disconnected more annoying info
more annoying info
more annoying info";
public static void Main(String[] args)
{
/* Create a temporary file containing the test data. */
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData), Path.GetRandomFileName());
File.WriteAllText(testFile, _testData);
try
{
var p = new Program();
var records = p.GetRecords(testFile);
foreach (var kvp in records)
{
Console.WriteLine("Record #" + kvp.Key);
foreach (var entry in kvp.Value)
{
Console.WriteLine(" " + entry);
}
}
}
finally
{
File.Delete(testFile);
}
}
private Dictionary<String, List<String>> GetRecords(String path)
{
var results = new Dictionary<String, List<String>>();
var recordNumber = 0;
var isInRecord = false;
using (var reader = new StreamReader(path))
{
String line;
while ((line = reader.ReadLine()) != null)
{
line = line.Trim();
if (line.StartsWith("Disconnected"))
{
// needs to continue on search for Disconnected or Subscribed
isInRecord = false;
}
else if (line.StartsWith("Subscribed"))
{
// program needs to continue reading file
// looking for and assigning values to
// dvd, cls, jhd, dxv, hft
// records start at Subscribed and end at ;
isInRecord = true;
recordNumber++;
}
else if (isInRecord)
{
// Check if the line has a general format of "something = something".
var parts = line.Split("=".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
if (parts.Length != 2)
continue;
// Update the relevant dictionary key, or add a new key.
List<String> entries;
if (results.TryGetValue(recordNumber.ToString(), out entries))
entries.Add(line);
else
results.Add(recordNumber.ToString(), new List<String>() { line });
// Determine if the isInRecord state variable should be toggled.
var lastCharacter = line[line.Length - 1];
if (lastCharacter == ';')
isInRecord = false;
}
}
}
return results;
}
}
}
I am trying to import a file with multiple record definition in it. Each one can also have a header record so I thought I would define a definition interface like so.
public interface IRecordDefinition<T>
{
bool Matches(string row);
T MapRow(string row);
bool AreRecordsNested { get; }
GenericLoadClass ToGenericLoad(T input);
}
I then created a concrete implementation for a class.
public class TestDefinition : IRecordDefinition<Test>
{
public bool Matches(string row)
{
return row.Split('\t')[0] == "1";
}
public Test MapColumns(string[] columns)
{
return new Test {val = columns[0].parseDate("ddmmYYYY")};
}
public bool AreRecordsNested
{
get { return true; }
}
public GenericLoadClass ToGenericLoad(Test input)
{
return new GenericLoadClass {Value = input.val};
}
}
However for each File Definition I need to store a list of the record definitions so I can then loop through each line in the file and process it accordingly.
Firstly am I on the right track
or is there a better way to do it?
I would split this process into two pieces.
First, a specific process to split the file with multiple types into multiple files. If the files are fixed width, I have had a lot of luck with regular expressions. For example, assume the following is a text file with three different record types.
TE20110223 A 1
RE20110223 BB 2
CE20110223 CCC 3
You can see there is a pattern here, hopefully the person who decided to put all the record types in the same file gave you a way to identify those types. In the case above you would define three regular expressions.
string pattern1 = #"^TE(?<DATE>[0-9]{8})(?<NEXT1>.{2})(?<NEXT2>.{2})";
string pattern2 = #"^RE(?<DATE>[0-9]{8})(?<NEXT1>.{3})(?<NEXT2>.{2})";
string pattern3 = #"^CE(?<DATE>[0-9]{8})(?<NEXT1>.{4})(?<NEXT2>.{2})";
Regex Regex1 = new Regex(pattern1);
Regex Regex2 = new Regex(pattern2);
Regex Regex3 = new Regex(pattern3);
StringBuilder FirstStringBuilder = new StringBuilder();
StringBuilder SecondStringBuilder = new StringBuilder();
StringBuilder ThirdStringBuilder = new StringBuilder();
string Line = "";
Match LineMatch;
FileInfo myFile = new FileInfo("yourFile.txt");
using (StreamReader s = new StreamReader(f.FullName))
{
while (s.Peek() != -1)
{
Line = s.ReadLine();
LineMatch = Regex1.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
LineMatch = Regex2.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
LineMatch = Regex3.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
}
}
Next, take the split files and run them through a generic process, that you most likely already have, to import them. This works well because when the process inevitably fails, you can narrow it to the single record type that is failing and not impact all the record types. Archive the main text file along with the split files and your life will be much easier as well.
Dealing with these kinds of transmitted files is hard, because someone else controls them and you never know when they are going to change. Logging the original file as well as a receipt of the import is very import and shouldn't be overlooked either. You can make that as simple or as complex as you want, but I tend to write a receipt to a db and copy the primary key from that table into a foreign key in the table I have imported the data into, then never change that data. I like to keep a unmolested copy of the import on the file system as well as on the DB server because there are inevitable conversion / transformation issues that you will need to track down.
Hope this helps, because this is not a trivial task. I think you are on the right track, but instead of processing/importing each line separately...write them to a separate file. I am assuming this is financial data, which is one of the reasons I think provability at every step is important.
I think the FileHelpers library solves a number of your problems:
Strong types
Delimited
Fixed-width
Record-by-Record operations
I'm sure you could consolidate this into a type hierarchy that could tie in custom binary formats as well.
Have you looked at something using Linq? This is a quick example of Linq to Text and Linq to Csv.
I think it would be much simpler to use "yield return" and IEnumerable to get what you want working. This way you could probably get away with only having 1 method on your interface.
My project requires a file where I will store key/value pair data that should be able to be read and modified by the user. I want the program to just expect the keys to be there, and I want to parse them from the file as quickly as possible.
I could store them in XML, but XML is way to complex, and it would require traversing nodes, and child nodes and so on, all I want is some class that takes a file and generates key value pairs. I want as little error handling as possible, and I want it done with as little code as possible.
I could code a class like that myself, but I'd rather learn how it's don'e in the framework than inventing the wheel twice. Are there some built in magic class in .NET (3.5) that are able to do so?
MagicClass kv = new MagicClass("Settings.ini"); // It doesn't neccesarily have to be an INI file, it can be any simple key/value pair format.
string Value1 = kv.get("Key1");
...
If you're looking for a quick easy function and don't want to use .Net app\user config setting files or worry about serialization issues that sometimes occur of time.
The following static function can load a file formatted like KEY=VALUE.
public static Dictionary<string, string> LoadConfig(string settingfile)
{
var dic = new Dictionary<string, string>();
if (File.Exists(settingfile))
{
var settingdata = File.ReadAllLines(settingfile);
for (var i = 0; i < settingdata.Length; i++)
{
var setting = settingdata[i];
var sidx = setting.IndexOf("=");
if (sidx >= 0)
{
var skey = setting.Substring(0, sidx);
var svalue = setting.Substring(sidx+1);
if (!dic.ContainsKey(skey))
{
dic.Add(skey, svalue);
}
}
}
}
return dic;
}
Note: I'm using a Dictionary so keys must be unique, which is usually that case with setting.
USAGE:
var settingfile = AssemblyDirectory + "\\mycustom.setting";
var settingdata = LoadConfig(settingfile);
if (settingdata.ContainsKey("lastrundate"))
{
DateTime lout;
string svalue;
if (settingdata.TryGetValue("lastrundate", out svalue))
{
DateTime.TryParse(svalue, out lout);
lastrun = lout;
}
}
Use the KeyValuePair class for you Key and Value, then just serialize a List to disk with an XMLSerializer.
That would be the simplest approach I feel. You wouldn't have to worry about traversing nodes. Calling the Deserialize function will do that for you. The user could edit the values in the file if they wish also.
I don't know of any builtin class to parse ini file. I've used nini when needed to do so. It's licensed under the MIT/X11 license, so doesn't have any issue to be included in a closed source program.
It's very to use. So if you have a Settings.ini file formatted this way:
[Configuration]
Name = Jb Evain
Phone = +330101010101
Using it would be as simple as:
var source = new IniConfigSource ("Settings.ini");
var config = source.Configs ["Configuration"];
string name = config.Get ("Name");
string phone = config.Get ("Phone");
if you want the user to be able to read and modify the file, i suggest a comma-delimited pair, one per line
key1,value1
key2,value2
...
parsing is simple: read the file, split at newline or comma, then take the elements in pairs
Format the file this way:
key1=value1
key2=value2
Read the entire file into a string (there is a simple convenience function that does that, maybe in the File or string class), and call string.Split('='). Make sure you also call string.Trim() on each key and value as you traverse the list and pop each pair into a hashtable or dictionary.