Flat file normalization with a dynamic number of columns

Flat file normalization with a dynamic number of columns - c#

I have a flat file with an unfortunately dynamic column structure. There is a value that is in a hierarchy of values, and each tier in the hierarchy gets its own column. For example, my flat file might resemble this:
StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Tier3ObjectId|Status
1234|7890|abcd|efgh|ijkl|mnop|Pending
...
The same feed the next day may resemble this:
StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Status
1234|7890|abcd|efgh|ijkl|Complete
...
The thing is, I don't care much about all the tiers; I only care about the id of the last (bottom) tier, and all the other row data that is not a part of the tier columns. I need normalize the feed to something resembling this to inject into a relational database:
StatisticID|FileId|ObjectId|Status
1234|7890|ijkl|Complete
...
What would be an efficient, easy-to-read mechanism for determining the last tier object id, and organizing the data as described? Every attempt I've made feels kludgy to me.
Some things I've done:
I have tried to examine the column names for regular expression patterns, identify the columns that are tiered, order them by name descending, and select the first record... but I lose the ordinal column number this way, so that didn't look good.
I have placed the columns I want into an IDictionary<string, int> object to reference, but again reliably collecting the ordinal of the dynamic columns is an issue, and it seems this would be rather non-performant.

I ran into a simular problem a few years ago. I used a Dictionary to map the columns, it was not pretty, but it worked.
First make a Dictionary:
private Dictionary<int, int> GetColumnDictionary(string headerLine)
{
Dictionary<int, int> columnDictionary = new Dictionary<int, int>();
List<string> columnNames = headerLine.Split('|').ToList();
string maxTierObjectColumnName = GetMaxTierObjectColumnName(columnNames);
for (int index = 0; index < columnNames.Count; index++)
{
if (columnNames[index] == "StatisticID")
{
columnDictionary.Add(0, index);
}
if (columnNames[index] == "FileId")
{
columnDictionary.Add(1, index);
}
if (columnNames[index] == maxTierObjectColumnName)
{
columnDictionary.Add(2, index);
}
if (columnNames[index] == "Status")
{
columnDictionary.Add(3, index);
}
}
return columnDictionary;
}
private string GetMaxTierObjectColumnName(List<string> columnNames)
{
// Edit this function if Tier ObjectId is greater then 9
var maxTierObjectColumnName = columnNames.Where(c => c.Contains("Tier") && c.Contains("Object")).OrderBy(c => c).Last();
return maxTierObjectColumnName;
}
And after that it's simply running thru the file:
private List<DataObject> ParseFile(string fileName)
{
StreamReader streamReader = new StreamReader(fileName);
string headerLine = streamReader.ReadLine();
Dictionary<int, int> columnDictionary = this.GetColumnDictionary(headerLine);
string line;
List<DataObject> dataObjects = new List<DataObject>();
while ((line = streamReader.ReadLine()) != null)
{
var lineValues = line.Split('|');
string statId = lineValues[columnDictionary[0]];
dataObjects.Add(
new DataObject()
{
StatisticId = lineValues[columnDictionary[0]],
FileId = lineValues[columnDictionary[1]],
ObjectId = lineValues[columnDictionary[2]],
Status = lineValues[columnDictionary[3]]
}
);
}
return dataObjects;
}
I hope this helps (even a little bit).

Personally I would not try to reformat your file. I think the easiest approach would be to parse each row from the front and the back. For example:
itemArray = getMyItems();
statisticId = itemArray[0];
fileId = itemArray[1];
//and so on for the rest of your pre-tier columns
//Then get the second to last column which will be the last tier
lastTierId = itemArray[itemArray.length -1];
Since you know the last tier will always be second from the end you can just start at the end and work your way forwards. This seems like it would be much easier than trying to reformat the datafile.
If you really want to create a new file, you could use this approach to get the data you want to write out.

I don't know C# syntax, but something along these lines:
split line in parts with | as separator
get parts [0], [1], [length - 2] and [length - 1]
pass the parts to the database handling code

Related

Select Row of CSV File [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 months ago.
Improve this question
I have a CSV File with these headers:
date;clock;value
My aim is to select the CSV line with a specific date to get the corresponding value.
For example:
I want to select date 20.08.22 and the result should be 130
15.08.22;07:05;100
20.08.22;08:04;130
21.08.22;10:04;150
With this code snippet I read the lines of the csv file:
private void Werte_aus_CSV_auslesen()
{
var path = #"E:\werte.csv";
using (TextFieldParser csvParser = new TextFieldParser(path))
{
csvParser.CommentTokens = new string[] { "#" };
csvParser.SetDelimiters(new string[] { ";" });
csvParser.HasFieldsEnclosedInQuotes = true;
// Skip the row with the column names
csvParser.ReadLine();
while (!csvParser.EndOfData)
{
// Read current line fields, pointer moves to the next line.
fields = csvParser.ReadFields();
datum.Add(fields[0]);
uhrzeit.Add(fields[1]);
wert.Add(double.Parse(fields[2], CultureInfo.InvariantCulture));
}
}
}

The approach you are using is going to have to scan the entire CSV every time you lookup a value. This might be a performance problem if this method is called multiple times. It would be better to build a dictionary that maps the date to the value that can be built once and reused for each subsequent lookup.
I maintain a couple libraries that make this pretty easy: Sylvan.Data and Sylvan.Data.Csv. Here is a complete C# 10 console app that demonstrates how to accomplish this:
using Sylvan.Data.Csv;
using Sylvan.Data;
// data: would normally use CsvDataReader.Create(csvFileName, opts);
var data =
new StringReader(
#"date;clock;value
15.08.22;07:05;100
20.08.22;08:04;130
21.08.22;10:04;150
");
// parameter:
var selectDate = new DateTime(2022, 8, 20);
// configure settings so the csv reader understands your data
var opts = new CsvDataReaderOptions
{
DateTimeFormat = "dd'.'MM'.'yy",
// ignore clock, as it isn't used
Schema = new CsvSchema(Schema.Parse("date:date,clock,value:int"))
};
var csvReader = CsvDataReader.Create(data, opts);
// create a dictionary to cache the CSV data for quick lookups
// creating the dictionary scans the whole dataset, but subsequent lookups will
// be blazing fast.
{
var dict =
csvReader
.GetRecords<Record>() // bind the CSV data to the Record class
.ToDictionary(r => r.Date, r => r.Value);
Console.WriteLine(dict.TryGetValue(selectDate, out var value) ? value.ToString() : "Value not found");
}
class Record
{
public DateTime Date { get; set; }
public int Value { get; set; }
}

Matched arrays/lists like datum, uhrzeit, and wert that relate values within each collection based on index is an anti-pattern... something to avoid. So much better to create a class with fields for each of the values, and then have one collection to hold the class.
public class MyData
{
public DateTime date {get;set;}
public int value {get;set;}
}
(Of course, give it a better name than "MyData")
Newer code might also use a record instead of a class.
We can further improve this by separating the code to read the csv data from the code that composes the objects. Start with something like this:
private IEnumerable<string[]> Werte_aus_CSV_auslesen(string path)
{
using (TextFieldParser csvParser = new TextFieldParser(path))
{
csvParser.CommentTokens = new string[] { "#" };
csvParser.SetDelimiters(new string[] { ";" });
csvParser.HasFieldsEnclosedInQuotes = true;
// Skip the row with the column names
csvParser.ReadLine();
while (!csvParser.EndOfData)
{
// Read current line fields, pointer moves to the next line.
yield return csvParser.ReadFields();
}
}
}
Notice how it accepts an input and returns an object (the enumerable with the data). Also notice how it avoids anything to do with processing the individual rows. It is only concerned with parsing the CSV/SSV inputs. It doesn't care what fields you expect to find, and can handle any file input with a header line, hash comments, and semi-colon field separators.
Since this gives us string[] values, we also add a method to transform a string[] into a class instance. I like to start out with this as a static method of the class itself, but as a project grows to have many of these methods they may eventually be moved to their own static type:
public class MyData
{
public DateTime date {get;set;}
public int value {get;set}
public static MyData FromCSVRow(string[] input)
{
return new MyData() {
date = DateTime.ParseExact($"{input[0]} {input[1]}", "dd.MM.yy HH:mm", null),
value = int.Parse(input[2])
};
}
}
And now with all that out of the way, we can finally put it all together to get your answer:
var targetDate = new DateTime(2022, 8, 20);
var csv = Werte_aus_CSV_auslesen(#"E:\werte.csv");
var rows = csv.Select(MyData.FromCSV);
var result = rows.Where(r => r.date.Date == targetDate);
If we really wanted to, we could even treat all that as a single line of code (it's probably better to keep it separate, for readability/maintainability):
var result = Werte_aus_CSV_auslesen(#"E:\werte.csv").
Select(MyData.FromCSV).
Where(r => r.date.Date == new DateTime(2022, 8, 20));
Note result is still an IEnumerable<MyData>, because there might be more than one row matching the criteria. If you are really sure there will only be one matching record, you can use this:
var result = rows.Where(r => r.date.Date == targetDate).FirstOrDefault();
or this:
var result = rows.Where(r => r.date.Date == targetDate).First();
depending on what you want to happen if no match is found.
One of the nice features here is this checks each record as it reads the file, and will stop reading the file as soon as it finds a match, which is potentially a very nice performance win.

Is it possible to get specific column data from a large pipe delimited file without creating a class for every column?

I am writing a C# program that will grab some data from a pipe delimited file with 400 columns in it. I'm only required to work with 6 of the columns in each row. The file does not have headers, and the first line is a 5 column row with general description of file (file name, batch date, number of records, total, report id). Before I create a class with 400 fields in it, I was curious if anyone here had a better idea of how to approach this. Thanks for your time.

Well, you don't mention much as to how you're loading the file, but I imagine it is using System.IO and then doing a string split on each line. If so, you need not extract every field in the resulting splitted array.
Imagine you only needed two columns, the second and fourth, and had a class to accept each row as follows:
public class row {
public string field2;
public string field4;
}
Then you would extract your data like this:
IEnumerable<row> parsed =
File.ReadLines(#"path to file")
.Skip(1)
.Select(line => {
var splitted = line.Split('|');
return new row {
field2 = splitted[1],
field4 = splitted[3]
};
});

You could use the Microsoft.VisualBasic.FileIO reference and then do something like this:
using(var parser = new TextFieldParsser(file))
{
Int32 skipHeader = 0;
parser.SetDelimiters("|");
while (!parser.EndOfData)
{
//Processing row
string[] fields = parser.ReadFields();
Int32 x = 0;
if (skipHeader > 0)
{
foreach (var field in fields)
{
if (x == 0)
{
//SAVE STUFF TO VARIABLE
}
else if (x==4)
{
//SAVE MORE STUFF
}
else if (x == 20)
{
//SAVE LAST STUFF
break;//THIS IS THE LAST COLUMN OF DATA NEEDED SO YOU BREAK
}
x++;
}
//DO SOMETHING WITH ALL THE SAVED STUFF AND CLEAR IT OUT
}
else
{
skipHeader++;
}
}}

How would I convert data in a .txt file into xml? c#

I have thousands of lines of data in a text file I want to make easily searchable by turning it into something easier to search (I am hoping an XML or another type of large data structure, though I am not sure if it will be the best for what I have in mind).
The data looks like this for each line:
Book 31, Thomas,George, 32, 34, 154
(each book is not unique, they are indexes so book will have several different entries of whom is listed in it, and the numbers are the page they are listed)
So I am kinda of lost on how to do this, I would want to read the .txt file, trim out all the spaces and commas, I basically get how to prep the data for it, but how would I programmatically make that many elements and values in xml or populate some other large data structure?

If your csv file does not change too much and the structure is stable, you could simply parse it to a list of objects at startup
private class BookInfo {
string title {get;set;}
string person {get;set;}
List<int> pages {get;set;}
}
private List<BookInfo> allbooks = new List<BookInfo>();
public void parse() {
var lines = File.ReadAllLines(filename); //you could also read the file line by line here to avoid reading the complete file into memory
foreach (var l in lines) {
var info = l.Split(',').Select(x=>x.Trim()).ToArray();
var b = new BookInfo {
title = info[0],
person = info[1]+", " + info[2],
pages = info.Skip(3).Select(x=> int.Parse(x)).ToList()
};
allbooks.Add(b);
}
}
Then you can easily search the allbooks list with for instance LINQ.
EDIT
Now, that you have clarified your input, I adapted the parsing a little bit to better fit your needs.
If you want to search your booklist by either the title or the person more easily, you can also create a lookup on each of the properties
var titleLookup = allbooks.ToLookup(x=> x.title);
var personLookup = allbooks.ToLookup(x => x.person);
So personLookup["Thomas, George"] will give you a list of all bookinfos that mention "Thomas, George" and titleLookup["Book 31"] will give you a list of all bookinfos for "Book 31", ie all persons mentioned in that book.

If you want the CSV file to make easily searchable by turning it into something easier to search, you can convert it to DataTable.
if you want data , you can use LINQ to XML to search
The following class generates both DataTable or Xml data format. You can pass delimeter ,includeHeader or use the default:
class CsvUtility
{
public DataTable Csv2DataTable(string fileName, bool includeHeader = false, char separator = ',')
{
IEnumerable<string> reader = File.ReadAllLines(fileName);
var data = new DataTable("Table");
var headers = reader.First().Split(separator);
if (includeHeader)
{
foreach (var header in headers)
{
data.Columns.Add(header.Trim());
}
reader = reader.Skip(1);
}
else
{
for (int index = 0; index < headers.Length; index++)
{
var header = "Field" + index; // headers[index];
data.Columns.Add(header);
}
}
foreach (var row in reader)
{
if (row != null) data.Rows.Add(row.Split(separator));
}
return data;
}
public string Csv2Xml(string fileName, bool includeHeader = false, char separator = ',')
{
var dt = Csv2DataTable(fileName, includeHeader, separator);
var stream = new StringWriter();
dt.WriteXml(stream);
return stream.ToString();
}
}
example to use:
CsvUtility csv = new CsvUtility();
var dt = csv.Csv2DataTable("f1.txt");
// Search for string in any column
DataRow[] filteredRows = dt.Select("Field1 LIKE '%" + "Thomas" + "%'");
//search in certain field
var filtered = dt.AsEnumerable().Where(r => r.Field<string>("Field1").Contains("Thomas"));
//generate xml
var xml= csv.Csv2Xml("f1.txt");
Console.WriteLine(xml);
/*
output of xml for your sample:
<DocumentElement>
<Table>
<Field0>Book 31</Field0>
<Field1> Thomas</Field1>
<Field2>George</Field2>
<Field3> 32</Field3>
<Field4> 34</Field4>
<Field5> 154</Field5>
</Table>
</DocumentElement>
*/

C# Converting List to 2d list and adding additional values

Hello need some assistance with this issue. Hopefully i can describe it well.
I have a parser that goes though a document and find sessionID's, strips some tags from them and places them into a list.
while ((line = sr.ReadLine()) != null)
{
Match sID = sessionId.Match(line);
if (sID.Success)
{
String sIDString;
String sid = sID.ToString();
sIDString = Regex.Replace(sid, "<[^>]+>", string.Empty);
sessionIDList.Add(sIDString);
}
}
Then I go thought list and get the distinctSessionID's.
List<String> distinctSessionID = sessionIDList.Distinct().ToList();
Now I need to go thought he document again and add the lines that match the sessionID and add them to the list. This is the part that I am having issue with.
Do I need to create a 2d list so I can add the matching log lines to the corresponding sessionids.
I was looking at this but cannot seem to figure out a way that I could copy over my Distinct list then add the Lines I need into the new array.
From what I can test it looks like this would add the value into the masterlist
List<List<string>> masterLists = new List<List<string>>();
Foreach (string value in distinctSessionID)
{
masterLists[0].Add(value);
}
How do I add Lines I need to the corresponding Masterlist. Say masterList[0].Add value is 1, how do i add the lines to 1?
masterList[0][0].add(myLInes);
Basically i want
Sessionid1
-------> related log line
-------> Related log line
SessionID2
-------> related log line
-------> related log line.
So on and so forth. I have the parsing all working, it's just getting the values into a 2nd string list is the issue.
Thanks,

What you can do is, simple create a class with public properties, and make list of that custom class.
public class Session
{
public int SessionId{get;set;}
public List<string> SessionLog{get;set;}
}
List<Session> objList = new List<Session>();
var session1 = new Session();
session1.SessionId = 1;
session1.SessionLog.Add("description lline1");
objList.Add(session1);

Here is one way to do it:
public class MultiDimDictList: Dictionary<string, List<int>> { }
MultiDimDictList myDictList = new MultiDimDictList ();
Foreach (string value in distinctSessionID)
{
myDictList.Add(value, new List<int>());
for(int j=0; j < lengthofLines; j++)
{
myDictList[value].Add(myLine);
}
}
You would need to replace lengthofLines with a number to indicate how many iterations of lines you have.
See Charles Bretana's answer here

Is there a way to dynamically create an object at run time in .NET 3.5?

I'm working on an importer that takes tab delimited text files. The first line of each file contains 'columns' like ItemCode, Language, ImportMode etc and there can be varying numbers of columns.
I'm able to get the names of each column, whether there's one or 10 and so on. I use a method to achieve this that returns List<string>:
private List<string> GetColumnNames(string saveLocation, int numColumns)
{
var data = (File.ReadAllLines(saveLocation));
var columnNames = new List<string>();
for (int i = 0; i < numColumns; i++)
{
var cols = from lines in data
.Take(1)
.Where(l => !string.IsNullOrEmpty(l))
.Select(l => l.Split(delimiter.ToCharArray(), StringSplitOptions.None))
.Select(value => string.Join(" ", value))
let split = lines.Split(' ')
select new
{
Temp = split[i].Trim()
};
foreach (var x in cols)
{
columnNames.Add(x.Temp);
}
}
return columnNames;
}
If I always knew what columns to be expecting, I could just create a new object, but since I don't, I'm wondering is there a way I can dynamically create an object with properties that correspond to whatever GetColumnNames() returns?
Any suggestions?

For what it's worth, here's how I used DataTables to achieve what I wanted.
// saveLocation is file location
// numColumns comes from another method that gets number of columns in file
var columnNames = GetColumnNames(saveLocation, numColumns);
var table = new DataTable();
foreach (var header in columnNames)
{
table.Columns.Add(header);
}
// itemAttributeData is the file split into lines
foreach (var row in itemAttributeData)
{
table.Rows.Add(row);
}
Although there was a bit more work involved to be able to manipulate the data in the way I wanted, Karthik's suggestion got me on the right track.

You could create a dictionary of strings where the first string references the "properties" name and the second string its characteristic.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Flat file normalization with a dynamic number of columns - c#

I don't know C# syntax, but something along these lines: split line in parts with | as separator get parts [0], [1], [length - 2] and [length - 1] pass the parts to the database handling code

Related

Select Row of CSV File [closed]

Is it possible to get specific column data from a large pipe delimited file without creating a class for every column?

How would I convert data in a .txt file into xml? c#

C# Converting List to 2d list and adding additional values

Is there a way to dynamically create an object at run time in .NET 3.5?

Categories

Resources