Encog C#, VersatileMLDataSet from CSV, how to get original data?

Encog C#, VersatileMLDataSet from CSV, how to get original data? - c#

I want to use CSV reader from Encog library, like this:
var format = new CSVFormat('.', ' ');
IVersatileDataSource source = new CSVDataSource(filename, false, format);
var data = new VersatileMLDataSet(source);
Is it possible to get original data from variable data? I have to show records from CSV to user in dataGridView, before I use it for neural network. I want to be able to modify original data as well. According documentation there is property Data, but it doesnt work for me. If I try something like:
data.Data[1][1]
I get null pointer exception. There is another problem with using data before normalization. I want to get count of records by:
data.GetRecordCount()
But I get error You must normalize the dataset before using it. So even if I have not used data yet I have to normalize it? If this is true, then is probably better to use my own CSV reader and then load it into encog from memory, right?

So I just looked at the Encog source code on GitHub. Thankfully your question is well defined and narrow in scope, so I can provide an answer. Unfortunately, you probably won't like it.
Basically, when you pass in your IVersatileDataSource into the constructor for VersatileMLDataSet, it gets placed into a private readonly field called _source. There is no abstraction around _source, so you cannot access it from outside of VersatileMLDataSet.
The Data property indeed will only be populated during the normalization process. There also doesn't appear to be any fields within CSVDataSource that are public of any value to you (again, all private).
If you just wanted to look at a single column of data, you could stay within Encog and look at Encog.Util.NetworkUtil.QuickCSVUtils. There are methods within this class that will help you pickup a file and get a single column of data out quickly.
If you wanted to get the full CSV data out of a file within Encog, you could use the Encog.Util.CSV.ReadCSV class to get the data. This is the underlying implementation anyways utilized by your code when you instantiate a QuickCSVUtils. You will have to provide some wrapper logic around ReadCSV, similar to QuickCSVUtils. If you go this route, I'd recommend peeking in that class to see see how its using ReadCSV. Essentially ReadCSV reads a single line at time.
But if you really need to read the RAW csv data from within the VersatileMLDataSet class, your best bet would be to provide your own implementation inside a custom class derived from VersatileMLDataSet.

There is a couple of steps you need to do after reading in your file:
You have to define your column types
Analyze data
Map output
Set up normalization strategy
Get your data count
Optionally clone the data.Data to keep originals
The code is below with appropriate comments.
var filename = #"iris.data.csv";
var format = new CSVFormat('.', ',');
IVersatileDataSource source = new CSVDataSource(filename, false, CSVFormat.DecimalPoint);
var data = new VersatileMLDataSet(source);
// Define columns to read data in.
data.DefineSourceColumn("Col1", 0, ColumnType.Continuous);
data.DefineSourceColumn("Col2", 1, ColumnType.Continuous);
data.DefineSourceColumn("Col3", 2, ColumnType.Continuous);
data.DefineSourceColumn("Col4", 3, ColumnType.Continuous);
ColumnDefinition outputColumn = data.DefineSourceColumn("Col5", 4, ColumnType.Nominal);
// Analyze data
data.Analyze();
// Output mapping
data.DefineSingleOutputOthersInput(outputColumn);
// Set normalization strategy
data.NormHelper.NormStrategy = new BasicNormalizationStrategy(-1, 1, -1, 1);
data.Normalize();
// Get count
var count = data.GetRecordCount();
// Clone to get original data
var oiginalData = data.Data.Clone();
For more details check the quickstart paper.
Sample data I'm using comes from here.

Related

SqlBulkCopy WriteToServer with an IDataReader instead of DataTable and Programmatically Adjusted Field Values

We have a working code in C# that utilizes SqlBulkCopy to insert records into a table from a stored procedure source. At a high-level:
Reads data from a stored procedure that puts the records into a DataTable. Essentially calls the SP and does an AdpAdapter to put the records into the DataTable. Let's call this srcDataTable.
Dynamically maps the column names between source and destination through configuration, a table that's similar to the following:
TargetTableName
ColumnFromSource
ColumnInDestination
DefaultValue
Formatting
TableA
StudentFirstName
FirstName
NULL
NULL
TableA
StudentLastName
LastName
NULL
NULL
TableA
Birthday
Birthdate
1/1/1900
dd/MM/yyyy
Based on the mapping from #2, set up new rows from srcDataTable using .NewRow() of a DataRow to another DataTable that matches the structure of the destination table (where ColumnNameOfDestination is based). Let's call this targetDataTable. As you can see from the table, there may be instances where the value from the source is not specified, or needs to be formatted a certain way. This is the primary reason why we're having to add data rows on the fly to another data table, and the adjustment / defaulting of the values are handled in code.
Call SqlBulkCopy to write all the rows in targetDataTable to the actual SQL table.
This approach has been working alright in tandem with stored procedures that utilize FETCH and OFFSET so it only returns an X number of rows at a time to deal with memory constraints. Unfortunately, as we're getting more and more data sources that are north of 50 million rows, and that we're having to share servers, we're needing to find a faster way to do so while keeping memory consumption in check. Researching options, it seems like utilizing an IDataReader for SQLBulkCopy will allow us to limit the memory consumption of the code, and not having to delegate getting X number of records in the stored procedure itself anymore.
In terms of preserving current functionality, it looks like we can utilize SqlBulkCopyMappingOptions to allow us to maintain mapping the fields even if they're named differently. What I can't confirm however is the defaulting or formatting of the values.
Is there a way to extend the DataReader's Read() method so that we can introduce that same logic to revise whatever value will be written to the destination if there's configuration asking us to? So a) check if the current row has a value populated from the source, b) default its value to the destination table if configured, and c) apply formatting rules as it gets written to the destination table.

You appear to be asking "can I make my own class that implements IDataReader and has some altered logic to the Read() method?"
The answer's yes; you can write your own data reader that does whatever it likes in Read(), format the server's hard disk as soon as it's called even.. When you're implementing an interface you aren't "extend[ing] the DataReader's read method", you're providing your own implementation that externally appears to obey a specific contract but the implementation detail is entirely up to you. If you want, upon every read, to pull down a row from db X into a temp array, zip through the array tweaking the values to have some default or other adjustment, before returning true that's fine..
..if you wanted to do the value adjustment in the GetXXX, then that's also fine.. you're writing the reader so you decide. All the bulk copier is going to do is call Read until it returns false and write the data it gets from e.g. GetValue (if it wasn't immediately clear: read doesn't produce the data that will be written, GetValue does. Read is just an instruction to move to the next set of data that must be written but it doesn't even have to do that. You could implement it as { return DateTime.Now.DayOfWeek == DayOfWeek.Monday; } and GetValue as { return Guid.NewGuid().ToString(); } and your copy operation would spend until 23:59:59.999 filling the database with guids, but only on Monday)

The question is a bit unclear. It looks like the actual question is whether it's possible to transform data before using SqlBulkCopy with a data reader.
There are a lot of ways to do it, and the appropriate one depends on how the rest of the ETL code does. Does it only work with data readers? Or does it load batches of rows that can be modified in memory?
Use IEnumerable<> and ObjectReader
FastMember's ObjectReader class creates an IDataReader wrapper over any IEnumerable<T> collection. This means that both strongly-typed .NET collections and iterator results can be sent to SqlBulkCopy.
IEnumerable<string> lines=File.ReadLines(filePath);
using(var bcp = new SqlBulkCopy(connection))
using(var reader = ObjectReader.Create(lines, "FileName"))
{
bcp.DestinationTableName = "SomeTable";
bcp.WriteToServer(reader);
}
It's possible to create a transformation pipeline using LINQ queries and iterator methods this way, and feed the result to SqlBulkCopy using ObjectReader. The code is a lot simpler than trying to create a custom IDataReader.
In this example, Dapper can be used to return query results as an IEnumerable<>:
IEnumerable<Order> orders=connection.Query<Order>("select ... where category=#category",
new {category="Cars"});
var ordersWithDate=orders.Select(ord=>new OrderWithDate {
....
SaleDate=DateTime.Parse(ord.DateString,CultureInfo.GetCultureInfo("en-GB");
});
using var reader = ObjectReader.Create(ordersWithDate, "Id","SaleDate",...));
Custom transforming data readers
It's also possible to create custom data readers by implementing the IDataReader interface. Libraries like ExcelDataReader and CsvHelper provide such wrappers over their results. CsvHelper's CsvDataReader creates an IDataReader wrapper over the parsed CSV results. The downside to this is that IDataReader has a lot of methods to implement. The GetSchemaTable will have to be implemented to provide column and information to later transformation steps and SqlBulkCopy.
IDataReader may be dynamic, but it requires adding a lot of hand-coded type information to work. In CsvDataReader most methods just forward the call to the underlying CsvReader, eg :
public long GetInt64(int i)
{
return csv.GetField<long>(i);
}
public string GetName(int i)
{
return csv.Configuration.HasHeaderRecord
? csv.HeaderRecord[i]
: string.Empty;
}
But GetSchemaTable() is 70 lines, with defaults that aren't optimal. Why use sting as the column type when the parser can already parse date and numeric data for example?
One way to get around this is to create a new custom IDataReader using a copy of the previous reader's Schema Table and adding the extra columns. CsvDataReader's constructor accepts a DataTable schemaTable parameter to handle cases where its own GetSchemaTable isn't good enough. That DataTable could be modified to add extra columns :
/// <param name="csv">The CSV.</param>
/// <param name="schemaTable">The DataTable representing the file schema.</param>
public CsvDataReader(CsvReader csv, DataTable schemaTable = null)
{
this.csv = csv;
csv.Read();
if (csv.Configuration.HasHeaderRecord)
{
csv.ReadHeader();
}
else
{
skipNextRead = true;
}
this.schemaTable = schemaTable ?? GetSchemaTable();
}
A DerivedColumnReader could be created that does just that in its constructor :
public DerivedColumnReader<TSource,TResult>(string sourceName, string targetname,Fun<TSource,TResult> func,DataTable schemaTable)
{
...
AddSchemaColumn(schemaTable);
_schemaTable=schemaTable;
}
void AddSchemaColumn(DataTable dt,string targetName)
{
var row = dt.NewRow();
row["AllowDBNull"] = true;
row["BaseColumnName"] = targetName;
row["ColumnName"] = targetName;
row["ColumnMapping"] = MappingType.Element;
row["ColumnOrdinal"] = dt.Rows.Count+1;
row["DataType"] = typeof(TResult);
//20-30 more properties
dt.Rows.Add(row);
}
That's a lot of boiler plate that's eliminated with LINQ.

Just providing closure to this. So the main question really is to how we can avoid running into out of memory exceptions when fetching data from SQL without employing FETCH and OFFSET in the stored procedure. The resolution didn't require getting fancy with a custom Reader similar to SqlDataReader, but adding count checking and calling SqlBulkCopy in batches. The code is similar to what's written below:
using (var dataReader = sqlCmd.ExecuteReader(CommandBehavior.SequentialAccess))
{
int rowCount = 0;
while (dataReader.Read())
{
DataRow dataRow = SourceDataSet.Tables[source.ObjectName].NewRow();
for (int i = 0; i < SourceDataSet.Tables[source.ObjectName].Columns.Count; i++)
{
dataRow[(SourceDataSet.Tables[source.ObjectName].Columns[i])] = dataReader[i];
}
SourceDataSet.Tables[source.ObjectName].Rows.Add(dataRow);
rowCount++;
if (rowCount % recordLimitPerBatch == 0)
{
// Apply our field mapping
ApplyFieldMapping();
// Write it up
WriteRecordsIntoDestinationSQLObject();
// Remove from our dataset once we get to this point
SourceDataSet.Tables[source.ObjectName].Rows.Clear();
}
}
}
Where ApplyFieldMapping() makes field-specific changes to the contents of the datatable, and WriteRecordsIntoDestinationSqlObject(). This allowed us to call the stored procedure just once to fetch the data, and let the loop keep memory in check by writing records out and clearing them afterwards when we hit a preset recordLimitPerBatch.

ML.NET - Loading variable number of feature columns

I would like a general learning pipeline (from N features predict a label, for example), in the sense that one of my input CSVs would have 5 features and another would have 10 features (those two CSVs would obviously produce different models, I don't want to combine them in any way, I just want to run the same program on both the CSVs).
However, to load the features, I need to use
TextLoader(...).CreateFrom<ClassA>()
where ClassA defines my schema. Its properties need to reflect the CSV format, therefore the CSV must always have the same number of columns.
I have noticed CustomTextLoader but it's obsolete. Any ideas? Thank you.

Taking a look at the source: (https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML/Data/TextLoader.cs)
CreateFrom looks like nothing more than a helper method that populates Arguments.Columns and Arguments, both of which are publicly accessible. This means that you could write your own implementation.
TextLoader tl = new TextLoader(inputFileName)
tl.Arguments.HasHeader = useHeader;
tl.Arguments.Separator = new[] { separator };
tl.Arguments.AllowQuoting = allowQuotedStrings;
tl.Arguments.AllowSparse = supportSparse;
tl.Arguments.TrimWhitespace = trimWhitespace;
And now the important part, you'll need to populate a TextLoader.Arguments.Columns with an entry for each column in your data set. If you know ahead of time that you'll have 5 or 10 columns that would be the simplest, but otherwise, I'd peek into the CSV to figure out.
tl.Arguments.Column = new TextLoaderColumns[numColumns];
tl.Arguments.Column[0].Name = ...
tl.Arguments.Column[0].Source = ... // see the docs
tl.Arguments.Column[0].Type = ...
// and so on.

jaket - thank you for your answer. I can see how that would work loading the data into the TextLoader. However, how would you then Train the model? as the pipeline Train() method also requires you to pass in an object defining the data schema :
PredictionModel<ClassA, ClassAPrediction> model = pipeline.Train<ClassA, ClassAPrediction>();

Read file to inmemory user manipulates info, then overwrites data back to same file in C#

I have been reading posts trying to figure how to get this done. The title says it all. I am trying to take simple data from a text file and load in memory. Then let the user manipulate (add/delete) data in memory and have it added to an List<Automobile>. Then have it write what is in memory back to the same file by overwriting what is there. I have tried to use different parts of MemoryStream() and trying to use StreamReader(). I would get an error saying "Argument 1: cannot convert from 'string' to 'Exercise6_DealerVehicleInventory.Automobile'". When I would use MemoryStream, it would give me an error saying "Cannot implicitly convert type 'System.IO.StreamReader' to 'string'"
I am not that familiar with the .net framework and what all can be done with. What is the best way to go about doing what the Title of my post says? I have been reading for the past few days and not been able to figure this out. I am still very new to all that C# has to offer when writing applications.
PS: Where it says "Exercise6", this is not for school by any means. This is something that I was given and told to use online for help/answers if I had issues.
If there is another post that explains all of this, please point me to that post because I have not found a post/answer to what I am trying to get done.

C# makes this very easy for you.
string content = System.IO.File.ReadAllText("filename.txt");
// change content here
System.IO.File.WriteAllText("filename.txt", content);
To simplify things, you should add using System.IO; to the top of your file, and then you don't have to include System.IO in the body of the code.

I assume that:
Each line in the input file represents a single automobile.
You want to convert this list of strings to list of automobiles.
For simplicity, let's define the class Automobile as follows:
public class Automobile
{
public string Name;
//you can add more fields here.
}
Let's say that your input file "automobiles.txt" has the following lines:
auto1
auto2
auto3
Your program should be:
// read file content
string sFilePath = #"C:\Users\user\Desktop\automobiles.txt";
string[] aLines = File.ReadAllLines(sFilePath);
// initialize an empty list of automobiles
List<Automobile> lAutomobiles = new List<Automobile>();
// initialize a list of automobiles from file content
aLines.ToList().ForEach(line => lAutomobiles.Add(new Automobile { Name = line }));
// here, do whatever you want with the automobiles list
lAutomobiles.ForEach(auto => auto.Name += "_processed");
// write the processed data to the file
File.WriteAllLines(sFilePath, lAutomobiles.Select(auto => auto.Name));
After running this code, the lines in automobiles.txt are:
auto1_processed
auto2_processed
auto3_processed

The error says that you want to add a string value to a list of Autmobile type.so you should create a complete Autombile class that has all info about autombiles, then when add to the list you can write something like that:
List<Autombile> auto = new List<Autombile>();
auto.Add(new Autombile() { Name = "user's auto name", Model = "user's auto model" });

How can I access a Properties.Settings.Default property if I have its name as a string?

I have a number of properties in Properties.Settings.Default whose name all start with "store" and an integer number, these numbers follow in sequence and what I would like to do after the method is fired off is to increase the number in the property name, i.e. from "store1" to "store2".
I keep getting an "identifier expected" error. I'm rather new at programming, so any help would be appreciated.
public void store()
{
storename1.ForeColor = Color.Orange;
if (File.Exists(Filedestination))
{
File.Delete(Filedestination);
}
NumberOfScales = Properties.Settings.Default.("store"+ Convert.ToString(storeNumber) + "NrOfScales");
StartRange = EndRange - Properties.Settings.Default.DegrendelNrOfScales;
IPRange = Properties.Settings.Default.DegrendelIPRange;
CurrentRange = StartRange;
PingScales();
}
I don't even know how I can read a property with the name ("store" + Convert.ToString(storeNumber) + "NrOfScales"). If I knew how to do that, it would shorten the code by at least 9/10ths as I would not have to redo this for every single instance of all the stores that I have. Is there any way I can get this to work?

At first glance, it seems like you possibly chose the wrong place to store your data. Is there any particular reason why you are using Windows Forms' application settings (Settings) to store data?
If you really want to do it that way, IIRC you can access a setting by its name using Properties.Settings.Default["PropertyName"] (where you can substitute "PropertyName" by any expression that yields a string, e.g. "store" + Convert.ToString(storeNumber) + "NrOfScales" (or more succinctly in Visual Studio 2015 or later, $"store{storeNumber}NrOfScales"). You will get back an object that you'll have to cast to whatever type of values you stored in there, e.g.:
var numberOfScales = (int)Properties.Settings.Default[$"store{storeNumber:D}NrOfScales"];
Some hints about syntax used here:
The [] syntax is called an "indexer".
$"…" is for string interpolation. It often allows for neater concatenation of strings than by using +.
The D (decimal) format specifier used in $"…{…:D}…" makes sure that storeNumber will be formatted as a decimal without any thousands/decimal separators.
Now, back to my initial question, if you're open to other means of storing data, let me point out a few alternatives:
If you only need the data during one single execution of your program, i.e. the data does not need to be persisted from one run of the program to the next, then a Dictionary<string, int> might be sufficient. Dictionaries allow you to associate int values with string values and look them up by these strings.
If your data is actually user content / business data, then don't store it as "application settings". At the least, store the data to a simple file (possibly to Isolated Storage) using the facilities under System.IO (File.Create, File.Open, StreamWriter, etc.). If you want to store structured data, you could make use of relational databases (see e.g. SQLite, SQL Server Compact, or SQL Server) or document databases.
If the data you're storing is in fact data that influences the setup / configuration of your application, then your current use of application settings might be fine.

Read large xml file and extract it's data into a form application in C#.Net project

I want to read a xml file data and extract information from it to show in a form in a C#.Net vs2008 env.
I can't read this data into a list in Memory, but I want to have a .exe file from project that can run in another computer System. So I think, I can't use database for saving and retrieving data!
Please help me to solve my problem!

Use System.Xml.XmlReader to read the XML file. This is a forward only reader which only loads a bit of XML at a time.
Combine this with an iterator method which produces an IEnumerable, such as this example, where MyXmlData is a class representing what is held in the XML file and a class that your forms can work with:
public IEnumerable<MyXmlData> QueryFile(String xmlFile)
{
using (var reader = XmlReader.Create(xmlFile))
{
// These are the variables you want to read for each 'object' in the
// xml file.
var prop1 = String.Empty;
var prop2 = 0;
var prop3 = DateTime.Today;
while (reader.Read())
{
// Here you'll have to read an xml node at a time.
// As you read, assign appropriate values to the variables
// declared above.
if (/* Have we finished reading an item? */)
{
// Once you've completed reading xml representing a single
// MyXmlData object, return it using yield return.
yield return new MyXmlData(prop1, prop2, prop3);
}
}
}
}
The value returned from that method is a sequence of MyXmlData objects where each one will be created on demand as the file is read, one at a time. This will greatly reduce memory requirements for a large XML file.
You can then query your MyXmlData objects using Linq functions. For example, you can emulate paging by using the Take and Skip methods.
// Third page - 50 objects per page.
QueryFile(#"x.xml").Skip(100).Take(50);

Recently microsoft provide a synidcation class in WCF. you can use it for doing this task

You should look into VTD-XML, it is the most efficient xml parser in terms of memory usage without losing random access and xpath..
http://vtd-xml.sf.net

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.