ML.NET - Loading variable number of feature columns - c#

I would like a general learning pipeline (from N features predict a label, for example), in the sense that one of my input CSVs would have 5 features and another would have 10 features (those two CSVs would obviously produce different models, I don't want to combine them in any way, I just want to run the same program on both the CSVs).
However, to load the features, I need to use
TextLoader(...).CreateFrom<ClassA>()
where ClassA defines my schema. Its properties need to reflect the CSV format, therefore the CSV must always have the same number of columns.
I have noticed CustomTextLoader but it's obsolete. Any ideas? Thank you.

Taking a look at the source: (https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML/Data/TextLoader.cs)
CreateFrom looks like nothing more than a helper method that populates Arguments.Columns and Arguments, both of which are publicly accessible. This means that you could write your own implementation.
TextLoader tl = new TextLoader(inputFileName)
tl.Arguments.HasHeader = useHeader;
tl.Arguments.Separator = new[] { separator };
tl.Arguments.AllowQuoting = allowQuotedStrings;
tl.Arguments.AllowSparse = supportSparse;
tl.Arguments.TrimWhitespace = trimWhitespace;
And now the important part, you'll need to populate a TextLoader.Arguments.Columns with an entry for each column in your data set. If you know ahead of time that you'll have 5 or 10 columns that would be the simplest, but otherwise, I'd peek into the CSV to figure out.
tl.Arguments.Column = new TextLoaderColumns[numColumns];
tl.Arguments.Column[0].Name = ...
tl.Arguments.Column[0].Source = ... // see the docs
tl.Arguments.Column[0].Type = ...
// and so on.

jaket - thank you for your answer. I can see how that would work loading the data into the TextLoader. However, how would you then Train the model? as the pipeline Train() method also requires you to pass in an object defining the data schema :
PredictionModel<ClassA, ClassAPrediction> model = pipeline.Train<ClassA, ClassAPrediction>();

Related

C# SQLDataReader accessing by column name

I have a Visual Basic application where I have a connection to a MS SQL database. I have code which defines a SqlDataReader, opens the connection, and executes the ExecuteReader() command. I use the following code to retrieve the data from the reader
While myDataReader.Read()
Session("menu_PEO") = myDataReader("menu_PEO")
Session("menu_Transfer") = myDataReader("menu_Transfer")
Session("menu_Loan") = myDataReader("menu_loan")
End While
menu_PEO, menu_Transfer, and menu_loan are 3 of the column headings in the data that the SQL returns.
I am now tasked with converting the code to c#. I have the following code in c# which works:
while (dataReader.Read())
{
dbMenuPEO = dataReader.GetString(1);
dbMenuTransfer = dataReader.GetString(2);
dbMenuLoan = dataReader.GetString(3);
}
Since my SQL is a SELECT *, I can not guarantee the order of the data being returned so I dont want to rely on specifying the instance of the GetString.
Is there a way in c# to specify the column name that I would like to retrieve similar to the way it works in Visual Basic?
Thanks!
There are two ways to do this. The easy way is to use the named indexer:
dbMenuPEO = (string)dataReader["menu_PEO"];
Which is essentially the same as:
dbMenuPEO = (string)dataReader.GetValue(dataReader.GetOrdinal("menu_PEO"));
You can move the ordinal lookup outside of the loop, which is a bit more efficient. You can also use the the strongly-typed accessor which avoids the cast. This can be important when reading value types (int, dateTime, etc) as it avoid needing to box the value, which would happen with GetValue.
var menuPeoIdx = dataReader.GetOrdinal("menu_PEO");
while(dataReader.Read())
{
dbMenuPEO = dataReader.GetString(menuPeoIdx);
}
Generally, you should never use select * in a production system. Always explicitly list the columns you want. The reason is that it allows the underlying tables to change without having to fix the consuming code, or over-selecting columns that you don't need. Select * is a great tool for "exploring" a data set, but should only be used as a dev tool.
I have code which defines a SqlDataReader, opens the connection, and executes the ExecuteReader()
And isn't it the most incredibly tedious code to have to write? Many people have thought this over the time and many things have been invented to relieve you of the tedium of it. MarkPflug's answer directly addresses your question, but just in case you aren't aware that there are significant productivity boosts available I'd like to introduce you to one of these technologies
Is there a way in c# to specify the column name that I would like to retrieve similar to the way it works in Visual Basic?
Here's a way to do it, in that when you do this you don't have to type it. It avoids typing the same thing again that you've already typed (twice - once for the variable name, once in the SQL)
Use the nuget package manager built into visual studio, to install Dapper
Then lets say you have a class that holds your data:
//C#
record DbMenu(string DbMenuPEO, string DbMenuTransfer, string DbMenuLoan);
'or VB, if you like that sort of thing
Class DbMenu
Public Property DbMenuPEO as String
Public Property DbMenuTransfer As String
Public Property DbMenuLoan As String
End Class
You can get Dapper to make your query, add any parameters, open your connection, download your data, fill up a list full of your classes, close the connection and return it.. all in one line of code:
//C#
using var conn = ... //code here that gets a connection; doesn't need to be open
var myListOfDbMenus = conn.Query<DbMenu>("SELECT * FROM ... ");
'VB
Using conn = ...
Dim myListOfDbMenus = conn.Query(Of DbMenu)("SELECT * FROM ... ");
End Using
The short short version is: your c# class properties should be named the same as your columns. If they aren't, it's easiest to use AS xyz in the SQL to equalize the names. If you want to write a parameterized query, you provide #parameterNames that are the same as the property names of an anonymous object you pass at the same time as your query:
var q = conn.Query<Type>("SELECT ... WHERE col = #val1", new {val1 = "hello" } );
If you like writing SQL and having that low level control/don't want to use an ORM like EF, then Dapper lets you carry on doing the SQL directly as you're doing, but takes away all the repetitive surrounding boilerplate
The SQLDataReader class exposes a set of instance methods that allow to retrieve the value of a field of a specific type, such as GetInt32() and GetString(), which can be used in combination with the GetOrdinal() method
Example
dataReader.GetString(dataReader.GetOrdinal("YOUR FIELD NAME HERE"))

Is there a way to Clone one or many Entities (records) in Code

NOTE: at this time I am stuck on 2sxc v9.43.2 on this project.
After selecting a set of records from my Content Type, I need to be able to duplicate them changing 1 of the fields along the way. Here is my almost-working idea so far. The use case is simple, they have Programs that people can register for. They change each Season, but only a little (prices, dates/times, etc). And they need the Current Season live and unchanged while they edit the Next Season. So we are still in the fall season (EntityId 1732) with 97 active programs. We want to click a button and clone all 97 programs as is, but IN TO the new Next Season (1735 below).
Two questions:
if this way works, what syntax would work on ent/Attributes to delivery the "object" as needed in the fields.Add() line
is there another 2sxc way to do this? Some other variant of the App.Data.Create() method or some other method in the API? I just need to duplicate the record with 1 field (Season) changed?
is there a better way to do this in the latest versions of 2sxc, v11.7+ ?
// we are going to duplicate the current Season's programs in to the new season
// cheating for now, pre-made new 1735 in Seasons, current is 1732
var programs = AsDynamic(App.Data["Programs"])
.Where(p => ((List<DynamicEntity>)p.Season).First().EntityId == selectedSeason.EntityId);
// #programs.Count() // 97
foreach(var copy in programs)
{
var fields = new Dictionary<string, object>();
var ent = AsEntity(copy);
foreach(var attr in ent.Attributes)
{
if(attr.Key == "Season")
{
fields.Add(attr.Key, new List<int> { 1735 });
}
else
{
fields.Add(attr.Key, ent.GetBestValue(attr.Key)); // object??
}
}
App.Data.Create("Programs", fields);
}
There are at least 3 ways to clone
Simple way using edit-ui
hard way using c# / server api
Semi-hard way using REST api
The simple way is to use the edit ui. You can see an example in the replace-dialog, there is a copy button there. This would open the edit UI with an existing item, but tell it it's a copy, so on save it would create a new one.
Combine this with a prefill or something and I think you would be good to go.
The second way is using the App.Data.Create - your code looks fairly good. I assume it also works and you were just wondering if there was a 1-liner - or am I mistaken?
The last way is using JS REST. Basically write some JS that gets an item, changes the object (resets the id) and posts it back to the endpoint for saving.
Just stumbled upon situation where I needed to create entity and set its field value, which has type of another entity. If that's your question #1, you need to add there EntityGuid.
fields.Add(attr.Key, attr.EntityGuid);. That should bind one entity to another one.
And no, I didn't stumble upon better way to copy entity then just to create a new one. At least so far.

Encog C#, VersatileMLDataSet from CSV, how to get original data?

I want to use CSV reader from Encog library, like this:
var format = new CSVFormat('.', ' ');
IVersatileDataSource source = new CSVDataSource(filename, false, format);
var data = new VersatileMLDataSet(source);
Is it possible to get original data from variable data? I have to show records from CSV to user in dataGridView, before I use it for neural network. I want to be able to modify original data as well. According documentation there is property Data, but it doesnt work for me. If I try something like:
data.Data[1][1]
I get null pointer exception. There is another problem with using data before normalization. I want to get count of records by:
data.GetRecordCount()
But I get error You must normalize the dataset before using it. So even if I have not used data yet I have to normalize it? If this is true, then is probably better to use my own CSV reader and then load it into encog from memory, right?
So I just looked at the Encog source code on GitHub. Thankfully your question is well defined and narrow in scope, so I can provide an answer. Unfortunately, you probably won't like it.
Basically, when you pass in your IVersatileDataSource into the constructor for VersatileMLDataSet, it gets placed into a private readonly field called _source. There is no abstraction around _source, so you cannot access it from outside of VersatileMLDataSet.
The Data property indeed will only be populated during the normalization process. There also doesn't appear to be any fields within CSVDataSource that are public of any value to you (again, all private).
If you just wanted to look at a single column of data, you could stay within Encog and look at Encog.Util.NetworkUtil.QuickCSVUtils. There are methods within this class that will help you pickup a file and get a single column of data out quickly.
If you wanted to get the full CSV data out of a file within Encog, you could use the Encog.Util.CSV.ReadCSV class to get the data. This is the underlying implementation anyways utilized by your code when you instantiate a QuickCSVUtils. You will have to provide some wrapper logic around ReadCSV, similar to QuickCSVUtils. If you go this route, I'd recommend peeking in that class to see see how its using ReadCSV. Essentially ReadCSV reads a single line at time.
But if you really need to read the RAW csv data from within the VersatileMLDataSet class, your best bet would be to provide your own implementation inside a custom class derived from VersatileMLDataSet.
There is a couple of steps you need to do after reading in your file:
You have to define your column types
Analyze data
Map output
Set up normalization strategy
Get your data count
Optionally clone the data.Data to keep originals
The code is below with appropriate comments.
var filename = #"iris.data.csv";
var format = new CSVFormat('.', ',');
IVersatileDataSource source = new CSVDataSource(filename, false, CSVFormat.DecimalPoint);
var data = new VersatileMLDataSet(source);
// Define columns to read data in.
data.DefineSourceColumn("Col1", 0, ColumnType.Continuous);
data.DefineSourceColumn("Col2", 1, ColumnType.Continuous);
data.DefineSourceColumn("Col3", 2, ColumnType.Continuous);
data.DefineSourceColumn("Col4", 3, ColumnType.Continuous);
ColumnDefinition outputColumn = data.DefineSourceColumn("Col5", 4, ColumnType.Nominal);
// Analyze data
data.Analyze();
// Output mapping
data.DefineSingleOutputOthersInput(outputColumn);
// Set normalization strategy
data.NormHelper.NormStrategy = new BasicNormalizationStrategy(-1, 1, -1, 1);
data.Normalize();
// Get count
var count = data.GetRecordCount();
// Clone to get original data
var oiginalData = data.Data.Clone();
For more details check the quickstart paper.
Sample data I'm using comes from here.

Copy Row from DataTable to another with different column schemas

I am working on optimizing some code I have been assigned from a previous employee's code base. Beyond the fact that the code is pretty well "spaghettified" I did run into an issue where I'm not sure how to optimize properly.
The below snippet is not an exact replication, but should detail the question fairly well.
He is taking one DataTable from an Excel spreasheet and placing rows into a consistantly formatted DataTable which later updates the database. This seems logical to me, however, the way he is copying data seems convoluted, and is a royal pain to modify, maintain or add new formats.
Here is what I'm seeing:
private void VendorFormatOne()
{
//dtSumbit is declared with it's column schema elsewhere
for (int i = 0; i < dtFromExcelFile.Rows.Count; i++)
{
dtSubmit.Rows.Add(i);
dtSubmit.Rows[i]["reference_no"] = dtFromExcelFile.Rows[i]["VENDOR REF"];
dtSubmit.Rows[i]["customer_name"] = dtFromExcelFile.Rows[i]["END USER ID"];
//etc etc etc
}
}
To me this is completely overkill for mapping columns to a different schema, but I can't think of a way to do this more gracefully. In the actual solution, there are about 20 of these methods, all using different formats from dtFromExcelFile and the column list is much longer. The column schema of dtSubmit remains the same across the board.
I am looking for a way to avoid having to manually map these columns every time the company needs to load a new file from a vendor. Is there a way to do this more efficiently? I'm sure I'm overlooking something here, but did not find any relevant answers on SO or elsewhere.
This might be overkill, but you could define an XML file that describes which Excel column maps to which database field, then input that along with each new Excel file. You'd want to whip up a class or two for parsing and consuming that file, and perhaps another class for validating the Excel file against the XML file.
Depending on the size of your organization, this may give you the added bonus of being able to offload that tedious mapping to someone less skilled. However, it is quite a bit of setup work and if this happens only sparingly, you might not get a significant return on investment for creating so much infrastructure.
Alternatively, if you're using MS SQL Server, this is basically what SSIS is built for, though in my experience, most programmers find SSIS quite tedious.
I had originally intended this just as a comment but ran out of space. It's in reply to Micah's answer and your first comment therein.
The biggest problem here is the amount of XML mapping would equal that of the manual mapping in code
Consider building a small tool that, given an Excel file with two
columns, produces the XML mapping file. Now you can offload the
mapping work to the vendor, or an intern, or indeed anyone who has a
copy of the requirement doc for a particular vendor project.
Since the file is then loaded at runtime in your import app or
whatever, you can change the mappings without having to redeploy the
app.
Having used exactly this kind of system many, many times in the past,
I can tell you this: you will be very glad you took the time to do
it - especially the first time you get a call right after deployment
along the lines of "oops, we need to add a new column to the data
we've given you, and we realised that we've misspelled the 19th
column by the way."
About the only thing that can perhaps go wrong is data type
conversions, but you can build that into the mapping file (type
from/to) and generalise your import routine to perform the
conversions for you.
Just my 2c.
A while ago I ran into similar problem where I had over 400 columns from 30 odd tables to be mapped to about 60 in the actual table in the database. I had the same dilemma whether to go with a schema or write something custom.
There was so much duplication that I ended up writing a simple helper class with a couple of overridden methods that basically took in a column name from import table and spit out the database column name. Also, for column names, I built a separate class of the format
public static class ColumnName
{
public const string FirstName = "FirstName";
public const string LastName = "LastName";
...
}
Same thing goes for TableNames as well.
This made it much simpler to maintain table names and column names. Also, this handled duplicate columns across different tables really well avoiding duplicate code.

To use Class or not to use Class , many objects

my program (console app) is basically reading a very big csv file and process it. there are columns in the file that I feel like can be grouped together and bestserved in class
for example the first line is title, second line onward are the values. each column has this structure. so I need to group title, location of the column and values. easiest is to create a class
this is what the data look like:
title1, title2, title3, ...
1,1,2, ...
20,30,5000,...
.
.
.
class tt
{
string title;
int column;
List<int> val = new List<int>();
}
but the problem is there are some 1,000 columns , which translate to 1,000 objects. is this a good approach? not sure?
A class with 1000 members would sound.... unusual, to be blunt. Since it is unlikely that the code is going to be referring to any of those by name, I would say it would be self-defeating to create members per-value. But for "should I create a class" - well, you don't have many options - it certainly would make a very bad struct. Actually, I suspect this may be a fair scenario for DataTable - it is not something I usually recommend, but for the data you are describing, it will probably do the job fine. Yes, it has overheads - but it optimises away a number of issues - for example, by storing data internally in typed columns (rather than rows, as you might expect), it avoids having to box all the values during storage (although they still tend to get boxed during access, those boxes are collected during gen-0, so are cheap).
"Thousands" are low numbers in most computing scenarios.
Note: if you actually mean 1000 rows, and the columns are actually sane (say, less than 50, all meaningful), then I would say "sure, create a class that maps the data" - for example something like:
class Customer {
public int Id {get;set;}
public string Name {get;set;}
//...
}
There are several tools that can help you read this from SCV into the objects, typically into a List<Customer> or IEnumerable<Customer>.
but the problem is there are some 1,000 columns , which translate to 1,000 objects. is this a good approach
yes translating to 1000 objects (if you need them) should not bother you.
but, for so many columns creating classes is hell lot of work. I won't do that unless it is absolutely necessary. You can use DataTable.
If your file is a big CSV file I suggest that load them on a DataSet. I dont see any benefit in using a class for each column
The CSV looks like a transposed regular data table, therefore each column is actully a row

Categories

Resources