I have been struggling to create the proper data structure for ML.net and get it to load into my application. Essentially, I have an application where the training data will be dynamic and the type and/or size will not be known prior to runtime. In addition, I have to convert the training data from a non-standard primitive types (ie. App_Bool, or App_Number... rather than simply using bool or double, etc.) So, this has been proving to be a problem as I try to convert my training data into a generic data type which can then be loaded from memory using the LoadFromEnumerable function.
I have four basic data type classes:
public class MLIntData
{
public MLIntData(string label, List<object> l)
{
Label = label;
foreach (App_Integer element in l)
Features.Add((int)element.Value);
}
public List<int> Features { get; set; } = new List<int>();
public string Label { get; set; } = "";
}
public class MLNumberData
{
public MLNumberData(string label, List<object> l)
{
Label = label;
foreach (App_Number element in l)
Features.Add((double)element.Value);
}
public List<double> Features { get; set; } = new List<double>();
public string Label { get; set; } = "";
}
public class MLBoolData
{
public MLBoolData(string label, List<object> l)
{
Label = label;
foreach (App_Boolean element in l)
Features.Add((bool)element.Value);
}
public List<bool> Features { get; set; } = new List<bool>();
public string Label { get; set; } = "";
}
public class MLTextData
{
public MLTextData(string label, List<object> l)
{
Label = label;
foreach (App_String element in l)
Features.Add(element.Value.ToString());
}
public List<string> Features { get; set; } = new List<string>();
public string Label { get; set; } = "";
}
So, each base class will contain a label for the data and then a list of features which will either be of type bool, double, int, or string.
Now, in my ML.net code I'm trying to load in the training data and then create an IDataView object of the data. First I loop through the input data (which is originally of the generic type object) then create the new classes of data.
List<object> data = new List<object>();
for(int i = 0; i < input.Count; i++)
{
MLCodifiedData codifiedData = input[i].Value as MLCodifiedData;
Type dataType = codifiedData.Features[0].GetType();
if (dataType == typeof(App_Boolean))
{
data.Add(new MLBoolData(codifiedData.Label, codifiedData.Features));
}
else if (dataType == typeof(App_Number))
{
data.Add(new MLNumberData(codifiedData.Label, codifiedData.Features));
}
else if (dataType == typeof(App_Integer))
{
data.Add(new MLIntData(codifiedData.Label, codifiedData.Features));
}
if (dataType == typeof(App_String))
{
data.Add(new MLTextData(codifiedData.Label, codifiedData.Features));
}
}
IDataView TrainingData = mlContext.Data.LoadFromEnumerable<object>(data);
I have tried creating a schema definition (which can be passed in as the second parameter in the LoadFromEnumerable method, but I can't seem to get that to work. I've also tried creating a schema using the schema builder to create a schema, but that doesn't seem to work either. Right now, I'm using one of the datasets that is included in one of the sample files. And to preempt questions, yes, I know I could simply load the data as file and read it in that way... However, in my app I need to first read in the CSV into memory, then create the data structure so I can't really use many of the examples which are geared toward reading in a CSV file using the LoadFromTextFile method. Can anyone provide support as to how I could setup a dynamic in-memory collection and get it converted into a IDataView object?
I have a requirement where user will be uploading a csv file in the below format which will contain around 1.8 to 2 million records
SITE_ID,HOUSE,STREET,CITY,STATE,ZIP,APARTMENT
44,545395,PORT ROYAL,CORPUS CHRISTI,TX,78418,2
44,608646,TEXAS AVE,ODESSA,TX,79762,
44,487460,EVERHART RD,CORPUS CHRISTI,TX,78413,
44,275543,EDWARD GARY,SAN MARCOS,TX,78666,4
44,136811,MAGNOLIA AVE,SAN ANTONIO,TX,78212
What i have to do is, first validate the file and then save it in database iff its validated successfully and has no errors. The validations that i have to apply are different for each column. For example,
SITE_ID: it can only be an integer and it is required.
HOUSE: integer, required
STREET: alphanumeric, required
CITY: alphabets only, required
State: 2 alphabets only, required
zip: 5 digits only, required
APARTMENT: integer only, optional
I need a generic way of applying these validations to respective columns. What i have tried so far is that i converted the csv file to dataTable and i plan to try and validate each cell through regex but this doesn't seem like a generic or good solution to me. Can anyone help me in this regard and point me to the right direction?
Here is one efficient method :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
using System.Data.OleDb;
using System.Text.RegularExpressions;
using System.IO;
namespace ConsoleApplication23
{
class Program
{
const string FILENAME = #"c:\temp\test.csv";
static void Main(string[] args)
{
CSVReader csvReader = new CSVReader();
DataSet ds = csvReader.ReadCSVFile(FILENAME, true);
RegexCompare compare = new RegexCompare();
DataTable errors = compare.Get_Error_Rows(ds.Tables[0]);
}
}
class RegexCompare
{
public static Dictionary<string,RegexCompare> dict = new Dictionary<string,RegexCompare>() {
{ "SITE_ID", new RegexCompare() { columnName = "SITE_ID", pattern = #"[^\d]+", positveNegative = false, required = true}},
{ "HOUSE", new RegexCompare() { columnName = "HOUSE", pattern = #"[^\d]+", positveNegative = false, required = true}},
{ "STREET", new RegexCompare() { columnName = "STREET", pattern = #"[A-Za-z0-9 ]+", positveNegative = true, required = true}},
{ "CITY", new RegexCompare() { columnName = "CITY", pattern = #"[A-Za-z ]+", positveNegative = true, required = true}},
{ "STATE", new RegexCompare() { columnName = "STATE", pattern = #"[A-Za-z]{2}", positveNegative = true, required = true}},
{ "ZIP", new RegexCompare() { columnName = "ZIP", pattern = #"\d{5}", positveNegative = true, required = true}},
{ "APARTMENT", new RegexCompare() { columnName = "APARTMENT", pattern = #"\d*", positveNegative = true, required = false}},
};
string columnName { get; set;}
string pattern { get; set; }
Boolean positveNegative { get; set; }
Boolean required { get; set; }
public DataTable Get_Error_Rows(DataTable dt)
{
DataTable dtError = null;
foreach (DataRow row in dt.AsEnumerable())
{
Boolean error = false;
foreach (DataColumn col in dt.Columns)
{
RegexCompare regexCompare = dict[col.ColumnName];
object colValue = row.Field<object>(col.ColumnName);
if (regexCompare.required)
{
if (colValue == null)
{
error = true;
break;
}
}
else
{
if (colValue == null)
continue;
}
string colValueStr = colValue.ToString();
Match match = Regex.Match(colValueStr, regexCompare.pattern);
if (regexCompare.positveNegative)
{
if (!match.Success)
{
error = true;
break;
}
if (colValueStr.Length != match.Value.Length)
{
error = true;
break;
}
}
else
{
if (match.Success)
{
error = true;
break;
}
}
}
if(error)
{
if (dtError == null) dtError = dt.Clone();
dtError.Rows.Add(row.ItemArray);
}
}
return dtError;
}
}
public class CSVReader
{
public DataSet ReadCSVFile(string fullPath, bool headerRow)
{
string path = fullPath.Substring(0, fullPath.LastIndexOf("\\") + 1);
string filename = fullPath.Substring(fullPath.LastIndexOf("\\") + 1);
DataSet ds = new DataSet();
try
{
if (File.Exists(fullPath))
{
string ConStr = string.Format("Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0}" + ";Extended Properties=\"Text;HDR={1};FMT=Delimited\\\"", path, headerRow ? "Yes" : "No");
string SQL = string.Format("SELECT * FROM {0}", filename);
OleDbDataAdapter adapter = new OleDbDataAdapter(SQL, ConStr);
adapter.Fill(ds, "TextFile");
ds.Tables[0].TableName = "Table1";
}
foreach (DataColumn col in ds.Tables["Table1"].Columns)
{
col.ColumnName = col.ColumnName.Replace(" ", "_");
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return ds;
}
}
}
Here's a rather overengineered but really fun generic method, where you give attributes to your class to match them to CSV column headers:
First step is to parse your CSV. There are a variety of methods out there, but my favourite is the TextFieldParser that can be found in the Microsoft.VisualBasic.FileIO namespace. The advantage of using this is that it's 100% native; all you need to do is add Microsoft.VisualBasic to the references.
Having done that, you have the data as List<String[]>. Now, things get interesting. See, now we can create a custom attribute and add it to our class properties:
The attribute class:
[AttributeUsage(AttributeTargets.Property)]
public sealed class CsvColumnAttribute : System.Attribute
{
public String Name { get; private set; }
public Regex ValidationRegex { get; private set; }
public CsvColumnAttribute(String name) : this(name, null) { }
public CsvColumnAttribute(String name, String validationRegex)
{
this.Name = name;
this.ValidationRegex = new Regex(validationRegex ?? "^.*$");
}
}
The data class:
public class AddressInfo
{
[CsvColumnAttribute("SITE_ID", "^\\d+$")]
public Int32 SiteId { get; set; }
[CsvColumnAttribute("HOUSE", "^\\d+$")]
public Int32 House { get; set; }
[CsvColumnAttribute("STREET", "^[a-zA-Z0-9- ]+$")]
public String Street { get; set; }
[CsvColumnAttribute("CITY", "^[a-zA-Z0-9- ]+$")]
public String City { get; set; }
[CsvColumnAttribute("STATE", "^[a-zA-Z]{2}$")]
public String State { get; set; }
[CsvColumnAttribute("ZIP", "^\\d{1,5}$")]
public Int32 Zip { get; set; }
[CsvColumnAttribute("APARTMENT", "^\\d*$")]
public Int32? Apartment { get; set; }
}
As you see, what I did here was link every property to a CSV column name, and give it a regex to validate the contents. On non-required stuff, you can still do regexes, but ones that allow empty values, as shown in the Apartment one.
Now, to actually match the columns to the CSV headers, we need to get the properties of the AddressInfo class, check for each property whether it has a CsvColumnAttribute, and if it does, match its name to the column headers of the CSV file data. Once we have that, we got a list of PropertyInfo objects, which can be used to dynamically fill in the properties of new objects created for all rows.
This method is completely generic, allows giving the columns in any order in the CSV file, and parsing will work for any class once you assign the CsvColumnAttribute to the properties you want to fill in. It will automatically validate the data, and you can handle failures however you want. In this code, all I do is skip invalid lines, though.
public static List<T> ParseCsvInfo<T>(List<String[]> split) where T : new()
{
// No template row, or only a template row but no data. Abort.
if (split.Count < 2)
return new List<T>();
String[] templateRow = split[0];
// Create a dictionary of rows and their index in the file data.
Dictionary<String, Int32> columnIndexing = new Dictionary<String, Int32>();
for (Int32 i = 0; i < templateRow.Length; i++)
{
// ToUpperInvariant is optional, of course. You could have case sensitive headers.
String colHeader = templateRow[i].Trim().ToUpperInvariant();
if (!columnIndexing.ContainsKey(colHeader))
columnIndexing.Add(colHeader, i);
}
// Prepare the arrays of property parse info. We set the length
// so the highest found column index exists in it.
Int32 numCols = columnIndexing.Values.Max() + 1;
// Actual property to fill in
PropertyInfo[] properties = new PropertyInfo[numCols];
// Regex to validate the string before parsing
Regex[] propValidators = new Regex[numCols];
// Type converters for automatic parsing
TypeConverter[] propconverters = new TypeConverter[numCols];
// go over the properties of the given type, see which ones have a
// CsvColumnAttribute, and put these in the list at their CSV index.
foreach (PropertyInfo p in typeof(T).GetProperties())
{
object[] attrs = p.GetCustomAttributes(true);
foreach (Object attr in attrs)
{
CsvColumnAttribute csvAttr = attr as CsvColumnAttribute;
if (csvAttr == null)
continue;
Int32 index;
if (!columnIndexing.TryGetValue(csvAttr.Name.ToUpperInvariant(), out index))
{
// If no valid column is found, and the regex for this property
// does not allow an empty value, then all lines are invalid.
if (!csvAttr.ValidationRegex.IsMatch(String.Empty))
return new List<T>();
// No valid column found: ignore this property.
break;
}
properties[index] = p;
propValidators[index] = csvAttr.ValidationRegex;
// Automatic type converter. This function could be enhanced by giving a
// list of custom converters as extra argument and checking those first.
propconverters[index] = TypeDescriptor.GetConverter(p.PropertyType);
break; // Only handle one CsvColumnAttribute per property.
}
}
List<T> objList = new List<T>();
// start from 1 since the first line is the template with the column names
for (Int32 i = 1; i < split.Count; i++)
{
Boolean abortLine = false;
String[] line = split[i];
// make new object of the given type
T obj = new T();
for (Int32 col = 0; col < properties.Length; col++)
{
// It is possible a line is not long enough to contain all columns.
String curVal = col < line.Length ? line[col] : String.Empty;
PropertyInfo prop = properties[col];
// this can be null if the column was not found but wasn't required.
if (prop == null)
continue;
// check validity. Abort buildup of this object if not valid.
Boolean valid = propValidators[col].IsMatch(curVal);
if (!valid)
{
// Add logging here? We have the line and column index.
abortLine = true;
break;
}
// Automated parsing. Always use nullable types for nullable properties.
Object value = propconverters[col].ConvertFromString(curVal);
prop.SetValue(obj, value, null);
}
if (!abortLine)
objList.Add(obj);
}
return objList;
}
To use on your CSV file, simply do
// the function using VB's TextFieldParser
List<String[]> splitData = SplitFile(datafile, new UTF8Encoding(false), ',');
// The above function, applied to the AddressInfo class
List<AddressInfo> addresses = ParseCsvInfo<AddressInfo>(splitData);
And that's it. Automatic parsing and validation, all through some added attributes on the class properties.
Note, if splitting the data in advance would give too much of a performance hit for large data, that's not really a problem; the TextFieldParser works from a Stream wrapped in a TextReader, so instead of giving a List<String[]> you can just give a stream and do the csv parsing on the fly inside the ParseCsvInfo function, simply reading per CSV line directly from the TextFieldParser.
I didn't do that here because the original use case for csv reading for which I wrote the reader to List<String[]> included automatic encoding detection, which required reading the whole file anyway.
I would suggest to using a CSV-library to read the file.
For example you can use LumenWorksCsvReader: https://www.nuget.org/packages/LumenWorksCsvReader
Your approach with an regex validation is actually ok.
For example, you could create a "Validation Dictionary" and check every CSV Value against the regex-expression.
Then you can build a function that can validate a CSV-File with such a "Validation Dictionary".
See here:
string lsInput = #"SITE_ID,HOUSE,STREET,CITY,STATE,ZIP,APARTMENT
44,545395,PORT ROYAL,CORPUS CHRISTI,TX,78418,2
44,608646,TEXAS AVE,ODESSA,TX,79762,
44,487460,EVERHART RD,CORPUS CHRISTI,TX,78413,
44,275543,EDWARD GARY,SAN MARCOS,TX,78666,4
44,136811,MAGNOLIA AVE,SAN ANTONIO,TX,78212";
Dictionary<string, string> loValidations = new Dictionary<string, string>();
loValidations.Add("SITE_ID", #"^\d+$"); //it can only be an integer and it is required.
//....
bool lbValid = true;
using (CsvReader loCsvReader = new CsvReader(new StringReader(lsInput), true, ','))
{
while (loCsvReader.ReadNextRecord())
{
foreach (var loValidationEntry in loValidations)
{
if (!Regex.IsMatch(loCsvReader[loValidationEntry.Key], loValidationEntry.Value))
{
lbValid = false;
break;
}
}
if (!lbValid)
break;
}
}
Console.WriteLine($"Valid: {lbValid}");
Here is another way to accomplish your needs using Cinchoo ETL - an open source file helper library.
First define a POCO class with DataAnnonations validation attributes as below
public class Site
{
[Required(ErrorMessage = "SiteID can't be null")]
public int SiteID { get; set; }
[Required]
public int House { get; set; }
[Required]
public string Street { get; set; }
[Required]
[RegularExpression("^[a-zA-Z][a-zA-Z ]*$")]
public string City { get; set; }
[Required(ErrorMessage = "State is required")]
[RegularExpression("^[A-Z][A-Z]$", ErrorMessage = "Incorrect zip code.")]
public string State { get; set; }
[Required]
[RegularExpression("^[0-9][0-9]*$")]
public string Zip { get; set; }
public int Apartment { get; set; }
}
then use this class with ChoCSVReader to load and check the validity of the file using Validate()/IsValid() method as below
using (var p = new ChoCSVReader<Site>("*** YOUR CSV FILE PATH ***")
.WithFirstLineHeader(true)
)
{
Exception ex;
Console.WriteLine("IsValid: " + p.IsValid(out ex));
}
Hope it helps.
Disclaimer: I'm the author of this library.
i am trying to convert data table to list in web service to get it in java script . i tried the below code
rows = (from p in dt.AsEnumerable()
select new TableProperty() { }).ToList<TableProperty>();
for (int i = 0; i < rows.Count; i++)
{
// Get List of table Properties..
List<TableProperty> tProperties = new List<TableProperty>();
for (int j = 0; j < dt.Columns.Count; j++)
{
try
{
TableProperty propertyValue = new TableProperty()
{
name = dt.Columns[j].ToString(),
serial = dt.Columns[j].ToString(),
address = dt.Columns[j].ToString(),
mobile = dt.Columns[j].ToString(),
password = dt.Columns[j].ToString()
};
tProperties.Add(propertyValue);
}
catch (Exception ex)
{
ex.ToString();
}
}
}
return rows;
where is dt is my data table ,TableProperty is the class which i set and get my properties
when i get it in java script , my value returns null
function draw_Result(ddata) {
alert(ddata.length);
for (var i = 0; i < ddata.length; i++) {
alert("for");
try
{
var post = ddata[i].name.toString();
alert(post);
}
catch (e)
{
alert(e.message);
}
}
}
any help
rows = (from p in dt.AsEnumerable()
select new TableProperty() { }).ToList<TableProperty>();
Have you tried debugging the program to see what rows is after you assign it like that?
Also, you seem to first be assigning the rows, then looping over that collection only to create a completely new collection tProperties (which would probably contain what you'd expect) and at the end you return rows again. What is the purpose of tProperties?
It looks as if rows was a list of TableProperty objects which are empty - since for each row (from p in dt.AsEnumerable()) you select an empty object (select new TableProperty() { }) and at the end convert the enumerable to a list of still empty objects.
You'd need to do your mapping either at the point when you create the TableProperty - but that would imply that you know how many columns there are:
rows = (from p in dt.AsEnumerable()
select new TableProperty() {
name = p[0].ToString(),
serial = p[1].ToString(),
address = p[2].ToString(),
mobile = p[3].ToString(),
password = p[4].ToString()
}).ToList<TableProperty>();
If you don't know how many columns to expect, you can always go with just the plain Enumerable - i.e. do the for loops using
var rows = dt.AsEnumerable()
Just a tip: You could always make an extension to the DataTable object. When you so-call extend the DataTable object, you make it possible to do this:
List<TableProperties> ListOfTableProperties = YourDataTable.DataTableToList<TableProperties>();
Here is how to do it:
Say you have this data model (a class):
public class TableProperties
{
public int Indexer { get; set; }
public string DisplayName { get; set; }
public string TypeName { get; set; }
}
.. and you have this data table to work with:
DataTable tbl = new DataTable("test");
tbl.Columns.Add("Indexer");
tbl.Columns.Add("DisplayName");
tbl.Columns.Add("TypeName");
tbl.Rows.Add(2, "teststring1", "teststring2");
.. and when you do this:
List<TableProperties> urk = tbl.DataTableToList<TableProperties>();
... you get a list of type TableProperties.
The extension you need for this can be put in a static class of its own somewhere in your project. Here is the code for making such a class:
using System;
using System.Collections.Generic;
using System.Data;
using System.Linq;
using System.Reflection;
using System.Web;
public static class ExtensionMethods
{
/// <summary>
/// Converts a DataTable to a strongly typed list
/// </summary>
/// <typeparam name="T">Generic object</typeparam>
/// <param name="table">DataTable</param>
/// <returns>List with generic objects</returns>
public static List<T> DataTableToList<T>(this DataTable table) where T : class, new()
{
try
{
List<T> list = new List<T>();
foreach (var row in table.AsEnumerable())
{
T obj = new T();
foreach (var prop in obj.GetType().GetProperties())
{
try
{
PropertyInfo propertyInfo = obj.GetType().GetProperty(prop.Name);
propertyInfo.SetValue(obj, Convert.ChangeType(row[prop.Name], propertyInfo.PropertyType), null);
}
catch
{
continue;
}
}
list.Add(obj);
}
return list;
}
catch
{
return null;
}
}
}
Note that for example integers are not enclosed in quotes in the final result; there is a type check there, in the extension method. Remember that the names of the properties must correspond to the field names of the DataTable object.
If you do not know the field names and/or do not have a data model at hand, you can use Linq.