When uploading data, importing data, or starting a model building session, you can specify metadata about the columns in your dataset. This allows the Nexosis API to better understand how your data should be used and interpreted when running machine learning algorithms.
Note: Setting column metadata is totally optional. For simple datasets with a couple of columns, the contents of the columns are inferred, and providing metadata may not make a difference in how the algorithms execute.
dataType
property is defining the kind of data that the column contains. The supported data types are:True
- False
1
- 0
On
- Off
Yes
- No
Our handling of Imputation and Aggregation are discussed in more detail in the sections below.
resultInterval
. That interval is something like hour
, day
, week
, etc.When you request one of these sessions, the Nexosis API goes through the following process against your data to prepare it for the Session:
timestamp
resultInterval
requested in the sessionThe Nexosis API doesn’t care whether your data is a daily/hourly/weekly rollup, or whether it’s a raw feed of sensor data.
resultInterval
, it looks for gaps in the aggregated data and Imputes missing values.Depending on the nature of the data you’re working with, you may need the Nexosis API to act differently with regard to the handling of missing values (Imputation) and the aggregation (e.g. if your data is hourly data but you’re forecasting by day) of that data.
For example, if a column of data is a record of the number of sales, then the defaults (Imputing with a zero and summing up data when aggregating) make sense. However, if a column is a temperature reading then those defaults don’t make sense. In that case you’d probably want a missing value to be filled in with the average of the adjacent values, and to use the mean temperature when aggregating.
Each dataType
available for a column of data in the Nexosis API comes with a set of default Imputation and Aggregation strategies that we feel are sensible choices for those data types.
If you wish to manually override the strategy used by the API for a column of data, you can do so through the imputation
and aggregation
fields.
zeroes
– A missing value is filled in with 0mean
– A missing value is filled in with the average of the nearest values we can find on either side of that gap.median
– A missing value is filled in with the median value of the rest of the values in that columnmode
– A missing value is filled in with the mode of the rest of the values in that columnmax
– A missing value is filled in with the maximum of the rest of the values in that columnmin
– A missing value is filled in with the minimum of the rest of the values in that columnresultInterval
sum
– The resulting value is the sum of all column values that fall within the resultInterval
mean
– The resulting value is the average of all column values that fall within the resultInterval
median
– The resulting value is the median of all column values that fall within the resultInterval
mode
– The resulting value is the mode of all column values that fall within the resultInterval
max
– The resulting value is the maximum of all column values that fall within the resultInterval
min
– The resulting value is the minimum of all column values that fall within the resultInterval
dataType
available. Those defaults are below.DataType | Typical Usage | Imputation Default | Aggregation Default |
---|---|---|---|
Numeric | Number of sales, etc. | zeroes |
sum |
NumericMeasure | Temperature reading, or a value that’s an interval itself (transactions/sec, etc) | mean |
mean |
Logical | Values that are a simple yes/no | mode |
mode |
String | Categorical data as described in DataTypes | mode |
mode |
Date values are generally used for timestamps, and so you cannot explicitly set imputation or aggregation strategy for these.
Text columns are not currently used by time-series sessions, and so missing values are never aggregated. Missing text values are treated as empty.
Impact sessions should be started with an override of column roles if the value being tested for impact is defined as a feature in the dataset. This ensures that the algorithms process the dataset as if the impact feature was not present. The column which specifies the impact event should be set to a role of
None
.
Refer to the Specifying Features tutorial for a more in-depth look at overriding column roles.
{
"columns" : {
"timeStamp" : {
"dataType" : "date",
"role" : "timestamp"
},
"sales" : {
"dataType" : "numeric",
"role" : "target"
},
"temperature" : {
"dataType" : "numericMeasure",
"role" : "feature"
},
"promotion" : {
"dataType" : "logical",
"role" : "feature"
},
"peakCustomersPerHour" : {
"dataType" : "numeric",
"role" : "feature",
"imputation" : "zeroes",
"aggregation" : "mean"
}
}
}