Implementation of Decision Tree Classifier

The decision tree is grown using either depth-first or breadth-first search. When determining the optimal split for each node, either the gini or entropy methods can be used to calculate the node impurity. The algorithm always selects the split that produces the maximum reduction in impurity.

class DecisionTreeClassifier : public ml::BaseModel

Implement decision tree using recursive partitioning algorithm.

Public Functions

DecisionTreeClassifier(nlohmann::json model_parameters, shared_ptr<Logger> logger): Constructor.

inline ~DecisionTreeClassifier(): Destructor.

virtual void set_data(TrainTestData &&train_test)

Initialize data.

Note the use of std::move(object).member instead of move(object.member) See https://oliora.github.io/2016/02/12/where-to-put-std-move.html.

vector<unordered_map<double, int, DoubleHash, DoubleEqual>> get_classes_frequencies(const vector<int> &indices): Compile number of occurrences of each unique class for each output variable.

double get_impurity(const vector<int> &indices)

Calculate node impurity by evaluating the impurity separately for each output variable using the gini or entropy method and averaging the result.

This is supposedly how scikit-learn implements a decision tree. See https://stackoverflow.com/questions/50715574/how-is-the-impurity-decrease-of-a-split-computed-in-case-we-have-multiple-output

pair<shared_ptr<TreeNode>, shared_ptr<TreeNode>> split_node(shared_ptr<TreeNode> node): Determine best feature and value for splitting a node.

void breadth_first_search(): Grow decision tree using breadth first search.

void depth_first_search(shared_ptr<TreeNode> node): Grow decision tree using depth first search.

virtual void fit()

Grow a decision tree using the recursive partitioning algorithm.

Calls either depth_first_search or breadth_first_search.

vector<vector<double>> predict(): Perform inference using decision tree grown by call to fit method.

void report_fit_results(): Log characteristics of decision tree after training.

virtual void evaluate(): Evaluate model performance on test data.

Public Members

string impurity_method = "gini"

Criterion to quantify individual node impurities.

Allowed values are ‘gini’ and ‘entropy’.

string search_algorithm = "breadth"

Algorithm to grow decision tree.

Allowed values are “breadth” or “depth”. They correspond to breadth-first and depth-first searches respectively.

int total_nodes = 0: Total number of nodes.

int number_leaf_nodes = 0: Number of leaf nodes.

int number_splits = 0: Number of splits.

int max_depth = 0: Maximum depth of all nodes.

double max_feature_fraction = 1.0: Fraction determining the number of features that will be considered during each split.

int max_features = 0: Maximum number of features that will be considered during each split.

vector<double> feature_importances = {}: Feature Importances.

vector<vector<double>> train_features = {}: Training features.

vector<vector<double>> train_labels = {}: Training labels.

vector<vector<double>> test_features = {}: Test features.

vector<vector<double>> test_labels = {}: Test labels.

shared_ptr<TreeNode> root: Pointer to tree root.