C45Learning Class

C4.5 Learning algorithm for Decision Trees.

Inheritance Hierarchy

SystemObject
  Accord.MachineLearningParallelLearningBase
    Accord.MachineLearning.DecisionTrees.LearningDecisionTreeLearningBase
      Accord.MachineLearning.DecisionTrees.LearningC45Learning

Namespace: Accord.MachineLearning.DecisionTrees.Learning
Assembly: Accord.MachineLearning (in Accord.MachineLearning.dll) Version: 3.8.0

Syntax

Copy

[SerializableAttribute]
public class C45Learning : DecisionTreeLearningBase, 
	ISupervisedLearning<DecisionTree, double[], int>

<SerializableAttribute>
Public Class C45Learning
	Inherits DecisionTreeLearningBase
	Implements ISupervisedLearning(Of DecisionTree, Double(), Integer)

Request Example View Source

The C45Learning type exposes the following members.

Constructors

	Name	Description
	C45Learning	Creates a new C4.5 learning algorithm.
	C45Learning(DecisionTree)	Creates a new C4.5 learning algorithm.
	C45Learning(DecisionVariable)	Creates a new C4.5 learning algorithm.

Top

Properties

	Name	Description
	Attributes	Gets or sets the collection of attributes to be processed by the induced decision tree. (Inherited from DecisionTreeLearningBase.)
	AttributeUsageCount	Gets how many times each attribute has already been used in the current path. In the original C4.5 and ID3 algorithms, attributes could be re-used only once, but in the framework implementation this behaviour can be adjusted by setting the Join property. (Inherited from DecisionTreeLearningBase.)
	Join	Gets or sets how many times one single variable can be integrated into the decision process. In the original ID3 algorithm, a variable can join only one time per decision path (path from the root to a leaf). If set to zero, a single variable can participate as many times as needed. Default is 1. (Inherited from DecisionTreeLearningBase.)
	MaxHeight	Gets or sets the maximum allowed height when learning a tree. If set to zero, the tree can have an arbitrary length. Default is 0. (Inherited from DecisionTreeLearningBase.)
	MaxVariables	Gets or sets the maximum number of variables that can enter the tree. A value of zero indicates there is no limit. Default is 0 (there is no limit on the number of variables). (Inherited from DecisionTreeLearningBase.)
	Model	Gets or sets the decision trees being learned. (Inherited from DecisionTreeLearningBase.)
	ParallelOptions	Gets or sets the parallelization options for this algorithm. (Inherited from ParallelLearningBase.)
	SplitStep	Gets or sets the step at which the samples will be divided when dividing continuous columns in binary classes. Default is 1.
	Token	Gets or sets a cancellation token that can be used to cancel the algorithm while it is running. (Inherited from ParallelLearningBase.)

Top

Methods

	Name	Description
	Add	Adds the specified variable to the list of Attributes. (Inherited from DecisionTreeLearningBase.)
	ComputeError	Obsolete. Computes the prediction error for the tree over a given set of input and outputs.
	Equals	Determines whether the specified object is equal to the current object. (Inherited from Object.)
	Finalize	Allows an object to try to free resources and perform other cleanup operations before it is reclaimed by garbage collection. (Inherited from Object.)
	GetEnumerator	Returns an enumerator that iterates through the collection. (Inherited from DecisionTreeLearningBase.)
	GetHashCode	Serves as the default hash function. (Inherited from Object.)
	GetType	Gets the Type of the current instance. (Inherited from Object.)
	Learn(Double, Int32, Double)	Learns a model that can map the given inputs to the given outputs.
	Learn(Int32, Int32, Double)	Learns a model that can map the given inputs to the given outputs.
	Learn(NullableInt32, Int32, Double)	Learns a model that can map the given inputs to the given outputs.
	MemberwiseClone	Creates a shallow copy of the current Object. (Inherited from Object.)
	Run	Obsolete. Runs the learning algorithm, creating a decision tree modeling the given inputs and outputs.
	ToString	Returns a string that represents the current object. (Inherited from Object.)

Top

Extension Methods

	Name	Description
	HasMethod	Checks whether an object implements a method with the given name. (Defined by ExtensionMethods.)
	IsEqual	Compares two objects for equality, performing an elementwise comparison if the elements are vectors or matrices. (Defined by Matrix.)
	To(Type)	Overloaded. Converts an object into another type, irrespective of whether the conversion can be done at compile time or not. This can be used to convert generic types to numeric types during runtime. (Defined by ExtensionMethods.)
	ToT	Overloaded. Converts an object into another type, irrespective of whether the conversion can be done at compile time or not. This can be used to convert generic types to numeric types during runtime. (Defined by ExtensionMethods.)

Top

Remarks

References:

Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
Quinlan, J. R. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996.
Mitchell, T. M. Machine Learning. McGraw-Hill, 1997. pp. 55-58.
Wikipedia, the free encyclopedia. ID3 algorithm. Available on http://en.wikipedia.org/wiki/ID3_algorithm

Examples

This example shows the simplest way to induce a decision tree with continuous variables.

Copy

            // In this example, we will process the famous Fisher's Iris dataset in 
            // which the task is to classify weather the features of an Iris flower 
            // belongs to an Iris setosa, an Iris versicolor, or an Iris virginica:
            // 
            //  - https://en.wikipedia.org/wiki/Iris_flower_data_set
            // 

            // First, let's load the dataset into an array of text that we can process
            string[][] text = Resources.iris_data.Split(new[] { "\r\n" },
                StringSplitOptions.RemoveEmptyEntries).Apply(x => x.Split(','));

            // The first four columns contain the flower features
            double[][] inputs = text.GetColumns(0, 1, 2, 3).To<double[][]>();

            // The last column contains the expected flower type
            string[] labels = text.GetColumn(4);

            // Since the labels are represented as text, the first step is to convert
            // those text labels into integer class labels, so we can process them
            // more easily. For this, we will create a codebook to encode class labels:
            // 
            var codebook = new Codification("Output", labels);

            // With the codebook, we can convert the labels:
            int[] outputs = codebook.Translate("Output", labels);

            // And we can use the C4.5 for learning:
            C45Learning teacher = new C45Learning();

            // Finally induce the tree from the data:
            var tree = teacher.Learn(inputs, outputs);

            // To get the estimated class labels, we can use
            int[] predicted = tree.Decide(inputs);

            // The classification error (0.0266) can be computed as 
            double error = new ZeroOneLoss(outputs).Loss(predicted);

            // Moreover, we may decide to convert our tree to a set of rules:
            DecisionSet rules = tree.ToRules();

            // And using the codebook, we can inspect the tree reasoning:
            string ruleText = rules.ToString(codebook, "Output",
                System.Globalization.CultureInfo.InvariantCulture);

            // The output is:
            string expected = @"Iris-setosa =: (2 <= 2.45)
Iris-versicolor =: (2 > 2.45) && (3 <= 1.75) && (0 <= 7.05) && (1 <= 2.85)
Iris-versicolor =: (2 > 2.45) && (3 <= 1.75) && (0 <= 7.05) && (1 > 2.85)
Iris-versicolor =: (2 > 2.45) && (3 > 1.75) && (0 <= 5.95) && (1 > 3.05)
Iris-virginica =: (2 > 2.45) && (3 <= 1.75) && (0 > 7.05)
Iris-virginica =: (2 > 2.45) && (3 > 1.75) && (0 > 5.95)
Iris-virginica =: (2 > 2.45) && (3 > 1.75) && (0 <= 5.95) && (1 <= 3.05)
";

This is the same example as above, but the decision variables are specified manually.

Copy

            // In this example, we will process the famous Fisher's Iris dataset in 
            // which the task is to classify weather the features of an Iris flower 
            // belongs to an Iris setosa, an Iris versicolor, or an Iris virginica:
            // 
            //  - https://en.wikipedia.org/wiki/Iris_flower_data_set
            // 

            // First, let's load the dataset into an array of text that we can process
            string[][] text = Resources.iris_data.Split(new[] { "\r\n" },
                StringSplitOptions.RemoveEmptyEntries).Apply(x => x.Split(','));

            // The first four columns contain the flower features
            double[][] inputs = text.GetColumns(0, 1, 2, 3).To<double[][]>();

            // The last column contains the expected flower type
            string[] labels = text.GetColumn(4);

            // Since the labels are represented as text, the first step is to convert
            // those text labels into integer class labels, so we can process them
            // more easily. For this, we will create a codebook to encode class labels:
            // 
            var codebook = new Codification("Output", labels);

            // With the codebook, we can convert the labels:
            int[] outputs = codebook.Translate("Output", labels);

            // Create a teaching algorithm:
            var teacher = new C45Learning()
            {
                new DecisionVariable("sepal length", DecisionVariableKind.Continuous),
                new DecisionVariable("sepal width", DecisionVariableKind.Continuous),
                new DecisionVariable("petal length", DecisionVariableKind.Continuous),
                new DecisionVariable("petal width", DecisionVariableKind.Continuous),
            };

            // Use the learning algorithm to induce a new tree:
            DecisionTree tree = teacher.Learn(inputs, outputs);

            // To get the estimated class labels, we can use
            int[] predicted = tree.Decide(inputs);

            // The classification error (0.0266) can be computed as 
            double error = new ZeroOneLoss(outputs).Loss(predicted);

            // Moreover, we may decide to convert our tree to a set of rules:
            DecisionSet rules = tree.ToRules();

            // And using the codebook, we can inspect the tree reasoning:
            string ruleText = rules.ToString(codebook, "Output",
                System.Globalization.CultureInfo.InvariantCulture);

            // The output is:
            string expected = @"Iris-setosa =: (petal length <= 2.45)
Iris-versicolor =: (petal length > 2.45) && (petal width <= 1.75) && (sepal length <= 7.05) && (sepal width <= 2.85)
Iris-versicolor =: (petal length > 2.45) && (petal width <= 1.75) && (sepal length <= 7.05) && (sepal width > 2.85)
Iris-versicolor =: (petal length > 2.45) && (petal width > 1.75) && (sepal length <= 5.95) && (sepal width > 3.05)
Iris-virginica =: (petal length > 2.45) && (petal width <= 1.75) && (sepal length > 7.05)
Iris-virginica =: (petal length > 2.45) && (petal width > 1.75) && (sepal length > 5.95)
Iris-virginica =: (petal length > 2.45) && (petal width > 1.75) && (sepal length <= 5.95) && (sepal width <= 3.05)
";

This example shows how to handle missing values in the training data.

Copy

            // In this example, we will be using a modified version of the famous Play Tennis 
            // example by Tom Mitchell (1998), where some values have been replaced by missing 
            // values. We will use NaN double values to represent values missing from the data.

            // Note: this example uses DataTables to represent the input data, 
            // but this is not required. The same could be performed using plain
            // double[][] matrices and vectors instead.
            DataTable data = new DataTable("Tennis Example with Missing Values");

            data.Columns.Add("Day", typeof(string));
            data.Columns.Add("Outlook", typeof(string));
            data.Columns.Add("Temperature", typeof(string));
            data.Columns.Add("Humidity", typeof(string));
            data.Columns.Add("Wind", typeof(string));
            data.Columns.Add("PlayTennis", typeof(string));

            data.Rows.Add("D1", "Sunny", "Hot", "High", "Weak", "No");
            data.Rows.Add("D2", null, "Hot", "High", "Strong", "No");
            data.Rows.Add("D3", null, null, "High", null, "Yes");
            data.Rows.Add("D4", "Rain", "Mild", "High", "Weak", "Yes");
            data.Rows.Add("D5", "Rain", "Cool", null, "Weak", "Yes");
            data.Rows.Add("D6", "Rain", "Cool", "Normal", "Strong", "No");
            data.Rows.Add("D7", "Overcast", "Cool", "Normal", "Strong", "Yes");
            data.Rows.Add("D8", null, "Mild", "High", null, "No");
            data.Rows.Add("D9", null, "Cool", "Normal", "Weak", "Yes");
            data.Rows.Add("D10", null, null, "Normal", null, "Yes");
            data.Rows.Add("D11", null, "Mild", "Normal", null, "Yes");
            data.Rows.Add("D12", "Overcast", "Mild", null, "Strong", "Yes");
            data.Rows.Add("D13", "Overcast", "Hot", null, "Weak", "Yes");
            data.Rows.Add("D14", "Rain", "Mild", "High", "Strong", "No");

            // Create a new codification codebook to convert 
            // the strings above into numeric, integer labels:
            var codebook = new Codification()
            {
                DefaultMissingValueReplacement = Double.NaN
            };

            // Learn the codebook
            codebook.Learn(data);

            // Use the codebook to convert all the data
            DataTable symbols = codebook.Apply(data);

            // Grab the training input and output instances:
            string[] inputNames = new[] { "Outlook", "Temperature", "Humidity", "Wind" };
            double[][] inputs = symbols.ToJagged(inputNames);
            int[] outputs = symbols.ToArray<int>("PlayTennis");

            // Create a new learning algorithm
            var teacher = new C45Learning()
            {
                Attributes = DecisionVariable.FromCodebook(codebook, inputNames)
            };

            // Use the learning algorithm to induce a new tree:
            DecisionTree tree = teacher.Learn(inputs, outputs);

            // To get the estimated class labels, we can use
            int[] predicted = tree.Decide(inputs);

            // The classification error (~0.214) can be computed as 
            double error = new ZeroOneLoss(outputs).Loss(predicted);

            // Moreover, we may decide to convert our tree to a set of rules:
            DecisionSet rules = tree.ToRules();

            // And using the codebook, we can inspect the tree reasoning:
            string ruleText = rules.ToString(codebook, "PlayTennis",
                System.Globalization.CultureInfo.InvariantCulture);

            // The output should be:
            string expected = @"No =: (Outlook == Sunny)
No =: (Outlook == Rain) && (Wind == Strong)
Yes =: (Outlook == Overcast)
Yes =: (Outlook == Rain) && (Wind == Weak)
";

The next example shows how to induce a decision tree for a more complicated example, again using a codebook to manage how input variables should be encoded. It also shows how to obtain a compiled version of the decision tree for deciding the class labels for new samples with maximum performance.

Copy

// This example uses the Nursery Database available from the University of
// California Irvine repository of machine learning databases, available at
// 
//   http://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.names
// 
// The description paragraph is listed as follows.
// 
//   Nursery Database was derived from a hierarchical decision model
//   originally developed to rank applications for nursery schools. It
//   was used during several years in 1980's when there was excessive
//   enrollment to these schools in Ljubljana, Slovenia, and the
//   rejected applications frequently needed an objective
//   explanation. The final decision depended on three subproblems:
//   occupation of parents and child's nursery, family structure and
//   financial standing, and social and health picture of the family.
//   The model was developed within expert system shell for decision
//   making DEX (M. Bohanec, V. Rajkovic: Expert system for decision
//   making. Sistemica 1(1), pp. 145-157, 1990.).
// 

// Let's begin by loading the raw data. This string variable contains
// the contents of the nursery.data file as a single, continuous text.
// 
string nurseryData = Resources.nursery;

// Those are the input columns available in the data
// 
string[] inputColumns =
{
    "parents", "has_nurs", "form", "children",
    "housing", "finance", "social", "health"
};

// And this is the output, the last column of the data.
// 
string outputColumn = "output";


// Let's populate a data table with this information.
// 
DataTable table = new DataTable("Nursery");
table.Columns.Add(inputColumns);
table.Columns.Add(outputColumn);

string[] lines = nurseryData.Split(
    new[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries);

foreach (var line in lines)
    table.Rows.Add(line.Split(','));


// Now, we have to convert the textual, categorical data found
// in the table to a more manageable discrete representation.
// 
// For this, we will create a codebook to translate text to
// discrete integer symbols:
// 
Codification codebook = new Codification(table);

// And then convert all data into symbols
// 
DataTable symbols = codebook.Apply(table);
double[][] inputs = symbols.ToArray(inputColumns);
int[] outputs = symbols.ToArray<int>(outputColumn);

// We can either specify the decision attributes we want
// manually, or we can ask the codebook to do it for us:
DecisionVariable[] attributes = DecisionVariable.FromCodebook(codebook, inputColumns);

// Now, let's create the C4.5 algorithm:
C45Learning c45 = new C45Learning(attributes);

// and induce a decision tree from the data:
DecisionTree tree = c45.Learn(inputs, outputs);

// To get the estimated class labels, we can use
int[] predicted = tree.Decide(inputs);

// And the classification error (of 0.0) can be computed as 
double error = new ZeroOneLoss(outputs).Loss(tree.Decide(inputs));

// To compute a decision for one of the input points,
//   such as the 25-th example in the set, we can use
// 
int y = tree.Decide(inputs[25]); // should be 1

Copy

// Finally, we can also convert our tree to a native
// function, improving efficiency considerably, with
// 
Func<double[], int> func = tree.ToExpression().Compile();

// Again, to compute a new decision, we can just use
// 
int z = func(inputs[25]);

The next example shows how to estimate the true performance of a decision tree model using cross-validation:

Copy

// Ensure we have reproducible results
Accord.Math.Random.Generator.Seed = 0;

// Get some data to be learned. We will be using the Wiconsin's
// (Diagnostic) Breast Cancer dataset, where the goal is to determine
// whether the characteristics extracted from a breast cancer exam
// correspond to a malignant or benign type of cancer:
var data = new WisconsinDiagnosticBreastCancer();
double[][] input = data.Features; // 569 samples, 30-dimensional features
int[] output = data.ClassLabels;  // 569 samples, 2 different class labels

// Let's say we want to measure the cross-validation performance of
// a decision tree with a maximum tree height of 5 and where variables
// are able to join the decision path at most 2 times during evaluation:
var cv = CrossValidation.Create(

    k: 10, // We will be using 10-fold cross validation

    learner: (p) => new C45Learning() // here we create the learning algorithm
    {
        Join = 2,
        MaxHeight = 5
    },

    // Now we have to specify how the tree performance should be measured:
    loss: (actual, expected, p) => new ZeroOneLoss(expected).Loss(actual),

    // This function can be used to perform any special
    // operations before the actual learning is done, but
    // here we will just leave it as simple as it can be:
    fit: (teacher, x, y, w) => teacher.Learn(x, y, w),

    // Finally, we have to pass the input and output data
    // that will be used in cross-validation. 
    x: input, y: output
);

// After the cross-validation object has been created,
// we can call its .Learn method with the input and 
// output data that will be partitioned into the folds:
var result = cv.Learn(input, output);

// We can grab some information about the problem:
int numberOfSamples = result.NumberOfSamples; // should be 569
int numberOfInputs = result.NumberOfInputs;   // should be 30
int numberOfOutputs = result.NumberOfOutputs; // should be 2

double trainingError = result.Training.Mean; // should be 0.017771153143274855
double validationError = result.Validation.Mean; // should be 0.0755952380952381

// If desired, compute an aggregate confusion matrix for the validation sets:
GeneralConfusionMatrix gcm = result.ToConfusionMatrix(input, output);
double accuracy = gcm.Accuracy; // result should be 0.92442882249560632

The next example shows how to find the best parameters for a decision tree using grid-search cross-validation:

Copy

// Ensure results are reproducible
Accord.Math.Random.Generator.Seed = 0;

// This is a sample code showing how to use Grid-Search in combination with 
// Cross-Validation  to assess the performance of Decision Trees with C4.5.

var parkinsons = new Parkinsons();
double[][] input = parkinsons.Features;
int[] output = parkinsons.ClassLabels;

// Create a new Grid-Search with Cross-Validation algorithm. Even though the
// generic, strongly-typed approach used accross the framework is most of the
// time easier to handle, combining those both methods in a single call can be
// difficult. For this reason. the framework offers a specialized method for
// combining those two algorirthms:
var gscv = GridSearch.CrossValidate(

    // Here we can specify the range of the parameters to be included in the search
    ranges: new
    {
        Join = GridSearch.Range(fromInclusive: 1, toExclusive: 20),
        MaxHeight = GridSearch.Range(fromInclusive: 1, toExclusive: 20),
    },

    // Indicate how learning algorithms for the models should be created
    learner: (p, ss) => new C45Learning
    {
        // Here, we can use the parameters we have specified above:
        Join = p.Join,
        MaxHeight = p.MaxHeight,
    },

    // Define how the model should be learned, if needed
    fit: (teacher, x, y, w) => teacher.Learn(x, y, w),

    // Define how the performance of the models should be measured
    loss: (actual, expected, r) => new ZeroOneLoss(expected).Loss(actual),

    folds: 3, // use k = 3 in k-fold cross validation

    x: input, y: output // so the compiler can infer generic types
);

// If needed, control the parallelization degree
gscv.ParallelOptions.MaxDegreeOfParallelism = 1;

// Search for the best decision tree
var result = gscv.Learn(input, output);

// Get the best cross-validation result:
var crossValidation = result.BestModel;

// Get an estimate of its error:
double bestAverageError = result.BestModelError;

double trainError = result.BestModel.Training.Mean;
double trainErrorVar = result.BestModel.Training.Variance;
double valError = result.BestModel.Validation.Mean;
double valErrorVar = result.BestModel.Validation.Variance;

// Get the best values for the parameters:
int bestJoin = result.BestParameters.Join;
int bestHeight = result.BestParameters.MaxHeight;

// Use the best parameter values to create the final 
// model using all the training and validation data:
var bestTeacher = new C45Learning
{
    Join = bestJoin,
    MaxHeight = bestHeight,
};

// Use the best parameters to create the final tree model:
DecisionTree finalTree = bestTeacher.Learn(input, output);

Reference

Accord.MachineLearning.DecisionTrees.Learning Namespace

Accord.MachineLearning.DecisionTreesDecisionTree

Accord.MachineLearning.DecisionTrees.LearningID3Learning

Accord.MachineLearning.DecisionTreesRandomForestLearning