C45Learning Class |
Namespace: Accord.MachineLearning.DecisionTrees.Learning
[SerializableAttribute] public class C45Learning : DecisionTreeLearningBase, ISupervisedLearning<DecisionTree, double[], int>
The C45Learning type exposes the following members.
Name | Description | |
---|---|---|
C45Learning |
Creates a new C4.5 learning algorithm.
| |
C45Learning(DecisionTree) |
Creates a new C4.5 learning algorithm.
| |
C45Learning(DecisionVariable) |
Creates a new C4.5 learning algorithm.
|
Name | Description | |
---|---|---|
Attributes |
Gets or sets the collection of attributes to
be processed by the induced decision tree.
(Inherited from DecisionTreeLearningBase.) | |
AttributeUsageCount |
Gets how many times each attribute has already been used in the current path.
In the original C4.5 and ID3 algorithms, attributes could be re-used only once,
but in the framework implementation this behaviour can be adjusted by setting
the Join property.
(Inherited from DecisionTreeLearningBase.) | |
Join |
Gets or sets how many times one single variable can be integrated into the decision process. In the original
ID3 algorithm, a variable can join only one time per decision path (path from the root to a leaf). If set to
zero, a single variable can participate as many times as needed. Default is 1.
(Inherited from DecisionTreeLearningBase.) | |
MaxHeight |
Gets or sets the maximum allowed height when learning a tree. If
set to zero, the tree can have an arbitrary length. Default is 0.
(Inherited from DecisionTreeLearningBase.) | |
MaxVariables |
Gets or sets the maximum number of variables that
can enter the tree. A value of zero indicates there
is no limit. Default is 0 (there is no limit on the
number of variables).
(Inherited from DecisionTreeLearningBase.) | |
Model |
Gets or sets the decision trees being learned.
(Inherited from DecisionTreeLearningBase.) | |
ParallelOptions |
Gets or sets the parallelization options for this algorithm.
(Inherited from ParallelLearningBase.) | |
SplitStep |
Gets or sets the step at which the samples will
be divided when dividing continuous columns in
binary classes. Default is 1.
| |
Token |
Gets or sets a cancellation token that can be used
to cancel the algorithm while it is running.
(Inherited from ParallelLearningBase.) |
Name | Description | |
---|---|---|
Add |
Adds the specified variable to the list of Attributes.
(Inherited from DecisionTreeLearningBase.) | |
ComputeError | Obsolete.
Computes the prediction error for the tree
over a given set of input and outputs.
| |
Equals | Determines whether the specified object is equal to the current object. (Inherited from Object.) | |
Finalize | Allows an object to try to free resources and perform other cleanup operations before it is reclaimed by garbage collection. (Inherited from Object.) | |
GetEnumerator |
Returns an enumerator that iterates through the collection.
(Inherited from DecisionTreeLearningBase.) | |
GetHashCode | Serves as the default hash function. (Inherited from Object.) | |
GetType | Gets the Type of the current instance. (Inherited from Object.) | |
Learn(Double, Int32, Double) |
Learns a model that can map the given inputs to the given outputs.
| |
Learn(Int32, Int32, Double) |
Learns a model that can map the given inputs to the given outputs.
| |
Learn(NullableInt32, Int32, Double) |
Learns a model that can map the given inputs to the given outputs.
| |
MemberwiseClone | Creates a shallow copy of the current Object. (Inherited from Object.) | |
Run | Obsolete.
Runs the learning algorithm, creating a decision
tree modeling the given inputs and outputs.
| |
ToString | Returns a string that represents the current object. (Inherited from Object.) |
Name | Description | |
---|---|---|
HasMethod |
Checks whether an object implements a method with the given name.
(Defined by ExtensionMethods.) | |
IsEqual |
Compares two objects for equality, performing an elementwise
comparison if the elements are vectors or matrices.
(Defined by Matrix.) | |
To(Type) | Overloaded.
Converts an object into another type, irrespective of whether
the conversion can be done at compile time or not. This can be
used to convert generic types to numeric types during runtime.
(Defined by ExtensionMethods.) | |
ToT | Overloaded.
Converts an object into another type, irrespective of whether
the conversion can be done at compile time or not. This can be
used to convert generic types to numeric types during runtime.
(Defined by ExtensionMethods.) |
References:
This example shows the simplest way to induce a decision tree with continuous variables.
// In this example, we will process the famous Fisher's Iris dataset in // which the task is to classify weather the features of an Iris flower // belongs to an Iris setosa, an Iris versicolor, or an Iris virginica: // // - https://en.wikipedia.org/wiki/Iris_flower_data_set // // First, let's load the dataset into an array of text that we can process string[][] text = Resources.iris_data.Split(new[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries).Apply(x => x.Split(',')); // The first four columns contain the flower features double[][] inputs = text.GetColumns(0, 1, 2, 3).To<double[][]>(); // The last column contains the expected flower type string[] labels = text.GetColumn(4); // Since the labels are represented as text, the first step is to convert // those text labels into integer class labels, so we can process them // more easily. For this, we will create a codebook to encode class labels: // var codebook = new Codification("Output", labels); // With the codebook, we can convert the labels: int[] outputs = codebook.Translate("Output", labels); // And we can use the C4.5 for learning: C45Learning teacher = new C45Learning(); // Finally induce the tree from the data: var tree = teacher.Learn(inputs, outputs); // To get the estimated class labels, we can use int[] predicted = tree.Decide(inputs); // The classification error (0.0266) can be computed as double error = new ZeroOneLoss(outputs).Loss(predicted); // Moreover, we may decide to convert our tree to a set of rules: DecisionSet rules = tree.ToRules(); // And using the codebook, we can inspect the tree reasoning: string ruleText = rules.ToString(codebook, "Output", System.Globalization.CultureInfo.InvariantCulture); // The output is: string expected = @"Iris-setosa =: (2 <= 2.45) Iris-versicolor =: (2 > 2.45) && (3 <= 1.75) && (0 <= 7.05) && (1 <= 2.85) Iris-versicolor =: (2 > 2.45) && (3 <= 1.75) && (0 <= 7.05) && (1 > 2.85) Iris-versicolor =: (2 > 2.45) && (3 > 1.75) && (0 <= 5.95) && (1 > 3.05) Iris-virginica =: (2 > 2.45) && (3 <= 1.75) && (0 > 7.05) Iris-virginica =: (2 > 2.45) && (3 > 1.75) && (0 > 5.95) Iris-virginica =: (2 > 2.45) && (3 > 1.75) && (0 <= 5.95) && (1 <= 3.05) ";
This is the same example as above, but the decision variables are specified manually.
// In this example, we will process the famous Fisher's Iris dataset in // which the task is to classify weather the features of an Iris flower // belongs to an Iris setosa, an Iris versicolor, or an Iris virginica: // // - https://en.wikipedia.org/wiki/Iris_flower_data_set // // First, let's load the dataset into an array of text that we can process string[][] text = Resources.iris_data.Split(new[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries).Apply(x => x.Split(',')); // The first four columns contain the flower features double[][] inputs = text.GetColumns(0, 1, 2, 3).To<double[][]>(); // The last column contains the expected flower type string[] labels = text.GetColumn(4); // Since the labels are represented as text, the first step is to convert // those text labels into integer class labels, so we can process them // more easily. For this, we will create a codebook to encode class labels: // var codebook = new Codification("Output", labels); // With the codebook, we can convert the labels: int[] outputs = codebook.Translate("Output", labels); // Create a teaching algorithm: var teacher = new C45Learning() { new DecisionVariable("sepal length", DecisionVariableKind.Continuous), new DecisionVariable("sepal width", DecisionVariableKind.Continuous), new DecisionVariable("petal length", DecisionVariableKind.Continuous), new DecisionVariable("petal width", DecisionVariableKind.Continuous), }; // Use the learning algorithm to induce a new tree: DecisionTree tree = teacher.Learn(inputs, outputs); // To get the estimated class labels, we can use int[] predicted = tree.Decide(inputs); // The classification error (0.0266) can be computed as double error = new ZeroOneLoss(outputs).Loss(predicted); // Moreover, we may decide to convert our tree to a set of rules: DecisionSet rules = tree.ToRules(); // And using the codebook, we can inspect the tree reasoning: string ruleText = rules.ToString(codebook, "Output", System.Globalization.CultureInfo.InvariantCulture); // The output is: string expected = @"Iris-setosa =: (petal length <= 2.45) Iris-versicolor =: (petal length > 2.45) && (petal width <= 1.75) && (sepal length <= 7.05) && (sepal width <= 2.85) Iris-versicolor =: (petal length > 2.45) && (petal width <= 1.75) && (sepal length <= 7.05) && (sepal width > 2.85) Iris-versicolor =: (petal length > 2.45) && (petal width > 1.75) && (sepal length <= 5.95) && (sepal width > 3.05) Iris-virginica =: (petal length > 2.45) && (petal width <= 1.75) && (sepal length > 7.05) Iris-virginica =: (petal length > 2.45) && (petal width > 1.75) && (sepal length > 5.95) Iris-virginica =: (petal length > 2.45) && (petal width > 1.75) && (sepal length <= 5.95) && (sepal width <= 3.05) ";
This example shows how to handle missing values in the training data.
// In this example, we will be using a modified version of the famous Play Tennis // example by Tom Mitchell (1998), where some values have been replaced by missing // values. We will use NaN double values to represent values missing from the data. // Note: this example uses DataTables to represent the input data, // but this is not required. The same could be performed using plain // double[][] matrices and vectors instead. DataTable data = new DataTable("Tennis Example with Missing Values"); data.Columns.Add("Day", typeof(string)); data.Columns.Add("Outlook", typeof(string)); data.Columns.Add("Temperature", typeof(string)); data.Columns.Add("Humidity", typeof(string)); data.Columns.Add("Wind", typeof(string)); data.Columns.Add("PlayTennis", typeof(string)); data.Rows.Add("D1", "Sunny", "Hot", "High", "Weak", "No"); data.Rows.Add("D2", null, "Hot", "High", "Strong", "No"); data.Rows.Add("D3", null, null, "High", null, "Yes"); data.Rows.Add("D4", "Rain", "Mild", "High", "Weak", "Yes"); data.Rows.Add("D5", "Rain", "Cool", null, "Weak", "Yes"); data.Rows.Add("D6", "Rain", "Cool", "Normal", "Strong", "No"); data.Rows.Add("D7", "Overcast", "Cool", "Normal", "Strong", "Yes"); data.Rows.Add("D8", null, "Mild", "High", null, "No"); data.Rows.Add("D9", null, "Cool", "Normal", "Weak", "Yes"); data.Rows.Add("D10", null, null, "Normal", null, "Yes"); data.Rows.Add("D11", null, "Mild", "Normal", null, "Yes"); data.Rows.Add("D12", "Overcast", "Mild", null, "Strong", "Yes"); data.Rows.Add("D13", "Overcast", "Hot", null, "Weak", "Yes"); data.Rows.Add("D14", "Rain", "Mild", "High", "Strong", "No"); // Create a new codification codebook to convert // the strings above into numeric, integer labels: var codebook = new Codification() { DefaultMissingValueReplacement = Double.NaN }; // Learn the codebook codebook.Learn(data); // Use the codebook to convert all the data DataTable symbols = codebook.Apply(data); // Grab the training input and output instances: string[] inputNames = new[] { "Outlook", "Temperature", "Humidity", "Wind" }; double[][] inputs = symbols.ToJagged(inputNames); int[] outputs = symbols.ToArray<int>("PlayTennis"); // Create a new learning algorithm var teacher = new C45Learning() { Attributes = DecisionVariable.FromCodebook(codebook, inputNames) }; // Use the learning algorithm to induce a new tree: DecisionTree tree = teacher.Learn(inputs, outputs); // To get the estimated class labels, we can use int[] predicted = tree.Decide(inputs); // The classification error (~0.214) can be computed as double error = new ZeroOneLoss(outputs).Loss(predicted); // Moreover, we may decide to convert our tree to a set of rules: DecisionSet rules = tree.ToRules(); // And using the codebook, we can inspect the tree reasoning: string ruleText = rules.ToString(codebook, "PlayTennis", System.Globalization.CultureInfo.InvariantCulture); // The output should be: string expected = @"No =: (Outlook == Sunny) No =: (Outlook == Rain) && (Wind == Strong) Yes =: (Outlook == Overcast) Yes =: (Outlook == Rain) && (Wind == Weak) ";
The next example shows how to induce a decision tree for a more complicated example, again using a codebook to manage how input variables should be encoded. It also shows how to obtain a compiled version of the decision tree for deciding the class labels for new samples with maximum performance.
// This example uses the Nursery Database available from the University of // California Irvine repository of machine learning databases, available at // // http://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.names // // The description paragraph is listed as follows. // // Nursery Database was derived from a hierarchical decision model // originally developed to rank applications for nursery schools. It // was used during several years in 1980's when there was excessive // enrollment to these schools in Ljubljana, Slovenia, and the // rejected applications frequently needed an objective // explanation. The final decision depended on three subproblems: // occupation of parents and child's nursery, family structure and // financial standing, and social and health picture of the family. // The model was developed within expert system shell for decision // making DEX (M. Bohanec, V. Rajkovic: Expert system for decision // making. Sistemica 1(1), pp. 145-157, 1990.). // // Let's begin by loading the raw data. This string variable contains // the contents of the nursery.data file as a single, continuous text. // string nurseryData = Resources.nursery; // Those are the input columns available in the data // string[] inputColumns = { "parents", "has_nurs", "form", "children", "housing", "finance", "social", "health" }; // And this is the output, the last column of the data. // string outputColumn = "output"; // Let's populate a data table with this information. // DataTable table = new DataTable("Nursery"); table.Columns.Add(inputColumns); table.Columns.Add(outputColumn); string[] lines = nurseryData.Split( new[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries); foreach (var line in lines) table.Rows.Add(line.Split(',')); // Now, we have to convert the textual, categorical data found // in the table to a more manageable discrete representation. // // For this, we will create a codebook to translate text to // discrete integer symbols: // Codification codebook = new Codification(table); // And then convert all data into symbols // DataTable symbols = codebook.Apply(table); double[][] inputs = symbols.ToArray(inputColumns); int[] outputs = symbols.ToArray<int>(outputColumn); // We can either specify the decision attributes we want // manually, or we can ask the codebook to do it for us: DecisionVariable[] attributes = DecisionVariable.FromCodebook(codebook, inputColumns); // Now, let's create the C4.5 algorithm: C45Learning c45 = new C45Learning(attributes); // and induce a decision tree from the data: DecisionTree tree = c45.Learn(inputs, outputs); // To get the estimated class labels, we can use int[] predicted = tree.Decide(inputs); // And the classification error (of 0.0) can be computed as double error = new ZeroOneLoss(outputs).Loss(tree.Decide(inputs)); // To compute a decision for one of the input points, // such as the 25-th example in the set, we can use // int y = tree.Decide(inputs[25]); // should be 1
// Finally, we can also convert our tree to a native // function, improving efficiency considerably, with // Func<double[], int> func = tree.ToExpression().Compile(); // Again, to compute a new decision, we can just use // int z = func(inputs[25]);
The next example shows how to estimate the true performance of a decision tree model using cross-validation:
// Ensure we have reproducible results Accord.Math.Random.Generator.Seed = 0; // Get some data to be learned. We will be using the Wiconsin's // (Diagnostic) Breast Cancer dataset, where the goal is to determine // whether the characteristics extracted from a breast cancer exam // correspond to a malignant or benign type of cancer: var data = new WisconsinDiagnosticBreastCancer(); double[][] input = data.Features; // 569 samples, 30-dimensional features int[] output = data.ClassLabels; // 569 samples, 2 different class labels // Let's say we want to measure the cross-validation performance of // a decision tree with a maximum tree height of 5 and where variables // are able to join the decision path at most 2 times during evaluation: var cv = CrossValidation.Create( k: 10, // We will be using 10-fold cross validation learner: (p) => new C45Learning() // here we create the learning algorithm { Join = 2, MaxHeight = 5 }, // Now we have to specify how the tree performance should be measured: loss: (actual, expected, p) => new ZeroOneLoss(expected).Loss(actual), // This function can be used to perform any special // operations before the actual learning is done, but // here we will just leave it as simple as it can be: fit: (teacher, x, y, w) => teacher.Learn(x, y, w), // Finally, we have to pass the input and output data // that will be used in cross-validation. x: input, y: output ); // After the cross-validation object has been created, // we can call its .Learn method with the input and // output data that will be partitioned into the folds: var result = cv.Learn(input, output); // We can grab some information about the problem: int numberOfSamples = result.NumberOfSamples; // should be 569 int numberOfInputs = result.NumberOfInputs; // should be 30 int numberOfOutputs = result.NumberOfOutputs; // should be 2 double trainingError = result.Training.Mean; // should be 0.017771153143274855 double validationError = result.Validation.Mean; // should be 0.0755952380952381 // If desired, compute an aggregate confusion matrix for the validation sets: GeneralConfusionMatrix gcm = result.ToConfusionMatrix(input, output); double accuracy = gcm.Accuracy; // result should be 0.92442882249560632
The next example shows how to find the best parameters for a decision tree using grid-search cross-validation:
// Ensure results are reproducible Accord.Math.Random.Generator.Seed = 0; // This is a sample code showing how to use Grid-Search in combination with // Cross-Validation to assess the performance of Decision Trees with C4.5. var parkinsons = new Parkinsons(); double[][] input = parkinsons.Features; int[] output = parkinsons.ClassLabels; // Create a new Grid-Search with Cross-Validation algorithm. Even though the // generic, strongly-typed approach used accross the framework is most of the // time easier to handle, combining those both methods in a single call can be // difficult. For this reason. the framework offers a specialized method for // combining those two algorirthms: var gscv = GridSearch.CrossValidate( // Here we can specify the range of the parameters to be included in the search ranges: new { Join = GridSearch.Range(fromInclusive: 1, toExclusive: 20), MaxHeight = GridSearch.Range(fromInclusive: 1, toExclusive: 20), }, // Indicate how learning algorithms for the models should be created learner: (p, ss) => new C45Learning { // Here, we can use the parameters we have specified above: Join = p.Join, MaxHeight = p.MaxHeight, }, // Define how the model should be learned, if needed fit: (teacher, x, y, w) => teacher.Learn(x, y, w), // Define how the performance of the models should be measured loss: (actual, expected, r) => new ZeroOneLoss(expected).Loss(actual), folds: 3, // use k = 3 in k-fold cross validation x: input, y: output // so the compiler can infer generic types ); // If needed, control the parallelization degree gscv.ParallelOptions.MaxDegreeOfParallelism = 1; // Search for the best decision tree var result = gscv.Learn(input, output); // Get the best cross-validation result: var crossValidation = result.BestModel; // Get an estimate of its error: double bestAverageError = result.BestModelError; double trainError = result.BestModel.Training.Mean; double trainErrorVar = result.BestModel.Training.Variance; double valError = result.BestModel.Validation.Mean; double valErrorVar = result.BestModel.Validation.Variance; // Get the best values for the parameters: int bestJoin = result.BestParameters.Join; int bestHeight = result.BestParameters.MaxHeight; // Use the best parameter values to create the final // model using all the training and validation data: var bestTeacher = new C45Learning { Join = bestJoin, MaxHeight = bestHeight, }; // Use the best parameters to create the final tree model: DecisionTree finalTree = bestTeacher.Learn(input, output);