Click or drag to resize
Accord.NET (logo)

BagOfWords Class

Bag of words.
Inheritance Hierarchy
SystemObject
  Accord.MachineLearningParallelLearningBase
    Accord.MachineLearningBagOfWords

Namespace:  Accord.MachineLearning
Assembly:  Accord.MachineLearning (in Accord.MachineLearning.dll) Version: 3.8.0
Syntax
[SerializableAttribute]
public class BagOfWords : ParallelLearningBase, IBagOfWords<string[]>, 
	ITransform<string[], double[]>, ICovariantTransform<string[], double[]>, 
	ITransform, ITransform<string[], int[]>, ICovariantTransform<string[], int[]>, 
	ITransform<string[], Sparse<double>>, ICovariantTransform<string[], Sparse<double>>, 
	IUnsupervisedLearning<BagOfWords, string[], int[]>
Request Example View Source

The BagOfWords type exposes the following members.

Constructors
Properties
  NameDescription
Public propertyCodeToString
Gets the reverse dictionary which translates integer labels into string tokens.
Public propertyMaximumOccurance
Gets or sets the maximum number of occurrences of a word which should be registered in the feature vector. Default is 1 (if a word occurs, corresponding feature is set to 1).
Public propertyNumberOfInputs
Gets the number of inputs accepted by the model.
Public propertyNumberOfOutputs
Gets the number of outputs generated by the model.
Public propertyNumberOfWords
Gets the number of words in this codebook.
Public propertyParallelOptions
Gets or sets the parallelization options for this algorithm.
(Inherited from ParallelLearningBase.)
Public propertyStringToCode
Gets the forward dictionary which translates string tokens to integer labels.
Public propertyToken
Gets or sets a cancellation token that can be used to cancel the algorithm while it is running.
(Inherited from ParallelLearningBase.)
Top
Methods
  NameDescription
Public methodCompute Obsolete.
Computes the Bag of Words model.
Public methodEquals
Determines whether the specified object is equal to the current object.
(Inherited from Object.)
Protected methodFinalize
Allows an object to try to free resources and perform other cleanup operations before it is reclaimed by garbage collection.
(Inherited from Object.)
Public methodStatic memberGetDefaultClusteringAlgorithm
Creates the default clustering algorithm for Bag-of-Words models (KMeans).
Public methodGetFeatureVector Obsolete.
Gets the codeword representation of a given text.
Public methodGetHashCode
Serves as the default hash function.
(Inherited from Object.)
Public methodGetType
Gets the Type of the current instance.
(Inherited from Object.)
Public methodLearn(String, Double)
Learns a model that can map the given inputs to the desired outputs.
Public methodLearn(String, Double)
Learns a model that can map the given inputs to the desired outputs.
Protected methodMemberwiseClone
Creates a shallow copy of the current Object.
(Inherited from Object.)
Public methodToString
Returns a string that represents the current object.
(Inherited from Object.)
Public methodTransform(String)
Applies the transformation to an input, producing an associated output.
Public methodTransform(String)
Applies the transformation to a set of input vectors, producing an associated set of output vectors.
Public methodTransform(String, SparseDouble)
Applies the transformation to an input, producing an associated output.
Public methodTransform(String, Double)
Applies the transformation to an input, producing an associated output.
Public methodTransform(String, Int32)
Applies the transformation to an input, producing an associated output.
Public methodTransform(String, SparseDouble)
Applies the transformation to a set of input vectors, producing an associated set of output vectors.
Public methodTransform(String, Double)
Applies the transformation to a set of input vectors, producing an associated set of output vectors.
Public methodTransform(String, Int32)
Applies the transformation to a set of input vectors, producing an associated set of output vectors.
Top
Extension Methods
  NameDescription
Public Extension MethodHasMethod
Checks whether an object implements a method with the given name.
(Defined by ExtensionMethods.)
Public Extension MethodIsEqual
Compares two objects for equality, performing an elementwise comparison if the elements are vectors or matrices.
(Defined by Matrix.)
Public Extension MethodTo(Type)Overloaded.
Converts an object into another type, irrespective of whether the conversion can be done at compile time or not. This can be used to convert generic types to numeric types during runtime.
(Defined by ExtensionMethods.)
Public Extension MethodToTOverloaded.
Converts an object into another type, irrespective of whether the conversion can be done at compile time or not. This can be used to convert generic types to numeric types during runtime.
(Defined by ExtensionMethods.)
Top
Remarks
The bag-of-words (BoW) model can be used to extract finite length features from otherwise varying length representations.
Examples
// The Bag-Of-Words model can be used to extract finite-length feature 
// vectors from sequences of arbitrary length, like for example, texts:


string[] texts =
{
    @"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas molestie malesuada 
      nisi et placerat. Curabitur blandit porttitor suscipit. Nunc facilisis ultrices felis,
      vitae luctus arcu semper in. Fusce ut felis ipsum. Sed faucibus tortor ut felis placerat
      euismod. Vestibulum pharetra velit et dolor ornare quis malesuada leo aliquam. Aenean 
      lobortis, tortor iaculis vestibulum dictum, tellus nisi vestibulum libero, ultricies 
      pretium nisi ante in neque. Integer et massa lectus. Aenean ut sem quam. Mauris at nisl 
      augue, volutpat tempus nisl. Suspendisse luctus convallis metus, vitae pretium risus 
      pretium vitae. Duis tristique euismod aliquam",

    @"Sed consectetur nisl et diam mattis varius. Aliquam ornare tincidunt arcu eget adipiscing. 
      Etiam quis augue lectus, vel sollicitudin lorem. Fusce lacinia, leo non porttitor adipiscing, 
      mauris purus lobortis ipsum, id scelerisque erat neque eget nunc. Suspendisse potenti. Etiam 
      non urna non libero pulvinar consequat ac vitae turpis. Nam urna eros, laoreet id sagittis eu,
      posuere in sapien. Phasellus semper convallis faucibus. Nulla fermentum faucibus tellus in 
      rutrum. Maecenas quis risus augue, eu gravida massa."
};

string[][] words = texts.Tokenize();

// Create a new BoW with options:
var codebook = new BagOfWords()
{
    MaximumOccurance = 1 // the resulting vector will have only 0's and 1's
};

// Compute the codebook (note: this would have to be done only for the training set)
codebook.Learn(words);


// Now, we can use the learned codebook to extract fixed-length
// representations of the different texts (paragraphs) above:

// Extract a feature vector from the text 1:
double[] bow1 = codebook.Transform(words[0]);

// Extract a feature vector from the text 2:
double[] bow2 = codebook.Transform(words[1]);

// we could also have transformed everything at once, i.e.
// double[][] bow = codebook.Transform(words);


// Now, since we have finite length representations (both bow1 and bow2 should
// have the same size), we can pass them to any classifier or machine learning
// method. For example, we can pass them to a Logistic Regression Classifier to
// discern between the first and second paragraphs

// Lets create a Logistic classifier to separate the two paragraphs:
var learner = new IterativeReweightedLeastSquares<LogisticRegression>()
{
    Tolerance = 1e-4,  // Let's set some convergence parameters
    Iterations = 100,  // maximum number of iterations to perform
    Regularization = 0
};

// Now, we use the learning algorithm to learn the distinction between the two:
LogisticRegression reg = learner.Learn(new[] { bow1, bow2 }, new[] { false, true });

// Finally, we can predict using the classifier:
bool c1 = reg.Decide(bow1); // Should be false
bool c2 = reg.Decide(bow2); // Should be true

The following example shows how to use Bag-of-Words to convert other kinds of sequences into fixed-length representations. In particular, we apply Bag-of-Words to convert data from the PENDIGITS handwritten digit recognition dataset and afterwards convert their representations using a SupportVectorMachine.

// The Bag-Of-Words model can be used to extract finite-length feature 
// vectors from sequences of arbitrary length, like handwritten digits

// Ensure we get reproducible results
Accord.Math.Random.Generator.Seed = 0;

// Download the PENDIGITS dataset from UCI ML repository
var pendigits = new Pendigits(path: localDownloadPath);

// Get and pre-process the training set
double[][][] trainInputs = pendigits.Training.Item1;
int[] trainOutputs = pendigits.Training.Item2;

// Pre-process the digits so each of them is centered and scaled
trainInputs = trainInputs.Apply(Accord.Statistics.Tools.ZScores);

// Create a Bag-of-Words learning algorithm
var bow = new BagOfWords<double[], KMeans>()
{
    Clustering = new KMeans(5),
};

// Use the BoW to create a quantizer
var quantizer = bow.Learn(trainInputs);

// Extract vector representations from the pen sequences
double[][] trainVectors = quantizer.Transform(trainInputs);

// Create a new learning algorithm for support vector machines
var teacher = new MulticlassSupportVectorLearning<ChiSquare, double[]>
{
    Learner = (p) => new SequentialMinimalOptimization<ChiSquare, double[]>()
    {
        Complexity = 1
    }
};

// Use the learning algorithm to create a classifier
var svm = teacher.Learn(trainVectors, trainOutputs);

// Compute predictions for the training set
int[] trainPredicted = svm.Decide(trainVectors);

// Check the performance of the classifier by comparing with the ground-truth:
var m1 = new GeneralConfusionMatrix(predicted: trainPredicted, expected: trainOutputs);
double trainAcc = m1.Accuracy; // should be 0.690


// Prepare the testing set
double[][][] testInputs = pendigits.Testing.Item1;
int[] testOutputs = pendigits.Testing.Item2;

// Apply the same normalizations
testInputs = testInputs.Apply(Accord.Statistics.Tools.ZScores);

double[][] testVectors = quantizer.Transform(testInputs);

// Compute predictions for the test set
int[] testPredicted = svm.Decide(testVectors);

// Check the performance of the classifier by comparing with the ground-truth:
var m2 = new GeneralConfusionMatrix(predicted: testPredicted, expected: testOutputs);
double testAcc = m2.Accuracy; // should be 0.600
See Also