Home | History | Annotate | Download | only in doc
      1 Machine Learning Overview {#ml_intro}
      2 =========================
      3 
      4 [TOC]
      5 
      6 Training Data {#ml_intro_data}
      7 =============
      8 
      9 In machine learning algorithms there is notion of training data. Training data includes several
     10 components:
     11 
     12 -   A set of training samples. Each training sample is a vector of values (in Computer Vision it's
     13     sometimes referred to as feature vector). Usually all the vectors have the same number of
     14     components (features); OpenCV ml module assumes that. Each feature can be ordered (i.e. its
     15     values are floating-point numbers that can be compared with each other and strictly ordered,
     16     i.e. sorted) or categorical (i.e. its value belongs to a fixed set of values that can be
     17     integers, strings etc.).
     18 -   Optional set of responses corresponding to the samples. Training data with no responses is used
     19     in unsupervised learning algorithms that learn structure of the supplied data based on distances
     20     between different samples. Training data with responses is used in supervised learning
     21     algorithms, which learn the function mapping samples to responses. Usually the responses are
     22     scalar values, ordered (when we deal with regression problem) or categorical (when we deal with
     23     classification problem; in this case the responses are often called "labels"). Some algorithms,
     24     most noticeably Neural networks, can handle not only scalar, but also multi-dimensional or
     25     vector responses.
     26 -   Another optional component is the mask of missing measurements. Most algorithms require all the
     27     components in all the training samples be valid, but some other algorithms, such as decision
     28     tress, can handle the cases of missing measurements.
     29 -   In the case of classification problem user may want to give different weights to different
     30     classes. This is useful, for example, when:
     31     -   user wants to shift prediction accuracy towards lower false-alarm rate or higher hit-rate.
     32     -   user wants to compensate for significantly different amounts of training samples from
     33         different classes.
     34 -   In addition to that, each training sample may be given a weight, if user wants the algorithm to
     35     pay special attention to certain training samples and adjust the training model accordingly.
     36 -   Also, user may wish not to use the whole training data at once, but rather use parts of it, e.g.
     37     to do parameter optimization via cross-validation procedure.
     38 
     39 As you can see, training data can have rather complex structure; besides, it may be very big and/or
     40 not entirely available, so there is need to make abstraction for this concept. In OpenCV ml there is
     41 cv::ml::TrainData class for that.
     42 
     43 @sa cv::ml::TrainData
     44 
     45 Normal Bayes Classifier {#ml_intro_bayes}
     46 =======================
     47 
     48 This simple classification model assumes that feature vectors from each class are normally
     49 distributed (though, not necessarily independently distributed). So, the whole data distribution
     50 function is assumed to be a Gaussian mixture, one component per class. Using the training data the
     51 algorithm estimates mean vectors and covariance matrices for every class, and then it uses them for
     52 prediction.
     53 
     54 @sa cv::ml::NormalBayesClassifier
     55 
     56 K-Nearest Neighbors {#ml_intro_knn}
     57 ===================
     58 
     59 The algorithm caches all training samples and predicts the response for a new sample by analyzing a
     60 certain number (__K__) of the nearest neighbors of the sample using voting, calculating weighted
     61 sum, and so on. The method is sometimes referred to as "learning by example" because for prediction
     62 it looks for the feature vector with a known response that is closest to the given vector.
     63 
     64 @sa cv::ml::KNearest
     65 
     66 Support Vector Machines {#ml_intro_svm}
     67 =======================
     68 
     69 Originally, support vector machines (SVM) was a technique for building an optimal binary (2-class)
     70 classifier. Later the technique was extended to regression and clustering problems. SVM is a partial
     71 case of kernel-based methods. It maps feature vectors into a higher-dimensional space using a kernel
     72 function and builds an optimal linear discriminating function in this space or an optimal hyper-
     73 plane that fits into the training data. In case of SVM, the kernel is not defined explicitly.
     74 Instead, a distance between any 2 points in the hyper-space needs to be defined.
     75 
     76 The solution is optimal, which means that the margin between the separating hyper-plane and the
     77 nearest feature vectors from both classes (in case of 2-class classifier) is maximal. The feature
     78 vectors that are the closest to the hyper-plane are called _support vectors_, which means that the
     79 position of other vectors does not affect the hyper-plane (the decision function).
     80 
     81 SVM implementation in OpenCV is based on @cite LibSVM
     82 
     83 @sa cv::ml::SVM
     84 
     85 Prediction with SVM {#ml_intro_svm_predict}
     86 -------------------
     87 
     88 StatModel::predict(samples, results, flags) should be used. Pass flags=StatModel::RAW_OUTPUT to get
     89 the raw response from SVM (in the case of regression, 1-class or 2-class classification problem).
     90 
     91 Decision Trees {#ml_intro_trees}
     92 ==============
     93 
     94 The ML classes discussed in this section implement Classification and Regression Tree algorithms
     95 described in @cite Breiman84 .
     96 
     97 The class cv::ml::DTrees represents a single decision tree or a collection of decision trees. It's
     98 also a base class for RTrees and Boost.
     99 
    100 A decision tree is a binary tree (tree where each non-leaf node has two child nodes). It can be used
    101 either for classification or for regression. For classification, each tree leaf is marked with a
    102 class label; multiple leaves may have the same label. For regression, a constant is also assigned to
    103 each tree leaf, so the approximation function is piecewise constant.
    104 
    105 @sa cv::ml::DTrees
    106 
    107 Predicting with Decision Trees {#ml_intro_trees_predict}
    108 ------------------------------
    109 
    110 To reach a leaf node and to obtain a response for the input feature vector, the prediction procedure
    111 starts with the root node. From each non-leaf node the procedure goes to the left (selects the left
    112 child node as the next observed node) or to the right based on the value of a certain variable whose
    113 index is stored in the observed node. The following variables are possible:
    114 
    115 -   __Ordered variables.__ The variable value is compared with a threshold that is also stored in
    116     the node. If the value is less than the threshold, the procedure goes to the left. Otherwise, it
    117     goes to the right. For example, if the weight is less than 1 kilogram, the procedure goes to the
    118     left, else to the right.
    119 
    120 -   __Categorical variables.__ A discrete variable value is tested to see whether it belongs to a
    121     certain subset of values (also stored in the node) from a limited set of values the variable
    122     could take. If it does, the procedure goes to the left. Otherwise, it goes to the right. For
    123     example, if the color is green or red, go to the left, else to the right.
    124 
    125 So, in each node, a pair of entities (variable_index , `decision_rule (threshold/subset)` ) is used.
    126 This pair is called a _split_ (split on the variable variable_index ). Once a leaf node is reached,
    127 the value assigned to this node is used as the output of the prediction procedure.
    128 
    129 Sometimes, certain features of the input vector are missed (for example, in the darkness it is
    130 difficult to determine the object color), and the prediction procedure may get stuck in the certain
    131 node (in the mentioned example, if the node is split by color). To avoid such situations, decision
    132 trees use so-called _surrogate splits_. That is, in addition to the best "primary" split, every tree
    133 node may also be split to one or more other variables with nearly the same results.
    134 
    135 Training Decision Trees {#ml_intro_trees_train}
    136 -----------------------
    137 
    138 The tree is built recursively, starting from the root node. All training data (feature vectors and
    139 responses) is used to split the root node. In each node the optimum decision rule (the best
    140 "primary" split) is found based on some criteria. In machine learning, gini "purity" criteria are
    141 used for classification, and sum of squared errors is used for regression. Then, if necessary, the
    142 surrogate splits are found. They resemble the results of the primary split on the training data. All
    143 the data is divided using the primary and the surrogate splits (like it is done in the prediction
    144 procedure) between the left and the right child node. Then, the procedure recursively splits both
    145 left and right nodes. At each node the recursive procedure may stop (that is, stop splitting the
    146 node further) in one of the following cases:
    147 
    148 -   Depth of the constructed tree branch has reached the specified maximum value.
    149 -   Number of training samples in the node is less than the specified threshold when it is not
    150     statistically representative to split the node further.
    151 -   All the samples in the node belong to the same class or, in case of regression, the variation is
    152     too small.
    153 -   The best found split does not give any noticeable improvement compared to a random choice.
    154 
    155 When the tree is built, it may be pruned using a cross-validation procedure, if necessary. That is,
    156 some branches of the tree that may lead to the model overfitting are cut off. Normally, this
    157 procedure is only applied to standalone decision trees. Usually tree ensembles build trees that are
    158 small enough and use their own protection schemes against overfitting.
    159 
    160 Variable Importance {#ml_intro_trees_var}
    161 -------------------
    162 
    163 Besides the prediction that is an obvious use of decision trees, the tree can be also used for
    164 various data analyses. One of the key properties of the constructed decision tree algorithms is an
    165 ability to compute the importance (relative decisive power) of each variable. For example, in a spam
    166 filter that uses a set of words occurred in the message as a feature vector, the variable importance
    167 rating can be used to determine the most "spam-indicating" words and thus help keep the dictionary
    168 size reasonable.
    169 
    170 Importance of each variable is computed over all the splits on this variable in the tree, primary
    171 and surrogate ones. Thus, to compute variable importance correctly, the surrogate splits must be
    172 enabled in the training parameters, even if there is no missing data.
    173 
    174 Boosting {#ml_intro_boost}
    175 ========
    176 
    177 A common machine learning task is supervised learning. In supervised learning, the goal is to learn
    178 the functional relationship \f$F: y = F(x)\f$ between the input \f$x\f$ and the output \f$y\f$ .
    179 Predicting the qualitative output is called _classification_, while predicting the quantitative
    180 output is called _regression_.
    181 
    182 Boosting is a powerful learning concept that provides a solution to the supervised classification
    183 learning task. It combines the performance of many "weak" classifiers to produce a powerful
    184 committee @cite HTF01 . A weak classifier is only required to be better than chance, and thus can be
    185 very simple and computationally inexpensive. However, many of them smartly combine results to a
    186 strong classifier that often outperforms most "monolithic" strong classifiers such as SVMs and
    187 Neural Networks.
    188 
    189 Decision trees are the most popular weak classifiers used in boosting schemes. Often the simplest
    190 decision trees with only a single split node per tree (called stumps ) are sufficient.
    191 
    192 The boosted model is based on \f$N\f$ training examples \f${(x_i,y_i)}1N\f$ with \f$x_i \in{R^K}\f$
    193 and \f$y_i \in{-1, +1}\f$ . \f$x_i\f$ is a \f$K\f$ -component vector. Each component encodes a
    194 feature relevant to the learning task at hand. The desired two-class output is encoded as -1 and +1.
    195 
    196 Different variants of boosting are known as Discrete Adaboost, Real AdaBoost, LogitBoost, and Gentle
    197 AdaBoost @cite FHT98 . All of them are very similar in their overall structure. Therefore, this
    198 chapter focuses only on the standard two-class Discrete AdaBoost algorithm, outlined below.
    199 Initially the same weight is assigned to each sample (step 2). Then, a weak classifier
    200 \f$f_{m(x)}\f$ is trained on the weighted training data (step 3a). Its weighted training error and
    201 scaling factor \f$c_m\f$ is computed (step 3b). The weights are increased for training samples that
    202 have been misclassified (step 3c). All weights are then normalized, and the process of finding the
    203 next weak classifier continues for another \f$M\f$ -1 times. The final classifier \f$F(x)\f$ is the
    204 sign of the weighted sum over the individual weak classifiers (step 4).
    205 
    206 __Two-class Discrete AdaBoost Algorithm__
    207 
    208 -   Set \f$N\f$ examples \f${(x_i,y_i)}1N\f$ with \f$x_i \in{R^K}, y_i \in{-1, +1}\f$ .
    209 
    210 -   Assign weights as \f$w_i = 1/N, i = 1,...,N\f$ .
    211 
    212 -   Repeat for \f$m = 1,2,...,M\f$ :
    213 
    214     -   Fit the classifier \f$f_m(x) \in{-1,1}\f$, using weights \f$w_i\f$ on the training data.
    215 
    216     -   Compute \f$err_m = E_w [1_{(y \neq f_m(x))}], c_m = log((1 - err_m)/err_m)\f$ .
    217 
    218     -   Set \f$w_i \Leftarrow w_i exp[c_m 1_{(y_i \neq f_m(x_i))}], i = 1,2,...,N,\f$ and
    219         renormalize so that \f$\Sigma i w_i = 1\f$ .
    220 
    221 -   Classify new samples _x_ using the formula: \f$\textrm{sign} (\Sigma m = 1M c_m f_m(x))\f$ .
    222 
    223 @note Similar to the classical boosting methods, the current implementation supports two-class
    224 classifiers only. For M \> 2 classes, there is the __AdaBoost.MH__ algorithm (described in
    225 @cite FHT98) that reduces the problem to the two-class problem, yet with a much larger training set.
    226 
    227 To reduce computation time for boosted models without substantially losing accuracy, the influence
    228 trimming technique can be employed. As the training algorithm proceeds and the number of trees in
    229 the ensemble is increased, a larger number of the training samples are classified correctly and with
    230 increasing confidence, thereby those samples receive smaller weights on the subsequent iterations.
    231 Examples with a very low relative weight have a small impact on the weak classifier training. Thus,
    232 such examples may be excluded during the weak classifier training without having much effect on the
    233 induced classifier. This process is controlled with the weight_trim_rate parameter. Only examples
    234 with the summary fraction weight_trim_rate of the total weight mass are used in the weak classifier
    235 training. Note that the weights for __all__ training examples are recomputed at each training
    236 iteration. Examples deleted at a particular iteration may be used again for learning some of the
    237 weak classifiers further @cite FHT98
    238 
    239 @sa cv::ml::Boost
    240 
    241 Prediction with Boost {#ml_intro_boost_predict}
    242 ---------------------
    243 StatModel::predict(samples, results, flags) should be used. Pass flags=StatModel::RAW_OUTPUT to get
    244 the raw sum from Boost classifier.
    245 
    246 Random Trees {#ml_intro_rtrees}
    247 ============
    248 
    249 Random trees have been introduced by Leo Breiman and Adele Cutler:
    250 <http://www.stat.berkeley.edu/users/breiman/RandomForests/> . The algorithm can deal with both
    251 classification and regression problems. Random trees is a collection (ensemble) of tree predictors
    252 that is called _forest_ further in this section (the term has been also introduced by L. Breiman).
    253 The classification works as follows: the random trees classifier takes the input feature vector,
    254 classifies it with every tree in the forest, and outputs the class label that received the majority
    255 of "votes". In case of a regression, the classifier response is the average of the responses over
    256 all the trees in the forest.
    257 
    258 All the trees are trained with the same parameters but on different training sets. These sets are
    259 generated from the original training set using the bootstrap procedure: for each training set, you
    260 randomly select the same number of vectors as in the original set ( =N ). The vectors are chosen
    261 with replacement. That is, some vectors will occur more than once and some will be absent. At each
    262 node of each trained tree, not all the variables are used to find the best split, but a random
    263 subset of them. With each node a new subset is generated. However, its size is fixed for all the
    264 nodes and all the trees. It is a training parameter set to \f$\sqrt{number\_of\_variables}\f$ by
    265 default. None of the built trees are pruned.
    266 
    267 In random trees there is no need for any accuracy estimation procedures, such as cross-validation or
    268 bootstrap, or a separate test set to get an estimate of the training error. The error is estimated
    269 internally during the training. When the training set for the current tree is drawn by sampling with
    270 replacement, some vectors are left out (so-called _oob (out-of-bag) data_ ). The size of oob data is
    271 about N/3 . The classification error is estimated by using this oob-data as follows:
    272 
    273 -   Get a prediction for each vector, which is oob relative to the i-th tree, using the very i-th
    274     tree.
    275 
    276 -   After all the trees have been trained, for each vector that has ever been oob, find the
    277     class-<em>winner</em> for it (the class that has got the majority of votes in the trees where
    278     the vector was oob) and compare it to the ground-truth response.
    279 
    280 -   Compute the classification error estimate as a ratio of the number of misclassified oob vectors
    281     to all the vectors in the original data. In case of regression, the oob-error is computed as the
    282     squared error for oob vectors difference divided by the total number of vectors.
    283 
    284 For the random trees usage example, please, see letter_recog.cpp sample in OpenCV distribution.
    285 
    286 @sa cv::ml::RTrees
    287 
    288 __References:__
    289 
    290 -   _Machine Learning_, Wald I, July 2002.
    291 <http://stat-www.berkeley.edu/users/breiman/wald2002-1.pdf>
    292 -   _Looking Inside the Black Box_, Wald II, July 2002.
    293 <http://stat-www.berkeley.edu/users/breiman/wald2002-2.pdf>
    294 -   _Software for the Masses_, Wald III, July 2002.
    295 <http://stat-www.berkeley.edu/users/breiman/wald2002-3.pdf>
    296 -   And other articles from the web site
    297 <http://www.stat.berkeley.edu/users/breiman/RandomForests/cc_home.htm>
    298 
    299 Expectation Maximization {#ml_intro_em}
    300 ========================
    301 
    302 The Expectation Maximization(EM) algorithm estimates the parameters of the multivariate probability
    303 density function in the form of a Gaussian mixture distribution with a specified number of mixtures.
    304 
    305 Consider the set of the N feature vectors { \f$x_1, x_2,...,x_{N}\f$ } from a d-dimensional Euclidean
    306 space drawn from a Gaussian mixture:
    307 
    308 \f[p(x;a_k,S_k, \pi _k) =  \sum _{k=1}^{m} \pi _kp_k(x),  \quad \pi _k  \geq 0,  \quad \sum _{k=1}^{m} \pi _k=1,\f]
    309 
    310 \f[p_k(x)= \varphi (x;a_k,S_k)= \frac{1}{(2\pi)^{d/2}\mid{S_k}\mid^{1/2}} exp \left \{ - \frac{1}{2} (x-a_k)^TS_k^{-1}(x-a_k) \right \} ,\f]
    311 
    312 where \f$m\f$ is the number of mixtures, \f$p_k\f$ is the normal distribution density with the mean
    313 \f$a_k\f$ and covariance matrix \f$S_k\f$, \f$\pi_k\f$ is the weight of the k-th mixture. Given the
    314 number of mixtures \f$M\f$ and the samples \f$x_i\f$, \f$i=1..N\f$ the algorithm finds the maximum-
    315 likelihood estimates (MLE) of all the mixture parameters, that is, \f$a_k\f$, \f$S_k\f$ and
    316 \f$\pi_k\f$ :
    317 
    318 \f[L(x, \theta )=logp(x, \theta )= \sum _{i=1}^{N}log \left ( \sum _{k=1}^{m} \pi _kp_k(x) \right ) \to \max _{ \theta \in \Theta },\f]
    319 
    320 \f[\Theta = \left \{ (a_k,S_k, \pi _k): a_k  \in \mathbbm{R} ^d,S_k=S_k^T>0,S_k  \in \mathbbm{R} ^{d  \times d}, \pi _k \geq 0, \sum _{k=1}^{m} \pi _k=1 \right \} .\f]
    321 
    322 The EM algorithm is an iterative procedure. Each iteration includes two steps. At the first step
    323 (Expectation step or E-step), you find a probability \f$p_{i,k}\f$ (denoted \f$\alpha_{i,k}\f$ in
    324 the formula below) of sample i to belong to mixture k using the currently available mixture
    325 parameter estimates:
    326 
    327 \f[\alpha _{ki} =  \frac{\pi_k\varphi(x;a_k,S_k)}{\sum\limits_{j=1}^{m}\pi_j\varphi(x;a_j,S_j)} .\f]
    328 
    329 At the second step (Maximization step or M-step), the mixture parameter estimates are refined using
    330 the computed probabilities:
    331 
    332 \f[\pi _k= \frac{1}{N} \sum _{i=1}^{N} \alpha _{ki},  \quad a_k= \frac{\sum\limits_{i=1}^{N}\alpha_{ki}x_i}{\sum\limits_{i=1}^{N}\alpha_{ki}} ,  \quad S_k= \frac{\sum\limits_{i=1}^{N}\alpha_{ki}(x_i-a_k)(x_i-a_k)^T}{\sum\limits_{i=1}^{N}\alpha_{ki}}\f]
    333 
    334 Alternatively, the algorithm may start with the M-step when the initial values for \f$p_{i,k}\f$ can
    335 be provided. Another alternative when \f$p_{i,k}\f$ are unknown is to use a simpler clustering
    336 algorithm to pre-cluster the input samples and thus obtain initial \f$p_{i,k}\f$ . Often (including
    337 machine learning) the k-means algorithm is used for that purpose.
    338 
    339 One of the main problems of the EM algorithm is a large number of parameters to estimate. The
    340 majority of the parameters reside in covariance matrices, which are \f$d \times d\f$ elements each
    341 where \f$d\f$ is the feature space dimensionality. However, in many practical problems, the
    342 covariance matrices are close to diagonal or even to \f$\mu_k*I\f$ , where \f$I\f$ is an identity
    343 matrix and \f$\mu_k\f$ is a mixture-dependent "scale" parameter. So, a robust computation scheme
    344 could start with harder constraints on the covariance matrices and then use the estimated parameters
    345 as an input for a less constrained optimization problem (often a diagonal covariance matrix is
    346 already a good enough approximation).
    347 
    348 @sa cv::ml::EM
    349 
    350 References:
    351 -   Bilmes98 J. A. Bilmes. _A Gentle Tutorial of the EM Algorithm and its Application to Parameter
    352 Estimation for Gaussian Mixture and Hidden Markov Models_. Technical Report TR-97-021,
    353 International Computer Science Institute and Computer Science Division, University of California
    354 at Berkeley, April 1998.
    355 
    356 Neural Networks {#ml_intro_ann}
    357 ===============
    358 
    359 ML implements feed-forward artificial neural networks or, more particularly, multi-layer perceptrons
    360 (MLP), the most commonly used type of neural networks. MLP consists of the input layer, output
    361 layer, and one or more hidden layers. Each layer of MLP includes one or more neurons directionally
    362 linked with the neurons from the previous and the next layer. The example below represents a 3-layer
    363 perceptron with three inputs, two outputs, and the hidden layer including five neurons:
    364 
    365 ![image](pics/mlp.png)
    366 
    367 All the neurons in MLP are similar. Each of them has several input links (it takes the output values
    368 from several neurons in the previous layer as input) and several output links (it passes the
    369 response to several neurons in the next layer). The values retrieved from the previous layer are
    370 summed up with certain weights, individual for each neuron, plus the bias term. The sum is
    371 transformed using the activation function \f$f\f$ that may be also different for different neurons.
    372 
    373 ![image](pics/neuron_model.png)
    374 
    375 In other words, given the outputs \f$x_j\f$ of the layer \f$n\f$ , the outputs \f$y_i\f$ of the
    376 layer \f$n+1\f$ are computed as:
    377 
    378 \f[u_i =  \sum _j (w^{n+1}_{i,j}*x_j) + w^{n+1}_{i,bias}\f]
    379 
    380 \f[y_i = f(u_i)\f]
    381 
    382 Different activation functions may be used. ML implements three standard functions:
    383 
    384 -   Identity function ( cv::ml::ANN_MLP::IDENTITY ): \f$f(x)=x\f$
    385 
    386 -   Symmetrical sigmoid ( cv::ml::ANN_MLP::SIGMOID_SYM ): \f$f(x)=\beta*(1-e^{-\alpha
    387     x})/(1+e^{-\alpha x}\f$ ), which is the default choice for MLP. The standard sigmoid with
    388     \f$\beta =1, \alpha =1\f$ is shown below:
    389 
    390     ![image](pics/sigmoid_bipolar.png)
    391 
    392 -   Gaussian function ( cv::ml::ANN_MLP::GAUSSIAN ): \f$f(x)=\beta e^{-\alpha x*x}\f$ , which is not
    393     completely supported at the moment.
    394 
    395 In ML, all the neurons have the same activation functions, with the same free parameters (
    396 \f$\alpha, \beta\f$ ) that are specified by user and are not altered by the training algorithms.
    397 
    398 So, the whole trained network works as follows:
    399 
    400 1.  Take the feature vector as input. The vector size is equal to the size of the input layer.
    401 2.  Pass values as input to the first hidden layer.
    402 3.  Compute outputs of the hidden layer using the weights and the activation functions.
    403 4.  Pass outputs further downstream until you compute the output layer.
    404 
    405 So, to compute the network, you need to know all the weights \f$w^{n+1)}_{i,j}\f$ . The weights are
    406 computed by the training algorithm. The algorithm takes a training set, multiple input vectors with
    407 the corresponding output vectors, and iteratively adjusts the weights to enable the network to give
    408 the desired response to the provided input vectors.
    409 
    410 The larger the network size (the number of hidden layers and their sizes) is, the more the potential
    411 network flexibility is. The error on the training set could be made arbitrarily small. But at the
    412 same time the learned network also "learns" the noise present in the training set, so the error on
    413 the test set usually starts increasing after the network size reaches a limit. Besides, the larger
    414 networks are trained much longer than the smaller ones, so it is reasonable to pre-process the data,
    415 using cv::PCA or similar technique, and train a smaller network on only essential features.
    416 
    417 Another MLP feature is an inability to handle categorical data as is. However, there is a
    418 workaround. If a certain feature in the input or output (in case of n -class classifier for
    419 \f$n>2\f$ ) layer is categorical and can take \f$M>2\f$ different values, it makes sense to
    420 represent it as a binary tuple of M elements, where the i -th element is 1 if and only if the
    421 feature is equal to the i -th value out of M possible. It increases the size of the input/output
    422 layer but speeds up the training algorithm convergence and at the same time enables "fuzzy" values
    423 of such variables, that is, a tuple of probabilities instead of a fixed value.
    424 
    425 ML implements two algorithms for training MLP's. The first algorithm is a classical random
    426 sequential back-propagation algorithm. The second (default) one is a batch RPROP algorithm.
    427 
    428 @sa cv::ml::ANN_MLP
    429 
    430 Logistic Regression {#ml_intro_lr}
    431 ===================
    432 
    433 ML implements logistic regression, which is a probabilistic classification technique. Logistic
    434 Regression is a binary classification algorithm which is closely related to Support Vector Machines
    435 (SVM). Like SVM, Logistic Regression can be extended to work on multi-class classification problems
    436 like digit recognition (i.e. recognizing digitis like 0,1 2, 3,... from the given images). This
    437 version of Logistic Regression supports both binary and multi-class classifications (for multi-class
    438 it creates a multiple 2-class classifiers). In order to train the logistic regression classifier,
    439 Batch Gradient Descent and Mini-Batch Gradient Descent algorithms are used (see
    440 <http://en.wikipedia.org/wiki/Gradient_descent_optimization>). Logistic Regression is a
    441 discriminative classifier (see <http://www.cs.cmu.edu/~tom/NewChapters.html> for more details).
    442 Logistic Regression is implemented as a C++ class in LogisticRegression.
    443 
    444 In Logistic Regression, we try to optimize the training paramater \f$\theta\f$ such that the
    445 hypothesis \f$0 \leq h_\theta(x) \leq 1\f$ is acheived. We have \f$h_\theta(x) = g(h_\theta(x))\f$
    446 and \f$g(z) = \frac{1}{1+e^{-z}}\f$ as the logistic or sigmoid function. The term "Logistic" in
    447 Logistic Regression refers to this function. For given data of a binary classification problem of
    448 classes 0 and 1, one can determine that the given data instance belongs to class 1 if \f$h_\theta(x)
    449 \geq 0.5\f$ or class 0 if \f$h_\theta(x) < 0.5\f$ .
    450 
    451 In Logistic Regression, choosing the right parameters is of utmost importance for reducing the
    452 training error and ensuring high training accuracy:
    453 
    454 -   The learning rate can be set with @ref cv::ml::LogisticRegression::setLearningRate "setLearningRate"
    455     method. It determines how fast we approach the solution. It is a positive real number.
    456 
    457 -   Optimization algorithms like Batch Gradient Descent and Mini-Batch Gradient Descent are supported
    458     in LogisticRegression. It is important that we mention the number of iterations these optimization
    459     algorithms have to run. The number of iterations can be set with @ref
    460     cv::ml::LogisticRegression::setIterations "setIterations". This parameter can be thought
    461     as number of steps taken and learning rate specifies if it is a long step or a short step. This
    462     and previous parameter define how fast we arrive at a possible solution.
    463 
    464 -   In order to compensate for overfitting regularization is performed, which can be enabled with
    465     @ref cv::ml::LogisticRegression::setRegularization "setRegularization". One can specify what
    466     kind of regularization has to be performed by passing one of @ref
    467     cv::ml::LogisticRegression::RegKinds "regularization kinds" to this method.
    468 
    469 -   Logistic regression implementation provides a choice of 2 training methods with Batch Gradient
    470     Descent or the MiniBatch Gradient Descent. To specify this, call @ref
    471     cv::ml::LogisticRegression::setTrainMethod "setTrainMethod" with either @ref
    472     cv::ml::LogisticRegression::BATCH "LogisticRegression::BATCH" or @ref
    473     cv::ml::LogisticRegression::MINI_BATCH "LogisticRegression::MINI_BATCH". If training method is
    474     set to @ref cv::ml::LogisticRegression::MINI_BATCH "MINI_BATCH", the size of the mini batch has
    475     to be to a postive integer set with @ref cv::ml::LogisticRegression::setMiniBatchSize
    476     "setMiniBatchSize".
    477 
    478 A sample set of training parameters for the Logistic Regression classifier can be initialized as follows:
    479 @snippet samples/cpp/logistic_regression.cpp init
    480 
    481 @sa cv::ml::LogisticRegression
    482