WO2012012680A2 - Classification using correntropy - Google Patents

Classification using correntropy Download PDF

Info

Publication number
WO2012012680A2
WO2012012680A2 PCT/US2011/044932 US2011044932W WO2012012680A2 WO 2012012680 A2 WO2012012680 A2 WO 2012012680A2 US 2011044932 W US2011044932 W US 2011044932W WO 2012012680 A2 WO2012012680 A2 WO 2012012680A2
Authority
WO
WIPO (PCT)
Prior art keywords
loss function
correntropy
label
correntopy
statistical similarity
Prior art date
Application number
PCT/US2011/044932
Other languages
French (fr)
Other versions
WO2012012680A3 (en
Inventor
Jose Carlos Principe
Abhishek Singh
Original Assignee
University Of Florida Research Foundation, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University Of Florida Research Foundation, Inc. filed Critical University Of Florida Research Foundation, Inc.
Priority to US13/811,317 priority Critical patent/US9269050B2/en
Priority to CA2806053A priority patent/CA2806053A1/en
Publication of WO2012012680A2 publication Critical patent/WO2012012680A2/en
Publication of WO2012012680A3 publication Critical patent/WO2012012680A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models

Definitions

  • Classification assigns labels to data based upon a decision rule.
  • a convex surrogate loss function is used as the training loss function in many classification procedures.
  • convex surrogates are preferred because of the virtues that convexity brings - unique optimum, efficient optimization using convex optimization tools, amenability to theoretical analysis of error bounds, etc.
  • convex functions are poor approximations for a wide variety of problems.
  • FIG. 1 is a graph illustrating examples of surrogate loss functions in accordance with various embodiments of the present disclosure.
  • FIG. 2 is a graph illustrating examples of correntropy loss (c-loss) functions in accordance with various embodiments of the present disclosure.
  • FIG. 3 is an example of an empirical risk function in a 2D space of errors in accordance with various embodiments of the present disclosure.
  • FIGS. 4 and 6 are graphical representations of examples of systems for robust classification using c-loss functions of FIG. 2 in accordance with various embodiments of the present disclosure.
  • FIG. 5 is a graph illustrating an example of derivatives of a c-loss function versus kernel width in accordance with various embodiments of the present disclosure.
  • FIG. 7 is a flow chart illustrating an example of robust classification using the systems of FIGS. 4 and 6 in accordance with various embodiments of the present disclosure.
  • FIGS. 8A-8F, 9A-9C, and 10A-10C are graphs of examples of classification results based at least in part upon a c-loss function module using the systems of FIGS. 4 and 6 in accordance with various embodiments of the present disclosure.
  • FIG. 11 is a graphical representation of an example of a processing device used to implement at least a portion of the systems of FIGS. 4 and 6 in accordance with various embodiments of the present disclosure.
  • Embodiments disclosed herein include systems and methods for robust classification using correntropy.
  • this application describes a loss function for classification that is induced by correntropy, referred to as the c-loss function.
  • a discriminant function is obtained by optimizing the c-loss function using a neural network that is substantially insensitive to outliers and resilient to overfitting and overtraining. This potentially leads to better generalization performance when compared to a squared loss function, which is common in neural network classifiers.
  • the described methods of training classifiers may provide a practical way of obtaining better results on real world classification problems, and an exemplary embodiment uses a simple gradient-based online training procedure for reducing the empirical risk.
  • Classification aims at assigning class labels to data (patterns) using an Optimal' decision rule that is learned using a set of pre-labeled training samples.
  • This 'optimal' decision rule or discriminant function / is learned by minimizing an empirical risk, which is a sample average of a loss function.
  • the loss (a function of the prediction f(x) and the true label y ) may be thought of as the price paid for predicting the label to be /(x) , instead of y .
  • the procedure for learning the discriminant function / is called empirical risk minimization.
  • a natural loss function for classification is the misclassification error rate (or the 0-1 loss) stated in:
  • 0-1 loss function directly relates to the probability of misclassification.
  • the product is called the margin (denoted by a ) and can be treated as a measure of correctness of the decision for the sample X .
  • an empirical risk e.g., the sample average of the 0-1 loss
  • the first surrogate loss function ⁇ (103) is a hinge loss function used in support vector machines (SVMs).
  • SVM uses the hinge loss, [ ⁇ - yf(x)] + .
  • the second surrogate loss function ⁇ (106) is a square loss function used in training neural networks, radial basis function networks, etc.
  • the loss function for training the weights of a neural network or a radial basis function network is typically the square loss function, (y - f(x)) 2 , or .
  • convex surrogates of the 0-1 loss function are preferred because of the virtues that convexity brings: unique optima, efficient optimization using convex optimization tools, amenability to theoretical analysis of error bounds, etc.
  • convex surrogates are poor approximations to the 0-1 loss function.
  • Convex surrogates tend to be boundless and offer poor robustness to outliers.
  • Another important limitation is that the complexity of convex optimization algorithms grows quickly as the amount of data increases.
  • Correntropy is a generalized correlation function or a robust measure of statistical similarity between two random variables that makes use of second and higher order statistics.
  • maximizing the similarity between the prediction /(x) and the target y in the correntropy sense effectively generates (i.e., induces) a non-convex, smooth loss function (referred to as a c-loss function) that can be used to train the classifier using an online gradient-based technique.
  • the c-loss function is used to train a neural network classifier using backpropagation. Without any increase in computational complexity, a better generalization performance on real world datasets may be obtained using the c-loss function when compared to the traditional squared loss function in neural network classifiers.
  • correntropy between the random variables is computed as:
  • Correntropy is a measure of how similar two random variables are, within a small neighborhood determined by the kernel width ⁇ .
  • metrics like mean squared error (MSE) provide a global measure.
  • MSE mean squared error
  • the loss function should be chosen such that minimization of the expected risk is equivalent to maximization of correntropy.
  • the correntropy induced loss function or the c-loss function is defined as:
  • FIG. 2 shown is a graph 200 of the c-loss function for different values of the kernel width parameter ⁇ plotted against the margin a .
  • the 0-1 loss is also shown.
  • the c- loss function is a better approximation of the 0-1 loss function as compared to the squared or hinge loss functions, which are illustrated in FIG. 1.
  • the empirical risk obtained using the c-loss function behaves like the L 2 norm for small errors (i.e., samples correctly classified with high confidence). As the errors increase, it behaves like the L ⁇ norm and approaches the L 0 norm for very large errors (i.e. , misclassified samples).
  • the kernel size (or, equivalently, the distance from the origin in the space of errors) dictates the rate at which the empirical risk transitions from L 2 to L 0 behavior in the error space.
  • FIG. 3 depicts an example of the transition from an L 2 norm like behavior close to the origin, to Lo behavior far away, with an L ⁇ like region in between.
  • the c-loss function is a non-convex function of the margin. Therefore, it is difficult to optimize the c-loss function using sophisticated convex optimization techniques in order to obtain the optimal discriminant function / .
  • the c-loss function is a smooth function, it is possible to optimize the c-loss function using, for example, a first order (steepest descent) or a second order (e.g., Newton's method) gradient-based procedure, or a combination thereof (e.g., a conjugate gradient, a Hessian approximation method (e.g., a Levenberg-Marquardt algorithm)).
  • the c-loss function is used to train a neural network classifier by backpropagating the errors using any of the gradient-based methods discussed above.
  • FIG. 4 is a block diagram illustrating a nonlimiting embodiment of a system for robust classification using correntropy.
  • the system 400a includes a classifier 403, a c-loss function module 406, a gradient-based module 409, and a weight update module 412.
  • the classifier 403 classifies a value of x by predicting the label to be .
  • the c-loss function module 406 determines a statistical similarity between the predicted label and the true label y using correntropy as described in EQN. 7 above.
  • the c-loss function module 406 outputs a c-loss value described by EQNS. 9 and 10 above.
  • the gradient-based module 409 calculates a change in the c-loss values, which minimizes the expected risk as described in EQN. 14 above, for example.
  • the gradient-based module 409 is a first, second, or mixed order gradient-based module, for example.
  • the gradient- based module 409 outputs the change to the weight update module 412, which updates the weights for the classifier based at least in part on a c-loss value and the change in c-loss values calculated by the gradient-based module 409.
  • the c-loss function is a non-convex function of the margin. Therefore it is difficult to optimize it using sophisticated convex optimization techniques, in order to obtain the optimal discriminant function / . However, since it is a smooth function, it is possible to optimize it using gradient based procedures. We therefore use the c-loss function to train a neural network classifier by backpropagating the errors using gradient descent.
  • weights of a multilayer perceptron may be updated by moving opposite to the gradient of the empirical risk computed using the c- loss function:
  • N 0 is the number of output layer PEs.
  • EQNS. 22 and 23 can be used to update or train the weights of a neural network classifier using the c-loss function.
  • the computational complexity of the weight updates remains the same as in the case of the conventional square loss function.
  • FIG. 5 shown is a graph 500 illustrating an example of derivatives of the c-loss function versus kernel width ⁇ .
  • the derivative of the square loss function (which results in a linear function) is also shown.
  • the derivatives are plotted as a function of the error th
  • Combined loss function « * (C - loss) + (1 - a)* (Square - loss) EQN. 15
  • FIG. 6 shown is a block diagram illustrating another nonlimiting embodiment of a system for robust classification using correntropy. Similar to the system 400a illustrated in FIG. 4, the system 400b in FIG. 6 includes a classifier 403, a gradient-based module 409, and a weight update module 412. However, the system 400b includes a combined loss function module 606 instead of merely a c-loss function module 406. The combined loss function module 606 determines a statistical similarity using a weighted combination of the c-loss function and the square loss function. The combined loss function module 606 outputs a combined loss function based at least in part on EQN. 15 above.
  • FIG. 7 is a flow chart illustrating a nonlimiting embodiment of a method of robust classification using correntropy.
  • the method 700 includes boxes 703, 706, 709, and 712.
  • a data value is classified by predicting a label for the data value using a discriminant function including a plurality of weights.
  • a statistical similarity between the predicted label and an actual label is determined based at least in part on a correntropy loss function.
  • the statistical similarity is also based at least in part on a square loss function.
  • the statistical similarity may be based at least in part on a weighted combination of the correntropy loss function and the square loss function (referred to herein as a combined loss function).
  • the expected risk associated with the predicted label is minimized based at least in part on the statistical similarity.
  • the minimization of the expected risk includes calculating a gradient of (i.e., change in) the statistical similarity.
  • the weights of the discriminant function are updated based at least in part on the minimized expected risk.
  • the performance of neural network classifiers trained with the c-loss function and the traditional square loss function were used in classifying the Pima Indians Diabetes dataset.
  • This data includes eight physiological measurements (features) from 768 subjects. The objective was to classify a test subject into either the diabetic or non- diabetic group, based on these physiological measurements. Out of the 768 samples, 400 samples were used to train the classifier, and the remaining samples were used to test for generalization.
  • Networks having a single hidden layer and a sigmoidal non- linearity are used in each of the PEs.
  • the datasets used are normalized to have unit variance along each feature.
  • the networks were trained with 10 epochs of the training dataset.
  • the kernel size ⁇ in the c-loss function was set to be 0.5. Other values of kernel size such as 0.6 and 0.7 were also tried, with little change in performance.
  • the generalization performance for both the methods was obtained by testing on a separate test dataset.
  • FIGS. 8A-8F are graphs of percentage of correct classification for a system 400 including a c-loss function module and a system including a square loss function, versus number of training epochs, for the Pima Indians Diabetes dataset.
  • the total number of training epochs varies in each figure such that the total number of training epochs is respectively 200, 100, 50, 30, 15, and 5 in FIGS. 8A, 8B, 8C, 8D, 8E, and 8F.
  • the c- loss function was implemented using the combined loss function as described by EQN. 15, in order to avoid local minima.
  • 8A-8F illustrate how the c- loss function improves the classification performance compared to the square loss function, over the training epochs. These plots were obtained after averaging over 100 Monte Carlo simulations with random initial weights of the network and random selection of training samples from the datasets.
  • FIGS. 9A-9C shown are the graphs of examples of the percentage of correct classification for 5, 10 and 20 PEs in the hidden layer, respectively.
  • the average classification performance (on the test set) computed from 100 Monte Carlo trials is shown for different training durations (epochs).
  • the tables in FIGS. 9A, 9B, and 9C illustrate the evaluation results for 5, 10 and 20 PEs, respectively. In each of the three tables, the best performance of c-loss function is better than that obtained by the square loss.
  • 10A-10C shown are the graphs of examples of the percentage of correct classification for 5, 10 and 20 PEs in the hidden layer, respectively.
  • the mean classification performance is tabulated for both the loss function, for various training epochs.
  • the c-loss function again outperforms the square loss function in terms of generalization.
  • the use of the c-loss function after the 5th epoch provides an improvement in classification performance.
  • the systems and methods provided herein can be implemented in hardware, software, firmware, or a combination thereof.
  • the method can be implemented in software or firmware that is stored in a memory and that is executed by a suitable instruction execution system.
  • the system can be implemented with any or a combination of the following technologies, which are all well known in the art: a discrete logic circuit(s) having logic gates for implementing logic functions upon data signals, an application specific integrated circuit (ASIC) having appropriate combinational logic gates, a programmable gate array(s) (PGA), a field programmable gate array (FPGA), digital signal processor (DSP), etc.
  • ASIC application specific integrated circuit
  • PGA programmable gate array
  • FPGA field programmable gate array
  • DSP digital signal processor
  • the adaptive systems described above may be implemented in a processing device 1 100 such as the one illustrated in FIG. 1 1 .
  • the signal processing device 1 100 includes a receiver 1 103, transmitter 1 106, processing unit 1 109, a bus 1 1 12 and a memory 1 1 15.
  • the memory 1 1 15 stores an application specific software 1 1 18 including modules that include instructions that when executed by the processing unit 1 109 perform various operations.
  • the modules include, for example, classifier 1 121 , a c-loss module 1 24, a gradient-based module 1 127, and weight update module 1 130.
  • the application software includes a combination loss function module configured to apply a combination of a c-loss function and a square loss function to an input predicted label and an actual label.
  • the application specific software 1 1 18 can also be stored on a variety of computer-readable media for use by, or in connection with, a variety of computer-related systems or methods.
  • a "computer-readable medium” stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device.
  • the computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device.
  • the computer-readable medium may include the following: an electrical connection (electronic) having one or more wires, a portable computer diskette (magnetic), a random access memory (RAM) (electronic), a read-only memory (ROM) (electronic), an erasable programmable read-only memory (EPROM, EEPROM, or Flash memory) (electronic), an optical fiber (optical), a portable compact disc read- only memory (CDROM) (optical), a digital versatile disc (optical), a high definition digital versatile disc (optical), and a Blu-ray Disc (optical).

Abstract

Various methods and systems are provided for classification using correntropy. In one embodiment, a classifying device includes a processing unit and memory storing instructions in modules that when executed by the processing unit cause the classifying device to adaptively classify a data value using a correntropy loss function. In another embodiment, a method includes adjusting a weight of a classifier based at least in part on a change in a correntropy loss function signal and classifying a data value using the classifier. In another embodiment, a method includes classifying a data value by predicting a label for the data value using a discriminant function, determining a correntopy statistical similarity between the predicted label and an actual label based at least in part on a correntropy loss function, and minimizing an expected risk associated with the predicted label based at least in part on the correntopy statistical similarity.

Description

CLASSIFICATION USING CORRENTROPY
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to copending U.S. provisional application entitled "CLASSIFICATION USING CORRENTROPY" having serial no. 61/366,662, filed July 22, 2010, which is entirely incorporated herein by reference.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
[0002] This invention was made with Government support under Grant ECS -0601271 awarded by the National Science Foundation. The Government has certain rights in this invention.
BACKGROUND
[0003] Classification assigns labels to data based upon a decision rule. A convex surrogate loss function is used as the training loss function in many classification procedures. Within the statistical learning community, convex surrogates are preferred because of the virtues that convexity brings - unique optimum, efficient optimization using convex optimization tools, amenability to theoretical analysis of error bounds, etc. However, convex functions are poor approximations for a wide variety of problems. BRIEF DESCRIPTION OF THE DRAWINGS
[0004] Many aspects of the present disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views.
[0005] FIG. 1 is a graph illustrating examples of surrogate loss functions in accordance with various embodiments of the present disclosure.
[0006] FIG. 2 is a graph illustrating examples of correntropy loss (c-loss) functions in accordance with various embodiments of the present disclosure.
[0007] FIG. 3 is an example of an empirical risk function in a 2D space of errors in accordance with various embodiments of the present disclosure.
[0008] FIGS. 4 and 6 are graphical representations of examples of systems for robust classification using c-loss functions of FIG. 2 in accordance with various embodiments of the present disclosure.
[0009] FIG. 5 is a graph illustrating an example of derivatives of a c-loss function versus kernel width in accordance with various embodiments of the present disclosure.
[0010] FIG. 7 is a flow chart illustrating an example of robust classification using the systems of FIGS. 4 and 6 in accordance with various embodiments of the present disclosure.
[0011] FIGS. 8A-8F, 9A-9C, and 10A-10C are graphs of examples of classification results based at least in part upon a c-loss function module using the systems of FIGS. 4 and 6 in accordance with various embodiments of the present disclosure.
[0012] FIG. 11 is a graphical representation of an example of a processing device used to implement at least a portion of the systems of FIGS. 4 and 6 in accordance with various embodiments of the present disclosure.
DESCRIPTION
[0013] Embodiments disclosed herein include systems and methods for robust classification using correntropy. For example, this application describes a loss function for classification that is induced by correntropy, referred to as the c-loss function. A discriminant function is obtained by optimizing the c-loss function using a neural network that is substantially insensitive to outliers and resilient to overfitting and overtraining. This potentially leads to better generalization performance when compared to a squared loss function, which is common in neural network classifiers. The described methods of training classifiers may provide a practical way of obtaining better results on real world classification problems, and an exemplary embodiment uses a simple gradient-based online training procedure for reducing the empirical risk.
[0014] Classification aims at assigning class labels to data (patterns) using an Optimal' decision rule that is learned using a set of pre-labeled training samples. This 'optimal' decision rule or discriminant function / is learned by minimizing an empirical risk, which is a sample average of a loss function. The loss (a function of the prediction f(x) and the true label y ) may be thought of as the price paid for predicting the label to be /(x) , instead of y . The procedure for learning the discriminant function / is called empirical risk minimization.
[0015] A natural loss function for classification is the misclassification error rate (or the 0-1 loss) stated in:
Figure imgf000005_0001
where (.)+ denotes the positive part and || . ||0 denotes the L0 norm. The misclassification error rate may be thought of as a count of the number of incorrect classifications made by the discriminant function / . Therefore, a
0-1 loss function directly relates to the probability of misclassification.
[0016] Optimization of the risk based on such a loss function, however, is computationally intractable due to its non-continuity and non-convexity. Therefore, a surrogate loss function is used as the training loss function. Correntropy between two random variables is a generalized correlation function or a robust measure of statistical similarity, that makes use of higher order statistics. In a classification setting, maximizing the similarity between the prediction /(x) and the target y in the correntropy sense effectively induces a non-convex, smooth loss function (or c-loss) that may be used to train the classifier using an online gradient based technique. This loss function is used to train a neural network classifier using backpropagation without an increase in computational complexity.
[0017] Given a sample set of observations Dn = {(x„ >,·), = 1, 2, n} , assumed to be independent and identically distributed (i.i.d.) realizations of a random pair (X, Y), the goal of classification is to select a function / from a class of functions 7, such that the sign of /(x) is an accurate prediction of Y under an unknown joint distribution P(X, Y). Here, X e X is the input vector and Y {-1 , 1 } is the class label. In other words, select / <≡ T that minimizes the risk R{f) given by:
R(f) = (Yf(X))] = (Y≠ sign(f(X))) . EQN. 2
[0018] The product
Figure imgf000006_0001
is called the margin (denoted by a ) and can be treated as a measure of correctness of the decision for the sample X . Given the sample set Dn of realizations, an empirical risk (e.g., the sample average of the 0-1 loss) is as expressed by:
Figure imgf000006_0002
[0019] Optimization of the empirical risk as expressed above, however, is computationally intractable primarily because of the discontinuity of the 0-1 loss function. Due to the discontinuity of the 0-1 loss function, the optimization procedure involves choosing a surrogate φ(α) = φ(γ/(χ)) as the loss function. What results is the minimization of the -risk and empirical φ- risk defined by:
R(f) = E[^Yf(X))] EQN. 4
1 "
ήφ(Ι) = -∑ <>(y (*,)) EQN. 5
[0020] Referring to FIG. 1 , shown is a graph 100 of two commonly used surrogate loss functions φ versus a margin a . The first surrogate loss function φ (103) is a hinge loss function used in support vector machines (SVMs). The SVM uses the hinge loss, [\ - yf(x)]+ . The second surrogate loss function φ (106) is a square loss function used in training neural networks, radial basis function networks, etc. The loss function for training the weights of a neural network or a radial basis function network is typically the square loss function, (y - f(x))2 , or
Figure imgf000007_0001
.
[0021] Within the statistical learning community, convex surrogates of the 0-1 loss function are preferred because of the virtues that convexity brings: unique optima, efficient optimization using convex optimization tools, amenability to theoretical analysis of error bounds, etc. However, convex surrogates are poor approximations to the 0-1 loss function. Convex surrogates tend to be boundless and offer poor robustness to outliers. Another important limitation is that the complexity of convex optimization algorithms grows quickly as the amount of data increases.
[0022] There is a large class of problems where optimization cannot be done using convex programming techniques. For example, the training of deep networks for large scale artificial intelligence (Al) problems primarily relies on online, gradient-based methods. Such neural network based classification machines can benefit from non-convex surrogates, as they can potentially be closer approximations to the 0-1 loss function. Additionally, non-convex surrogates can have better scalability, robustness, and generalization performance. Although non-convex surrogates do not offer many theoretical guarantees, the empirical evidence that they work better in practical engineering applications is overwhelming.
[0023] An example of a loss function for classification that utilizes a statistical measure called correntropy is described. Correntropy is a generalized correlation function or a robust measure of statistical similarity between two random variables that makes use of second and higher order statistics. In a classification setting, maximizing the similarity between the prediction /(x) and the target y in the correntropy sense, effectively generates (i.e., induces) a non-convex, smooth loss function (referred to as a c-loss function) that can be used to train the classifier using an online gradient-based technique. The c-loss function is used to train a neural network classifier using backpropagation. Without any increase in computational complexity, a better generalization performance on real world datasets may be obtained using the c-loss function when compared to the traditional squared loss function in neural network classifiers.
[0024] Cross correntropy, or simply correntropy, between two random variables X and Y is a generalized similarity measure defined as: ν(Χ, Υ = Ε[κσ(Χ - Υ)] , EQN. 6 where κσ is a Gaussian kernel with width parameter σ . In practice, given only a finite number of realizations of the random variables, correntropy between the random variables is computed as:
Figure imgf000008_0001
[0025] Correntropy is a measure of how similar two random variables are, within a small neighborhood determined by the kernel width σ . In contrast, metrics like mean squared error (MSE) provide a global measure. The localization provided by the kernel width proves to be very useful in reducing the detrimental effects of outliers and impulsive noise.
[0026] In a classification setting, the similarity between the classifier output and the true label is maximized, in the correntropy sense. Therefore, the loss function should be chosen such that minimization of the expected risk is equivalent to maximization of correntropy. The correntropy induced loss function or the c-loss function is defined as:
Figure imgf000009_0001
[0027] The c-loss function can also be expressed in terms of the classification margin a = yf(x) as:
1(,{α) = β[\ - κσ{\ - α)] , EQN. 9
Figure imgf000009_0002
where β is a positive scaling constant chosen such that l(.(a = 0) = \ .
Therefore, β =
Figure imgf000009_0003
[0028] Referring next to FIG. 2, shown is a graph 200 of the c-loss function for different values of the kernel width parameter σ plotted against the margin a . The 0-1 loss is also shown. As can be seen in FIG. 2, the c- loss function is a better approximation of the 0-1 loss function as compared to the squared or hinge loss functions, which are illustrated in FIG. 1.
[0029] The expected risk associated with the c-loss function is expressed by:
R {f) = β (\ - Ε[κσ(\ - γβ{χ))]) EQN. 11
Figure imgf000009_0004
= /?(l -v(r,/(X))) EQN. 13
[0030] Minimizing the expected risk is equivalent to maximizing the similarity (in the correntropy sense) between the predicted label f(x) and the true label y . Upon changing variables based on s = Y - f(X) , the estimator of correntropy can be written as: v(e) = -fi Ka(el) EQN. 14
[0031] From the Parzen density estimation principle, it can be seen that EQN. 14 is an estimator of the probability density function ( pdf ) of ε , evaluated at 0. Therefore, maximizing the correntropy of the errors ε of a classifier, essentially maximizes ρ(ε = 0) . This is a more natural quantity to optimize for a classifier, as compared with quantities such as the sum of squared errors.
[0032] In the space of the errors e = y - fix), the empirical risk obtained using the c-loss function behaves like the L2 norm for small errors (i.e., samples correctly classified with high confidence). As the errors increase, it behaves like the L\ norm and approaches the L0 norm for very large errors (i.e. , misclassified samples). The kernel size (or, equivalently, the distance from the origin in the space of errors) dictates the rate at which the empirical risk transitions from L2 to L0 behavior in the error space. FIG. 3 depicts an example of the transition from an L2 norm like behavior close to the origin, to Lo behavior far away, with an L\ like region in between. The empirical risk function in the 2D space of errors e = y - fix) of FIG. 3 was obtained using the c-loss function with σ = 0.5.
[0033] The c-loss function is a non-convex function of the margin. Therefore, it is difficult to optimize the c-loss function using sophisticated convex optimization techniques in order to obtain the optimal discriminant function / . However, since the c-loss function is a smooth function, it is possible to optimize the c-loss function using, for example, a first order (steepest descent) or a second order (e.g., Newton's method) gradient-based procedure, or a combination thereof (e.g., a conjugate gradient, a Hessian approximation method (e.g., a Levenberg-Marquardt algorithm)). The c-loss function is used to train a neural network classifier by backpropagating the errors using any of the gradient-based methods discussed above.
[0034] From the equations involved in backpropagation, it can be observed that the magnitude of the derivative of the loss function, with respect to the current value of the error, controls the size of the steps in the weight update equation. In other words, the derivative of the loss evaluated at the current value of the error, controls how much the discriminant function (weights) are influenced by the input sample that produced the error.
[0035] FIG. 4 is a block diagram illustrating a nonlimiting embodiment of a system for robust classification using correntropy. The system 400a includes a classifier 403, a c-loss function module 406, a gradient-based module 409, and a weight update module 412. The classifier 403 classifies a value of x by predicting the label to be
Figure imgf000011_0001
. The c-loss function module 406 determines a statistical similarity between the predicted label
Figure imgf000011_0002
and the true label y using correntropy as described in EQN. 7 above. The c-loss function module 406 outputs a c-loss value described by EQNS. 9 and 10 above.
[0036] The gradient-based module 409 calculates a change in the c-loss values, which minimizes the expected risk as described in EQN. 14 above, for example. In various embodiments, the gradient-based module 409 is a first, second, or mixed order gradient-based module, for example. The gradient- based module 409 outputs the change to the weight update module 412, which updates the weights for the classifier based at least in part on a c-loss value and the change in c-loss values calculated by the gradient-based module 409.
[0037] The c-loss function is a non-convex function of the margin. Therefore it is difficult to optimize it using sophisticated convex optimization techniques, in order to obtain the optimal discriminant function / . However, since it is a smooth function, it is possible to optimize it using gradient based procedures. We therefore use the c-loss function to train a neural network classifier by backpropagating the errors using gradient descent.
[0038] Before deriving the backpropagation equations, the notation used to denote the variables in a neural network is summarized below for convenience. w"k '■ The weight between the processing element (PE) k and the PE j th
of the previous layer, at the n iteration. y j : Output of the PE j of the previous layer, at the n iteration. n \ 1 n n n
net k - 7 , J , w : Weighted sum of all outputs y , of the previous th
layer, at the n iteration. g( ): Sigmoidal squashing function in each PE, g\z) = : EQN. 15 l+e ' y"k = S(netk) : Output of the PE k of the current layer, at the nth iteration. y" e {±1} : The true label (or desired signal), for the nth sample.
[0039] The weights of a multilayer perceptron (MLP) may be updated by moving opposite to the gradient of the empirical risk computed using the c- loss function:
„ dRc (e)
Wjk - Wjk - M „ . EQN. 16
OWjk
Using the chain rule, the above equation can be written as:
Figure imgf000013_0001
oen
Since this is an online procedure, the derivative of the risk with respect to the
th
error at n iteration, e„, is essentially the derivative of the c-loss function, evaluated at en. Therefore, n+\ _ n lc {e) . η Λ η
wjk - Wjk + g xnethWj EQN 18 oen
The above equation is the general rule for updating all the weights of the MLP, and it is called the Delta rule, written simply as:
Wjk = Wjk + Mdkyj , EQN. 19 where,
O k - g ynetk) . EQN. 20 oen
[0040] Depending on the cost function, and the type of weights (belonging to output layer or hidden layer), the computation of Si in the above equation differs. For the output layer weights, the computation is as follows: dlc (e)
O k - d o - g'ineto) EQN. 21 den
A
Figure imgf000014_0001
where ?, = β/σ2■ For the previous (hidden) layer, the 'deltas' are computed as:
Figure imgf000014_0002
where N0 is the number of output layer PEs.
[0041] EQNS. 22 and 23 can be used to update or train the weights of a neural network classifier using the c-loss function. The computational complexity of the weight updates remains the same as in the case of the conventional square loss function.
[0042] From EQNS. 19 and 20, it can be seen that the magnitude of the derivative of the loss function, with respect to the current value of the error, essentially controls the size of the steps in the weight update equation. In other words, the derivative of the loss evaluated at the current value of the error, controls how much the discriminant function (weights) are influenced by the input sample that produced the error.
[0043] Referring to FIG. 5, shown is a graph 500 illustrating an example of derivatives of the c-loss function versus kernel width σ . For comparison purposes, the derivative of the square loss function (which results in a linear function) is also shown. The derivatives are plotted as a function of the error th
en produced by the n input sample. Therefore, if 1 <| en \< 2 , the sample is misclassified, and if 0 <| en |< 1 , the sample is correctly classified. As can be seen in FIG. 4, if the square loss function is used, the samples which produce high errors (i.e. , samples which are badly misclassified) yield a high derivative. This means that the weights or the decision boundary of the classifier is highly affected by outliers or impulsive input samples. In contrast, the derivative of the c-loss function attenuates the influence of highly erroneous samples while computing the weights or the discriminant function. Since the discriminant function is insensitive to these outliers, better generalization can be obtained in real world datasets which are often plagued with noisy samples.
[0044] Also, for a kernel size like σ = 0.5, the effect of samples near the decision boundary (e = 1) is also attenuated. This means that 'confusing' samples which lie near the decision boundary are not given much importance while learning the classifier. This results in a more regularized solution that is less prone to overfitting due to overtraining.
[0045] The localization provided by the kernel in the c-loss function makes the gradient descent algorithm susceptible to getting stuck in local minima. Therefore, the c-loss function may be useful primarily in the vicinity of the optimal solution. One approach is to first allow convergence using the traditional square loss function and then switch to the c-loss function. Another approach using a weighted combination of the two loss functions instead is described by: Combined loss function = « * (C - loss) + (1 - a)* (Square - loss) EQN. 15
[0046] The value of a is linearly increased from 0 to 1 , over the total number of training epochs. Therefore, for the ith epoch, at = i/N , where N is the total number of training epochs. Such an approach means that the square loss function is used in the beginning of training, and there is a smooth switch over to the c-loss function towards the end of training.
[0047] Referring to FIG. 6, shown is a block diagram illustrating another nonlimiting embodiment of a system for robust classification using correntropy. Similar to the system 400a illustrated in FIG. 4, the system 400b in FIG. 6 includes a classifier 403, a gradient-based module 409, and a weight update module 412. However, the system 400b includes a combined loss function module 606 instead of merely a c-loss function module 406. The combined loss function module 606 determines a statistical similarity using a weighted combination of the c-loss function and the square loss function. The combined loss function module 606 outputs a combined loss function based at least in part on EQN. 15 above.
[0048] FIG. 7 is a flow chart illustrating a nonlimiting embodiment of a method of robust classification using correntropy. The method 700 includes boxes 703, 706, 709, and 712. In box 703, a data value is classified by predicting a label for the data value using a discriminant function including a plurality of weights. In box 706 a statistical similarity between the predicted label and an actual label is determined based at least in part on a correntropy loss function. In some embodiments, the statistical similarity is also based at least in part on a square loss function. Further, the statistical similarity may be based at least in part on a weighted combination of the correntropy loss function and the square loss function (referred to herein as a combined loss function). In box 709, the expected risk associated with the predicted label is minimized based at least in part on the statistical similarity. The minimization of the expected risk includes calculating a gradient of (i.e., change in) the statistical similarity. In box 712, the weights of the discriminant function are updated based at least in part on the minimized expected risk.
[0049] For purposes of comparison, the performance of neural network classifiers trained with the c-loss function and the traditional square loss function were used in classifying the Pima Indians Diabetes dataset. This data includes eight physiological measurements (features) from 768 subjects. The objective was to classify a test subject into either the diabetic or non- diabetic group, based on these physiological measurements. Out of the 768 samples, 400 samples were used to train the classifier, and the remaining samples were used to test for generalization.
[0050] Networks having a single hidden layer and a sigmoidal non- linearity are used in each of the PEs. The datasets used are normalized to have unit variance along each feature. The networks were trained with 10 epochs of the training dataset. The kernel size σ in the c-loss function was set to be 0.5. Other values of kernel size such as 0.6 and 0.7 were also tried, with little change in performance. The generalization performance for both the methods was obtained by testing on a separate test dataset.
[0051] FIGS. 8A-8F are graphs of percentage of correct classification for a system 400 including a c-loss function module and a system including a square loss function, versus number of training epochs, for the Pima Indians Diabetes dataset. The total number of training epochs varies in each figure such that the total number of training epochs is respectively 200, 100, 50, 30, 15, and 5 in FIGS. 8A, 8B, 8C, 8D, 8E, and 8F. In these evaluations, the c- loss function was implemented using the combined loss function as described by EQN. 15, in order to avoid local minima. FIGS. 8A-8F illustrate how the c- loss function improves the classification performance compared to the square loss function, over the training epochs. These plots were obtained after averaging over 100 Monte Carlo simulations with random initial weights of the network and random selection of training samples from the datasets.
[0052] In addition, the effect of the number of PEs on the classification results was examined using the Pima Indians Diabetes dataset. The evaluation was repeated with 5, 10 and 20 PEs in the hidden layer of the MLP using the two loss functions. Referring to FIGS. 9A-9C, shown are the graphs of examples of the percentage of correct classification for 5, 10 and 20 PEs in the hidden layer, respectively. The average classification performance (on the test set) computed from 100 Monte Carlo trials is shown for different training durations (epochs). The tables in FIGS. 9A, 9B, and 9C illustrate the evaluation results for 5, 10 and 20 PEs, respectively. In each of the three tables, the best performance of c-loss function is better than that obtained by the square loss. In these evaluations, in order to avoid local optima, training was carried out by first training with the square loss function and then switching over to the c-loss function. Since the switch to the c-loss function is made only after the 5th epoch, the first 5 values for the c-loss results are not shown. As indicated in the graphs of FIGS. 9A-9C, generalization performance improves with use of the c-loss function. [0053] Comparison of the classification using a c-loss function module and using a square loss function was also performed using a Wisconsin Breast Cancer dataset. This dataset consists of nine-dimensional samples, belonging to two classes. Out of the 683 samples, 300 were used to train the classifiers. Referring to FIGS. 10A-10C, shown are the graphs of examples of the percentage of correct classification for 5, 10 and 20 PEs in the hidden layer, respectively. Along the same lines as the examination of FIGS. 9A-9C, the mean classification performance is tabulated for both the loss function, for various training epochs. The c-loss function again outperforms the square loss function in terms of generalization. As indicated in the graphs of FIGS. 10A-9C, the use of the c-loss function after the 5th epoch provides an improvement in classification performance.
[0054] The systems and methods provided herein can be implemented in hardware, software, firmware, or a combination thereof. In one embodiment, the method can be implemented in software or firmware that is stored in a memory and that is executed by a suitable instruction execution system. If implemented in hardware, as in an alternative embodiment, the system can be implemented with any or a combination of the following technologies, which are all well known in the art: a discrete logic circuit(s) having logic gates for implementing logic functions upon data signals, an application specific integrated circuit (ASIC) having appropriate combinational logic gates, a programmable gate array(s) (PGA), a field programmable gate array (FPGA), digital signal processor (DSP), etc.
[0055] In some embodiments, the adaptive systems described above may be implemented in a processing device 1 100 such as the one illustrated in FIG. 1 1 . The signal processing device 1 100 includes a receiver 1 103, transmitter 1 106, processing unit 1 109, a bus 1 1 12 and a memory 1 1 15. The memory 1 1 15 stores an application specific software 1 1 18 including modules that include instructions that when executed by the processing unit 1 109 perform various operations. In the embodiment illustrated in FIG. 1 1 , the modules include, for example, classifier 1 121 , a c-loss module 1 24, a gradient-based module 1 127, and weight update module 1 130. In other embodiments, the application software includes a combination loss function module configured to apply a combination of a c-loss function and a square loss function to an input predicted label and an actual label.
[0056] The application specific software 1 1 18 can also be stored on a variety of computer-readable media for use by, or in connection with, a variety of computer-related systems or methods. In the context of this disclosure, a "computer-readable medium" stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium may include the following: an electrical connection (electronic) having one or more wires, a portable computer diskette (magnetic), a random access memory (RAM) (electronic), a read-only memory (ROM) (electronic), an erasable programmable read-only memory (EPROM, EEPROM, or Flash memory) (electronic), an optical fiber (optical), a portable compact disc read- only memory (CDROM) (optical), a digital versatile disc (optical), a high definition digital versatile disc (optical), and a Blu-ray Disc (optical).
[0057] Any process descriptions or blocks should be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the embodiments described in the present disclosure in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present disclosure.
[0058] It should be emphasized that the above-described embodiments in the present disclosure are merely possible examples of implementations, merely set forth for a clear understanding of the principles of the disclosure. Many variations and modifications may be made to the above-described embodiment(s) of the disclosure without departing substantially from the spirit and principles of the disclosure. All such modifications and variations are intended to be included herein within the scope of this disclosure and the present disclosure and protected by the following claims.

Claims

CLAIMS Therefore, at least the following is claimed:
1. A classifying device comprising:
a processing unit; and
a memory storing instructions in modules that when executed by the processing unit cause the classifying device to adaptively classify a data value using a correntropy loss function.
2. The classifying device of claim 1 , wherein the stored instructions includes a classifier module that when executed by the processing unit is configured to predict a label for the data value using a discriminant function, the discriminant function including a plurality of weights based at least in part upon the correntropy loss function.
3. The classifying device of claim 2, wherein the stored instructions further includes a weight update module that when executed by the processing unit is configured to update at least one of the plurality of weights based at least in part on the correntopy loss function.
4. The classifying device of claim 3, wherein the at least one weight is update based at least in part upon a minimized correntropy expected risk based at least in part on a correntopy statistical similarity determined from the predicted label and an actual label based at least in part on the
correntropy loss function.
5. A method comprising:
adjusting a weight of a classifier based at least in part on a change in a correntropy loss function signal; and
classifying a data value using the classifier.
6. The method of claim 5, wherein classifying the data value comprises predicting a label for the data value using a discriminant function, the discriminant function including the adjusted weight.
7. The method of claim 6, further comprising providing the label for rendering on a display device.
8. The method of claim 5, wherein a plurality of weights are updated ed at least in part on a change in a correntropy loss function signal.
9. The method of claim 5, wherein the change in the correntropy loss function signal is based upon a statistical similarity between the predicted label and an actual label, the statistical similarity based at least in part on a correntropy loss function.
10. A method comprising:
classifying a data value by predicting a label for the data value using a discriminant function;
determining a correntopy statistical similarity between the predicted label and an actual label based at least in part on a correntropy loss function; and
minimizing an expected risk associated with the predicted label based at least in part on the correntopy statistical similarity.
11. The method of claim 10, wherein the discrimant function includes at least one weight, and the method further comprising updating at least one weight based at least in part on the minimized correntropy expected risk.
12. The method of claim 10, wherein determining the correntopy statistical similarity between the predicted label and the actual label is also based at least in part on a square loss function.
13. The method of claim 12, wherein the determined correntopy statistical similarity is based at least in part on a weighted combination of a correntropy loss function and a square loss function.
14. The method of claim 10, further comprising obtaining the data value.
15. A system comprising:
a classifier configured to predict a label for the data value using a discriminant function, the discriminant function including a plurality of weights; a loss function module configured to determine a correntopy statistical similarity between the predicted label and an actual label based at least in part on a correntropy loss function;
a gradient-based module configured to calculate a change in the correntopy statistical similarity; and
a weight update module configured update the plurality of weights based at least in part on the correntopy statistical similarity and the change in the correntropy statistical similarity.
16. The system of claim 15, wherein the gradient-based module is a first, second or mixed order gradient-based module.
17. The system of claim 15, wherein the gradient-based module is a second order gradient-based module configured to calculate a change in the correntopy statistical similarity using a Newtonian method.
18. The system of claim 15, wherein the gradient-based module is a mixed order gradient-based module configured to calculate a change in the correntopy statistical similarity using a Hessian approximation method.
19. The system of claim 15, wherein the gradient-based module is a conjugate gradient-based module.
PCT/US2011/044932 2010-07-22 2011-07-22 Classification using correntropy WO2012012680A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/811,317 US9269050B2 (en) 2010-07-22 2011-07-22 Classification using correntropy
CA2806053A CA2806053A1 (en) 2010-07-22 2011-07-22 Classification using correntropy

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US36666210P 2010-07-22 2010-07-22
US61/366,662 2010-07-22

Publications (2)

Publication Number Publication Date
WO2012012680A2 true WO2012012680A2 (en) 2012-01-26
WO2012012680A3 WO2012012680A3 (en) 2012-04-12

Family

ID=45497472

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2011/044932 WO2012012680A2 (en) 2010-07-22 2011-07-22 Classification using correntropy

Country Status (3)

Country Link
US (1) US9269050B2 (en)
CA (1) CA2806053A1 (en)
WO (1) WO2012012680A2 (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9424652B2 (en) 2011-06-30 2016-08-23 University Of Florida Research Foundation, Inc. Adaptive background estimation
WO2014097598A1 (en) * 2012-12-17 2014-06-26 日本電気株式会社 Information processing device which carries out risk analysis and risk analysis method
US11537930B2 (en) * 2013-03-04 2022-12-27 Nec Corporation Information processing device, information processing method, and program
US10127475B1 (en) * 2013-05-31 2018-11-13 Google Llc Classifying images
US9842390B2 (en) * 2015-02-06 2017-12-12 International Business Machines Corporation Automatic ground truth generation for medical image collections
CN106548210B (en) 2016-10-31 2021-02-05 腾讯科技(深圳)有限公司 Credit user classification method and device based on machine learning model training
US10832308B2 (en) * 2017-04-17 2020-11-10 International Business Machines Corporation Interpretable rule generation using loss preserving transformation
US11842280B2 (en) * 2017-05-05 2023-12-12 Nvidia Corporation Loss-scaling for deep neural network training with reduced precision
US20210082577A1 (en) * 2017-05-15 2021-03-18 Koninklijke Philips N.V. System and method for providing user-customized prediction models and health-related predictions based thereon
WO2019169155A1 (en) * 2018-02-28 2019-09-06 Carnegie Mellon University Convex feature normalization for face recognition
CN109657693B (en) * 2018-10-22 2023-08-01 中国科学院软件研究所 Classification method based on correlation entropy and transfer learning
CN110929563A (en) * 2019-10-14 2020-03-27 广东浩迪创新科技有限公司 Electrical appliance identification method and device
CN111091911A (en) * 2019-12-30 2020-05-01 重庆同仁至诚智慧医疗科技股份有限公司 System and method for screening stroke risk
WO2023086585A1 (en) * 2021-11-12 2023-05-19 Covera Health Re-weighted self-influence for labeling noise removal in medical imaging data
CN116128047B (en) * 2022-12-08 2023-11-14 西南民族大学 Migration learning method based on countermeasure network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040236611A1 (en) * 2003-04-30 2004-11-25 Ge Financial Assurance Holdings, Inc. System and process for a neural network classification for insurance underwriting suitable for use by an automated system
US20050015251A1 (en) * 2001-05-08 2005-01-20 Xiaobo Pi High-order entropy error functions for neural classifiers

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007027839A2 (en) * 2005-09-01 2007-03-08 University Of Florida Research Foundation, Inc. Device and methods for enhanced matched filtering based on correntropy
WO2007053831A2 (en) * 2005-10-31 2007-05-10 University Of Florida Research Foundation, Inc. Optimum nonlinear correntropy filter
WO2008133679A1 (en) 2007-04-26 2008-11-06 University Of Florida Research Foundation, Inc. Robust signal detection using correntropy
WO2011100491A2 (en) 2010-02-12 2011-08-18 University Of Florida Research Foundation Inc. Adaptive systems using correntropy
US9424652B2 (en) 2011-06-30 2016-08-23 University Of Florida Research Foundation, Inc. Adaptive background estimation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050015251A1 (en) * 2001-05-08 2005-01-20 Xiaobo Pi High-order entropy error functions for neural classifiers
US20040236611A1 (en) * 2003-04-30 2004-11-25 Ge Financial Assurance Holdings, Inc. System and process for a neural network classification for insurance underwriting suitable for use by an automated system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ABHISHEK SINGH ET AL.: 'A Loss Function for Classification Based on a Robust Similarity Metric' NEURAL NETWORKS (IJCNN), THE 2010 INTERNATIONAL JOINT CONFERENCE ON. 18 July 2010 - 23 July 2010, pages 1 - 6 *
ABHISHEK SINGH ET AL.: 'Using Correntropy as a Cost Function in Linear Adaptive Filters' NEURAL NETWORKS, 2009. IJCNN 2009. INTERNATIONAL JOINT CONFERENCE ON. 14 June 2009 - 19 June 2009, pages 2950 - 2955 *

Also Published As

Publication number Publication date
CA2806053A1 (en) 2012-01-26
US20130132315A1 (en) 2013-05-23
WO2012012680A3 (en) 2012-04-12
US9269050B2 (en) 2016-02-23

Similar Documents

Publication Publication Date Title
US9269050B2 (en) Classification using correntropy
Memon et al. Generalised fuzzy c‐means clustering algorithm with local information
Wang et al. Uncertainty estimation for stereo matching based on evidential deep learning
Adhikari et al. Conditional spatial fuzzy C-means clustering algorithm for segmentation of MRI images
Yin et al. ARM: Augment-REINFORCE-merge gradient for stochastic binary networks
dos Santos Coelho et al. A RBF neural network model with GARCH errors: application to electricity price forecasting
Frank et al. Locally weighted naive bayes
Singh et al. A loss function for classification based on a robust similarity metric
Sannasi Chakravarthy et al. Detection and classification of microcalcification from digital mammograms with firefly algorithm, extreme learning machine and non‐linear regression models: A comparison
Eguchi et al. A class of logistic‐type discriminant functions
Pavlidis et al. λ‐Perceptron: An adaptive classifier for data streams
Felizardo et al. Comparative study of bitcoin price prediction using wavenets, recurrent neural networks and other machine learning methods
Menezes et al. Width optimization of RBF kernels for binary classification of support vector machines: A density estimation-based approach
CN110532921A (en) The more Bernoulli Jacob&#39;s video multi-target trackings of broad sense label are detected based on SSD
Abdelhamid et al. Innovative feature selection method based on hybrid sine cosine and dipper throated optimization algorithms
Broumand et al. Discrete optimal Bayesian classification with error-conditioned sequential sampling
Lee et al. Localization uncertainty estimation for anchor-free object detection
Yang et al. Explainable uncertainty quantifications for deep learning-based molecular property prediction
Villmann et al. Self-adjusting reject options in prototype based classification
Niaf et al. Handling uncertainties in SVM classification
Wu et al. SMOTE-Boost-based sparse Bayesian model for flood prediction
Liu et al. Act: Semi-supervised domain-adaptive medical image segmentation with asymmetric co-training
Zhao et al. Broad learning approach to Surrogate-Assisted Multi-Objective evolutionary fuzzy clustering algorithm based on reference points for color image segmentation
Zhao et al. Cost-sensitive online classification with adaptive regularization and its applications
Kong et al. Image segmentation using a hierarchical student's‐t mixture model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11810435

Country of ref document: EP

Kind code of ref document: A2

ENP Entry into the national phase

Ref document number: 2806053

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 13811317

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11810435

Country of ref document: EP

Kind code of ref document: A2