1 Introduction
Similaritybased clustering and semisupervised learning methods segment the data based on the similarity measure between the data points. Regarding to similaritybased data clustering, spectral clustering
(Ng et al., 2001) identifies clusters of complex shapes lying on some low dimensional manifolds by normalized graph Laplacian from a data similarity matrix. Pairwise clustering method (Shental et al., 2003)uses messagepassing algorithm to infer the cluster labels in a pairwise undirected graphical model with the pairwise potential function constructed from a similarity matrix. Kmeans
(Hartigan and Wong, 1979) searches for data clusters by a local minimum of sum of withincluster dissimilarities.The representative similaritybased semisupervised learning algorithm is label propagation (Zhu et al., 2003; Zhou et al., 2003), which is effective and widely used. With a predefined similarity graph, it determines the labels of unlabeled data by the minimization of the objective function over the similarity graph defined as sum of the product of pairwise similarity and the squared label difference. Therefore, label propagation encourages local smoothness of the labels according to the edge weight of the similarity graph. The typical label propagation algorithm (Zhu et al., 2003) renders a harmonic solution which can also be interpreted by random walks from unlabeled data to the labeled data.
The success of similaritybased learning method highly depends on the underlying pairwise similarity over the data, which in most cases are constructed empirically, e.g. by Gaussian kernel or the KNearestNeighbor (KNN) graph. In this paper, we present a discriminative similarity learning framework for clustering and semisupervised learning wherein the discriminative similarity is derived as the generalization error bound for the kernel classifier learned from hypothetical labeling. When the popular Support Vector Machines (SVMs) is used in this framework, unsupervised SVM
(Xu et al., 2004) is deduced for the unsupervised case, and SemiSupervised or Transductive SVMs (Vapnik, 1998; Joachims, 1999; Chapelle et al., 2008) is deduced for the semisupervised learning case. A kernel classifier motivated by similarity learning (Balcan et al., 2008; Cortes et al., 2013) is used in our framework. By generalization analysis via Rademacher complexity, the generalization error bound for the kernel classifier learned from hypothetical labeling is expressed as the sum of pairwise similarity between the data from different classes. Such pairwise similarity, parameterized by the weights of the learned kernel classifier, serves as the discriminative similarity induced by this generalization bound for clustering and semisupervised learning. Although similarity is often used to quantify the local affinity between the data, the term “discriminative similarity” here means the similarity to be learned that improve the discriminative capability of some classification method such as the mentioned kernel classifier. Moreover, we prove that discriminative similarity with the same form can also be induced by the error bound for the integrated squared error of kernel density classification.Our discriminative similarity learning framework is related to a class of discriminative clustering methods which classify unlabeled data by various measures on the discriminative unsupervised classifiers, and the measures include generalization error (Xu et al., 2004) or the entropy of the posterior distribution of the label (Gomes et al., 2010). Discriminative clustering methods (Xu et al., 2004) predict the labels of unlabeled data by minimizing the generalization error bound for the unsupervised classifier. (Xu et al., 2004) proposes Unsupervised SVMs which learns a binary classifier to partition unlabeled data with the maximum margin between different clusters. The theoretical properties of unsupervised SVMs are analyzed in (Karnin et al., 2012). (Gomes et al., 2010)
learns the kernel logistic regression classifier regularized by the entropy of the posterior distribution of the class label.
The paper is organized as follows. We introduce the formulation of the discriminative similarity learning framework in Section 2, and then derive the generalization error bound for the kernel classifier learned from hypothetical labeling in Section 3 where the discriminative similarity is induced by the error bound, and Section 4 shows that a discriminative similarity can also be induced by kernel density classification. The application of the discriminative similarity learning framework to data clustering and semisupervised learning is shown in Section 6, and we conclude the paper in Section 7. Throughout this paper the term kernel standards for PSD kernel if no special notes are made.
2 Discriminative Similarity Framework
2.1 Discriminative Similarity Framework for Clustering
The discriminative clustering literature (Xu et al., 2004; Gomes et al., 2010) has demonstrated the potential of multiclass classification for the clustering problem. Inspired by the natural connection between clustering and classification, we proposes the framework of learning discriminative similarity for clustering by unsupervised classification which models the clustering problem as a multiclass classification problem: a classifier is learned from the training data built by a hypothetical labeling, which can be any possible cluster labeling. The optimal hypothetical labeling is supposed to be the one such that its associated classifier has the minimum generalization error bound. To study the generalization bound for the classifier learned from hypothetical labeling, the concept of classification model is needed. Given unlabeled data , a classification model is constructed for any hypothetical labeling as below:
Definition 1
The classification model corresponding to the hypothetical labeling for clustering is defined as . are the labeled data by the hypothetical labeling, and
are assumed to be i.i.d. samples drawn from the joint distribution
over the data and its class label . is the family comprised of all such distributions, i.e. . is a classifier learned using the training data . The generalization error of the classification model is defined as the generalization error of the classifier in .The optimal hypothetical labeling minimizes the generalization error bound for the classification model. With being different classifiers, different discriminative clustering models can be derived. When SVMs is used in the above discriminative model, unsupervised SVM (Xu et al., 2004) is obtained. When two nonparametric classifiers, i.e. the nearest neighbor classifier and the plugin classifier, are used in this framework, the unsupervised classification method in (Yang et al., 2014) is recovered.
2.2 Discriminative Similarity Framework for SemiSupervised Learning
Similar to the case of clustering, the discriminative similarity framework for semisupervised learning also searches for the optimal hypothetical labeling such that the associated classifier has minimum generalization error bound.
Suppose the data are comprised of labeled and unlabeled set, the first points have labels for and is the number of classes. In the following text, is the label of for , and for . Given any hypothetical labeling for the unlabeled data, the classification model for semisupervised learning is defined as below:
Definition 2
The classification model corresponding to the hypothetical labeling for semisupervised learning is defined as . are the labeled data by the hypothetical labeling, and are assumed to be i.i.d. samples drawn from the joint distribution over the data and its class label . is the family comprised of all such distributions, i.e. . is a classifier learned using the training data . The generalization error of the classification model is defined as the generalization error of the classifier in .
Our framework of learning discriminative similarity for semisupervised learning model searches for the optimal hypothetical labeling that minimizes the generalization error bound for the classification model defined above. Different specific discriminative semisupervised learning models can be derived with being different classifiers. For example, when SVMs is used, a model with the same optimization problem as the well known SemiSupervised or Transductive SVMs (Vapnik, 1998; Joachims, 1999; Chapelle et al., 2008) is obtained ^{1}^{1}1Note that Transductive SVMs, despite its name, also learns a inductive rule.
2.3 Discriminative Similarity induced by Kernel Classifier
We employ kernel classifier in the proposed discriminative similarity learning framework in Section 2.1 and Section 2.2, so as to induce the discriminative similarity parameterized with learnable weights of the kernel classifier. The kernel classifier is designed based on similarity learning methods. Balcan et al. (Balcan et al., 2008) proposes a classification method using general similarity functions, and the classification rule measures the similarity of the test data to each class then assigns the test data to the class such that the weighed average of the similarity between the test data and the training data belonging to that class is maximized over all the classes. In (Cortes et al., 2013), kernel function is used as the similarity function, and the generalization error of a kernel classifier is derived using a properly defined kernel margin, wherein the classifier uses average instead of weighed average when computing pointtoclass similarity. Inspired by these similarity learning methods, we propose hypothesis to measure the similarity of datum to class and it uses kernel as the similarity function:
(1) 
where is the isotropic Gaussian kernel (with the constant that makes unit integral omitted) with bandwidth , are the nonnegative weights that sum up to . Instead of the expectationbased similarity in (Balcan et al., 2008), hypothesis is a the finite samplebased similarity between datum and class , leading to a tractable optimization problem, as shown in the next section. The similaritybased kernel classifier predicts the label of the datum as the one for which the pointtoclass similarity is maximized, i.e. .
The generalization error bound for the kernel classifier is derived in the following section. Note that we only need to derive the error bound for the kernel classifiers in the discriminative similarity framework for clustering. The reason is that the error bound for clustering also applies to the case of semisupervised learning, as a certain amount of given labels do not affect the derivation of the generalization bound.
3 Discriminative Similarity from Generalization Bound for Kernel Classifier
In this section, the generalization error bound for the classification model in Definition 1 and Definition 2 with the kernel classifier is derived as a sum of discriminative similarity between the data from different classes.
To analyze the generalization bound for the kernel classifier , the following notations are introduced. Let be the nonzero weights that sum up to , be a column vector representing the weights belonging to class such that is if , and otherwise. The margin of the labeled sample is defined as , the sample is classified correctly if . We then derive the generalization error bound for using the Rademacher complexity of the function class comprised of all the possible margin functions . The Rademacher complexity (Bartlett and Mendelson, 2003; Koltchinskii, 2001) of a function class is defined below:
Definition 3
Let be
i.i.d. random variables such that
. The Rademacher complexity of a function class is defined as(2) 
Let be the gram matrix of the data by the kernel with . Based on the generalization analysis by Koltchinskii and Panchenko (2002), Theorem 1 presents the generalization error bound for the unsupervised kernel classifier using the empirical error and the Rademacher complexity of the function class . Inspired by the regularization on multiclass kernelbased vector machines (Crammer and Singer, 2001), we propose the regularization term which is required to be bounded by for some . Denote by the space of all the hypothesis associated with label , i.e.
(3) 
and define the hypothesis space . Lemma 1 shows that the Rademacher complexity of the properly defined functional class is bounded by the regularization term
with a large probability:
Lemma 1
Define . When where is a positive constant, with probability at least over the data , the Rademacher complexity of the function space satisfies
(4) 
With the bounded Rademacher complexity of the function class , Theorem 1 presents the generalization error bound for the kernel classifier in the discriminative similarity learning framework.
Theorem 1
(Error of the Kernel Classifier) Given the classification model in Definition 1 or Definition 2, if , then with probability over the labeled data with respect to any distribution in , the generalization error of the kernel classifier satisfies
(5) 
where is the empirical error of on the labeled data, is a constant and is defined as
(6) 
Moreover, if , the empirical error is
(7) 
With the error bound in Theorem 1, searching for the optimal hypothetical labeling amounts to solving the following optimization problem
(8) 
Where is the feasible set for the weights of the kernel classifier, and is a weighting parameter. Similar to the optimization problem of SVMs (Vapnik, 1998; Joachims, 1999), there is a balancing parameter that balances between the empirical error and the regularization term. Substituting into (8),
(9) 
where
(10) 
is tuned such that , e.g. . The first term of the objective function (9) is , which is the sum of similarity between the data from different classes with being the discriminative similarity between and induced by the generalization error bound (5). In the next section, it is shown that the discriminative similarity with the same form as that induced by our generalization analysis, namely (10
), can also be induced from the perspective of kernel density classification by kernel density estimators with nonuniform weights. It supports the theoretical justification for the induced discriminative similarity (
10) in this section.4 A Kernel Density Classification Perspective
The discriminative similarity can also be induced from kernel density classification with varying weights on the data, and binary classification is considered in this section. For any classification model with hypothetical labeling and the labeled data , suppose the joint distribution over has probabilistic density function . Let be the induced marginal distribution over the data with probabilistic density function . Robust kernel density estimation methods (Girolami and He, 2003; Kim and Scott, 2008; Mahapatruni and Gray, 2011; Kim and Scott, 2012) suggest the following kernel density estimator where the kernel contributions of different data points are reflected by different nonnegative weights that sum up to :
(11) 
where . Based on (11), it is straightforward to obtain the following kernel density estimator of the density function :
(12) 
Kernel density classifier is learnt from the labeled data and constructed by kernel density estimators (12). Kernel density classifier resembles the Bayes classifier, and it classifies the test data based on the conditional label distribution , or equivalently, is assigned to class if , otherwise it is assigned to class . Intuitively, it is preferred that the decision function is close to the true Bayes decision function . (Girolami and He, 2003; Kim and Scott, 2008) propose to use Integrated Squared Error (ISE) as the metric to measure the distance between the kernel density estimators and their true counterparts, and the oracle inequality is obtained that relates the performance of the classifier in (Kim and Scott, 2008) to the best possible performance of kernel density classifier in the same category. ISE is adopted in our analysis of kernel density classification, and the ISE between the decision function and the true Bayes decision function is defined as
(13) 
The upper bound for the ISE also induces discriminative similarity between the data from different classes, which is presented in the following theorem.
Theorem 2
Let and . With probability at least over the labeled data , the ISE between the decision function and the true Bayes decision function satisfies
(14) 
where
(15) 
(16) 
and is the gram matrix evaluated on the data with the kernel .
Let be a weighting parameter, then the cost function , designed according to the empirical term and the regularization term in the ISE error bound (14), can be expressed as
wherein the first term is comprised of sum of similarity between data from different classes with similarity , and is the discriminative similarity induced by the ISE bound for kernel density classification. Note that has the same form as the discriminative similarity (10) induced by the kernel classifier in our discriminative similarity learning framework, up to a scaling constant and the choice of the balancing parameter.
5 Extension to General SimilarityBased Classifier
We now consider using a general symmetric and continuous function as the similarity function in the classification model in Definition 1 or Definition 2 which is not necessarily a PSD kernel. The hypothesis becomes
. In order to analyze the generalization property of the classification rule using the general similarity function, we first investigate the properties of general similarity function and its relationship to PSD kernels in terms of eigenvalues and eigenfunctions of the associated integral operator. The integral operator
is well defined. It can be verified that is a compact operator since is continuous. According to the spectral theorem in operator theory, there exists an orthogonal basis of which are the eigenfunctions of , where is the space of measurable functions which are defined over and square Lebesgue integrable. is the eigenfunction of with eigenvalue if . The following lemma shows that under certain assumption on the eigenvalues and eigenfunctions of , a general symmetric and continuous similarity can be decomposed into two PSD kernels.Lemma 2
Suppose is a symmetric continuous function, and and are the eigenvalues and eigenfunctions of respectively. Suppose for some constant . Then for any , and it can be decomposed as the difference between two positive semidefinite kernels, namely with
(17) 
Remark 1
Resembling the case that kernel serves as the similarity function in Section 3, we use the regularization term to bound the Rademacher complexity for the classification rule using general similarity function. Let and with and . Similar to the analysis of kernel classifier, the space of all the hypothesis associated with label is defined as
(18) 
with positive number and which bounds and respectively. Let the margin function be , the hypothesis space be , and the general similaritybased classifier predicts the label of the datum by . We then present the main results in this section, which show the bound for the Rademacher complexity of the hypothesis space, i.e. , and the generalization error of unsupervised general similaritybased classifier .
Lemma 3
Suppose the assumptions in Lemma 2 hold. Define and . When , for positive constants and , , for some , then with probability at least over the data , the Rademacher complexity of the class satisfies
(19) 
Theorem 3
(Error of the General SimilarityBased Classifier) Suppose the assumptions in Lemma 2 hold. Given the classification model in Definition 1 or Definition 2, if , , then with probability over the labeled data with respect to any distribution in , under the assumptions of Lemma 2 and Lemma 3 on , and , the generalization error of the general classifier satisfies
(20) 
where is the empirical error of on the labeled data, is a constant and is defined in (6). Moreover, if , the empirical error is
(21) 
Remark 2
Remark 3
When the decomposition exists and , are PSD kernels, is the kernel of some Reproducing Kernel Kreĭn Space (RKKS) (Mary, 2003). (Ong et al., 2004; Loosli et al., 2016) analyze the problem of learning SVMstyle classifiers with indefinite kernels from the Kreĭn space. However, their work does not show when and how an indefinite and general similarity function can have PSD decomposition, as well as the generalization analysis for the similaritybased classifier using such general indefinite function as similarity measure. Our analysis deals with these problems in Lemma 2 and Theorem 3.
Minimizing the above bound for general continuous similarity leads to the formulation that minimizes , i.e.
(22) 
where is the weighting parameter for the regularization term , and
(23) 
is the discriminative similarity between data from different classes, which is induced by the generalization error bound for the general similaritybased classifier . When is a kernel, , , then reduces to in (10), the similarity induced by the kernel classifier.
Remark 4 (Similarity Machines: SVMType Classifier with General Similarity Function)
Traditional Kernel Support Vector Machines (Kernel SVMs), one of the most representative of Kernel Machine methods, maps the data into a infinite dimensional Reproducing Kernel Hilbert Space (RKHS) associated with the chosen kernel, and learn maxmargin linear classifier in RKHS. The mapping function is a Mercer Kernel in most cases, which is symmetric, continuous and positive semidefinite. The past several decades have witnessed the great success of SVMs in solid theoretical foundation of statistical learning, and broad applications in a vast regime of machine learning, pattern recognition. However, the requirement of positive semidefiniteness for Mercer Kernel substantially restricts the feasibility of SVMs for learning maxmargin classifier with general similarity function which is not necessarily a Mercer kernel. Based on Lemma
2, we can propose Similarity Machines as a generalization of Kernel SVMs, which is a framework of learning maximum margin classifier with general similarity function which is symmetric but not necessarily positive semidefinite. Similarity Machines has generalization error bound which reduces to the canonical error bound for Kernel SVMs when the similarity function is in fact a PSD kernel. The parameters of Similarity Machines can be obtained by minimizing the objective function based on its error bound. More details about the generalization analysis of Similarity Machines are included in the appendix of this paper.6 Applications
In this section, we present new clustering and semisupervised learning method using the discriminative similarity induced by the kernel classifier.
6.1 Application to Data Clustering
We propose a novel data clustering method named Clustering by Discriminative Similarity via Kernel classification (CDSK) which is based on our discriminative similarity learning framework with kernel classifier. CDSK aims to minimize (9). However, problem (9) involves minimization with respect to discrete cluster labels which is NPhard. In addition, it potentially results in a trivial solution which puts all the data in a single cluster due to the lack of constraints on the cluster balance. Therefore, (9) is relaxed in the proposed optimization problem for CDSK below:
(24) 
where , , is the graph Laplacian computed with , is a diagonal matrix with each diagonal element being the sum of the corresponding row of : , is a identity matrix. Note that when each column of is a binary membership indicator vector for the corresponding cluster, . Similar to spectral clustering (Ng et al., 2001), the constraint prevents imbalanced data clusters.
Problem (24) is optimized by coordinate descent. In each iteration of coordinate descent, optimization with respect to is performed with fixed , which is exactly the same problem as that of spectral clustering with a solution formed by the smallest eigenvectors of the normalized graph Laplacian ; then the optimization with respect to is performed with fixed , which is a standard constrained quadratic programming problem. The iteration of coordinate descent proceeds until convergence or the maximum iteration number is achieved.
6.2 Application to SemiSupervised Learning
We also propose a new semisupervised learning method based on Label Propagation using the Discriminative Similarity induced by the Kernel classification (LPDSK). The formulation of LPDSK is
(25) 
where is a matrix of with its elements set by the given labels, i.e. if has label for , otherwise .
Similar to the case of clustering, Problem (25) is also optimized by coordinate descent. In each iteration of coordinate descent, optimization with respect to is performed with fixed . With the block representation , , where and are of size , is of size and is of size , it can be verified that this subproblem admits a closed form solution when is invertible; the optimization with respect to is the same as the case of clustering for problem (24). The iteration of coordinate descent proceeds until convergence or the maximum iteration number is achieved.
7 Conclusions
We propose a novel discriminative similarity learning framework wherein the discriminative similarity is induced by the generalization error bound for the classifier learned from hypothetical labeling, and the optimal hypothetical labeling is pursued by minimizing the generalization bound for the associated classifier. A kernel classifier is employed in the proposed framework. The learnable weights in discriminative similarity induced by the kernel classifier allows for adaptive similarity accommodating the local variation of the data. We also analyze the generalization property of similaritybased classifier with general continuous similarity function rather than a PSD kernel. We present new clustering and semisupervised learning method based on the discriminative similarity learning framework with the kernel classifier, i.e. clustering by discriminative similarity via kernel classification (CDSK) and label propagation by discriminative similarity via kernel classification (LPDSK).
References
 Balcan et al. [2008] MariaFlorina Balcan, Avrim Blum, and Nathan Srebro. A theory of learning with similarity functions. Machine Learning, 72(12):89–112, 2008. ISSN 08856125. doi: 10.1007/s1099400850595.
 Bartlett and Mendelson [2003] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, March 2003. ISSN 15324435.
 Chapelle et al. [2008] Olivier Chapelle, Vikas Sindhwani, and Sathiya S. Keerthi. Optimization techniques for semisupervised support vector machines. J. Mach. Learn. Res., 9:203–233, June 2008. ISSN 15324435.
 Cortes et al. [2013] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Multiclass classification with maximum margin multiple kernel. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013, pages 46–54, 2013.
 Crammer and Singer [2001] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2:265–292, 2001.
 Cucker and Zhou [2007] Felipe Cucker and DingXuan Zhou. Learning theory : an approximation theory viewpoint. Cambridge monographs on applied and computational mathematics. Cambridge University Press, Cambridge, New York, 2007. ISBN 9780521865593.
 Girolami and He [2003] Mark Girolami and Chao He. Probability density estimation from optimally condensed data samples. IEEE Trans. Pattern Anal. Mach. Intell., 25(10):1253–1264, 2003. doi: 10.1109/TPAMI.2003.1233899.
 Gomes et al. [2010] Ryan Gomes, Andreas Krause, and Pietro Perona. Discriminative clustering by regularized information maximization. In NIPS, pages 775–783, 2010.
 Hartigan and Wong [1979] J. A. Hartigan and M. A. Wong. A Kmeans clustering algorithm. Applied Statistics, 28:100–108, 1979.
 Joachims [1999] Thorsten Joachims. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 200–209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1558606122.

Karnin et al. [2012]
Zohar Karnin, Edo Liberty, Shachar Lovett, Roy Schwartz, and Omri Weinstein.
Unsupervised svms: On the complexity of the furthest hyperplane problem.
Journal of Machine Learning Research  Proceedings Track, 23:2.1–2.17, 2012.  Kim and Scott [2008] JooSeuk Kim and Clayton D. Scott. Performance analysis for l_2 kernel classification. In Advances in Neural Information Processing Systems 21, Proceedings of the TwentySecond Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 811, 2008, pages 833–840, 2008.
 Kim and Scott [2012] JooSeuk Kim and Clayton D. Scott. Robust kernel density estimation. Journal of Machine Learning Research, 13:2529–2565, 2012.
 Koltchinskii and Panchenko [2002] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist., 30(1):1–50, 02 2002. doi: 10.1214/aos/1015362183.
 Koltchinskii [2001] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001. doi: 10.1109/18.930926.
 Koltchinskii and Panchenko [2000] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In Evarist Giné, DavidM. Mason, and JonA. Wellner, editors, High Dimensional Probability II, volume 47 of Progress in Probability, pages 443–457. Birkh?user Boston, 2000. ISBN 9781461271116. doi: 10.1007/9781461213581˙29.
 Loosli et al. [2016] Gaëlle Loosli, Stéphane Canu, and Cheng Soon Ong. Learning SVM in kreĭn spaces. IEEE Trans. Pattern Anal. Mach. Intell., 38(6):1204–1216, 2016. doi: 10.1109/TPAMI.2015.2477830.

Mahapatruni and Gray [2011]
Ravi Sastry Ganti Mahapatruni and Alexander G. Gray.
CAKE: convex adaptive kernel density estimation.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 1113, 2011
, pages 498–506, 2011.  Mary [2003] Xavier Mary. Hilbertian subspaces, subdualities and applications. Ph.D. Dissertation, Institut National des Sciences Appliquees Rouen, 2003.

Ng et al. [2001]
Andrew Y. Ng, Michael I. Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.
In NIPS, pages 849–856, 2001.  Ong et al. [2004] Cheng Soon Ong, Xavier Mary, Stéphane Canu, and Alexander J. Smola. Learning with nonpositive kernels. In Machine Learning, Proceedings of the Twentyfirst International Conference (ICML 2004), Banff, Alberta, Canada, July 48, 2004, 2004. doi: 10.1145/1015330.1015443.
 Shental et al. [2003] Noam Shental, Assaf Zomet, Tomer Hertz, and Yair Weiss. Pairwise clustering and graphical models. In NIPS, 2003.
 Vapnik [1998] Vladimir N. Vapnik. Statistical Learning Theory. WileyInterscience, 1998.
 Xu et al. [2004] Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering. In NIPS, 2004.
 Yang et al. [2014] Yingzhen Yang, Feng Liang, Shuicheng Yan, Zhangyang Wang, and Thomas S. Huang. On a theory of nonparametric pairwise similarity for clustering: Connecting clustering to classification. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pages 145–153, 2014.
 Zhou et al. [2003] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 813, 2003, Vancouver and Whistler, British Columbia, Canada], 2003.
 Zhu et al. [2003] Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semisupervised learning using gaussian fields and harmonic functions. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 2124, 2003, Washington, DC, USA, pages 912–919, 2003.
8 Appendix
8.1 Proof of Lemma 1
Inspired by Koltchinskii and Panchenko (2002), we first prove that the Rademacher complexity of the function class formed by the maximum of several hypotheses is bounded by two times the sum of the Rademacher complexity of the function classes that these hypothesis belong to, i.e.
(26) 
where for . If no confusion arises, the notations are omitted in the subscript of the expectation operator in the following text, i.e. is abbreviated to . According to Theorem of Koltchinskii and Panchenko (2002), it can be verified that
Therefore,
Comments
There are no comments yet.