Robust Wasserstein profile inference and applications to machine learning

Jose Blanchet; Yang Kang; Karthyek Murthy

doi:10.1017/jpr.2019.49

Robust Wasserstein profile inference and applications to machine learning

Part of: Linear inference, regression Limit theorems

Published online by Cambridge University Press: 01 October 2019

Jose Blanchet ,

Yang Kang and

Karthyek Murthy

Show author details

Jose Blanchet*: Affiliation:
Stanford University
Yang Kang*: Affiliation:
Columbia University
Karthyek Murthy*: Affiliation:
Singapore University of Technology and Design
*: *Postal address: Management Science and Engineering, Stanford University, 475 Via Ortega, Stanford, CA 94305, USA.
**Postal address: Columbia University, 1255 Amsterdam Avenue, Rm 1005, New York, NY 10027, USA.
***Postal address: Singapore University of Technology and Design, 8 Somapah Road, Singapore 487372, Singapore.

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

We show that several machine learning estimators, including square-root least absolute shrinkage and selection and regularized logistic regression, can be represented as solutions to distributionally robust optimization problems. The associated uncertainty regions are based on suitably defined Wasserstein distances. Hence, our representations allow us to view regularization as a result of introducing an artificial adversary that perturbs the empirical distribution to account for out-of-sample effects in loss estimation. In addition, we introduce RWPI (robust Wasserstein profile inference), a novel inference methodology which extends the use of methods inspired by empirical likelihood to the setting of optimal transport costs (of which Wasserstein distances are a particular case). We use RWPI to show how to optimally select the size of uncertainty regions, and as a consequence we are able to choose regularization parameters for these machine learning estimators without the use of cross validation. Numerical experiments are also given to validate our theoretical findings.

Keywords

Distributionally robust optimization Wasserstein distance regularization square-root LASSO logistic regression support vector machine limit characterization of optimal Wasserstein ball radius and regularization parameter empirical likelihood

MSC classification

Primary: 60F05: Central limit and other weak theorems

Secondary: 62J05: Linear regression 62J12: Generalized linear models

Type: Research Papers
Information: Journal of Applied Probability , Volume 56 , Issue 3 , September 2019 , pp. 830 - 857

DOI: https://doi.org/10.1017/jpr.2019.49 [Opens in a new window]
Copyright: © Applied Probability Trust 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The supplementary material for this article can be found at http://doi.org/10.1017/jpr.2019.49

References

Banerjee, A., Chen, S., Fazayeli, F. and Sivakumar, V. (2014). Estimation with norm regularization. In Proc. Advances in Neural Information Processing Systems 27, Neural Information Processing Systems Foundation, pp. 1556–1564.Google Scholar

Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root LASSO: pivotal recovery of sparse signals via conic programming. Biometrika 98, 791–806.CrossRef Google Scholar

Bertsimas, D. and Copenhaver, M. S. (2018). Characterization of the equivalence of robustification and regularization in linear and matrix regression. Europ. J. Operat. Res. 270, 931–942.CrossRef Google Scholar

Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of LASSO and Dantzig selector. Ann. Statist. 37, 1705–1732.CrossRef Google Scholar

Billingsley, P. (2013). Convergence of Probability Measures. John Wiley & Sons, Chichester.Google Scholar

Blanchet, J. and Kang, Y. (2016). Sample out-of-sample inference based on Wasserstein distance. Preprint, arXiv:1605.01340.Google Scholar

Blanchet, J. and Kang, Y. (2017). Semi-supervised learning based on distributionally robust optimization. Preprint, arXiv:1702.08848.Google Scholar

Blanchet, J. and Murthy, K. (2016). Quantifying distributional model risk via optimal transport. Preprint, arXiv:1604.01446.Google Scholar

Blanchet, J., Kang, Y. and Murthy, K. (2019). Robust Wasserstein profile inference and applications to machine learning. Supplementary material. Available at http://doi.org/10.1017/jpr.2019.49.Google Scholar

Blanchet, J., Murthy, K. and Si, N. (2018). Confidence regions for optimal transport based distributionally robust optimization problems. In preparation.Google Scholar

Bravo, F. (2004). Empirical likelihood based inference with applications to some econometric models. Econometric Theory 20, 231–264.CrossRef Google Scholar

Candes, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger than n . Ann. Statist. 35, 2313–2351.CrossRef Google Scholar

Chen, S. X. and Hall, P. (1993). Smoothed empirical likelihood confidence intervals for quantiles. Ann. Statist. 21, 1166–1181.CrossRef Google Scholar

Duchi, J., Glynn, P. and Namkoong, H. (2016). Statistics of robust optimization: a generalized empirical likelihood approach. Preprint, arXiv:1610.03425.Google Scholar

Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32, 407–499.Google Scholar

Esfahani, P. and Kuhn, D. (2015). Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Preprint, arXiv:1505.05116.Google Scholar

Fournier, N. and Guillin, A. (2015). On the rate of convergence in Wasserstein distance of the empirical measure. Prob. Theory Relat. Fields 162, 707–738.CrossRef Google Scholar

Frogner, C., Zhang, C., Mobahi, H., Araya, M. and Poggio, T. A. (2015). Learning with a Wasserstein loss. In Proc. Advances in Neural Information Processing Systems 28, Neural Information Processing Systems Foundation, pp. 2053–2061.Google Scholar

Gao, R. and Kleywegt, A. J. (2016). Distributionally robust stochastic optimization with Wasserstein distance. Preprint, arXiv:1604.02199v1.Google Scholar

Gotoh, J.-Y., Kim, M. J. and Lim, A. E. (2017). Calibration of distributionally robust empirical optimization models. Preprint, arXiv:1711.06565.Google Scholar

Hastie, T., Tibshirani, R., Friedman, J. and Franklin, J. (2005). The elements of statistical learning: data mining, inference and prediction. Math. Intell. 27, 83–85.Google Scholar

Hjort, N. L., McKeague, I. and Van Keilegom, I. (2009). Extending the scope of empirical likelihood. Ann. Statist. 37, 1079–1111.CrossRef Google Scholar

Isii, K. (1962). On sharpness of Tchebycheff-type inequalities. Ann. Inst. Statist. Math. 14, 185–197.CrossRef Google Scholar

Knight, K. and Fu, W. (2000). Asymptotics for LASSO-type estimators. Ann. Statist. 28, 1356–1378.Google Scholar

Lam, H. (2016). Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Preprint, arXiv:1605.09349.Google Scholar

Lam, H. and Zhou, E. (2016). The empirical likelihood approach to quantifying uncertainty in sample average approximation. Preprint, arXiv:1604.02573.Google Scholar

Li, X., Zhao, T., Yuan, X. and Liu, H. (2015). The flare package for high dimensional linear regression and precision matrix estimation in R. J. Mach. Learn. Res. 16, 553–557.Google Scholar PubMed

Mohajerin Esfahani, P. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program. 171, 115–166.CrossRef Google Scholar

Negahban, S., Ravikumar, P., Wainwright, M. and Yu, B. (2012). A unified framework for high-dimensional analysis of M-Estimators with decomposable regularizers. Statist. Sci. 27, 538–557.CrossRef Google Scholar

Owen, A. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237–249.CrossRef Google Scholar

Owen, A. (1990). Empirical likelihood ratio confidence regions. Ann. Statist. 18 90–120.CrossRef Google Scholar

Owen, A. (1991). Empirical likelihood for linear models. Ann. Statist. 19, 1725–1747.CrossRef Google Scholar

Owen, A. (2001). Empirical Likelihood. CRC Press, Boca Raton, FL.CrossRef Google Scholar

Peyré, G., Cuturi, M. and Solomon, J. (2016). Gromov–Wasserstein averaging of kernel and distance matrices. In Proc. Int. Conf. Machine Learning, Vol. 48. International Machine Learning Society, pp. 2664–2672.Google Scholar

Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Ann. Statist. 22, 300–325.CrossRef Google Scholar

Rachev, S. T. and Rüschendorf, L. (1998). Mass Transportation Problems. Volume II: Applications. Springer Science & Business Media, New York.Google Scholar

Rachev, S. T. and Rüschendorf, L. (1998). Mass Transportation Problems. Volume I: Theory. Springer Science & Business Media, New York.Google Scholar

Rubner, Y., Tomasi, C. and Guibas, L. J. (2000). The earth mover’s distance as a metric for image retrieval. Internat. J. Comput. Vision 40, 99–121.CrossRef Google Scholar

Seguy, V. and Cuturi, M. (2015). Principal geodesic analysis for probability measures under the optimal transport metric. In Advances in Neural Information Processing Systems, Vol. 28. Neural Information Processing Systems Foundation, pp. 3312–3320.Google Scholar

Shafieezadeh-Abadeh, S., Esfahani, P. and Kuhn, D. (2015). Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, Vol. 28. Neural Information Processing Systems Foundation, pp. 1576–1584.Google Scholar

Shafieezadeh-Abadeh, S., Kuhn, D. and Esfahani, P. M. (2017). Regularization via mass transportation. Preprint, arXiv:1710.10016.Google Scholar

Shapiro, A. (2001). On duality theory of conic linear problems. In Semi-Infinite Programming, eds Goberna, M. Á. and López, M. A., Springer, New York, pp. 135–165.CrossRef Google Scholar

Smith, J. (1995). Generalized Chebychev inequalities: theory and applications in decision analysis. Operat. Res. 43, 807–825.CrossRef Google Scholar

Solomon, J., Rustamov, R., Guibas, L. and Butscher, A. (2014). Earth mover’s distances on discrete surfaces. ACM Trans. Graph. 33, 67:1–67:12.Google Scholar

Srivastava, S., Cevher, V., Tran-Dinh, Q. and Dunson, D. B. (2015). WASP: scalable Bayes via barycenters of subset posteriors. In Proc. Machine Learning Research, Vol. 38, pp. 912–920.Google Scholar

Talagrand, M. (1992). Matching random samples in many dimensions. Ann. Appl. Probab. 2, 846–856.CrossRef Google Scholar

Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Statist. Soc. B [Statist. Methodology] 58, 267–288.Google Scholar

Villani, C. (2008). Optimal Transport: Old and New. Springer Science & Business Media, New York.Google Scholar

Wu, C. (2004). Weighted empirical likelihood inference. Statist. Prob. Lett. 66, 67–79.CrossRef Google Scholar

Xu, H., Caramanis, C. and Mannor, S. (2009). Robustness and regularization of support vector machines. J. Mach. Learn. Res. 10, 1485–1510.Google Scholar

Xu, H., Caramanis, C. and Mannor, S. (2009). Robust regression and LASSO. In Advances in Neural Information Processing Systems, Vol. 21. Neural Information Processing Systems Foundation, pp. 1801–1808.Google Scholar

Zhou, M. (2015). Empirical Likelihood Method in Survival Analysis. CRC Press, Boca Raton, FL.CrossRef Google Scholar

Blanchet et al. supplementary material

PDF 366.9 KB

Article contents

Robust Wasserstein profile inference and applications to machine learning

Abstract

Keywords

MSC classification

Access options

Footnotes

References

Blanchet et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests