Hostname: page-component-848d4c4894-nr4z6 Total loading time: 0 Render date: 2024-05-17T07:35:37.336Z Has data issue: false hasContentIssue false

Robust Wasserstein profile inference and applications to machine learning

Published online by Cambridge University Press:  01 October 2019

Jose Blanchet*
Affiliation:
Stanford University
Yang Kang*
Affiliation:
Columbia University
Karthyek Murthy*
Affiliation:
Singapore University of Technology and Design
*
*Postal address: Management Science and Engineering, Stanford University, 475 Via Ortega, Stanford, CA 94305, USA.
**Postal address: Columbia University, 1255 Amsterdam Avenue, Rm 1005, New York, NY 10027, USA.
***Postal address: Singapore University of Technology and Design, 8 Somapah Road, Singapore 487372, Singapore.

Abstract

We show that several machine learning estimators, including square-root least absolute shrinkage and selection and regularized logistic regression, can be represented as solutions to distributionally robust optimization problems. The associated uncertainty regions are based on suitably defined Wasserstein distances. Hence, our representations allow us to view regularization as a result of introducing an artificial adversary that perturbs the empirical distribution to account for out-of-sample effects in loss estimation. In addition, we introduce RWPI (robust Wasserstein profile inference), a novel inference methodology which extends the use of methods inspired by empirical likelihood to the setting of optimal transport costs (of which Wasserstein distances are a particular case). We use RWPI to show how to optimally select the size of uncertainty regions, and as a consequence we are able to choose regularization parameters for these machine learning estimators without the use of cross validation. Numerical experiments are also given to validate our theoretical findings.

Type
Research Papers
Copyright
© Applied Probability Trust 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

The supplementary material for this article can be found at http://doi.org/10.1017/jpr.2019.49

References

Banerjee, A., Chen, S., Fazayeli, F. and Sivakumar, V. (2014). Estimation with norm regularization. In Proc. Advances in Neural Information Processing Systems 27, Neural Information Processing Systems Foundation, pp. 1556–1564.Google Scholar
Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root LASSO: pivotal recovery of sparse signals via conic programming. Biometrika 98, 791806.CrossRefGoogle Scholar
Bertsimas, D. and Copenhaver, M. S. (2018). Characterization of the equivalence of robustification and regularization in linear and matrix regression. Europ. J. Operat. Res. 270, 931942.CrossRefGoogle Scholar
Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of LASSO and Dantzig selector. Ann. Statist. 37, 17051732.CrossRefGoogle Scholar
Billingsley, P. (2013). Convergence of Probability Measures. John Wiley & Sons, Chichester.Google Scholar
Blanchet, J. and Kang, Y. (2016). Sample out-of-sample inference based on Wasserstein distance. Preprint, arXiv:1605.01340.Google Scholar
Blanchet, J. and Kang, Y. (2017). Semi-supervised learning based on distributionally robust optimization. Preprint, arXiv:1702.08848.Google Scholar
Blanchet, J. and Murthy, K. (2016). Quantifying distributional model risk via optimal transport. Preprint, arXiv:1604.01446.Google Scholar
Blanchet, J., Kang, Y. and Murthy, K. (2019). Robust Wasserstein profile inference and applications to machine learning. Supplementary material. Available at http://doi.org/10.1017/jpr.2019.49.Google Scholar
Blanchet, J., Murthy, K. and Si, N. (2018). Confidence regions for optimal transport based distributionally robust optimization problems. In preparation.Google Scholar
Bravo, F. (2004). Empirical likelihood based inference with applications to some econometric models. Econometric Theory 20, 231264.CrossRefGoogle Scholar
Candes, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger than n . Ann. Statist. 35, 23132351.CrossRefGoogle Scholar
Chen, S. X. and Hall, P. (1993). Smoothed empirical likelihood confidence intervals for quantiles. Ann. Statist. 21, 11661181.CrossRefGoogle Scholar
Duchi, J., Glynn, P. and Namkoong, H. (2016). Statistics of robust optimization: a generalized empirical likelihood approach. Preprint, arXiv:1610.03425.Google Scholar
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32, 407499.Google Scholar
Esfahani, P. and Kuhn, D. (2015). Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Preprint, arXiv:1505.05116.Google Scholar
Fournier, N. and Guillin, A. (2015). On the rate of convergence in Wasserstein distance of the empirical measure. Prob. Theory Relat. Fields 162, 707738.CrossRefGoogle Scholar
Frogner, C., Zhang, C., Mobahi, H., Araya, M. and Poggio, T. A. (2015). Learning with a Wasserstein loss. In Proc. Advances in Neural Information Processing Systems 28, Neural Information Processing Systems Foundation, pp. 2053–2061.Google Scholar
Gao, R. and Kleywegt, A. J. (2016). Distributionally robust stochastic optimization with Wasserstein distance. Preprint, arXiv:1604.02199v1.Google Scholar
Gotoh, J.-Y., Kim, M. J. and Lim, A. E. (2017). Calibration of distributionally robust empirical optimization models. Preprint, arXiv:1711.06565.Google Scholar
Hastie, T., Tibshirani, R., Friedman, J. and Franklin, J. (2005). The elements of statistical learning: data mining, inference and prediction. Math. Intell. 27, 8385.Google Scholar
Hjort, N. L., McKeague, I. and Van Keilegom, I. (2009). Extending the scope of empirical likelihood. Ann. Statist. 37, 10791111.CrossRefGoogle Scholar
Isii, K. (1962). On sharpness of Tchebycheff-type inequalities. Ann. Inst. Statist. Math. 14, 185197.CrossRefGoogle Scholar
Knight, K. and Fu, W. (2000). Asymptotics for LASSO-type estimators. Ann. Statist. 28, 13561378.Google Scholar
Lam, H. (2016). Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Preprint, arXiv:1605.09349.Google Scholar
Lam, H. and Zhou, E. (2016). The empirical likelihood approach to quantifying uncertainty in sample average approximation. Preprint, arXiv:1604.02573.Google Scholar
Li, X., Zhao, T., Yuan, X. and Liu, H. (2015). The flare package for high dimensional linear regression and precision matrix estimation in R. J. Mach. Learn. Res. 16, 553557.Google ScholarPubMed
Mohajerin Esfahani, P. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program. 171, 115166.CrossRefGoogle Scholar
Negahban, S., Ravikumar, P., Wainwright, M. and Yu, B. (2012). A unified framework for high-dimensional analysis of M-Estimators with decomposable regularizers. Statist. Sci. 27, 538557.CrossRefGoogle Scholar
Owen, A. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237249.CrossRefGoogle Scholar
Owen, A. (1990). Empirical likelihood ratio confidence regions. Ann. Statist. 18 90120.CrossRefGoogle Scholar
Owen, A. (1991). Empirical likelihood for linear models. Ann. Statist. 19, 17251747.CrossRefGoogle Scholar
Owen, A. (2001). Empirical Likelihood. CRC Press, Boca Raton, FL.CrossRefGoogle Scholar
Peyré, G., Cuturi, M. and Solomon, J. (2016). Gromov–Wasserstein averaging of kernel and distance matrices. In Proc. Int. Conf. Machine Learning, Vol. 48. International Machine Learning Society, pp. 2664–2672.Google Scholar
Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Ann. Statist. 22, 300325.CrossRefGoogle Scholar
Rachev, S. T. and Rüschendorf, L. (1998). Mass Transportation Problems. Volume II: Applications. Springer Science & Business Media, New York.Google Scholar
Rachev, S. T. and Rüschendorf, L. (1998). Mass Transportation Problems. Volume I: Theory. Springer Science & Business Media, New York.Google Scholar
Rubner, Y., Tomasi, C. and Guibas, L. J. (2000). The earth mover’s distance as a metric for image retrieval. Internat. J. Comput. Vision 40, 99121.CrossRefGoogle Scholar
Seguy, V. and Cuturi, M. (2015). Principal geodesic analysis for probability measures under the optimal transport metric. In Advances in Neural Information Processing Systems, Vol. 28. Neural Information Processing Systems Foundation, pp. 33123320.Google Scholar
Shafieezadeh-Abadeh, S., Esfahani, P. and Kuhn, D. (2015). Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, Vol. 28. Neural Information Processing Systems Foundation, pp. 15761584.Google Scholar
Shafieezadeh-Abadeh, S., Kuhn, D. and Esfahani, P. M. (2017). Regularization via mass transportation. Preprint, arXiv:1710.10016.Google Scholar
Shapiro, A. (2001). On duality theory of conic linear problems. In Semi-Infinite Programming, eds Goberna, M. Á. and López, M. A., Springer, New York, pp. 135165.CrossRefGoogle Scholar
Smith, J. (1995). Generalized Chebychev inequalities: theory and applications in decision analysis. Operat. Res. 43, 807825.CrossRefGoogle Scholar
Solomon, J., Rustamov, R., Guibas, L. and Butscher, A. (2014). Earth mover’s distances on discrete surfaces. ACM Trans. Graph. 33, 67:167:12.Google Scholar
Srivastava, S., Cevher, V., Tran-Dinh, Q. and Dunson, D. B. (2015). WASP: scalable Bayes via barycenters of subset posteriors. In Proc. Machine Learning Research, Vol. 38, pp. 912–920.Google Scholar
Talagrand, M. (1992). Matching random samples in many dimensions. Ann. Appl. Probab. 2, 846856.CrossRefGoogle Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Statist. Soc. B [Statist. Methodology] 58, 267288.Google Scholar
Villani, C. (2008). Optimal Transport: Old and New. Springer Science & Business Media, New York.Google Scholar
Wu, C. (2004). Weighted empirical likelihood inference. Statist. Prob. Lett. 66, 6779.CrossRefGoogle Scholar
Xu, H., Caramanis, C. and Mannor, S. (2009). Robustness and regularization of support vector machines. J. Mach. Learn. Res. 10, 14851510.Google Scholar
Xu, H., Caramanis, C. and Mannor, S. (2009). Robust regression and LASSO. In Advances in Neural Information Processing Systems, Vol. 21. Neural Information Processing Systems Foundation, pp. 18011808.Google Scholar
Zhou, M. (2015). Empirical Likelihood Method in Survival Analysis. CRC Press, Boca Raton, FL.CrossRefGoogle Scholar
Supplementary material: PDF

Blanchet et al. supplementary material

Blanchet et al. supplementary material

Download Blanchet et al. supplementary material(PDF)
PDF 366.9 KB