Online Learning and Bandits with Queried Hints

Bhaskara, Aditya; Gollapudi, Sreenivas; Im, Sungjin; Kollias, Kostas; Munagala, Kamesh

doi:10.4230/LIPIcs.ITCS.2023.16

Abstract

We consider the classic online learning and stochastic multi-armed bandit (MAB) problems, when at each step, the online policy can probe and find out which of a small number (k) of choices has better reward (or loss) before making its choice. In this model, we derive algorithms whose regret bounds have exponentially better dependence on the time horizon compared to the classic regret bounds. In particular, we show that probing with k = 2 suffices to achieve time-independent regret bounds for online linear and convex optimization. The same number of probes improve the regret bound of stochastic MAB with independent arms from O(√{nT}) to O(n² log T), where n is the number of arms and T is the horizon length. For stochastic MAB, we also consider a stronger model where a probe reveals the reward values of the probed arms, and show that in this case, k = 3 probes suffice to achieve parameter-independent constant regret, O(n²). Such regret bounds cannot be achieved even with full feedback after the play, showcasing the power of limited "advice" via probing before making the play. We also present extensions to the setting where the hints can be imperfect, and to the case of stochastic MAB where the rewards of the arms can be correlated.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pages 2312-2320, Red Hook, NY, USA, 2011. Curran Associates Inc.
Jacob D Abernethy, Young Hun Jung, Chansoo Lee, Audra McMillan, and Ambuj Tewari. Online learning via the differential privacy lens. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 39.1-39.26, Edinburgh, Scotland, 25-27 June 2012. PMLR.
Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876-1902, 2009.
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2–3):235-256, May 2002.
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48-77, January 2003.
Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced allocations. SIAM J. Comput., 29(1):180-200, September 1999.
Hedyeh Beyhaghi and Robert Kleinberg. Pandora’s problem with nonobligatory inspection. In Anna Karlin, Nicole Immorlica, and Ramesh Johari, editors, Proceedings of the 2019 ACM Conference on Economics and Computation, EC 2019, Phoenix, AZ, USA, June 24-28, 2019, pages 131-132. ACM, 2019.
Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Online learning with imperfect hints. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 822-831. PMLR, 13-18 July 2020.
Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Logarithmic regret from sublinear hints. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 28222-28232. Curran Associates, Inc., 2021. URL: https://proceedings.neurips.cc/paper/2021/file/edb947f2bbceb132245fdde9c59d3f59-Paper.pdf.
Aditya Bhaskara, Sreenivas Gollapudi, Sungjin Im, Kostas Kollias, and Kamesh Munagala. Online learning and bandits with queried hints, 2022. URL: https://doi.org/10.48550/arXiv.2211.02703.
Aditya Bhaskara, Sreenivas Gollapudi, Kostas Kollias, and Kamesh Munagala. Adaptive probing policies for shortest path routing. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1-8, 1956.
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. URL: https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.
Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1-122, 2012.
Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427-485, May 1997.
Ofer Dekel, Arthur Flajolet, Nika Haghtalab, and Patrick Jaillet. Online learning with a hint. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper/2017/file/22b1f2e0983160db6f7bb9f62f4dbb39-Paper.pdf.
Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik. Approximation algorithms for stochastic submodular set cover with applications to boolean function evaluation and min-knapsack. ACM Trans. Algorithms, 12(3):42:1-42:28, 2016.
Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211-407, 2014.
David A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):100-118, 1975.
A. Goel, S. Guha, and K. Munagala. Asking the right questions: Model-driven optimization using probes. In Proc. of the 2006 ACM Symp. on Principles of Database Systems, 2006.
Sreenivas Gollapudi and Debmalya Panigrahi. Online algorithms for rent-or-buy with expert advice. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 2319-2327, 2019.
Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. J. Artif. Int. Res., 42(1):427-486, September 2011.
S. Guha, K. Munagala, and S. Sarkar. Optimizing transmission rate in wireless channels using adaptive probes. In SIGMETRICS/Performance, pages 381-382, 2006.
James Hannan. Approximation to BAYES risk in repeated play, pages 97-140. Princeton University Press, 2016.
Elad Hazan. Introduction to online convex optimization. CoRR, abs/1909.05207, 2019. URL: http://arxiv.org/abs/1909.05207.
Sungjin Im, Ravi Kumar, Aditya Petety, and Manish Purohit. Parsimonious learning-augmented caching. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9588-9601. PMLR, 17-23 July 2022.
Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291-307, 2005. Learning Theory 2003.
Satyen Kale. Multiarmed bandits with limited expert advice. CoRR, abs/1306.4653, 2013. URL: http://arxiv.org/abs/1306.4653.
Ulrich Krengel and Louis Sucheston. Semiamarts and finite values. Bulletin of the American Mathematical Society, 83(4):745-747, 1977.
Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 767-776, Lille, France, 07-09 July 2015. PMLR.
T.L Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4-22, 1985.
Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online scheduling via learned weights. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 1859-1877. SIAM, 2020.
Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661-670, New York, NY, USA, 2010. Association for Computing Machinery.
N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212-261, 1994. URL: https://doi.org/10.1006/inco.1994.1009.
Zhen Liu, Srinivasan Parthasarathy, Anand Ranganathan, and Hao Yang. Near-optimal algorithms for shared filter evaluation in data stream systems. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 133-146, New York, NY, USA, 2008.
Thodoris Lykouris and Sergei Vassilvtiskii. Competitive caching with machine learned advice. In International Conference on Machine Learning, pages 3302-3311, 2018.
Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions. In Tim Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms, pages 646-662. Cambridge University Press, 2020. URL: https://doi.org/10.1017/9781108637435.037.
Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press, 1995. URL: https://doi.org/10.1017/cbo9780511814075.
Samrat Mukhopadhyay, Sourav Sahoo, and Abhishek Sinha. k-experts - online policies and fundamental limits. CoRR, abs/2110.07881, 2021. URL: http://arxiv.org/abs/2110.07881.
K. Munagala, S. Babu, R. Motwani, and J. Widom. The pipelined set cover problem. Proc. Intl. Conf. Database Theory, 2005.
Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithms via ml predictions. In Advances in Neural Information Processing Systems, pages 9661-9670, 2018.
Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Shai Shalev-Shwartz and Ingo Steinwart, editors, COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA, volume 30 of JMLR Workshop and Conference Proceedings, pages 993-1019. JMLR.org, 2013.
Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527-535, 1952. URL: https://doi.org/bams/1183517370.
Ester Samuel-Cahn. Comparison of threshold stop rules and maximum for independent nonnegative random variables. The Annals of Probability, 12(4):1213-1216, 1984. URL: https://doi.org/10.1214/aop/1176993150.
Yevgeny Seldin, Koby Crammer, and Peter Bartlett. Open problem: Adversarial multiarmed bandits with limited advice. In Shai Shalev-Shwartz and Ingo Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 1067-1072, Princeton, NJ, USA, 12-14 June 2013. PMLR.
Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1-286, 2019.
Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Ranked bandits in metric spaces: Learning diverse rankings over large document collections. J. Mach. Learn. Res., 14(1):399-436, February 2013.
Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In Proc. the 31th International Conference on Machine Learning, ICML, volume 32 of JMLR Workshop and Conference Proceedings, pages 1593-1601, 2014.
Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008. URL: https://proceedings.neurips.cc/paper/2008/file/5751ec3e9a4feab575962e78e006250d-Paper.pdf.
William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285-294, 1933.
A. Wald. Sequential analysis. John Wiley, 1947.
Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Conference On Learning Theory, COLT 2018, volume 75 of Proceedings of Machine Learning Research, pages 1263-1291. PMLR, 2018.
Martin L. Weitzman. Optimal search for the best alternative. Econometrica, 47(3):641-654, 1979.
Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538-1556, 2012.
Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML'03, pages 928-935. AAAI Press, 2003.
Jinhang Zuo, Xiaoxi Zhang, and Carlee Joe-Wong. Observe before play: Multi-armed bandit with pre-observations. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):7023-7030, April 2020.

Online Learning and Bandits with Queried Hints

Authors Aditya Bhaskara, Sreenivas Gollapudi, Sungjin Im, Kostas Kollias, Kamesh Munagala

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Online Learning and Bandits with Queried Hints

Authors Aditya Bhaskara, Sreenivas Gollapudi, Sungjin Im, Kostas Kollias, Kamesh Munagala

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message