Online Learning and Bandits with Queried Hints

Authors Aditya Bhaskara, Sreenivas Gollapudi, Sungjin Im, Kostas Kollias, Kamesh Munagala



PDF
Thumbnail PDF

File

LIPIcs.ITCS.2023.16.pdf
  • Filesize: 0.84 MB
  • 24 pages

Document Identifiers

Author Details

Aditya Bhaskara
  • School of Computing, University of Utah, Salt Lake City, UT, USA
Sreenivas Gollapudi
  • Google Research, Mountain View, CA, USA
Sungjin Im
  • University of California, Merced, CA, USA
Kostas Kollias
  • Google Research, Mountain View, CA, USA
Kamesh Munagala
  • Computer Science Department, Duke University, Durham, NC, USA

Cite AsGet BibTex

Aditya Bhaskara, Sreenivas Gollapudi, Sungjin Im, Kostas Kollias, and Kamesh Munagala. Online Learning and Bandits with Queried Hints. In 14th Innovations in Theoretical Computer Science Conference (ITCS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 251, pp. 16:1-16:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ITCS.2023.16

Abstract

We consider the classic online learning and stochastic multi-armed bandit (MAB) problems, when at each step, the online policy can probe and find out which of a small number (k) of choices has better reward (or loss) before making its choice. In this model, we derive algorithms whose regret bounds have exponentially better dependence on the time horizon compared to the classic regret bounds. In particular, we show that probing with k = 2 suffices to achieve time-independent regret bounds for online linear and convex optimization. The same number of probes improve the regret bound of stochastic MAB with independent arms from O(√{nT}) to O(n² log T), where n is the number of arms and T is the horizon length. For stochastic MAB, we also consider a stronger model where a probe reveals the reward values of the probed arms, and show that in this case, k = 3 probes suffice to achieve parameter-independent constant regret, O(n²). Such regret bounds cannot be achieved even with full feedback after the play, showcasing the power of limited "advice" via probing before making the play. We also present extensions to the setting where the hints can be imperfect, and to the case of stochastic MAB where the rewards of the arms can be correlated.

Subject Classification

ACM Subject Classification
  • Theory of computation → Online learning algorithms
Keywords
  • Online learning
  • multi-armed bandits
  • regret

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pages 2312-2320, Red Hook, NY, USA, 2011. Curran Associates Inc. Google Scholar
  2. Jacob D Abernethy, Young Hun Jung, Chansoo Lee, Audra McMillan, and Ambuj Tewari. Online learning via the differential privacy lens. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Google Scholar
  3. Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 39.1-39.26, Edinburgh, Scotland, 25-27 June 2012. PMLR. Google Scholar
  4. Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876-1902, 2009. Google Scholar
  5. Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2–3):235-256, May 2002. Google Scholar
  6. Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48-77, January 2003. Google Scholar
  7. Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced allocations. SIAM J. Comput., 29(1):180-200, September 1999. Google Scholar
  8. Hedyeh Beyhaghi and Robert Kleinberg. Pandora’s problem with nonobligatory inspection. In Anna Karlin, Nicole Immorlica, and Ramesh Johari, editors, Proceedings of the 2019 ACM Conference on Economics and Computation, EC 2019, Phoenix, AZ, USA, June 24-28, 2019, pages 131-132. ACM, 2019. Google Scholar
  9. Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Online learning with imperfect hints. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 822-831. PMLR, 13-18 July 2020. Google Scholar
  10. Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, and Manish Purohit. Logarithmic regret from sublinear hints. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 28222-28232. Curran Associates, Inc., 2021. URL: https://proceedings.neurips.cc/paper/2021/file/edb947f2bbceb132245fdde9c59d3f59-Paper.pdf.
  11. Aditya Bhaskara, Sreenivas Gollapudi, Sungjin Im, Kostas Kollias, and Kamesh Munagala. Online learning and bandits with queried hints, 2022. URL: https://doi.org/10.48550/arXiv.2211.02703.
  12. Aditya Bhaskara, Sreenivas Gollapudi, Kostas Kollias, and Kamesh Munagala. Adaptive probing policies for shortest path routing. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. Google Scholar
  13. David Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1-8, 1956. Google Scholar
  14. Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. URL: https://doi.org/10.1093/acprof:oso/9780199535255.001.0001.
  15. Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1-122, 2012. Google Scholar
  16. Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427-485, May 1997. Google Scholar
  17. Ofer Dekel, Arthur Flajolet, Nika Haghtalab, and Patrick Jaillet. Online learning with a hint. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper/2017/file/22b1f2e0983160db6f7bb9f62f4dbb39-Paper.pdf.
  18. Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik. Approximation algorithms for stochastic submodular set cover with applications to boolean function evaluation and min-knapsack. ACM Trans. Algorithms, 12(3):42:1-42:28, 2016. Google Scholar
  19. Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211-407, 2014. Google Scholar
  20. David A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):100-118, 1975. Google Scholar
  21. A. Goel, S. Guha, and K. Munagala. Asking the right questions: Model-driven optimization using probes. In Proc. of the 2006 ACM Symp. on Principles of Database Systems, 2006. Google Scholar
  22. Sreenivas Gollapudi and Debmalya Panigrahi. Online algorithms for rent-or-buy with expert advice. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 2319-2327, 2019. Google Scholar
  23. Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. J. Artif. Int. Res., 42(1):427-486, September 2011. Google Scholar
  24. S. Guha, K. Munagala, and S. Sarkar. Optimizing transmission rate in wireless channels using adaptive probes. In SIGMETRICS/Performance, pages 381-382, 2006. Google Scholar
  25. James Hannan. Approximation to BAYES risk in repeated play, pages 97-140. Princeton University Press, 2016. Google Scholar
  26. Elad Hazan. Introduction to online convex optimization. CoRR, abs/1909.05207, 2019. URL: http://arxiv.org/abs/1909.05207.
  27. Sungjin Im, Ravi Kumar, Aditya Petety, and Manish Purohit. Parsimonious learning-augmented caching. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9588-9601. PMLR, 17-23 July 2022. Google Scholar
  28. Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291-307, 2005. Learning Theory 2003. Google Scholar
  29. Satyen Kale. Multiarmed bandits with limited expert advice. CoRR, abs/1306.4653, 2013. URL: http://arxiv.org/abs/1306.4653.
  30. Ulrich Krengel and Louis Sucheston. Semiamarts and finite values. Bulletin of the American Mathematical Society, 83(4):745-747, 1977. Google Scholar
  31. Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 767-776, Lille, France, 07-09 July 2015. PMLR. Google Scholar
  32. T.L Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4-22, 1985. Google Scholar
  33. Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online scheduling via learned weights. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 1859-1877. SIAM, 2020. Google Scholar
  34. Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020. Google Scholar
  35. Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661-670, New York, NY, USA, 2010. Association for Computing Machinery. Google Scholar
  36. N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212-261, 1994. URL: https://doi.org/10.1006/inco.1994.1009.
  37. Zhen Liu, Srinivasan Parthasarathy, Anand Ranganathan, and Hao Yang. Near-optimal algorithms for shared filter evaluation in data stream systems. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 133-146, New York, NY, USA, 2008. Google Scholar
  38. Thodoris Lykouris and Sergei Vassilvtiskii. Competitive caching with machine learned advice. In International Conference on Machine Learning, pages 3302-3311, 2018. Google Scholar
  39. Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions. In Tim Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms, pages 646-662. Cambridge University Press, 2020. URL: https://doi.org/10.1017/9781108637435.037.
  40. Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press, 1995. URL: https://doi.org/10.1017/cbo9780511814075.
  41. Samrat Mukhopadhyay, Sourav Sahoo, and Abhishek Sinha. k-experts - online policies and fundamental limits. CoRR, abs/2110.07881, 2021. URL: http://arxiv.org/abs/2110.07881.
  42. K. Munagala, S. Babu, R. Motwani, and J. Widom. The pipelined set cover problem. Proc. Intl. Conf. Database Theory, 2005. Google Scholar
  43. Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithms via ml predictions. In Advances in Neural Information Processing Systems, pages 9661-9670, 2018. Google Scholar
  44. Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Shai Shalev-Shwartz and Ingo Steinwart, editors, COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA, volume 30 of JMLR Workshop and Conference Proceedings, pages 993-1019. JMLR.org, 2013. Google Scholar
  45. Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527-535, 1952. URL: https://doi.org/bams/1183517370.
  46. Ester Samuel-Cahn. Comparison of threshold stop rules and maximum for independent nonnegative random variables. The Annals of Probability, 12(4):1213-1216, 1984. URL: https://doi.org/10.1214/aop/1176993150.
  47. Yevgeny Seldin, Koby Crammer, and Peter Bartlett. Open problem: Adversarial multiarmed bandits with limited advice. In Shai Shalev-Shwartz and Ingo Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 1067-1072, Princeton, NJ, USA, 12-14 June 2013. PMLR. Google Scholar
  48. Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1-286, 2019. Google Scholar
  49. Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Ranked bandits in metric spaces: Learning diverse rankings over large document collections. J. Mach. Learn. Res., 14(1):399-436, February 2013. Google Scholar
  50. Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In Proc. the 31th International Conference on Machine Learning, ICML, volume 32 of JMLR Workshop and Conference Proceedings, pages 1593-1601, 2014. Google Scholar
  51. Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008. URL: https://proceedings.neurips.cc/paper/2008/file/5751ec3e9a4feab575962e78e006250d-Paper.pdf.
  52. William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285-294, 1933. Google Scholar
  53. A. Wald. Sequential analysis. John Wiley, 1947. Google Scholar
  54. Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Conference On Learning Theory, COLT 2018, volume 75 of Proceedings of Machine Learning Research, pages 1263-1291. PMLR, 2018. Google Scholar
  55. Martin L. Weitzman. Optimal search for the best alternative. Econometrica, 47(3):641-654, 1979. Google Scholar
  56. Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538-1556, 2012. Google Scholar
  57. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML'03, pages 928-935. AAAI Press, 2003. Google Scholar
  58. Jinhang Zuo, Xiaoxi Zhang, and Carlee Joe-Wong. Observe before play: Multi-armed bandit with pre-observations. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):7023-7030, April 2020. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail