Hardness of Learning Boolean Functions from Label Proportions

Authors Venkatesan Guruswami, Rishi Saket



PDF
Thumbnail PDF

File

LIPIcs.FSTTCS.2023.37.pdf
  • Filesize: 0.74 MB
  • 15 pages

Document Identifiers

Author Details

Venkatesan Guruswami
  • Department of EECS and Simons Institute for the Theory of Computing, University of California, Berkeley, CA, USA
Rishi Saket
  • Google Research India, Banglaore, India

Cite As Get BibTex

Venkatesan Guruswami and Rishi Saket. Hardness of Learning Boolean Functions from Label Proportions. In 43rd IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 284, pp. 37:1-37:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.FSTTCS.2023.37

Abstract

In recent years the framework of learning from label proportions (LLP) has been gaining importance in machine learning. In this setting, the training examples are aggregated into subsets or bags and only the average label per bag is available for learning an example-level predictor. This generalizes traditional PAC learning which is the special case of unit-sized bags. The computational learning aspects of LLP were studied in recent works [R. Saket, 2021; R. Saket, 2022] which showed algorithms and hardness for learning halfspaces in the LLP setting. In this work we focus on the intractability of LLP learning Boolean functions. Our first result shows that given a collection of bags of size at most 2 which are consistent with an OR function, it is NP-hard to find a CNF of constantly many clauses which satisfies any constant-fraction of the bags. This is in contrast with the work of [R. Saket, 2021] which gave a (2/5)-approximation for learning ORs using a halfspace. Thus, our result provides a separation between constant clause CNFs and halfspaces as hypotheses for LLP learning ORs.
Next, we prove the hardness of satisfying more than 1/2 + o(1) fraction of such bags using a t-DNF (i.e. DNF where each term has ≤ t literals) for any constant t. In usual PAC learning such a hardness was known [S. Khot and R. Saket, 2008] only for learning noisy ORs. We also study the learnability of parities and show that it is NP-hard to satisfy more than (q/2^{q-1} + o(1))-fraction of q-sized bags which are consistent with a parity using a parity, while a random parity based algorithm achieves a (1/2^{q-2})-approximation.

Subject Classification

ACM Subject Classification
  • Theory of computation → Problems, reductions and completeness
Keywords
  • Learning from label proportions
  • Computational learning
  • Hardness
  • Boolean functions

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S. Arora, L. Babai, J. Stern, and Z. Sweedyk. The hardness of approximate optima in lattices, codes, and systems of linear equations. J. Comput. Syst. Sci., 54(2):317-331, 1997. Google Scholar
  2. S. Arora, C. Lund, R. Motwani, M. Sudan, and M. Szegedy. Proof verification and the hardness of approximation problems. J. ACM, 45(3):501-555, 1998. Google Scholar
  3. S. Arora and S. Safra. Probabilistic checking of proofs: A new characterization of NP. J. ACM, 45(1):70-122, 1998. Google Scholar
  4. D. Barucic and J. Kybic. Fast learning from label proportions with small bags. CoRR, abs/2110.03426, 2021. URL: https://arxiv.org/abs/2110.03426.
  5. G. Bortsova, F. Dubost, S. N. Ørting, I. Katramados, L. Hogeweg, L. H. Thomsen, M. M. W. Wille, and M. de Bruijne. Deep learning from label proportions for emphysema quantification. In MICCAI, volume 11071 of Lecture Notes in Computer Science, pages 768-776. Springer, 2018. URL: https://arxiv.org/abs/1807.08601.
  6. R. I. Busa-Fekete, H. Choi, T. Dick, C. Gentile, and A. M. Medina. Easy learning from label proportions. arXiv, 2023. URL: https://arxiv.org/abs/2302.03115.
  7. L. Chen, T. Fu, A. Karbasi, and V. Mirrokni. Learning from aggregated data: Curated bags versus random bags. arXiv, 2023. URL: https://arxiv.org/abs/2305.09557.
  8. L. Chen, Z. Huang, and R. Ramakrishnan. Cost-based labeling of groups of mass spectra. In Proc. ACM SIGMOD International Conference on Management of Data, pages 167-178, 2004. Google Scholar
  9. L. M. Dery, B. Nachman, F. Rubbo, and A. Schwartzman. Weakly supervised classification in high energy physics. Journal of High Energy Physics, 2017(5):1-11, 2017. Google Scholar
  10. V. Feldman, V. Guruswami, P. Raghavendra, and Y. Wu. Agnostic learning of monomials by halfspaces is hard. SIAM J. Comput., 41(6):1558-1590, 2012. Google Scholar
  11. S. Ghoshal and R. Saket. Hardness of learning DNFs using halfspaces. In Proc. STOC, pages 467-480, 2021. Google Scholar
  12. V. Guruswami, P. Raghavendra, R. Saket, and Y. Wu. Bypassing UGC from some optimal geometric inapproximability results. ACM Trans. Algorithms, 12(1):6:1-6:25, 2016. URL: http://eccc.hpi-web.de/report/2010/177.
  13. J. Håstad. Some optimal inapproximability results. J. ACM, 48(4):798-859, 2001. Google Scholar
  14. J. Hernández-González, I. Inza, L. Crisol-Ortíz, M. A. Guembe, M. J. Iñarra, and J. A. Lozano. Fitting the data from embryo implantation prediction: Learning from label proportions. Statistical methods in medical research, 27(4):1056-1066, 2018. Google Scholar
  15. S. Khot and R. Saket. Hardness of minimizing and learning DNF expressions. In Proc. FOCS, pages 231-240, 2008. Google Scholar
  16. C. O'Brien, A. Thiagarajan, S. Das, R. Barreto, C. Verma, T. Hsu, J. Neufeld, and J. J. Hunt. Challenges and approaches to privacy preserving post-click conversion prediction. CoRR, abs/2201.12666, 2022. URL: https://arxiv.org/abs/2201.12666.
  17. S. N. Ørting, J. Petersen, M. Wille, L. Thomsen, and M. de Bruijne. Quantifying emphysema extent from weakly labeled ct scans of the lungs using label proportions learning. In The Sixth International Workshop on Pulmonary Image Analysis, pages 31-42, 2016. Google Scholar
  18. R. O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014. Google Scholar
  19. R. Raz. A parallel repetition theorem. SIAM J. Comput., 27(3):763-803, 1998. Google Scholar
  20. S. Rueping. SVM classifier estimation from group probabilities. In Proc. ICML, pages 911-918, 2010. Google Scholar
  21. R. Saket. Learnability of linear thresholds from label proportions. In Proc. NeurIPS, 2021. URL: https://openreview.net/forum?id=5BnaKeEwuYk.
  22. R. Saket. Algorithms and hardness for learning linear thresholds from label proportions. In Proc. NeurIPS, 2022. URL: https://openreview.net/forum?id=4LZo68TuF-4.
  23. L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134-1142, 1984. Google Scholar
  24. J. Wojtusiak, K. Irvin, A. Birerdinc, and A. V. Baranova. Using published medical results and non-homogenous data in rule learning. In Proc. International Conference on Machine Learning and Applications and Workshops, volume 2, pages 84-89. IEEE, 2011. Google Scholar
  25. F. X. Yu, K. Choromanski, S. Kumar, T. Jebara, and S. F. Chang. On learning from label proportions. CoRR, abs/1402.5902, 2014. URL: https://arxiv.org/abs/1402.5902.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail