Natural Language Data Interfaces: A Data Access Odyssey (Invited Talk)

Author Georgia Koutrika

Thumbnail PDF


  • Filesize: 1.09 MB
  • 22 pages

Document Identifiers

Author Details

Georgia Koutrika
  • Athena Research Center, Athens, Greece

Cite AsGet BibTex

Georgia Koutrika. Natural Language Data Interfaces: A Data Access Odyssey (Invited Talk). In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 1:1-1:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Back in 1970’s, E. F. Codd worked on a prototype of a natural language question and answer application that would sit on top of a relational database system. Soon, natural language interfaces for databases (NLIDBs) became the holy grail for the database community. Different approaches have been proposed from the database, machine learning and NLP communities. Interest in the topic has had its peaks and valleys. After a long and adventurous journey of almost 50 years, there is a rekindled interest in NLIDBs in recent years, fueled by the need for democratizing data access and by the recent advances in deep learning and natural language processing in particular. There is a surge of works on natural language interfaces for databases using neural translation, and suddenly it becomes hard to keep up with advancements in the field. Are we close to finding the holy grail of data access? What are the lurking challenges that we need to surpass and what research opportunities arise? Finally, what is the role of the database community?

Subject Classification

ACM Subject Classification
  • Computing methodologies → Machine translation
  • Information systems → Data management systems
  • natural language data interfaces
  • NLIDBs
  • NL-to-SQL
  • text-to-SQL
  • conversational databases


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. A comparative survey of recent natural language interfaces for databases. CoRR, abs/1906.08990, 2019. URL:
  2. Hiyan Alshawi, David M. Carter, Jan van Eijck, Robert C. Moore, Douglas B. Moran, and Stephen G. Pulman. Overview of the core language engine. In Proceedings of the International Conference on Fifth Generation Computer Systems, FGCS 1988, Tokyo, Japan, November 28-December 2, 1988, pages 1108-1115. OHMSHA Ltd. Tokyo and Springer-Verlag, 1988. Google Scholar
  3. Ion Androutsopoulos, Graeme D. Ritchie, and Peter Thanisch. Natural language interfaces to databases - An introduction. Natural Language Engineering, 1(1):29-81, 1995. URL:
  4. Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt Stockinger. SODA: Generating sql for business users. PVLDB, 5(10):932-943, 2012. URL:
  5. Ben Bogin, Jonathan Berant, and Matt Gardner. Representing schema structure with graph neural networks for text-to-SQL parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4560-4565, Florence, Italy, jul 2019. Association for Computational Linguistics. URL:
  6. Ben Bogin, Matt Gardner, and Jonathan Berant. Global reasoning over database structures for text-to-SQL parsing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3657-3662, Hong Kong, China, nov 2019. Association for Computational Linguistics. URL:
  7. Ursin Brunner and Kurt Stockinger. Valuenet: A neural text-to-sql architecture incorporating values. arXiv, abs/2006.00888, 2020. URL:
  8. Ruichu Cai, Boyan Xu, Zhenjie Zhang, Xiaoyan Yang, Zijian Li, and Zhihao Liang. An encoder-decoder framework translating natural language to database queries. In Jérôme Lang, editor, Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI, pages 3977-3983., 2018. URL:
  9. Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexander Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, et al. Dr. spider: A diagnostic evaluation benchmark towards text-to-sql robustness. arXiv preprint, 2023. URL:
  10. DongHyun Choi, Myeong Cheol Shin, EungGyun Kim, and Dong Ryeol Shin. Ryansql: Recursively applying sketch-based slot fillings for complex text-to-sql in cross-domain databases, 2020. URL:
  11. E. F. Codd. Seven steps to rendezvous with the casual user. In J. W. Klimbie and K. L. Koffeman, editors, Data Base Management, Proceeding of the IFIP Working Conference Data Base Management, Cargèse, Corsica, France, April 1-5, 1974, pages 179-200. North-Holland, jan 1974. Google Scholar
  12. Ann A. Copestake and Karen Sparck Jones. Natural language interfaces to databases. Knowl. Eng. Rev., 5(4):225-249, 1990. URL:
  13. Fred Damerau. Problems and some solutions in customization of natural language database front ends. ACM Trans. Inf. Syst., 3(2):165-184, 1985. URL:
  14. Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. Generating typed dependency parses from phrase structure parses. LREC, pages 449-454, 2006. URL:
  15. A.N. De Roeck. A Natural Language System Based on Formal Semantics. Universiti Sains Malaysia, 1991. URL:
  16. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL:
  17. Li Dong and Mirella Lapata. Language to logical form with neural attention, 2016. URL:
  18. Timothy Dozat and Christopher D. Manning. Deep biaffine attention for neural dependency parsing, 2017. URL:
  19. Timothy Dozat and Christopher D. Manning. Simpler but more accurate semantic dependency parsing, 2018. URL:
  20. Stavroula Eleftherakis, Orest Gkini, and Georgia Koutrika. Let the database talk back: Natural language explanations for SQL. In Proceedings of the 2nd Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA-Data 2021) co-located with 47th International Conference on Very Large Data Bases (VLDB 2021), Copenhagen, Denmark, August 20, 2021, volume 2929 of CEUR Workshop Proceedings, pages 14-19., 2021. URL:
  21. Samuel S. Epstein. Transportable natural language processing through simplicity - the PRE system. ACM Trans. Inf. Syst., 3(2):107-120, 1985. URL:
  22. Han Fu, Chang Liu, Bin Wu, Feifei Li, Jian Tan, and Jianling Sun. Catsql: Towards real world natural language to SQL applications. Proc. VLDB Endow., 16(6):1534-1547, 2023. URL:
  23. Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew Purver, John R. Woodward, Jinxia Xie, and Pengsheng Huang. Towards robustness of text-to-SQL models against synonym substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2505-2515, Online, aug 2021. Association for Computational Linguistics. URL:
  24. Yujian Gan, Xinyun Chen, and Matthew Purver. Exploring underexplored limitations of cross-domain text-to-sql generalization. arXiv preprint, 2021. URL:
  25. Orest Gkini, Theofilos Belmpas, Georgia Koutrika, and Yannis E. Ioannidis. An in-depth benchmarking of text-to-sql systems. In Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava, editors, SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, pages 632-644. ACM, 2021. URL:
  26. Barbara J. Grosz. TEAM: A transportable natural-language interface system. In 1st Applied Natural Language Processing Conference, ANLP 1983, Miramar-Sheraton Hotel, Santa Monica, California, USA, February 1-3, 1983, pages 39-45. ACL, 1983. URL:
  27. Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. Towards complex text-to-sql in cross-domain database with intermediate representation, 2019. URL:
  28. Larry R. Harris. Experience with INTELLECT: artificial intelligence technology transfer. AI Mag., 5(2):43-50, 1984. URL:
  29. Moshe Hazoom, Vibhor Malik, and Ben Bogin. Text-to-SQL in the wild: A naturally-occurring dataset based on stack exchange data. In 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), pages 77-87, aug 2021. Google Scholar
  30. Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Weizhu Chen. X-sql: reinforce schema representation with context, 2019. URL:
  31. Gary G. Hendrix, Earl D. Sacerdoti, Daniel Sagalowicz, and Jonathan Slocum. Developing a natural language interface to complex data. ACM Trans. Database Syst., 3(2):105-147, jun 1978. URL:
  32. Vagelis Hristidis, Luis Gravano, and Yannis Papakonstantinou. Efficient IR-style keyword search over relational databases. In VLDB, pages 850-861, 2003. URL:
  33. Vagelis Hristidis and Yannis Papakonstantinou. Discover: Keyword search in relational databases. In VLDB, pages 670-681, 2002. URL:
  34. Binyuan Hui, Xiang Shi, Ruiying Geng, Binhua Li, Yongbin Li, Jian Sun, and Xiaodan Zhu. Improving text-to-sql with schema dependency learning, 2021. URL:
  35. Wonseok Hwang, Jinyeong Yim, Seunghyun Park, and Minjoon Seo. A comprehensive exploration on wikisql with table-aware word contextualization, 2019. URL:
  36. Tim Johnson. Natural language computing: The commercial applications. Knowl. Eng. Rev., 1(3):11-23, 1984. URL:
  37. George Katsogiannis-Meimarakis and Georgia Koutrika. A deep dive into deep learning approaches for text-to-sql systems. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD/PODS '21, pages 2846-2851, New York, NY, USA, 2021. Association for Computing Machinery. URL:
  38. George Katsogiannis-Meimarakis and Georgia Koutrika. A survey on deep learning approaches for text-to-sql. The VLDB Journal, 2023. URL:
  39. Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hongrae Lee. Natural language to sql: Where are we today? Proc. VLDB Endow., 13(10):1737-1750, 2020. URL:
  40. Andreas Kokkalis, Panagiotis Vagenas, Alexandros Zervakis, Alkis Simitsis, Georgia Koutrika, and Yannis E. Ioannidis. Logos: a system for translating queries into narratives. In K. Selçuk Candan, Yi Chen, Richard T. Snodgrass, Luis Gravano, and Ariel Fuxman, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, pages 673-676. ACM, 2012. URL:
  41. Georgia Koutrika, Alkis Simitsis, and Yannis E. Ioannidis. Précis: The essence of a query answer. In Ling Liu, Andreas Reuter, Kyu-Young Whang, and Jianjun Zhang, editors, Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3-8 April 2006, Atlanta, GA, USA, pages 69-78. IEEE Computer Society, 2006. URL:
  42. Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gardner. Neural semantic parsing with type constraints for semi-structured tables. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1516-1526, Copenhagen, Denmark, sep 2017. Association for Computational Linguistics. URL:
  43. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pages 282-289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google Scholar
  44. Chia-Hsuan Lee, Oleksandr Polozov, and Matthew Richardson. Kaggledbqa: Realistic evaluation of text-to-sql parsers. arXiv preprint, 2021. URL:
  45. Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong-Yeup Kim, and Edward Choi. Ehrsql: A practical text-to-sql benchmark for electronic health records. Advances in Neural Information Processing Systems, 35:15589-15601, 2022. URL:
  46. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019. URL:
  47. Fei Li and H. V. Jagadish. Constructing an interactive natural language interface for relational databases. PVLDB, 8(1):73-84, sep 2014. URL:
  48. Haoyang Li, Jing Zhang, Cuiping Li, and Hong Chen. RESDSQL: decoupling schema linking and skeleton parsing for text-to-sql. In Brian Williams, Yiling Chen, and Jennifer Neville, editors, Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 13067-13075. AAAI Press, 2023. URL:
  49. Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint, 2023. URL:
  50. Xi Victoria Lin, Richard Socher, and Caiming Xiong. Bridging textual and tabular data for cross-domain text-to-SQL semantic parsing. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4870-4888, Online, nov 2020. Association for Computational Linguistics. URL:
  51. Yi Luo, Xuemin Lin, Wei Wang, and Xiaofang Zhou. Spark: Top-k keyword query in relational databases. In ACM SIGMOD, pages 115-126, 2007. URL:
  52. Qin Lyu, Kaushik Chakrabarti, Shobhit Hathi, Souvik Kundu, Jianwen Zhang, and Zheng Chen. Hybrid ranking network for text-to-sql. Technical Report MSR-TR-2020-7, Microsoft Dynamics 365 AI, mar 2020. URL:
  53. Nikolaus Ott. Aspects of the automatic generation of SQL statements in the natural language query interface. Inf. Syst., 17(2):147-159, 1992. URL:
  54. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, 2014. URL:, URL:
  55. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2020. URL:
  56. Ohad Rubin and Jonathan Berant. SmBoP: Semi-autoregressive bottom-up semantic parsing. In Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021), pages 12-21, Online, aug 2021. Association for Computational Linguistics. URL:
  57. Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R. Mittal, Fatma Özcan, IBM Research. Bangalore, and IBM Research. Almaden. ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores. VLDB, 2016. URL:, URL:
  58. Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. Picard: Parsing incrementally for constrained auto-regressive decoding from language models, 2021. URL:
  59. Jaydeep Sen, Chuan Lei, Abdul Quamar, Fatma Özcan, Vasilis Efthymiou, Ayushi Dalmia, Greg Stager, Ashish R. Mittal, Diptikalyan Saha, and Karthik Sankaranarayanan. ATHENA++: natural language querying for complex nested SQL queries. Proc. VLDB Endow., 13(11):2747-2759, 2020. URL:
  60. Or Sharir, Barak Peleg, and Yoav Shoham. The cost of training NLP models: A concise overview. CoRR, abs/2004.08900, 2020. URL:
  61. Tianze Shi, Kedar Tatwawadi, Kaushik Chakrabarti, Yi Mao, Oleksandr Polozov, and Weizhu Chen. Incsql: Training incremental text-to-sql parsers with non-deterministic oracles, 2018. URL:
  62. Alkis Simitsis, Georgia Koutrika, and Yannis Ioannidis. Précis: from unstructured keywords as queries to structured databases as answers. The VLDB Journal, 17(1):117-149, 2008. URL:
  63. Marjorie Templeton and John Burger. Problems in natural-language interface to DBMS with examples from EUFID. In First Conference on Applied Natural Language Processing, pages 3-16, Santa Monica, California, USA, feb 1983. Association for Computational Linguistics. URL:
  64. Harry R. Tennant, Kenneth M. Ross, Richard M. Saenz, Craig W. Thompson, and James R. Miller. Menu-based natural language understanding. In Mitchell P. Marcus, editor, 21st Annual Meeting of the Association for Computational Linguistics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA, June 15-17, 1983, pages 151-158. ACL, 1983. URL:
  65. Bozena Henisz Thompson and Frederick B. Thompson. Introducing ask, A simple knowledgeable system. In 1st Applied Natural Language Processing Conference, ANLP 1983, Miramar-Sheraton Hotel, Santa Monica, California, USA, February 1-3, 1983, pages 17-24. ACL, 1983. URL:
  66. Arif Usta, Akifhan Karakayali, and Özgür Ulusoy. Dbtagger: Multi-task learning for keyword mapping in NLIDBs using bi-directional recurrent neural networks. Proc. VLDB Endow., 14(5):813-821, jan 2021. URL:
  67. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. URL:
  68. Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers, 2020. URL:
  69. Bing Wang, Yan Gao, Zhoujun Li, and Jian-Guang Lou. Know what I don't know: Handling ambiguous and unknown questions for text-to-sql. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 5701-5714. Association for Computational Linguistics, 2023. URL:
  70. Chenglong Wang, Kedar Tatwawadi, Marc Brockschmidt, Po-Sen Huang, Yi Mao, Oleksandr Polozov, and Rishabh Singh. Robust text-to-sql generation with execution-guided decoding, 2018. URL:
  71. Weiguo Wang, Sourav S. Bhowmick, Hui Li, Shafiq R. Joty, Siyuan Liu, and Peng Chen. Towards enhancing database education: Natural language generation meets query execution plans. In Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava, editors, SIGMOD '21: International Conference on Management of Data, pages 1933-1945. ACM, 2021. URL:
  72. David H. D. Warren and Fernando C. N. Pereira. An efficient easily adaptable system for interpreting natural language queries. Am. J. Comput. Linguistics, 8(3-4):110-122, 1982. Google Scholar
  73. Nathaniel Weir, Prasetya Utama, Alex Galakatos, Andrew Crotty, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Nadja Geisler, Benjamin Hättasch, Steffen Eger, Ugur Çetintemel, and Carsten Binnig. Dbpal: A fully pluggable NL2SQL training pipeline. In David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo, editors, Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pages 2347-2361. ACM, 2020. URL:
  74. William A. Woods. Procedural semantics for a question-answering machine. In Proceedings of the AFIPS '68 Fall Joint Computer Conference, December 9-11, 1968, San Francisco, California, USA - Part I, volume 33, pages 457-471, 1968. URL:
  75. Xiaojun Xu, Chang Liu, and Dawn Song. Sqlnet: Generating structured queries from natural language without reinforcement learning, 2017. URL:
  76. Kuan Xuan, Yongbo Wang, Yongliang Wang, Zujie Wen, and Yang Dong. Sead: End-to-end text-to-sql generation with schema-aware denoising, 2021. URL:
  77. Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. Sqlizer: Query synthesis from natural language. PACMPL, pages 63:1-63:26, 2017. URL:
  78. Pengcheng Yin and Graham Neubig. A syntactic neural model for general-purpose code generation, 2017. URL:
  79. Pengcheng Yin and Graham Neubig. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 7-12, Brussels, Belgium, nov 2018. Association for Computational Linguistics. URL:
  80. Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. Typesql: Knowledge-based type-aware neural text-to-sql generation, 2018. URL:
  81. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task, 2019. URL:
  82. Wei Yu, Haiyan Yang, Mengzhu Wang, and Xiaodong Wang. Bravely say I don't know: Relational question-schema graph for text-to-sql answerability classification. ACM Trans. Asian Low Resour. Lang. Inf. Process., 22(4):111:1-111:18, 2023. URL:
  83. Zhong Zeng, Mong Li Lee, and Tok Wang Ling. Answering keyword queries involving aggregates and groupby on relational databases. EDBT, pages 161-172, 2016. URL:
  84. Yi Zhang, Jan Deriu, George Katsogiannis-Meimarakis, Catherine Kosten, Georgia Koutrika, and Kurt Stockinger. Sciencebenchmark: A complex real-world benchmark for evaluating natural language to sql systems, 2023. URL:
  85. Yusen Zhang, Xiangyu Dong, Shuaichen Chang, Tao Yu, Peng Shi, and Rui Zhang. Did you ask a good question? A cross-domain question intention classification benchmark for text-to-sql. CoRR, abs/2010.12634, 2020. URL:
  86. Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning, 2017. URL:
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail