Document Open Access Logo

Semantic Search of Mobile Applications Using Word Embeddings

Authors João Coelho, António Neto, Miguel Tavares , Carlos Coutinho , Ricardo Ribeiro , Fernando Batista



PDF
Thumbnail PDF

File

OASIcs.SLATE.2021.12.pdf
  • Filesize: 0.57 MB
  • 12 pages

Document Identifiers

Author Details

João Coelho
  • Caixa Mágica Software, Lisbon, Portugal
  • Instituto Superior Técnico, Lisbon, Portugal
António Neto
  • Caixa Mágica Software, Lisbon, Portugal
  • University Institute of Lisbon, Portugal
Miguel Tavares
  • Caixa Mágica Software, Lisbon, Portugal
  • Lusophone University of Humanities and Technologies, Lisbon, Portugal
Carlos Coutinho
  • Caixa Mágica Software, Lisbon, Portugal
  • ISTAR-IUL, University Institute of Lisbon, Portugal
Ricardo Ribeiro
  • University Institute of Lisbon, Portugal
  • INESC-ID Lisbon, Portugal
Fernando Batista
  • University Institute of Lisbon, Portugal
  • INESC-ID Lisbon, Portugal

Cite AsGet BibTex

João Coelho, António Neto, Miguel Tavares, Carlos Coutinho, Ricardo Ribeiro, and Fernando Batista. Semantic Search of Mobile Applications Using Word Embeddings. In 10th Symposium on Languages, Applications and Technologies (SLATE 2021). Open Access Series in Informatics (OASIcs), Volume 94, pp. 12:1-12:12, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/OASIcs.SLATE.2021.12

Abstract

This paper proposes a set of approaches for the semantic search of mobile applications, based on their name and on the unstructured textual information contained in their description. The proposed approaches make use of word-level, character-level, and contextual word-embeddings that have been trained or fine-tuned using a dataset of about 500 thousand mobile apps, collected in the scope of this work. The proposed approaches have been evaluated using a public dataset that includes information about 43 thousand applications, and 56 manually annotated non-exact queries. Our results show that both character-level embeddings trained on our data, and fine-tuned RoBERTa models surpass the performance of the other existing retrieval strategies reported in the literature.

Subject Classification

ACM Subject Classification
  • Information systems → Retrieval models and ranking
  • Information systems → Document representation
  • Information systems → Language models
  • Information systems → Search engine indexing
  • Information systems → Similarity measures
  • Computing methodologies → Machine learning
Keywords
  • Semantic Search
  • Word Embeddings
  • Elasticsearch
  • Mobile Applications

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S. Banon. Elasticsearch, 2010. URL: https://www.elastic.co/.
  2. David M. Blei. Probabilistic Topic Models. Commun. ACM, 55(4):77–84, 2012. URL: https://doi.org/10.1145/2133806.2133826.
  3. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomás Mikolov. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguistics, 5:135-146, 2017. URL: https://transacl.org/ojs/index.php/tacl/article/view/999.
  4. D. Cutting. Apache Lucene, 1999. URL: https://lucene.apache.org/.
  5. Anindya Datta, Kaushik Dutta, Sangar Kajanan, and Nargin Pervin. Mobilewalla: A Mobile Application Search Engine. In Joy Ying Zhang, Jarek Wilkiewicz, and Ani Nahapetian, editors, Mobile Computing, Applications, and Services, pages 172-187, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. Google Scholar
  6. Anindya Datta, Sangaralingam Kajanan, and Nargis Pervin. A Mobile App Search Engine. Mobile Networks and Applications, 18, 2013. Google Scholar
  7. Sahar Ghannay, Benoit Favre, Yannick Estève, and Nathalie Camelin. Word Embedding Evaluation and Combination. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 300-305, Portorož, Slovenia, 2016. European Language Resources Association (ELRA). URL: https://www.aclweb.org/anthology/L16-1046.
  8. Mansoor Iqbal. App download and usage statistics (2020). web page, October 2020. URL: https://www.businessofapps.com/data/app-statistics/.
  9. Omar Khattab and Matei Zaharia. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Jimmy Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu, editors, Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, pages 39-48. ACM, 2020. URL: https://doi.org/10.1145/3397271.3401075.
  10. Qi Liu, Matt J. Kusner, and P. Blunsom. A Survey on Contextual Embeddings. ArXiv, abs/2003.07278, 2020. URL: http://arxiv.org/abs/2003.07278.
  11. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL: http://arxiv.org/abs/1907.11692.
  12. Tomas Mikolov, G.s Corrado, Kai Chen, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations, 2013. Google Scholar
  13. Dae Hoon Park, Yi Fang, Mengwen Liu, and ChengXiang Zhai. Mobile App Retrieval for Social Media Users via Inference of Implicit Intent in Social Media Text. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM '16, page 959–968, New York, NY, USA, 2016. Association for Computing Machinery. URL: https://doi.org/10.1145/2983323.2983843.
  14. Dae Hoon Park, Mengwen Liu, ChengXiang Zhai, and Haohong Wang. Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval. In Ricardo Baeza-Yates, Mounia Lalmas, Alistair Moffat, and Berthier A. Ribeiro-Neto, editors, Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9-13, 2015, pages 533-542. ACM, 2015. URL: https://doi.org/10.1145/2766462.2767759.
  15. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global Vectors for Word Representation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532-1543. ACL, 2014. URL: https://doi.org/10.3115/v1/d14-1162.
  16. Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. CoRR, abs/2010.08191, 2020. URL: http://arxiv.org/abs/2010.08191.
  17. Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019. URL: http://arxiv.org/abs/1908.10084.
  18. Eugénio Ribeiro, Ricardo Ribeiro, Fernando Batista, and João Oliveira. Using Topic Information to Improve Non-exact Keyword-Based Search for Mobile Applications. In Marie-Jeanne Lesot, Susana Vieira, Marek Z. Reformat, João Paulo Carvalho, Anna Wilbik, Bernadette Bouchon-Meunier, and Ronald R. Yager, editors, Information Processing and Management of Uncertainty in Knowledge-Based Systems, pages 373-386, Cham, 2020. Springer International Publishing. Google Scholar
  19. Sameendra Samarawickrama, Shanika Karunasekera, Aaron Harwood, and Ramamohanarao Kotagiri. Search Result Personalization in Twitter Using Neural Word Embeddings. In Ladjel Bellatreche and Sharma Chakravarthy, editors, Big Data Analytics and Knowledge Discovery, pages 244-258, Cham, 2017. Springer International Publishing. Google Scholar
  20. Y. Seeley. Apache Solr, 2004. URL: https://lucene.apache.org/solr/.
  21. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38-45, Online, 2020. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  22. Jing Yao, Zhicheng Dou, and Ji-Rong Wen. Employing Personal Word Embeddings for Personalized Search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '20, page 1359–1368, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3397271.3401153.
  23. Andrew Yates, Rodrigo Nogueira, and Jimmy Lin. Pretrained transformers for text ranking: BERT and beyond. In Liane Lewin-Eytan, David Carmel, Elad Yom-Tov, Eugene Agichtein, and Evgeniy Gabrilovich, editors, WSDM '21, The Fourteenth ACM International Conference on Web Search and Data Mining, Virtual Event, Israel, March 8-12, 2021, pages 1154-1156. ACM, 2021. URL: https://doi.org/10.1145/3437963.3441667.
  24. Juchao Zhuo, Zeqian Huang, Yunfeng Liu, Zhanhui Kang, Xun Cao, Mingzhi Li, and Long Jin. Semantic Matching in APP Search. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM '15, page 209–210, New York, NY, USA, 2015. Association for Computing Machinery. URL: https://doi.org/10.1145/2684822.2697046.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail