OCRticle - a Structure-Aware OCR Application

Authors Sofia G. Rodrigues dos Santos , J. João Dias de Almeida



PDF
Thumbnail PDF

File

OASIcs.SLATE.2023.8.pdf
  • Filesize: 12.69 MB
  • 14 pages

Document Identifiers

Author Details

Sofia G. Rodrigues dos Santos
  • Informatics Department, University of Minho, Braga, Portugal
J. João Dias de Almeida
  • ALGORITMI/LASI, University of Minho, Braga, Portugal

Cite AsGet BibTex

Sofia G. Rodrigues dos Santos and J. João Dias de Almeida. OCRticle - a Structure-Aware OCR Application. In 12th Symposium on Languages, Applications and Technologies (SLATE 2023). Open Access Series in Informatics (OASIcs), Volume 113, pp. 8:1-8:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/OASIcs.SLATE.2023.8

Abstract

While there are currently many applications and websites capable of performing Optical Character Recognition (OCR), none of the widely available options offer structured OCR, i.e., OCR that maintains the text’s original structure. For example, if a document has a title, after performing OCR on it, the title should have a different formatting, in order to distinguish it from the rest of the text. This paper covers the topic of structure-aware OCR, first by describing the current state of OCR tools, then by showcasing a prototype tool capable of retaining the structure of articles scanned from an image.

Subject Classification

ACM Subject Classification
  • Applied computing → Optical character recognition
Keywords
  • OCR
  • Optical Character Recognition
  • Data Structure
  • Data Parsing
  • Document Structure

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. What is ocr (optical character recognition)? - aws. URL: https://aws.amazon.com/what-is/ocr/.
  2. Matt Cone. Markdown guide. URL: https://www.markdownguide.org/.
  3. Freeocr. URL: http://www.paperfile.net/.
  4. Search what you see. URL: https://lens.google/.
  5. Trey Harris. Converting a scanned document into a compressed, searchable pdf with redactions, September 2022. URL: https://medium.com/@treyharris/converting-a-scanned-document-into-a-compressed-searchable-pdf-with-redactions-63f61c34fe4c.
  6. Google answers whether it’s better to ocr text in pdfs or not, August 2022. URL: https://iloveseo.com/seo/google-answers-whether-its-better-to-ocr-text-in-pdfs-or-not/.
  7. An introduction to markup. URL: https://port.sas.ac.uk/mod/book/view.php?id=568&chapterid=336.
  8. Kivy: Cross-platform python framework for gui apps development. URL: https://kivy.org/.
  9. Kaan Kuguoglu. How to use image preprocessing to improve the accuracy of tesseract, July 2021. URL: https://towardsdatascience.com/getting-started-with-tesseract-part-ii-f7f9a0899b3f.
  10. 12+ best free ocr software for windows [2022 updated list], September 2022. URL: https://www.softwaretestinghelp.com/ocr-software-for-pc/.
  11. Ocrspace. URL: https://ocr.space/.
  12. Image to text converter using ocr online. URL: https://www.onlineocr.net/.
  13. Pytesseract. URL: https://pypi.org/project/pytesseract/.
  14. Online ocr - free ocr pdf document scanner & converter. URL: https://www.sodapdf.com/ocr-pdf/.
  15. Tesseract user manual. URL: https://tesseract-ocr.github.io/tessdoc/.
  16. Improving the quality of the output - tesseract documentation. URL: https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html.
  17. Tesseract.js: Pure javascript ocr for 100 languages! URL: https://tesseract.projectnaptha.com/.
  18. Languages supported in different versions of tesseract. URL: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html.
  19. TheJoeFin. Thejoefin/text-grab: Use ocr in windows 10 quickly and easily with text grab. with optional background process and popups. URL: https://github.com/TheJoeFin/Text-Grab.
  20. James Vincent. Google lens can now copy and paste handwritten notes to your computer, May 2020. URL: https://www.theverge.com/2020/5/7/21250556/google-lens-copy-paste-handwritten-notes-computer-phone-ios-android.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail