Spaces:

yhshin
/

latex-ocr

Runtime error

App Files Files Community

Young Ho Shin commited on Apr 28, 2022

Commit

f369852

1 Parent(s): 36bccd1

Clean up app.py and article.md

Browse files

Files changed (2) hide show

app.py +3 -4
article.md +41 -15

app.py CHANGED Viewed

@@ -34,17 +34,17 @@ def process_image(image):
 # !ls examples | grep png
 # +
-title = "Convert an image of an equation to LaTeX source code"
 with open('article.md',mode='r') as file:
     article = file.read()
 description = """
-This is a demo of machine learning model trained to parse an image and reconstruct the LaTeX source code of an equation.
 To use it, simply upload an image or use one of the example images below and click 'submit'.
 Results will show up in a few seconds.
-Try rendering the equation [here](https://quicklatex.com/) to compare with the original image.
 (The model is not perfect yet, so you may need to edit the resulting LaTeX a bit to get it to render a good match.)
 """
@@ -61,7 +61,6 @@ examples = [
     [ "examples/7afdeff0e6.png" ],
     [ "examples/b8f1e64b1f.png" ],
 ]
-#examples =[["examples/image_0.png"], ["image_1.png"], ["image_2.png"]]
 # -
 iface = gr.Interface(fn=process_image,

 # !ls examples | grep png
 # +
+title = "Convert image to LaTeX source code"
 with open('article.md',mode='r') as file:
     article = file.read()
 description = """
+This is a demo of machine learning model trained to reconstruct the LaTeX source code of an equation from an image.
 To use it, simply upload an image or use one of the example images below and click 'submit'.
 Results will show up in a few seconds.
+Try rendering the generated LaTeX [here](https://quicklatex.com/) to compare with the original.
 (The model is not perfect yet, so you may need to edit the resulting LaTeX a bit to get it to render a good match.)
 """
     [ "examples/7afdeff0e6.png" ],
     [ "examples/b8f1e64b1f.png" ],
 ]
 # -
 iface = gr.Interface(fn=process_image,

article.md CHANGED Viewed

@@ -14,8 +14,8 @@ and the corresponding LaTeX code:
 ```
-This demo is a first step in solving that problem.
-Eventually, you'll be able to take a quick screenshot of an equation from a paper
 and a program built with this model will generate its corresponding LaTeX source code
 so that you can just copy/paste straight into your personal notes.
 No more endless googling obscure LaTeX syntax!
@@ -24,25 +24,51 @@ No more endless googling obscure LaTeX syntax!
 Because this problem involves looking at an image and generating valid LaTeX code,
 the model needs to understand both Computer Vision (CV) and Natural Language Processing (NLP).
-There are some other projects that aim to solve the same problem with some very interesting architectures
-that generally involve some kind of "encoder" that looks at the image and extracts and encodes the information about the equation from the image,
 and a "decoder" that takes that information and translates it into what is hopefully both valid and accurate LaTeX code.
-Examples:
-...
-I chose to tackle this problem with transfer learning.
 The biggest reason for this is computing constraints -
-I don't have unlimited access to GPU hours and wanted training to be reasonably fast, on the order of a couple of hours.
 There are some other benefits to this approach,
-e.g. the architecture is already proven to be robust enough for various applications, so less time spent on trial and error.
-I chose TrOCR, an OCR machine learning model trained by Microsoft on SRIOE data to produce text from receipts.
 <p style='text-align: center'>Made by Young Ho Shin</p>
 <p style='text-align: center'>
-<a href = "mailto: [email protected]">Email</a> |
-<a href='https://www.github.com/yhshin11'>Github</a> |
-<a href='https://www.linkedin.com/in/young-ho-shin-3995051b9/'>Linkedin</a>
 </p>

 ```
+This demo is a first step in solving this problem.
+Eventually, you'll be able to take a quick partial screenshot from a paper
 and a program built with this model will generate its corresponding LaTeX source code
 so that you can just copy/paste straight into your personal notes.
 No more endless googling obscure LaTeX syntax!
 Because this problem involves looking at an image and generating valid LaTeX code,
 the model needs to understand both Computer Vision (CV) and Natural Language Processing (NLP).
+There are some other projects that aim to solve the same problem with some very interesting models.
+These generally involve some kind of "encoder" that looks at the image and extracts/encodes the information about the equation from the image,
 and a "decoder" that takes that information and translates it into what is hopefully both valid and accurate LaTeX code.
+The "encode" part can be done using classic CNN architectures commonly used for CV tasks, or newer vision transformer architectures.
+The "decode" part can be done with LSTMs or transformer decoders, using attention mechanism to make sure the decoder understands long range dependencies, e.g. remembering to close a bracket that was opened a long sequence away.
+I chose to tackle this problem with transfer learning, using an existing OCR model and fine-tuning it for this task.
 The biggest reason for this is computing constraints -
+GPU hours are expensive so I wanted training to be reasonably fast, on the order of a couple of hours.
 There are some other benefits to this approach,
+e.g. the architecture is already proven to be robust.
+I chose [TrOCR](https://arxiv.org/abs/2109.10282), a model trained at Microsoft for text recognition tasks which uses transformer architecture for both the encoder and decoder.
+For the data, I used the `im2latex-100k` dataset, which includes a total of roughly 100k formulas and images.
+Some preprocessing steps were done by Harvard NLP for the [`im2markup` project](https://github.com/harvardnlp/im2markup).
+To limit the scope of the project and simplify the task, I limited training data to only look at equations containing 100 LaTeX tokens or less.
+This covers most single line equations, including fractions, subscripts, symbols, etc, but does not cover large multi line equations, some of which can have up to 500 LaTeX tokens.
+GPU training was done using on Kaggle in roughly 3 hours.
+You can find the full training code on my Kaggle profile [here](https://www.kaggle.com/code/younghoshin/finetuning-trocr/notebook).
+## What's next?
+There's multiple improvements that I'm hoping to make to this project.
+### More robust prediction
+If you've tried the examples above (randomly sampled from the test set), you've noticed that the model predictions aren't quite perfect and the model occasionally misses, duplicates or mistakes tokens.
+More training on the existing data set could help with this.
+### More data
+There's a lot of LaTeX data available on the internet besides `im2latex-100k`, e.g. arXiv and Wikipedia.
+It's just waiting to be scraped and used for this project.
+This means a lot of hours of scraping, cleaning, and processing but having a more diverse set of input images could improve model accuracy significantly.
+### Faster and smaller model
+The model currently takes a few seconds to process a single image.
+I would love to improve performance so that it can run in one second or less, maybe even on mobile devices.
+This might be impossible with TrOCR which is a fairly large model, designed for use on GPUs.
 <p style='text-align: center'>Made by Young Ho Shin</p>
 <p style='text-align: center'>
+    <a href = "mailto: [email protected]">Email</a> |
+    <a href='https://www.github.com/yhshin11'>Github</a> |
+    <a href='https://www.linkedin.com/in/young-ho-shin-3995051b9/'>Linkedin</a>
 </p>