FOTS - Fast oriented text spotting

Abhinav sharma
5 min readApr 29, 2021
  1. Business Problem:- Given an Image containing multiple text regions, detect the text regions and recognize the text in those regions.

Reading text from natural images is very useful in many domains like document analysis, robot navigation, scene understanding, self-driving cars, and image retrieval. It’s also one of the most challenging tasks because of the different fonts, sizes/scales, and text alignment in real-life images. Some of these applications require precise and faster text detection and recognition from the images (recognition from video streams), which makes text recognition strenuous.

Text Detection and Recognition have been two separate tasks in the past, implying that one has to create two separate models. However, this results in the process getting slowed down and hence, not suitable for real-time scene text spotting and recognition.

“FOTS” or fast-oriented text spotting is an architecture that aims to remedy this by creating one single architecture for the detection and recognition of text. It does so by having one single Backbone branch whose Convolutions are shared between the recognition and detection branch, and by inventing a new technique from the ground up called Roi pooling (more about it below!).

Outline

  1. Business Problem
  2. ML Problem Formulation
  3. Source of Data and Overview
  4. Exploratory Data Analysis (EDA)
  5. Data Preprocessing/Ground Truth Generation
  6. Modeling
  7. Loss Calculation
  8. Results
  9. Future Work
  10. References

My GitHub page for the entire Code:- https://github.com/abhiss4/FOTS-tf2-keras

2. ML Problem Formulation:- Problem can be formulated as a two-stage pipeline 1) text localization/detection 2) Recognition. Text detection can be further broken down into bounding box regression and per-pixel classification of a text region.

HIgh-Level Architecture

3. Source Of Data:-

a. ICDAR 15:- This dataset includes 1000 training images and 500 testing images. All the images are captured by Google glasses without taking care of position, so text in the scene can be in arbitrary orientation. All the images are of the same shape (720,1280,3) with some of the images tending to be blurry.

b. SynthText800k:- This data set contains over 800k images of any shape with random text data embedded into them.

data set examples

4.Exploratory Data Analysis

ICDAR sample

‘####’ in the Ground truth transcription is to be avoided whilst training

Distribution of Sentence lengths

5. Data Preprocessing/Ground Truth Generation:-

1.Score maps: Per Pixel maps of bounding boxes wherein a value1 denotes text region.

score maps

2.Geo Maps:- Distances from Each side (Top, Right, Left, Bottom + angle map).

GEO MAPS

3.Training Mask to ignore hard regions or text Marked as ’####’ in transcription.

training mask

4. Transcription:- Using EDA we find the average length of the words And accordingly Post pad or truncate the words with 0.

6. Modeling:-

1. Back-Bone:- Here I have used DenseNet121 as it gave me the best Bang for my GPU memory (5m parameters against 24m in Resnets) 5. One can choose any net they want just to make the final output such that the resultant is 4 times smaller than the input image size.

2. Detection Branch:- On the Output of backbone branch 1x1 Convolutions are applied to reduce the filter to 1 and 5 for predicting score maps and geo maps respectively.

3. Roi Rotate:- This branch is the highlight of this whole architecture, basically given an image it rotates, scales and crops the images to an axis-aligned bounding box so that those regions can be fed for recognition.

Note that the above image is just for visualization. The actual implementation of RoIRotate operates over feature maps extracted by shared convolutions instead of raw images.

4. Recognition Branch:- It’s a Basic Crnn type Network, where we are reducing the size on both height and width as opposed to only height in the research paper, then reshaping the output such that the width of the image is aligned with Rnn sequences fed into a dense layer.

A typical CRNNArchitecture

7. Loss Formulation:-

  1. Dice Loss:- As opposed to cross-entropy for per-pixel classification we will be using dice loss as it has shown to train the network better 1.
  2. 2. Regression Loss :- Intersection/Union + 1-cos(x,y) where x and y predicted and actual angles.

3. CTC Loss will be used for recognition = Recognition loss.

  • * Total Loss = (Dice loss + U(regression loss)) +V(Recognition loss) Where u and v are regularizers

8. Results:-

Training the Detection Loss was very easy as it only took only 20 epoch to get reduce the loss

Detection Loss vs Epoch
Recognition Loss vs Epoch
Sample Inference

Deployment Video:- https://www.youtube.com/watch?v=vZTzIqlG6b0&ab_channel=AbhinavSharma

**note why are the Text detection results not good? because I was not able to train the model simultaneously due to memory constraint, therefore I had to freeze the backbone for recognition ,plus ROI rotate was crashing when using tf.addons so I had to use CV for Roi rotate hence it was also not trainable.

this can be further shown here with the mini model architecture of recognizer trained on Icdar 2013 data set, after manually creating ROI images as shown below

Sample Rois
predictions

9. Future Work:-

1. Definitely try the same model in the Linux box to make ROI work in Tensorflow

2.Experiment on different Detection Loss

10. References:-

1 https://stats.stackexchange.com/questions/321460/dice-coefficient-loss-function-vs-cross-entropy

2 https://arxiv.org/pdf/1507.05717.pdf

3 https://github.com/RaidasGrisk/tf2-fots

4 https://www.appliedaicourse.com/

5 https://keras.io/api/applications/

--

--