程序代写代做代考 Hidden Markov Mode GPU algorithm deep learning Deep TextSpotter: An End-To-End Trainable Scene Text Localization and Recognition Framework

Deep TextSpotter: An End-To-End Trainable Scene Text Localization and Recognition Framework

Deep TextSpotter: An End-to-End Trainable Scene Text Localization and

Recognition Framework

Michal Bušta, Lukáš Neumann and Jiřı́ Matas

Centre for Machine Perception, Department of Cybernetics

Czech Technical University, Prague, Czech Republic

bustam@fel.cvut.cz, neumalu1@cmp.felk.cvut.cz, matas@cmp.felk.cvut.cz

Abstract

A method for scene text localization and recognition is

proposed. The novelties include: training of both text detec-

tion and recognition in a single end-to-end pass, the struc-

ture of the recognition CNN and the geometry of its input

layer that preserves the aspect of the text and adapts its res-

olution to the data.

The proposed method achieves state-of-the-art accuracy

in the end-to-end text recognition on two standard datasets

– ICDAR 2013 and ICDAR 2015, whilst being an order

of magnitude faster than competing methods – the whole

pipeline runs at 10 frames per second on an NVidia K80
GPU.

1. Introduction

Scene text localization and recognition, a.k.a. text spot-

ting, text-in-the-wild problem or photo OCR, in an open

problem with many practical applications, ranging from

tools for helping visually impaired or text translation, to use

as a part of a larger integrated system, e.g. in robotics, in-

door navigation or autonomous driving.

Like many areas of computer vision, the scene text field

has greatly benefited from deep learning techniques and

accuracy of methods has significantly improved [12, 6].

Most work however focuses either solely on text localiza-

tion (detection) [18, 26, 6, 15] or on recognition of manu-

ally cropped-out words [7, 24]. The problem of scene text

recognition has been so far always approached ad-hoc, by

connecting the detection module to an existing independent

recognition method [6, 15, 8].

In this paper, we propose a novel end-to-end framework

which simultaneously detects and recognizes text in scene

images. As the first contribution, we present a model which

is trained for both text detection and recognition in a sin-

gle learning framework, and we show that such joint model

outperforms the combination of state-of-the-art localization

Figure 1. The proposed method detects and recognizes text in

scene images at 10fps on an NVidia K80 GPU. Ground truth in

green, model output in red. The image is taken from the ICDAR

2013 dataset [13]

and state-of-the-art recognition methods [6, 4].

As the second contribution, we show how the state-

of-the-art object detection methods [22, 23] can be ex-

tended for text detection and recognition, taking into ac-

count specifics of text such as the exponential number of

classes (given an alphabet A, there are up to AL possi-
ble classes, where L denotes maximum text length) and the
sensitivity to hidden parameters such as text aspect and ro-

tation.

The method achieves state-of-the-art results on the stan-

dard ICDAR 2013 [13] and ICDAR 2015 [12] datasets and

the pipeline runs end-to-end at 10 frames per second on a
NVidia K80 GPU, which is more than 10 times faster than
the fastest methods.

The rest of the paper is structured as follows. In Sec-

tion 2, previous work is reviewed. In Section 3, the pro-

posed method is described and in Section 4 evaluated. The

paper is concluded in Section 5.

2. Previous Work

2.1. Scene Text Localization

Jaderberg et al. [10] train a character-centric CNN [14],

which takes a 24 × 24 image patch and predicts a text/no-
text score, a character and a bigram class. The input image

12204

0.98

0.66

0.04

0.07

.

.

.

0.22

Bilinear

Sampler

-RR-I-VE-R-S–III-D–E-

—————————-

—————————-

—-W-AA-LLLK———

Source Image Fully Convolutional Net Region Proposals

Region Scores

Best Proposals Geometric Normalization CTC Region Transcription

CNNCNN CNNCNN

Figure 2. Method overview. Text region proposals are generated by a Region Proposal Network [22]. Each region with a sufficient text

confidence is then normalized to a variable-width feature tensor by bilinear sampling. Finally, each region is associated with a sequence of

characters or rejected as not text.

is scanned by the trained network in 16 scales and a text
saliency map is obtained by taking the text/no-text output

of the network. Given the saliency maps, word bounding

boxes are then obtained by the run length smoothing algo-

rithm. The method is further improved in [8], where a word-

centric approach is introduced. First, horizontal bounding-

box proposals are detected by aggregating the output of

the standard Edge Boxes [29] and Aggregate Channel Fea-

ture [2] detectors. Each proposal is then classified by a Ran-

dom Forest [1] classifier to reduce the number of false pos-

itives and its position and size is further refined by a CNN

regressor, to obtain a more suitable cropping of the detected

word image.

Gupta et al. [6] propose a fully-convolutional regression

network, drawing inspiration from the YOLO object detec-

tion pipeline [21]. An image is divided into a fixed num-

ber of cells (14 × 14 in the highest resolution), where each
cell is associated with 7 values directly predicting the po-
sition, rotation and confidence of text. The values are esti-

mated by translation-invariant predictors built on top of the

first 9 convolutional layers of the popular VGG-16 architec-
ture [25], trained on synthetic data.

Tian et al. [26] adapt the Faster R-CNN architecture [23]

by horizontally sliding a 3 × 3 window on the last convo-
lutional layer of the VGG-16 [25] and applying a Recurrent

Neural Network to jointly predict the text/non-text score,

the y-axis coordinates and the anchor side-refinement. Sim-

ilarly, Liao et al. [15] adapt the SSD object detector [17] to

detect horizontal bounding boxes.

Ma et al. [18] adapt the Faster R-CNN architecture and

extend it to detect text of different orientations by adding

anchor boxes of 6 hand-crafted rotations and 3 aspects. This

is in contrast to our work, where the rotation is a continu-

ous parameter and the optimal anchor boxes dimensions are

found on the training set.

All the aforementioned methods only localize text, but

do not provide text recognition. The end-to-end scene text

recognition results, where present, are achieved by simply

connecting the particular localization method to one of the

cropped-word recognition methods (see Section 2.2).

Last but not least, the methods are significantly slower

than the proposed method, the missing recognition stage

notwithstanding.

2.2. Scene Text Recognition

Jaderberg et al. [8] take a cropped image of a single

word, resize it to a fixed size of 32 × 100 pixels and clas-
sify it as one of the words in a dictionary. In their setup, the

dictionary contains 90 000 English words and words of the
training and testing set. The classifier is trained on a dataset

of 9 million synthetic word images uniformly sampled from

this dictionary.

Shi et al. [24] train a fully-convolutional network with

a bidirectional LSTM using the Connectionist Tempo-

ral Classification (CTC), which was first introduced by

Graves et al. [5] for speech recognition to eliminate the need

for pre-segmented data. Unlike the proposed method, Shi et

al. [24] only recognize a single word per image (i.e. the out-

put is always just one sequence of characters), they resize

the source image to a fixed-sized matrix of 100 × 32 pix-
els regardless of how many characters it contains and the

method is significantly slower because of the LSTM layer.

2.3. Image Captioning

Johnson et al. [11] introduce a Fully Convolutional Lo-

calization Network (FCLN) that combines the Faster R-

CNN approach of Ren et al. [23] based on full VGG-16

[25] with bilinear sampling [9] to generate features for

LSTM that produces captions for detected objects. In our

method, we use YOLOv2 architecture [22] for its lower

complexity, we use the bilinear sampling to produce tensors

of variable width to deal with character sequence recog-

nition and we employ a different (and significantly faster)

classification stage.

3. Proposed Method

The proposed model localizes text regions in a given

scene image and provides text transcription as a sequence

of characters for all regions with text (see Figure 2). The

2205

model is jointly optimized for both text localization and

recognition in an end-to-end training framework.

3.1. Fully Convolutional Network

We adapt the YOLOv2 architecture [22] for its accuracy

and significantly lower complexity than the standard VGG-

16 architecture [25, 11], as the full VGG-16 architecture re-

quires 30 billion operations just to process a 224×224 (0.05
Mpx) image [22]. Using YOLOv2 architecture allows us to

process images with higher resolution, which is a crucial

ability for text recognition – processing at higher resolution

is required because a 1Mpx scene image may contain text

which is 10 pixels high [12], so scaling down the source

image would make the text unreadable.

The proposed method uses the first 18 convolutional and

5 max pool layers from the YOLOv2 architecture, which is

based on 3×3 convolutional filters, doubling the number of
channels after every pooling step and adding 1× 1 filters to
compress the representations between the 3× 3 filters [22].
We remove the fully-connected layers to make the network

fully convolutional, so our model final layer has the dimen-

sion of W
32

× H
32

×1024, where W a H denote source image
width and height [22].

3.2. Region Proposals

Similarly to Faster R-CNN [23] and YOLOv2 [22], we

use a Region Proposal Network (RPN) to generate region

proposals, but we add rotation rθ which is crucial for a suc-
cessful text recognition. At each position of the last convo-

lutional layer, the model predicts k rotated bounding boxes,
where for each bounding box r we predict 6 features – its
position rx, ry , its dimensions rw, rh, its rotation rθ and
its score rp, which captures the probability that the region
contains text.

The bounding box position and dimension is encoded

with respect to predefined anchor boxes using the logistic

activation function, so the actual bounding box position (x,
y) and dimension (w, h) in the source image is given as

x = σ(rx) + cx (1)

y = σ(ry) + cy (2)

w = aw exp(rw) (3)

h = ah exp(rh) (4)

θ = rθ (5)

where cx and cy denote the offset of the cell in the last con-
volutional layer and aw and ah denote the predefined height
and width of the anchor box a. The rotation θ ∈ (−π

2
, π
2
)

of the bounding box is predicted directly by rθ.
We followed the approach of Redmon et al. [22]

and found suitable anchor box scales and aspects by k-

means clustering on the aggregated training set (see Sec-

tion 3.5). Requiring the anchor boxes to have at least 60%

−6 −4 −2 0 2 4 6
−3

−2

−1

0

1

2

3

Figure 3. Anchor box widths and heights, or equivalently scales

and aspects, were obtained by k-means clustering on the training

set. Requiring that each ground truth box had intersection-over-

union of at least 60% with one anchor box led to k = 14 boxes.

intersection-over-union with the ground truth led to k = 14
different anchor boxes dimensions (see Figure 3).

For every image, the RPN produces W
32

× H
32

×6k boxes,
where k is the number of anchor boxes in every location and
6 is the number of predicted parameters (x, y, w, h, θ and
the text score).

In the training stage, we use the YOLOv2 approach [22]

by taking all positive and negative samples in the source

image, where every 20 batches we randomly change the in-

put dimension size into one of {352, 416, 480, 544, 608}. A
positive sample is the region with the highest intersection

over union with the ground truth, the other intersecting re-

gions are negatives.

At runtime, we found the best approach is to take all re-

gions with the score rp above a certain threshold pmin and
to postpone the non-maxima suppression after the recog-

nition stage, because regions with very similar rp scores
could produce very different transcriptions, and therefore

selecting the region with the highest rp at this stage would
not always correspond to the correct transcription (for ex-

ample, in some cases a region containing letters “TALY”

may have slightly higher score rp than a region contain-
ing the full word “ITALY”). We empirically found the value

pmin = 0.1 to be a reasonable trade-off between accuracy
and speed.

3.3. Bilinear Sampling

Each region detected in the previous stage has a different

size and rotation and it is therefore necessary to map the

features into a tensor of canonical dimensions, which can

be used in recognition.

Faster R-CNN [23] uses the RoI pooling approach of

Girshick [3], where a w × h × C region is mapped onto a
fixed-sized W ′ ×H ′ ×C grid (7× 7× 1024 in their imple-
mentation), where each cell takes the maximum activation

of the w
W

× h
H

cells in the underlying feature layer.

In our model, we instead use bilinear sampling [9, 11]

to map a w × h × C region from the source image into a
fixed-height wH

h
×H ′ ×C tensor (H ′ = 32). This feature

representation has a key advantage over the standard RoI

approach as it allows the network to normalize rotation and

2206

Type Channels Size/Stride Dim/Act

input C – W × 32
conv 32 3× 3 leaky ReLU
conv 32 3× 3 leaky ReLU

maxpool 2× 2/2 W/2× 16
conv 64 3× 3 leaky ReLU

BatchNorm

recurrent conv 64 3× 3 leaky ReLU
maxpool 2× 2/2 W/4× 8

conv 128 3× 3 leaky ReLU
BatchNorm

recurrent conv 128 3× 3 leaky ReLU
maxpool 2× 2/2× 1 W/4× 4

conv 256 3× 3 leaky ReLU
BatchNorm

recurrent conv 256 3× 3 leaky ReLU
maxpool 2× 2/2× 1 W/4× 2

conv 512 3× 2 leaky ReLU
conv 512 5× 1 leaky ReLU

conv |Â| 7× 1 W/4× 1
log softmax

Table 1. Fully-Convolutional Network for Text Recognition

scale, but at the same to persist the aspect and positioning of

individual characters, which is crucial for text recognition

accuracy (see Section 3.4).

Given the detected region features U ∈ Rw×h×C , they

are mapped into a fixed-height tensor V ∈ R
wH

h
×H


×C as

V
c
x′,y′ =

w

x=1

h

y=1

U
c
x,yκ(x− Tx(x

′))κ(y − Ty(y
′)) (6)

where κ is the bilinear sampling kernel κ(v) = max(0, 1−
|v|) and T is a point-wise coordinate transformation, which
projects co-ordinates x′ and y′ of the fixed-sized tensor V
to the co-ordinates x and y in the detected region features
tensor U.

The transformation allows for shift and scaling in x- and

y- axes and rotation and its parameters are taken directly

from the region parameters (see Section 3.2).

3.4. Text Recognition

Given the normalized region from the source image, each

region is associated with a sequence of characters or re-

jected as not text in the following process.

The main problem one has to address in this step is the

fact, that text regions of different sizes have to be mapped to

character sequences of different lengths. Traditionally, the

issue is solved by resizing the input to a fixed-sized matrix

(typically 100×32 [8, 24]) and the input is then classified by
either making every possible character sequence (i.e. every

word) a separate class of its own [8, 6], thus requiring a list

of all possible outputs in the training stage, or by having

0 10 20 30 40
0.0

0.2

0.4

0.6

0.8

1.0

blank T i r e d n s

—-T-ii–r—e—-d—-n—-e—-s—s—-

Figure 4. Text recognition using Connectionist Temporal Classi-

fication. Input W × 32 region (top), CTC output W
4

× |Â| as
the most probable class at given column (middle) and the resulting

sequence (bottom)

multiple independent classifiers, where each classifier pre-

dicts the character at a predefined position [7].

Our model exploits a novel fully-convolutional network

(see Table 1), which takes a variable-width feature tensor

W ×H ′ × C as an input (W = wH

h
) and outputs a matrix

W
4
× |Â|, where A is the alphabet (e.g. all English charac-

ters). The matrix height is fixed (it’s the number of character

classes), but its width grows with the width of the source re-

gion and therefore with the length of the expected character

sequence.

As a result, a single classifier is used regardless of the

position of the character in the word (in contrast to Jader-

berg et al. [7], where there is an independent classifier for

the character “A” as the first character in the word, an inde-

pendent classifier for the character “A” as the second charac-

ter in the word, etc). The model also does not require prior

knowledge of all words to be detected in the training stage,

in contrast to the separate class per character sequence for-

mulation [8].

The model uses Connectionist Temporal Classification

(CTC) [5, 24] to transform variable-width feature tensor

into a conditional probability distribution over label se-

quences. The distribution is then used to select the most

probable labelling sequence for the text region (see Fig-

ure 4).

Let y = y1, y2, · · · , yn denote the vector of network out-
puts of length n from an alphabet A extended with a blank
symbol “–”.

The probability of a path π is then given as

p(π|y) =
n

i=1

yiπi , π ∈ Â
n

(7)

 = A ∪ {−}

where yiπi denotes the output probability of the network pre-
dicting the label πi at the position i (i.e. the output of the
final softmax layer in Table 1).

2207

Let us further define a many-to-one mapping B : Ân 7→
A≤n, where Â≤n is the set of all sequences of shorter or
equal in length. The mapping B removes all blanks and
repeated labels, which corresponds to outputting a new label

every time the label prediction changes. For example,

B(−ww − al − k) = B(wwaaa − l − k−) = walk

B(−f − oo − o −−d) = B(ffoo − ooo − d) = food

The conditional probability of observing the output se-

quence w is then given as

p(w|y) =

π:B(π)=w

p(π|y), w ∈ A≤n (8)

In training, an objective function that maximizes the log

likelihood of target labellings p(w|y) is used [5]. In every
training step, the probability p(wgt|y) of every text region
in the mini-batch is efficiently calculated using a forward-

backward algorithm similar to HMMs training [20] and

the objective function derivatives are used to update net-

work weights, using the standard back-propagation algo-

rithm (wgt denotes the ground truth transcription of the text

region).

At test time, the classification output w∗ should be given

by the most probable path p(w|y), which unfortunately is
not tractable, and therefore we adapt the approximate ap-

proach [5] of taking the most probable labelling

w
∗ ≈ B

(

argmax p(π|y)
)

(9)

At the end of this process, each text region in the im-

age has an associated content in the form of a character se-

quence, or it is rejected as not text when all the labels are

blank.

The model typically produces many different boxes for

a single text area in the image, we therefore suppress over-

lapping boxes by a standard non-maxima suppression algo-

rithm based on the text recognition confidence, which is the

p(w∗|y) normalized by the text length.

3.5. Training

We pre-train the detection CNN using the SynthText

dataset [6] (800, 000 synthetic scene images with multi-
ple words per image) for 3 epochs, with weights initialized

from ImageNet [22]. The recognition CNN is pre-trained on

the Synthetic Word dataset [7] (9 million synthetic cropped

word images) for 3 epochs, with weights randomly initial-

ized from the N (0, 1) distribution.
As the final step, we train both networks simultane-

ously for 3 epochs on a combined dataset consisting of the

SynthText dataset, the Synthetic Word dataset, the ICDAR

2013 Training dataset [13] (229 scene images captured

by a professional camera) and the ICDAR 2015 Training

Figure 5. End-to-end scene text recognition samples from the

ICDAR 2013 dataset. Model output in red, ground truth in green.

Note that in some cases (e.g. top-right) text is correctly recog-

nized even though the bounding IoU with the ground truth is less

than 80%, which would be required by the text localization proto-

col [13]. Best viewed zoomed in color

dataset [12] (1000 scene images captured by Google Glass).

For every image, we randomly crop up to 30% of its width

and height. We use standard Stochastic Gradient Descent

with momentum 0.9 and learning rate 10−3, divided by 10
after each epoch. One mini-batch takes about 500ms on a

NVidia K80 GPU.

2208

end-to-end word spotting speed

strong weak generic strong weak generic fps

Deep2Text [28] 0.81 0.79 0.77 0.85 0.83 0.79 1.0

TextSpotter [19] 0.77 0.63 0.54 0.85 0.66 0.57 1.0

StradVision [12] 0.81 0.79 0.67 0.84 0.83 0.70 ?

Jaderberg et al. [8] 0.86 – – 0.90 0.76 – *0.3

Gupta et al. [6] – – – – 0.85 – *0.4

Deep TextSpotter 0.89 0.86 0.77 0.92 0.89 0.81 *10.0
Table 2. ICDAR 2013 dataset – End-to-end scene text recognition accuracy (f-measure), depending on the lexicon size and whether digits

are excluded from the evaluation (denoted as word spotting). Methods running on a GPU marked with an asterisk

end-to-end word spotting speed

strong weak generic strong weak generic fps

TextSpotter [19] 0.35 0.20 0.16 0.37 0.21 0.16 1.0

Stradvision [12] 0.44 – – 0.46 – – ?

TextProposals + DictNet [4, 8] 0.53 0.50 0.47 0.56 0.52 0.50 0.2

Deep TextSpotter 0.54 0.51 0.47 0.58 0.53 0.51 *9.0
Table 3. ICDAR 2015 dataset – End-to-end scene text recognition accuracy (f-measure). Methods running on a GPU marked with an

asterisk

4. Experiments

We trained our model once1 and then evaluated its accu-

racy on three standard datasets. We evaluate the model in

an end-to-end set up, where the objective is to localize and

recognize all words in the image in a single step, using the

standard evaluation protocol associated with each dataset.

4.1. ICDAR 2013 dataset

In the ICDAR evaluation schema [13, 12], each image in

the test set is associated with a list of words (lexicon), which

contains the words that the method should localize and rec-

ognize, as well as an increasing number of random “distrac-

tor” words. There are three sizes of lists provided with each

image, depending how heavily contextualized their content

is to the specific image:

• strongly contextualized – 100 words specific to each
image, contains all words in the image and the remain-

ing words are “distractors”

• weakly contextualized – all words in the testing set,
same list for every image

• generic – all words in the testing set plus 90k English
words

A word is considered as correctly recognized, when

its Intersection-over-Union (IoU) with the ground truth is

above 0.5 and the transcription is identical, using case-

insensitive comparison [12].

The ICDAR 2013 Dataset [13] is the most-frequently

cited dataset for scene text evaluation. It consists of 255

testing images with 716 annotated words, the images were

1Full source code and the trained model are publicly available at

https://github.com/MichalBusta/DeepTextSpotter

taken by a professional camera so text is typically horizon-

tal and the camera is almost always aimed at it. The dataset

is sometimes referred to as the Focused Scene Text dataset.

The proposed model achieves state-of-the-art text recog-

nition accuracy (see Table 2) for all 3 lexicon sizes. In the

end-to-end set up, where all lexicon words plus all digits

in an image should be recognized, the maximal f-measure

it achieves is 0.89/0.86/0.77 for strongly, weakly and gen-

erally contextualized lexicons respectively. Each image is

first resized to 544×544 pixels, the average processing time
is 100ms per image on a NVidia K80 GPU for the whole

pipeline.

While training on the same training data, our model out-

performs the combination of the state-of-the-art localization

method of Gupta et al. [6] with the state-of-the-art recogni-

tion method of Jaderberg et al. [8] by at least 3 per cent

points on every measure, thus demonstrating the advantage

of the joint training for the end-to-end task of our model. It

is also more than 20 times faster than the method of Gupta et

al. [6].

Let us further note that our model would not be consid-

ered as a state-of-the-art text localization method according

to the text localization evaluation protocol, because the stan-

dard DetEval tool used for evaluation is based on a series of

thresholds which require at least a 80% intersection-over-

union with bounding boxes created by human annotators.

Our method in contrast does not always achieve the required

80% overlap, but it is still mostly able to recognize the text

correctly even when the overlap is lower (see Figure 5).

We argue that evaluating methods purely on text local-

ization accuracy without subsequent recognition is not very

informative, because the text localization “accuracy” only

aims to fit the way human annotators create bounding boxes

around text, but it does not give any estimates on how well

a text recognition phase would read text post a successful

2209

Figure 6. End-to-end scene text recognition samples from the ICDAR 2015 dataset. Model output in red, ground truth in green. Best

viewed zoomed in color

Figure 7. All the images of the ICDAR 2013 Testing set where the

proposed method fails to correctly recognize any text (i.e. images

with 0% recall)

localization, which should be the prime objective of the text

localization metrics.

The main limitation of the proposed model are single

characters or short snippets of digits and characters (see Fig-

ure 7), which may be partially caused by the fact that such

examples are not very frequent in the training set.

4.2. ICDAR 2015 dataset

The ICDAR 2015 dataset was introduced in the ICDAR

2015 Robust Reading Competition [12] and it uses the same

evaluation protocol as the ICDAR 2013 dataset in the previ-

ous section. The dataset consists of 500 test images, which

were collected by people wearing Google Glass devices and

walking in Singapore. Subsequently, all images with text

were selected and annotated. The images in the dataset

were taken “not having text in mind”, therefore text is much

smaller and the images contain a high variability of text

fonts and sizes. They also include many realistic effects

– e.g. occlusion, perspective distortion, blur or noise, so as a

result the dataset is significantly more challenging than the

ICDAR 2013 dataset (Section 4.1), which contains typically

large horizontal text.

The proposed model achieves state-of-the-art end-to-end

text recognition accuracy (see Table 3 and Figure 6) for all

3 lexicon sizes. In our experiments, the average processing

time was 110ms per image on a NVidia K80 GPU (the im-

age is first resized to 608 × 608 pixels), which makes the
proposed model 45 times faster than currently the best pub-

lished method of Gomez et al. [4]

The main failure mode of the proposed method is blurry

2210

Figure 8. Main failure modes on the ICDAR 2015 dataset. Blurred

and noisy text (top), vertical text (top) and small text (bottom).

Best viewed zoomed in color

recall precision f-measure

Method A [27] 28.33 68.42 40.07

Method B [27] 9.97 54.46 16.85

Method C [27] 1.66 4.15 2.37

Deep TextSpotter 16.75 31.43 21.85
Table 4. COCO-Text dataset – End to End text recognition

or noisy text (see Figure 8), which are effects not present in

the training set (Section 3.5). The method also often fails to

detect small text (less than 15 pixels high), which again is

due to the lack of such samples in the training stage.

4.3. COCO-Text dataset

The COCO-Text dataset [27] was created by annotating

the standard MS COCO dataset [16], which captures im-

ages of complex everyday scenes. As a result, the dataset

contains 63,686 images with 173,589 labeled text regions,

so it is two orders of magnitude larger than any other scene

text dataset. Unlike the ICDAR datasets, there is no lexicon

used in the evaluation, so methods have to recognize text

without any prior knowledge.

The proposed model demonstrates competitive results in

the text recognition accuracy (see Table 4 and Figure 9),

being only surpassed by Method A2.

5. Conclusion

A novel framework for scene text localization and recog-

nition was proposed. The model is trained for both text de-

tection and recognition in a single training framework.

The proposed model achieves state-of-the-art accuracy

in the end-to-end text recognition on two standard datasets

(ICDAR 2013 and ICDAR 2015), whilst being an order of

2Method A [27] was authored by Google and neither the training data

nor the algorithm is published.

Figure 9. End-to-end scene text recognition samples from the

COCO-Text dataset. Model output in red, ground truth in green.

Best viewed zoomed in color

magnitude faster than the previous methods – the whole

pipeline runs at 10 frames per second on a NVidia K80
GPU. Our model showed that the state-of-the-art object de-

tection methods [22, 23] can be extended for text detection

and recognition, taking into account specifics of text, and

still maintaining a low computational complexity.

We also demonstrated the advantage of the joint training

for the end-to-end task, by outperforming the ad-hoc com-

bination of the state-of-the-art localization and state-of-the-

art recognition methods [6, 4, 8], while exploiting the same

training data.

Last but not least, we showed that optimizing localiza-

tion accuracy on human-annotated bounding boxes might

not improve performance of an end-to-end system, as there

is not a clear link between how well a method fits the bound-

ing boxes created by a human annotator and how well a

method reads text. Future work includes extending the

training set with more realistic effects, single characters and

digits.

Acknowledgment

JM was supported by the Czech Science Foundation

Project GACR P103/12/G084, LN and MB by Technol-

ogy Agency of the Czech Republic research program

TE01020415 (V3C – Visual Computing Competence Cen-

ter). Lukas would also like to acknowledge the support

of the Google PhD Fellowship and the Google Research

Award.

2211

References

[1] A. Bosch, A. Zisserman, and X. Muoz. Image classification

using random forests and ferns. In Computer Vision, 2007.

ICCV 2007. IEEE 11th International Conference on, pages 1

–8, oct. 2007. 2

[2] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature

pyramids for object detection. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 36(8):1532–1545, 2014.

2

[3] R. Girshick. Fast r-cnn. In Proceedings of the IEEE Inter-

national Conference on Computer Vision, pages 1440–1448,

2015. 3

[4] L. Gomez-Bigorda and D. Karatzas. Textproposals: A text-

specific selective search algorithm for word spotting in the

wild. arXiv preprint arXiv:1604.02619, 2016. 1, 6, 7, 8

[5] A. Graves, S. Fernández, F. Gomez, and J. Schmidhu-

ber. Connectionist temporal classification: labelling unseg-

mented sequence data with recurrent neural networks. In

Proceedings of the 23rd international conference on Ma-

chine learning, pages 369–376. ACM, 2006. 2, 4, 5

[6] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for

text localisation in natural images. In Proceedings of the

IEEE Conference on Computer Vision and Pattern Recogni-

tion, 2016. 1, 2, 4, 5, 6, 8

[7] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman.

Synthetic data and artificial neural networks for natural scene

text recognition. In NIPS Deep Learning Workshop 2014,

2014. 1, 4, 5

[8] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisser-

man. Reading text in the wild with convolutional neural net-

works. International Journal of Computer Vision, 116(1):1–

20, 2016. 1, 2, 4, 6, 8

[9] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial

transformer networks. In Advances in Neural Information

Processing Systems, pages 2017–2025, 2015. 2, 3

[10] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features

for text spotting. In European conference on computer vi-

sion, pages 512–528. Springer, 2014. 1

[11] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully

convolutional localization networks for dense captioning. In

Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 4565–4574, 2016. 2, 3

[12] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh,

B. Andrew, M. Iwamura, J. Matas, L. Neumann, V. R. Chan-

drasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny.

ICDAR 2015 robust reading competition. In ICDAR 2015,

pages 1156–1160. IEEE, 2013. 1, 3, 5, 6, 7

[13] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre,

J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, et al.

ICDAR 2013 robust reading competition. In ICDAR 2013,

pages 1484–1493. IEEE, 2013. 1, 5, 6

[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-

based learning applied to document recognition. Proceed-

ings of the IEEE, 86(11):2278–2324, 1998. 1

[15] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu. Textboxes:

A fast text detector with a single deep neural network. arXiv

preprint arXiv:1611.06779, 2016. 1, 2

[16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-

manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-

mon objects in context. In European conference on computer

vision, pages 740–755. Springer, 2014. 8

[17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-

Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.

In European Conference on Computer Vision, pages 21–37.

Springer, 2016. 2

[18] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and

X. Xue. Arbitrary-oriented scene text detection via rotation

proposals. arXiv preprint arXiv:1703.01086, 2017. 1, 2

[19] L. Neumann and J. Matas. Real-time lexicon-free scene text

localization and recognition. Pattern Analysis and Machine

Intelligence, IEEE Transactions on, 38(9):1872–1885, Sept

2016. 6

[20] L. R. Rabiner. A tutorial on hidden markov models and se-

lected applications in speech recognition. Proceedings of the

IEEE, 77(2):257–286, 1989. 5

[21] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You

only look once: Unified, real-time object detection. In Pro-

ceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2015. 2

[22] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger.

arXiv preprint arXiv:1612.08242, 2016. 1, 2, 3, 5, 8

[23] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards

real-time object detection with region proposal networks. In

Advances in neural information processing systems, pages

91–99, 2015. 1, 2, 3, 8

[24] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural

network for image-based sequence recognition and its ap-

plication to scene text recognition. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 2016. 1, 2, 4

[25] K. Simonyan and A. Zisserman. Very deep convolutional

networks for large-scale image recognition. arXiv preprint

arXiv:1409.1556, 2014. 2, 3

[26] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao. Detecting text

in natural image with connectionist text proposal network.

In European Conference on Computer Vision, pages 56–72.

Springer, 2016. 1, 2

[27] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Be-

longie. Coco-text: Dataset and benchmark for text de-

tection and recognition in natural images. arXiv preprint

arXiv:1601.07140, 2016. 8

[28] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao. Robust text de-

tection in natural scene images. IEEE transactions on pattern

analysis and machine intelligence, 36(5):970–983, 2014. 6

[29] C. L. Zitnick and P. Dollár. Edge boxes: Locating object

proposals from edges. In European Conference on Computer

Vision, pages 391–405. Springer, 2014. 2

2212