This guide provides step-by-step instructions for training Tesseract 5 in a Docker container. Docker allows you to create a reproducible environment for training Tesseract OCR models. By following the steps outlined below, you can set up a Docker container with Ubuntu, install Tesseract 5 and the necessary training tools, obtain training data, organize the data, and start the training process.
-
Open the terminal.
-
Pull the Ubuntu Docker image:
docker pull ubuntu
If you are interested in a specific version, you can specify it:
docker pull ubuntu:22.04
-
Run the Docker image:
docker run -ti --rm ubuntu /bin/bash
Note: By default, the Docker Ubuntu image does not have the
lsb_release
command available. You can use thecat
command to check the OS information instead. -
Check the OS version:
cat /etc/os-release
If the
lsb-release
package is not installed, update the package sources and install it:apt update && apt install lsb-core
Verify the OS version again:
lsb_release -a
-
Create a shared directory between your host system and the Docker container: In the container's terminal, create a directory named
Docker_Share
:mkdir -p Docker_Share
Verify that the directory was created:
ls
-
In a separate terminal on your host machine, check the current running container ID:
docker ps
Make note of the container ID.
-
Save the Docker container state as a new image:
docker commit -p container_id new_image_name
For example:
docker commit -p 3409ehfu384f myubuntu
Replace
container_id
with the ID of the container obtained in the previous step, andnew_image_name
with the desired name for the new image. -
Verify that the new image was created:
docker images
-
Stop the Docker container:
docker stop container_id
Replace
container_id
with the ID of the container obtained earlier. -
Restart the container with the shared data:
docker run -ti -v /host/machine/dir:/Docker_Share image_name /bin/bash
For example:
docker run -ti -v C:\training_data:/Docker_Share myubuntu /bin/bash
Replace
/host/machine/dir
with the directory path on your host machine that you want to share with the container,image_name
with the name of the new image created in the previous step, and/bin/bash
to start the container with a terminal.
-
In the container's terminal, update the package sources and install Git:
apt update && apt install git
-
Clone the Tesseract repository:
git clone https://github.com/tesseract-ocr/tesseract.git
Verify that the
tesseract
directory was created:ls
-
Install auxiliary libraries required for Tesseract:
apt update && apt install autoconf automake libtool pkg-config libpng-dev libjpeg8-dev libtiff5-dev zlib1g-dev libwebpdemux2 libwebp-dev libopenjp2-7-dev libgif-dev libarchive-dev libcurl4-openssl-dev libicu-dev libpango1.0-dev libcairo2-dev libleptonica-dev
-
Navigate to the
/tesseract
directory:cd /tesseract
-
Run the
autogen.sh
script:./autogen.sh
-
Run the
configure
script:./configure
-
Build and install Tesseract OCR 5:
make make install ldconfig
-
Install the Tesseract training tools:
make training make training-install
-
Clone the
tesstrain
repository:git clone https://github.com/tesseract-ocr/tesstrain.git
-
Navigate to the
tesstrain
directory:cd /tesseract/tesstrain
-
Install
wget
and the required Python libraries:apt update && apt install wget python3-pip pip install -r requirements.txt
-
Fetch language data:
make tesseract-langdata
The setup process has been successfully completed until the fetching of the Tesseract language data. From here on, you will need to gather and organize the training data (images and ground truth files) before continuing with the model training.
Next Steps:
Obtain training data files (.tif and .gt.txt).
Organize them and move them to the shared directory (Docker_Share).
Note: The following steps are required to proceed with training your Tesseract OCR model.
To train a Tesseract OCR model, you need the following training data:
- [lang].[font].exp[number].tif (line string image file)
- [lang].[font].exp[number].gt.txt (ground truth text file)
For example:
- chi_tra.DFKai.exp0.tif
- chi_tra.DFKai.exp0.gt.txt
Optional training data includes:
- [lang].[font].exp[number].box
The .box
files contain information about character positions in the image, improving the training process and model accuracy.
Move all the training data into the directory shared with the Docker container. For example, if your shared directory on the host machine is C:\training_data
, place all the .gt.txt
, .tif
, and .box
files in that directory.
-
Copy the training data from the shared directory to the appropriate location:
cp -r /Docker_Share /tesseract/tesstrain/data/[lang].[font]-ground-truth
Replace
[lang].[font]
with the appropriate language and font information. -
Download the traineddata files you need from the tessdata_best repository. Make sure to download the
eng.traineddata
file for any language you are training. For example, if you are training Chinese Traditional (chi_tra), download thechi_tra.traineddata
file. -
Move the downloaded traineddata files into the shared directory. For example, move
eng.traineddata
andchi_tra.traineddata
toC:\training_data
on the host machine. -
Move the traineddata files to the default training directory:
mv /Docker_Share/*.traineddata /usr/local/share/tessdata/
Now your training data is organized and ready for training the new model.
-
Navigate to the training directory:
cd /tesseract/tesstrain
-
If you have .box files and want to avoid overwriting them during the training process, modify the Makefile:
apt update && apt install nano cd /tesseract/tesstrain nano Makefile
Locate the lines starting with
%.box
and comment them out.Original lines:
%.box: %.png %.gt.txt PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.png" -t "$*.gt.txt" > "$@" %.box: %.bin.png %.gt.txt PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.bin.png" -t "$*.gt.txt" > "$@" %.box: %.nrm.png %.gt.txt PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.nrm.png" -t "$*.gt.txt" > "$@" %.box: %.raw.png %.gt.txt PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.raw.png" -t "$*.gt.txt" > "$@" %.box: %.tif %.gt.txt PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.tif" -t "$*.gt.txt" > "$@"
Modified lines:
# %.box: %.png %.gt.txt # PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.png" -t "$*.gt.txt" > "$@" # %.box: %.bin.png %.gt.txt # PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.bin.png" -t "$*.gt.txt" > "$@" # %.box: %.nrm.png %.gt.txt # PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.nrm.png" -t "$*.gt.txt" > "$@" # %.box: %.raw.png %.gt.txt # PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.raw.png" -t "$*.gt.txt" > "$@" # %.box: %.tif %.gt.txt # PYTHONIOENCODING=utf-8 $(PY_CMD) $(GENERATE_BOX_SCRIPT) -i "$*.tif" -t "$*.gt.txt" > "$@"
Press
Ctrl + O
and thenEnter
to save the modified Makefile. PressCtrl + X
to exit the editor. -
Start training a new model:
make training MODEL_NAME=[lang].[font] TESSDATA=/usr/local/share/tessdata
Replace
[lang].[font]
with the appropriate language and font information. -
If you want to fine-tune an existing model, use the
START_MODEL
parameter:make training MODEL_NAME=[lang].[font] START_MODEL=[lang] TESSDATA=/usr/local/share/tessdata
Replace
[lang].[font]
with the appropriate language and font information. -
After training, you can find the traineddata of the new model in the default output path:
cd /tesseract/tesstrain/data/[lang].[font] ls
Replace
[lang].[font]
with the appropriate language and font information. -
Copy the traineddata of the new model to the shared directory:
cp /tesseract/tesstrain/data/[lang].[font]/[lang].[font].traineddata /Docker_Share
Replace
[lang].[font]
with the appropriate language and font information.
The traineddata file will now be available in the shared directory on your host machine.
For detailed steps and additional information, please refer to the following resources: