Building and installing tesseract for python on Ubuntu 14.04.
root@server:/home/user/tesseract# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.1 LTS"
root@server:/home/user/tesseract# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.1 LTS"
Install packages
sudo apt-get install python-distutils-extra tesseract-ocr tesseract-ocr-eng libopencv-dev libtesseract-dev libleptonica-dev python-all-dev swig libcv-dev python-opencv python-numpy python-setuptools build-essential subversion
sudo apt-get install autoconf automake libtool
sudo apt-get install libpng12-dev libjpeg62-dev libtiff4-dev zlib1g-dev
For tesseract training install the next packages:
sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev
Download leptonica
wget http://www.leptonica.com/source/leptonica-1.71.tar.gz
tar xvf leptonica-1.71.tar.gz
and build it
cd leptonica-1.71
./configure
make
make install
Download tesseract-ocr
wget https://bitbucket.org/3togo/python-tesseract/downloads/tesseract-3.03-rc1.tar.gz
tar xvf tesseract-3.03-rc1.tar.gz
and build it
cd tesseract-3.03
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
Download (checkout) python-tesseract
svn checkout http://python-tesseract.googlecode.com/svn/trunk/src python-tesseract
I've used 659 revistion.
and build it
and build it
cd python-tesseract
python setup.py clean
python setup.py build
python setup.py install
After that try to run your python example.
If you'll get such error:
Error opening data file ./tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
AdaptedTemplates != NULL:Error:Assert failed:in file adaptmatch.cpp, line 174
Segmentation fault (core dumped)
You could fix it by patching "mainblk.cpp" file inside tesseract-3.03\ccutil\ folder the next way:
In the "mainblk.cpp" file code:
if (argv0 != NULL) {
datadir = argv0;
} else {
if (getenv("TESSDATA_PREFIX")) {
datadir = getenv("TESSDATA_PREFIX");
} else {
#ifdef TESSDATA_PREFIX
#define _STR(a) #a
#define _XSTR(a) _STR(a)
datadir = _XSTR(TESSDATA_PREFIX);
#undef _XSTR
#undef _STR
#endif
}
}
// insert code here
// datadir may still be empty:
if (datadir.length() == 0) {
datadir = "./";
add:
if (getenv("TESSDATA_PREFIX")) {
datadir = getenv("TESSDATA_PREFIX");
} else {
// check dir with tessdata
struct stat sb;
if (stat("/usr/share/tesseract-ocr/tessdata", &sb) == 0 && S_ISDIR(sb.st_mode)) {
datadir = "/usr/share/tesseract-ocr";
}
}
and include:
#include <sys/stat.h>
And rebuild and reinstall tesseract-ocr:
cd tesseract-3.03
make
sudo make install
So, after that, if you have TESSDATA_PREFIX env variable, it will be loaded, and if you have tessdata folder with files in /usr/share/tesseract-ocr/ it will be loaded, otherwise directory with your python example module (./) will be checked for tessdata folder.
P.S. Take a look at the repo, by the way, there are already built deb package: https://bitbucket.org/3togo/python-tesseract/
Update:
If you have the next error message when importing tesseract module in Python:
Traceback (most recent call last):
File "test_module.py", line 5, in
from tesseract_ocr import TesseractOCR
File "/home/user/ocr/module/tesseract_ocr.py", line 9, in
import tesseract
File "/usr/local/lib/python2.7/dist-packages/python_tesseract-0.9-py2.7-linux-x86_64.egg/tesseract.py", line 28, in
_tesseract = swig_import_helper()
File "/usr/local/lib/python2.7/dist-packages/python_tesseract-0.9-py2.7-linux-x86_64.egg/tesseract.py", line 24, in swig_import_helper
_mod = imp.load_module('_tesseract', fp, pathname, description)
ImportError: /usr/local/lib/python2.7/dist-packages/python_tesseract-0.9-py2.7-linux-x86_64.egg/_tesseract.so: undefined symbol: cvSetData
Update:
If you have the next error message when importing tesseract module in Python:
Traceback (most recent call last):
File "test_module.py", line 5, in
from tesseract_ocr import TesseractOCR
File "/home/user/ocr/module/tesseract_ocr.py", line 9, in
import tesseract
File "/usr/local/lib/python2.7/dist-packages/python_tesseract-0.9-py2.7-linux-x86_64.egg/tesseract.py", line 28, in
_tesseract = swig_import_helper()
File "/usr/local/lib/python2.7/dist-packages/python_tesseract-0.9-py2.7-linux-x86_64.egg/tesseract.py", line 24, in swig_import_helper
_mod = imp.load_module('_tesseract', fp, pathname, description)
ImportError: /usr/local/lib/python2.7/dist-packages/python_tesseract-0.9-py2.7-linux-x86_64.egg/_tesseract.so: undefined symbol: cvSetData
Check that opencv library is linked in the _tesseract.so, because cvSetData is opencv's function.
ldd _tesseract.so | grep libopencv
If the output is empty try to build _tesseract.so using this command:
sudo c++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/tesseract_wrap.o build/temp.linux-x86_64-2.7/main.o -lstdc++ -ltesseract -llept -lopencv_superres -lopencv_video -lopencv_videostab -lopencv_ml -lopencv_ocl -lopencv_contrib -lopencv_flann -lopencv_calib3d -lopencv_imgproc -lopencv_core -lopencv_legacy -lopencv_stitching -lopencv_features2d -lopencv_photo -lopencv_ts -lopencv_objdetect -lopencv_highgui -lopencv_gpu -o build/lib.linux-x86_64-2.7/_tesseract.so
NB: The command python setup.py build must be executed before the above one, otherwise next errors will be printed:
c++: error: build/temp.linux-x86_64-2.7/tesseract_wrap.o: No such file or directory
c++: error: build/temp.linux-x86_64-2.7/main.o: No such file or directory
After successful build of _tesseract.so copy it to python-tesseract directory:
sudo cp ./build/lib.linux-x86_64-2.7/_tesseract.so .
And check again that opencv library is linked in the _tesseract.so.
ldd _tesseract.so | grep libopencv
Should be something like:
libopencv_core.so.2.4 => /usr/local/lib/libopencv_core.so.2.4 (0x000070d310313370)
Now install python-tesseract:
python setup.py install
After that the problem "undefined symbol: cvSetData" will be solved.
If you have the next error:
>>> import tesseract
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/dist-packages/tesseract.py", line 28, in
_tesseract = swig_import_helper()
File "/usr/lib/python2.7/dist-packages/tesseract.py", line 24, in swig_import_helper
_mod = imp.load_module('_tesseract', fp, pathname, description)
ImportError: /usr/lib/python2.7/dist-packages/_tesseract.x86_64-linux-gnu.so: undefined symbol: pixGenerateFlateData
Try to use old python-tesseract svn revision (e.g. 659 or 660).
https://code.google.com/p/python-tesseract/source/detail?r=659
Update:
I've added a new tutorial Installing tesseract for python on Ubuntu 15.10.
If you have the next error:
>>> import tesseract
Traceback (most recent call last):
File "
File "/usr/lib/python2.7/dist-packages/tesseract.py", line 28, in
_tesseract = swig_import_helper()
File "/usr/lib/python2.7/dist-packages/tesseract.py", line 24, in swig_import_helper
_mod = imp.load_module('_tesseract', fp, pathname, description)
ImportError: /usr/lib/python2.7/dist-packages/_tesseract.x86_64-linux-gnu.so: undefined symbol: pixGenerateFlateData
Try to use old python-tesseract svn revision (e.g. 659 or 660).
https://code.google.com/p/python-tesseract/source/detail?r=659
Update:
I've added a new tutorial Installing tesseract for python on Ubuntu 15.10.