Wednesday, October 29, 2014

Installing tesseract for python on Ubuntu 14.04

Building and installing tesseract for python on Ubuntu 14.04.

root@server:/home/user/tesseract# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.1 LTS"

Install packages
sudo apt-get install python-distutils-extra tesseract-ocr tesseract-ocr-eng libopencv-dev libtesseract-dev libleptonica-dev python-all-dev swig libcv-dev python-opencv python-numpy python-setuptools build-essential subversion
sudo apt-get install autoconf automake libtool
sudo apt-get install libpng12-dev libjpeg62-dev libtiff4-dev zlib1g-dev

For tesseract training install the next packages:
sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev

Download leptonica
wget http://www.leptonica.com/source/leptonica-1.71.tar.gz
tar xvf leptonica-1.71.tar.gz

and build it
cd leptonica-1.71
./configure
make
make install

Download tesseract-ocr
wget https://bitbucket.org/3togo/python-tesseract/downloads/tesseract-3.03-rc1.tar.gz
tar xvf tesseract-3.03-rc1.tar.gz

and build it
cd tesseract-3.03
./autogen.sh
./configure
make
sudo make install
sudo ldconfig

Download (checkout) python-tesseract
svn checkout http://python-tesseract.googlecode.com/svn/trunk/src python-tesseract

I've used 659 revistion.

and build it
cd python-tesseract
python setup.py clean
python setup.py build
python setup.py install

After that try to run your python example.
If you'll get such error:
Error opening data file ./tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
AdaptedTemplates != NULL:Error:Assert failed:in file adaptmatch.cpp, line 174
Segmentation fault (core dumped)

You could fix it by patching "mainblk.cpp" file inside tesseract-3.03\ccutil\ folder the next way:

In the "mainblk.cpp" file code:
  if (argv0 != NULL) {
    datadir = argv0;
  } else {
    if (getenv("TESSDATA_PREFIX")) {
      datadir = getenv("TESSDATA_PREFIX");
    } else {
#ifdef TESSDATA_PREFIX
#define _STR(a) #a
#define _XSTR(a) _STR(a)
    datadir = _XSTR(TESSDATA_PREFIX);
#undef _XSTR
#undef _STR
#endif
    }
  }

  // insert code here

  // datadir may still be empty:
  if (datadir.length() == 0) {
    datadir = "./";

add:
  if (getenv("TESSDATA_PREFIX")) {
      datadir = getenv("TESSDATA_PREFIX");
  } else {
    // check dir with tessdata
    struct stat sb;
    if (stat("/usr/share/tesseract-ocr/tessdata", &sb) == 0 && S_ISDIR(sb.st_mode)) {    
      datadir = "/usr/share/tesseract-ocr";
    }
  }

and include:
#include <sys/stat.h>

And rebuild and reinstall tesseract-ocr:
cd tesseract-3.03
make
sudo make install

So, after that, if you have TESSDATA_PREFIX env variable, it will be loaded, and if you have tessdata folder with files in /usr/share/tesseract-ocr/ it will be loaded, otherwise directory with your python example module (./) will be checked for tessdata folder.

P.S. Take a look at the repo, by the way, there are already built deb package: https://bitbucket.org/3togo/python-tesseract/

Update:

If you have the next error message when importing tesseract module in Python:
Traceback (most recent call last):
  File "test_module.py", line 5, in
    from tesseract_ocr import TesseractOCR
  File "/home/user/ocr/module/tesseract_ocr.py", line 9, in
    import tesseract
  File "/usr/local/lib/python2.7/dist-packages/python_tesseract-0.9-py2.7-linux-x86_64.egg/tesseract.py", line 28, in
    _tesseract = swig_import_helper()
  File "/usr/local/lib/python2.7/dist-packages/python_tesseract-0.9-py2.7-linux-x86_64.egg/tesseract.py", line 24, in swig_import_helper
    _mod = imp.load_module('_tesseract', fp, pathname, description)
ImportError: /usr/local/lib/python2.7/dist-packages/python_tesseract-0.9-py2.7-linux-x86_64.egg/_tesseract.so: undefined symbol: cvSetData

Check that opencv library is linked in the _tesseract.so, because cvSetData is opencv's function.
ldd _tesseract.so | grep libopencv

If the output is empty try to build _tesseract.so using this command:

sudo c++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/tesseract_wrap.o build/temp.linux-x86_64-2.7/main.o -lstdc++ -ltesseract -llept -lopencv_superres -lopencv_video -lopencv_videostab -lopencv_ml -lopencv_ocl -lopencv_contrib -lopencv_flann -lopencv_calib3d -lopencv_imgproc -lopencv_core -lopencv_legacy -lopencv_stitching -lopencv_features2d -lopencv_photo -lopencv_ts -lopencv_objdetect -lopencv_highgui -lopencv_gpu -o build/lib.linux-x86_64-2.7/_tesseract.so

NB: The command python setup.py build must be executed before the above one, otherwise next errors will be printed:
c++: error: build/temp.linux-x86_64-2.7/tesseract_wrap.o: No such file or directory
c++: error: build/temp.linux-x86_64-2.7/main.o: No such file or directory

After successful build of _tesseract.so copy it to python-tesseract directory:
sudo cp ./build/lib.linux-x86_64-2.7/_tesseract.so .

And check again that opencv library is linked in the _tesseract.so.
ldd _tesseract.so | grep libopencv
Should be something like:
libopencv_core.so.2.4 => /usr/local/lib/libopencv_core.so.2.4 (0x000070d310313370)

Now install python-tesseract:
python setup.py install

After that the problem "undefined symbol: cvSetData" will be solved.

If you have the next error:
>>> import tesseract
Traceback (most recent call last):
  File "", line 1, in
  File "/usr/lib/python2.7/dist-packages/tesseract.py", line 28, in
    _tesseract = swig_import_helper()
  File "/usr/lib/python2.7/dist-packages/tesseract.py", line 24, in swig_import_helper
    _mod = imp.load_module('_tesseract', fp, pathname, description)
ImportError: /usr/lib/python2.7/dist-packages/_tesseract.x86_64-linux-gnu.so: undefined symbol: pixGenerateFlateData

Try to use old python-tesseract svn revision (e.g. 659 or 660).
https://code.google.com/p/python-tesseract/source/detail?r=659

Update: 
I've added a new tutorial Installing tesseract for python on Ubuntu 15.10.

18 comments:

  1. Thanks VERY MUCH for this! Built this recently and still had to apply the fix for undefined symbol...

    ReplyDelete
  2. Thanks a million. Installation worked perfectly using your guide and later on when I got some nasty errors, this was my life saviour. Explains everything crystal clear. Thank you again!!!!!

    ReplyDelete
  3. Hi,

    I have ubuntu 14.04 and am getting following error while trying to rebuild _tesseract.so as described by you. Can you please help me fix this. (I did run 'python setup.py build' before running the command to fix .so

    surinder@suriubu:~/leptonica-1.71/tesseract-3.03/python-tesseract$ sudo c++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/tesseract_wrap.o build/temp.linux-x86_64-2.7/main.o -lstdc++ -ltesseract -llept -lopencv_superres -lopencv_video -lopencv_videostab -lopencv_ml -lopencv_ocl -lopencv_contrib -lopencv_flann -lopencv_calib3d -lopencv_imgproc -lopencv_core -lopencv_legacy -lopencv_stitching -lopencv_features2d -lopencv_photo -lopencv_ts -lopencv_objdetect -lopencv_highgui -lopencv_gpu -o build/lib.linux-x86_64-2.7/_tesseract.so
    /usr/bin/ld: cannot find -lopencv_superres
    /usr/bin/ld: cannot find -lopencv_videostab
    /usr/bin/ld: cannot find -lopencv_ocl
    /usr/bin/ld: cannot find -lopencv_contrib
    /usr/bin/ld: cannot find -lopencv_legacy
    /usr/bin/ld: cannot find -lopencv_objdetect
    /usr/bin/ld: cannot find -lopencv_highgui
    collect2: error: ld returned 1 exit status


    I do have opencv on my machine as below :

    surinder@suriubu:~$ python
    Python 2.7.6 (default, Mar 22 2014, 22:59:56)
    [GCC 4.8.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import cv2
    >>>

    and my /usr/bin/ has an ld file of 1.1Mb as link to program

    ReplyDelete
    Replies
    1. Please check that all opencv libs are installed.
      pkg-config --cflags --libs opencv

      Delete
    2. Try to use older python-tesseract svn revision (e.g. 659).

      Delete
  4. I solved the ld error with:
    sudo apt-get install opencv-superres-dev libopencv-videostab-dev libopencv-ocl-dev libopencv-contrib-dev libopencv-legacy-dev libopencv-objdetect-dev libopencv-highgui-dev

    ReplyDelete
  5. Hi,
    could u help me with this?
    import tesseract
    File "/usr/local/lib/python2.7/dist-packages/python_tesseract-0.9.1-py2.7-linux-x86_64.egg/tesseract.py", line 28, in
    _tesseract = swig_import_helper()
    File "/usr/local/lib/python2.7/dist-packages/python_tesseract-0.9.1-py2.7-linux-x86_64.egg/tesseract.py", line 24, in swig_import_helper
    _mod = imp.load_module('_tesseract', fp, pathname, description)
    ImportError: /usr/local/lib/python2.7/dist-packages/python_tesseract-0.9.1-py2.7-linux-x86_64.egg/_tesseract.so: undefined symbol: _ZN9tesseract16TessTextRendererC1Ev

    ReplyDelete
    Replies
    1. Also have this issue... Anyone found a solution? Thanks!

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Hi, please take a look at my new tutorial "Installing tesseract for python on Ubuntu 15.10" - there I've solved the problem with TessTextRenderer: http://delimitry.blogspot.com/2016/02/installing-tesseract-for-python-on.html

      Delete
  6. I don't know who you are..
    But I will find you and hug you!

    Thank you so much sir :)

    ReplyDelete
  7. Hello, when i try to run this line in my code tesseract.pixThresholdToBinary(pixImage, long(160)) the following occurs: TypeError: in method 'pixThresholdToBinary', argument 2 of type 'l_int32' anyone know how to fix this problem? Thanks! Manuel

    ReplyDelete
  8. Has anyone found a solution for:

    ImportError: /usr/local/lib/python2.7/dist-packages/python_tesseract-0.9.1-py2.7-linux-x86_64.egg/_tesseract.so: undefined symbol: _ZN9tesseract16TessTextRendererC1Ev

    I'm using Tesseract 3.04, libleptonica 1.72. Thanks in advance!

    ReplyDelete
    Replies
    1. Take a look at my new tutorial "Installing tesseract for python on Ubuntu 15.10" - there updated version with TessTextRenderer is used: http://delimitry.blogspot.com/2016/02/installing-tesseract-for-python-on.html

      Delete
  9. Thanks for good tutorial, allthough I get stuck on the last error message in your guide:

    Traceback (most recent call last):
    File "test.py", line 2, in
    import tesseract
    File "/usr/local/lib/python2.7/dist-packages/python_tesseract-0.9.1-py2.7-linux-x86_64.egg/tesseract.py", line 28, in
    _tesseract = swig_import_helper()
    File "/usr/local/lib/python2.7/dist-packages/python_tesseract-0.9.1-py2.7-linux-x86_64.egg/tesseract.py", line 24, in swig_import_helper
    _mod = imp.load_module('_tesseract', fp, pathname, description)
    ImportError: /usr/local/lib/python2.7/dist-packages/python_tesseract-0.9.1-py2.7-linux-x86_64.egg/_tesseract.so: undefined symbol: pixGenerateFlateData

    I have change the content in the file "/var/python-tesseract/allheader_mini.h" to the following which is in revision 659: https://code.google.com/p/python-tesseract/source/browse/trunk/src/allheaders_mini.h?spec=svn659&r=659

    And also added the file to the same path the following file "/var/python-tesseract/allheader_mini_170.h": https://code.google.com/p/python-tesseract/source/browse/trunk/src/allheaders_mini_170.h?spec=svn659&r=659

    Allthough the error still shows up. I have restarted server. What am I doing wrong?

    ReplyDelete
    Replies
    1. BTW check my new tutorial "Installing tesseract for python on Ubuntu 15.10", may be it will help you: http://delimitry.blogspot.com/2016/02/installing-tesseract-for-python-on.html

      Delete