Python 写真や画像の文字認識 PyOCR tesseract | みやしんのプログラミングスキル通信

みなさん、こんにちは！みやしんです。

今回は、Pythonを使って写真や画像内の文字認識(OCR)をやってみたいと思います。

みやしん

紙の資料を電子化したり、事務作業の改善にOCRって役立ちそうだよね！

PythonやAIをもっと勉強したい方🤗

Pyサブスクール：サブスク8,030円/月でPythonを始められるプログラミングスクール

サブスク8,030円/月でPythonを始められるプログラミングスクール。現役エンジニアへの質問も自由に出来ます。話題のPythonを学びたいけどスクールに60万円は高すぎる！でも独学だと挫折が恐い。そんな不満と不安を解決するサブスク型のプ...

OCRとは
仮想環境構築
Tesseractのインストール
PyOCRのインストール
日本語の学習済みモデルをダウンロード
サンプルコード
1. サンプル画像
2. OCR認識結果

OCRとは

OCRとはOptical Character Recognitionの略で光学的文字認識といいます。その名の通り、画像や写真に写っている手書きや印刷された文字を読み取る技術です。

仮想環境構築

Anacondaを使って進めていきます。新しくcondaの仮想環境を構築します。

環境名：ocr　　Python：3.10

conda create -n ocr python==3.10

作った仮想環境をアクティベートします。

conda activate ocr

Tesseractのインストール

Tesseract(テッセラクト)は、様々なシステム上で動作するOCRエンジンです。まずはこのTesseractをインストールします。以下のコマンドでインストールします。

conda install -c conda-forge tesseract

PyOCRのインストール

PyOCRはPython用のOCRツールラッパーです。つまり、Python環境でTesseractを使えるようにしてくれるライブラリです。以下のコマンドでインストールします。

conda install -c conda-forge pyocr

日本語の学習済みモデルをダウンロード

OCRを日本語に対応させます。下記のリンク先から学習済みモデルをダウンロードします。

【tessdata_best/jpn.traineddata】
https://github.com/tesseract-ocr/tessdata_best/blob/master/jpn.traineddata

【tessdata_best/jpn_vert.traineddata】
https://github.com/tesseract-ocr/tessdata_best/blob/master/jpn_vert.traineddata

赤枠のところからダウンロードできます。

ファイルのダウンロード完了

続いて、この2つの日本語の学習済みモデルを指定のフォルダへ入れます。

C:\Users\[ユーザー名]\anaconda3\envs\[仮想環境名]\Library\bin\tessdata

既に2つファイルが入っています。

学習済みモデルを保存します。

サンプルコード

今回はJupyter Notebookを使っています。

まずは必要なライブラリをインポートします。

from PIL import Image
import sys

import pyocr
import pyocr.builders

続いてOCRモデルの確認をします。

tools = pyocr.get_available_tools() # 使用可能なAIモデルをリストに格納
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0] # toolsにはOCRモデルが推奨順に格納されているためtools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.

■実行結果

「jpn」「jpn_vert」があることを確認。日本語に対応できた。

OCR処理を実行する。

tools = pyocr.get_available_tools()
tool = tools[0]
txt = tool.image_to_string(
    Image.open('my_blog.png'), # 読ませる画像ファイル
    lang="jpn", # 言語を日本語に設定
    builder=pyocr.builders.TextBuilder(tesseract_layout=6) # tesseract_layout=6は、横書きで単一に書いてある
)

print("読み取り結果")
print(txt)

オプションの tesseract_layout は、下記のようになっているようです。

Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.