Python图像文字识别：Tesseract-OCR库的安装与应用详解

想用Python识别图片里的文字？没问题，这篇教程就带你搞定！我们将使用Tesseract-OCR库，这是一个非常流行的开源OCR引擎，配合Python的pytesseract库，可以轻松实现图片文字提取。别担心，即使你是新手，也能一步步学会。

1. Tesseract-OCR简介

Tesseract最初由惠普开发，后来由Google维护。它支持多种操作系统，能识别大量语种，而且识别精度相当不错。 pytesseract 是 Python 中对 Tesseract OCR 引擎进行封装的一个库，让我们可以通过 Python 代码调用 Tesseract 的功能。

官方网站：

Tesseract OCR: https://github.com/tesseract-ocr

2. 安装Tesseract-OCR引擎

首先，你需要安装 Tesseract OCR 引擎。不同的操作系统安装方式略有不同：

Windows

下载安装包： 从UB Mannheim下载最新的Windows安装包：https://digi.bib.uni-mannheim.de/tesseract/
安装： 运行下载的安装包，按照提示进行安装。 注意： 记住你的安装路径，例如 C:\Program Files\Tesseract-OCR，后面配置环境变量会用到。
配置环境变量：
- 打开“控制面板” -> “系统与安全” -> “系统” -> “高级系统设置”。
- 点击“环境变量”按钮。
- 在“系统变量”中，找到“Path”，点击“编辑”。
- 点击“新建”，添加 Tesseract 的安装路径，例如 C:\Program Files\Tesseract-OCR。
- 再点击“新建”，添加 Tesseract 的 tessdata 路径，例如 C:\Program Files\Tesseract-OCR\tessdata。（tessdata 目录包含语言数据）
- 点击“确定”保存所有更改。

macOS

可以使用 Homebrew 安装：

brew install tesseract

如果需要安装其他语言包，可以使用以下命令 (例如安装中文语言包)：

brew install tesseract-lang

Linux (Ubuntu/Debian)

sudo apt update
sudo apt install tesseract-ocr

安装其他语言包 (例如中文)：

sudo apt install tesseract-ocr-chi-sim

验证安装

安装完成后，打开命令行或终端，输入以下命令，如果能看到 Tesseract 的版本信息，说明安装成功：

tesseract --version

3. 安装`pytesseract`库

接下来，安装 Python 的 pytesseract 库：

pip install pytesseract pillow

pillow 是 Python 的图像处理库，pytesseract 依赖它来打开和处理图片。

4. 简单示例：提取图片文字

下面是一个简单的示例，演示如何使用 pytesseract 提取图片中的文字：

import pytesseract
from PIL import Image

#  指定tesseract.exe的路径 (如果pytesseract无法自动找到)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # 修改为你实际的安装路径

# 打开图片
image = Image.open('example.png')

# 使用 pytesseract 提取文字
text = pytesseract.image_to_string(image, lang='chi_sim')  #  lang指定语言，这里是简体中文

# 打印提取的文字
print(text)

代码解释：

pytesseract.pytesseract.tesseract_cmd: 这行代码用于指定 Tesseract OCR 引擎的安装路径。如果你的 Tesseract 没有添加到环境变量，或者 pytesseract 无法自动找到 Tesseract，就需要手动指定路径。 务必替换成你实际的安装路径。
Image.open('example.png'): 使用 Pillow 库打开图片文件。将 'example.png' 替换成你要识别的图片文件名。
pytesseract.image_to_string(image, lang='chi_sim'): 这是核心函数，它使用 Tesseract OCR 引擎识别图片中的文字。 lang 参数指定识别的语言，'chi_sim' 表示简体中文。如果你的图片是英文，可以设置为 'eng'。
print(text): 打印提取到的文字。

注意事项：

确保你的图片文件存在，并且路径正确。
根据图片中的文字语言，正确设置 lang 参数。
如果识别效果不佳，可以尝试对图片进行预处理，例如调整对比度、二值化等。

5. 进阶使用：图片预处理

有时候，图片质量不高，直接识别效果可能不好。这时，可以先对图片进行预处理，提高识别精度。

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

# 指定tesseract.exe的路径 (如果pytesseract无法自动找到)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # 修改为你实际的安装路径

# 打开图片
image = Image.open('example.png')

# 灰度化
image = image.convert('L')

# 增强对比度
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(2)

# 二值化 (可选，根据图片情况调整阈值)
# threshold = 127
# table = []
# for i in range(256):
#     if i < threshold:
#         table.append(0)
#     else:
#         table.append(1)
# image = image.point(table, '1')

# 降噪 (可选)
# image = image.filter(ImageFilter.MedianFilter())

# 使用 pytesseract 提取文字
text = pytesseract.image_to_string(image, lang='chi_sim')

# 打印提取的文字
print(text)