使用 Python 从图像和扫描的 PDF 中提取文本

摘要：图像和扫描的 PDF 通常包含有价值的信息，但它们的文本作为图像的一部分存储，而不是以可编辑的格式存储。此限制使得直接搜索、编辑或重新调整内容的用途变得具有挑战性。从这些文档中提取文本对于数字化信息、增强可访问性和提高生产力至关重要。

图像和扫描的 PDF 通常包含有价值的信息，但它们的文本作为图像的一部分存储，而不是以可编辑的格式存储。此限制使得直接搜索、编辑或重新调整内容的用途变得具有挑战性。从这些文档中提取文本对于数字化信息、增强可访问性和提高生产力至关重要。

从图像和扫描的 PDF 中提取文本的过程依赖于光学字符识别（OCR），这是一种强大的技术，可以识别图像中的字符并将其转换为机器可读、可编辑的文本。OCR 彻底改变了我们与扫描文档和基于图像的内容的交互方式，从而更容易释放它们的全部潜力。

要开始在 Python 中从图像和扫描的 PDF 中提取文本，我们将使用 Spire.OCR for Python 库。该库支持多种语言，包括英语、法语、德语、中文、日语、韩语等。在使用此库之前，您需要完成两个步骤：

第 1 步：安装适用于 Python 的 Spire.OCR

可以使用 pip 安装 Spire.OCR。在终端中，运行以下命令：

pip install Spire.OCR

第 2 步：下载 OCR 模型

Spire.OCR 提供预训练模型来识别文本。您需要下载适用于您的操作系统的型号。这些型号适用于 Windows、Linux 和 Mac（64 位）。

Win-x64Linux操作系统苹果电脑

从图像中提取文本非常简单。首先，我们需要为OCR扫描仪配置适当的设置（例如文本识别的语言和模型路径），然后扫描图像。

初始化 OCR 扫描仪：创建 OcrScanner 对象。配置 OCR 设置：通过 OcrScanner 对象的 ConfigureDependencies 属性设置用于文本识别的 OCR 模型路径和语言。扫描图像：使用 OcrScanner 对象的 Scan 方法扫描图像中的文本。保存文本：获取扫描的文本并将其保存到文件中。

以下代码演示如何在 Python 中从图像中提取文本：

from spire.ocr import *# Initialize the OCR scannerscanner = OcrScanner# Configure OCR options (language and model path for text recognition)# Supported languages include English, French, German, Korean, Chinese etc.configureOptions = ConfigureOptionsconfigureOptions.ModelPath = r'D:\OCR\win-x64'configureOptions.Language = 'English'scanner.ConfigureDependencies(configureOptions)# Scan the imagescanner.Scan(r'Sample.png')# Get the extracted texttext = scanner.Text.ToString + '\n'# Save the extracted text to a Filewith open('output.txt', 'a', encoding='utf-8') as file: file.write(text + '\n')

Python 使用 OCR 从图像中提取文本

有时，不仅提取文本，而且提取图像中文本的坐标（位置）也很有用。Spire.OCR 也允许我们访问这些信息。

以下是提取文本及其边界框坐标的方法：

初始化 OCR 扫描仪：创建 OcrScanner 对象。配置 OCR 设置：通过 OcrScanner 对象的 ConfigureDependencies 属性设置用于文本识别的 OCR 模型路径和语言。扫描图像：使用 Scan 方法扫描图像中的文本。Access Bounding Box Coordinates：检索文本的每个块的边界框信息。Save the Text and Coordinates（保存文本和坐标）：将文本及其坐标都保存到文件中。

以下代码展示了如何在 Python 中使用坐标从图像中提取文本：

from spire.ocr import *# Initialize the OCR scannerscanner = OcrScanner# Configure OCR optionsconfigureOptions = ConfigureOptionsconfigureOptions.ModelPath = r'D:\OCR\win-x64'configureOptions.Language = 'English'scanner.ConfigureDependencies(configureOptions)# Scan the imagescanner.Scan(r'Sample.png')# Extract text and coordinatestext = ''for block in scanner.Text.Blocks: rectangle = block.Box positions = f'{block.Text} -> x: {rectangle.X}, y: {rectangle.Y}, w: {rectangle.Width}, h: {rectangle.Height}' text += positions + '\n'# Save the text and coordinates to a filewith open('output.txt', 'a', encoding='utf-8') as file: file.write(text + '\n')

如果您的图像未存储在磁盘上，但来自内存缓冲区或 Web 资源，您仍然可以使用图像流处理这些图像。以下是从图像流中提取文本的方法：

初始化 OCR 扫描仪：创建 OcrScanner 对象。配置 OCR 设置：通过 OcrScanner 对象的 ConfigureDependencies 属性设置用于文本识别的 OCR 模型路径和语言。从图像创建 Stream：使用 Stream 类将图像加载到内存中。扫描图像流：使用 Scan 方法扫描图像流中的文本。Save the Extracted Text（保存提取的文本）：将提取的文本保存到文件中。

以下代码展示了如何在 Python 中从图像流中提取文本：

from spire.ocr import *# Initialize the OCR scannerscanner = OcrScanner# Configure OCR optionsconfigureOptions = ConfigureOptionsconfigureOptions.ModelPath = r'D:\OCR\win-x64'configureOptions.Language = 'English'scanner.ConfigureDependencies(configureOptions)# Create an image streamimage_stream = Stream('Sample.png')image_format = OCRImageFormat.Png# Scan the image streamscanner.Scan(image_stream, image_format)text = scanner.Text.ToString# Save the extracted textwith open('output.txt','a',encoding='utf-8') as file: file.write(text)

对于扫描的 PDF，我们需要先将 PDF 的每一页转换为图像格式。Spire.PDF for Python 库可用于此目的。将 PDF 页面转换为图像后，我们可以对每张图像应用 OCR。

在使用下面的代码之前，我们需要通过以下命令安装 Spire.PDF：

pip install Spire.PDF

以下是从扫描的 PDF 中提取文本的方法：

将 PDF 页面转换为图像：加载扫描的 PDF 文档，并使用 Pdfdocument 类的 SaveAsImage 方法将每个页面保存为图像。执行 OCR：对图像应用 OCR 以使用 Spire.OCR 提取文本。Save Extracted Text（保存提取的文本）：将提取的文本保存到文件中。

以下代码显示了如何在 Python 中从扫描的 PDF 文档中提取文本：

from spire.pdf import *from spire.ocr import *import io# Function to convert a PDF page to an imagedef convert_pdf_page_to_image(pdf, page_index): # Save the specified page as an image and return it return pdf.SaveAsImage(page_index)# Function to recognize text from an imagedef recognize_text_from_image(imgName, language, model_path): # Initialize OCR scanner and configure OCR options scanner = OcrScanner configure_options = ConfigureOptions configure_options.Language = language configure_options.ModelPath = model_path scanner.ConfigureDependencies(configure_options) # Perform OCR and return the recognized text scanner.Scan(imgName) data = scanner.Text.ToString return data# Load the scanned PDF documentpdf = PdfDocumentpdf.LoadFromFile('Test.pdf')# Open a file to save the extracted textwith open('ScannedPDF.txt', 'a', encoding='utf-8') as writer: for page_index in range(pdf.Pages.Count): # Convert the PDF page to an image image = convert_pdf_page_to_image(pdf, page_index) imgName="toImage_"+str(page_index)+".png" image.Save(imgName) # Recognize text from the image recognized_text = recognize_text_from_image(imgName, 'Chinese', r'D:\OCR\win-x64') # Write the extracted text to the file writer.write(f'Page {page_index + 1}:\n') writer.write(recognized_text) writer.write('\n\n') # Add two line breaks between pagesprint('Text successfully saved to "ScannedPDF.txt".')

应该注意的是，OCR 准确性在很大程度上受图像质量的影响。使用清晰、高对比度、不模糊/不失真的图像将提高识别准确性。此外，选择正确的语言以优化识别准确性也很重要。

来源：自由坦荡的湖泊AI

标签： pdf 图像 python

本文地址：https://news.43u.com.cn/a/538321.html

免责声明：本站系转载，并不代表本网赞同其观点和对其真实性负责。如涉及作品内容、版权和其它问题，请在30日内与本站联系，我们将在第一时间删除内容!