在 Python 中拆分 Word 文档的 5 种方法

摘要：要使用 Python 拆分 Word 文档，我们将使用 Spire.Doc for Python 模块。该工具提供了一套全面的功能来处理 Word 文档，包括创建、阅读、编辑、转换、合并和拆分它们。

要使用 Python 拆分 Word 文档，我们将使用 Spire.Doc for Python 模块。该工具提供了一套全面的功能来处理 Word 文档，包括创建、阅读、编辑、转换、合并和拆分它们。

首先，请确保您已安装 Spire.Doc for Python。如果没有，你可以通过 pip 安装它：

pip install spire.doc

Word 中的节将文档划分为不同的部分，每个部分都有自己的页眉、页脚、页面方向、边距和其他格式选项。按部分拆分 Word 文档允许您为每个部分创建单独的文件，从而更轻松地导航、编辑和协作处理特定部分，而不会影响整个文档。

按部分拆分 Word 文档的关键步骤

打开源文档：初始化 Document 实例并加载要拆分的源 Word 文档。遍历章节：遍历源文档中的所有章节。对于每个部分：
- Create a New Document： 为该部分创建新文档。
- 复制部分： 从源文档中复制部分，并将复制的部分添加到新文档。
- Save the Document（保存文档）：将新文档保存到单独的文件中。

以下是如何使用 Python 和 Spire.Doc for Python 按部分拆分 Word 文档的简单示例：

from spire.doc import *from spire.doc.common import *# Load the source documentwith Document as document: document.LoadFromFile("Sections.docx") # Iterate through all sections in the document for sec_index in range(document.Sections.Count): # Access the current section section = document.Sections[sec_index] # Create a new document for the current section with Document as new_document: # Add a clone of the current section to the new document new_document.Sections.Add(section.Clone) # Copy themes and styles from the source document to ensure consistency document.CloneThemesTo(new_document) document.CloneDefaultStyleTo(new_document) # Save the new document with a unique filename for each section output_file = f"Output/Section{sec_index + 1}.docx" new_document.SaveToFile(output_file, FileFormat.Docx2016)

拆分 Word 文档的另一种常用方法是使用标题。此方法根据指定的标题样式（例如 Heading1）将文档划分为单独的文件。

打开源文档：初始化 Document 实例并加载要拆分的源 Word 文档。遍历章节：遍历源文档中的所有章节。对于每个部分：
- 识别标题：查找样式为“Heading1”的段落。
- 创建新文档：找到标题 1 后，创建一个新文档并将标题复制到其中。
- 复制内容：继续将内容复制到新文档中，直到找到下一个标题 1。
- Save the Document（保存文档）：将新文档保存到单独的文件中。

以下是如何使用 Python 和 Spire.Doc for Python 按标题（标题 1）拆分 Word 文档的简单示例：

from spire.doc import *from spire.doc.common import *# Load the source documentwith Document as source_document: source_document.LoadFromFile("Headings.docx") # Initialize variables new_documents = new_document = None new_section = None is_inside_heading = False # Iterate through all sections in the source document for sec_index in range(source_document.Sections.Count): # Access the current section section = source_document.Sections[sec_index] # Iterate through all objects in the current section for obj_index in range(section.Body.childObjects.Count): # Access the current section obj = section.Body.ChildObjects[obj_index] # Check if the current object is a Paragraph if isinstance(obj, Paragraph): para = obj # Check if the paragraph style is "Heading1" if para.StyleName == "Heading1": # Add the document to the list if it exists if new_document is not None: new_documents.append(new_document) # Create a new document new_document = Document # Add a new section to the new document new_section = new_document.AddSection # Copy section settings section.CloneSectionPropertiesTo(new_section) # Copy the paragraph to the new section of the new document new_section.Body.ChildObjects.Add(para.Clone) # Set the is_inside_heading flag to True is_inside_heading = True else: if is_inside_heading: # Copy the paragraph to the new section of the new document until the next Heading1 new_section.Body.ChildObjects.Add(para.Clone) else: if is_inside_heading: # Copy non-paragraph objects to the new section new_section.Body.ChildObjects.Add(obj.Clone) # Add the last document to the list if it exists if new_document is not None: new_documents.append(new_document) # Iterate through all documents in the list for i, doc in enumerate(new_documents): # Copy themes and styles from the source document to ensure consistency source_document.CloneThemesTo(doc) source_document.CloneDefaultStyleTo(doc) # Save the document to a separate file output_file = f"Output/HeadingContent{i + 1}.docx" doc.SaveToFile(output_file, FileFormat.Docx2016)

书签是文档中的占位符，用于指示特定位置或部分，使其成为自定义内容管理的理想选择。通过在这些带有书签的点拆分文档，您可以创建根据您的需要定制的单独文件，例如隔离不同的章节、章节或句段，以便于导航、分发或编辑。

按书签拆分 Word 文档的关键步骤

打开源文档：初始化 Document 实例并加载要拆分的源 Word 文档。遍历书签：遍历源文档中的所有书签。对于每个书签：
- 创建新文档：为每个书签创建一个新文档。
- 添加新部分：向新文档添加新部分。
- 复制书签内容：使用 BookmarksNavigator 对象获取当前书签的内容并将其添加到新文档中。
- Save the New Document（保存新文档）：将新文档保存到单独的文件中。

以下是如何使用 Python 和 Spire.Doc for Python 按书签拆分 Word 文档的简单示例：

from spire.doc import *from spire.doc.common import *# Load the source documentwith Document as document: document.LoadFromFile("Bookmarks.docx") # Iterate through all bookmarks in the document for bookmark_index in range(document.Bookmarks.Count): # Access the current bookmark bookmark = document.Bookmarks[bookmark_index] # Create a new document for the current bookmark with Document as new_document: # Add a new section to the new document new_section = new_document.AddSection # Copy section settings document.Sections[0].CloneSectionPropertiesTo(new_section) # Create a bookmark navigator for the source document bookmarks_navigator = BookmarksNavigator(document) # Navigate to the current bookmark bookmarks_navigator.MoveToBookmark(bookmark.Name) # Get the bookmark content textBodyPart = bookmarks_navigator.GetBookmarkContent # Add a paragraph to the new document paragraph = new_section.AddParagraph # Add a bookmark to the paragraph with the same bookmark name paragraph.AppendBookmarkStart(bookmark.Name) paragraph.AppendBookmarkEnd(bookmark.Name) # Create a bookmark navigator for the new document new_bookmarks_navigator = BookmarksNavigator(new_document) # Navigate to the newly added bookmark new_bookmarks_navigator.MoveToBookmark(bookmark.Name) # Replace the content of the newly added bookmark in the new document with the content of the current bookmark in the source document new_bookmarks_navigator.ReplaceBookmarkContent(textBodyPart, True) # Copy themes and styles from the source document to ensure consistency document.CloneThemesTo(new_document) document.CloneDefaultStyleTo(new_document) # Save the new document to a separate file output_file = f"Output/Bookmark{bookmark_index + 1}.docx" new_document.SaveToFile(output_file, FileFormat.Docx2016)

在 Microsoft Word 中，分页符标记文档中一页的结束和新页面的开始。在按分页符拆分 Word 文档之前，您需要在希望进行分页的位置插入这些分隔符。

按分页符拆分 Word 文档的步骤

打开源文档：初始化 Document 实例并加载源 Word 文档。创建新文档：设置新文档并添加初始部分。遍历章节：遍历源文档中的所有章节。对于每个部分：
- 识别分页符：在章节的段落中查找分页符。
- 在分页符处拆分： 在段落中找到分页符时，保存当前文档并创建一个新文档。将分页符后的内容复制到新文档。保存新文档：将新文档保存到单独的文件中。

以下是如何使用 Python 和 Spire.Doc for Python 按分页符拆分 Word 文档的简单示例：

from spire.doc import *from spire.doc.common import *# Load the source documentwith Document as document: document.LoadFromFile("PageBreaks.docx") # Create a new document new_document = Document # Add a new section to the new document new_section = new_document.AddSection # Copy themes and styles from the source document to ensure consistency document.CloneDefaultStyleTo(new_document) document.CloneThemesTo(new_document) index = 0 # Iterate through all sections in the source document for sec_index in range(document.Sections.Count): section = document.Sections[sec_index] # Iterate through all body child objects of each section for sec_obj_index in range(section.Body.ChildObjects.Count): sec_obj = section.Body.ChildObjects[sec_obj_index] # Check if the current object is a paragraph if isinstance(sec_obj, Paragraph): para = sec_obj # Copy section setting section.CloneSectionPropertiesTo(new_section) # Add a clone of the paragraph to the section of the new document new_section.Body.ChildObjects.Add(para.Clone) # Iterate through all body child objects of the paragraph for para_obj_index in range(para.ChildObjects.Count): para_obj = para.ChildObjects[para_obj_index] # Check if the current object is a page break if isinstance(para_obj, Break) and para_obj.BreakType == BreakType.PageBreak: # Get the index of page break in paragraph i = para.ChildObjects.IndexOf(para_obj) # Remove the page break from its paragraph new_section.Body.LastParagraph.ChildObjects.RemoveAt(i) # Save the document output_file = f"Output/SplitDocByPageBreak-{index}.docx" new_document.SaveToFile(output_file, FileFormat.Docx) index += 1 # Create a new document new_document = Document # Add a section to the new document new_section = new_document.AddSection document.CloneDefaultStyleTo(new_document) document.CloneThemesTo(new_document) section.CloneSectionPropertiesTo(new_section) # Add the paragraph to the section of the new document new_section.Body.ChildObjects.Add(para.Clone) if new_section.Paragraphs[0].ChildObjects.Count == 0: # Remove the first paragraph if it's blank new_section.Body.ChildObjects.RemoveAt(0) else: # Remove the child objects before the page break while i >= 0: new_section.Paragraphs[0].ChildObjects.RemoveAt(i) i -= 1 else: # Copy non-paragraph objects to the new section new_section.Body.ChildObjects.Add(sec_obj.Clone) # Save the document result = f"Output/SplitDocByPageBreak-{index}.docx" new_document.SaveToFile(result, FileFormat.Docx2013)

将 Word 文档拆分为 HTML 页面时，可以将 Word 文档的内容转换为 HTML 格式，并将其划分为单独的网页。此过程使文档可以在浏览器上被视为一系列网页。

将 Word 文档拆分为 HTML 页面的关键步骤

打开源文档：初始化 Document 实例并加载要拆分的源 Word 文档。遍历章节：遍历源文档中的所有章节。对于每个部分：
- Create a New Document： 为该部分创建新文档。
- 复制部分： 从源文档中复制部分，并将复制的部分添加到新文档。
- 嵌入 CSS 和图像：配置新文档的 HTML 导出选项，以将 CSS 样式和图像直接嵌入到 HTML 页面中。
- 将文档保存为 HTML：将新文档保存到单独的 HTML 文件。

以下是如何使用 Python 和 Spire.Doc for Python 将 Word 文档的每个部分拆分为单独的 HTML 页面的简单示例：

from spire.doc import *from spire.doc.common import *# Load the source documentwith Document as document: document.LoadFromFile("Sections.docx") # Iterate through all sections in the document for sec_index in range(document.Sections.Count): # Access the current section section = document.Sections[sec_index] # Create a new document for the current section new_document = Document # Add a clone of the current section to the new document new_document.Sections.Add(section.Clone) # Copy themes and styles from the source document to ensure consistency document.CloneThemesTo(new_document) document.CloneDefaultStyleTo(new_document) # Embed CSS style and image data into HTML page new_document.HtmlExportOptions.CssStyleSheetType = CssStyleSheetType.Internal new_document.HtmlExportOptions.ImageEmbedded = True # Save the new document as an HTML file output_file = f"Output/Section-{sec_index + 1}.html" new_document.SaveToFile(output_file, FileFormat.Html) new_document.Close

除了将 Word 文档的内容拆分为 HTML 页面外，您还可以通过调整 FileFormat 参数将它们拆分为许多其他格式，例如 PDF、XPS、Markdown 等。