Name	Name	Last commit message	Last commit date
parent directory ..
.qoder	.qoder
IntermediateTxt	IntermediateTxt
RawEpub	RawEpub
SplitOutput/蛊真人	SplitOutput/蛊真人
docs	docs
AppSettings.cs	AppSettings.cs
EpubConverter.cs	EpubConverter.cs
EpubToSplitTxt.csproj	EpubToSplitTxt.csproj
EpubToSplitTxt.sln	EpubToSplitTxt.sln
GlobalUsings.cs	GlobalUsings.cs
Program.cs	Program.cs
README.md	README.md
README_ja.md	README_ja.md
README_ko.md	README_ko.md
README_ru.md	README_ru.md
README_zh.md	README_zh.md
TextSplitter.cs	TextSplitter.cs
appsettings.json	appsettings.json

EpubToSplitTxt

Epub eBook to Text Conversion and Chapter Splitting System

Convert .epub format eBooks to plain text and intelligently split into separate TXT files by chapter

Development Log: Agent&Chat.md

✨ Features

✅ Automatically parse Epub file structure and extract plain text
✅ Intelligent chapter title recognition (supports Chinese and English formats)
✅ Split by chapter into separate TXT files with sequence numbers to maintain reading order
✅ Support for special chapters like preface, prologue, etc.
✅ UTF-8 without BOM encoding output for compatibility
✅ Stream processing for large files with low memory usage
✅ Configurable chapter matching rules

🛠️ Tech Stack

Library	Version	Purpose
.NET	9.0	Runtime Environment
VersOne.Epub	3.3.0	Epub File Parsing
HtmlAgilityPack	1.11.59	HTML Content Cleaning
Microsoft.Extensions.Configuration	-	Configuration Management

🚀 Quick Start

1. Build Project

dotnet build

2. Prepare Epub Files

Place your .epub eBook files in the RawEpub directory:

EpubToSplitTxt/
├── RawEpub/
│   ├── Novel1.epub
│   └── Novel2.epub

3. Run Program

dotnet run

4. View Results

The program will automatically generate the following directory structure:

EpubToSplitTxt/
├── IntermediateTxt/          # Intermediate files (full text)
│   ├── Novel1_Full.txt
│   └── Novel2_Full.txt
├── SplitOutput/              # Chapter split output
│   ├── Novel1/
│   │   ├── 000_Prologue.txt
│   │   ├── 001_Chapter1_Rebirth.txt
│   │   ├── 002_Chapter2_Training.txt
│   │   └── ...
│   └── Novel2/
│       └── ...

⚙️ Configuration

Configuration file: appsettings.json

{
  "Splitter": {
    "ChapterRegex": "(^第[0-9一二三四五六七八九十百千]+[章节卷].*)|(^Chapter [0-9]+.*)|(^序章.*)|(^楔子.*)|(^引子.*)|(^后记.*)|(^尾声.*)",
    "MinChapterLength": 100
  },
  "Paths": {
    "RawEpubFolder": "./RawEpub",
    "IntermediateTxtFolder": "./IntermediateTxt",
    "SplitOutputFolder": "./SplitOutput"
  }
}

Configuration Options

Option	Description	Default
`Splitter:ChapterRegex`	Regular expression for chapter title matching	Supports Chinese/Arabic numerals, etc.
`Splitter:MinChapterLength`	Minimum chapter character count (warns if below)	100
`Paths:RawEpubFolder`	Raw Epub file directory	`./RawEpub`
`Paths:IntermediateTxtFolder`	Full text intermediate file directory	`./IntermediateTxt`
`Paths:SplitOutputFolder`	Chapter split output directory	`./SplitOutput`

📑 Supported Chapter Formats

Default supported chapter title formats:

✅ Chinese numerals: 第一章, 第二十章, 第一百章
✅ Arabic numerals: 第1章, 第001章
✅ English format: Chapter 1, Chapter 2
✅ Special chapters: 序章, 楔子, 引子, 后记, 尾声

To support other formats, modify the ChapterRegex in appsettings.json.

📊 Processing Flow

[Epub File]
    ↓
[EpubConverter] Parse Epub structure
    ↓
[Clean HTML] Remove tags, convert entities
    ↓
[Full Text] Merge into single TXT file
    ↓
[TextSplitter] Scan lines and match chapters
    ↓
[Chapter Files] Output as separate files with sequence numbers

📝 Log Description

[INFO]: Normal processing info (parsing progress, statistics)
[WARN]: Warning info (chapter too small, no chapters matched, etc.)
[ERROR]: Error info (corrupted file, I/O errors, etc.)

⚡ Performance Optimization

✅ Pre-compiled regular expressions (RegexOptions.Compiled)
✅ Stream reading for large text files (StreamReader)
✅ Avoid loading entire text into memory at once
✅ UTF-8 without BOM encoding to reduce file size

⚠️ Notes

Encoding: All output files use UTF-8 without BOM encoding
Filenames: Automatically clean illegal characters, replace with underscores
Directory Structure: Create separate subfolder for each book to avoid confusion
Regex Timeout: Chapter matching has 1-second timeout to prevent backtracking traps

🏗️ System Architecture

Core Components

EpubConverter: Responsible for parsing Epub files and extracting plain text
TextSplitter: Responsible for chapter recognition and text splitting
AppSettings: Configuration management model

Dependencies

Program.cs
   ├── EpubConverter (VersOne.Epub, HtmlAgilityPack)
   ├── TextSplitter (System.Text.RegularExpressions)
   └── AppSettings (Microsoft.Extensions.Configuration)

🔧 Extension Development

Custom Chapter Matching Rules

Modify the regular expression in appsettings.json:

{
  "Splitter": {
    "ChapterRegex": "Your custom regex pattern"
  }
}

Add New Output Formats

Modify the SplitTextAsync method in TextSplitter.cs to support other formats (e.g., Markdown).

📄 License

This project is for personal learning and research only. Please comply with relevant copyright laws.

🤝 Contributing

Issues and Pull Requests are welcome!

Made with ❤️ using .NET 9 and VersOne.Epub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

EpubToSplitTxt

✨ Features

🛠️ Tech Stack

🚀 Quick Start

1. Build Project

2. Prepare Epub Files

3. Run Program

4. View Results

⚙️ Configuration

Configuration Options

📑 Supported Chapter Formats

📊 Processing Flow

📝 Log Description

⚡ Performance Optimization

⚠️ Notes

🏗️ System Architecture

Core Components

Dependencies

🔧 Extension Development

Custom Chapter Matching Rules

Add New Output Formats

📄 License

🤝 Contributing

FilesExpand file tree

EpubToSplitTxt

Directory actions

More options

Directory actions

More options

Latest commit

History

EpubToSplitTxt

Folders and files

parent directory

README.md

EpubToSplitTxt

✨ Features

🛠️ Tech Stack

🚀 Quick Start

1. Build Project

2. Prepare Epub Files

3. Run Program

4. View Results

⚙️ Configuration

Configuration Options

📑 Supported Chapter Formats

📊 Processing Flow

📝 Log Description

⚡ Performance Optimization

⚠️ Notes

🏗️ System Architecture

Core Components

Dependencies

🔧 Extension Development

Custom Chapter Matching Rules

Add New Output Formats

📄 License

🤝 Contributing