KS
Killer-Skills

book-sft-pipeline — book-sft-pipeline setup guide book-sft-pipeline setup guide, how to use book-sft-pipeline, what is book-sft-pipeline, book-sft-pipeline alternative, book-sft-pipeline vs style-transfer models, book-sft-pipeline install, training style-transfer models with book-sft-pipeline, converting eBooks to SFT datasets, author-voice model creation with book-sft-pipeline

v2.0.0
GitHub

About this Skill

Perfect for NLP Agents needing advanced text analysis and style-transfer capabilities for literary works book-sft-pipeline is a complete system for converting books into SFT datasets and training style-transfer models, supporting text segmentation pipelines for long-form content.

Features

Converts raw ePub files to SFT datasets
Trains style-transfer models for author-voice replication
Supports text segmentation pipelines for long-form content
Prepares training data for Tinker or similar SFT platforms
Enables building fine-tuning datasets from literary works
Creates author-voice or style-transfer models

# Core Topics

goodnight000 goodnight000
[0]
[0]
Updated: 3/7/2026

Quality Score

Top 5%
65
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add goodnight000/KittyCourt/references/tinker-format.md

Agent Capability Analysis

The book-sft-pipeline MCP Server by goodnight000 is an open-source Categories.community integration for Claude and other AI agents, enabling seamless task automation and capability expansion. Optimized for book-sft-pipeline setup guide, how to use book-sft-pipeline, what is book-sft-pipeline.

Ideal Agent Persona

Perfect for NLP Agents needing advanced text analysis and style-transfer capabilities for literary works

Core Value

Empowers agents to convert raw ePub files into SFT datasets and train style-transfer models using protocols like fine-tuning datasets, enabling the creation of author-voice models and text segmentation pipelines for long-form content

Capabilities Granted for book-sft-pipeline MCP Server

Building fine-tuning datasets from literary works
Creating author-voice or style-transfer models
Preparing training data for Tinker or similar SFT platforms

! Prerequisites & Limits

  • Requires raw ePub files as input
  • Limited to training small models
  • Specifically designed for SFT datasets and style-transfer models
Project
SKILL.md
13.6 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8

# Tags

[No tags]
SKILL.md
Readonly

Book SFT Pipeline

A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.

When to Activate

Activate this skill when:

  • Building fine-tuning datasets from literary works
  • Creating author-voice or style-transfer models
  • Preparing training data for Tinker or similar SFT platforms
  • Designing text segmentation pipelines for long-form content
  • Training small models (8B or less) on limited data

Core Concepts

The Three Pillars of Book SFT

1. Intelligent Segmentation Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.

2. Diverse Instruction Generation Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.

3. Style Over Content The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR AGENT                           │
│  Coordinates pipeline phases, manages state, handles failures   │
└──────────────────────┬──────────────────────────────────────────┘
                       │
       ┌───────────────┼───────────────┬───────────────┐
       ▼               ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  EXTRACTION  │ │ SEGMENTATION │ │  INSTRUCTION │ │   DATASET    │
│    AGENT     │ │    AGENT     │ │    AGENT     │ │   BUILDER    │
│ ePub → Text  │ │ Text → Chunks│ │ Chunks →     │ │ Pairs →      │
│              │ │ 150-400 words│ │ Prompts      │ │ JSONL        │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
                       │
       ┌───────────────┴───────────────┐
       ▼                               ▼
┌──────────────┐               ┌──────────────┐
│   TRAINING   │               │  VALIDATION  │
│    AGENT     │               │    AGENT     │
│ LoRA on      │               │ AI detector  │
│ Tinker       │               │ Originality  │
└──────────────┘               └──────────────┘

Phase 1: Text Extraction

Critical Rules

  1. Always source ePub over PDF - OCR errors become learned patterns
  2. Use paragraph-level extraction - Extract from <p> tags to preserve breaks
  3. Remove front/back matter - Copyright and TOC pollute the dataset
python
1# Extract text from ePub paragraphs 2from epub2 import EPub 3from bs4 import BeautifulSoup 4 5def extract_epub(path): 6 book = EPub(path) 7 chapters = [] 8 for item in book.flow: 9 html = book.get_chapter(item.id) 10 soup = BeautifulSoup(html, 'html.parser') 11 paragraphs = [p.get_text().strip() for p in soup.find_all('p')] 12 chapters.append('\n\n'.join(p for p in paragraphs if p)) 13 return '\n\n'.join(chapters)

Phase 2: Intelligent Segmentation

Smaller Chunks + Overlap

Smaller chunks (150-400 words) produce more training examples and better style transfer than larger chunks (250-650).

python
1def segment(text, min_words=150, max_words=400): 2 paragraphs = text.split('\n\n') 3 chunks, buffer, buffer_words = [], [], 0 4 5 for para in paragraphs: 6 words = len(para.split()) 7 if buffer_words + words > max_words and buffer_words >= min_words: 8 chunks.append('\n\n'.join(buffer)) 9 # Keep last paragraph for overlap 10 buffer = [buffer[-1], para] if buffer else [para] 11 buffer_words = sum(len(p.split()) for p in buffer) 12 else: 13 buffer.append(para) 14 buffer_words += words 15 16 if buffer: 17 chunks.append('\n\n'.join(buffer)) 18 return chunks

Expected Results

For an 86,000-word book:

  • Old method (250-650 words): ~150 chunks
  • New method (150-400 + overlap): ~300 chunks
  • With 2 variants per chunk: 600+ training examples

Phase 3: Diverse Instruction Generation

The Key Insight

Using a single prompt template causes memorization. Diverse templates teach the underlying style.

python
1SYSTEM_PROMPTS = [ 2 "You are an expert creative writer capable of emulating specific literary styles.", 3 "You are a literary writer with deep knowledge of classic prose styles.", 4 "You are a creative writer skilled at emulating distinctive authorial voices.", 5 "You write prose that captures the essence of modernist literature.", 6 "You are a talented writer who can channel classic American authors.", 7] 8 9PROMPT_TEMPLATES = [ 10 "Write a passage in the style of {author}: {desc}", 11 "Channel {author}'s voice to write about: {desc}", 12 "In {author}'s distinctive prose style, describe: {desc}", 13 "Write this scene as {author} would have: {desc}", 14 "Using {author}'s repetitive technique, describe: {desc}", 15 "Capture the rhythm of {author} in this passage: {desc}", 16 "Write like {author}: {desc}", 17 "In the voice of {author}, write: {desc}", 18 "This is a literary exercise. Write like {author}: {desc}", 19 "Can you write in {author}'s style? {desc}", 20]

Instruction Generation

python
1INSTRUCTION_PROMPT = """Describe what is happening in this excerpt in 2-3 sentences. 2Focus on: characters present, actions, emotions, setting. 3Do NOT quote the text directly. 4 5Excerpt: 6{text} 7""" 8 9# Use a fast, cheap LLM (e.g., Gemini Flash) 10instruction = llm_call(INSTRUCTION_PROMPT.format(text=chunk))

Phase 4: Dataset Construction

Message Format

json
1{ 2 "messages": [ 3 {"role": "system", "content": "You are an expert creative writer..."}, 4 {"role": "user", "content": "Write in the style of Author: Scene description..."}, 5 {"role": "assistant", "content": "The actual book text from chunk..."} 6 ] 7}

Multiple Variants Per Chunk

python
1def build_examples(chunk, instruction, author, variants=2): 2 examples = [] 3 for i in range(variants): 4 system = SYSTEM_PROMPTS[i % len(SYSTEM_PROMPTS)] 5 template = PROMPT_TEMPLATES[(chunk.id + i) % len(PROMPT_TEMPLATES)] 6 user = template.format(author=author, desc=instruction) 7 examples.append({"messages": [ 8 {"role": "system", "content": system}, 9 {"role": "user", "content": user}, 10 {"role": "assistant", "content": chunk.text} 11 ]}) 12 return examples

Phase 5: LoRA Training on Tinker

Configuration

python
1CONFIG = { 2 "model_name": "Qwen/Qwen3-8B-Base", # Base, not instruct 3 "lora_rank": 32, # 352MB adapter 4 "learning_rate": 5e-4, # Higher for LoRA 5 "batch_size": 4, 6 "epochs": 3, 7}

Why Base Model?

Use base (pretrained) models, not instruction-tuned versions:

  • Base models are more malleable for new styles
  • Instruct models have patterns that resist overwriting
  • Style is a low-level pattern that base models capture better

Training Loop

python
1import tinker 2from tinker import types 3 4training_client = await service_client.create_lora_training_client_async( 5 base_model="Qwen/Qwen3-8B-Base", 6 rank=32 7) 8 9for epoch in range(3): 10 for batch in batches: 11 await training_client.forward_backward_async(batch, loss_fn="cross_entropy") 12 await training_client.optim_step_async(types.AdamParams(learning_rate=5e-4)) 13 14result = await training_client.save_weights_for_sampler_async(name="final")

Phase 6: Validation

Modern Scenario Test

Test with scenarios that couldn't exist in the original book:

python
1TEST_PROMPTS = [ 2 "Write about a barista making lattes", 3 "Describe lovers communicating through text messages", 4 "Write about someone anxious about climate change", 5]

If the model applies style markers to modern scenarios, it learned style, not content.

Originality Verification

bash
1# Search training data for output phrases 2grep "specific phrase from output" dataset.jsonl 3# Should return: No matches

AI Detector Testing

Test outputs with GPTZero, Pangram, or ZeroGPT.

Known Issues and Solutions

Character Name Leakage

Symptom: Model uses original character names in new scenarios. Cause: Limited name diversity from one book. Solution: Train on multiple books or add synthetic examples.

Model Parrots Exact Phrases

Symptom: Outputs contain exact sentences from training data. Cause: Too few prompt variations or too many epochs. Solution: Use 15+ templates, limit to 3 epochs.

Fragmented Outputs

Symptom: Sentences feel incomplete. Cause: Poor segmentation breaking mid-thought. Solution: Always break at paragraph boundaries.

Guidelines

  1. Always source ePub over PDF - OCR errors become learned patterns
  2. Never break mid-sentence - Boundaries must be grammatically complete
  3. Use diverse prompts - 15+ templates, 5+ system prompts
  4. Use base models - Not instruct versions
  5. Use smaller chunks - 150-400 words for more examples
  6. Reserve test set - 50 examples minimum
  7. Test on modern scenarios - Proves style transfer vs memorization
  8. Verify originality - Grep training data for output phrases

Expected Results

MetricValue
Training examples500-1000 per book
ModelQwen/Qwen3-8B-Base
LoRA rank32
Adapter size~350 MB
Training time~15 min
Loss reduction90%+
Style transfer success~50% perfect

Cost Estimate

ComponentCost
LLM (instruction generation)~$0.50
Tinker training (15 min)~$1.50
Total~$2.00

Integration with Context Engineering Skills

This example applies several skills from the Agent Skills for Context Engineering collection:

project-development

The pipeline follows the staged, idempotent architecture pattern:

  • Acquire: Extract text from ePub
  • Prepare: Segment into training chunks
  • Process: Generate synthetic instructions
  • Parse: Build message format
  • Render: Output Tinker-compatible JSONL
  • Train: LoRA fine-tuning
  • Validate: Modern scenario testing

Each phase is resumable and produces intermediate artifacts for debugging.

context-compression

Segmentation is a form of context compression for training. The core insight from context-compression applies: information density matters more than information quantity. Smaller, coherent chunks (150-400 words) produce better style transfer than larger, diluted chunks.

The two-tier strategy mirrors context compression evaluation:

  • Tier 1: Fast, deterministic compression
  • Tier 2: LLM-assisted for edge cases

multi-agent-patterns

The pipeline uses the supervisor/orchestrator pattern:

  • Orchestrator coordinates phases and manages state
  • Specialized agents (Extraction, Segmentation, Instruction, Builder) have isolated contexts
  • Each agent receives only the information needed for its task

This matches the principle that sub-agents exist primarily to isolate context rather than simulate roles.

evaluation

Validation follows the end-state evaluation pattern:

  • Functional testing: Does output match expected style markers?
  • Originality verification: Is content genuinely generated?
  • External validation: AI detector scores

The "modern scenario" test is a form of out-of-distribution evaluation that proves generalization.

context-fundamentals

Prompt diversity prevents attention collapse on single patterns. When training with identical prompt structures, the model memorizes the instruction-response mapping. Diverse templates force attention across the style patterns themselves.

References

Internal references:

Related skills from Agent Skills for Context Engineering:

  • project-development - Pipeline architecture patterns
  • context-compression - Compression strategies
  • multi-agent-patterns - Agent coordination
  • evaluation - Evaluation frameworks
  • context-fundamentals - Attention and information density

External resources:


Skill Metadata

Created: 2025-12-26 Last Updated: 2025-12-28 Author: Muratcan Koylan Version: 2.0.0 Standalone: Yes (separate from main context-engineering collection)

Related Skills

Looking for an alternative to book-sft-pipeline or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

View All

widget-generator

Logo of f
f

widget-generator is an open-source AI agent skill for creating widget plugins that are injected into prompt feeds on prompts.chat. It supports two rendering modes: standard prompt widgets using default PromptCard styling and custom render widgets built as full React components.

149.6k
0
Design

chat-sdk

Logo of lobehub
lobehub

chat-sdk is a unified TypeScript SDK for building chat bots across multiple platforms, providing a single interface for deploying bot logic.

73.0k
0
Communication

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication