KS
Killer-Skills

berdl_start — how to use berdl_start how to use berdl_start, what is berdl_start, berdl_start alternative, berdl_start vs KBase, berdl_start install, berdl_start setup guide, KBase Data Lakehouse tutorial, microbial ecology research tools, Spark SQL for data analysis

v1.0.0
GitHub

About this Skill

Perfect for Research Agents in microbial ecology needing comprehensive onboarding to the KBase Data Lakehouse berdl_start is an AI agent skill that facilitates exploration of microbial ecology using the KBase Data Lakehouse, a Spark SQL-based Delta Lakehouse hosting multiple databases and collections.

Features

Presents an overview of the KBase BER Data Lakehouse, including 35 databases across 9 tenants
Routes users to the right context based on their goals
Provides information on key collections, such as kbase_ke_pa
Supports exploration of microbial ecology using Spark SQL
Hosts data in an on-prem Delta Lakehouse
Offers a system overview without requiring file reads

# Core Topics

kbaseincubator kbaseincubator
[0]
[0]
Updated: 3/6/2026

Quality Score

Top 5%
45
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add kbaseincubator/BERIL-research-observatory/berdl_start

Agent Capability Analysis

The berdl_start MCP Server by kbaseincubator is an open-source Categories.community integration for Claude and other AI agents, enabling seamless task automation and capability expansion. Optimized for how to use berdl_start, what is berdl_start, berdl_start alternative.

Ideal Agent Persona

Perfect for Research Agents in microbial ecology needing comprehensive onboarding to the KBase Data Lakehouse

Core Value

Empowers agents to route users to the right context based on their goals, leveraging Spark SQL and Delta Lakehouse capabilities, and providing an overview of the 35 databases across 9 tenants

Capabilities Granted for berdl_start MCP Server

Onboarding new researchers to the BERDL system
Routing users to specific databases based on their goals
Providing an overview of the KBase BER Data Lakehouse

! Prerequisites & Limits

  • Requires access to the KBase Data Lakehouse
  • Limited to microbial ecology research context
Project
SKILL.md
19.1 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8

# Tags

[No tags]
SKILL.md
Readonly

BERIL Research Observatory - Onboarding

Welcome the user and orient them to the system, then route them to the right context based on their goal.

Phase 1: System Overview

Present this information directly (no file reads needed):

What is BERDL?

The KBase BER Data Lakehouse (BERDL) is an on-prem Delta Lakehouse (Spark SQL) hosting 35 databases across 9 tenants. Key collections:

CollectionScaleWhat it contains
kbase_ke_pangenome293K genomes, 1B genes, 27K speciesSpecies-level pangenomes from GTDB r214: gene clusters, ANI, functional annotations (eggNOG), pathway predictions (GapMind), environmental embeddings (AlphaEarth)
kbase_genomes293K genomes, 253M proteinsStructural genomics (contigs, features, protein sequences) in CDM format
kbase_msd_biochemistry56K reactions, 46K moleculesModelSEED biochemical reactions and compounds for metabolic modeling
kescience_fitnessbrowser48 organisms, 27M fitness scoresGenome-wide mutant fitness from RB-TnSeq experiments
enigma_coral3K taxa, 7K genomesENIGMA SFA environmental microbiology
nmdc_arkin48 studies, 3M+ metabolomicsNMDC multi-omics (annotations, embeddings, metabolomics, proteomics)
PhageFoundry (5 DBs)VariousSpecies-specific genome browsers for phage-host research
planetmicrobe_planetmicrobe2K samples, 6K experimentsMarine microbial ecology

Repo Structure

projects/           # Science projects (each has README.md + notebooks/ + data/)
docs/               # Shared knowledge base
  collections.md    # Full database inventory
  schemas/          # Per-collection schema docs
  pitfalls.md       # SQL gotchas, data sparsity, common errors
  performance.md    # Query strategies for large tables
  research_ideas.md # Future research directions
  overview.md       # Scientific context and data generation workflow
  discoveries.md    # Running log of insights
.claude/skills/     # Agent skills
data/               # Shared data extracts reusable across projects

Available Skills

SkillWhat it does
/berdlQuery BERDL databases via REST API or Spark SQL
/berdl-queryRun SQL queries locally with remote Spark compute (CLI tools + notebook support)
/berdl-minioTransfer files between BERDL MinIO and local machine
/berdl-discoverExplore and document a new BERDL database
/literature-reviewSearch PubMed, bioRxiv, arXiv, Semantic Scholar, and Google Scholar for relevant biological literature
/synthesizeRead analysis outputs, compare against literature, and draft findings
/submitSubmit a project for automated review
/ctsRun batch compute jobs on the CTS cluster

Note: Hypothesis generation, research planning, and notebook creation are handled automatically as part of the research workflow (Path 1 below). You don't need to invoke them separately.

Existing Projects

Discover projects dynamically — run ls projects/ to list them. Read the first line of each projects/*/README.md to get titles. Present the list to the user so they can see what's been done.

How Projects Work

Each project lives in projects/<name>/ with a three-file structure plus supporting directories:

projects/<name>/
├── README.md            — Project overview, reproduction, authors
├── RESEARCH_PLAN.md     — Hypothesis, approach, query strategy, revision history
├── REPORT.md            — Findings, interpretation, supporting evidence
├── REVIEW.md            — Automated review (generated by /submit)
├── notebooks/           — Analysis notebooks with saved outputs
├── data/                — Agent-derived data from queries and analysis
├── user_data/           — User-provided input data (gene lists, phenotypes, etc.)
├── figures/             — Key visualizations
└── requirements.txt     — Python dependencies

Reproducibility is required: notebooks must be committed with outputs, figures must be saved to figures/, and README must include a ## Reproduction section. See PROJECT.md for full standards.


Phase 1.5: Environment Detection

Before routing the user, run environment detection to ensure prerequisites are met:

Run the detection script:

bash
1python scripts/detect_berdl_environment.py

This script automatically:

  1. Detects location: Tests connectivity to spark.berdl.kbase.us:443 to determine if you're on-cluster (BERDL JupyterHub) or off-cluster (local machine)
  2. On-cluster path:
    • Automatically retrieves KBASE_AUTH_TOKEN from the environment and saves it to .env
    • Confirms direct access works (no proxy needed)
    • Reports ready status
  3. Off-cluster path:
    • Checks .env for KBASE_AUTH_TOKEN
    • Checks .venv-berdl exists
    • Checks SSH tunnels on ports 1337 and 1338
    • Checks pproxy on port 8123
    • Provides specific next steps for anything missing

Present the output to the user and help them resolve any issues before proceeding to Phase 2.

Common resolutions:

  • Missing .venv-berdl: bash scripts/bootstrap_client.sh
  • Missing KBASE_AUTH_TOKEN: Get token from https://narrative.kbase.us/#auth2/account and add to .env
  • Off-cluster without proxy: User must start SSH tunnels (requires credentials). See .claude/skills/berdl-query/references/proxy-setup.md for full guide. Claude can start pproxy once tunnels are up.

Phase 1.7: Session Naming Reminder

If the session does not already have a name, remind the user:

Tip: Name this session for easy identification — especially useful for long-running or remote sessions where the connection may drop. A good convention is to match the project name and git branch (e.g., essential_metabolome).

This is a non-blocking reminder. Move on to Phase 2 regardless.


Phase 2: Interactive Routing

Ask the user which of these they want to do:

  1. Start a new research project
  2. Explore BERDL data
  3. Review published literature
  4. Continue an existing project
  5. Understand the system

Then follow the appropriate path below.


Path 1: Start a New Research Project (Orchestrated Workflow)

When the user wants to start a new research project, the agent drives the entire process — from ideation through review — checking in with the user at natural decision points rather than requiring manual skill invocations.

Phase A: Orientation & Ideation

Required reading before anything else:

  1. Read PROJECT.md — understand dual goals (science + knowledge capture), project structure requirements, reproducibility standards, JupyterHub workflow, Spark notebook patterns
  2. Read docs/overview.md — understand the data architecture, key tables, data generation workflow, known limitations
  3. Read docs/collections.md — full database inventory (35 databases, 9 tenants), what data is actually available
  4. Read docs/pitfalls.md and docs/performance.mdcritical: read these before designing any queries or analysis
  5. Read docs/research_ideas.md — check for existing ideas, avoid duplicating work

Additional setup (Phase 1.5 should have already checked KBASE_AUTH_TOKEN and proxy): 6. Check gh auth status — needed for creating branches, PRs, and pushing code. If not authenticated, prompt the user to run gh auth login 7. Note the user's ORCID and affiliation for the Authors section — ask once, remember for future projects

Then engage with the user: 9. Chat with the user about their research interest 10. Explore BERDL data (use /berdl queries) to check data availability, row counts, column types 11. Check existing projects (ls projects/) — read READMEs of related projects to understand what's been done 12. Develop 2-3 testable hypotheses with H0/H1 13. Search literature (use /literature-review internally) for context

Phase B: Research Plan

  1. Write RESEARCH_PLAN.md with: Research Question, Hypothesis, Literature Context, Approach, Data Sources, Query Strategy, Analysis Plan, Expected Outcomes, Revision History (v1)

  2. Write slim README.md with: Title, Research Question, Status (In Progress), Overview, Quick Links, Reproduction placeholder, Authors

  3. Create project directory structure: notebooks/, data/, user_data/, figures/

  4. Suggest naming this session to match the project: "Consider naming this session {project_id} to match the branch."

  5. Create branch projects/{project_id}, switch to it, and commit README + RESEARCH_PLAN. Working on a dedicated branch from the start avoids accumulating changes on main during long-running projects. Tell the user what you're doing — if they prefer to stay on main, skip branch creation.

  6. Optional: Research plan review — Offer to run a quick review of the research plan before starting analysis. If the user accepts, invoke the plan reviewer subagent:

    bash
    1CLAUDECODE= claude -p \ 2 --system-prompt "$(cat .claude/reviewer/PLAN_REVIEW_PROMPT.md)" \ 3 --allowedTools "Read" \ 4 --dangerously-skip-permissions \ 5 "Review the research plan at projects/{project_id}/. Read RESEARCH_PLAN.md and README.md. Also read docs/pitfalls.md, docs/performance.md, docs/collections.md, and PROJECT.md. Check docs/schemas/ for any tables referenced in the plan. List existing projects with ls projects/ and read their README.md files to check for overlap. Return a concise list of suggestions."

    Present the suggestions to the user. They can address them, note them for later, or skip — this is advisory, not blocking.

Phase C: Analysis (Notebooks)

  1. Write numbered notebooks (01_data_exploration.ipynb, 02_analysis.ipynb, etc.) following the analysis plan
  2. Notebooks are the primary audit trail — do as much work as possible in notebooks so humans can inspect intermediate results
  3. When parallel execution or complex pipelines are needed, write scripts in src/ but call them from notebooks
  4. Run notebooks — execute cells, inspect outputs, iterate
  5. As new information emerges, update RESEARCH_PLAN.md with a revision tag: - **v2** ({date}): {what changed and why}
  6. Check in code frequently — commit after each major milestone (plan written, notebooks created, data extracted, analysis complete)
  7. Re-read docs/pitfalls.md when something doesn't work as expected

Checkpoint: Results Review

After notebooks are executed and committed, pause and present the key results to the user before moving to synthesis. This is a natural decision point — the user may want to inspect figures, question a result, or request additional analysis before the interpretation gets written.

  1. Summarize the key results: main statistics, notable patterns, anything unexpected
  2. Ask: "Would you like to look at the notebooks/figures before I proceed with the writeup, or should I go ahead with /synthesize?"
  3. If the user wants to explore first, wait. If they want changes, iterate on the notebooks before proceeding.

Phase D: Synthesis & Writeup

  1. Run /synthesize to create REPORT.md with findings, interpretation, supporting evidence
  2. Commit the report
  3. Chat with user about the report — revise if needed

Phase E: Review & Submission

  1. Run /submit to validate documentation and generate REVIEW.md
  2. Fix any issues flagged by the review
  3. Commit fixes
  4. Upload project to the lakehouse: python tools/lakehouse_upload.py {project_id} (prompted by /submit after clean review)
  5. Chat with user about next steps

Throughout the Entire Workflow:

  • Check in code often — don't let work accumulate uncommitted
  • Update docs/discoveries.md when you find something interesting (tag with [project_id])
  • Update docs/pitfalls.md when you hit a gotcha (follow pitfall-capture protocol from .claude/skills/pitfall-capture/SKILL.md)
  • Update docs/performance.md when you learn a query optimization
  • Re-read docs/pitfalls.md when debugging failures — the answer may already be documented
  • Re-read docs/performance.md when queries are slow — check for existing optimization patterns
  • Follow PROJECT.md standards — notebooks with saved outputs, figures as standalone PNGs, requirements.txt, Reproduction section in README

Path 2: Explore BERDL Data

Read these files:

  • docs/collections.md — full database inventory
  • docs/pitfalls.md — critical gotchas before querying

Then:

  • Summarize what databases are available and their scale
  • Highlight cross-collection relationships (pangenome <-> genomes <-> biochemistry <-> fitness)
  • Suggest using /berdl to start querying
  • Suggest using /berdl-discover if they want to explore a database not yet documented in docs/schemas/
  • Warn about the key pitfalls (see Critical Pitfalls below)

Path 3: Review Published Literature

Suggest using /literature-review to search biological databases. This is useful for:

  • Checking what's already known about an organism or pathway before querying BERDL
  • Finding published pangenome analyses to compare against BERDL data
  • Supporting a hypothesis with existing citations
  • Discovering methods and approaches used in similar studies

MCP setup check: The paper-search-mcp (openags/paper-search-mcp) is configured in .mcp.json. It runs via uvx paper-search-mcp. If it's not working:

  1. Ensure Python 3.10+ and uv are installed
  2. Test: uvx --from paper-search-mcp python -m paper_search_mcp.server
  3. Optionally set SEMANTIC_SCHOLAR_API_KEY in your environment for enhanced Semantic Scholar features
  4. The skill falls back to WebSearch if the MCP server is unavailable

Path 4: Continue an Existing Project

Steps:

  1. Run ls projects/ and list all projects for the user to choose from
  2. Read the chosen project's README.md
  3. Check if a REVIEW.md exists in that project directory (read it if so)
  4. Summarize where the project stands: what's done, what's next
  5. Suggest using /submit when the project is ready for review

Path 5: Understand the System

Read these files:

  • PROJECT.md — high-level goals and structure
  • docs/collections.md — database inventory
  • docs/overview.md — scientific context and data workflow

Then:

  • Walk through the dual goals (science + knowledge capture)
  • Explain the documentation workflow (tag discoveries, update pitfalls)
  • Mention the UI can be browsed at the BERDL JupyterHub
  • List the available skills and what each does
  • Point to docs/research_ideas.md for future directions

Key Principles (for the agent)

  1. Read the docs firstPROJECT.md, docs/overview.md, docs/collections.md, docs/pitfalls.md, and docs/performance.md before designing anything. Check existing projects/ to avoid duplicating work.
  2. Notebooks are the audit trail — numbered sequentially (01, 02, 03...), each self-contained with a clear purpose. Commit with saved outputs per PROJECT.md reproducibility standards.
  3. Commit early and often — after plan, after notebooks, after data extraction, after analysis, after synthesis.
  4. Branch by default — create a projects/{project_id} branch when starting a new project. Extended work on main causes merge difficulties and risks conflicting with other contributors. Tell the user what branch you're creating; if they explicitly prefer main, respect that.
  5. Update the plan — when the analysis reveals something that changes the approach, update RESEARCH_PLAN.md with a dated revision tag explaining what changed and why.
  6. Don't stop and wait — drive the process forward, checking in with the user at decision points rather than stopping after each step.
  7. Document as you go — discoveries go in docs/discoveries.md, pitfalls in docs/pitfalls.md, performance tips in docs/performance.md — captured in real-time, tagged with [project_id].
  8. Use Spark patterns from PROJECT.mdget_spark_session(), PySpark-first, .toPandas() only for final small results.

Critical Pitfalls (always mention)

Regardless of path chosen, surface these early:

  1. Species IDs contain -- — This is fine inside quoted strings in SQL. Use exact equality (WHERE id = 's__Escherichia_coli--RS_GCF_000005845.2'), not LIKE patterns.
  2. Large tables need filters — Never full-scan gene (1B rows) or genome_ani (420M rows). Always filter by species or genome ID.
  3. AlphaEarth embeddings cover only 28% of genomes (83K/293K) — check coverage before relying on them.
  4. Match Spark import to environment — On JupyterHub notebooks: spark = get_spark_session() (no import). On JupyterHub CLI/scripts: from berdl_notebook_utils.setup_spark_session import get_spark_session. Locally: from get_spark_session import get_spark_session (requires .venv-berdl + proxy chain). See docs/pitfalls.md for details.
  5. Auth token — stored in .env as KBASE_AUTH_TOKEN (not KB_AUTH_TOKEN).
  6. String-typed numeric columns — Many databases store numbers as strings. Always CAST before comparisons.
  7. Gene clusters are species-specific — Cannot compare cluster IDs across species. Use COG/KEGG/PFAM for cross-species comparisons.
  8. Avoid unnecessary .toPandas().toPandas() pulls all data to the driver node and can be very slow or cause OOM errors. Use PySpark DataFrame operations for filtering, joins, and aggregations. Only convert to pandas for final small results (plotting, CSV export).

Templates

RESEARCH_PLAN.md

markdown
1# Research Plan: {Title} 2 3## Research Question 4{Refined question after literature review} 5 6## Hypothesis 7- **H0**: {Null hypothesis} 8- **H1**: {Alternative hypothesis} 9 10## Literature Context 11{Summary of what's known, key references, identified gaps} 12 13## Query Strategy 14 15### Tables Required 16| Table | Purpose | Estimated Rows | Filter Strategy | 17|---|---|---|---| 18| {table} | {why needed} | {count} | {how to filter} | 19 20### Key Queries 211. **{Description}**: 22\```sql 23{query} 24\``` 25 26### Performance Plan 27- **Tier**: {REST API / JupyterHub} 28- **Estimated complexity**: {simple / moderate / complex} 29- **Known pitfalls**: {list from pitfalls.md} 30 31## Analysis Plan 32 33### Notebook 1: Data Exploration 34- **Goal**: {what to verify/explore} 35- **Expected output**: {CSV/figures} 36 37### Notebook 2: Main Analysis 38- **Goal**: {core analysis} 39- **Expected output**: {CSV/figures} 40 41### Notebook 3: Visualization (if needed) 42- **Goal**: {figures for findings} 43 44## Expected Outcomes 45- **If H1 supported**: {interpretation} 46- **If H0 not rejected**: {interpretation} 47- **Potential confounders**: {list} 48 49## Revision History 50- **v1** ({date}): Initial plan 51 52## Authors 53{ORCID, affiliation}

README.md

markdown
1# {Title} 2 3## Research Question 4{Refined question} 5 6## Status 7In Progress — research plan created, awaiting analysis. 8 9## Overview 10{One-paragraph summary of the hypothesis and approach} 11 12## Quick Links 13- [Research Plan](RESEARCH_PLAN.md) — hypothesis, approach, query strategy 14- [Report](REPORT.md) — findings, interpretation, supporting evidence 15 16## Reproduction 17*TBD — add prerequisites and step-by-step instructions after analysis is complete.* 18 19## Authors 20{Authors}

Related Skills

Looking for an alternative to berdl_start or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

View All

widget-generator

Logo of f
f

widget-generator is an open-source AI agent skill for creating widget plugins that are injected into prompt feeds on prompts.chat. It supports two rendering modes: standard prompt widgets using default PromptCard styling and custom render widgets built as full React components.

149.6k
0
Design

chat-sdk

Logo of lobehub
lobehub

chat-sdk is a unified TypeScript SDK for building chat bots across multiple platforms, providing a single interface for deploying bot logic.

73.0k
0
Communication

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication