Comparative genomics of Desmophyllum pertusum
Nordic (Tisler reef, 2015 PacBio CLR + Illumina + mate-pair) versus
US (Charleston, 2023 published reference Loph_1.0) same species,
very different sequencing histories.
The 2014–2018 Tisler assembly project was abandoned for lack of a US reference and a sharp
biological question. Both are gone now: the published Loph_1.0 anchors structural comparison,
and we focus on mucus immunity — the C3 complement system, DMBT1 / SRCR
scavenger receptors — a story rooted in 600 My of conserved innate-immune scaffolding.
Every PacBio read is fully documented in the parquet DB before any narrowing.
Axis-F structural class (EXPANDER / ANCHOR / BURIED / SUSPICIOUS), full Pfam-A annotation,
kraken2 taxonomic call. The immune atlas is a query-time predicate, not a scan-time filter.
Same DB, every read, joinable on read_id.
Four rules: threads-where-they-saturate · stream-to-disk · sync-before-next · idempotent substep ledger.
Every long substep records to state/runs.jsonl. Re-runs skip completed work.
Hard-won from two real incidents — M1 OOM (181 GB Counter) and B1 1.8-core hmmscan
mis-direction — both now case studies in the design memory.
Cross-contig paired RNA-seq reads → bridge graph (JSON + parquet) → pseudo-scaffolds.
Tested at scale on one Tisler RNA sample (27.5 M pairs) against per-contig grain target: 8,635 nodes, 15,344 edges. Trinity-transcript overlay deferred — pair evidence first.
Roadmap & maps
Where the project is, where it's going, and what the fleet is doing right now. Compact dashboard of figure plan, pipeline state, cross-species scope, and open captain decisions. For the operating-model playbook (captain & chief mate roles, fleet wiring, the four pipeline rules), see helm.deepsek.no.
Figure plan (F1 – F6)
| fig | topic | data status | blocker |
|---|---|---|---|
| F1 | Sampling map + species ID (Tisler vs Charleston; D. pertusum sensu stricto) | data in hand | write-up only |
| F2 | Assembly comparison Flye v1 (33% haplotig-inflated) vs Loph_1.0 ref | partial — Flye v1 exists | haplotig purge + RNA-bridge scaffolding (deferred) |
| F3 | Cross-pop immune-domain atlas (8 pools, 124 immune HMMs + 219 curated full) | data in hand + Job D curated landing | fold full-Pfam totals into card |
| F4 | DMBT1 / SRCR family architecture — Lophelia + cnidaria-wide | closed — three converging evidence sources say SRCR is present but DMBT1 architecture (SRCR + CUB + ZP) is genuinely absent in D. pertusum | cnidaria-wide cross-species check running on t5 now (34 proteomes) |
| F5 | RNA-seq DE — mucus-immunity + mito-immune nexus expression | 11/12 samples aligned (NN1 broken-fastq excluded); 40,679 gene × 11 sample count matrix built | captain decisions: D-vs-NN biology, headline contrast |
| F6 | Complement cascade architecture (lectin → MAC; classical absent) | data in hand — 5 A2M / 44 MACPF / 134 Ficolin / 161 lectin_C proteins in Loph_1.0; lectin pathway dominant in both populations, US +56% Ficolin / +57% Ldl_recept_a | refresh card with cnidaria-wide complement portrait |
Pipeline state map
| stage | host | state | output |
|---|---|---|---|
| Tisler PacBio data (12 cells) | t5 | ✓ done | 25.4× CLR, 14.17 Gbp on disk |
| Loph_1.0 reference fetch | t5 | ✓ done | 2,858 contigs, 557 Mb |
| Flye v1 assembly | t5 | ✓ done | 31,741 contigs, N50 34.4 kb (haplotig-inflated, expected) |
| ARAGONITE M0-M5 + axis-F partition | t5 | ✓ done | per-region architects + mortar pools |
| Cross-pop diff (8 pools, 4.3 M reads) | super | ✓ done | shared / partial / private / strict-private × nor / us |
| Kraken on all 8 pools (PlusPF) | super | ✓ done | cross-pop signal mostly real biology, not host contamination |
| 124-HMM immune-Pfam on all 8 pools | super | ✓ done | 8,425 hits across 62 domains |
| us_mortar immune-Pfam (Job B) | super | ✓ done | 702 hits, 0 CUB/ZP |
| Loph_1.0 proteome immune-Pfam scan | t5 | ✓ done | 53 SRCR proteins, 0 with CUB/ZP/Mucin/VWD/Trefoil |
| Mito-orthologs Pfam-A scan (22,911 cnidarian proteins) | t5 | ✓ done | 49,249 hits, 706/727 Lophelia mito-orthologs annotated |
| 124 cnidarian nucleotide BLAST DBs | t5 | ✓ done | 18 GB, 124 species, 1 missing (Astrangia empty dir) |
| tblastn cross-cnidaria immune hunt | t5 | ✓ done | 1.84 M rows, 124 species × 275 queries |
| RNA-seq STAR alignment (GTF-free, plain) | trx50 | ✓ done | 6 US + 5 Nordic samples aligned (NN1 broken-fastq excluded) |
| S4 featureCounts gene matrix | trx50 | ✓ done | 40,679 genes × 11 samples count matrix |
| Job D — full Pfam-A (219 curated) on 8 pools | super | running | 4/8 pools through, ~3 hr remaining |
| Cnidaria-wide proteome immune scan | t5 | running | ~6/34 species through, ~5 min remaining |
| S5 DESeq2 differential expression | trx50 | blocked | awaits captain decisions on D-vs-NN + contrast |
| S6 immune-gene DE overlay | trx50/t5 | blocked | after S5 |
Fleet map
(NFS mounts — t5 is the hub) t5 ◄──── ~/Desktop/super ─── super (192.168.1.12) — first offload + nginx vhost server t5 ◄──── ~/Desktop/trx50 ─── trx50 (192.168.1.34) — RNA-seq offload t5 ◄──── ~/Desktop/d8 ─── d8 (192.168.1.10) — disk-only (mainboard swap pending) super, trx50, d8 do NOT mount t5 (data must be pushed to each) ssh hub: t5 → super:45497, trx50:4540 (custom ports, dedicated key) exec: scripts/fire.sh — detached tmux + NFS log/exit polling
Threads budget: t5 = 25, super = 40, trx50 = 25. One heavy task per box. Between boxes, parallel is fine. Full details in helm.deepsek.no.
Cross-species comparative scope
| resource | count | used for |
|---|---|---|
| Cnidarian genome assemblies on disk | 125 | tblastn substrate |
| Annotated proteomes (NCBI RefSeq / GCF) | 34 | HMM scans, ortholog mapping |
| Per-genome nucleotide BLAST DBs built | 124 | tblastn queries — 1.84 M hits total |
| Per-species protein BLAST DBs | 34 | blastp / homology validation |
| MitoCarta human inventory → cnidaria ortholog map | 26,054 rows | mito-nuclear gene catalog |
| Mito-immune nexus (mito × immune intersection) | 985 rows | direct manuscript material — TFAM, ECSIT, MUL1, CASP9 |
Locked methodological decisions
| decision | value | rationale |
|---|---|---|
| STAR index | GTF-free, FASTA-only | no annotation priors; de novo splice junction discovery |
| STAR alignment | minimal flags (no twopass, no fancy filters) | captain preference — plain mapping |
| BAM-extract pair fix | BBTools repair.sh after samtools fastq | defensive re-pair + orphan audit |
| Threads | t5 = 25, super = 40, trx50 = 25 | per-box headroom; verify %CPU saturation |
| MAX_AA | 5000 with sliding window (W 5000, S 4800) | HMMER Plan7 is O(L × N); cap is first-class scaling param |
| Per-sample failure | flag in ledger + continue | captain rule: one broken sample shouldn't kill the batch |
| State spine | NFS + markdown (INTRO / FINDINGS) + JSONL ledger | no SSH automation, no shared agent state, slow but clear |
Open captain decisions
- D vs NN biology — what does the Tisler sample-label encoding mean (tissue, depth, treatment)? Needed before locking DESeq2 contrast direction.
- Headline contrast — within-Nordic D-vs-NN, cross-pop Nordic-vs-US, or combined
~population + group? Affects power and what the F5 figure shows. - NN1 fate — Tisler NN1 raw fastqs from L006 are broken; STAR refused them twice; BBTools reformat didn't catch the issue. Currently excluded → 5-sample Nordic. Option to hunt down the specific corrupt records, or accept N=5.
Pipeline · aragonite
Eight steps from raw PacBio BAM to a proposed contig graph. Each step's script + small-test signal + ledger key is on the repo side; the chart below is the structural map.
Design rules (the four-rule discipline, in case study form):
rule 1 — threads where they saturate (case: hmmscan --cpu 25 only used 1.8 cores
on 124-HMM scan; flipped to parallel hmmsearch across HMMs);
rule 2 — stream to disk (case: M1 Python Counter at 181 GB OOM-killed; moved to
kmc);
rule 3 — sync before next stage (within AND between pipelines, no oversubscription);
rule 4 — idempotent substeps + ledger (re-runs resume from missing substeps; pre-populated
ledger rows express "done by design").
Immune highlights
Three gene families anchor the cnidarian mucus-immunity story we're building: the SRCR scavenger-receptor family, the secreted mucin / DMBT1 architecture (multi-SRCR + CUB + ZP), and the complement cascade (lectin → MAC). Evidence comes in three layers: (i) the cross-pop diff-pool Pfam scan of 4.3 M PacBio reads (genomic domain landscape; broad but over-counts when domains are promiscuous), (ii) the truth-carrier named-ortholog stack (4 carriers · 1,233 human canonical refs · DIAMOND vs 34 cnidaria · clean 1:1 orthology), and (iii) the DESeq2 differential expression on 11 RNA-seq samples under the strict expression filter (the publishable named-ortholog DE table). The headline figure above synthesises layers (ii) and (iii). Layer (i) gives the domain landscape; the SRCR / Mucins / Complement cards below report it.
S5 DESeq2 named-ortholog volcano11-sample RNA-seq (5 Nordic / 6 US) under the strict expression filter (≥10 counts in ≥3 samples per group, both pops) — 16,611 testable genes; 473 of them are carrier-annotated immune orthologs. Positive log2FC = NOR-elevated.
The top-right cluster is the headline biology: C3 (+2.11 LFC, padj 9×10⁻⁷), MASP1/C1R/C1S consolidated hub (+2.50, padj 5×10⁻¹⁵), and C2 (+3.89, padj 2×10⁻¹³) — three classical/lectin activation-cascade complement genes coherently NOR-elevated. Supporting NOR-elevated: MUC6 gel-mucin (+3.55) and the DDX58/MDA5/LGP2 viral-RNA-sensor hub (+1.60). Cluster near origin (DMBT1/CD163L1, NLRP3, MyD88, CFH) is the constitutive innate-immune backbone — balanced both pops, no DE. MitoCarta panel (teal) is mostly housekeeping; the few extreme US↑ mito outliers are residual library-composition artifact.
Tools: pyDESeq2 0.5.4 · contrast: pop nor vs us ·
padj < 0.05 + |LFC| ≥ 1 · script
scripts/aragonite/s5d_headline_volcano.py ·
underlying data
~/scratch/s5_deseq2/deseq2_strict_filter_immune.tsv
named human orthologyFour pathway-specific carriers of canonical human reference proteins, each DIAMOND-blastp'd against 34 cnidarian proteomes. Replaces the broad-Pfam-domain approach (over-counts when domains are promiscuous: Trypsin, TSP_1, EGF, Ldl_a) with named 1:1 ortholog mapping back to Reactome / UniProt / MitoCarta3.
| carrier | human refs | Lophelia orthologs | signature finding |
|---|---|---|---|
| complement | 49 | 19 | terminal MAC absent in Lophelia (no C6-C9); consolidated hub
KAJ7394693.1 = C1R + C1S + MASP1 + MASP2 |
| mucin / SRCR | 24 | 8 | 0/26 cnidaria carry DMBT1 SRCR+CUB architecture (named-ortholog
confirmation); Lophelia DMBT1 best-hit KAJ7373414.1 = SRCR-only,
also the CD163L1 best-hit (consolidated hub) |
| mitocarta | 1,132 | 727 | 418-gene universal cnidarian mito-core (≥30/34 species) — the eukaryotic-ancestral mitochondrial backbone |
| TLR / NLR / RLR | 28 | 4 | Canonical PRR folds preserved (35/37 TLR hits carry LRR+TIR,
42/53 NLR hits carry NACHT+LRR) but reduced gene families; Lophelia hub
KAJ7375464.1 = DDX58 + IFIH1 + DHX58 (all 3 RLR paralogs collapsed) |
The consolidated-hub pattern is the deep-time signal. Across all four carriers, Lophelia (and most anthozoans) collapse multiple vertebrate paralogs onto single cnidarian loci — vertebrate gene duplications evidently happened after the cnidarian–bilaterian split. The pattern repeats four times with independent gene families: strongly publishable.
Layout: ~/cnidaria_genomes/_comparative/truth_carriers/<pathway>/ ·
per-carrier matrix tab <pathway>_pathway_cn_per_species.tsv ·
scripts scripts/aragonite/{build_,diamond_,build_*_matrix}_*.py
PF00530Scavenger Receptor Cysteine-Rich. The prime mucus-immunity target — DMBT1 carries 8–16 tandem SRCR copies per protein. 356 SRCR domain hits across the 8 diff pools — concentrated in shared sequence (nor_shared 108, us_partial 148), almost absent from strict-private pools (1 / 0). US-Atlantic carries slightly more SRCR signal than Nordic (192 vs 164, +17%), but the cross-pop divergence in SRCR is modest compared to the complement cascade below.
Canonical DMBT1-like architecture. Each bead is one ~100-aa SRCR domain (PF00530), tandemly repeated. The CUB and ZP C-terminal modules anchor the protein to the mucus matrix. Cnidarian DMBT1 homologs have been catalogued across coral and sea-anemone proteomes; whether D. pertusum carries an intact multi-SRCR mucin is what this analysis tests.
| pool | SRCR hits | reads w/ hit | Fib_C | Trypsin | Sushi | Ldl_a |
|---|---|---|---|---|---|---|
| nor_partial | 52 | 1006 | 273 | 106 | 33 | 15 |
| nor_shared | 108 | 2416 | 542 | 246 | 76 | 46 |
| nor_private | 3 | 33 | 13 | 4 | 0 | 0 |
| nor_strict_private | 1 | 4 | 0 | 1 | 0 | 0 |
| us_partial | 148 | 2911 | 993 | 324 | 63 | 61 |
| us_shared | 37 | 948 | 268 | 97 | 27 | 33 |
| us_private | 7 | 106 | 33 | 15 | 4 | 2 |
| us_strict_private | 0 | 9 | 0 | 0 | 0 | 1 |
| Nordic total | 164 | 3459 | 828 | 357 | 109 | 61 |
| US total | 192 | 3974 | 1294 | 436 | 94 | 96 |
Pool definitions (from the 4.3 M-read pb_read_diff): shared = reads carrying k-mers
present in both populations; private = reads enriched in one population only;
partial = intermediate overlap; strict_private = single-population
k-mer signature, the cleanest "this population only" pool.
Every read in every pool documented — no read discarded — per the BAC-PAC principle.
Secreted mucus PRRs combining multi-SRCR + CUB + ZP domains. The signature architecture for mucus immunity in vertebrates — and the working hypothesis for cnidarian mucosal defense at the coral–water interface.
What we look for: a single PacBio read that co-carries SRCR + CUB (or SRCR + ZP) in the same 6-frame translation. That's a positional argument — the domains are on one piece of contiguous DNA — much stronger than independent hits.
| mucin-architecture domain | Nordic | US | total (8 pools) |
|---|---|---|---|
| SRCR · scavenger-receptor (PF00530) | 164 | 192 | 356 |
| CUB · DMBT1 anchor (PF00431) | 0 | 0 | 0 |
| ZP · zona pellucida C-term (PF00100) | 0 | 0 | 0 |
| Mucin · O-glycosyl mucin core (PF01456) | 0 | 0 | 0 |
| VWD · gel-forming mucin (PF00094) | 0 | 0 | 0 |
| Trefoil · TFF-family mucin partner (PF00088) | 0 | 0 | 0 |
Three converging lines of evidence say the SRCR-containing proteins in D. pertusum are SRCR-only, never in classical DMBT1 architecture: (i) zero CUB/ZP/Mucin/VWD/Trefoil hits across all 8 cross-pop diff pools (4.3 M reads); (ii) zero co-occurrence in us_mortar (axis-F tandem-repeat pool, 769 MB, 31 SRCR hits); (iii) zero co-occurrence in Loph_1.0 reference proteome (37,484 proteins, 128 SRCR hits in 53 proteins).
Cnidaria-wide check (2026-05-17)
To rule out a D. pertusum-only quirk, we extended the same scan to 34 annotated cnidarian proteomes across Anthozoa (corals + anemones), Hydrozoa, Scyphozoa, and Myxozoa (513,899 protein-domain hits, 98 distinct immune Pfam HMMs):
| cnidaria-wide totals | SRCR proteins | CUB | ZP | Mucin | VWD | Trefoil |
|---|---|---|---|---|---|---|
| 31 species with at least one immune-Pfam hit | 2,410 | 0 | 0 | 0 | 0 | 0 |
| top SRCR carriers | Dendronephthya gigantea 1,012 · Actinia tenebrosa 715 · Oculina patagonica 438 · Acropora muricata 352 · Montipora foliosa 351 · Pocillopora verrucosa 335 | |||||
| proteins with SRCR + (CUB | ZP | Mucin | VWD | Trefoil) | 0 — across 205,278 cnidarian proteins tested | |||||
Verdict — interpretation (b) wins, cnidaria-wide. The SRCR scavenger-receptor family is ancient and expanded across Cnidaria (2,410 proteins, 7,604 domain hits, present in every major class except the myxozoan parasites with reduced genomes). But the canonical mammalian DMBT1 scaffold (multi-SRCR + CUB + ZP) is genuinely absent in Cnidaria — the CUB+ZP-anchored mucin architecture is a vertebrate / mammalian innovation. Cnidarian mucus immunity uses SRCR-only PRRs (and their non-CUB co-domain partners: Ig_2, TSP_1, Trypsin, Pkinase, Ldl_recept_a in D. pertusum's 53 SRCR proteins).
Named-ortholog confirmation (2026-05-18, DMBT1 = UniProt Q9UGM3)
The Pfam-architecture finding now re-confirmed against the named human DMBT1 reference. DIAMOND blastp of human DMBT1 (Q9UGM3, 2,413 aa) vs 34 cnidarian proteomes returns best-hits in 26/26 anthozoans (pident 30-50%, qcov 28-80%), but 0/26 carry the SRCR+CUB co-occurrence that defines vertebrate DMBT1-mediated mucus immunity — all are SRCR-only (some with Ig_2, Sushi, or Ldl_recept_a accessory). Hydrozoans (4 spp) + Rhopilema return no DMBT1 best-hit at all. The keystone PRR is anthozoan-specific in cnidaria.
Lophelia DMBT1 best-hit: KAJ7373414.1 (pident 32.5,
qcov 28.8, SRCR-only architecture, NCBI annotation: "Neurotrypsin") —
also the best-hit for CD163L1. Single Lophelia locus = both vertebrate paralogs
(consolidated-hub pattern, recurring across all four truth-carriers).
NCBI annotation caveat: NCBI's product field labels a different Lophelia
locus (OS493_018800 = our CD163 best-hit) as "Deleted in malignant brain
tumors 1" — i.e., NCBI calls a CD163-like protein DMBT1. The truth-carrier method
uses closest-sequence-match; NCBI uses a different criterion. Both are SRCR-rich;
the disagreement is documented and will be footnoted in the manuscript.
Expression evidence (Dahl-RNA1 + US SRA, ≥100 raw reads cutoff)
| category | n genes | expr Nordic | expr US | both | silent |
|---|---|---|---|---|---|
| SRCR | 53 | 11 | 24 | 9 | 27 |
| Mucin-anchor (CUB/ZP/Mucin/VWD/Trefoil) | 0 | nothing to express — gene set empty | |||
Top SRCR expression (balanced across populations):
OS493_025544 (KAJ7328140.1) — NOR 6,341 / US 6,288, pure SRCR, housekeeping-level.
OS493_013007 — NOR 4,224 / US 5,370, pure SRCR.
OS493_017426 — NOR 1,561 / US 523, pure SRCR (Nordic-favoured).
OS493_024254 — NOR 571 / US 414, SRCR + Ig_2 + Trypsin combo.
SRCR + kinase fusion, US-expressed: OS493_036508
(PK_Tyr_Ser-Thr + Pkinase + SRCR, NOR 0 / US 2,705).
Caveat — BAM-extract bias. 5 of 6 Nordic samples are samtools fastq
from BAMs previously aligned to the 142-contig us_lp_filter subset (354 Mb of 557 Mb);
NN1 is from raw cutadapt2 fastq. So Nordic counts for genes on the missing 203 Mb are
structural zeros, not biological zeros. Genes WITHIN the subset show honest cross-pop
comparison (above examples are all on-subset).
The most evidence-rich immune system in our scan, and the one that diverges most strongly between Nordic and US. 2,122 Ficolin hits + 793 MASP-like serine-protease hits + 203 Sushi + 157 MAC-pore hits across the 8 pools — 3,275 complement-arm hits total, the lectin → MAC pathway dominating in both populations. US-Atlantic carries ~56% more Ficolin signal than Nordic (1,294 vs 828), the strongest cross-pop differential we measured.
Lectin pathway dominates both populations. Ficolin (Fibrinogen_C) is the single strongest immune signal in the entire diff space — 2,122 hits, ahead of every other Pfam domain across all 124 HMMs tested. MASP-like serine proteases (Trypsin) and Factor B / MASP accessory modules (Sushi) are well attested. The membrane-attack pool (Ldl_recept_a, the C6–C9 MAC component) is also strongly cross-pop. The C3 α-2-macroglobulin thioester core itself remains under-powered in diff space — but absence here is consistent with C3 being conserved between Nordic and US (i.e., visible only in the shared bulk, not yet scanned).
| component | Pfam | Nordic | US | total | US / Nor |
|---|---|---|---|---|---|
| Ficolin · lectin path initiator | PF00147 | 828 | 1,294 | 2,122 | +56% |
| MASP · serine protease (Trypsin) | PF00089 | 357 | 436 | 793 | +22% |
| Sushi · Factor B / MASP accessory | PF00084 | 109 | 94 | 203 | –14% |
| Ldl_recept_a · MAC (C6–C9) | PF00057 | 61 | 96 | 157 | +57% |
| Complement-arm subtotal | 1,355 | 1,920 | 3,275 | +42% | |
The +56% Ficolin and +57% Ldl_recept_a enrichment in US-Atlantic is the headline cross-pop differential. Both sit on the lectin → MAC axis (recognition → pore formation), so the divergence is coherent rather than scattered — a real population-level skew in mucus complement.
RNA-seq expression evidence (Dahl-RNA1 + US SRA, ≥100 raw reads)
Loph_1.0 carries 577 complement-pathway genes (Fibrinogen_C / Lectin_C recognition + Trypsin MASP + Sushi accessory + A2M C3-core + MACPF / Ldl_recept_a MAC). STAR counts at the 100-read cutoff per sample:
| category (Pfam) | n genes | expr Nordic | expr US | both | silent |
|---|---|---|---|---|---|
| Complement — all families combined | 577 | 157 | 239 | 116 | 297 |
DESeq2 strict-filter DE results (named orthologs — 2026-05-18)
The Pfam-scan above measures genomic domain presence in cross-pop read pools. The DESeq2 results below measure per-gene transcription with the named-ortholog truth-carrier. Same underlying biology, two different evidence layers; both are kept. Filter: gene retained only if ≥10 counts in ≥3 samples per pop, both pops.
| complement gene | gene_id | log2FC | padj | direction |
|---|---|---|---|---|
| C3 · α-2-macroglobulin thioester | OS493_014903 | +2.11 | 9 × 10⁻⁷ | NOR ↑ 4.4× |
| MASP1 / C1R / C1S / MASP2 · consolidated protease hub | OS493_000518 | +2.50 | 5 × 10⁻¹⁵ | NOR ↑ 5.7× |
| C2 · classical/lectin protease | OS493_002228 | +3.89 | 2 × 10⁻¹³ | NOR ↑ 15× |
| C4A · classical/lectin thioester | OS493_024843 | +1.18 | 9.6 × 10⁻³ | NOR ↑ 2.3× |
| CFH · regulator | OS493_002696 | −1.00 | 1.8 × 10⁻³ | US ↑ 2.0× |
Headline: the classical / lectin activation cascade (C3 + MASP-hub + C2 + C4A) is coherently NOR-elevated at the transcript level. Same direction as the broad-Pfam Ficolin / MASP / Sushi enrichment but the named-ortholog DESeq2 lands the signal on Nordic, not US — because the Pfam-scan layer was measuring genomic content in diff-pool reads (US shared+partial pools are larger), while DESeq2 measures transcription per gene (Nordic transcribes more). Two consistent evidence layers when interpreted at their proper level.
Sub-threshold notes (not in strict filter):
NCBI confirms OS493_000518 = "Mannan-binding lectin serine protease 1" —
independent annotation match on our MASP1 carrier hit. The earlier
OS493_028292 (Trypsin, NOR 4 / US 85,684) and
OS493_028953 (Trypsin, NOR 0 / US 39,670) one-pop spikes were dropped by
the strict filter as library-composition artifacts (BAM-extracted Nordic samples
systematically zero on off-us_lp_filter-subset genes). Demoted to "trending
US, sub-threshold expression in Nordic — not a confident DE call."
Caveat — BAM-extract bias. 5 of 6 Nordic samples come from BAMs aligned to
the 142-contig us_lp_filter subset (354 Mb of 557 Mb full Loph_1.0). The
strict-filter mitigation drops genes off-subset; the 11 named orthologs above are all
on-subset and represent honest cross-pop transcription. Genome-wide 38.5% sig figure
(vs 52% before filter) is partially residual library-composition and partially real
environmental divergence between deep-Tisler and Atlantic-shelf populations.
Pattern-recognition receptors — TLR / NLR / RLR (carrier 4, 2026-05-18)
The fourth truth-carrier: 28 canonical human PRRs (TLR1-10, TIR adaptors, NOD1/2, NLRP1/3/6/7, NLRC4/5, NLRX1, NAIP, RLR triad). Cnidaria show a compressed-but-architecturally-complete PRR signal — canonical folds preserved, gene families reduced. Across 31 species:
| layer | universal (≥28/34) | broad (10-22/34) | absent (0/34) |
|---|---|---|---|
| RLR · viral RNA | DDX58, IFIH1, DHX58 | — | — |
| TLR adaptor | — | MYD88 (22) | TIRAP, TICAM1/2, SARM1 |
| NLR inflammasome | — | NLRP3 (22) | NLRP1 |
| TLR · surface PRR | — | TLR2 (15), TLR4 (11), TLR5 (10) | TLR1, TLR3, TLR6, TLR7, TLR9, TLR10 |
| NLR-NOD | — | NOD2 (12), NOD1 (3) | — |
Architecturally canonical: 35/37 cnidarian TLR best-hits carry both LRR + TIR; 42/53 NLR best-hits carry NACHT + LRR. The PRR machinery is preserved back to the cnidarian–vertebrate split. Viral-RNA sensing via the RIG-I/MDA5/LGP2 RLR family is universal across cnidaria; surface bacterial sensing (TLR2/4/5 + MyD88 + NLRP3 + NOD2) preserved but reduced; endosomal nucleic-acid TLRs (TLR3/7/8/9) and most TIR adaptors (TIRAP/TRIF/TRAM/SARM1) are vertebrate innovations.
Lophelia carries a minimal but architecturally-complete PRR system — 4 orthologs, all canonical:
| function | Lophelia locus | architecture | NOR_CPM / US_CPM |
|---|---|---|---|
| TLR4 | KAJ7392015.1 | LRR_8 + TIR + TIR_2 ★ | 1.1 / 0.1 |
| MyD88 adaptor | KAJ7371306.1 | Death + TIR + TIR_2 (qcov 83.8%) | 128 / 150 |
| NLRP3 inflammasome | KAJ7373972.1 | LRR_6 + NACHT ★ | 253 / 326 |
| RIG-I / MDA5 / LGP2 (hub) | KAJ7375464.1 | DEAD helicase — one locus = 3 paralogs | 100 / 27 |
DESeq2: NLRP3 + MyD88 robust and balanced both pops (constitutive backbone). RLR-hub NOR-elevated +1.60 LFC, padj 2.5×10⁻⁶ — Nordic upregulates viral-RNA sensing. TLR4 baseMean too low for confident DE call (demoted).
MitoCarta orthologs — for context
Lophelia has 635 MitoCarta orthologs (human mito-localised proteins). At the 100-read cutoff: 440 expressed in Nordic, 589 in US, 428 in both, 34 silent. The high overlap reflects the mito-machinery's housekeeping role across both populations. The mito × immune intersection is small in D. pertusum with the 124-HMM immune atlas (24 proteins in the cross-cnidaria nexus, 2 strictly in our SRCR/complement narrow set) — but the broader mito-immune crosstalk machinery (TFAM, ECSIT, MUL1, CASP9, CLPB/CLPX/LONP1) is present and conserved (see Mito × Immune nexus backlog work — 985 rows across 34 cnidarians, manuscript material for a separate angle).
(1) Classical / lectin complement is transcriptionally NOR-elevated. Coherent same-direction signal across C3 (+2.11 LFC), MASP1/C1R/C1S consolidated hub (+2.50), C2 (+3.89), and C4A (+1.18) — the entire proteolytic activation cascade up in Nordic Tisler. Reactome pathway: R-HSA-166658. Independent NCBI annotation confirms our MASP1 ortholog locus is literally labelled "Mannan-binding lectin serine protease 1". Same direction as supporting NOR-elevation in MUC6 gel-mucin (+3.55) and the DDX58/MDA5/LGP2 RLR hub (+1.60).
(2) Cnidarian mucus-immunity does not use DMBT1 architecture. Named-ortholog DIAMOND of human DMBT1 (Q9UGM3) returns best-hits in 26/26 anthozoans but 0/26 carry the SRCR+CUB co-occurrence — all SRCR-only. Confirmed by the read-level Pfam scan (0 CUB hits across 4.3 M PacBio reads in the diff space). The CUB+ZP-anchored mucin scaffold is a vertebrate innovation. Cnidarian mucus immunity uses SRCR-only PRRs.
(3) Vertebrate gene duplications collapse onto single cnidarian loci. The consolidated-hub pattern replicated across four independent truth-carriers: one Lophelia locus covers MASP1+C1R+C1S+MASP2 (complement); one covers DMBT1+CD163L1 (SRCR); one covers DDX58+IFIH1+DHX58 (RLR); one covers FCN1+FCN2+FCN3 (ficolins); etc. This is the textbook deep-time signal — vertebrate paralogs duplicated from a single ancestor after the cnidarian split. Four-carrier replication makes it publishable.
(4) Terminal MAC pore (C6-C9) is absent in Lophelia + most anthozoans + all hydrozoans. Lophelia, Acropora digitifera, Exaiptasia, and all 4 hydrozoans carry 0 named MAC pore-forming genes; C5 is present in many anthozoans but C6-C9 are not. The pore-forming complement axis appears to be largely a vertebrate innovation; cnidarian complement-mediated killing must use the C3 opsonization / protease cascade without the canonical lytic pore.
Immune Pfam atlas · 124 HMMs
The InnateDB-derived cnidaria-relevant immune Pfam atlas — 124 top-2-per-family
accessions curated from immune_family_table_cnidaria_relevant.tsv. Used as a
query-time predicate over the full per-read DB, not as a scan-time filter
(that's the post-2026-05-13 design correction).
Top families by member count in the cnidaria-relevant table:
| family | Pfam |
|---|---|
| Toll-like receptor family | PF13855 |
| TRIM/RBCC family | PF00643 |
| Peptidase S1 family | PF00089 |
| NLRP family | PF05729 |
| Complement C6/C7/C8/C9 family | PF00057 |
| STING family | PF15009 |
| 2-5A synthase family | PF10421 |
| Ficolin lectin family | PF00147 |
| HMGB family | PF00505 |
| Cathelicidin family | PF00666 |
| Invertebrate defensin (Type 1) | PF01097 |
| Mab-21 family | PF20266 |
8,425 hits across 7,865 unique reads · 62 distinct Pfam domains · 4.3 M PacBio reads scanned (super offload, 2026-05-13 → 14):
| domain | Nordic | US | total |
|---|---|---|---|
| Fibrinogen_C ← Ficolin / lectin | 828 | 1,294 | 2,122 |
| Trypsin | 357 | 436 | 793 |
| TSP_1 | 344 | 426 | 770 |
| PK_Tyr_Ser-Thr | 322 | 342 | 664 |
| TRAF-mep_MATH | 279 | 287 | 566 |
| Ank | 193 | 165 | 358 |
| SRCR ← prime target | 164 | 192 | 356 |
| Arf | 166 | 159 | 325 |
| LRR_6 | 167 | 128 | 295 |
| ubiquitin | 141 | 149 | 290 |
| Forkhead | 105 | 115 | 220 |
| Sushi | 109 | 94 | 203 |
| zf-C3HC4 | 99 | 102 | 201 |
| An_peroxidase | 66 | 105 | 171 |
| Ldl_recept_a | 61 | 96 | 157 |
| zf-RING_UBOX | 60 | 72 | 132 |
Top 16 by total hit count. Full 62-domain table sits on disk in
out/immune/<pool>/hmmer.domtblout.
Data inventory
What's on disk. Source-of-truth paths under
~/cnidaria_genomes/lophelia_pertusa/.
Nordic — Tisler reef (2015)
| Cell | Movie ID prefix | Library | Subreads size |
|---|---|---|---|
| B02_1 | m151122_020500 | P6-C4 · 10 kb insert | CCS partial |
| C02_1 | m151122_082514 | P6-C4 · 10 kb insert | 326 MB |
| D02_1 | m151122_144213 | P6-C4 · 10 kb insert | 335 MB |
| D03_1 | m151103_071355 | P6-C4 · 10 kb insert | 636 MB |
| E02_1 | m151122_210143 | P6-C4 · 10 kb insert | 894 MB |
| E03_1 | m151103_113308 | P6-C4 · 10 kb insert | 691 MB |
| F01_1 | m151031_225026 | P6-C4 · 10 kb insert | 574 MB |
| F02_1 | m151123_032108 | P6-C4 · 10 kb insert | 1.1 GB |
| F03_1 | m151103_155234 | P6-C4 · 10 kb insert | 726 MB |
| G02_1 | m151123_094519 | P6-C4 · 10 kb insert | 829 MB |
| G03_1 | m151103_201506 | P6-C4 · 10 kb insert | 700 MB |
| H02_1 | m151123_160617 | P6-C4 · 10 kb insert | 751 MB |
Aggregate: 25.4× CLR, 14.17 Gbp. 9 cells with full subreads; 2 cells (C02_1, D02_1) have CCS instead; 1 cell (B02_1) partial. Plus Tisler Illumina paired-end + 3 kb / 8 kb / 20 kb / 40 kb mate-pair libraries.
Nordic — RNA-seq (Dahl)
| Sample | Index | Pool | Reads |
|---|---|---|---|
| Dahl-RNA1-D2 | GCCAAT | D | 27.5 M pairs |
| Dahl-RNA1-D3 | CAGATC | D | 23.5 M pairs |
| Dahl-RNA1-D5 | CTTGTA | D | 26.0 M pairs |
| Dahl-RNA1-NN1 | CGATGT | NN | 23.8 M pairs |
| Dahl-RNA1-NN3 | TGACCA | NN | 25.2 M pairs |
| Dahl-RNA1-NN5 | ACAGTG | NN | 29.7 M pairs |
HiSeq2500 paired-end, STAR-aligned to us_lp_filter (Loph_1.0 142-scaffold subset).
~155 M pairs total. Dahl-RNA1-D2 used as the RNA-bridge test sample so far.
US — published reference (2023)
- Accession
- GCA_029204205.1 · Loph_1.0
- Submitter
- Lehigh / Santiago Herrera lab
- Specimen
- USNM1676648 (Smithsonian)
- Origin
- Atlantic · 31.754 N · 79.194 W (Charleston, SC)
- Depth
- 515 m
- Assembly
- scaffold N50 1.614 Mb · contig N50 438 kb · 557 Mb total · 39.5% GC
- Annotation
- RefSeq
genomic.gff+protein.faa
Plus 33 NCBI-annotated cnidarian proteomes (RefSeq-preferred) under
~/cnidaria_genomes/_comparative/proteomes/ as the
comparative backdrop across the major cnidarian clades.
People
Contributors to the AutoLoop / Tisler D. pertusum work. The list is open — contribution shape, not author order.
Co-author on the 2009 Tisler manuscript that prefigured this project. Mitochondrial + nuclear immunity arc.
Sample lead on the 2015 Tisler PacBio sequencing at NSC Norway (run 47). Continuity with the original sequencing decisions and library design.
AutoLoop / aragonite codebase, the per-read DB build, the four-rule pipeline discipline, the RNA-bridge graph stack.
Project owner. Mito-nuclear genes + innate immunity (complement, DMBT1, mucus PRRs).
FRIBIO grant 204907 (2010-era) framing: ancestral tumor/apoptosis
comparative axis. Traylor-Knowles 2022 Front Mar Sci is the designated
biological-route precedent. Story: solid science, not high-impact splash.
Names listed alphabetically by last name within each card row. Authorship and acknowledgements are finalised at submission.
Lab notebook
Dated entries from ~/autoloop/docs/lab_notebook.md, rendered on every commit.
Newest entry first. Run scripts/site/build_notebook_panel.py to refresh.
2026-05-18 (night) — S5d headline figure (named-ortholog volcano)
Script: scripts/aragonite/s5d_headline_volcano.py. Generates two
volcano flavours from the strict-filtered DESeq2 results:
headline_volcano.svg— genome-wide context (16,611 grey bg genes + 473 carrier-annotated colored fg), x ∈ [-10, 8] to clip extreme mito artefacts; shows the immune panel sits clearly above the genome-wide noise floorheadline_volcano_immune.svg— the paper figure: only carrier- annotated genes plotted, headline names labeled, palette matched to the autoloop site
Colors: complement = sandy coral (accent); mucin/SRCR = lavender; mitocarta = teal; tlr_nlr = warm tan. Background dark to match the site dark theme.
Plotted breakdown (strict filter)
| Carrier | n in panel | NOR↑ sig | US↑ sig |
|---|---|---|---|
| complement | 8 | 5 | 0 |
| mucin_srcr | 4 | 1 | 0 |
| mitocarta | 459 | 22 | 92 |
| tlr_nlr | 2 | 1 | 0 |
The immune-panel volcano (no mito background) reads cleanly: top-right cluster = C3 + MASP-hub + C2 (the activation cascade NOR↑), mid-right = MUC6 + DDX58/MDA5/LGP2 hub (NOR↑), origin cluster = DMBT1/CD163L1 + NLRP3 + MyD88 + CFH (the constitutive innate backbone — balanced both pops, no DE).
The 92 US↑ mitocarta genes vs 22 NOR↑ is the residual library- composition signal (housekeeping mito proteins systematically over-represented in the US raw libraries). Demoted to "broad transcriptional divergence with library-composition contribution" in the captioning, not headline.
Files
~/scratch/s5_deseq2/figures/ (source)
~/autoloop/docs/figures/ (repo copy)
~/Desktop/super/bok1/deepsek_web/autoloop/figures/ (site copy)
├── headline_volcano.svg genome-wide context
├── headline_volcano.png
├── headline_volcano_immune.svg paper figure
└── headline_volcano_immune.png
Site integration
autoloop.deepsek.no Immune highlights panel gets a new
"Headline · S5 DESeq2 named-ortholog volcano" card at the top
(above the SRCR/Mucins/Complement cards), embedding the
headline_volcano_immune.svg inline. Caption explains the
top-right activation-cascade cluster, the constitutive backbone
near origin, and the residual library-composition artifact. Links
back to scripts/aragonite/s5d_headline_volcano.py + underlying
TSV for reproducibility.
Open / running (refresh)
- Label crowding near origin (DMBT1, NLRP3, CFH, C4A) — could use
adjustTextlibrary or hand-curate offsets if going to print. Acceptable for current draft. - TMM normalization fallback if reviewers push on genome-wide 38.5%
- Site nginx vhost (sudo), helm.deepsek.no — unchanged backlog
2026-05-18 (late evening) — S5c library-bias mitigation (strict filter)
Approach: re-run DESeq2 on a strict expressed-in-both-pops filter to remove the lowly-expressed-in-one-pop genes that DESeq2's noise model flags with extreme LFCs (the "PABP4 LFC −16 padj 1e-291" artifacts). Re-extraction of raw NOR fastqs was not feasible (raw 2014-2015 data abandoned); TMM normalization deferred unless this mitigation is insufficient for reviewers.
Filter: gene retained iff ≥10 counts in ≥3 of 5 NOR samples AND ≥3
of 6 US samples. Script: scripts/aragonite/s5c_library_bias_mitigation.py.
Result — biology preserved, genome-wide partially mitigated
| Filter | Tested | Sig | % sig | NOR↑ | US↑ |
|---|---|---|---|---|---|
| Loose (sum ≥ 10) | 32,398 | 17,024 | 52.5% | 4,655 | 12,369 |
| Strict (3+/group both) | 16,969 | 6,538 | 38.5% | 2,997 | 3,541 |
- 14 percentage-point drop in genome-wide sig
- Directional skew evened out (loose 73% US-up → strict 54% US-up, near-symmetric — more biologically credible)
- Still elevated; real coral pop comparisons typically run 5-15% sig
Headline named-ortholog panel — 11/13 preserved, all 11 LFC-stable
| Gene | gene_id | loose LFC | strict LFC | LFC Δ |
|---|---|---|---|---|
| C3 | OS493_014903 | +2.14 | +2.11 | 0.03 |
| MASP1 hub | OS493_000518 | +2.53 | +2.50 | 0.03 |
| C2 | OS493_002228 | +3.92 | +3.89 | 0.03 |
| C4A | OS493_024843 | +1.22 | +1.18 | 0.04 |
| CFH | OS493_002696 | −0.97 | −1.00 | 0.03 |
| DMBT1 hub | OS493_013007 | −0.55 | −0.58 | 0.03 |
| MUC6 | OS493_020649 | +3.59 | +3.55 | 0.04 |
| MyD88 | OS493_026950 | −0.65 | −0.68 | 0.03 |
| NLRP3 | OS493_009300 | −0.88 | −0.91 | 0.03 |
| RLR hub | OS493_002237 | +1.64 | +1.60 | 0.04 |
| CD163 | OS493_018800 | −0.44 | −0.48 | 0.04 (NS both) |
| CD5L+CD6 hub | OS493_031461 | −6.30 (sig) | DROPPED | — |
| TLR4 | OS493_014951 | +4.38 (sig) | DROPPED | — |
The headline biology is rock-solid under the strict filter — C3, MASP-hub, C2, MUC6, RLR-hub all NOR-elevated with near-identical LFCs.
Two genes dropped to sub-threshold expression: - CD5L+CD6 hub — barely expressed in NOR (<10 counts), the −6.30 LFC was a noise-model extreme. Demote to "trending US-elevated, sub-threshold in Nordic" — interesting but not a confident DE call. - TLR4 — baseMean 4.2, too low for confident testing. Demote to "detectable but low in both pops."
Recommendation for the paper
- Use strict-filter results as the published DE analysis: 16,969 genes tested, 6,538 sig, immune panel preserved.
- Explicitly report loose-vs-strict comparison in methods — shows the filter is principled (not p-hacking; biology preserved).
- Headline claims: C3, MASP-hub, C2, C4A, CFH, MUC6, RLR-hub for the NOR-elevated story; DMBT1-hub balanced for the constitutive story; CD5L+CD6 and TLR4 demoted to supplement footnotes.
- Frame genome-wide 38.5% sig as "broad transcriptional divergence between geographically separated populations, possibly environmental; named-immune-ortholog signal is robust to filter assumptions."
Diagnostic notes for the residual 38.5%
Three plausible contributors: 1. Real biology — Lophelia between deep-cold Tisler and Atlantic-shelf US could have genuinely large transcriptional shifts 2. Residual library-composition bias not fully removed by strict filter 3. Sample size + within-group variance (n=5/n=6, non-replicate-matched libraries between groups)
If reviewers push back on the 38.5%, fallback option is TMM normalization (limma-voom style — more robust to composition shifts). PyDESeq2 doesn't natively support TMM but TMM factors can be computed externally and passed in as size factors. Not implemented yet — defer unless needed.
Files
~/scratch/s5_deseq2/
├── deseq2_strict_filter_full.tsv 16,969 strict-filtered DESeq2 results
├── deseq2_strict_filter_immune.tsv 487 carrier-annotated
├── library_bias_comparison.tsv per-headline-gene before/after
└── mitigation_verdict.tsv loose-vs-strict summary stats
Open / running (refresh)
- TMM normalization fallback if reviewers push on genome-wide 38.5%
- Site nginx vhost (sudo), helm.deepsek.no — unchanged backlog
2026-05-18 (evening) — S5b D-vs-NN within-Nordic sanity check
Script: scripts/aragonite/s5b_deseq2_d_vs_nn.py. Two analyses:
(1) PCA on log-CPM all 11 samples + within-group distance summary;
(2) DESeq2 ~subpop within Nordic (D=3 vs NN=2) compared to the main
NOR-vs-US DE volume.
Verdict: POOL JUSTIFIED
| Contrast | Tested | Sig (padj<0.05, |LFC|≥1) | % sig |
|---|---|---|---|
| NOR vs US | 32,398 | 17,024 | 52.5% |
| D vs NN within Nordic | 22,609 | 400 | 1.8% |
Ratio (subpop_DE / pop_DE) = 0.023. Within-Nordic subpop signal is 2.3% of the cross-pop signal — comfortably below the pooling threshold.
PCA structure
PC1 (90.7% variance) is the NOR↔US axis — overwhelms everything else. PC2 (3.9%) is the D↔NN axis but small. Within-D variance (50.95 in PC1-2 space) is larger than the D-to-NN mean separation (57.49): D-vs-NN inside Nordic looks like ordinary sample-to-sample variability, not two distinct sub-populations. Within-NN samples are very tight (2.2); D5 is the main D-outlier driving within-D variance.
within-D 50.95
within-NN 2.20
within-US 6.43
D ↔ NN 57.49
D ↔ US 248.99
NN ↔ US 241.45
Carrier-gene check — clean
Only 3 of 501 carrier-annotated genes are D-vs-NN sig: - ITGB2 (complement receptor β subunit, CR3-β) — LFC −1.46, padj 9.7e-4 — D-elevated. The only named-ortholog with within-Nordic structure. Worth a paper footnote (possible Tisler-vs-other-Nordic tissue-state difference); does not compromise the headline. - 2 unannotated mitocarta hits (likely depth-artifact at this signal level).
Critically: NONE of the S5 headline genes (C3, MASP-hub, C2, C4A, MUC6, RLR-hub, CD5L+CD6 hub) are in the D-vs-NN sig list. The NOR-vs-US biological story is driven by genuine cross-pop signal, not by D-vs-NN heterogeneity inside the Nordic group.
Implication
The S5 NOR-vs-US contrast can be reported as published. Pool the 5 Nordic samples as one group. Footnote ITGB2 D-elevation if relevant. The library-composition caveat (NOR BAM-extracted vs US raw) remains the more important caveat — that one needs a separate mitigation (reviewer might ask).
Files
~/scratch/s5_deseq2/
├── pca_all_samples.tsv 11 samples × PC1-PC4 + subpop
├── deseq2_d_vs_nn_full.tsv 22,609 within-Nordic results
├── deseq2_d_vs_nn_immune.tsv 501 carrier-annotated within-Nordic
└── subpop_pooling_verdict.tsv NOR-vs-US vs D-vs-NN summary stats
Open / running (refresh)
- DESeq2 headline figure draft — named-ortholog panel volcano
- Site nginx vhost (sudo), helm.deepsek.no — unchanged backlog
2026-05-18 (evening) — S5 DESeq2 — NOR vs US headline contrast
Tooling decision: no R/DESeq2 env existed on t5; installed pyDESeq2
0.5.4 into the lophelia env (canonical Python port, results
near-identical to R DESeq2; avoids R-bioconductor env bloat per the
one-env-per-project rule). Pandas downgraded 3.0.2 → 2.3.3 by the
install (pydeseq2 incompatible with pandas 3.x).
Script: scripts/aragonite/s5_deseq2_pop_contrast.py. Design
~pop (us as reference), so positive log2FC = NOR-upregulated.
Filter: gene total count ≥ 10 → 32,398 / 40,679 genes tested.
Carrier annotation merged in (660 unique Lophelia gene_ids with
truth-carrier ortholog labels).
Headline named-ortholog panel (padj < 0.05, |LFC| ≥ 1)
| Gene | gene_id | LFC | padj | direction |
|---|---|---|---|---|
| MASP1 hub (C1R+C1S+MASP1+MASP2) | OS493_000518 | +2.53 | 1.5e-15 | NOR↑ |
| C2 | OS493_002228 | +3.92 | 2e-13 | NOR↑ |
| C3 | OS493_014903 | +2.14 | 4e-7 | NOR↑ |
| MUC6 (gel-mucin) | OS493_020649 | +3.59 | 6e-6 | NOR↑ |
| DDX58/MDA5/LGP2 RLR hub | OS493_002237 | +1.64 | 1.2e-6 | NOR↑ |
| C4A | OS493_024843 | +1.22 | 6.5e-3 | NOR↑ |
| TLR4 | OS493_014951 | +4.38 | 3e-4 | NOR↑ (low baseMean) |
| CD5L+CD6 hub | OS493_031461 | −6.30 | 2.6e-11 | US↑ |
| MyD88 | OS493_026950 | −0.65 | 0.034 | barely US↑ |
| NLRP3 | OS493_009300 | −0.88 | 0.003 | barely US↑ |
| DMBT1/CD163L1 hub | OS493_013007 | −0.55 | 0.012 | barely US↑ — constitutive |
Biological pattern
Nordic Lophelia upregulates the activation-cascade complement + mucus + viral RNA sensing: C3 + MASP-hub + C2 + C4A coherently NOR-elevated (the classical/lectin proteolytic activation cascade); MUC6 gel-mucin NOR-up; DDX58/MDA5/LGP2 RLR sensing NOR-up.
US Lophelia upregulates the SRCR-secreted scavenger layer: CD5L+CD6 hub 79× higher.
Innate-immune backbone (MyD88, NLRP3, DMBT1/CD163L1 hub) is balanced — robust constitutive expression both pops, no DE.
The coherence at pathway level (multiple genes in the same biological role coordinating in the same direction) is what makes the named-ortholog panel credible despite the per-gene small n.
Per-carrier DE counts
| Carrier | tested | sig | NOR↑ | US↑ |
|---|---|---|---|---|
| Complement | 14 | 10 | 5 | 5 |
| Mucin/SRCR | 6 | 3 | 1 | 2 |
| MitoCarta | 635 | 274 | 26 | 248 |
| TLR/NLR | 4 | 2 | 2 | 0 |
Caveats — critical
- 17,024 / 32,398 genes flagged sig genome-wide (52%) — top hits are mitochondrial housekeeping (e.g. PABP4 at LFC −16, padj 1e-291). This is the library-composition bias from BAM-extracted Nordic samples vs raw US fastqs. DESeq2 size factors normalize total depth but cannot undo the upstream read-filter selection.
- The named-ortholog panel is the credible biology — coherent pathway-level patterns are unlikely to be artifact. The genome-wide volcano top is contaminated with normalization artifacts.
- D vs NN substructure still un-tested. Captain previously flagged
this as a blocker decision. Need a
~ subpopwithin-Nordic check before claiming pop = NOR is homogeneous.
Files
~/scratch/s5_deseq2/
├── deseq2_results_full.tsv 32,398 genes × DESeq2 stats
├── deseq2_results_immune.tsv 658 carrier-annotated genes
├── deseq2_results_<carrier>.tsv per-panel results
├── top_de_genes.tsv top 50 by -log10padj × |LFC|
└── state/runs.jsonl ledger
Open / running
- Library-bias mitigation — discuss with captain: re-extract NOR from raw if surviving, or subsample US to match NOR floor, or alternative normalizer (voom-TMM)
- DESeq2 headline figure draft — named-ortholog panel volcano (not the genome-wide volcano)
- Site nginx vhost (sudo), helm.deepsek.no — unchanged backlog
2026-05-18 (late) — truth-carrier × RNA-seq expression cross-reference
Pivot: with four truth-carriers landed (carrier-defined Lophelia
orthologs in KAJ7* protein_id namespace) and the 11-sample
featureCounts matrix at trx50:out/counts/ keyed on OS493_* gene_id,
the natural next step is to join them and ask "which named immune
orthologs are actually expressed, and in which pop." Mapping built
from GTF CDS records (gene_id + protein_id in the same attribute
block): 192,784 CDS records → 37,484 gene→protein mappings. 100% of
the 766 Lophelia carrier orthologs map to a gene_id.
Script: scripts/aragonite/truth_carrier_expression.py. Outputs at
~/scratch/truth_carrier_expression/. Per-sample featureCounts
parquets concatenated → 40,679 genes × 11 samples matrix → CPM. Background:
20,100 / 40,679 genes expressed (≥1 CPM) in any NOR sample, 26,992 in
any US sample — US-side has more genes detected because Nordic data
are BAM-extracted (already filtered to Lophelia-mapping reads, depth
budget concentrated on fewer loci).
Per-carrier expression coverage
| Carrier | Orthologs | NOR-expr | US-expr | Both |
|---|---|---|---|---|
| Complement | 25 | 12 | 20 | 12 |
| Mucin/SRCR | 8 | 5 | 8 | 5 |
| MitoCarta | 727 | 539 | 724 | 536 |
| TLR/NLR | 6 | 6 | 5 | 5 |
Named-ortholog highlights (NOR_max_CPM / US_max_CPM)
| Gene | gene_id | NOR | US | direction |
|---|---|---|---|---|
| C3 | OS493_014903 | 2,381 | 359 | NOR 6.6× ↑ |
| MASP1 / C1R / C1S / MASP2 hub | OS493_000518 | 239 | 29 | NOR 8.2× ↑ |
| C2 | OS493_002228 | 699 | 56 | NOR 12.5× ↑ |
| C4A / C4B | OS493_024843 | 592 | 173 | NOR 3.4× ↑ |
| CFH | OS493_002696 | 161 | 161 | balanced |
| DMBT1 / CD163L1 hub | OS493_013007 | 310 | 291 | balanced — constitutive |
| CD163 | OS493_018800 | 24 | 44 | US 1.8× ↑ |
| MUC6 | OS493_020649 | 54 | 5 | NOR 12× ↑ |
| CD5L / CD6 hub | OS493_031461 | 0.4 | 27 | US 67× ↑ |
| TLR4 | OS493_014951 | 1.1 | 0.1 | barely expressed |
| MyD88 | OS493_026950 | 128 | 150 | robust, balanced |
| NLRP3 | OS493_009300 | 253 | 326 | robust, balanced |
| DDX58 / MDA5 / LGP2 RLR hub | OS493_002237 | 100 | 27 | NOR 3.7× ↑ |
Three validation findings
- C3 locus settled at
OS493_014903 → KAJ7334579.1. Prior Pfam- based candidate from earlier session wasOS493_014902— off by one (adjacent locus same neighbourhood). The carrier-truth call supersedes the Pfam-based one. - NCBI annotation cross-validates the MASP1 carrier hit: NCBI product for OS493_000518 is literally "Mannan-binding lectin serine protease 1" — independent confirmation that the consolidated-hub pattern (one Lophelia locus = MASP1+C1R+C1S+MASP2) is the genuine ancestral C1/MASP-like protease.
- Naming conflict flagged for paper: NCBI's product field calls
OS493_018800"Deleted in malignant brain tumors 1" (= DMBT1) — but our truth-carrier method puts that locus at CD163, and the DMBT1 best-hit atOS493_013007(which NCBI calls "Neurotrypsin"). These are two SRCR-rich Lophelia proteins with overlapping vertebrate similarity; NCBI's pipeline used a different criterion. The carrier method gives the closest sequence match; NCBI's call is plausibly a synonym chain through annotation transitive closure. Will footnote in the manuscript.
Pop-bias pattern (observational, not yet DESeq2-tested)
- NOR-elevated: classical/lectin complement core (C2, C3, C4, MASP1-hub), MUC6, viral-RNA-sensing RLR hub
- US-elevated: SRCR scavenger/regulator layer (CD5L+CD6, CD163, CD46+C4BPA at OS493_027659)
- Balanced: DMBT1-locus (~300 CPM both), CFH, MyD88, NLRP3 — constitutive innate-immune backbone
Two-population coral systems with co-expression of complement + mucus-SRCR + PRR + RLR axes, with named-ortholog evidence — this is the dataset the paper's headline figure can come from.
Caveats logged
- n=5 NOR / n=6 US, small; observations not significance-tested
- NOR samples BAM-extracted (filtered) — library-depth normalization needed before any DE call. The next step (DESeq2 S5) handles this via per-sample size factors.
- "Hypothetical protein" is the dominant NCBI annotation; the carrier cross-reference is the first identification of many Lophelia loci in functional terms.
Files
~/scratch/truth_carrier_expression/
├── gene_protein_map.tsv 37,484 OS493↔KAJ7 pairings
├── lophelia_expression_full.parquet 40,679 genes × 11 samples + CPM
├── complement_expression.tsv 25 orthologs
├── mucin_srcr_expression.tsv 8 orthologs
├── mitocarta_expression.tsv 727 orthologs
├── tlr_nlr_expression.tsv 6 orthologs
└── lophelia_immune_expression.tsv 766 rows stacked (carrier × expression)
Open / running (refresh)
- DESeq2 (S5) — now unblocked: named-ortholog list ready, gene symbols mapped, both pops sampled; ready to fire DE on the immune panels and the headline contrast
- Site nginx vhost (sudo), helm.deepsek.no — unchanged backlog
2026-05-18 — truth-carrier stack (complement, mucin/SRCR, mitocarta)
Captain question after the cross-cnidaria complement domain-Pfam matrix (70,444 hits): "what are we using as complement 'truth' stack — human, pfam?" The honest answer was: Pfam-domain-architecture-based, which over-counts massively (Trypsin Pfam hits ALL serine proteases; TSP_1 hits all thrombospondins; etc.). Switched to named human canonical orthology as the truth standard, mirroring the existing MitoCarta workflow.
Truth-carrier architecture
Layout at ~/cnidaria_genomes/_comparative/truth_carriers/<pathway>/,
one subdir per pathway, each with the same artifact set:
<pathway>_genes.tsv gene_symbol × uniprot × category × role
<pathway>.faa carrier query FASTA (one canonical seq per gene)
<pathway>_diamond_hits.tsv.gz raw DIAMOND blastp hits, species column
<pathway>_loci_per_species.tsv per-species best-hit per human query
<pathway>_pathway_cn_per_species.tsv matrix tab: species × category → CN
diamond_db/ hits/ cached DBs + per-species raw hits
Build scripts in scripts/aragonite/:
- build_<pathway>_truth.py — fetch human canonical refs from UniProt
- diamond_<pathway>_vs_cnidaria.sh — blastp at 25 threads, e≤1e-5
- build_<pathway>_matrix.py — filter + best-hit-per-query + matrix
- Idempotent: re-fires only missing pieces; DBs cached cross-pathway.
Filters: pident ≥ 25, qcov ≥ 25-30, evalue ≤ 1e-5. CN counts DISTINCT cnidarian protein_ids matched per category (not raw hits) — so multiple human paralogs hitting the same cnidarian gene = 1 ortholog, the natural unit for deep-time orthology.
State ledger at ~/scratch/truth_carriers/state/runs.jsonl.
Carrier 1 — Complement (Reactome R-HSA-166658, 49 human refs)
Coverage: classical (C1QA/B/C, C1R, C1S), lectin (MBL2, FCN1/2/3, COLEC10/11, MASP1/2), C2/C4, alternative (CFB, CFD, CFP), C3-hub (C3, CFI, CFH, CFHR1-5), terminal-MAC (C5, C6, C7, C8A/B/G, C9), regulators (SERPING1, CD46/55/59, C4BPA/B, VTN, CLU), receptors (CR1, CR2, C3AR1, C5AR1, C5AR2, ITGAM, ITGB2).
Result: 4,156 raw DIAMOND hits → 2,767 after filters → 762 best-hit-per-query rows across 32 species (Henneguya/Myxobolus/ Thelohanellus mostly empty as expected for parasitic myxozoans).
Cnidaria-wide universal complement core (29-31/34 species): C1R, C1S, MASP1, MASP2, CFD, CFP, FCN1/2/3, ITGB2, plus C3 (26/34), C2 + C4A + C4B (27/34).
Headline finding — terminal MAC: Lophelia + Acropora digitifera + Exaiptasia + all hydrozoans carry 0 MAC pore genes (C6-C9). Stony corals and most anemones have C5 but no C6-C9. The pore-forming membrane-attack complex appears to be largely a vertebrate invention; cnidarians implement complement-mediated killing without the canonical lytic pore module.
Lophelia consolidated-hub pattern (single cnidarian locus covers multiple vertebrate paralogs):
| Lophelia locus | Vertebrate paralogs |
|---|---|
| KAJ7394693.1 | C1R + C1S + MASP1 + MASP2 (one ancestral C1/MASP-like protease) |
| KAJ7391652.1 | FCN1 + FCN2 + FCN3 (one ancestral ficolin) |
| KAJ7371504.1 | C4A + C4B (one C4 thioester) |
| KAJ7383494.1 | CR1 + CD55 (one SCR/Sushi receptor) |
| KAJ7383505.1 | CR2 + C4BPB |
This is the textbook deep-time complement architecture — vertebrate gene duplications collapse onto single cnidarian ancestors. Strong publishable story.
Lophelia C3 ortholog (named-human-truth): KAJ7334579.1 (pident 28.4,
qcov 46.5, bit 276). Cross-check with the prior Pfam-based candidate
OS493_014902 (A2M_BRD + ANATO) needed — these may be the same gene
under different namespaces (KAJ7 = protein_id, OS493 = gene_id).
Carrier 2 — Mucin / SRCR (24 human refs)
Coverage: gel-mucins (MUC2/5AC/5B/6), membrane-tethered mucins (MUC1/13/15/17/20/21), soluble mucin (MUC7), DMBT1 (the keystone), SRCR scavenger (MSR1, MARCO, SCARA3/5), SRCR lymphoid (CD5, CD6, CD163, CD163L1), SRCR secreted (CD5L/AIM, SSC4D, SSC5D, LGALS3BP). Big tandem-repeat mucins (MUC3A, MUC4, MUC12, MUC16, MUC19) skipped — their PTS repeats don't carry deep-time orthology signal.
Result: 3,623 raw hits → 1,768 filtered → 242 best-hit rows across 31 species.
Headline finding — DMBT1 architecture: cross-checked the cnidarian
DMBT1 best-hits against the cnidarian-immune Pfam scan
(cnidaria_immune_hits.parquet). All 26 anthozoan DMBT1 hits carry
SRCR alone (some with Ig_2, Sushi, or Ldl_recept_a accessory) —
0/26 carry the SRCR + CUB co-occurrence that defines vertebrate
DMBT1. Confirms the earlier F4 cross-cnidaria Pfam finding (2,410
SRCR / 0 CUB co-occurrence across 31 species) now framed against the
named human DMBT1 reference. Hydrozoans (4 spp) + Rhopilema = no
DMBT1 best-hit at all; the protein is an anthozoan-specific signal.
Vertebrate-only families absent from cnidaria: - Membrane-tethered mucins (MUC1/13/15/17/20/21): 0/34 species - Gel-forming MUC2/5AC/5B (airway/intestinal big-gel): 0/34 - Mammalian SRCR scavengers MARCO/SCARA3/SCARA5: ≤2/34
Retained ancestrally (29-31/34 species): - CD163, CD163L1, CD6 (SRCR lymphoid core) - CD5L (AIM), SSC4D, SSC5D (SRCR secreted) - MUC6 (gastric gel-mucin — the ONE gel-mucin cnidaria retain)
Lophelia: 8 mucin/SRCR orthologs. DMBT1 best-hit KAJ7373414.1
(SRCR-only, no CUB; pident 32.5, qcov 28.8) — also the CD163L1
best-hit (same consolidation pattern as complement). MUC6 ortholog
KAJ7372214.1.
Publishable statement: "Cnidaria carry abundant SRCR-domain proteins mapping to vertebrate CD163/CD163L1/CD6/CD5L/SSC4D/SSC5D but lack the SRCR+CUB co-occurrence that defines vertebrate DMBT1-mediated mucus immunity. The gel-forming mucin family (MUC2/5AC/5B) and the entire cytoplasmic-tail membrane-mucin family are absent from cnidaria — they represent vertebrate innovations. MUC6 alone is retained ancestrally."
Carrier 3 — MitoCarta3 (1,132 human refs)
Re-homed the existing workflow from ~/lophelia/data/mitocarta/ into
the standardized truth_carriers/mitocarta/ layout. DIAMOND step
already done (26,053 loci rows × 34 species). FAA carrier added now
(1,130 fetched + 2 remapped: GATD3A P30042→P0DPI2, MYG1
Q86UA3→Q9HB07).
Matrix tab pattern:
| Group | total range | notes |
|---|---|---|
| Anthozoa stony corals | 765–918 | top: Orbicella franksi 918 |
| Lophelia | 727 | lower-middle anthozoan; OXPHOS 70, central dogma 119 |
| Medusozoa | 695–790 | Hydra ~700 |
| Octocorallia | 721–807 | intermediate |
| Myxozoa (parasites) | 142–184 | known anaerobic genome reduction |
Universal cnidarian mito-core: 418 MitoCarta genes present in ≥30/34 species — OXPHOS subunits (ATP5F1A/B, SDHA, UQCRC1), TCA cycle (DLD, DLST, OAT, GLUD1/2), heme (FECH), import (HSPD1, PMPCB), ATAD3A/B nucleoid, mt-aaRS family. The eukaryotic-ancestral mitochondrial backbone.
Methodology — why "truth-carrier" framing matters
The Pfam domain-architecture approach (124-HMM cnidaria-immune atlas) gives the broad domain landscape — which Pfam domains are present in which species. It does NOT give named orthologs and over-counts when domains are promiscuous (Trypsin, TSP_1, EGF, Ldl_recept_a all in thousands of non-target proteins).
The truth-carrier approach (human canonical refs → DIAMOND → best-hit-per-query) gives named 1:1 orthology. For complement this is a 17× reduction (4,156 vs 70,444 hits) and produces a clean, defensible matrix tab keyed off the published gene names that reviewers can cross-reference against Reactome/KEGG.
Both layers are kept — the Pfam scan stays as the "what domains are where" map; the truth-carrier matrix is the "where are the named orthologs" map. Each carries different evidence and they answer different questions.
Carrier 4 — TLR / NLR / RLR (28 human refs)
Coverage: TLR family (TLR1-10), TIR adaptors (MYD88, TIRAP, TICAM1/2, SARM1), NLR-NOD (NOD1, NOD2), NLR-inflammasome (NLRP1/3/6/7), NLR-other (NLRC4/5, NLRX1, NAIP), RLR (DDX58/RIG-I, IFIH1/MDA5, DHX58/LGP2). 28 refs, 25.6 kb FAA.
Result: 2,594 raw hits → 149 best-hit rows across 31 species (the three myxozoans + Hydractinia/Hydra mostly empty).
Architecture cross-check (vs cnidaria_immune_hits.parquet): - Canonical TLR architecture (LRR + TIR): 35/37 cnidarian TLR best-hits carry both — the deep ancestral TLR fold is intact - Canonical NLR architecture (NACHT + LRR): 42/53 cnidarian NLR best-hits carry both — canonical NLR fold preserved
Compressed-but-architecturally-complete PRR signal: cnidarians retain the canonical TLR/NLR/RLR folds but use a small subset of the vertebrate-paralog gene family:
| Layer | Universal (≥28/34) | Broad (10-22/34) | Absent (0/34) |
|---|---|---|---|
| RLR (viral RNA) | DDX58, IFIH1, DHX58 | — | — |
| TLR adaptor | — | MYD88 (22) | TIRAP, TICAM1/2, SARM1 |
| NLR_inflammasome | — | NLRP3 (22) | NLRP1 |
| TLR | — | TLR2 (15), TLR4 (11), TLR5 (10) | TLR1, TLR3, TLR6, TLR7, TLR9, TLR10 |
| NLR_NOD | — | NOD2 (12), NOD1 (3) | — |
| NLR_other | — | NLRC5 (7) | NLRC4, NLRX1, NAIP |
Publishable framing: "Cnidarian innate immunity retains the canonical TLR/NLR/RLR PRR architectures (LRR+TIR, NACHT+LRR, DEAD helicase folds) but uses a structurally minimal subset of the vertebrate gene families — viral-RNA sensing via RIG-I/MDA5/LGP2 is universal, surface-bacterial sensing via TLR2/4/5 + MyD88 + NLRP3 + NOD2 is preserved, but the endosomal nucleic-acid TLRs (TLR3/7/8/9) and most TIR-domain accessory adaptors (TIRAP/TRIF/TRAM/SARM1) are absent."
Lophelia closeup — 4 PRR orthologs, all architecturally canonical:
| Function | Lophelia locus | Architecture |
|---|---|---|
| TLR4 | KAJ7392015.1 | ★ LRR_8 + TIR + TIR_2 |
| MyD88 adaptor | KAJ7371306.1 | Death + TIR + TIR_2 (qcov 83.8%) |
| NLRP3 inflammasome | KAJ7373972.1 | ★ LRR_6 + NACHT |
| RIG-I/MDA5/LGP2 (consolidated) | KAJ7375464.1 | DEAD helicase — one locus covers all 3 vertebrate RLR paralogs |
Same consolidated-hub signature seen in complement (one KAJ7 covers C1R+C1S+MASP1+MASP2) and SRCR (one KAJ7 covers DMBT1+CD163L1): vertebrate gene duplications collapse onto single ancestral cnidarian loci. This is now a four-carrier-replicated pattern — a publishable deep-time signal across complement, mucus-immunity, mitochondrial metabolism, and PRR sensing.
Open / running
- DESeq2 (S5), site nginx vhost (sudo), helm.deepsek.no — unchanged from prior open list
2026-05-17 → 18 — heavy data-mining day
F4 closed cnidaria-wide (DMBT1 architecture absent across Cnidaria)
Three converging lines of evidence said DMBT1 architecture is absent in D. pertusum: (1) zero SRCR+CUB+ZP co-occurrence across the 8 cross-pop diff pools (4.3 M reads); (2) zero in us_mortar pool (axis-F tandem repeats); (3) zero in Loph_1.0 reference proteome (53 SRCR / 0 CUB).
Extended to 34 annotated cnidarian proteomes (Anthozoa →
Hydrozoa → Scyphozoa → Myxozoa) via scan_cnidaria_immune_proteomes.sh
+ the 124-HMM immune atlas. Result: 2,410 SRCR-bearing proteins
across 31 species (the 3 missing = myxozoan parasites with reduced
genomes), ZERO with CUB / ZP / Mucin / VWD / Trefoil co-occurrence in
205,278 cnidarian proteins. Verdict cnidaria-wide: SRCR scavenger
receptor family is ancient and expanded; the canonical mammalian DMBT1
scaffold is a vertebrate / mammalian innovation absent in Cnidaria.
Cross-cnidaria immune tblastn
124 nucleotide BLAST DBs (build_per_genome_blastdbs.sh) → tblastn
277 Lophelia immune candidate proteins (53 SRCR + 5 A2M + selected
MACPF / Fibrinogen_C / Ldl_recept_a) against all 124 cnidarian
genomes. 1.84 M hits, 120 ok / 4 skip / 0 fail. Top hit-count
species: Dendronephthya (octocoral, 47K), Anthopleura (anemone, 46K),
Aulactinia (32K), Madracis (30K).
Mito × Pfam annotation (closing the mito layer)
Full Pfam-A scan on 22,911 cnidarian mito-orthologs from the prebuilt
mito_loci_per_species.tsv. 49,249 Pfam hits across 1,295 distinct
domains; 22,382 (97.7%) of mito-ortholog proteins now have ≥1
Pfam-A domain call. Lophelia specifically: 706 of 727 mito-orthologs
annotated (97.1% coverage). Output:
~/scratch/mito_pfam_scan/mito_loci_with_pfam.parquet.
RNA-seq full pipeline (per captain's plain-mapping mandate)
Rebuilt STAR index FASTA-only (no GTF priors per
feedback_star_no_gtf.md). Stripped S3 to minimal STAR flags:
--outSAMtype BAM SortedByCoordinate --outSAMunmapped Within
--limitBAMsortRAM 30G. All 12 STAR jobs ran; 11 landed cleanly
(NN1 broken-fastq, see below). featureCounts → 40,679-gene × 11-sample
count matrix.
Per-region quantification + unannotated landscape
HOMER makeTagDirectory per sample + analyzeRepeats.pl on custom
intergenic + intronic GTFs (built via bedtools complement / subtract
on the gene GTF). Headline:
- Exonic: 40,679 intervals, 4,365 expressed in both pops, 29,917 silent
- Intronic: 153,855 intervals, 664 expressed in both pops, 151,724 silent (~1%)
- Intergenic: 38,927 intervals, 6,941 expressed in both pops, 11,599 US-only-expressed
Per-sample annotation-region breakdown: ~61% of US reads are intergenic vs 47% Nordic (Nordic biased low by BAM-extract). Major unannotated transcriptional landscape — Loph_1.0's 40,679 Trinity- based gene models from Herrera 2023 capture only ~40% of expressed sequence in this dataset.
Genomic context for top intergenic candidates
intergenic_genomic_context.py — top 50 intergenic + 20 intronic by
US max count, with flanking gene products + Pfam categories. Highlights:
- INTG_0012893 — 9.2 kb spacer between two Trypsin / MASP genes (complement-pathway hotspot), US-only expressed
- INTR_0087738 — 4.9 kb intronic transcript inside Neurotrypsin (SRCR) host gene
- INTR_0097491 — inside a ShK toxin domain host
- INTR_0147093 — inside a TNFR_c6 host gene
- INTG_0003055 — 67.9 kb intergenic spacer with 42K NOR reads (could be unannotated lncRNA / dropped gene model)
Output: ~/scratch/intergenic_context/intergenic_context.{parquet,tsv}.
Expression of immune + mito candidate genes (≥100 reads cutoff)
expressed_immune_mito_genes.py — gene_id↔protein_id map via GTF
CDS records → categorise by Pfam family + MitoCarta membership →
≥100-read cutoff per sample per population:
| category | n | NOR | US | both | silent |
|---|---|---|---|---|---|
| SRCR | 53 | 11 | 24 | 9 | 27 |
| Complement (any) | 577 | 157 | 239 | 116 | 297 |
| Mucin-anchor | 0 | — | — | — | — |
| MitoCarta | 635 | 440 | 589 | 428 | 34 |
| Mito × Immune | 2 | 2 | 2 | 2 | 0 |
Top balanced SRCR: OS493_025544 (NOR 6,341 / US 6,288). Top US-only
complement: OS493_028292 Trypsin (NOR 4 / US 85,684 — likely off-subset
BAM-extract artefact). Output:
~/scratch/expressed_immune_mito/expressed_candidate_genes.{parquet,tsv}.
Job D pivot — abandoned full Pfam-A in favour of curated atlas
Initial plan was full Pfam-A (25,545 HMMs) on all 8 diff pools. Two
issues: (i) HMMER 3.3.2 on super silently reads only first HMM from
multi-HMM file when input is gzipped — switched to HMMER 3.4 from
super's protein_db env + decompress-to-tmp pattern; (ii) even fixed,
projected wall ~50-200 hr for the heavy pools. Killed and rebuilt with
a curated 219-HMM atlas (union of 124-immune + mito-immune nexus + top
hits from Loph_1.0 proteome + top hits from mito-orthologs Pfam scan).
Re-fired. Stalled on smallest pool nor_shared (16 GB protein file,
~10 hr/pool projected) — captain killed; 4/8 small pools (privates +
strict_private) landed cleanly. Parquet fix-pass ran manually (python
trx50_rna env had pandas, base did not).
Captain steered through
- "STAR plain mapping no fancy" → GTF-free index, minimal flags
- "BBTools, hardcore programmer Brian Bushnell" → repair.sh post-BAM-extract
- "if one sample is wrong flag and move on" → all S3 scripts patched
to
set -uo pipefail+ per-sampleif ! cmd; then log_ledger failed; continueinstead ofset -ekilling the batch - "fire job B / lncrna / homer / unannotated" — sequential decisions reflected in the day's pipeline order
Open / running
- NN1 rescue (firing now, 2026-05-18): fastp strict structural validation → repair.sh re-pair → STAR smoke test → full re-align. Goal: recover one clean Nordic sample (Tisler raw cutadapt2 L006), rather than wait for raw fastqs on other servers.
- 142-subset filter (queued): restrict HOMER intergenic / intronic / exonic to the 142 contigs the Nordic BAM-extracts had honest mapping to → honest cross-pop expression comparison without the BAM-extract caveat.
- DESeq2 (S5) still blocked on captain decisions: D-vs-NN biology, headline contrast, NN1 fate (resolved once rescue lands).
Site / docs
autoloop.deepsek.no: Roadmap & maps tab added; Mucins + Complement cards refreshed with RNA-seq expression evidence. helm.deepsek.no refactored as multi-project fleet hub. Neither vhost is nginx-installed yet — pending captain's sudo step.
docs/operating_model.md (the playbook) + docs/state_architecture.md
(parquet conventions) + docs/fleet_tools.md (per-box tool matrix)
landed earlier this week as the durable contracts.
2026-05-16 — fleet operating model + DMBT1 question closed
Operating model codified
Three new docs land together as the chief-mate steering pack:
docs/operating_model.md(673 lines, 15 sections) — the playbook for running multi-host comparative-genomics projects. Captain (Bård) sets course, chief mate (Claude session) orchestrates. NFS + markdown + ledger is the spine. SSH + tmux is the only execution shortcut.docs/fleet_tools.md— comprehensive inventory of what's on each box: hardware, system-wide PATH, conda envs (per-env package versions),- shopping list of sudo installs.
docs/state_architecture.md(already in place; cross-referenced here) — where outputs live, hive-partitioned parquet schema, ledger discipline.
Plus the web-facing rendering: helm.deepsek.no vhost prepared (static
HTML, seafoam accent, helm-wheel SVG). Captain + chief-mate metaphor
explicit. Setup script + nginx conf in deepsek_web repo; user runs
helm-site.sh install + certbot --expand when ready.
Fire wrapper + SSH server live
scripts/fire.sh (commit 862322e) + docs/ssh_setup.md. Dedicated
SSH key per offload box, custom ports (super 45497, trx50 4540).
Verified end-to-end with a 3-second hello job on each box. Then used
in anger for the day's real work — env_extras install, S1_nordic
extract, S3_us align, job_b_us_mortar.
Hominidae host-mask check (super)
Ran run_host_mask.sh on super against the 4 private/strict_private
kraken-classified pools. Cross-pop signal essentially unchanged:
- Nordic total hits: 3,880 → 3,877 clean (–0.08%)
- US total hits: 4,545 → 4,476 clean (–1.5%)
- Nordic SRCR: 164 → 164 (no change); US SRCR: 192 → 188 (–2%)
- Nordic Fibrinogen_C: 828 → 825; US Fib_C: 1,294 → 1,273 (–1.6%)
Conclusion: cross-pop signal is mostly real biology, not host contamination. US-vs-Nordic differentials drop from +17% to +15% (SRCR) and +56% to +54% (Ficolin) — minor adjustments, headline story intact. Autoloop site cards stand as-is.
DMBT1 architecture question — closed
Three independent lines of evidence converged today:
| evidence source | SRCR | CUB | ZP | Mucin | VWD | Trefoil |
|---|---|---|---|---|---|---|
| 8 cross-pop diff pools (4.3 M reads) | 356 | 0 | 0 | 0 | 0 | 0 |
| us_mortar pool (Job B, 769 MB) | 31 | 0 | 0 | 0 | 0 | 0 |
| Loph_1.0 reference proteome (37,484) | 128 | 0 | 0 | 0 | 0 | 0 |
The 128 SRCRs in the Loph_1.0 proteome are spread across 53 distinct proteins. Their non-SRCR co-occurring domains: Ig_2 (×4), TSP1_CFP_C, TSP_1, Trypsin, Pkinase, Ldl_recept_a (×1 each). Zero CUB / ZP / Mucin / VWD / Trefoil — anywhere.
Verdict: interpretation (b) wins — the classical DMBT1 architecture (SRCR + CUB + ZP) is genuinely absent from D. pertusum. Cnidarian mucus immunity in this species uses SRCR-only PRRs, not the mammalian-style CUB/ZP-anchored DMBT1 scaffold. F4 has its answer.
Job A on super (per-pop conserved-bulk, 8-16 hr) is no longer urgent for the DMBT1 question. Still has value as a broad cross-pop immune atlas, but it's no longer blocking F4. Defer or fire when convenient.
trx50 pipeline status
- env_extras (DESeq2 1.50.2 + rtracklayer 1.70.1 + samtools 1.23.1 + pyarrow + ...) — done after three solver-error attempts (had to bootstrap conda-libmamba-solver into base first).
- S1_nordic_bam_extract — done in 5.5 min wall (5 samples, ~66s avg).
- S3_star_align_us — running (6 US samples sequential, ~3-4 hr at
THREADS=25;
--outSAMunmapped Withinper the BAM lesson learned earlier this week). - S3_star_align_nordic — queued behind S3_us. Asymmetric data: NN1 from raw cutadapt2 L006 fastqs, D2/D3/D5/NN3/NN5 from BAM-extracted (mapped-only against the 142-contig subset; 33.7% gene-blackhole caveat documented in INTRO.md).
- S4 featureCounts + S5 DESeq2 + S6 immune overlay — scripts not yet written; will follow when S3 lands.
Per-host budget locked
THREADS: t5 = 25, super = 40, trx50 = 25. Updated in trx50 INTRO and
memory feedback_compute_safety.md so future sessions don't re-derive.
2026-05-13 (afternoon) — Phase C: US 10% real-scale test + RNA-bridge end-to-end
Goal: take the 8-step skeleton out for a real-scale test, produce
the headline deliverables (per-read parquet DB + bridge graph), surface
any scaling issues we missed in the small tests. Knob: --fraction 0.10
on the US BAM (~207k reads).
What ran clean:
| stage | walltime | result |
|---|---|---|
| S0_subsample | 243 s | 206,793 reads at seed=19 |
| M0_partition | 593 s | 39,438 architects + 157,729 mortar (19.0% architect — consistent with 30%) |
| M1_characterize v2 kmc | 200 s | 78,876 flank rows in architects_flanks_us.tsv.gz |
| B0_facts | 0 s | 1.1 MB parquet (reads_pop=us_run=us_pct10_2026-05-13.parquet) |
| B1f_a fq2fa | 7 s | 39,438 architect sequences |
| B1f_b translate | 49 s | 236,628 proteins (6 frames × 39,438) |
| B1f_c press_check | 0 s | Pfam-A.hmm pressed DB present |
| B1f_d split_aa | 1 s | 25 chunks × 9,466 proteins |
| B1f_e hmmscan | killed after 3 hr | 0/25 chunks landed (see below) |
| B1f_mortar | skipped | pre-populated placeholder + ledger row |
| B2_kraken_raw | 17 s | kraken_pop=us_*_db=PlusPF.parquet (8.4 MB) |
| B2_kraken_c005 | 17 s | *-c0.05.parquet (8.3 MB) |
| RNA-bridge R0 | 2166 s | 27.5 M pairs extracted, 23.3 M / 56.1 M records mapped (41.5%) |
| RNA-bridge R1 | 99 s | 8,635 per-contig bridge parquets |
| RNA-bridge R2 | 2 s | bridge_graph.json — 8,635 nodes, 15,344 edges |
| RNA-bridge R3 | 3 s | pseudo_scaffolds.fa — 230 MB |
The B1f_e_hmmscan kill — long-protein pathology.
Smoke test was on smallest contig (MU825479.1, 40 architects → 240 short proteins, 91 s wall on 1 chunk). At 10% scale the per-protein cost exploded by ~4× because the protein length distribution is brutal:
chunk_001 protein length distribution
<5k aa : 0
5k-10k : 7,676
10k-20k : 1,745
>20k : 45 (max 29,705 aa)
N = 9,466 proteins, mean 8,334 aa
6-frame translation of long PacBio reads with only -m 50 (min) filter
produces enormous stretches. HMMER's Plan7 is O(L×N), so per-query time
on a 29 kaa sequence × 20 k HMMs is unbounded. After 3 hours, every
chunk's .tmp had only 18–38 hit lines and mtime hadn't moved in 28
minutes — every chunk was stuck inside a single long query.
Followup: re-architect run_hmmer_b1_full_pfam_v1.sh with a
--max-aa cap (e.g. 5000) at the translate stage, or per-protein
windowing for queries above the cap. Logged in
feedback_compute_safety.md along with the broader lesson: small
tests must sample input DIVERSITY, not just SIZE. The smoke had the
right input size (240 proteins) but the wrong input shape (all from
the short tail of the read-length distribution).
The concurrency-violation incident.
US 10% and RNA-bridge launched concurrently. Load hit 58 on a 25-core
box. User called it out: "one task uses 25 threads for headroom".
Updated feedback_pipeline_design.md rule 3: synchronize before
launching the next task — between pipelines, not just within them.
Pattern is script_A.sh && script_B.sh. The wall-time "savings" of
parallel runs are illusory — kernel context-switch overhead +
contention slow both runs — and the resulting per-stage timings are
useless for benchmarking. Sequentially-collected timings are the only
honest data for the operational map.
The pre-populated-ledger pattern.
Cancelled B1f_mortar mid-run by writing an empty-schema placeholder
parquet at the expected path + recording a B1_hmmer_full_pfam ledger
row for pop=us_mortar run=us_pct10_2026-05-13_mortar with
skipped_by_user=true. When bash reached the mortar block, its
existing idempotency check (ledger_check && [ -s parquet ]) fired the
skip branch — driver didn't need to be modified mid-run. Same pattern
let us run B2-only afterwards (--skip B1f flag + the existing ledger
made S0/M0/M1/B0 skip cleanly, only B2 actually ran for 34 s total).
Final deliverables present at expected paths:
us_pct10/reads_dir/reads_pop=us_run=us_pct10_2026-05-13.parquet 1.1 MB B0 per-read DB
us_pct10/annotations/kraken_pop=us_*_db=PlusPF.parquet 8.4 MB B2 raw
us_pct10/annotations/kraken_pop=us_*_db=PlusPF-c0.05.parquet 8.3 MB B2 c0.05
us_pct10/annotations/hmmer_pop=us_mortar_*_db=PfamAfull.parquet 12 KB placeholder (skipped)
rna_bridge_us_pct10/graph/bridge_graph.json 3.9 MB PROPOSED CONTIG GRAPH
rna_bridge_us_pct10/graph/bridge_graph_{edges,nodes}.parquet ~192 KB
rna_bridge_us_pct10/scaffolds/pseudo_scaffolds.fa 230 MB pseudo-scaffolds
us_pct10/manifest/manifest.md, manifest.tsv 17,295 files / 14 categories
Missing from the "every read documented" picture: the full-Pfam-A HMM annotation parquets (architects + mortar). Deferred until the B1f redesign with length cap lands. The kraken + axis-F + RNA-bridge layers are complete.
New scripts in this session:
scripts/aragonite/rna_bridge_chain.sh— R0→R1→R2→R3 chain-driver, records ledger entries per substep, sync barriers between R0 and R1 (BAM presence), R1 and R2 (file count), R2 and R3 (graph JSON parse).scripts/aragonite/inventory_run.py— walks a run dir tree, categorizes files (parquet/json/fasta/log/etc), joins against the ledger by run_id, emits a manifest.md + manifest.tsv. Reads the bridge_graph.json directly to surface the top-N edges as a publication-style table inside the manifest.
Pipeline operational viz (pipeline_overview_v1.html at
http://localhost:8765/) refreshed; it now shows the killed-B1f and
the completed B2 + the pre-populated mortar row, and includes the
ledger snapshot grouped by run_id with the newest 8 runs.
2026-05-13 (evening) — Phase B: 8-step pipeline skeleton, RNA-bridge stack built
Build pass: after the broad-first decision, expanded the work surface to all 8 anticipated pipeline steps:
1. run_hmmer_b1_full_pfam_v1.sh [Phase A — workhorse]
2. run_annotation_loop_full_pfam_v1.sh [Phase A — per-contig]
3. run_us30_b1_full_pfam_architects.sh [Phase A — wrapper]
4. run_us30_b1_full_pfam_mortar.sh [Phase A — wrapper]
5. run_us_pipeline.sh [Phase B — generic US pipeline,
--fraction param]
6. run_nordic_tisler_pipeline.sh [Phase B — Nordic CLR, raw FASTQ]
7. rna_bridge_R0..R3 [Phase B — 4 scripts, the
contig puzzler]
8. scripts/immune/q_immune_atlas_hits [Phase A — Phase-2 query stub]
Each script has a --test / --smoke / --self-test mode that
runs in seconds-to-minutes against a tiny canned input. Tweaking
later becomes cheap — the data path is wired end-to-end with
synthetic stand-ins.
Phase B smoke results (commit fa6c80f):
| step | smoke wall | result |
|---|---|---|
5 US --fraction 0.01 --seed 137 |
109 s | 20,882 reads → 3,972 architects + 15,881 mortar → 7,944 flank rows → 131 KB parquet, all 4 ledger entries |
6 Nordic --smoke (F01_1 head 4k) |
4 s | 975 reads (25 below 500 bp floor) → 195 architects + 780 mortar |
| 7 R0 Dahl-RNA1-D2 × 1k pairs | 1 s | 835 pairs extracted, 324/1670 primary mapped to MU826351.1 grains (19.4%) |
| 7 R1 synthetic 10-pair BAM | <1 s | 5 same-contig dropped, 3 A↔B + 2 B↔C extracted, 3 per-contig parquets |
| 7 R2 synthetic (R1 output) | <1 s | 3 nodes, 2 edges (A↔B support 3, B↔C support 2) |
| 7 R3 synthetic (R2 graph) | <1 s | 1 scaffold, 3 contigs, length 400 (100+50+100+50+100), path A → B → C |
Notable findings from the Phase B smokes:
-
CLR axis-F partition transfers cleanly. Nordic CLR M0 produced 19.5% architect fraction (195 / 975 reads); US Sequel M0 on its 30% slice produced 19.0% architect fraction (118,250 / 591,228). The axis-F length-based partition rule apparently doesn't need recalibration for CLR vs Sequel — the biology (modal read length, top-quintile threshold) is similar enough. M1's per-end k-mer/entropy thresholds remain to be cross-checked on a non-trivial Nordic N.
-
R0 silent-failure bug caught + fixed. First R0 smoke reported "done in 2s (/ records mapped)" but the pipeline actually failed at
samtools sort(duplicateg_1182SAM @SQ headers from concat'ing two per-contig grains with shared local naming). The pipeline error was silenced because the script'sset -uo pipefaildidn't triggerset -e. Added explicitif ! ( bwa-mem2 | samtools sort )guard with cleanup of partial.tmpfiles. Rule 3 (sync before next stage) reinforced. -
Target-FASTA prep is the user's responsibility, not the pipeline's. When concat'ing per-contig grains across the multi-contig loop, the same contig appears in
_norand_usdirs — prefix grain IDs by FULL dir name (not just contig), OR use ONE per-contig grains file at a time. Documented in R0's header.
Discussion (deferred to backlog): Trinity-assembled
transcripts would give ~10× stronger bridge evidence per item via
split alignments (one transcript = one categorical "this gene
crosses A→B"). Drop-in as R0b; R1/R2/R3 unchanged. Cost
~12–24 hr Trinity assembly + ~100 GB RAM. Order: pair-evidence
first at real scale, Trinity as publication-quality overlay
later. Logged in project_backlog_rna_bridge_scaffolding.md.
Sessions's net state:
git log --oneline (today)
fa6c80f aragonite Phase B: steps 5/6/7 real scripts + small tests pass
53f4f38 aragonite Phase A: full-Pfam infrastructure + 8-step skeleton
c45156a aragonite: state_ledger.py + extract_xpop_diff_pools.sh
f073358 aragonite B1 v2: parallel hmmsearch + substep ledger, v1 retired
85d945f aragonite M1 v2: kmc-backed streaming, v1 retired
8 steps. Each has a real script. Each has a small test that passes. Each is idempotent via the ledger. The narrow immune-Pfam path is kept as fast-debug; the broad full-Pfam path is canonical. Mortar gets the same library as architects. The RNA-bridge stack is built and synthetic-tested; ready for real-scale runs the moment the substrate (Nordic ARAGONITE contigs) is available.
What's NOT yet done (each is independent, each idempotent): - Full per-contig loop with full-Pfam (10 contig×pop, ~45 min) - US 30% architects+mortar full-Pfam (~15 hr combined) - US fraction=1.0 — the canonical "every US PB read documented" - Nordic 30% S0→B2 (the cross-pop comparison axis) - RNA-bridge R0→R3 end-to-end on real RNA + real contig substrate - Trinity overlay (deferred) - Immune queries (Phase 2 — atlas as predicate)
2026-05-13 (afternoon → evening) — Phase A: broad-first annotation, full-Pfam infrastructure
The shift, surfaced by the user: the immune-Pfam atlas
(lophelia/data/diagnostic_pfams.txt, 124 cnidaria-relevant Pfams
curated from InnateDB) was being used as a scan-time filter in B1
production. That's the wrong layer — the atlas belongs as a
query-time predicate over a fully-documented per-read DB. The
user's restatement:
"my goal was that we had all us pacs firstly documented in the db / parquet, so in the end we could do proper immune-related stitching via our rna seq reads, we should have built the strategy around this principle, so every read is documented, if its that or this etc."
Extended feedback_dignity_of_every_read.md to cover the
annotation layer — every read deserves full Pfam-A annotation
(20k HMMs, not 124) in the canonical parquet. The immune scan stays
as a fast-debug iteration loop (~10 min), not as the production
default. Two anti-patterns named: pre-filter at scan time, and
scanning architects only (mortar deserves the same library).
Built (commit 53f4f38):
- run_hmmer_b1_full_pfam_v1.sh — 7-substep workhorse for any
fastq input. Uses chunked hmmscan against the pressed Pfam-A
binary DB (~2.4 GB mmap-shared across worker processes).
Adaptive chunking: 1 chunk for tiny inputs (DB-load dominates),
up to THREADS chunks for large inputs. Substep markers are
per-chunk .domtblout files. Substrate tag in ledger
(architects / mortar / etc).
- run_annotation_loop_full_pfam_v1.sh — per-contig loop driver.
- run_us30_b1_full_pfam_{architects,mortar}.sh — thin wrappers
with --test mode (100-read smoke).
- scripts/immune/q_immune_atlas_hits.py — Phase-2 query stub.
Demonstrates atlas-as-predicate pattern; --test passes
synthetic 3-row filter ✓.
Smoke tests:
- Smallest contig (MU825479.1 nor, 40 architects → 240 proteins):
91 s wall, 119 MB RSS, 1 chunk. 1 hit: DUF5641 — found by
full-Pfam, missed by immune-only. Principle validated on first
invocation.
- Smallest contig pair via the per-contig loop driver: 380 s wall
(nor 90 s + us 290 s sequential), both PfamAfull.parquet files
in loop_annotations/.
2026-05-13 — B1 OOM-class miss #2 — `hmmscan` wrong direction, parallel `hmmsearch` v2
Incident. After M1 v2 landed, B1 started under the production driver
and stayed in hmmscan for 22 min. top showed 180% CPU on the
python process despite --cpu 25. ETA was ~73 min for what should
have been a 10-minute job. Caught the inefficiency mid-run, killed the
driver tree, redesigned.
Why it was slow. HMMER's docs are explicit: hmmscan
(sequences-vs-HMMs) parallelizes within a single query sequence. With
709k short proteins × 124 HMMs, there's almost no per-query work to
split; effective parallelism saturates at ~2 cores. hmmsearch
(HMMs-vs-sequences) inverts the dimension and is recommended for this
workload shape.
Redesign. run_hmmer_b1_v2.sh:
B1_a_fq2fa—zcat ... | seqkit fq2faB1_b_translate— 6 frames,-m 50min aa filterB1_c_hmm_index—hmmfetch --index(one-time)B1_d_hmmsearch—xargs -P 25across 124 HMMs, single-threaded each,.tmp→mvatomic rename. Per-HMM domtblouts ARE the substep markers.B1_e_concat—cat hmmsearch/*.domtblout > hmmer.domtbloutonly after a hard barrier verifies all 124 files exist.B1_f_ingest— parse → parquet.ingest_hmmer.pynow auto-detects hmmsearch vs hmmscan by reading the# Pipeline mode:footer (column-1 flips between the two tools).
Each of the 6 substeps records its own line in state/runs.jsonl. A
re-run after any crash resumes from the first incomplete substep; in
particular, an interrupted hmmsearch fan-out re-runs only the HMMs
whose .domtblout is missing.
Graded test (under /usr/bin/time -v):
| stage | reads | elapsed | peak RSS |
|---|---|---|---|
| b1_1k | 1,000 | 4 s | 114 MB |
| b1_10k | 10,000 | 30 s | 115 MB |
| b1_full | 118,250 | 611 s | 116 MB |
Peak RSS is basically flat because the orchestrator is a thin bash + python wrapper; the work runs inside hmmsearch subprocesses. Per-substep breakdown on the full 118k:
B1_a_fq2fa 19 s
B1_b_translate 137 s 709,500 proteins (6 frames × 118k arch)
B1_c_hmm_index 0 s (salvaged from prior test)
B1_d_hmmsearch 453 s 124 HMMs, xargs -P 25, ~90s per single hmmsearch
B1_e_concat 0 s
B1_f_ingest 0 s
total 611 s (vs v1 projected ~73 min, ~7× faster e2e,
~10× on the search step alone)
Biological readout — 30% US subset, 118,250 architects, 124-HMM immune Pfam library:
- 578 domain hits across 484 unique reads
- 37 distinct immune domains detected
- Top by hit count:
| domain | hits |
|---|---|
| Fibrinogen_C | 178 |
| Trypsin | 62 |
| TSP_1 | 53 |
| PK_Tyr_Ser-Thr | 43 |
| ubiquitin | 31 |
| TRAF-mep_MATH | 30 |
| Arf | 23 |
| Ank | 21 |
| SRCR | 17 |
| Forkhead | 15 |
SRCR — the user's prime story-target domain (mucus DMBT1/SRCR family immunity) — sits in the top 10 even on the 30% subset. Worth a closer look at those 17 hits when we move to the full-population run.
v1 ↔ v2 verification. On the 1k correctness test, the two
parquets are byte-identical (same read_id × frame × domain_name ×
domain_acc tuples). The raw domtblout files DIFFER textually because
hmmscan and hmmsearch flip target↔query columns — ingest_hmmer.py
auto-detects and normalizes, so downstream consumers see identical
data.
Landed.
- run_hmmer_b1_v2.sh is the new production B1 implementation.
- run_hmmer_b1.sh → run_hmmer_b1_v1_underthreaded.sh, retirement
header in place, kept as archived reference.
- ingest_hmmer.py gained --input-format {hmmscan,hmmsearch,auto}
with auto reading the footer; default is auto. Back-compat for
any caller that previously didn't pass the flag.
- run_us_pct30.sh driver M1 line edited to call v2 with
LEDGER=$LEDGER THREADS=$THREADS.
- B1 outputs (architects.fa, architects_aa.fa, hmmer.domtblout,
hmmer_pop=us_run=...parquet) copied into the canonical us_pct30
workdir from the v2_test run.
- Six B1 substep entries plus the rolled-up B1_hmmer recorded in
the canonical ledger with completed_via=v2_test_2026-05-13.
Rule 4 working live. After landing, restarting the driver hit
B0 skip / B1 skip immediately and moved straight to B2 kraken.
The ~10 minutes of completed B1 work was NOT re-run — first concrete
test of the substep-ledger discipline in production.
This is the case study for feedback_pipeline_design.md. Both the M1
OOM (rule 2) and the B1 1.8-core throughput (rule 1) were the same
kind of miss — picked the wrong primitive, not just a wrong flag.
The four-rule discipline catches both classes prospectively.
2026-05-13 — M1 OOM postmortem and kmc-backed v2
Incident. At 21:39:59 on 2026-05-12 the M1 stage of the 30% US Tisler run was OOM-killed. Kernel log was unambiguous:
Out of memory: Killed process 1078805 (python3)
total-vm:223529488kB anon-rss:189774672kB
~181 GB resident. The system has 251 GB, Chrome held the rest, kernel went global OOM and took out a Chrome render process alongside python.
Why it died. aragonite_characterize.py (v1) built the architect-pool
k-mer histogram in a Python Counter[int, int] keyed on canonical 21-mers,
single-process, in RAM. At ~120k architects (~3.5 Gbp) the dict held
2.5B unique k-mers × ~75 B/entry → ~190 GB. The synth run (1,212 reads)
passed at <1 GB; nothing scale-tested.
Redesign. aragonite_characterize_v2.py keeps the same output schema
but routes the global histogram through kmc on disk:
kmc -k21 -m32 -t25builds the full-pool DB on disk.- Stream architects, emit 5'/3' window FASTA, kmc that → windows DB.
kmc_tools simple full ∩ windows -ocleft→ counts only for k-mers that any window actually queries.kmc_tools transform dump -s→ load into sorted numpy arrays (uint64keys +uint32counts).- Second streaming pass: per-window
np.searchsortedfor mean/min/p50.
In-RAM lookup is bounded by window k-mer diversity, not pool diversity.
Graded scale test (under /usr/bin/time -v):
| stage | reads | input | elapsed | peak RSS |
|---|---|---|---|---|
| us_1k | 1,000 | 8 MB | 7 s | 663 MB |
| us_10k | 10,000 | 72 MB | 49 s | 2.59 GB |
| us_full | 118,250 | 839 MB | 612 s | 22 GB |
10× input → 4× RSS (sub-linear) → 12× input → 8× RSS. Bound is the
1.3 GB lookup arrays + kmc subprocess at its -m32 cap. v1 vs v2
diff on synth: 2,425 rows match byte-for-byte except the three
kmer-freq columns, which agree to two decimals.
Read-class distribution (US, 30% subset, 118,250 architects, v2):
| class | count | fraction |
|---|---|---|
| EXPANDER | 96,562 | 81.7% |
| ANCHOR | 16,344 | 13.8% |
| BURIED | 5,278 | 4.5% |
| SUSPICIOUS | 66 | 0.1% |
Landed.
- v1 renamed to aragonite_characterize_v1_oom.py with a retirement
header. Kept as archived reference, not deleted.
- v2 wired into run_us_pct30.sh M1 stage with --threads 25
--kmc-mem-gb 32.
- Test output copied into canonical us_pct30/architects_flanks_us.tsv.gz
so B0/B1/B2 can resume without re-running M1.
- Ledger entry recorded: M1_characterize completed, tool_version=v2_kmc,
walltime_s=612, peak_rss_mb=21986.
- Hard rule in feedback_compute_safety.md: no unbounded in-process
counter/dict at biology scale; small-test-first for correctness, then
graded scale-test-second (1% → 5% → 10% under /usr/bin/time -v)
before scaling.
Audit-trail wart left in place. state/runs.jsonl has two
M0_partition entries for the same run_id (23,356 then 118,250
architects, ~25 min apart). The first M0 ran to completion on a
truncated stream after samtools fastq choked; the second M0 ran on
the full subsample. Both legitimately completed, ledger appended both,
check finds the first and returns 0 so idempotency still works.
Not patched — the duplicate is itself the most honest record of what
happened. If this pattern recurs, add a supersedes field rather than
collapsing.
2026-05-12 — multi-contig test loop (5 new contigs, 105 kb → 10.7 Mb)
run_aragonite_loop.sh ran 5 new contigs × 2 pops with default
parameters, both pops parallel per contig.
| contig | size | Nor cov | US cov | Nor longest | US longest | walltime |
|---|---|---|---|---|---|---|
| MU825479 | 105 kb | 36.0% | 79.5% | 7 kb (7%) | 74 kb (70%) | 3s |
| MU825436 | 609 kb | 88.7% | 99.9% | 108 kb (18%) | 590 kb (97%) | 19s |
| MU825420 | 980 kb | 83.7% | 95.8% | 97 kb (10%) | 492 kb (50%) | 24s |
| MU826351 | 5.2 Mb | 87.8% | 98.3% | 266 kb (5%) | 918 kb (18%) | 158s |
| MU825396 | 10.7 Mb | 92.3% | 99.6% | 281 kb (3%) | 2.83 Mb (27%) | 416s |
The 10.7 Mb contig assembles at 99.6% US Sequel coverage, with a 2.83 Mb single contiguous block (27% in one piece). Walltime 416 sec (~7 min) parallel on 25 cores.
Pattern across scales (US Sequel): 80% → 99.9% → 96% → 98% → 99.6%. US ceilings around 99-100% except for very small contigs where read substrate is insufficient (105 kb / 40 architects barely supports OLC).
Pattern across scales (Nordic CLR): 36% → 89% → 84% → 88% → 92%. Increases with contig size as more substrate becomes available. Caps around 92%, fragments more aggressively (longest-block fraction drops from 7% at 105 kb to 3% at 10.7 Mb).
Walltime scaling: roughly linear with contig size (3s on 105 kb, 416s on 10.7 Mb = ~140× for 100× scale).
Across all 8 contigs (5 new + 3 prior at MU826414, MU827843, MU826831): - Mean Nordic coverage: ~82% - Mean US coverage: ~96% - US Sequel handles every contig at 80%+, most at 95%+ - Nordic CLR more variable (36-93%), depends on substrate density
Viz: loop_multi_contig_v1.html with size-vs-coverage scatter
plot (nor↔us connected per contig) and longest-block-fraction chart.
The "autolearn" angle the data surfaces: per-read characterization
matters most at the limits. The 105 kb contig fails because at 24
architects there's insufficient substrate for the greedy walker.
Larger contigs hit a different limit — fragmentation due to internal
repeat structure aragonite doesn't yet see (only end-window features).
This motivates the per-read internal characterization backlog
(project_backlog_per_read_internal_characterization.md): three paths
— sliding-window scan along reads, de novo repeat-family clustering,
kraken2 per-read tagging. Each architect read deserves to be
characterized as a full BAC equivalent, not just its ends.
The "dignity of every read" principle
(feedback_dignity_of_every_read.md) saved alongside.
2026-05-11 — third region: MU826831.1 (DMBT1-like SRCR locus, 3.2 Mb contig)
The DMBT1 test — this project's central biology question.
The only SRCR-domain hit (Pfam PF00530) in loph10_immune_loci.bed
sits at MU826831.1:2,210,737-2,255,342 within a 3.2 Mb contig —
roughly 10× larger than prior test regions. PF00530 = scavenger
receptor cysteine-rich domain, the defining domain of DMBT1 and the
cnidarian SRCR-family PRRs central to mucus immunity.
| metric | Nordic CLR | US Sequel |
|---|---|---|
| reads on contig | ~16,000 | ~13,000 |
| architects | 3,210 | 2,546 |
| grains formed | 2,313 | 1,259 |
| main / sister | 2,000 / 313 | 648 / 611 |
| truth coverage | 92.7% | 99.5% |
| longest contig | 277 kb | 1.70 Mb (53% in one piece) |
| merged blocks | 251 | 14 |
| SRCR locus (44.6 kb) coverage | 100% | 100% |
The biology: the DMBT1-like SRCR locus is fully recovered in both Nordic and US PacBio data. The 44.6 kb cluster is intact, contig context preserved. This is the central immune-genomics target of the project: it survives both chemistries' assembly walls.
The chemistry comparison: US assembles 53% of a 3.2 Mb contig in a single contiguous block. That's near-chromosome-scale assembly with a 2,700-line modular tool. Nordic fragments at 251 blocks (mean ~12 kb), consistent with its 5.5 kb mean read length hitting LINE/LTR-scale repeats.
Walltime: about 5-10 min for each pop in parallel on 25 cores (3.2 Mb scale). Pipeline scales linearly with input.
Cross-region table:
| contig | size | content | Nordic cov | US cov | Nordic longest | US longest |
|---|---|---|---|---|---|---|
| MU826414.1 | 280 kb | random | 92.1% | 93.4% | 104 kb | 249 kb |
| MU827843.1 | 273 kb | complement | 76.6% | 99.8% | 39 kb | 248 kb |
| MU826831.1 | 3.2 Mb | DMBT1/SRCR | 92.7% | 99.5% | 277 kb | 1.70 Mb |
US Sequel handles all three regions at 93-99.8% coverage with very long contiguous blocks. Nordic CLR is more variable (77-92%) and always more fragmented. The pattern across loci is consistent: chemistry > biology for assembly contiguity at this scale.
Viz pages: lophelia_mu826831_dmbt1_nor_v1.html and lophelia_mu826831_dmbt1_us_v1.html.
2026-05-11 — second region: MU827843.1 (complement C6/C7/C8/C9 contig)
Picked from data/loph10_immune_loci.bed for biological relevance. 273 kb
contig carrying two complement C6/C7/C8/C9 paralogs (the gene family
this project is built around).
| metric | Nordic CLR | US Sequel |
|---|---|---|
| reads on region | 175 | 139 |
| architects | 175 | 139 |
| grains formed | 136 | 70 |
| main / sister | 123 / 13 | 43 / 27 |
| truth coverage | 76.6% | 99.8% |
| longest contig | 38.9 kb | 247.8 kb |
| merged blocks | 27 | 2 |
US's aragonite assembly of this complement-bearing contig is near-perfect — 99.8% covered, in just 2 pieces, longest of 247.8 kb (91% of the contig in one block). Nordic gets 76.6% in 27 fragments.
This means the complement gene cluster is, in principle, recoverable from US Sequel reads alone using a 2,700-line modular OLC assembler. The biology the project is built around survives the chemistry-and-length wall.
Comparison to first region (MU826414.1, 280 kb, random):
| MU826414.1 | MU827843.1 (complement) | |
|---|---|---|
| Nordic coverage | 92.1% | 76.6% |
| US coverage | 93.4% | 99.8% |
MU827843.1 is HARDER for Nordic (76 vs 92) but EASIER for US (99.8 vs 93). Different repeat landscape between contigs probably explains it. The complement contig may have fewer 5–9 kb repeats (where Nordic CLR breaks) but tractable 10–15 kb repeats (which US Sequel reads span).
Viz pages: lophelia_mu827843_nor_v1.html and lophelia_mu827843_us_v1.html.
2026-05-11 — M5 cross-pop: US on the same MU826414.1 region
Re-ran the aragonite pipeline on the US Sequel BAM, same Loph_1.0 contig MU826414.1. Per-pop independence (corrected vision): two separate runs, truth contig used only as a final comparison lens.
Nordic vs US on identical truth target:
| metric | Nordic (CLR) | US (Sequel) |
|---|---|---|
| reads on region | 1,044 | 152 |
| architects | 209 | 152 |
| EXPANDER fraction | 94.7% | 87.5% |
| grains formed | 143 | 77 |
| main / sister | 118 / 25 | 43 / 34 |
| allele-pair links | 15 | 19 |
| truth coverage | 92.1% | 93.4% |
| merged blocks | 15 | 5 |
| longest contiguous block | 104.5 kb | 249.3 kb |
The finding: US has 6.9× fewer reads on this region but produces a 2.4× longer contiguous assembly block (89% of the 280 kb truth in a single piece). Reason: US Sequel mean read length 9.3 kb vs Nordic CLR 5.5 kb. Longer reads span more repeats → fewer assembly breaks. The read-length advantage outweighs the read-count advantage at the scale of structural assembly contiguity.
This is the first direct cross-pop assembly contiguity comparison we have. It says: for Lophelia assembly, raw read length matters more than total read count. Confirms (and quantifies) what the field has been saying about HiFi/Sequel vs RS II CLR.
Viz page added: lophelia_mu826414_us_v1.html.
2026-05-11 — ARAGONITE M0–M5 buildout
Started a tweakable, layered de novo assembler named for the CaCO₃ polymorph Lophelia secretes. Old-school philosophy: every phase a small composable script, every decision a CLI knob, every artifact a human-readable file between phases.
Scripts in autoloop/scripts/aragonite/:
| script | phase | role |
|---|---|---|
| aragonite_partition.py | M0 | split reads by size → architects + mortar |
| aragonite_characterize.py | M1 | axis F features + k-mer rarity → flank-DB; class read |
| aragonite_overlap.sh | phase 2 | minimap2 ava-pb wrapper → PAF |
| aragonite_grow.py | M2 / M3 / M3.5 | endpoint-graph OLC walker; greedy or allele |
| aragonite_mortar.py | M4 | mortar→grains mapping; per-position depth track |
| run_aragonite_pipeline.sh | driver | end-to-end on a BAM region |
| generate_architecture_viz.py | viz | design-space map page |
| generate_aragonite_run_viz.py | viz | per-run page |
Synth 1 Mb diploid validation — built scripts/synth/* for ground-truth
generation. ~1 Mb diploid (chr1_hapA + hapB at 2% het + 50 kb microbe).
pbsim3 RSII simulated 6,059 reads at 35× depth. Comparison vs miniasm /
wtdbg2 / flye documented in viz site (synth_1Mb_v1 page).
M5 real-data test on Loph_1.0 contig MU826414.1 — pipeline runs end-to-end on 1,044 Nordic reads aligning to a 280 kb region. Produces 143 grains (118 main + 25 sister, 15 allele-pair links), longest 63.7 kb, covers 92.1% of the truth contig (258 / 280 kb), longest contiguous merged block 104.5 kb across 15 blocks. ~1 sec walltime for the region.
Local viz site at http://localhost:8765/ with three entries:
- 📖 Aragonite architecture (design-space map of all phases + swap-points)
- synth_1Mb_v1 (comparative assembly view)
- lophelia_MU826414_v1 (M5 first real-data run)
Lessons logged at each milestone (the trajectory): - M0: PacBio length CDFs have no clean knee — use percentile defaults - M1: synth is too easy (93% EXPANDER); real data will discriminate more - M2: greedy produces ~8× redundant grains (parallel walks not merged) - M3: allele-fork detection works at ~40% diploid rate - M3.5: processing order matters → interleave sisters → 48.5% - M4: mortar depth is QC, not just polish — surfaces grain-confidence track - M5: real-data pipeline works; 92% truth coverage on first try
Memory references: project_axis_F_vision_corrected.md,
feedback_autolearn_goodhart.md, feedback_background_default.md,
feedback_no_root_tmp.md, feedback_compute_safety.md.
2026-05-07 to 2026-05-10 — axis F per-read DB framework (cross-pop)
Multi-day buildout of the BAC-substrate per-read database for cross-population
read characterization. Lives at ~/cnidaria_genomes/lophelia_pertusa/refree/
(t5) and synced to lophelia hub data dir.
Scripts in autoloop/scripts/ (all reference-free per the corrected vision):
- extract_read_ends.sh + _axis_F_extract_helper.py — phase 1 BAM → per-read TSV
- extract_read_ends_refree.sh + _refree_extract_helper.py — strict reference-free
variant (no is_mapped/softclip_bp columns)
- characterize_and_classify.py — 6 sequence-only features per end, classifier
- compute_features_refree.py — features only (no classifier) per-pop independent
- diag_features_refree.py — per-pop distribution diagnostic
- flatten_to_flank_db.py — read-centric TSV → flank-centric DB (one row per flank)
- autolearn_axis_F.py — autolearning parameter sweep (Goodhart-vulnerable, lessons saved)
- drill_microsat.py — dominant-motif analysis for short-period tandem hits
- peek_flank_db.sh + peek_flank_db_v2.sh + _flank_graph_helper.py —
all-vs-all flank similarity + union-find community detection
Key findings:
- US Sequel raw subreads have Q-strings stripped (all ! = Phred 0); Nordic
RS II has real per-base Q. Real chemistry / deposit-convention asymmetry.
- US has ~2× more period-2 microsatellite hits — 95% homopolymer (AA/TT, CC/GG),
consistent with Sequel chemistry artefact, not real microsatellite biology.
- Autolearn objectives are repeatedly Goodhart-vulnerable. Five iterations
exposed five different ways the optimizer cheats. Memory saved.
- v1 flank-DB peek: 75% of unmapped flanks have no similar partner — supports
d8's "contamination + cleaning artifacts dominate, real biology is rarer"
framing.
Memory references: feedback_autolearn_goodhart.md,
project_axis_F_vision_corrected.md.
2026-05-05 — Phase 1 t5 (Pfam subset + decisions HMM section)
Per d8's share_plan.md (commit fffbfe8 on lophelia hub), Phase 1
t5 work landed:
data/diagnostic_pfams.txt(in lophelia hub) — 124 unique Pfam accessions, top-2 per cnidaria-relevant family fromimmune_family_table_cnidaria_relevant.tsv(116 families). All 124 verified present in Pfam-A.hmm 38.1.Pfam-A.immune.hmm— 124-model HMM library extracted from the full Pfam-A. ~6.9 MB text + ~8 MB pressed binaries. Local on each server (regenerable). hmmscan on a 5-protein test query took ~12 ms.scripts/extract_immune_pfams.py(in autoloop) — d8-portable rebuild script (parses Pfam-A.hmm text directly; no hmmfetch SSI dependency).docs/decisions.md— t5 has filled §3 HMM methodology (hmmscan +--cut_ga+ tblout/domtblout, with the multi-domain output schema for SRCR-family paralogs). §2 blastp left for d8.
Ready for Phase 2 t5 (run hmmscan on 34 proteomes) once decisions.md is sync'd.
2026-05-04 — Flye v1 landed; cross-server collaboration with d8
Flye CLR assembly v1 finished
Walltime 5 h 19 min (2026-05-03 17:22 → 2026-05-04 22:41). Output at ~/cnidaria_genomes/lophelia_pertusa/assembly/flye_v1/assembly.fasta (744 MB). Stats:
| Metric | Value |
|---|---|
| Contigs | 31,741 |
| Total length | 744.2 Mb (33% inflated vs 557 Mb expected — haplotigs) |
| N50 | 34.4 kb |
| Max contig | 267 kb |
| Ns / gaps | 0 |
| GC% | 39.7 (matches Loph_1.0 39.5 — sanity OK) |
Inflation is the expected outbred-CLR haplotig signature at 25.4× coverage. Path forward is purge_dups → Pilon polish (Tisler HiSeq2500) → SSPACE/BESST scaffolding (3/8/20/40 kb mate-pairs) → minimap2 -ax asm5 + SyRI vs Loph_1.0 for the SV catalogue.
Cross-server collaboration with d8 set up
A partner Claude session on bok1d8 (192.168.1.10, NFS-mounted at ~/Desktop/d8/) is working in /mnt/backup/lophelia/ on the same project. Architecture:
- Shared bare git hub on t5 at
~/repos/lophelia.git. Both servers see it (d8 via NFS atDesktop/t5/bok1/repos/lophelia.git). No SSH or cloud needed. - t5 working clone at
~/lophelia/— separate from autoloop. - Etiquette: d8 won't touch t5 outside the bare hub. Big data stays local on each side.
.gitignorekeeps sequence/alignment/log files out.
What d8 has already done:
- 6 Dahl-RNA1 RNA-seq libraries recovered from STAR _Aligned.out.bam via samtools collate | samtools fastq -F 0x900 -n → /mnt/backup/lophelia/fastq_from_bam/ (~25 GB, lossless). Source raw FASTQs don't exist anywhere on disk.
- Existing Trinity outputs consolidated at /mnt/backup/lophelia/trinity/: 2022 main Trinity (402,228 transcripts → TSA GHDD01), CLC re-clustering (29.5k assemblies, 68.8k contigs), OmicsBox annotation (with cnidaria-protein-hit subsets), Trinity of unmapped reads (210k transcripts, potential Lophelia-specific paralogs).
- BLAST protein DBs for the 33 best annotated cnidarian assemblies (1,053,827 proteins) — per-species + aliased combined cnidaria_all. Build script at d8 scripts/build_blast_dbs.sh.
t5-side first commit to the shared hub: lab notebook entry summarising Flye v1 + provenance + division of labor (committed as d637e04).
Coordination — division of labor on immune-gene hunt
| Side | Tool | Why |
|---|---|---|
| t5 (us) | Pfam HMM scan (SRCR, A2M, Anaphylatoxin/C3a, Thioester, CUB, MAM, ApeC) + curated cnidarian C3 seeds → DIAMOND vs locally-rebuilt cnidaria_all |
Domain-aware: catches the SRCR-family expansion that pure homology splits |
| d8 | Seed-based blastp (human RefSeq + cnidarian C3/DMBT1 seeds) vs cnidaria_all |
Direct ortholog calls + Lophelia-paralog focus via Trinity 2022 |
Hits from both pipelines merge into docs/cnidaria_immune_catalog.md + _comparative/results/immune_summary.tsv. Reciprocal-best-hits cleanup at the end.
On BLAST DB sharing: /mnt/backup/lophelia/ is d8-local ext4 — not mountable from t5. Decided to rebuild on t5 using d8's scripts/build_blast_dbs.sh rather than rsync'ing the binary DBs across. Same _comparative/databases/ layout, same proteome inputs (symlinks happen to be valid on t5 already). Idempotent script, ~1–2 min runtime, both servers stay symmetric.
On timing: d8 starts the seed-based blastp now on the existing 33 species (no need to block on the Nordic assembly). When polish + annotation finishes for Nordic Lophelia, we slot it in as the 34th proteome and re-run; existing 33-species rows don't change.
Papers retrieved
- Herrera & Cordes 2023 (GigaByte) — the canonical Loph_1.0 data paper. PDF at
docs/papers/. - Walters et al. 2020 (Front Immunol) — Mpeg-1/Perforin-2 family in Cnidaria. Directly on the mucus-immunity arc. PDF at
docs/papers/. - Quattrini 2020 NEE + McFadden 2021 Syst Biol — nominally green-OA but no working PDF source via unpaywall/OpenAlex. Need institutional access. Walters paper covers a lot of the cnidarian-immunity context anyway.
autoloop hub set up
Bare repo at ~/repos/autoloop.git (parallel to ~/repos/lophelia.git). ~/autoloop/ working tree pushes to it as origin. d8 can git clone it to pull autoloop's canonical scripts, env yaml, and docs/{lab_notebook,decisions,samples}.md. Same etiquette as the lophelia hub.
CLR / HiFi feasibility recorded in docs/methods.md
Decision not to regenerate modern HiFi from the 2015 RS II data captured at paper-grade precision in docs/methods.md §2. Headline: 10 kb insert + RS II P6-C4 chemistry yields only 1–3 polymerase passes per ZMW, well below the 8–10 needed for modern HiFi quality. Even with full reprocessing the ceiling is ~1–3× HiFi-eq coverage of the 557 Mb genome. CLR-grade analyses (assembly, SV calling) proceed with the raw subreads; SNP-grade analyses use the Tisler HiSeq2500 short reads instead. HiFi regeneration moves to backlog as a targeted-loci tool only (validating an interesting SV at base-pair resolution if needed).
PacBio CLR mapping to Loph_1.0 — reference indexed, mapping launching
Reference decompressed + indexed once for reuse at ~/cnidaria_genomes/lophelia_pertusa/reference/work/:
Loph_1.0.fna 539 MB uncompressed reference
Loph_1.0.fna.fai 92 KB samtools index
Loph_1.0.map-pb.mmi 1.2 GB minimap2 -x map-pb pre-built index (k=19, skip=10, hpc=on)
Mapping pipeline (see docs/methods.md §3.3 for paper-grade detail):
minimap2 -ax map-pb -t 40 Loph_1.0.map-pb.mmi <36 subreads.fastq> \
| samtools sort -@ 8 -m 2G -o nordic_pb_vs_loph10.bam
samtools index ...
Output dir: ~/cnidaria_genomes/lophelia_pertusa/alignments/nordic_pb_vs_loph10_v1/.
Walltime estimate 2–4 h. SV calling with Sniffles2 follows.
Reframing: assembly polish/scaffold demoted, mapping promoted
The story arc is comparative immune-gene biology, not chromosome-quality Nordic assembly per se. Direct PacBio→Loph_1.0 alignment + SV calling answers the C3/SRCR-locus question more directly than Flye→purge_dups→Pilon→SSPACE→assembly-vs-assembly. Backlog tasks 19/20 (Pilon polish, mate-pair scaffolding) demoted from "next" to "if-needed-for-novel-Nordic-sequence". Flye v1 stays on disk as backup for any sequence not in Loph_1.0.
Nordic CLR → Loph_1.0 alignment landed
Wall time 17 min (much faster than the 2–4 h estimate — 14.17 Gbp on 40 cores with a pre-built .mmi index moves quickly). Output at cnidaria_genomes/lophelia_pertusa/alignments/nordic_pb_vs_loph10_v1/nordic_pb_vs_loph10.bam (17 GB).
Flagstat highlights:
| Metric | Value | Reading |
|---|---|---|
| Primary reads | 2,525,036 | matches subread count exactly (sanity OK) |
| Primary mapped | 2,409,639 (95.43%) | high — Nordic and US Lophelia are very close at sequence level |
| Unmapped primary | 115,397 (4.57%) | candidate Nordic-specific sequence (insertions, divergent regions, repeats) |
| Supplementary | 2,572,820 | split-mapped reads — the SV signal lives here |
| Secondary | 4,875,457 | repeat-region multi-mapping |
95.43% primary mapping is a strong signal that the Nordic and US populations are highly conserved at sequence level. The ~4.6% unmapped + 2.6 M supplementary alignments are where the structural-variation story will come from.
Sniffles2 SV calling — done in 18 s
Single-sample mode, reference Loph_1.0.fna, 16 threads, on the 17 GB BAM.
108,257 SVs called, sorted, bgzipped, tabix-indexed → nordic_pb_svs.vcf.gz (25 MB). All calls have SUPPORT ≥ 3 and QUAL ≥ 20 (sniffles defaults).
| SV class | Count | Notable |
|---|---|---|
| DEL | 58,829 | 27,290 in 100–500 bp range; 171 ≥ 50 kb |
| INS | 44,399 | 23,728 in 100–500 bp; almost none ≥ 5 kb (mean read length is the spanning ceiling) |
| DUP | 460 | 100 bp → 50 kb+ — directly relevant for SRCR-family copy-number expansion |
| INV | 844 | 212 ≥ 50 kb; smaller ones may be CLR-noise — apply PRECISE/QUAL filter when intersecting |
| BND | 3,725 | split-read breakends, untyped |
Filter sensitivity sweep (PASS only):
SUPPORT≥3 SVLEN≥50 99,975 lenient
SUPPORT≥5 SVLEN≥100 61,537 good working set
SUPPORT≥5 SVLEN≥500 18,387 high-confidence
SUPPORT≥10 SVLEN≥500 8,653 very conservative
Not filtering at the catalogue level. The right move is to keep the full VCF and filter per immune-gene locus once t5 HMM scan + d8 seed-BLAST land their coordinates. Working filter for the immune-locus intersection will likely be SUPPORT≥5 + SVLEN≥100 + PASS (61k SVs) with manual review of the largest events.
Output dir summary:
alignments/nordic_pb_vs_loph10_v1/
├── nordic_pb_vs_loph10.bam 17 GB
├── nordic_pb_vs_loph10.bam.bai 4.3 MB
├── flagstat.txt
├── nordic_pb_svs.vcf.gz 25 MB
├── nordic_pb_svs.vcf.gz.tbi tabix index
├── minimap2.stderr
├── samtools_sort.stderr
└── sniffles.stderr
2026-05-03 — project bootstrap, pivot, scaffold, Flye launched
Story arc decided
Mucus immunity in cnidarians, focused on C3 (proto-complement) and DMBT1 / SRCR-family mucosal pattern-recognition receptors. Lophelia pertusa (= Desmophyllum pertusum) as the cold-water-coral case study: assemble Nordic Lophelia from PacBio CLR, align to US GCA_029204205.1, focus on C3 + SRCR-family loci, place results in the cnidarian comparative context using ~80 reference genomes already on disk. Solid science over high-impact splash.
What landed today
US reference
Fetched GCA_029204205.1 (Loph_1.0) from NCBI FTP. 7 files, all md5-verified, written read-only with a MANIFEST.json capturing provenance.
- Specimen: USNM1676648 (Smithsonian voucher), off Charleston SC, 515 m depth, 2019-04-17
- Submitter: Lehigh University (Santiago Herrera lab)
- Assembly level: Scaffold (scaf N50 1.6 Mb, contig N50 438 kb). 557 Mb total, 39.5% GC.
- Path: ~/cnidaria_genomes/lophelia_pertusa/reference/GCA_029204205.1_Loph_1.0/
Nordic PacBio data inventoried
12 SMRT cells from Norwegian Sequencing Centre (NSC, bioinf3.hpc.uio.no), 2015 run 47, sample lead Per Mikael, sequenced on PacBio RS II with P6-C4 chemistry, 10 kb insert library. Originally landed in three legacy .tgz archives; now on disk as the proper extracted directory tree at ~/cnidaria_genomes/lophelia_pertusa/LopheliaPertusa/.
seqkit stats across all 36 .subreads.fastq files (3 per cell):
| Metric | Value |
|---|---|
| Subreads | 2,525,036 |
| Total bases | 14.17 Gbp |
| Mean read length | ~5,500 bp (lower than the 10 kb insert suggested) |
| Read N50 (per file) | 6,700–7,000 bp |
| Max read length | up to ~65 kb |
| CLR coverage of 557 Mb genome | 25.4× |
CCS data exists for only 3 of 12 cells (B02_1, C02_1, D02_1) from a 2018 reprocessing — total ~5–9× HiFi-equivalent, too sparse for whole-genome polish. The legacy .bas.h5 indexes and the Lp_*.subreads.bam modern PacBio BAM (only B02_1 has the BAM form) are extras; .subreads.fastq is enough for Flye.
Nordic Illumina data discovered
At ~/cnidaria_genomes/lophelia_pertusa/Lophelia_Illumina/FASTQ/ (47 GB total):
- Paired-end short-insert (HiSeq, 2014):
Lp_001,Lp_002— Lp_001 R1 (8.2 GB) vs R2 (3.4 GB) is mismatched, needs investigation before use; Lp_002 looks fine 8.2/8.2 - Mate-pair "jumping" libraries:
LpT_DNA2_*at 3 / 8 / 20 / 40 kb inserts, 2 lanes each — exactly the substrate for scaffolding the Flye contigs into chromosome-arm-level scaffolds - Plus a
LJD_PROCESSED_CLEANED/subdir not yet inspected (likely post-NextClip junction-cleaned mate-pair reads)
Sample naming: LpT = Lophelia, Tisler reef (Norwegian Skagerrak). Flowcell H7W1UADXX, instrument DE18INS655.
Pre-existing alignments staged into bams/
116 GB across 8 subdirs (sorted/unsorted Nordic and US RNA-seq, Tisler HiSeq → US genome via CLC Workbench, MiSeq alignments, mito BAMs for both populations). Two reference identities used: the full 2858-scaffold US assembly (Tisler HiSeq) and a filtered 142-scaffold subset (us_lp_filter, 354 Mb — RNA-seq STAR alignments). Both are subsets of GCA_029204205.1. Reusable, contingent on which scaffolds matter for the immune loci.
Cleanup
Removed redundant .tgz archives (LopheliaPertusa.tgz 48 GB, LopheliaPertusa_2.tgz 92 GB, LopheliaPertusa.bas.h5.tgz 2.9 MB) and the PerMikael/ legacy index dir. 141 GB freed, 88% → 84% disk used.
Deliberately did not touch ~/cnidaria_genomes/.tmp_zips/ — the user has an active cnidaria-genome download pipeline writing there.
Cnidarian comparative dataset
The user has 124 cnidarian species across 157 NCBI assemblies under ~/cnidaria_genomes/ (downloaded with datasets, layout: <species>/<accession>/ with protein.faa, genomic.gff, etc.).
- 33 species have at least one annotated assembly (best-per-species manifest at
_comparative/proteomes/manifest_best.tsv) - 1,053,827 proteins total across the 33 best assemblies
- 91 species are unannotated genome-only (skipped for first pass; can do
tblastnlater if a specific species matters)
Coverage looks great for the story: Acropora ×5, Nematostella, Hydra ×4, Hydractinia (allorecognition model), Stylophora pistillata, Pocillopora ×3, Orbicella ×2, Montipora ×2, Exaiptasia ×2 (Aiptasia model), Clytia, Dendronephthya, etc. Astrangia poculata lacks annotation — important for the aposymbiotic-control angle if we use that subplot; tblastn fallback.
Conda env
lophelia env at ~/miniconda3/envs/lophelia (yaml at ~/autoloop/envs/lophelia.yml). Built with mamba.
Phase 1 (assembly + alignment + variants): fastp, multiqc, seqkit, flye 2.9.6, pilon, bwa-mem2, minimap2 2.30, samtools 1.21, bcftools 1.21, sniffles 2.7.5, bedtools, gffread, diamond, ncbi-datasets-cli, pysam, biopython, pandas.
Phase 2 (comparative orthology, added today): hmmer, mafft, iqtree, trimal.
_comparative/ analysis tree scaffolded
~/cnidaria_genomes/_comparative/{proteomes,seeds,databases,searches,alignments,trees,results} — empty subdirs ready for the cross-species ortholog hunt. 33 best-proteome FASTAs symlinked into proteomes/.
Flye launched
17:22 local time. flye --pacbio-raw <36 fastqs> --genome-size 560m --threads 40 --out-dir flye_v1. First launch failed in 4 s: minimap2 wasn't on PATH because Flye was invoked by direct path without env activation. Re-launched with PATH=$ENV/bin:$PATH; now actually running, walltime estimate 4–8 h, finish window roughly 21:22–01:22.
Blocker / gotcha
Flye binary direct-path call doesn't auto-pick up sibling tools (minimap2, samtools) from the same env's bin/. Workaround: prefix PATH with $CONDA_PREFIX/bin or use mamba run -n lophelia flye .... Documented in case it bites again.