Comparative genomics of Desmophyllum pertusum

Nordic (Tisler reef, 2015 PacBio CLR + Illumina + mate-pair) versus US (Charleston, 2023 published reference Loph_1.0) same species, very different sequencing histories. The 2014–2018 Tisler assembly project was abandoned for lack of a US reference and a sharp biological question. Both are gone now: the published Loph_1.0 anchors structural comparison, and we focus on mucus immunity — the C3 complement system, DMBT1 / SRCR scavenger receptors — a story rooted in 600 My of conserved innate-immune scaffolding.

Per-read DB principle

Every PacBio read is fully documented in the parquet DB before any narrowing.

Axis-F structural class (EXPANDER / ANCHOR / BURIED / SUSPICIOUS), full Pfam-A annotation, kraken2 taxonomic call. The immune atlas is a query-time predicate, not a scan-time filter. Same DB, every read, joinable on read_id.

Honest pipeline discipline

Four rules: threads-where-they-saturate · stream-to-disk · sync-before-next · idempotent substep ledger.

Every long substep records to state/runs.jsonl. Re-runs skip completed work. Hard-won from two real incidents — M1 OOM (181 GB Counter) and B1 1.8-core hmmscan mis-direction — both now case studies in the design memory.

RNA-bridge as contig puzzler

Cross-contig paired RNA-seq reads → bridge graph (JSON + parquet) → pseudo-scaffolds.

Tested at scale on one Tisler RNA sample (27.5 M pairs) against per-contig grain target: 8,635 nodes, 15,344 edges. Trinity-transcript overlay deferred — pair evidence first.

State at 2026-05-13: 10% US PacBio slice has full per-read DB, full B2 kraken layer, full RNA-bridge graph. B1f full-Pfam annotation layer is pending a translation-length cap re-run (chunks were taking ≥3 hr per chunk against full Pfam-A on PacBio-translated proteins averaging 8.3 kaa — windowing cap at 5,000 aa landed today, re-run in flight).

Roadmap & maps

Where the project is, where it's going, and what the fleet is doing right now. Compact dashboard of figure plan, pipeline state, cross-species scope, and open captain decisions. For the operating-model playbook (captain & chief mate roles, fleet wiring, the four pipeline rules), see helm.deepsek.no.

Figure plan (F1 – F6)

figtopicdata statusblocker
F1 Sampling map + species ID (Tisler vs Charleston; D. pertusum sensu stricto) data in hand write-up only
F2 Assembly comparison Flye v1 (33% haplotig-inflated) vs Loph_1.0 ref partial — Flye v1 exists haplotig purge + RNA-bridge scaffolding (deferred)
F3 Cross-pop immune-domain atlas (8 pools, 124 immune HMMs + 219 curated full) data in hand + Job D curated landing fold full-Pfam totals into card
F4 DMBT1 / SRCR family architecture — Lophelia + cnidaria-wide closed — three converging evidence sources say SRCR is present but DMBT1 architecture (SRCR + CUB + ZP) is genuinely absent in D. pertusum cnidaria-wide cross-species check running on t5 now (34 proteomes)
F5 RNA-seq DE — mucus-immunity + mito-immune nexus expression 11/12 samples aligned (NN1 broken-fastq excluded); 40,679 gene × 11 sample count matrix built captain decisions: D-vs-NN biology, headline contrast
F6 Complement cascade architecture (lectin → MAC; classical absent) data in hand — 5 A2M / 44 MACPF / 134 Ficolin / 161 lectin_C proteins in Loph_1.0; lectin pathway dominant in both populations, US +56% Ficolin / +57% Ldl_recept_a refresh card with cnidaria-wide complement portrait

Pipeline state map

stagehoststateoutput
Tisler PacBio data (12 cells)t5✓ done25.4× CLR, 14.17 Gbp on disk
Loph_1.0 reference fetcht5✓ done2,858 contigs, 557 Mb
Flye v1 assemblyt5✓ done31,741 contigs, N50 34.4 kb (haplotig-inflated, expected)
ARAGONITE M0-M5 + axis-F partitiont5✓ doneper-region architects + mortar pools
Cross-pop diff (8 pools, 4.3 M reads)super✓ doneshared / partial / private / strict-private × nor / us
Kraken on all 8 pools (PlusPF)super✓ donecross-pop signal mostly real biology, not host contamination
124-HMM immune-Pfam on all 8 poolssuper✓ done8,425 hits across 62 domains
us_mortar immune-Pfam (Job B)super✓ done702 hits, 0 CUB/ZP
Loph_1.0 proteome immune-Pfam scant5✓ done53 SRCR proteins, 0 with CUB/ZP/Mucin/VWD/Trefoil
Mito-orthologs Pfam-A scan (22,911 cnidarian proteins)t5✓ done49,249 hits, 706/727 Lophelia mito-orthologs annotated
124 cnidarian nucleotide BLAST DBst5✓ done18 GB, 124 species, 1 missing (Astrangia empty dir)
tblastn cross-cnidaria immune huntt5✓ done1.84 M rows, 124 species × 275 queries
RNA-seq STAR alignment (GTF-free, plain)trx50✓ done6 US + 5 Nordic samples aligned (NN1 broken-fastq excluded)
S4 featureCounts gene matrixtrx50✓ done40,679 genes × 11 samples count matrix
Job D — full Pfam-A (219 curated) on 8 poolssuperrunning4/8 pools through, ~3 hr remaining
Cnidaria-wide proteome immune scant5running~6/34 species through, ~5 min remaining
S5 DESeq2 differential expressiontrx50blockedawaits captain decisions on D-vs-NN + contrast
S6 immune-gene DE overlaytrx50/t5blockedafter S5

Fleet map

                (NFS mounts — t5 is the hub)

  t5  ◄──── ~/Desktop/super  ─── super (192.168.1.12) — first offload + nginx vhost server
  t5  ◄──── ~/Desktop/trx50  ─── trx50 (192.168.1.34) — RNA-seq offload
  t5  ◄──── ~/Desktop/d8     ─── d8 (192.168.1.10) — disk-only (mainboard swap pending)

  super, trx50, d8  do NOT mount t5 (data must be pushed to each)

  ssh hub:  t5 → super:45497, trx50:4540  (custom ports, dedicated key)
  exec:     scripts/fire.sh — detached tmux + NFS log/exit polling

Threads budget: t5 = 25, super = 40, trx50 = 25. One heavy task per box. Between boxes, parallel is fine. Full details in helm.deepsek.no.

Cross-species comparative scope

resourcecountused for
Cnidarian genome assemblies on disk125tblastn substrate
Annotated proteomes (NCBI RefSeq / GCF)34HMM scans, ortholog mapping
Per-genome nucleotide BLAST DBs built124tblastn queries — 1.84 M hits total
Per-species protein BLAST DBs34blastp / homology validation
MitoCarta human inventory → cnidaria ortholog map26,054 rowsmito-nuclear gene catalog
Mito-immune nexus (mito × immune intersection)985 rowsdirect manuscript material — TFAM, ECSIT, MUL1, CASP9

Locked methodological decisions

decisionvaluerationale
STAR indexGTF-free, FASTA-onlyno annotation priors; de novo splice junction discovery
STAR alignmentminimal flags (no twopass, no fancy filters)captain preference — plain mapping
BAM-extract pair fixBBTools repair.sh after samtools fastqdefensive re-pair + orphan audit
Threadst5 = 25, super = 40, trx50 = 25per-box headroom; verify %CPU saturation
MAX_AA5000 with sliding window (W 5000, S 4800)HMMER Plan7 is O(L × N); cap is first-class scaling param
Per-sample failureflag in ledger + continuecaptain rule: one broken sample shouldn't kill the batch
State spineNFS + markdown (INTRO / FINDINGS) + JSONL ledgerno SSH automation, no shared agent state, slow but clear

Open captain decisions

  1. D vs NN biology — what does the Tisler sample-label encoding mean (tissue, depth, treatment)? Needed before locking DESeq2 contrast direction.
  2. Headline contrast — within-Nordic D-vs-NN, cross-pop Nordic-vs-US, or combined ~population + group? Affects power and what the F5 figure shows.
  3. NN1 fate — Tisler NN1 raw fastqs from L006 are broken; STAR refused them twice; BBTools reformat didn't catch the issue. Currently excluded → 5-sample Nordic. Option to hunt down the specific corrupt records, or accept N=5.
Living dashboard. Updated as pipelines complete and the manuscript shape sharpens. Source of truth for the operating model: helm.deepsek.no. Manuscript-grade findings: the Immune highlights tab.

Pipeline · aragonite

Eight steps from raw PacBio BAM to a proposed contig graph. Each step's script + small-test signal + ledger key is on the repo side; the chart below is the structural map.

us_pb_vs_loph10.bam · 2.07 M primary reads S0 subsample samtools view -s · -F 0x900 reads_us_pct*.fq.gz M0 partition · axis-F (length p80 split) aragonite_partition.py architects.fq.gz + mortar.fq.gz M1 characterize · kmc-backed (k=21 canonical) aragonite_characterize_v2.py architects_flanks_*.tsv.gz B0 facts · per-read parquet (the DB row per read) aragonite_db_init.py reads_pop=*_run=*.parquet B1f Pfam-A annotation chunked hmmscan · MAX_AA=5000 architects + mortar (symmetric) B2 kraken2 taxonomy PlusPF db · raw + conf 0.05 full pool (arch + mortar) hmmer_pop=*_db=PfamAfull.parquet · kraken_pop=*_db=PlusPF{,-c0.05}.parquet RNA-bridge sibling track (R0 → R3) R0–R3 remap RNA · extract bridges · build graph · emit scaffolds rna_bridge_chain.sh bridge_graph.json · pseudo_scaffolds.fa ── Phase 2: immune atlas + per-pop SV queries as duckdb predicates over the parquet DB ──

Design rules (the four-rule discipline, in case study form): rule 1 — threads where they saturate (case: hmmscan --cpu 25 only used 1.8 cores on 124-HMM scan; flipped to parallel hmmsearch across HMMs); rule 2 — stream to disk (case: M1 Python Counter at 181 GB OOM-killed; moved to kmc); rule 3 — sync before next stage (within AND between pipelines, no oversubscription); rule 4 — idempotent substeps + ledger (re-runs resume from missing substeps; pre-populated ledger rows express "done by design").

Immune highlights

Three gene families anchor the cnidarian mucus-immunity story we're building: the SRCR scavenger-receptor family, the secreted mucin / DMBT1 architecture (multi-SRCR + CUB + ZP), and the complement cascade (lectin → MAC). Evidence comes in three layers: (i) the cross-pop diff-pool Pfam scan of 4.3 M PacBio reads (genomic domain landscape; broad but over-counts when domains are promiscuous), (ii) the truth-carrier named-ortholog stack (4 carriers · 1,233 human canonical refs · DIAMOND vs 34 cnidaria · clean 1:1 orthology), and (iii) the DESeq2 differential expression on 11 RNA-seq samples under the strict expression filter (the publishable named-ortholog DE table). The headline figure above synthesises layers (ii) and (iii). Layer (i) gives the domain landscape; the SRCR / Mucins / Complement cards below report it.

Headline · S5 DESeq2 named-ortholog volcano

11-sample RNA-seq (5 Nordic / 6 US) under the strict expression filter (≥10 counts in ≥3 samples per group, both pops) — 16,611 testable genes; 473 of them are carrier-annotated immune orthologs. Positive log2FC = NOR-elevated.

S5 immune-panel volcano: NOR vs US, named-ortholog DE

The top-right cluster is the headline biology: C3 (+2.11 LFC, padj 9×10⁻⁷), MASP1/C1R/C1S consolidated hub (+2.50, padj 5×10⁻¹⁵), and C2 (+3.89, padj 2×10⁻¹³) — three classical/lectin activation-cascade complement genes coherently NOR-elevated. Supporting NOR-elevated: MUC6 gel-mucin (+3.55) and the DDX58/MDA5/LGP2 viral-RNA-sensor hub (+1.60). Cluster near origin (DMBT1/CD163L1, NLRP3, MyD88, CFH) is the constitutive innate-immune backbone — balanced both pops, no DE. MitoCarta panel (teal) is mostly housekeeping; the few extreme US↑ mito outliers are residual library-composition artifact.

Tools: pyDESeq2 0.5.4 · contrast: pop nor vs us · padj < 0.05 + |LFC| ≥ 1 · script scripts/aragonite/s5d_headline_volcano.py · underlying data ~/scratch/s5_deseq2/deseq2_strict_filter_immune.tsv

Truth-carriers · named human orthology

Four pathway-specific carriers of canonical human reference proteins, each DIAMOND-blastp'd against 34 cnidarian proteomes. Replaces the broad-Pfam-domain approach (over-counts when domains are promiscuous: Trypsin, TSP_1, EGF, Ldl_a) with named 1:1 ortholog mapping back to Reactome / UniProt / MitoCarta3.

carrierhuman refsLophelia orthologssignature finding
complement 49 19 terminal MAC absent in Lophelia (no C6-C9); consolidated hub KAJ7394693.1 = C1R + C1S + MASP1 + MASP2
mucin / SRCR 24 8 0/26 cnidaria carry DMBT1 SRCR+CUB architecture (named-ortholog confirmation); Lophelia DMBT1 best-hit KAJ7373414.1 = SRCR-only, also the CD163L1 best-hit (consolidated hub)
mitocarta 1,132 727 418-gene universal cnidarian mito-core (≥30/34 species) — the eukaryotic-ancestral mitochondrial backbone
TLR / NLR / RLR 28 4 Canonical PRR folds preserved (35/37 TLR hits carry LRR+TIR, 42/53 NLR hits carry NACHT+LRR) but reduced gene families; Lophelia hub KAJ7375464.1 = DDX58 + IFIH1 + DHX58 (all 3 RLR paralogs collapsed)

The consolidated-hub pattern is the deep-time signal. Across all four carriers, Lophelia (and most anthozoans) collapse multiple vertebrate paralogs onto single cnidarian loci — vertebrate gene duplications evidently happened after the cnidarian–bilaterian split. The pattern repeats four times with independent gene families: strongly publishable.

Layout: ~/cnidaria_genomes/_comparative/truth_carriers/<pathway>/ · per-carrier matrix tab <pathway>_pathway_cn_per_species.tsv · scripts scripts/aragonite/{build_,diamond_,build_*_matrix}_*.py

SRCR · PF00530

Scavenger Receptor Cysteine-Rich. The prime mucus-immunity target — DMBT1 carries 8–16 tandem SRCR copies per protein. 356 SRCR domain hits across the 8 diff pools — concentrated in shared sequence (nor_shared 108, us_partial 148), almost absent from strict-private pools (1 / 0). US-Atlantic carries slightly more SRCR signal than Nordic (192 vs 164, +17%), but the cross-pop divergence in SRCR is modest compared to the complement cascade below.

N 8 × SRCR (tandem) CUB ZP C

Canonical DMBT1-like architecture. Each bead is one ~100-aa SRCR domain (PF00530), tandemly repeated. The CUB and ZP C-terminal modules anchor the protein to the mucus matrix. Cnidarian DMBT1 homologs have been catalogued across coral and sea-anemone proteomes; whether D. pertusum carries an intact multi-SRCR mucin is what this analysis tests.

poolSRCR hitsreads w/ hitFib_CTrypsinSushiLdl_a
nor_partial5210062731063315
nor_shared10824165422467646
nor_private33313400
nor_strict_private140100
us_partial14829119933246361
us_shared37948268972733
us_private7106331542
us_strict_private090001
Nordic total164345982835710961
US total192397412944369496

Pool definitions (from the 4.3 M-read pb_read_diff): shared = reads carrying k-mers present in both populations; private = reads enriched in one population only; partial = intermediate overlap; strict_private = single-population k-mer signature, the cleanest "this population only" pool. Every read in every pool documented — no read discarded — per the BAC-PAC principle.

Mucins / DMBT1 architecture

Secreted mucus PRRs combining multi-SRCR + CUB + ZP domains. The signature architecture for mucus immunity in vertebrates — and the working hypothesis for cnidarian mucosal defense at the coral–water interface.

What we look for: a single PacBio read that co-carries SRCR + CUB (or SRCR + ZP) in the same 6-frame translation. That's a positional argument — the domains are on one piece of contiguous DNA — much stronger than independent hits.

mucin-architecture domainNordicUStotal (8 pools)
SRCR · scavenger-receptor (PF00530)164192356
CUB · DMBT1 anchor (PF00431)000
ZP · zona pellucida C-term (PF00100)000
Mucin · O-glycosyl mucin core (PF01456)000
VWD · gel-forming mucin (PF00094)000
Trefoil · TFF-family mucin partner (PF00088)000

Three converging lines of evidence say the SRCR-containing proteins in D. pertusum are SRCR-only, never in classical DMBT1 architecture: (i) zero CUB/ZP/Mucin/VWD/Trefoil hits across all 8 cross-pop diff pools (4.3 M reads); (ii) zero co-occurrence in us_mortar (axis-F tandem-repeat pool, 769 MB, 31 SRCR hits); (iii) zero co-occurrence in Loph_1.0 reference proteome (37,484 proteins, 128 SRCR hits in 53 proteins).

Cnidaria-wide check (2026-05-17)

To rule out a D. pertusum-only quirk, we extended the same scan to 34 annotated cnidarian proteomes across Anthozoa (corals + anemones), Hydrozoa, Scyphozoa, and Myxozoa (513,899 protein-domain hits, 98 distinct immune Pfam HMMs):

cnidaria-wide totalsSRCR proteinsCUBZPMucinVWDTrefoil
31 species with at least one immune-Pfam hit 2,410 00000
top SRCR carriers Dendronephthya gigantea 1,012 · Actinia tenebrosa 715 · Oculina patagonica 438 · Acropora muricata 352 · Montipora foliosa 351 · Pocillopora verrucosa 335
proteins with SRCR + (CUB | ZP | Mucin | VWD | Trefoil) 0 — across 205,278 cnidarian proteins tested

Verdict — interpretation (b) wins, cnidaria-wide. The SRCR scavenger-receptor family is ancient and expanded across Cnidaria (2,410 proteins, 7,604 domain hits, present in every major class except the myxozoan parasites with reduced genomes). But the canonical mammalian DMBT1 scaffold (multi-SRCR + CUB + ZP) is genuinely absent in Cnidaria — the CUB+ZP-anchored mucin architecture is a vertebrate / mammalian innovation. Cnidarian mucus immunity uses SRCR-only PRRs (and their non-CUB co-domain partners: Ig_2, TSP_1, Trypsin, Pkinase, Ldl_recept_a in D. pertusum's 53 SRCR proteins).

Named-ortholog confirmation (2026-05-18, DMBT1 = UniProt Q9UGM3)

The Pfam-architecture finding now re-confirmed against the named human DMBT1 reference. DIAMOND blastp of human DMBT1 (Q9UGM3, 2,413 aa) vs 34 cnidarian proteomes returns best-hits in 26/26 anthozoans (pident 30-50%, qcov 28-80%), but 0/26 carry the SRCR+CUB co-occurrence that defines vertebrate DMBT1-mediated mucus immunity — all are SRCR-only (some with Ig_2, Sushi, or Ldl_recept_a accessory). Hydrozoans (4 spp) + Rhopilema return no DMBT1 best-hit at all. The keystone PRR is anthozoan-specific in cnidaria.

Lophelia DMBT1 best-hit: KAJ7373414.1 (pident 32.5, qcov 28.8, SRCR-only architecture, NCBI annotation: "Neurotrypsin") — also the best-hit for CD163L1. Single Lophelia locus = both vertebrate paralogs (consolidated-hub pattern, recurring across all four truth-carriers).

NCBI annotation caveat: NCBI's product field labels a different Lophelia locus (OS493_018800 = our CD163 best-hit) as "Deleted in malignant brain tumors 1" — i.e., NCBI calls a CD163-like protein DMBT1. The truth-carrier method uses closest-sequence-match; NCBI uses a different criterion. Both are SRCR-rich; the disagreement is documented and will be footnoted in the manuscript.

Expression evidence (Dahl-RNA1 + US SRA, ≥100 raw reads cutoff)

categoryn genesexpr Nordicexpr USbothsilent
SRCR 531124927
Mucin-anchor (CUB/ZP/Mucin/VWD/Trefoil) 0nothing to express — gene set empty

Top SRCR expression (balanced across populations): OS493_025544 (KAJ7328140.1) — NOR 6,341 / US 6,288, pure SRCR, housekeeping-level. OS493_013007 — NOR 4,224 / US 5,370, pure SRCR. OS493_017426 — NOR 1,561 / US 523, pure SRCR (Nordic-favoured). OS493_024254 — NOR 571 / US 414, SRCR + Ig_2 + Trypsin combo.
SRCR + kinase fusion, US-expressed: OS493_036508 (PK_Tyr_Ser-Thr + Pkinase + SRCR, NOR 0 / US 2,705).

Caveat — BAM-extract bias. 5 of 6 Nordic samples are samtools fastq from BAMs previously aligned to the 142-contig us_lp_filter subset (354 Mb of 557 Mb); NN1 is from raw cutadapt2 fastq. So Nordic counts for genes on the missing 203 Mb are structural zeros, not biological zeros. Genes WITHIN the subset show honest cross-pop comparison (above examples are all on-subset).

Complement cascade

The most evidence-rich immune system in our scan, and the one that diverges most strongly between Nordic and US. 2,122 Ficolin hits + 793 MASP-like serine-protease hits + 203 Sushi + 157 MAC-pore hits across the 8 pools — 3,275 complement-arm hits total, the lectin → MAC pathway dominating in both populations. US-Atlantic carries ~56% more Ficolin signal than Nordic (1,294 vs 828), the strongest cross-pop differential we measured.

classical C1q · C1s · C1r lectin Ficolin / MASP 2,122 + 793 hits ← dominant arm alt. Factor B / D Sushi 203 hits C3 (a2m core) A2M signal under-powered in diff space C6 / C7 / C8 / C9 · MAC Ldl_recept_a + MACPF 157 Ldl_a hits

Lectin pathway dominates both populations. Ficolin (Fibrinogen_C) is the single strongest immune signal in the entire diff space — 2,122 hits, ahead of every other Pfam domain across all 124 HMMs tested. MASP-like serine proteases (Trypsin) and Factor B / MASP accessory modules (Sushi) are well attested. The membrane-attack pool (Ldl_recept_a, the C6–C9 MAC component) is also strongly cross-pop. The C3 α-2-macroglobulin thioester core itself remains under-powered in diff space — but absence here is consistent with C3 being conserved between Nordic and US (i.e., visible only in the shared bulk, not yet scanned).

componentPfamNordicUStotalUS / Nor
Ficolin · lectin path initiatorPF001478281,2942,122+56%
MASP · serine protease (Trypsin)PF00089357436793+22%
Sushi · Factor B / MASP accessoryPF0008410994203–14%
Ldl_recept_a · MAC (C6–C9)PF000576196157+57%
Complement-arm subtotal1,3551,9203,275+42%

The +56% Ficolin and +57% Ldl_recept_a enrichment in US-Atlantic is the headline cross-pop differential. Both sit on the lectin → MAC axis (recognition → pore formation), so the divergence is coherent rather than scattered — a real population-level skew in mucus complement.

RNA-seq expression evidence (Dahl-RNA1 + US SRA, ≥100 raw reads)

Loph_1.0 carries 577 complement-pathway genes (Fibrinogen_C / Lectin_C recognition + Trypsin MASP + Sushi accessory + A2M C3-core + MACPF / Ldl_recept_a MAC). STAR counts at the 100-read cutoff per sample:

category (Pfam)n genesexpr Nordicexpr USbothsilent
Complement — all families combined 577157239116297

DESeq2 strict-filter DE results (named orthologs — 2026-05-18)

The Pfam-scan above measures genomic domain presence in cross-pop read pools. The DESeq2 results below measure per-gene transcription with the named-ortholog truth-carrier. Same underlying biology, two different evidence layers; both are kept. Filter: gene retained only if ≥10 counts in ≥3 samples per pop, both pops.

complement genegene_idlog2FCpadjdirection
C3 · α-2-macroglobulin thioester OS493_014903 +2.11 9 × 10⁻⁷ NOR ↑ 4.4×
MASP1 / C1R / C1S / MASP2 · consolidated protease hub OS493_000518 +2.50 5 × 10⁻¹⁵ NOR ↑ 5.7×
C2 · classical/lectin protease OS493_002228 +3.89 2 × 10⁻¹³ NOR ↑ 15×
C4A · classical/lectin thioester OS493_024843 +1.18 9.6 × 10⁻³ NOR ↑ 2.3×
CFH · regulator OS493_002696 −1.00 1.8 × 10⁻³ US ↑ 2.0×

Headline: the classical / lectin activation cascade (C3 + MASP-hub + C2 + C4A) is coherently NOR-elevated at the transcript level. Same direction as the broad-Pfam Ficolin / MASP / Sushi enrichment but the named-ortholog DESeq2 lands the signal on Nordic, not US — because the Pfam-scan layer was measuring genomic content in diff-pool reads (US shared+partial pools are larger), while DESeq2 measures transcription per gene (Nordic transcribes more). Two consistent evidence layers when interpreted at their proper level.

Sub-threshold notes (not in strict filter): NCBI confirms OS493_000518 = "Mannan-binding lectin serine protease 1" — independent annotation match on our MASP1 carrier hit. The earlier OS493_028292 (Trypsin, NOR 4 / US 85,684) and OS493_028953 (Trypsin, NOR 0 / US 39,670) one-pop spikes were dropped by the strict filter as library-composition artifacts (BAM-extracted Nordic samples systematically zero on off-us_lp_filter-subset genes). Demoted to "trending US, sub-threshold expression in Nordic — not a confident DE call."

Caveat — BAM-extract bias. 5 of 6 Nordic samples come from BAMs aligned to the 142-contig us_lp_filter subset (354 Mb of 557 Mb full Loph_1.0). The strict-filter mitigation drops genes off-subset; the 11 named orthologs above are all on-subset and represent honest cross-pop transcription. Genome-wide 38.5% sig figure (vs 52% before filter) is partially residual library-composition and partially real environmental divergence between deep-Tisler and Atlantic-shelf populations.

Pattern-recognition receptors — TLR / NLR / RLR (carrier 4, 2026-05-18)

The fourth truth-carrier: 28 canonical human PRRs (TLR1-10, TIR adaptors, NOD1/2, NLRP1/3/6/7, NLRC4/5, NLRX1, NAIP, RLR triad). Cnidaria show a compressed-but-architecturally-complete PRR signal — canonical folds preserved, gene families reduced. Across 31 species:

layeruniversal (≥28/34)broad (10-22/34)absent (0/34)
RLR · viral RNA DDX58, IFIH1, DHX58
TLR adaptor MYD88 (22)TIRAP, TICAM1/2, SARM1
NLR inflammasome NLRP3 (22)NLRP1
TLR · surface PRR TLR2 (15), TLR4 (11), TLR5 (10)TLR1, TLR3, TLR6, TLR7, TLR9, TLR10
NLR-NOD NOD2 (12), NOD1 (3)

Architecturally canonical: 35/37 cnidarian TLR best-hits carry both LRR + TIR; 42/53 NLR best-hits carry NACHT + LRR. The PRR machinery is preserved back to the cnidarian–vertebrate split. Viral-RNA sensing via the RIG-I/MDA5/LGP2 RLR family is universal across cnidaria; surface bacterial sensing (TLR2/4/5 + MyD88 + NLRP3 + NOD2) preserved but reduced; endosomal nucleic-acid TLRs (TLR3/7/8/9) and most TIR adaptors (TIRAP/TRIF/TRAM/SARM1) are vertebrate innovations.

Lophelia carries a minimal but architecturally-complete PRR system — 4 orthologs, all canonical:

functionLophelia locusarchitectureNOR_CPM / US_CPM
TLR4KAJ7392015.1LRR_8 + TIR + TIR_2 ★1.1 / 0.1
MyD88 adaptorKAJ7371306.1Death + TIR + TIR_2 (qcov 83.8%)128 / 150
NLRP3 inflammasomeKAJ7373972.1LRR_6 + NACHT ★253 / 326
RIG-I / MDA5 / LGP2 (hub)KAJ7375464.1DEAD helicase — one locus = 3 paralogs100 / 27

DESeq2: NLRP3 + MyD88 robust and balanced both pops (constitutive backbone). RLR-hub NOR-elevated +1.60 LFC, padj 2.5×10⁻⁶ — Nordic upregulates viral-RNA sensing. TLR4 baseMean too low for confident DE call (demoted).

MitoCarta orthologs — for context

Lophelia has 635 MitoCarta orthologs (human mito-localised proteins). At the 100-read cutoff: 440 expressed in Nordic, 589 in US, 428 in both, 34 silent. The high overlap reflects the mito-machinery's housekeeping role across both populations. The mito × immune intersection is small in D. pertusum with the 124-HMM immune atlas (24 proteins in the cross-cnidaria nexus, 2 strictly in our SRCR/complement narrow set) — but the broader mito-immune crosstalk machinery (TFAM, ECSIT, MUL1, CASP9, CLPB/CLPX/LONP1) is present and conserved (see Mito × Immune nexus backlog work — 985 rows across 34 cnidarians, manuscript material for a separate angle).

For Åse — four findings the manuscript can lead with (refresh 2026-05-18, after truth-carrier + DESeq2 layers landed):

(1) Classical / lectin complement is transcriptionally NOR-elevated. Coherent same-direction signal across C3 (+2.11 LFC), MASP1/C1R/C1S consolidated hub (+2.50), C2 (+3.89), and C4A (+1.18) — the entire proteolytic activation cascade up in Nordic Tisler. Reactome pathway: R-HSA-166658. Independent NCBI annotation confirms our MASP1 ortholog locus is literally labelled "Mannan-binding lectin serine protease 1". Same direction as supporting NOR-elevation in MUC6 gel-mucin (+3.55) and the DDX58/MDA5/LGP2 RLR hub (+1.60).

(2) Cnidarian mucus-immunity does not use DMBT1 architecture. Named-ortholog DIAMOND of human DMBT1 (Q9UGM3) returns best-hits in 26/26 anthozoans but 0/26 carry the SRCR+CUB co-occurrence — all SRCR-only. Confirmed by the read-level Pfam scan (0 CUB hits across 4.3 M PacBio reads in the diff space). The CUB+ZP-anchored mucin scaffold is a vertebrate innovation. Cnidarian mucus immunity uses SRCR-only PRRs.

(3) Vertebrate gene duplications collapse onto single cnidarian loci. The consolidated-hub pattern replicated across four independent truth-carriers: one Lophelia locus covers MASP1+C1R+C1S+MASP2 (complement); one covers DMBT1+CD163L1 (SRCR); one covers DDX58+IFIH1+DHX58 (RLR); one covers FCN1+FCN2+FCN3 (ficolins); etc. This is the textbook deep-time signal — vertebrate paralogs duplicated from a single ancestor after the cnidarian split. Four-carrier replication makes it publishable.

(4) Terminal MAC pore (C6-C9) is absent in Lophelia + most anthozoans + all hydrozoans. Lophelia, Acropora digitifera, Exaiptasia, and all 4 hydrozoans carry 0 named MAC pore-forming genes; C5 is present in many anthozoans but C6-C9 are not. The pore-forming complement axis appears to be largely a vertebrate innovation; cnidarian complement-mediated killing must use the C3 opsonization / protease cascade without the canonical lytic pore.

Immune Pfam atlas · 124 HMMs

The InnateDB-derived cnidaria-relevant immune Pfam atlas — 124 top-2-per-family accessions curated from immune_family_table_cnidaria_relevant.tsv. Used as a query-time predicate over the full per-read DB, not as a scan-time filter (that's the post-2026-05-13 design correction).

Atlas families (selected)

Top families by member count in the cnidaria-relevant table:

familyPfam
Toll-like receptor familyPF13855
TRIM/RBCC familyPF00643
Peptidase S1 familyPF00089
NLRP familyPF05729
Complement C6/C7/C8/C9 familyPF00057
STING familyPF15009
2-5A synthase familyPF10421
Ficolin lectin familyPF00147
HMGB familyPF00505
Cathelicidin familyPF00666
Invertebrate defensin (Type 1)PF01097
Mab-21 familyPF20266
Top hits — cross-pop diff scan (Nordic vs US, 8 pools)

8,425 hits across 7,865 unique reads · 62 distinct Pfam domains · 4.3 M PacBio reads scanned (super offload, 2026-05-13 → 14):

domainNordicUStotal
Fibrinogen_C ← Ficolin / lectin8281,2942,122
Trypsin357436793
TSP_1344426770
PK_Tyr_Ser-Thr322342664
TRAF-mep_MATH279287566
Ank193165358
SRCR ← prime target164192356
Arf166159325
LRR_6167128295
ubiquitin141149290
Forkhead105115220
Sushi10994203
zf-C3HC499102201
An_peroxidase66105171
Ldl_recept_a6196157
zf-RING_UBOX6072132

Top 16 by total hit count. Full 62-domain table sits on disk in out/immune/<pool>/hmmer.domtblout.

SRCR is the prime target. DMBT1 — the cnidarian-relevant scavenger receptor family for mucus immunity — has 8–16 tandem SRCR copies per protein. The cross-pop diff scan now gives 356 SRCR hits in 4.3 M reads, concentrated in the shared / partial pools (US 192, Nordic 164; +17% in US-Atlantic). No SRCR+CUB or SRCR+ZP co-occurrence yet — the classic DMBT1 architecture lives elsewhere. Two next moves keep this honest: (a) scan the conserved bulk pool (non-diff reads, by far the largest substrate) to test whether DMBT1-architecture co-occurrence emerges there; (b) scan the mortar pool — tandem-repeat-flavored architectures land in mortar via axis-F, so the canonical 8–16-bead DMBT1 may sit there.

Data inventory

What's on disk. Source-of-truth paths under ~/cnidaria_genomes/lophelia_pertusa/.

Nordic — Tisler reef (2015)

CellMovie ID prefixLibrarySubreads size
B02_1m151122_020500P6-C4 · 10 kb insertCCS partial
C02_1m151122_082514P6-C4 · 10 kb insert326 MB
D02_1m151122_144213P6-C4 · 10 kb insert335 MB
D03_1m151103_071355P6-C4 · 10 kb insert636 MB
E02_1m151122_210143P6-C4 · 10 kb insert894 MB
E03_1m151103_113308P6-C4 · 10 kb insert691 MB
F01_1m151031_225026P6-C4 · 10 kb insert574 MB
F02_1m151123_032108P6-C4 · 10 kb insert1.1 GB
F03_1m151103_155234P6-C4 · 10 kb insert726 MB
G02_1m151123_094519P6-C4 · 10 kb insert829 MB
G03_1m151103_201506P6-C4 · 10 kb insert700 MB
H02_1m151123_160617P6-C4 · 10 kb insert751 MB

Aggregate: 25.4× CLR, 14.17 Gbp. 9 cells with full subreads; 2 cells (C02_1, D02_1) have CCS instead; 1 cell (B02_1) partial. Plus Tisler Illumina paired-end + 3 kb / 8 kb / 20 kb / 40 kb mate-pair libraries.

Nordic — RNA-seq (Dahl)

SampleIndexPoolReads
Dahl-RNA1-D2GCCAATD27.5 M pairs
Dahl-RNA1-D3CAGATCD23.5 M pairs
Dahl-RNA1-D5CTTGTAD26.0 M pairs
Dahl-RNA1-NN1CGATGTNN23.8 M pairs
Dahl-RNA1-NN3TGACCANN25.2 M pairs
Dahl-RNA1-NN5ACAGTGNN29.7 M pairs

HiSeq2500 paired-end, STAR-aligned to us_lp_filter (Loph_1.0 142-scaffold subset). ~155 M pairs total. Dahl-RNA1-D2 used as the RNA-bridge test sample so far.

US — published reference (2023)

Accession
GCA_029204205.1 · Loph_1.0
Submitter
Lehigh / Santiago Herrera lab
Specimen
USNM1676648 (Smithsonian)
Origin
Atlantic · 31.754 N · 79.194 W (Charleston, SC)
Depth
515 m
Assembly
scaffold N50 1.614 Mb · contig N50 438 kb · 557 Mb total · 39.5% GC
Annotation
RefSeq genomic.gff + protein.faa

Plus 33 NCBI-annotated cnidarian proteomes (RefSeq-preferred) under ~/cnidaria_genomes/_comparative/proteomes/ as the comparative backdrop across the major cnidarian clades.

People

Contributors to the AutoLoop / Tisler D. pertusum work. The list is open — contribution shape, not author order.

Åse Emblem

Co-author on the 2009 Tisler manuscript that prefigured this project. Mitochondrial + nuclear immunity arc.

Per Mikael Jonassen

Sample lead on the 2015 Tisler PacBio sequencing at NSC Norway (run 47). Continuity with the original sequencing decisions and library design.

Bård O. Karlsen

AutoLoop / aragonite codebase, the per-read DB build, the four-rule pipeline discipline, the RNA-bridge graph stack.

Steinar D. Johansen · U. Tromsø (UiT)

Project owner. Mito-nuclear genes + innate immunity (complement, DMBT1, mucus PRRs). FRIBIO grant 204907 (2010-era) framing: ancestral tumor/apoptosis comparative axis. Traylor-Knowles 2022 Front Mar Sci is the designated biological-route precedent. Story: solid science, not high-impact splash.

Names listed alphabetically by last name within each card row. Authorship and acknowledgements are finalised at submission.

Lab notebook

Dated entries from ~/autoloop/docs/lab_notebook.md, rendered on every commit. Newest entry first. Run scripts/site/build_notebook_panel.py to refresh.

2026-05-18 (night) — S5d headline figure (named-ortholog volcano)

Script: scripts/aragonite/s5d_headline_volcano.py. Generates two volcano flavours from the strict-filtered DESeq2 results:

  • headline_volcano.svg — genome-wide context (16,611 grey bg genes + 473 carrier-annotated colored fg), x ∈ [-10, 8] to clip extreme mito artefacts; shows the immune panel sits clearly above the genome-wide noise floor
  • headline_volcano_immune.svg — the paper figure: only carrier- annotated genes plotted, headline names labeled, palette matched to the autoloop site

Colors: complement = sandy coral (accent); mucin/SRCR = lavender; mitocarta = teal; tlr_nlr = warm tan. Background dark to match the site dark theme.

Plotted breakdown (strict filter)

Carrier n in panel NOR↑ sig US↑ sig
complement 8 5 0
mucin_srcr 4 1 0
mitocarta 459 22 92
tlr_nlr 2 1 0

The immune-panel volcano (no mito background) reads cleanly: top-right cluster = C3 + MASP-hub + C2 (the activation cascade NOR↑), mid-right = MUC6 + DDX58/MDA5/LGP2 hub (NOR↑), origin cluster = DMBT1/CD163L1 + NLRP3 + MyD88 + CFH (the constitutive innate backbone — balanced both pops, no DE).

The 92 US↑ mitocarta genes vs 22 NOR↑ is the residual library- composition signal (housekeeping mito proteins systematically over-represented in the US raw libraries). Demoted to "broad transcriptional divergence with library-composition contribution" in the captioning, not headline.

Files

~/scratch/s5_deseq2/figures/    (source)
~/autoloop/docs/figures/         (repo copy)
~/Desktop/super/bok1/deepsek_web/autoloop/figures/   (site copy)
├── headline_volcano.svg         genome-wide context
├── headline_volcano.png
├── headline_volcano_immune.svg  paper figure
└── headline_volcano_immune.png

Site integration

autoloop.deepsek.no Immune highlights panel gets a new "Headline · S5 DESeq2 named-ortholog volcano" card at the top (above the SRCR/Mucins/Complement cards), embedding the headline_volcano_immune.svg inline. Caption explains the top-right activation-cascade cluster, the constitutive backbone near origin, and the residual library-composition artifact. Links back to scripts/aragonite/s5d_headline_volcano.py + underlying TSV for reproducibility.

Open / running (refresh)

  • Label crowding near origin (DMBT1, NLRP3, CFH, C4A) — could use adjustText library or hand-curate offsets if going to print. Acceptable for current draft.
  • TMM normalization fallback if reviewers push on genome-wide 38.5%
  • Site nginx vhost (sudo), helm.deepsek.no — unchanged backlog

2026-05-18 (late evening) — S5c library-bias mitigation (strict filter)

Approach: re-run DESeq2 on a strict expressed-in-both-pops filter to remove the lowly-expressed-in-one-pop genes that DESeq2's noise model flags with extreme LFCs (the "PABP4 LFC −16 padj 1e-291" artifacts). Re-extraction of raw NOR fastqs was not feasible (raw 2014-2015 data abandoned); TMM normalization deferred unless this mitigation is insufficient for reviewers.

Filter: gene retained iff ≥10 counts in ≥3 of 5 NOR samples AND ≥3 of 6 US samples. Script: scripts/aragonite/s5c_library_bias_mitigation.py.

Result — biology preserved, genome-wide partially mitigated

Filter Tested Sig % sig NOR↑ US↑
Loose (sum ≥ 10) 32,398 17,024 52.5% 4,655 12,369
Strict (3+/group both) 16,969 6,538 38.5% 2,997 3,541
  • 14 percentage-point drop in genome-wide sig
  • Directional skew evened out (loose 73% US-up → strict 54% US-up, near-symmetric — more biologically credible)
  • Still elevated; real coral pop comparisons typically run 5-15% sig

Headline named-ortholog panel — 11/13 preserved, all 11 LFC-stable

Gene gene_id loose LFC strict LFC LFC Δ
C3 OS493_014903 +2.14 +2.11 0.03
MASP1 hub OS493_000518 +2.53 +2.50 0.03
C2 OS493_002228 +3.92 +3.89 0.03
C4A OS493_024843 +1.22 +1.18 0.04
CFH OS493_002696 −0.97 −1.00 0.03
DMBT1 hub OS493_013007 −0.55 −0.58 0.03
MUC6 OS493_020649 +3.59 +3.55 0.04
MyD88 OS493_026950 −0.65 −0.68 0.03
NLRP3 OS493_009300 −0.88 −0.91 0.03
RLR hub OS493_002237 +1.64 +1.60 0.04
CD163 OS493_018800 −0.44 −0.48 0.04 (NS both)
CD5L+CD6 hub OS493_031461 −6.30 (sig) DROPPED
TLR4 OS493_014951 +4.38 (sig) DROPPED

The headline biology is rock-solid under the strict filter — C3, MASP-hub, C2, MUC6, RLR-hub all NOR-elevated with near-identical LFCs.

Two genes dropped to sub-threshold expression: - CD5L+CD6 hub — barely expressed in NOR (<10 counts), the −6.30 LFC was a noise-model extreme. Demote to "trending US-elevated, sub-threshold in Nordic" — interesting but not a confident DE call. - TLR4 — baseMean 4.2, too low for confident testing. Demote to "detectable but low in both pops."

Recommendation for the paper

  • Use strict-filter results as the published DE analysis: 16,969 genes tested, 6,538 sig, immune panel preserved.
  • Explicitly report loose-vs-strict comparison in methods — shows the filter is principled (not p-hacking; biology preserved).
  • Headline claims: C3, MASP-hub, C2, C4A, CFH, MUC6, RLR-hub for the NOR-elevated story; DMBT1-hub balanced for the constitutive story; CD5L+CD6 and TLR4 demoted to supplement footnotes.
  • Frame genome-wide 38.5% sig as "broad transcriptional divergence between geographically separated populations, possibly environmental; named-immune-ortholog signal is robust to filter assumptions."

Diagnostic notes for the residual 38.5%

Three plausible contributors: 1. Real biology — Lophelia between deep-cold Tisler and Atlantic-shelf US could have genuinely large transcriptional shifts 2. Residual library-composition bias not fully removed by strict filter 3. Sample size + within-group variance (n=5/n=6, non-replicate-matched libraries between groups)

If reviewers push back on the 38.5%, fallback option is TMM normalization (limma-voom style — more robust to composition shifts). PyDESeq2 doesn't natively support TMM but TMM factors can be computed externally and passed in as size factors. Not implemented yet — defer unless needed.

Files

~/scratch/s5_deseq2/
├── deseq2_strict_filter_full.tsv      16,969 strict-filtered DESeq2 results
├── deseq2_strict_filter_immune.tsv    487 carrier-annotated
├── library_bias_comparison.tsv        per-headline-gene before/after
└── mitigation_verdict.tsv             loose-vs-strict summary stats

Open / running (refresh)

  • TMM normalization fallback if reviewers push on genome-wide 38.5%
  • Site nginx vhost (sudo), helm.deepsek.no — unchanged backlog

2026-05-18 (evening) — S5b D-vs-NN within-Nordic sanity check

Script: scripts/aragonite/s5b_deseq2_d_vs_nn.py. Two analyses: (1) PCA on log-CPM all 11 samples + within-group distance summary; (2) DESeq2 ~subpop within Nordic (D=3 vs NN=2) compared to the main NOR-vs-US DE volume.

Verdict: POOL JUSTIFIED

Contrast Tested Sig (padj<0.05, |LFC|≥1) % sig
NOR vs US 32,398 17,024 52.5%
D vs NN within Nordic 22,609 400 1.8%

Ratio (subpop_DE / pop_DE) = 0.023. Within-Nordic subpop signal is 2.3% of the cross-pop signal — comfortably below the pooling threshold.

PCA structure

PC1 (90.7% variance) is the NOR↔US axis — overwhelms everything else. PC2 (3.9%) is the D↔NN axis but small. Within-D variance (50.95 in PC1-2 space) is larger than the D-to-NN mean separation (57.49): D-vs-NN inside Nordic looks like ordinary sample-to-sample variability, not two distinct sub-populations. Within-NN samples are very tight (2.2); D5 is the main D-outlier driving within-D variance.

within-D     50.95
within-NN     2.20
within-US     6.43
D ↔ NN       57.49
D ↔ US      248.99
NN ↔ US     241.45

Carrier-gene check — clean

Only 3 of 501 carrier-annotated genes are D-vs-NN sig: - ITGB2 (complement receptor β subunit, CR3-β) — LFC −1.46, padj 9.7e-4 — D-elevated. The only named-ortholog with within-Nordic structure. Worth a paper footnote (possible Tisler-vs-other-Nordic tissue-state difference); does not compromise the headline. - 2 unannotated mitocarta hits (likely depth-artifact at this signal level).

Critically: NONE of the S5 headline genes (C3, MASP-hub, C2, C4A, MUC6, RLR-hub, CD5L+CD6 hub) are in the D-vs-NN sig list. The NOR-vs-US biological story is driven by genuine cross-pop signal, not by D-vs-NN heterogeneity inside the Nordic group.

Implication

The S5 NOR-vs-US contrast can be reported as published. Pool the 5 Nordic samples as one group. Footnote ITGB2 D-elevation if relevant. The library-composition caveat (NOR BAM-extracted vs US raw) remains the more important caveat — that one needs a separate mitigation (reviewer might ask).

Files

~/scratch/s5_deseq2/
├── pca_all_samples.tsv              11 samples × PC1-PC4 + subpop
├── deseq2_d_vs_nn_full.tsv          22,609 within-Nordic results
├── deseq2_d_vs_nn_immune.tsv        501 carrier-annotated within-Nordic
└── subpop_pooling_verdict.tsv       NOR-vs-US vs D-vs-NN summary stats

Open / running (refresh)

  • DESeq2 headline figure draft — named-ortholog panel volcano
  • Site nginx vhost (sudo), helm.deepsek.no — unchanged backlog

2026-05-18 (evening) — S5 DESeq2 — NOR vs US headline contrast

Tooling decision: no R/DESeq2 env existed on t5; installed pyDESeq2 0.5.4 into the lophelia env (canonical Python port, results near-identical to R DESeq2; avoids R-bioconductor env bloat per the one-env-per-project rule). Pandas downgraded 3.0.2 → 2.3.3 by the install (pydeseq2 incompatible with pandas 3.x).

Script: scripts/aragonite/s5_deseq2_pop_contrast.py. Design ~pop (us as reference), so positive log2FC = NOR-upregulated. Filter: gene total count ≥ 10 → 32,398 / 40,679 genes tested. Carrier annotation merged in (660 unique Lophelia gene_ids with truth-carrier ortholog labels).

Headline named-ortholog panel (padj < 0.05, |LFC| ≥ 1)

Gene gene_id LFC padj direction
MASP1 hub (C1R+C1S+MASP1+MASP2) OS493_000518 +2.53 1.5e-15 NOR↑
C2 OS493_002228 +3.92 2e-13 NOR↑
C3 OS493_014903 +2.14 4e-7 NOR↑
MUC6 (gel-mucin) OS493_020649 +3.59 6e-6 NOR↑
DDX58/MDA5/LGP2 RLR hub OS493_002237 +1.64 1.2e-6 NOR↑
C4A OS493_024843 +1.22 6.5e-3 NOR↑
TLR4 OS493_014951 +4.38 3e-4 NOR↑ (low baseMean)
CD5L+CD6 hub OS493_031461 −6.30 2.6e-11 US↑
MyD88 OS493_026950 −0.65 0.034 barely US↑
NLRP3 OS493_009300 −0.88 0.003 barely US↑
DMBT1/CD163L1 hub OS493_013007 −0.55 0.012 barely US↑ — constitutive

Biological pattern

Nordic Lophelia upregulates the activation-cascade complement + mucus + viral RNA sensing: C3 + MASP-hub + C2 + C4A coherently NOR-elevated (the classical/lectin proteolytic activation cascade); MUC6 gel-mucin NOR-up; DDX58/MDA5/LGP2 RLR sensing NOR-up.

US Lophelia upregulates the SRCR-secreted scavenger layer: CD5L+CD6 hub 79× higher.

Innate-immune backbone (MyD88, NLRP3, DMBT1/CD163L1 hub) is balanced — robust constitutive expression both pops, no DE.

The coherence at pathway level (multiple genes in the same biological role coordinating in the same direction) is what makes the named-ortholog panel credible despite the per-gene small n.

Per-carrier DE counts

Carrier tested sig NOR↑ US↑
Complement 14 10 5 5
Mucin/SRCR 6 3 1 2
MitoCarta 635 274 26 248
TLR/NLR 4 2 2 0

Caveats — critical

  • 17,024 / 32,398 genes flagged sig genome-wide (52%) — top hits are mitochondrial housekeeping (e.g. PABP4 at LFC −16, padj 1e-291). This is the library-composition bias from BAM-extracted Nordic samples vs raw US fastqs. DESeq2 size factors normalize total depth but cannot undo the upstream read-filter selection.
  • The named-ortholog panel is the credible biology — coherent pathway-level patterns are unlikely to be artifact. The genome-wide volcano top is contaminated with normalization artifacts.
  • D vs NN substructure still un-tested. Captain previously flagged this as a blocker decision. Need a ~ subpop within-Nordic check before claiming pop = NOR is homogeneous.

Files

~/scratch/s5_deseq2/
├── deseq2_results_full.tsv          32,398 genes × DESeq2 stats
├── deseq2_results_immune.tsv         658 carrier-annotated genes
├── deseq2_results_<carrier>.tsv     per-panel results
├── top_de_genes.tsv                 top 50 by -log10padj × |LFC|
└── state/runs.jsonl                 ledger

Open / running

  • Library-bias mitigation — discuss with captain: re-extract NOR from raw if surviving, or subsample US to match NOR floor, or alternative normalizer (voom-TMM)
  • DESeq2 headline figure draft — named-ortholog panel volcano (not the genome-wide volcano)
  • Site nginx vhost (sudo), helm.deepsek.no — unchanged backlog

2026-05-18 (late) — truth-carrier × RNA-seq expression cross-reference

Pivot: with four truth-carriers landed (carrier-defined Lophelia orthologs in KAJ7* protein_id namespace) and the 11-sample featureCounts matrix at trx50:out/counts/ keyed on OS493_* gene_id, the natural next step is to join them and ask "which named immune orthologs are actually expressed, and in which pop." Mapping built from GTF CDS records (gene_id + protein_id in the same attribute block): 192,784 CDS records → 37,484 gene→protein mappings. 100% of the 766 Lophelia carrier orthologs map to a gene_id.

Script: scripts/aragonite/truth_carrier_expression.py. Outputs at ~/scratch/truth_carrier_expression/. Per-sample featureCounts parquets concatenated → 40,679 genes × 11 samples matrix → CPM. Background: 20,100 / 40,679 genes expressed (≥1 CPM) in any NOR sample, 26,992 in any US sample — US-side has more genes detected because Nordic data are BAM-extracted (already filtered to Lophelia-mapping reads, depth budget concentrated on fewer loci).

Per-carrier expression coverage

Carrier Orthologs NOR-expr US-expr Both
Complement 25 12 20 12
Mucin/SRCR 8 5 8 5
MitoCarta 727 539 724 536
TLR/NLR 6 6 5 5

Named-ortholog highlights (NOR_max_CPM / US_max_CPM)

Gene gene_id NOR US direction
C3 OS493_014903 2,381 359 NOR 6.6× ↑
MASP1 / C1R / C1S / MASP2 hub OS493_000518 239 29 NOR 8.2× ↑
C2 OS493_002228 699 56 NOR 12.5× ↑
C4A / C4B OS493_024843 592 173 NOR 3.4× ↑
CFH OS493_002696 161 161 balanced
DMBT1 / CD163L1 hub OS493_013007 310 291 balanced — constitutive
CD163 OS493_018800 24 44 US 1.8× ↑
MUC6 OS493_020649 54 5 NOR 12× ↑
CD5L / CD6 hub OS493_031461 0.4 27 US 67× ↑
TLR4 OS493_014951 1.1 0.1 barely expressed
MyD88 OS493_026950 128 150 robust, balanced
NLRP3 OS493_009300 253 326 robust, balanced
DDX58 / MDA5 / LGP2 RLR hub OS493_002237 100 27 NOR 3.7× ↑

Three validation findings

  1. C3 locus settled at OS493_014903 → KAJ7334579.1. Prior Pfam- based candidate from earlier session was OS493_014902 — off by one (adjacent locus same neighbourhood). The carrier-truth call supersedes the Pfam-based one.
  2. NCBI annotation cross-validates the MASP1 carrier hit: NCBI product for OS493_000518 is literally "Mannan-binding lectin serine protease 1" — independent confirmation that the consolidated-hub pattern (one Lophelia locus = MASP1+C1R+C1S+MASP2) is the genuine ancestral C1/MASP-like protease.
  3. Naming conflict flagged for paper: NCBI's product field calls OS493_018800 "Deleted in malignant brain tumors 1" (= DMBT1) — but our truth-carrier method puts that locus at CD163, and the DMBT1 best-hit at OS493_013007 (which NCBI calls "Neurotrypsin"). These are two SRCR-rich Lophelia proteins with overlapping vertebrate similarity; NCBI's pipeline used a different criterion. The carrier method gives the closest sequence match; NCBI's call is plausibly a synonym chain through annotation transitive closure. Will footnote in the manuscript.

Pop-bias pattern (observational, not yet DESeq2-tested)

  • NOR-elevated: classical/lectin complement core (C2, C3, C4, MASP1-hub), MUC6, viral-RNA-sensing RLR hub
  • US-elevated: SRCR scavenger/regulator layer (CD5L+CD6, CD163, CD46+C4BPA at OS493_027659)
  • Balanced: DMBT1-locus (~300 CPM both), CFH, MyD88, NLRP3 — constitutive innate-immune backbone

Two-population coral systems with co-expression of complement + mucus-SRCR + PRR + RLR axes, with named-ortholog evidence — this is the dataset the paper's headline figure can come from.

Caveats logged

  • n=5 NOR / n=6 US, small; observations not significance-tested
  • NOR samples BAM-extracted (filtered) — library-depth normalization needed before any DE call. The next step (DESeq2 S5) handles this via per-sample size factors.
  • "Hypothetical protein" is the dominant NCBI annotation; the carrier cross-reference is the first identification of many Lophelia loci in functional terms.

Files

~/scratch/truth_carrier_expression/
├── gene_protein_map.tsv               37,484 OS493↔KAJ7 pairings
├── lophelia_expression_full.parquet   40,679 genes × 11 samples + CPM
├── complement_expression.tsv          25 orthologs
├── mucin_srcr_expression.tsv          8 orthologs
├── mitocarta_expression.tsv           727 orthologs
├── tlr_nlr_expression.tsv             6 orthologs
└── lophelia_immune_expression.tsv     766 rows stacked (carrier × expression)

Open / running (refresh)

  • DESeq2 (S5) — now unblocked: named-ortholog list ready, gene symbols mapped, both pops sampled; ready to fire DE on the immune panels and the headline contrast
  • Site nginx vhost (sudo), helm.deepsek.no — unchanged backlog

2026-05-18 — truth-carrier stack (complement, mucin/SRCR, mitocarta)

Captain question after the cross-cnidaria complement domain-Pfam matrix (70,444 hits): "what are we using as complement 'truth' stack — human, pfam?" The honest answer was: Pfam-domain-architecture-based, which over-counts massively (Trypsin Pfam hits ALL serine proteases; TSP_1 hits all thrombospondins; etc.). Switched to named human canonical orthology as the truth standard, mirroring the existing MitoCarta workflow.

Truth-carrier architecture

Layout at ~/cnidaria_genomes/_comparative/truth_carriers/<pathway>/, one subdir per pathway, each with the same artifact set:

<pathway>_genes.tsv                    gene_symbol × uniprot × category × role
<pathway>.faa                          carrier query FASTA (one canonical seq per gene)
<pathway>_diamond_hits.tsv.gz          raw DIAMOND blastp hits, species column
<pathway>_loci_per_species.tsv         per-species best-hit per human query
<pathway>_pathway_cn_per_species.tsv   matrix tab: species × category → CN
diamond_db/  hits/                     cached DBs + per-species raw hits

Build scripts in scripts/aragonite/: - build_<pathway>_truth.py — fetch human canonical refs from UniProt - diamond_<pathway>_vs_cnidaria.sh — blastp at 25 threads, e≤1e-5 - build_<pathway>_matrix.py — filter + best-hit-per-query + matrix - Idempotent: re-fires only missing pieces; DBs cached cross-pathway.

Filters: pident ≥ 25, qcov ≥ 25-30, evalue ≤ 1e-5. CN counts DISTINCT cnidarian protein_ids matched per category (not raw hits) — so multiple human paralogs hitting the same cnidarian gene = 1 ortholog, the natural unit for deep-time orthology.

State ledger at ~/scratch/truth_carriers/state/runs.jsonl.

Carrier 1 — Complement (Reactome R-HSA-166658, 49 human refs)

Coverage: classical (C1QA/B/C, C1R, C1S), lectin (MBL2, FCN1/2/3, COLEC10/11, MASP1/2), C2/C4, alternative (CFB, CFD, CFP), C3-hub (C3, CFI, CFH, CFHR1-5), terminal-MAC (C5, C6, C7, C8A/B/G, C9), regulators (SERPING1, CD46/55/59, C4BPA/B, VTN, CLU), receptors (CR1, CR2, C3AR1, C5AR1, C5AR2, ITGAM, ITGB2).

Result: 4,156 raw DIAMOND hits → 2,767 after filters → 762 best-hit-per-query rows across 32 species (Henneguya/Myxobolus/ Thelohanellus mostly empty as expected for parasitic myxozoans).

Cnidaria-wide universal complement core (29-31/34 species): C1R, C1S, MASP1, MASP2, CFD, CFP, FCN1/2/3, ITGB2, plus C3 (26/34), C2 + C4A + C4B (27/34).

Headline finding — terminal MAC: Lophelia + Acropora digitifera + Exaiptasia + all hydrozoans carry 0 MAC pore genes (C6-C9). Stony corals and most anemones have C5 but no C6-C9. The pore-forming membrane-attack complex appears to be largely a vertebrate invention; cnidarians implement complement-mediated killing without the canonical lytic pore module.

Lophelia consolidated-hub pattern (single cnidarian locus covers multiple vertebrate paralogs):

Lophelia locus Vertebrate paralogs
KAJ7394693.1 C1R + C1S + MASP1 + MASP2 (one ancestral C1/MASP-like protease)
KAJ7391652.1 FCN1 + FCN2 + FCN3 (one ancestral ficolin)
KAJ7371504.1 C4A + C4B (one C4 thioester)
KAJ7383494.1 CR1 + CD55 (one SCR/Sushi receptor)
KAJ7383505.1 CR2 + C4BPB

This is the textbook deep-time complement architecture — vertebrate gene duplications collapse onto single cnidarian ancestors. Strong publishable story.

Lophelia C3 ortholog (named-human-truth): KAJ7334579.1 (pident 28.4, qcov 46.5, bit 276). Cross-check with the prior Pfam-based candidate OS493_014902 (A2M_BRD + ANATO) needed — these may be the same gene under different namespaces (KAJ7 = protein_id, OS493 = gene_id).

Carrier 2 — Mucin / SRCR (24 human refs)

Coverage: gel-mucins (MUC2/5AC/5B/6), membrane-tethered mucins (MUC1/13/15/17/20/21), soluble mucin (MUC7), DMBT1 (the keystone), SRCR scavenger (MSR1, MARCO, SCARA3/5), SRCR lymphoid (CD5, CD6, CD163, CD163L1), SRCR secreted (CD5L/AIM, SSC4D, SSC5D, LGALS3BP). Big tandem-repeat mucins (MUC3A, MUC4, MUC12, MUC16, MUC19) skipped — their PTS repeats don't carry deep-time orthology signal.

Result: 3,623 raw hits → 1,768 filtered → 242 best-hit rows across 31 species.

Headline finding — DMBT1 architecture: cross-checked the cnidarian DMBT1 best-hits against the cnidarian-immune Pfam scan (cnidaria_immune_hits.parquet). All 26 anthozoan DMBT1 hits carry SRCR alone (some with Ig_2, Sushi, or Ldl_recept_a accessory) — 0/26 carry the SRCR + CUB co-occurrence that defines vertebrate DMBT1. Confirms the earlier F4 cross-cnidaria Pfam finding (2,410 SRCR / 0 CUB co-occurrence across 31 species) now framed against the named human DMBT1 reference. Hydrozoans (4 spp) + Rhopilema = no DMBT1 best-hit at all; the protein is an anthozoan-specific signal.

Vertebrate-only families absent from cnidaria: - Membrane-tethered mucins (MUC1/13/15/17/20/21): 0/34 species - Gel-forming MUC2/5AC/5B (airway/intestinal big-gel): 0/34 - Mammalian SRCR scavengers MARCO/SCARA3/SCARA5: ≤2/34

Retained ancestrally (29-31/34 species): - CD163, CD163L1, CD6 (SRCR lymphoid core) - CD5L (AIM), SSC4D, SSC5D (SRCR secreted) - MUC6 (gastric gel-mucin — the ONE gel-mucin cnidaria retain)

Lophelia: 8 mucin/SRCR orthologs. DMBT1 best-hit KAJ7373414.1 (SRCR-only, no CUB; pident 32.5, qcov 28.8) — also the CD163L1 best-hit (same consolidation pattern as complement). MUC6 ortholog KAJ7372214.1.

Publishable statement: "Cnidaria carry abundant SRCR-domain proteins mapping to vertebrate CD163/CD163L1/CD6/CD5L/SSC4D/SSC5D but lack the SRCR+CUB co-occurrence that defines vertebrate DMBT1-mediated mucus immunity. The gel-forming mucin family (MUC2/5AC/5B) and the entire cytoplasmic-tail membrane-mucin family are absent from cnidaria — they represent vertebrate innovations. MUC6 alone is retained ancestrally."

Carrier 3 — MitoCarta3 (1,132 human refs)

Re-homed the existing workflow from ~/lophelia/data/mitocarta/ into the standardized truth_carriers/mitocarta/ layout. DIAMOND step already done (26,053 loci rows × 34 species). FAA carrier added now (1,130 fetched + 2 remapped: GATD3A P30042→P0DPI2, MYG1 Q86UA3→Q9HB07).

Matrix tab pattern:

Group total range notes
Anthozoa stony corals 765–918 top: Orbicella franksi 918
Lophelia 727 lower-middle anthozoan; OXPHOS 70, central dogma 119
Medusozoa 695–790 Hydra ~700
Octocorallia 721–807 intermediate
Myxozoa (parasites) 142–184 known anaerobic genome reduction

Universal cnidarian mito-core: 418 MitoCarta genes present in ≥30/34 species — OXPHOS subunits (ATP5F1A/B, SDHA, UQCRC1), TCA cycle (DLD, DLST, OAT, GLUD1/2), heme (FECH), import (HSPD1, PMPCB), ATAD3A/B nucleoid, mt-aaRS family. The eukaryotic-ancestral mitochondrial backbone.

Methodology — why "truth-carrier" framing matters

The Pfam domain-architecture approach (124-HMM cnidaria-immune atlas) gives the broad domain landscape — which Pfam domains are present in which species. It does NOT give named orthologs and over-counts when domains are promiscuous (Trypsin, TSP_1, EGF, Ldl_recept_a all in thousands of non-target proteins).

The truth-carrier approach (human canonical refs → DIAMOND → best-hit-per-query) gives named 1:1 orthology. For complement this is a 17× reduction (4,156 vs 70,444 hits) and produces a clean, defensible matrix tab keyed off the published gene names that reviewers can cross-reference against Reactome/KEGG.

Both layers are kept — the Pfam scan stays as the "what domains are where" map; the truth-carrier matrix is the "where are the named orthologs" map. Each carries different evidence and they answer different questions.

Carrier 4 — TLR / NLR / RLR (28 human refs)

Coverage: TLR family (TLR1-10), TIR adaptors (MYD88, TIRAP, TICAM1/2, SARM1), NLR-NOD (NOD1, NOD2), NLR-inflammasome (NLRP1/3/6/7), NLR-other (NLRC4/5, NLRX1, NAIP), RLR (DDX58/RIG-I, IFIH1/MDA5, DHX58/LGP2). 28 refs, 25.6 kb FAA.

Result: 2,594 raw hits → 149 best-hit rows across 31 species (the three myxozoans + Hydractinia/Hydra mostly empty).

Architecture cross-check (vs cnidaria_immune_hits.parquet): - Canonical TLR architecture (LRR + TIR): 35/37 cnidarian TLR best-hits carry both — the deep ancestral TLR fold is intact - Canonical NLR architecture (NACHT + LRR): 42/53 cnidarian NLR best-hits carry both — canonical NLR fold preserved

Compressed-but-architecturally-complete PRR signal: cnidarians retain the canonical TLR/NLR/RLR folds but use a small subset of the vertebrate-paralog gene family:

Layer Universal (≥28/34) Broad (10-22/34) Absent (0/34)
RLR (viral RNA) DDX58, IFIH1, DHX58
TLR adaptor MYD88 (22) TIRAP, TICAM1/2, SARM1
NLR_inflammasome NLRP3 (22) NLRP1
TLR TLR2 (15), TLR4 (11), TLR5 (10) TLR1, TLR3, TLR6, TLR7, TLR9, TLR10
NLR_NOD NOD2 (12), NOD1 (3)
NLR_other NLRC5 (7) NLRC4, NLRX1, NAIP

Publishable framing: "Cnidarian innate immunity retains the canonical TLR/NLR/RLR PRR architectures (LRR+TIR, NACHT+LRR, DEAD helicase folds) but uses a structurally minimal subset of the vertebrate gene families — viral-RNA sensing via RIG-I/MDA5/LGP2 is universal, surface-bacterial sensing via TLR2/4/5 + MyD88 + NLRP3 + NOD2 is preserved, but the endosomal nucleic-acid TLRs (TLR3/7/8/9) and most TIR-domain accessory adaptors (TIRAP/TRIF/TRAM/SARM1) are absent."

Lophelia closeup — 4 PRR orthologs, all architecturally canonical:

Function Lophelia locus Architecture
TLR4 KAJ7392015.1 ★ LRR_8 + TIR + TIR_2
MyD88 adaptor KAJ7371306.1 Death + TIR + TIR_2 (qcov 83.8%)
NLRP3 inflammasome KAJ7373972.1 ★ LRR_6 + NACHT
RIG-I/MDA5/LGP2 (consolidated) KAJ7375464.1 DEAD helicase — one locus covers all 3 vertebrate RLR paralogs

Same consolidated-hub signature seen in complement (one KAJ7 covers C1R+C1S+MASP1+MASP2) and SRCR (one KAJ7 covers DMBT1+CD163L1): vertebrate gene duplications collapse onto single ancestral cnidarian loci. This is now a four-carrier-replicated pattern — a publishable deep-time signal across complement, mucus-immunity, mitochondrial metabolism, and PRR sensing.

Open / running

  • DESeq2 (S5), site nginx vhost (sudo), helm.deepsek.no — unchanged from prior open list

2026-05-17 → 18 — heavy data-mining day

F4 closed cnidaria-wide (DMBT1 architecture absent across Cnidaria)

Three converging lines of evidence said DMBT1 architecture is absent in D. pertusum: (1) zero SRCR+CUB+ZP co-occurrence across the 8 cross-pop diff pools (4.3 M reads); (2) zero in us_mortar pool (axis-F tandem repeats); (3) zero in Loph_1.0 reference proteome (53 SRCR / 0 CUB).

Extended to 34 annotated cnidarian proteomes (Anthozoa → Hydrozoa → Scyphozoa → Myxozoa) via scan_cnidaria_immune_proteomes.sh + the 124-HMM immune atlas. Result: 2,410 SRCR-bearing proteins across 31 species (the 3 missing = myxozoan parasites with reduced genomes), ZERO with CUB / ZP / Mucin / VWD / Trefoil co-occurrence in 205,278 cnidarian proteins. Verdict cnidaria-wide: SRCR scavenger receptor family is ancient and expanded; the canonical mammalian DMBT1 scaffold is a vertebrate / mammalian innovation absent in Cnidaria.

Cross-cnidaria immune tblastn

124 nucleotide BLAST DBs (build_per_genome_blastdbs.sh) → tblastn 277 Lophelia immune candidate proteins (53 SRCR + 5 A2M + selected MACPF / Fibrinogen_C / Ldl_recept_a) against all 124 cnidarian genomes. 1.84 M hits, 120 ok / 4 skip / 0 fail. Top hit-count species: Dendronephthya (octocoral, 47K), Anthopleura (anemone, 46K), Aulactinia (32K), Madracis (30K).

Mito × Pfam annotation (closing the mito layer)

Full Pfam-A scan on 22,911 cnidarian mito-orthologs from the prebuilt mito_loci_per_species.tsv. 49,249 Pfam hits across 1,295 distinct domains; 22,382 (97.7%) of mito-ortholog proteins now have ≥1 Pfam-A domain call. Lophelia specifically: 706 of 727 mito-orthologs annotated (97.1% coverage). Output: ~/scratch/mito_pfam_scan/mito_loci_with_pfam.parquet.

RNA-seq full pipeline (per captain's plain-mapping mandate)

Rebuilt STAR index FASTA-only (no GTF priors per feedback_star_no_gtf.md). Stripped S3 to minimal STAR flags: --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --limitBAMsortRAM 30G. All 12 STAR jobs ran; 11 landed cleanly (NN1 broken-fastq, see below). featureCounts → 40,679-gene × 11-sample count matrix.

Per-region quantification + unannotated landscape

HOMER makeTagDirectory per sample + analyzeRepeats.pl on custom intergenic + intronic GTFs (built via bedtools complement / subtract on the gene GTF). Headline:

  • Exonic: 40,679 intervals, 4,365 expressed in both pops, 29,917 silent
  • Intronic: 153,855 intervals, 664 expressed in both pops, 151,724 silent (~1%)
  • Intergenic: 38,927 intervals, 6,941 expressed in both pops, 11,599 US-only-expressed

Per-sample annotation-region breakdown: ~61% of US reads are intergenic vs 47% Nordic (Nordic biased low by BAM-extract). Major unannotated transcriptional landscape — Loph_1.0's 40,679 Trinity- based gene models from Herrera 2023 capture only ~40% of expressed sequence in this dataset.

Genomic context for top intergenic candidates

intergenic_genomic_context.py — top 50 intergenic + 20 intronic by US max count, with flanking gene products + Pfam categories. Highlights:

  • INTG_0012893 — 9.2 kb spacer between two Trypsin / MASP genes (complement-pathway hotspot), US-only expressed
  • INTR_0087738 — 4.9 kb intronic transcript inside Neurotrypsin (SRCR) host gene
  • INTR_0097491 — inside a ShK toxin domain host
  • INTR_0147093 — inside a TNFR_c6 host gene
  • INTG_0003055 — 67.9 kb intergenic spacer with 42K NOR reads (could be unannotated lncRNA / dropped gene model)

Output: ~/scratch/intergenic_context/intergenic_context.{parquet,tsv}.

Expression of immune + mito candidate genes (≥100 reads cutoff)

expressed_immune_mito_genes.py — gene_id↔protein_id map via GTF CDS records → categorise by Pfam family + MitoCarta membership → ≥100-read cutoff per sample per population:

category n NOR US both silent
SRCR 53 11 24 9 27
Complement (any) 577 157 239 116 297
Mucin-anchor 0
MitoCarta 635 440 589 428 34
Mito × Immune 2 2 2 2 0

Top balanced SRCR: OS493_025544 (NOR 6,341 / US 6,288). Top US-only complement: OS493_028292 Trypsin (NOR 4 / US 85,684 — likely off-subset BAM-extract artefact). Output: ~/scratch/expressed_immune_mito/expressed_candidate_genes.{parquet,tsv}.

Job D pivot — abandoned full Pfam-A in favour of curated atlas

Initial plan was full Pfam-A (25,545 HMMs) on all 8 diff pools. Two issues: (i) HMMER 3.3.2 on super silently reads only first HMM from multi-HMM file when input is gzipped — switched to HMMER 3.4 from super's protein_db env + decompress-to-tmp pattern; (ii) even fixed, projected wall ~50-200 hr for the heavy pools. Killed and rebuilt with a curated 219-HMM atlas (union of 124-immune + mito-immune nexus + top hits from Loph_1.0 proteome + top hits from mito-orthologs Pfam scan). Re-fired. Stalled on smallest pool nor_shared (16 GB protein file, ~10 hr/pool projected) — captain killed; 4/8 small pools (privates + strict_private) landed cleanly. Parquet fix-pass ran manually (python trx50_rna env had pandas, base did not).

Captain steered through

  • "STAR plain mapping no fancy" → GTF-free index, minimal flags
  • "BBTools, hardcore programmer Brian Bushnell" → repair.sh post-BAM-extract
  • "if one sample is wrong flag and move on" → all S3 scripts patched to set -uo pipefail + per-sample if ! cmd; then log_ledger failed; continue instead of set -e killing the batch
  • "fire job B / lncrna / homer / unannotated" — sequential decisions reflected in the day's pipeline order

Open / running

  • NN1 rescue (firing now, 2026-05-18): fastp strict structural validation → repair.sh re-pair → STAR smoke test → full re-align. Goal: recover one clean Nordic sample (Tisler raw cutadapt2 L006), rather than wait for raw fastqs on other servers.
  • 142-subset filter (queued): restrict HOMER intergenic / intronic / exonic to the 142 contigs the Nordic BAM-extracts had honest mapping to → honest cross-pop expression comparison without the BAM-extract caveat.
  • DESeq2 (S5) still blocked on captain decisions: D-vs-NN biology, headline contrast, NN1 fate (resolved once rescue lands).

Site / docs

autoloop.deepsek.no: Roadmap & maps tab added; Mucins + Complement cards refreshed with RNA-seq expression evidence. helm.deepsek.no refactored as multi-project fleet hub. Neither vhost is nginx-installed yet — pending captain's sudo step.

docs/operating_model.md (the playbook) + docs/state_architecture.md (parquet conventions) + docs/fleet_tools.md (per-box tool matrix) landed earlier this week as the durable contracts.

2026-05-16 — fleet operating model + DMBT1 question closed

Operating model codified

Three new docs land together as the chief-mate steering pack:

  • docs/operating_model.md (673 lines, 15 sections) — the playbook for running multi-host comparative-genomics projects. Captain (Bård) sets course, chief mate (Claude session) orchestrates. NFS + markdown + ledger is the spine. SSH + tmux is the only execution shortcut.
  • docs/fleet_tools.md — comprehensive inventory of what's on each box: hardware, system-wide PATH, conda envs (per-env package versions),
  • shopping list of sudo installs.
  • docs/state_architecture.md (already in place; cross-referenced here) — where outputs live, hive-partitioned parquet schema, ledger discipline.

Plus the web-facing rendering: helm.deepsek.no vhost prepared (static HTML, seafoam accent, helm-wheel SVG). Captain + chief-mate metaphor explicit. Setup script + nginx conf in deepsek_web repo; user runs helm-site.sh install + certbot --expand when ready.

Fire wrapper + SSH server live

scripts/fire.sh (commit 862322e) + docs/ssh_setup.md. Dedicated SSH key per offload box, custom ports (super 45497, trx50 4540). Verified end-to-end with a 3-second hello job on each box. Then used in anger for the day's real work — env_extras install, S1_nordic extract, S3_us align, job_b_us_mortar.

Hominidae host-mask check (super)

Ran run_host_mask.sh on super against the 4 private/strict_private kraken-classified pools. Cross-pop signal essentially unchanged:

  • Nordic total hits: 3,880 → 3,877 clean (–0.08%)
  • US total hits: 4,545 → 4,476 clean (–1.5%)
  • Nordic SRCR: 164 → 164 (no change); US SRCR: 192 → 188 (–2%)
  • Nordic Fibrinogen_C: 828 → 825; US Fib_C: 1,294 → 1,273 (–1.6%)

Conclusion: cross-pop signal is mostly real biology, not host contamination. US-vs-Nordic differentials drop from +17% to +15% (SRCR) and +56% to +54% (Ficolin) — minor adjustments, headline story intact. Autoloop site cards stand as-is.

DMBT1 architecture question — closed

Three independent lines of evidence converged today:

evidence source SRCR CUB ZP Mucin VWD Trefoil
8 cross-pop diff pools (4.3 M reads) 356 0 0 0 0 0
us_mortar pool (Job B, 769 MB) 31 0 0 0 0 0
Loph_1.0 reference proteome (37,484) 128 0 0 0 0 0

The 128 SRCRs in the Loph_1.0 proteome are spread across 53 distinct proteins. Their non-SRCR co-occurring domains: Ig_2 (×4), TSP1_CFP_C, TSP_1, Trypsin, Pkinase, Ldl_recept_a (×1 each). Zero CUB / ZP / Mucin / VWD / Trefoil — anywhere.

Verdict: interpretation (b) wins — the classical DMBT1 architecture (SRCR + CUB + ZP) is genuinely absent from D. pertusum. Cnidarian mucus immunity in this species uses SRCR-only PRRs, not the mammalian-style CUB/ZP-anchored DMBT1 scaffold. F4 has its answer.

Job A on super (per-pop conserved-bulk, 8-16 hr) is no longer urgent for the DMBT1 question. Still has value as a broad cross-pop immune atlas, but it's no longer blocking F4. Defer or fire when convenient.

trx50 pipeline status

  • env_extras (DESeq2 1.50.2 + rtracklayer 1.70.1 + samtools 1.23.1 + pyarrow + ...) — done after three solver-error attempts (had to bootstrap conda-libmamba-solver into base first).
  • S1_nordic_bam_extract — done in 5.5 min wall (5 samples, ~66s avg).
  • S3_star_align_us — running (6 US samples sequential, ~3-4 hr at THREADS=25; --outSAMunmapped Within per the BAM lesson learned earlier this week).
  • S3_star_align_nordic — queued behind S3_us. Asymmetric data: NN1 from raw cutadapt2 L006 fastqs, D2/D3/D5/NN3/NN5 from BAM-extracted (mapped-only against the 142-contig subset; 33.7% gene-blackhole caveat documented in INTRO.md).
  • S4 featureCounts + S5 DESeq2 + S6 immune overlay — scripts not yet written; will follow when S3 lands.

Per-host budget locked

THREADS: t5 = 25, super = 40, trx50 = 25. Updated in trx50 INTRO and memory feedback_compute_safety.md so future sessions don't re-derive.


2026-05-13 (afternoon) — Phase C: US 10% real-scale test + RNA-bridge end-to-end

Goal: take the 8-step skeleton out for a real-scale test, produce the headline deliverables (per-read parquet DB + bridge graph), surface any scaling issues we missed in the small tests. Knob: --fraction 0.10 on the US BAM (~207k reads).

What ran clean:

stage walltime result
S0_subsample 243 s 206,793 reads at seed=19
M0_partition 593 s 39,438 architects + 157,729 mortar (19.0% architect — consistent with 30%)
M1_characterize v2 kmc 200 s 78,876 flank rows in architects_flanks_us.tsv.gz
B0_facts 0 s 1.1 MB parquet (reads_pop=us_run=us_pct10_2026-05-13.parquet)
B1f_a fq2fa 7 s 39,438 architect sequences
B1f_b translate 49 s 236,628 proteins (6 frames × 39,438)
B1f_c press_check 0 s Pfam-A.hmm pressed DB present
B1f_d split_aa 1 s 25 chunks × 9,466 proteins
B1f_e hmmscan killed after 3 hr 0/25 chunks landed (see below)
B1f_mortar skipped pre-populated placeholder + ledger row
B2_kraken_raw 17 s kraken_pop=us_*_db=PlusPF.parquet (8.4 MB)
B2_kraken_c005 17 s *-c0.05.parquet (8.3 MB)
RNA-bridge R0 2166 s 27.5 M pairs extracted, 23.3 M / 56.1 M records mapped (41.5%)
RNA-bridge R1 99 s 8,635 per-contig bridge parquets
RNA-bridge R2 2 s bridge_graph.json8,635 nodes, 15,344 edges
RNA-bridge R3 3 s pseudo_scaffolds.fa — 230 MB

The B1f_e_hmmscan kill — long-protein pathology.

Smoke test was on smallest contig (MU825479.1, 40 architects → 240 short proteins, 91 s wall on 1 chunk). At 10% scale the per-protein cost exploded by ~4× because the protein length distribution is brutal:

chunk_001 protein length distribution
  <5k aa    :     0
  5k-10k    :  7,676
  10k-20k   :  1,745
  >20k      :     45   (max 29,705 aa)
  N = 9,466 proteins, mean 8,334 aa

6-frame translation of long PacBio reads with only -m 50 (min) filter produces enormous stretches. HMMER's Plan7 is O(L×N), so per-query time on a 29 kaa sequence × 20 k HMMs is unbounded. After 3 hours, every chunk's .tmp had only 18–38 hit lines and mtime hadn't moved in 28 minutes — every chunk was stuck inside a single long query.

Followup: re-architect run_hmmer_b1_full_pfam_v1.sh with a --max-aa cap (e.g. 5000) at the translate stage, or per-protein windowing for queries above the cap. Logged in feedback_compute_safety.md along with the broader lesson: small tests must sample input DIVERSITY, not just SIZE. The smoke had the right input size (240 proteins) but the wrong input shape (all from the short tail of the read-length distribution).

The concurrency-violation incident.

US 10% and RNA-bridge launched concurrently. Load hit 58 on a 25-core box. User called it out: "one task uses 25 threads for headroom". Updated feedback_pipeline_design.md rule 3: synchronize before launching the next task — between pipelines, not just within them. Pattern is script_A.sh && script_B.sh. The wall-time "savings" of parallel runs are illusory — kernel context-switch overhead + contention slow both runs — and the resulting per-stage timings are useless for benchmarking. Sequentially-collected timings are the only honest data for the operational map.

The pre-populated-ledger pattern.

Cancelled B1f_mortar mid-run by writing an empty-schema placeholder parquet at the expected path + recording a B1_hmmer_full_pfam ledger row for pop=us_mortar run=us_pct10_2026-05-13_mortar with skipped_by_user=true. When bash reached the mortar block, its existing idempotency check (ledger_check && [ -s parquet ]) fired the skip branch — driver didn't need to be modified mid-run. Same pattern let us run B2-only afterwards (--skip B1f flag + the existing ledger made S0/M0/M1/B0 skip cleanly, only B2 actually ran for 34 s total).

Final deliverables present at expected paths:

us_pct10/reads_dir/reads_pop=us_run=us_pct10_2026-05-13.parquet   1.1 MB   B0 per-read DB
us_pct10/annotations/kraken_pop=us_*_db=PlusPF.parquet            8.4 MB   B2 raw
us_pct10/annotations/kraken_pop=us_*_db=PlusPF-c0.05.parquet      8.3 MB   B2 c0.05
us_pct10/annotations/hmmer_pop=us_mortar_*_db=PfamAfull.parquet    12 KB   placeholder (skipped)
rna_bridge_us_pct10/graph/bridge_graph.json                       3.9 MB   PROPOSED CONTIG GRAPH
rna_bridge_us_pct10/graph/bridge_graph_{edges,nodes}.parquet     ~192 KB
rna_bridge_us_pct10/scaffolds/pseudo_scaffolds.fa                 230 MB   pseudo-scaffolds
us_pct10/manifest/manifest.md, manifest.tsv                              17,295 files / 14 categories

Missing from the "every read documented" picture: the full-Pfam-A HMM annotation parquets (architects + mortar). Deferred until the B1f redesign with length cap lands. The kraken + axis-F + RNA-bridge layers are complete.

New scripts in this session:

  • scripts/aragonite/rna_bridge_chain.sh — R0→R1→R2→R3 chain-driver, records ledger entries per substep, sync barriers between R0 and R1 (BAM presence), R1 and R2 (file count), R2 and R3 (graph JSON parse).
  • scripts/aragonite/inventory_run.py — walks a run dir tree, categorizes files (parquet/json/fasta/log/etc), joins against the ledger by run_id, emits a manifest.md + manifest.tsv. Reads the bridge_graph.json directly to surface the top-N edges as a publication-style table inside the manifest.

Pipeline operational viz (pipeline_overview_v1.html at http://localhost:8765/) refreshed; it now shows the killed-B1f and the completed B2 + the pre-populated mortar row, and includes the ledger snapshot grouped by run_id with the newest 8 runs.


2026-05-13 (evening) — Phase B: 8-step pipeline skeleton, RNA-bridge stack built

Build pass: after the broad-first decision, expanded the work surface to all 8 anticipated pipeline steps:

1. run_hmmer_b1_full_pfam_v1.sh        [Phase A — workhorse]
2. run_annotation_loop_full_pfam_v1.sh [Phase A — per-contig]
3. run_us30_b1_full_pfam_architects.sh [Phase A — wrapper]
4. run_us30_b1_full_pfam_mortar.sh     [Phase A — wrapper]
5. run_us_pipeline.sh                  [Phase B — generic US pipeline,
                                                  --fraction param]
6. run_nordic_tisler_pipeline.sh       [Phase B — Nordic CLR, raw FASTQ]
7. rna_bridge_R0..R3                   [Phase B — 4 scripts, the
                                                  contig puzzler]
8. scripts/immune/q_immune_atlas_hits  [Phase A — Phase-2 query stub]

Each script has a --test / --smoke / --self-test mode that runs in seconds-to-minutes against a tiny canned input. Tweaking later becomes cheap — the data path is wired end-to-end with synthetic stand-ins.

Phase B smoke results (commit fa6c80f):

step smoke wall result
5 US --fraction 0.01 --seed 137 109 s 20,882 reads → 3,972 architects + 15,881 mortar → 7,944 flank rows → 131 KB parquet, all 4 ledger entries
6 Nordic --smoke (F01_1 head 4k) 4 s 975 reads (25 below 500 bp floor) → 195 architects + 780 mortar
7 R0 Dahl-RNA1-D2 × 1k pairs 1 s 835 pairs extracted, 324/1670 primary mapped to MU826351.1 grains (19.4%)
7 R1 synthetic 10-pair BAM <1 s 5 same-contig dropped, 3 A↔B + 2 B↔C extracted, 3 per-contig parquets
7 R2 synthetic (R1 output) <1 s 3 nodes, 2 edges (A↔B support 3, B↔C support 2)
7 R3 synthetic (R2 graph) <1 s 1 scaffold, 3 contigs, length 400 (100+50+100+50+100), path A → B → C

Notable findings from the Phase B smokes:

  1. CLR axis-F partition transfers cleanly. Nordic CLR M0 produced 19.5% architect fraction (195 / 975 reads); US Sequel M0 on its 30% slice produced 19.0% architect fraction (118,250 / 591,228). The axis-F length-based partition rule apparently doesn't need recalibration for CLR vs Sequel — the biology (modal read length, top-quintile threshold) is similar enough. M1's per-end k-mer/entropy thresholds remain to be cross-checked on a non-trivial Nordic N.

  2. R0 silent-failure bug caught + fixed. First R0 smoke reported "done in 2s (/ records mapped)" but the pipeline actually failed at samtools sort (duplicate g_1182 SAM @SQ headers from concat'ing two per-contig grains with shared local naming). The pipeline error was silenced because the script's set -uo pipefail didn't trigger set -e. Added explicit if ! ( bwa-mem2 | samtools sort ) guard with cleanup of partial .tmp files. Rule 3 (sync before next stage) reinforced.

  3. Target-FASTA prep is the user's responsibility, not the pipeline's. When concat'ing per-contig grains across the multi-contig loop, the same contig appears in _nor and _us dirs — prefix grain IDs by FULL dir name (not just contig), OR use ONE per-contig grains file at a time. Documented in R0's header.

Discussion (deferred to backlog): Trinity-assembled transcripts would give ~10× stronger bridge evidence per item via split alignments (one transcript = one categorical "this gene crosses A→B"). Drop-in as R0b; R1/R2/R3 unchanged. Cost ~12–24 hr Trinity assembly + ~100 GB RAM. Order: pair-evidence first at real scale, Trinity as publication-quality overlay later. Logged in project_backlog_rna_bridge_scaffolding.md.


Sessions's net state:

git log --oneline (today)
  fa6c80f aragonite Phase B: steps 5/6/7 real scripts + small tests pass
  53f4f38 aragonite Phase A: full-Pfam infrastructure + 8-step skeleton
  c45156a aragonite: state_ledger.py + extract_xpop_diff_pools.sh
  f073358 aragonite B1 v2: parallel hmmsearch + substep ledger, v1 retired
  85d945f aragonite M1 v2: kmc-backed streaming, v1 retired

8 steps. Each has a real script. Each has a small test that passes. Each is idempotent via the ledger. The narrow immune-Pfam path is kept as fast-debug; the broad full-Pfam path is canonical. Mortar gets the same library as architects. The RNA-bridge stack is built and synthetic-tested; ready for real-scale runs the moment the substrate (Nordic ARAGONITE contigs) is available.

What's NOT yet done (each is independent, each idempotent): - Full per-contig loop with full-Pfam (10 contig×pop, ~45 min) - US 30% architects+mortar full-Pfam (~15 hr combined) - US fraction=1.0 — the canonical "every US PB read documented" - Nordic 30% S0→B2 (the cross-pop comparison axis) - RNA-bridge R0→R3 end-to-end on real RNA + real contig substrate - Trinity overlay (deferred) - Immune queries (Phase 2 — atlas as predicate)


2026-05-13 (afternoon → evening) — Phase A: broad-first annotation, full-Pfam infrastructure

The shift, surfaced by the user: the immune-Pfam atlas (lophelia/data/diagnostic_pfams.txt, 124 cnidaria-relevant Pfams curated from InnateDB) was being used as a scan-time filter in B1 production. That's the wrong layer — the atlas belongs as a query-time predicate over a fully-documented per-read DB. The user's restatement:

"my goal was that we had all us pacs firstly documented in the db / parquet, so in the end we could do proper immune-related stitching via our rna seq reads, we should have built the strategy around this principle, so every read is documented, if its that or this etc."

Extended feedback_dignity_of_every_read.md to cover the annotation layer — every read deserves full Pfam-A annotation (20k HMMs, not 124) in the canonical parquet. The immune scan stays as a fast-debug iteration loop (~10 min), not as the production default. Two anti-patterns named: pre-filter at scan time, and scanning architects only (mortar deserves the same library).

Built (commit 53f4f38): - run_hmmer_b1_full_pfam_v1.sh — 7-substep workhorse for any fastq input. Uses chunked hmmscan against the pressed Pfam-A binary DB (~2.4 GB mmap-shared across worker processes). Adaptive chunking: 1 chunk for tiny inputs (DB-load dominates), up to THREADS chunks for large inputs. Substep markers are per-chunk .domtblout files. Substrate tag in ledger (architects / mortar / etc). - run_annotation_loop_full_pfam_v1.sh — per-contig loop driver. - run_us30_b1_full_pfam_{architects,mortar}.sh — thin wrappers with --test mode (100-read smoke). - scripts/immune/q_immune_atlas_hits.py — Phase-2 query stub. Demonstrates atlas-as-predicate pattern; --test passes synthetic 3-row filter ✓.

Smoke tests: - Smallest contig (MU825479.1 nor, 40 architects → 240 proteins): 91 s wall, 119 MB RSS, 1 chunk. 1 hit: DUF5641 — found by full-Pfam, missed by immune-only. Principle validated on first invocation. - Smallest contig pair via the per-contig loop driver: 380 s wall (nor 90 s + us 290 s sequential), both PfamAfull.parquet files in loop_annotations/.


2026-05-13 — B1 OOM-class miss #2 — `hmmscan` wrong direction, parallel `hmmsearch` v2

Incident. After M1 v2 landed, B1 started under the production driver and stayed in hmmscan for 22 min. top showed 180% CPU on the python process despite --cpu 25. ETA was ~73 min for what should have been a 10-minute job. Caught the inefficiency mid-run, killed the driver tree, redesigned.

Why it was slow. HMMER's docs are explicit: hmmscan (sequences-vs-HMMs) parallelizes within a single query sequence. With 709k short proteins × 124 HMMs, there's almost no per-query work to split; effective parallelism saturates at ~2 cores. hmmsearch (HMMs-vs-sequences) inverts the dimension and is recommended for this workload shape.

Redesign. run_hmmer_b1_v2.sh:

  1. B1_a_fq2fazcat ... | seqkit fq2fa
  2. B1_b_translate — 6 frames, -m 50 min aa filter
  3. B1_c_hmm_indexhmmfetch --index (one-time)
  4. B1_d_hmmsearchxargs -P 25 across 124 HMMs, single-threaded each, .tmpmv atomic rename. Per-HMM domtblouts ARE the substep markers.
  5. B1_e_concatcat hmmsearch/*.domtblout > hmmer.domtblout only after a hard barrier verifies all 124 files exist.
  6. B1_f_ingest — parse → parquet. ingest_hmmer.py now auto-detects hmmsearch vs hmmscan by reading the # Pipeline mode: footer (column-1 flips between the two tools).

Each of the 6 substeps records its own line in state/runs.jsonl. A re-run after any crash resumes from the first incomplete substep; in particular, an interrupted hmmsearch fan-out re-runs only the HMMs whose .domtblout is missing.

Graded test (under /usr/bin/time -v):

stage reads elapsed peak RSS
b1_1k 1,000 4 s 114 MB
b1_10k 10,000 30 s 115 MB
b1_full 118,250 611 s 116 MB

Peak RSS is basically flat because the orchestrator is a thin bash + python wrapper; the work runs inside hmmsearch subprocesses. Per-substep breakdown on the full 118k:

B1_a_fq2fa         19 s
B1_b_translate    137 s   709,500 proteins (6 frames × 118k arch)
B1_c_hmm_index      0 s   (salvaged from prior test)
B1_d_hmmsearch    453 s   124 HMMs, xargs -P 25, ~90s per single hmmsearch
B1_e_concat         0 s
B1_f_ingest         0 s
total             611 s   (vs v1 projected ~73 min, ~7× faster e2e,
                          ~10× on the search step alone)

Biological readout — 30% US subset, 118,250 architects, 124-HMM immune Pfam library:

  • 578 domain hits across 484 unique reads
  • 37 distinct immune domains detected
  • Top by hit count:
domain hits
Fibrinogen_C 178
Trypsin 62
TSP_1 53
PK_Tyr_Ser-Thr 43
ubiquitin 31
TRAF-mep_MATH 30
Arf 23
Ank 21
SRCR 17
Forkhead 15

SRCR — the user's prime story-target domain (mucus DMBT1/SRCR family immunity) — sits in the top 10 even on the 30% subset. Worth a closer look at those 17 hits when we move to the full-population run.

v1 ↔ v2 verification. On the 1k correctness test, the two parquets are byte-identical (same read_id × frame × domain_name × domain_acc tuples). The raw domtblout files DIFFER textually because hmmscan and hmmsearch flip target↔query columns — ingest_hmmer.py auto-detects and normalizes, so downstream consumers see identical data.

Landed. - run_hmmer_b1_v2.sh is the new production B1 implementation. - run_hmmer_b1.shrun_hmmer_b1_v1_underthreaded.sh, retirement header in place, kept as archived reference. - ingest_hmmer.py gained --input-format {hmmscan,hmmsearch,auto} with auto reading the footer; default is auto. Back-compat for any caller that previously didn't pass the flag. - run_us_pct30.sh driver M1 line edited to call v2 with LEDGER=$LEDGER THREADS=$THREADS. - B1 outputs (architects.fa, architects_aa.fa, hmmer.domtblout, hmmer_pop=us_run=...parquet) copied into the canonical us_pct30 workdir from the v2_test run. - Six B1 substep entries plus the rolled-up B1_hmmer recorded in the canonical ledger with completed_via=v2_test_2026-05-13.

Rule 4 working live. After landing, restarting the driver hit B0 skip / B1 skip immediately and moved straight to B2 kraken. The ~10 minutes of completed B1 work was NOT re-run — first concrete test of the substep-ledger discipline in production.

This is the case study for feedback_pipeline_design.md. Both the M1 OOM (rule 2) and the B1 1.8-core throughput (rule 1) were the same kind of miss — picked the wrong primitive, not just a wrong flag. The four-rule discipline catches both classes prospectively.


2026-05-13 — M1 OOM postmortem and kmc-backed v2

Incident. At 21:39:59 on 2026-05-12 the M1 stage of the 30% US Tisler run was OOM-killed. Kernel log was unambiguous:

Out of memory: Killed process 1078805 (python3)
  total-vm:223529488kB  anon-rss:189774672kB

~181 GB resident. The system has 251 GB, Chrome held the rest, kernel went global OOM and took out a Chrome render process alongside python.

Why it died. aragonite_characterize.py (v1) built the architect-pool k-mer histogram in a Python Counter[int, int] keyed on canonical 21-mers, single-process, in RAM. At ~120k architects (~3.5 Gbp) the dict held 2.5B unique k-mers × ~75 B/entry → ~190 GB. The synth run (1,212 reads) passed at <1 GB; nothing scale-tested.

Redesign. aragonite_characterize_v2.py keeps the same output schema but routes the global histogram through kmc on disk:

  1. kmc -k21 -m32 -t25 builds the full-pool DB on disk.
  2. Stream architects, emit 5'/3' window FASTA, kmc that → windows DB.
  3. kmc_tools simple full ∩ windows -ocleft → counts only for k-mers that any window actually queries.
  4. kmc_tools transform dump -s → load into sorted numpy arrays (uint64 keys + uint32 counts).
  5. Second streaming pass: per-window np.searchsorted for mean/min/p50.

In-RAM lookup is bounded by window k-mer diversity, not pool diversity.

Graded scale test (under /usr/bin/time -v):

stage reads input elapsed peak RSS
us_1k 1,000 8 MB 7 s 663 MB
us_10k 10,000 72 MB 49 s 2.59 GB
us_full 118,250 839 MB 612 s 22 GB

10× input → 4× RSS (sub-linear) → 12× input → 8× RSS. Bound is the 1.3 GB lookup arrays + kmc subprocess at its -m32 cap. v1 vs v2 diff on synth: 2,425 rows match byte-for-byte except the three kmer-freq columns, which agree to two decimals.

Read-class distribution (US, 30% subset, 118,250 architects, v2):

class count fraction
EXPANDER 96,562 81.7%
ANCHOR 16,344 13.8%
BURIED 5,278 4.5%
SUSPICIOUS 66 0.1%

Landed. - v1 renamed to aragonite_characterize_v1_oom.py with a retirement header. Kept as archived reference, not deleted. - v2 wired into run_us_pct30.sh M1 stage with --threads 25 --kmc-mem-gb 32. - Test output copied into canonical us_pct30/architects_flanks_us.tsv.gz so B0/B1/B2 can resume without re-running M1. - Ledger entry recorded: M1_characterize completed, tool_version=v2_kmc, walltime_s=612, peak_rss_mb=21986. - Hard rule in feedback_compute_safety.md: no unbounded in-process counter/dict at biology scale; small-test-first for correctness, then graded scale-test-second (1% → 5% → 10% under /usr/bin/time -v) before scaling.

Audit-trail wart left in place. state/runs.jsonl has two M0_partition entries for the same run_id (23,356 then 118,250 architects, ~25 min apart). The first M0 ran to completion on a truncated stream after samtools fastq choked; the second M0 ran on the full subsample. Both legitimately completed, ledger appended both, check finds the first and returns 0 so idempotency still works. Not patched — the duplicate is itself the most honest record of what happened. If this pattern recurs, add a supersedes field rather than collapsing.


2026-05-12 — multi-contig test loop (5 new contigs, 105 kb → 10.7 Mb)

run_aragonite_loop.sh ran 5 new contigs × 2 pops with default parameters, both pops parallel per contig.

contig size Nor cov US cov Nor longest US longest walltime
MU825479 105 kb 36.0% 79.5% 7 kb (7%) 74 kb (70%) 3s
MU825436 609 kb 88.7% 99.9% 108 kb (18%) 590 kb (97%) 19s
MU825420 980 kb 83.7% 95.8% 97 kb (10%) 492 kb (50%) 24s
MU826351 5.2 Mb 87.8% 98.3% 266 kb (5%) 918 kb (18%) 158s
MU825396 10.7 Mb 92.3% 99.6% 281 kb (3%) 2.83 Mb (27%) 416s

The 10.7 Mb contig assembles at 99.6% US Sequel coverage, with a 2.83 Mb single contiguous block (27% in one piece). Walltime 416 sec (~7 min) parallel on 25 cores.

Pattern across scales (US Sequel): 80% → 99.9% → 96% → 98% → 99.6%. US ceilings around 99-100% except for very small contigs where read substrate is insufficient (105 kb / 40 architects barely supports OLC).

Pattern across scales (Nordic CLR): 36% → 89% → 84% → 88% → 92%. Increases with contig size as more substrate becomes available. Caps around 92%, fragments more aggressively (longest-block fraction drops from 7% at 105 kb to 3% at 10.7 Mb).

Walltime scaling: roughly linear with contig size (3s on 105 kb, 416s on 10.7 Mb = ~140× for 100× scale).

Across all 8 contigs (5 new + 3 prior at MU826414, MU827843, MU826831): - Mean Nordic coverage: ~82% - Mean US coverage: ~96% - US Sequel handles every contig at 80%+, most at 95%+ - Nordic CLR more variable (36-93%), depends on substrate density

Viz: loop_multi_contig_v1.html with size-vs-coverage scatter plot (nor↔us connected per contig) and longest-block-fraction chart.

The "autolearn" angle the data surfaces: per-read characterization matters most at the limits. The 105 kb contig fails because at 24 architects there's insufficient substrate for the greedy walker. Larger contigs hit a different limit — fragmentation due to internal repeat structure aragonite doesn't yet see (only end-window features). This motivates the per-read internal characterization backlog (project_backlog_per_read_internal_characterization.md): three paths — sliding-window scan along reads, de novo repeat-family clustering, kraken2 per-read tagging. Each architect read deserves to be characterized as a full BAC equivalent, not just its ends.

The "dignity of every read" principle (feedback_dignity_of_every_read.md) saved alongside.


2026-05-11 — third region: MU826831.1 (DMBT1-like SRCR locus, 3.2 Mb contig)

The DMBT1 test — this project's central biology question.

The only SRCR-domain hit (Pfam PF00530) in loph10_immune_loci.bed sits at MU826831.1:2,210,737-2,255,342 within a 3.2 Mb contig — roughly 10× larger than prior test regions. PF00530 = scavenger receptor cysteine-rich domain, the defining domain of DMBT1 and the cnidarian SRCR-family PRRs central to mucus immunity.

metric Nordic CLR US Sequel
reads on contig ~16,000 ~13,000
architects 3,210 2,546
grains formed 2,313 1,259
main / sister 2,000 / 313 648 / 611
truth coverage 92.7% 99.5%
longest contig 277 kb 1.70 Mb (53% in one piece)
merged blocks 251 14
SRCR locus (44.6 kb) coverage 100% 100%

The biology: the DMBT1-like SRCR locus is fully recovered in both Nordic and US PacBio data. The 44.6 kb cluster is intact, contig context preserved. This is the central immune-genomics target of the project: it survives both chemistries' assembly walls.

The chemistry comparison: US assembles 53% of a 3.2 Mb contig in a single contiguous block. That's near-chromosome-scale assembly with a 2,700-line modular tool. Nordic fragments at 251 blocks (mean ~12 kb), consistent with its 5.5 kb mean read length hitting LINE/LTR-scale repeats.

Walltime: about 5-10 min for each pop in parallel on 25 cores (3.2 Mb scale). Pipeline scales linearly with input.

Cross-region table:

contig size content Nordic cov US cov Nordic longest US longest
MU826414.1 280 kb random 92.1% 93.4% 104 kb 249 kb
MU827843.1 273 kb complement 76.6% 99.8% 39 kb 248 kb
MU826831.1 3.2 Mb DMBT1/SRCR 92.7% 99.5% 277 kb 1.70 Mb

US Sequel handles all three regions at 93-99.8% coverage with very long contiguous blocks. Nordic CLR is more variable (77-92%) and always more fragmented. The pattern across loci is consistent: chemistry > biology for assembly contiguity at this scale.

Viz pages: lophelia_mu826831_dmbt1_nor_v1.html and lophelia_mu826831_dmbt1_us_v1.html.


2026-05-11 — second region: MU827843.1 (complement C6/C7/C8/C9 contig)

Picked from data/loph10_immune_loci.bed for biological relevance. 273 kb contig carrying two complement C6/C7/C8/C9 paralogs (the gene family this project is built around).

metric Nordic CLR US Sequel
reads on region 175 139
architects 175 139
grains formed 136 70
main / sister 123 / 13 43 / 27
truth coverage 76.6% 99.8%
longest contig 38.9 kb 247.8 kb
merged blocks 27 2

US's aragonite assembly of this complement-bearing contig is near-perfect — 99.8% covered, in just 2 pieces, longest of 247.8 kb (91% of the contig in one block). Nordic gets 76.6% in 27 fragments.

This means the complement gene cluster is, in principle, recoverable from US Sequel reads alone using a 2,700-line modular OLC assembler. The biology the project is built around survives the chemistry-and-length wall.

Comparison to first region (MU826414.1, 280 kb, random):

MU826414.1 MU827843.1 (complement)
Nordic coverage 92.1% 76.6%
US coverage 93.4% 99.8%

MU827843.1 is HARDER for Nordic (76 vs 92) but EASIER for US (99.8 vs 93). Different repeat landscape between contigs probably explains it. The complement contig may have fewer 5–9 kb repeats (where Nordic CLR breaks) but tractable 10–15 kb repeats (which US Sequel reads span).

Viz pages: lophelia_mu827843_nor_v1.html and lophelia_mu827843_us_v1.html.


2026-05-11 — M5 cross-pop: US on the same MU826414.1 region

Re-ran the aragonite pipeline on the US Sequel BAM, same Loph_1.0 contig MU826414.1. Per-pop independence (corrected vision): two separate runs, truth contig used only as a final comparison lens.

Nordic vs US on identical truth target:

metric Nordic (CLR) US (Sequel)
reads on region 1,044 152
architects 209 152
EXPANDER fraction 94.7% 87.5%
grains formed 143 77
main / sister 118 / 25 43 / 34
allele-pair links 15 19
truth coverage 92.1% 93.4%
merged blocks 15 5
longest contiguous block 104.5 kb 249.3 kb

The finding: US has 6.9× fewer reads on this region but produces a 2.4× longer contiguous assembly block (89% of the 280 kb truth in a single piece). Reason: US Sequel mean read length 9.3 kb vs Nordic CLR 5.5 kb. Longer reads span more repeats → fewer assembly breaks. The read-length advantage outweighs the read-count advantage at the scale of structural assembly contiguity.

This is the first direct cross-pop assembly contiguity comparison we have. It says: for Lophelia assembly, raw read length matters more than total read count. Confirms (and quantifies) what the field has been saying about HiFi/Sequel vs RS II CLR.

Viz page added: lophelia_mu826414_us_v1.html.


2026-05-11 — ARAGONITE M0–M5 buildout

Started a tweakable, layered de novo assembler named for the CaCO₃ polymorph Lophelia secretes. Old-school philosophy: every phase a small composable script, every decision a CLI knob, every artifact a human-readable file between phases.

Scripts in autoloop/scripts/aragonite/: | script | phase | role | |---|---|---| | aragonite_partition.py | M0 | split reads by size → architects + mortar | | aragonite_characterize.py | M1 | axis F features + k-mer rarity → flank-DB; class read | | aragonite_overlap.sh | phase 2 | minimap2 ava-pb wrapper → PAF | | aragonite_grow.py | M2 / M3 / M3.5 | endpoint-graph OLC walker; greedy or allele | | aragonite_mortar.py | M4 | mortar→grains mapping; per-position depth track | | run_aragonite_pipeline.sh | driver | end-to-end on a BAM region | | generate_architecture_viz.py | viz | design-space map page | | generate_aragonite_run_viz.py | viz | per-run page |

Synth 1 Mb diploid validation — built scripts/synth/* for ground-truth generation. ~1 Mb diploid (chr1_hapA + hapB at 2% het + 50 kb microbe). pbsim3 RSII simulated 6,059 reads at 35× depth. Comparison vs miniasm / wtdbg2 / flye documented in viz site (synth_1Mb_v1 page).

M5 real-data test on Loph_1.0 contig MU826414.1 — pipeline runs end-to-end on 1,044 Nordic reads aligning to a 280 kb region. Produces 143 grains (118 main + 25 sister, 15 allele-pair links), longest 63.7 kb, covers 92.1% of the truth contig (258 / 280 kb), longest contiguous merged block 104.5 kb across 15 blocks. ~1 sec walltime for the region.

Local viz site at http://localhost:8765/ with three entries: - 📖 Aragonite architecture (design-space map of all phases + swap-points) - synth_1Mb_v1 (comparative assembly view) - lophelia_MU826414_v1 (M5 first real-data run)

Lessons logged at each milestone (the trajectory): - M0: PacBio length CDFs have no clean knee — use percentile defaults - M1: synth is too easy (93% EXPANDER); real data will discriminate more - M2: greedy produces ~8× redundant grains (parallel walks not merged) - M3: allele-fork detection works at ~40% diploid rate - M3.5: processing order matters → interleave sisters → 48.5% - M4: mortar depth is QC, not just polish — surfaces grain-confidence track - M5: real-data pipeline works; 92% truth coverage on first try

Memory references: project_axis_F_vision_corrected.md, feedback_autolearn_goodhart.md, feedback_background_default.md, feedback_no_root_tmp.md, feedback_compute_safety.md.


2026-05-07 to 2026-05-10 — axis F per-read DB framework (cross-pop)

Multi-day buildout of the BAC-substrate per-read database for cross-population read characterization. Lives at ~/cnidaria_genomes/lophelia_pertusa/refree/ (t5) and synced to lophelia hub data dir.

Scripts in autoloop/scripts/ (all reference-free per the corrected vision): - extract_read_ends.sh + _axis_F_extract_helper.py — phase 1 BAM → per-read TSV - extract_read_ends_refree.sh + _refree_extract_helper.py — strict reference-free variant (no is_mapped/softclip_bp columns) - characterize_and_classify.py — 6 sequence-only features per end, classifier - compute_features_refree.py — features only (no classifier) per-pop independent - diag_features_refree.py — per-pop distribution diagnostic - flatten_to_flank_db.py — read-centric TSV → flank-centric DB (one row per flank) - autolearn_axis_F.py — autolearning parameter sweep (Goodhart-vulnerable, lessons saved) - drill_microsat.py — dominant-motif analysis for short-period tandem hits - peek_flank_db.sh + peek_flank_db_v2.sh + _flank_graph_helper.py — all-vs-all flank similarity + union-find community detection

Key findings: - US Sequel raw subreads have Q-strings stripped (all ! = Phred 0); Nordic RS II has real per-base Q. Real chemistry / deposit-convention asymmetry. - US has ~2× more period-2 microsatellite hits — 95% homopolymer (AA/TT, CC/GG), consistent with Sequel chemistry artefact, not real microsatellite biology. - Autolearn objectives are repeatedly Goodhart-vulnerable. Five iterations exposed five different ways the optimizer cheats. Memory saved. - v1 flank-DB peek: 75% of unmapped flanks have no similar partner — supports d8's "contamination + cleaning artifacts dominate, real biology is rarer" framing.

Memory references: feedback_autolearn_goodhart.md, project_axis_F_vision_corrected.md.


2026-05-05 — Phase 1 t5 (Pfam subset + decisions HMM section)

Per d8's share_plan.md (commit fffbfe8 on lophelia hub), Phase 1 t5 work landed:

  • data/diagnostic_pfams.txt (in lophelia hub) — 124 unique Pfam accessions, top-2 per cnidaria-relevant family from immune_family_table_cnidaria_relevant.tsv (116 families). All 124 verified present in Pfam-A.hmm 38.1.
  • Pfam-A.immune.hmm — 124-model HMM library extracted from the full Pfam-A. ~6.9 MB text + ~8 MB pressed binaries. Local on each server (regenerable). hmmscan on a 5-protein test query took ~12 ms.
  • scripts/extract_immune_pfams.py (in autoloop) — d8-portable rebuild script (parses Pfam-A.hmm text directly; no hmmfetch SSI dependency).
  • docs/decisions.md — t5 has filled §3 HMM methodology (hmmscan + --cut_ga + tblout/domtblout, with the multi-domain output schema for SRCR-family paralogs). §2 blastp left for d8.

Ready for Phase 2 t5 (run hmmscan on 34 proteomes) once decisions.md is sync'd.


2026-05-04 — Flye v1 landed; cross-server collaboration with d8

Flye CLR assembly v1 finished

Walltime 5 h 19 min (2026-05-03 17:22 → 2026-05-04 22:41). Output at ~/cnidaria_genomes/lophelia_pertusa/assembly/flye_v1/assembly.fasta (744 MB). Stats:

Metric Value
Contigs 31,741
Total length 744.2 Mb (33% inflated vs 557 Mb expected — haplotigs)
N50 34.4 kb
Max contig 267 kb
Ns / gaps 0
GC% 39.7 (matches Loph_1.0 39.5 — sanity OK)

Inflation is the expected outbred-CLR haplotig signature at 25.4× coverage. Path forward is purge_dups → Pilon polish (Tisler HiSeq2500) → SSPACE/BESST scaffolding (3/8/20/40 kb mate-pairs) → minimap2 -ax asm5 + SyRI vs Loph_1.0 for the SV catalogue.

Cross-server collaboration with d8 set up

A partner Claude session on bok1d8 (192.168.1.10, NFS-mounted at ~/Desktop/d8/) is working in /mnt/backup/lophelia/ on the same project. Architecture:

  • Shared bare git hub on t5 at ~/repos/lophelia.git. Both servers see it (d8 via NFS at Desktop/t5/bok1/repos/lophelia.git). No SSH or cloud needed.
  • t5 working clone at ~/lophelia/ — separate from autoloop.
  • Etiquette: d8 won't touch t5 outside the bare hub. Big data stays local on each side. .gitignore keeps sequence/alignment/log files out.

What d8 has already done: - 6 Dahl-RNA1 RNA-seq libraries recovered from STAR _Aligned.out.bam via samtools collate | samtools fastq -F 0x900 -n/mnt/backup/lophelia/fastq_from_bam/ (~25 GB, lossless). Source raw FASTQs don't exist anywhere on disk. - Existing Trinity outputs consolidated at /mnt/backup/lophelia/trinity/: 2022 main Trinity (402,228 transcripts → TSA GHDD01), CLC re-clustering (29.5k assemblies, 68.8k contigs), OmicsBox annotation (with cnidaria-protein-hit subsets), Trinity of unmapped reads (210k transcripts, potential Lophelia-specific paralogs). - BLAST protein DBs for the 33 best annotated cnidarian assemblies (1,053,827 proteins) — per-species + aliased combined cnidaria_all. Build script at d8 scripts/build_blast_dbs.sh.

t5-side first commit to the shared hub: lab notebook entry summarising Flye v1 + provenance + division of labor (committed as d637e04).

Coordination — division of labor on immune-gene hunt

Side Tool Why
t5 (us) Pfam HMM scan (SRCR, A2M, Anaphylatoxin/C3a, Thioester, CUB, MAM, ApeC) + curated cnidarian C3 seeds → DIAMOND vs locally-rebuilt cnidaria_all Domain-aware: catches the SRCR-family expansion that pure homology splits
d8 Seed-based blastp (human RefSeq + cnidarian C3/DMBT1 seeds) vs cnidaria_all Direct ortholog calls + Lophelia-paralog focus via Trinity 2022

Hits from both pipelines merge into docs/cnidaria_immune_catalog.md + _comparative/results/immune_summary.tsv. Reciprocal-best-hits cleanup at the end.

On BLAST DB sharing: /mnt/backup/lophelia/ is d8-local ext4 — not mountable from t5. Decided to rebuild on t5 using d8's scripts/build_blast_dbs.sh rather than rsync'ing the binary DBs across. Same _comparative/databases/ layout, same proteome inputs (symlinks happen to be valid on t5 already). Idempotent script, ~1–2 min runtime, both servers stay symmetric.

On timing: d8 starts the seed-based blastp now on the existing 33 species (no need to block on the Nordic assembly). When polish + annotation finishes for Nordic Lophelia, we slot it in as the 34th proteome and re-run; existing 33-species rows don't change.

Papers retrieved

  • Herrera & Cordes 2023 (GigaByte) — the canonical Loph_1.0 data paper. PDF at docs/papers/.
  • Walters et al. 2020 (Front Immunol) — Mpeg-1/Perforin-2 family in Cnidaria. Directly on the mucus-immunity arc. PDF at docs/papers/.
  • Quattrini 2020 NEE + McFadden 2021 Syst Biol — nominally green-OA but no working PDF source via unpaywall/OpenAlex. Need institutional access. Walters paper covers a lot of the cnidarian-immunity context anyway.

autoloop hub set up

Bare repo at ~/repos/autoloop.git (parallel to ~/repos/lophelia.git). ~/autoloop/ working tree pushes to it as origin. d8 can git clone it to pull autoloop's canonical scripts, env yaml, and docs/{lab_notebook,decisions,samples}.md. Same etiquette as the lophelia hub.

CLR / HiFi feasibility recorded in docs/methods.md

Decision not to regenerate modern HiFi from the 2015 RS II data captured at paper-grade precision in docs/methods.md §2. Headline: 10 kb insert + RS II P6-C4 chemistry yields only 1–3 polymerase passes per ZMW, well below the 8–10 needed for modern HiFi quality. Even with full reprocessing the ceiling is ~1–3× HiFi-eq coverage of the 557 Mb genome. CLR-grade analyses (assembly, SV calling) proceed with the raw subreads; SNP-grade analyses use the Tisler HiSeq2500 short reads instead. HiFi regeneration moves to backlog as a targeted-loci tool only (validating an interesting SV at base-pair resolution if needed).

PacBio CLR mapping to Loph_1.0 — reference indexed, mapping launching

Reference decompressed + indexed once for reuse at ~/cnidaria_genomes/lophelia_pertusa/reference/work/:

Loph_1.0.fna           539 MB   uncompressed reference
Loph_1.0.fna.fai        92 KB   samtools index
Loph_1.0.map-pb.mmi    1.2 GB   minimap2 -x map-pb pre-built index (k=19, skip=10, hpc=on)

Mapping pipeline (see docs/methods.md §3.3 for paper-grade detail):

minimap2 -ax map-pb -t 40 Loph_1.0.map-pb.mmi <36 subreads.fastq> \
  | samtools sort -@ 8 -m 2G -o nordic_pb_vs_loph10.bam
samtools index ...

Output dir: ~/cnidaria_genomes/lophelia_pertusa/alignments/nordic_pb_vs_loph10_v1/. Walltime estimate 2–4 h. SV calling with Sniffles2 follows.

Reframing: assembly polish/scaffold demoted, mapping promoted

The story arc is comparative immune-gene biology, not chromosome-quality Nordic assembly per se. Direct PacBio→Loph_1.0 alignment + SV calling answers the C3/SRCR-locus question more directly than Flye→purge_dups→Pilon→SSPACE→assembly-vs-assembly. Backlog tasks 19/20 (Pilon polish, mate-pair scaffolding) demoted from "next" to "if-needed-for-novel-Nordic-sequence". Flye v1 stays on disk as backup for any sequence not in Loph_1.0.

Nordic CLR → Loph_1.0 alignment landed

Wall time 17 min (much faster than the 2–4 h estimate — 14.17 Gbp on 40 cores with a pre-built .mmi index moves quickly). Output at cnidaria_genomes/lophelia_pertusa/alignments/nordic_pb_vs_loph10_v1/nordic_pb_vs_loph10.bam (17 GB).

Flagstat highlights:

Metric Value Reading
Primary reads 2,525,036 matches subread count exactly (sanity OK)
Primary mapped 2,409,639 (95.43%) high — Nordic and US Lophelia are very close at sequence level
Unmapped primary 115,397 (4.57%) candidate Nordic-specific sequence (insertions, divergent regions, repeats)
Supplementary 2,572,820 split-mapped reads — the SV signal lives here
Secondary 4,875,457 repeat-region multi-mapping

95.43% primary mapping is a strong signal that the Nordic and US populations are highly conserved at sequence level. The ~4.6% unmapped + 2.6 M supplementary alignments are where the structural-variation story will come from.

Sniffles2 SV calling — done in 18 s

Single-sample mode, reference Loph_1.0.fna, 16 threads, on the 17 GB BAM.

108,257 SVs called, sorted, bgzipped, tabix-indexed → nordic_pb_svs.vcf.gz (25 MB). All calls have SUPPORT ≥ 3 and QUAL ≥ 20 (sniffles defaults).

SV class Count Notable
DEL 58,829 27,290 in 100–500 bp range; 171 ≥ 50 kb
INS 44,399 23,728 in 100–500 bp; almost none ≥ 5 kb (mean read length is the spanning ceiling)
DUP 460 100 bp → 50 kb+ — directly relevant for SRCR-family copy-number expansion
INV 844 212 ≥ 50 kb; smaller ones may be CLR-noise — apply PRECISE/QUAL filter when intersecting
BND 3,725 split-read breakends, untyped

Filter sensitivity sweep (PASS only):

SUPPORT≥3   SVLEN≥50    99,975   lenient
SUPPORT≥5   SVLEN≥100   61,537   good working set
SUPPORT≥5   SVLEN≥500   18,387   high-confidence
SUPPORT≥10  SVLEN≥500    8,653   very conservative

Not filtering at the catalogue level. The right move is to keep the full VCF and filter per immune-gene locus once t5 HMM scan + d8 seed-BLAST land their coordinates. Working filter for the immune-locus intersection will likely be SUPPORT≥5 + SVLEN≥100 + PASS (61k SVs) with manual review of the largest events.

Output dir summary:

alignments/nordic_pb_vs_loph10_v1/
├── nordic_pb_vs_loph10.bam       17 GB
├── nordic_pb_vs_loph10.bam.bai   4.3 MB
├── flagstat.txt
├── nordic_pb_svs.vcf.gz          25 MB
├── nordic_pb_svs.vcf.gz.tbi      tabix index
├── minimap2.stderr
├── samtools_sort.stderr
└── sniffles.stderr

2026-05-03 — project bootstrap, pivot, scaffold, Flye launched

Story arc decided

Mucus immunity in cnidarians, focused on C3 (proto-complement) and DMBT1 / SRCR-family mucosal pattern-recognition receptors. Lophelia pertusa (= Desmophyllum pertusum) as the cold-water-coral case study: assemble Nordic Lophelia from PacBio CLR, align to US GCA_029204205.1, focus on C3 + SRCR-family loci, place results in the cnidarian comparative context using ~80 reference genomes already on disk. Solid science over high-impact splash.

What landed today

US reference

Fetched GCA_029204205.1 (Loph_1.0) from NCBI FTP. 7 files, all md5-verified, written read-only with a MANIFEST.json capturing provenance. - Specimen: USNM1676648 (Smithsonian voucher), off Charleston SC, 515 m depth, 2019-04-17 - Submitter: Lehigh University (Santiago Herrera lab) - Assembly level: Scaffold (scaf N50 1.6 Mb, contig N50 438 kb). 557 Mb total, 39.5% GC. - Path: ~/cnidaria_genomes/lophelia_pertusa/reference/GCA_029204205.1_Loph_1.0/

Nordic PacBio data inventoried

12 SMRT cells from Norwegian Sequencing Centre (NSC, bioinf3.hpc.uio.no), 2015 run 47, sample lead Per Mikael, sequenced on PacBio RS II with P6-C4 chemistry, 10 kb insert library. Originally landed in three legacy .tgz archives; now on disk as the proper extracted directory tree at ~/cnidaria_genomes/lophelia_pertusa/LopheliaPertusa/.

seqkit stats across all 36 .subreads.fastq files (3 per cell):

Metric Value
Subreads 2,525,036
Total bases 14.17 Gbp
Mean read length ~5,500 bp (lower than the 10 kb insert suggested)
Read N50 (per file) 6,700–7,000 bp
Max read length up to ~65 kb
CLR coverage of 557 Mb genome 25.4×

CCS data exists for only 3 of 12 cells (B02_1, C02_1, D02_1) from a 2018 reprocessing — total ~5–9× HiFi-equivalent, too sparse for whole-genome polish. The legacy .bas.h5 indexes and the Lp_*.subreads.bam modern PacBio BAM (only B02_1 has the BAM form) are extras; .subreads.fastq is enough for Flye.

Nordic Illumina data discovered

At ~/cnidaria_genomes/lophelia_pertusa/Lophelia_Illumina/FASTQ/ (47 GB total):

  • Paired-end short-insert (HiSeq, 2014): Lp_001, Lp_002Lp_001 R1 (8.2 GB) vs R2 (3.4 GB) is mismatched, needs investigation before use; Lp_002 looks fine 8.2/8.2
  • Mate-pair "jumping" libraries: LpT_DNA2_* at 3 / 8 / 20 / 40 kb inserts, 2 lanes each — exactly the substrate for scaffolding the Flye contigs into chromosome-arm-level scaffolds
  • Plus a LJD_PROCESSED_CLEANED/ subdir not yet inspected (likely post-NextClip junction-cleaned mate-pair reads)

Sample naming: LpT = Lophelia, Tisler reef (Norwegian Skagerrak). Flowcell H7W1UADXX, instrument DE18INS655.

Pre-existing alignments staged into bams/

116 GB across 8 subdirs (sorted/unsorted Nordic and US RNA-seq, Tisler HiSeq → US genome via CLC Workbench, MiSeq alignments, mito BAMs for both populations). Two reference identities used: the full 2858-scaffold US assembly (Tisler HiSeq) and a filtered 142-scaffold subset (us_lp_filter, 354 Mb — RNA-seq STAR alignments). Both are subsets of GCA_029204205.1. Reusable, contingent on which scaffolds matter for the immune loci.

Cleanup

Removed redundant .tgz archives (LopheliaPertusa.tgz 48 GB, LopheliaPertusa_2.tgz 92 GB, LopheliaPertusa.bas.h5.tgz 2.9 MB) and the PerMikael/ legacy index dir. 141 GB freed, 88% → 84% disk used.

Deliberately did not touch ~/cnidaria_genomes/.tmp_zips/ — the user has an active cnidaria-genome download pipeline writing there.

Cnidarian comparative dataset

The user has 124 cnidarian species across 157 NCBI assemblies under ~/cnidaria_genomes/ (downloaded with datasets, layout: <species>/<accession>/ with protein.faa, genomic.gff, etc.).

  • 33 species have at least one annotated assembly (best-per-species manifest at _comparative/proteomes/manifest_best.tsv)
  • 1,053,827 proteins total across the 33 best assemblies
  • 91 species are unannotated genome-only (skipped for first pass; can do tblastn later if a specific species matters)

Coverage looks great for the story: Acropora ×5, Nematostella, Hydra ×4, Hydractinia (allorecognition model), Stylophora pistillata, Pocillopora ×3, Orbicella ×2, Montipora ×2, Exaiptasia ×2 (Aiptasia model), Clytia, Dendronephthya, etc. Astrangia poculata lacks annotation — important for the aposymbiotic-control angle if we use that subplot; tblastn fallback.

Conda env

lophelia env at ~/miniconda3/envs/lophelia (yaml at ~/autoloop/envs/lophelia.yml). Built with mamba.

Phase 1 (assembly + alignment + variants): fastp, multiqc, seqkit, flye 2.9.6, pilon, bwa-mem2, minimap2 2.30, samtools 1.21, bcftools 1.21, sniffles 2.7.5, bedtools, gffread, diamond, ncbi-datasets-cli, pysam, biopython, pandas.

Phase 2 (comparative orthology, added today): hmmer, mafft, iqtree, trimal.

_comparative/ analysis tree scaffolded

~/cnidaria_genomes/_comparative/{proteomes,seeds,databases,searches,alignments,trees,results} — empty subdirs ready for the cross-species ortholog hunt. 33 best-proteome FASTAs symlinked into proteomes/.

Flye launched

17:22 local time. flye --pacbio-raw <36 fastqs> --genome-size 560m --threads 40 --out-dir flye_v1. First launch failed in 4 s: minimap2 wasn't on PATH because Flye was invoked by direct path without env activation. Re-launched with PATH=$ENV/bin:$PATH; now actually running, walltime estimate 4–8 h, finish window roughly 21:22–01:22.

Blocker / gotcha

Flye binary direct-path call doesn't auto-pick up sibling tools (minimap2, samtools) from the same env's bin/. Workaround: prefix PATH with $CONDA_PREFIX/bin or use mamba run -n lophelia flye .... Documented in case it bites again.


autoloop.deepsek.no · static snapshot · last build 2026-05-13