Methodology

How MindDance retrieves, ranks, tiers, and writes about AIDD papers

MindDance is a daily research brief for people working in AI-driven drug discovery. It is not built as a broad paper dump. The goal is to retrieve a larger pool of relevant AIDD papers, keep the selection logic visible, and publish concise, traceable commentary on the strongest items.

MindDance methodology workflow overview

Positioning

Content is prioritized roughly as Drug > Chem ≈ Bio > Med. If a paper is genuinely useful for drug discovery, it can enter the pool whether it is framed as chemistry, biology, structure, methods, or translational work. Papers that are still pure AI, pure biology, pure chemistry, or pure physics are meant to be removed downstream.

The site borrows the transparency principle used by general AI briefing products, but adapts it to a much harder domain boundary: AIDD needs stronger filtering on relevance, not just community buzz.

Daily cadence

The pipeline is designed to run at 08:00 Beijing time. The publish date is the run date, and the paper-date semantics follow a T+1 logic: primarily yesterday in Beijing time, plus whatever relevant papers are already discoverable from upstream sources by the time the run happens. In practice this depends on how quickly each source indexes new material.

Where papers come from

The current primary sources are arXiv, bioRxiv, and PubMed. These sources should be treated as a union of candidate inputs rather than an intersection.

  • arXiv: captures q-bio core categories plus broader AI and physics-adjacent categories where AIDD methods often appear.
  • bioRxiv: adds preprints from protein design, computational biology, biophysics, and pharmacology-oriented work.
  • PubMed: is currently the main path for journal-style AIDD retrieval, especially medicinal chemistry, computational chemistry, structural biology, and computational biology journals.
  • Auxiliary signals: citation, repository, and community indicators are used mainly for enrichment and ranking, not as the primary retrieval channel.

Recall broadly, then filter in layers

Layer 1: rule-based screening

The first layer requires both an AI-method signal and an AIDD-domain signal. This layer is not supposed to decide the final editorial set by itself. Its job is to remove obvious noise while keeping the pool large enough for downstream scoring and LLM review.

The domain keywords are organized around the real AIDD workflow: target discovery, pockets and binding, docking and virtual screening, molecule generation and optimization, protein and antibody design, ADMET, synthesis, biomarkers, multi-omics, and translational relevance.

Layer 2: multi-signal scoring

After rule-based inclusion, each paper is scored using signals that better match practitioner value than raw popularity. The most important signals currently include:

  • Publication form and venue: journals often outrank preprints, and top journals or top conferences receive stronger weight.
  • Institutional signal: leading academic labs, pharma AI teams, and recognized AIDD companies carry more weight.
  • Code and reproducibility: publicly available code and repository evidence improve rank.
  • Domain strength: whether the paper sits on the drug discovery path instead of merely touching biology or AI language.
  • Community and citation signals: used as supplementary evidence, not the sole criterion.

Layer 3: visible tiers instead of binary keep/drop

The site keeps three explicit tiers:

Featured: the strongest papers that deserve full write-ups.
Notable: papers worth surfacing, but not necessarily long-form interpretation.
Candidate: papers that made it into the reviewed pool but did not enter the main brief.

This matters because hiding the candidate tier makes the product look far stricter than it really is, and removes the reader's ability to audit the daily pool.

Layer 4: LLM judge as semantic cleanup

The LLM judge is a second-pass reviewer. It rechecks featured and notable, and can also inspect high-scoring candidates. If a paper slipped through because of keyword overlap but is not genuinely AIDD, it should be pushed back down. If a semantically strong AIDD paper looked weaker in the heuristic stage, it can be promoted.

How the site presents this

Homepage: explains the site's positioning, workflow, and topic map.
Daily brief page: shows full write-ups for Featured and short summaries for Notable.
Sources page: exposes Featured, Notable, and Candidate together, including score reasons and source provenance.
Topic pages: organize the archive along AIDD workflows rather than generic broad disciplines.

Current AIDD topic map

Based on recent AIDD reviews and research patterns, the site is easier to understand through these workflow-oriented groups:

Target & Mechanism: target discovery, target validation, pathways, mechanism modeling
Structure & Binding: protein structure, pocket modeling, docking, poses, affinity
Molecule Design: generation, property prediction, lead optimization, scaffold hopping
Developability: ADMET, toxicity, synthesizability, formulation, developability
Protein / Antibody / Peptide: protein design, antibody engineering, peptide therapeutics
Reaction & Synthesis: reaction prediction, retrosynthesis, route planning
Biology & Omics for Drug Discovery: biomarkers, multi-omics, patient stratification, drug response

Known limitations

  • Source breadth is still limited: retrieval is stronger than before, but still concentrated in arXiv, bioRxiv, and PubMed.
  • Date semantics depend on upstream indexing: different APIs expose new papers at different speeds.
  • Scoring and topic taxonomy are still evolving: AIDD boundaries are harder to formalize than generic AI news.
  • Write-ups are abstract-driven: useful for fast review, but not a substitute for full-paper reading.

FAQ

How is MindDance different from a generic paper index?
Generic indexes answer "how do I find papers?" MindDance answers "which AIDD papers matter today, and why?" It is intentionally selective and optimized for practitioner-facing review rather than exhaustive coverage.
Why expose the candidate tier on sources pages?
Because transparency is part of the product. Showing candidates lets readers inspect whether the daily pool is too small, too broad, or mis-ranked instead of seeing only the final editorial output.
What does the LLM judge actually do?
It is a second semantic filter rather than a writing engine. Its job is to reject papers that still look like pure AI, pure biology, pure chemistry, or pure physics instead of true AIDD content.
Why avoid first-person commentary in the write-ups?
Because the site is structured as a research brief, not a personal essay column. The current tone is neutral and compact, centered on the problem, method, validation level, and practical significance.