Vocabulary Format

Vocabulary Format

How we designed and organized the word vocabulary for Word Space 3D

Updated: January 25, 2026


Vocabulary Format

Why We Expanded the Vocabulary

When we first built Word Space 3D, we had 2,405 words — a mix of nouns, verbs, and adjectives with no formal categorization. The game worked, but we noticed some limitations:

  1. UMAP quality improves with more data points. With only ~2,400 words, the 3D semantic space felt sparse in places. Doubling the vocabulary gives UMAP more information to work with, creating smoother transitions and tighter semantic clusters.

  2. Mixed vocabulary is more fun. We considered going nouns-only (~3,000-4,000 quality nouns exist in common English), but that felt restrictive. Players naturally guess verbs ("run", "swim") and adjectives ("happy", "cold") — blocking these would frustrate gameplay.

  3. Part-of-speech tracking helps us curate. By tagging each word with its primary POS, we can analyze the vocabulary distribution, spot gaps, and make informed decisions about what to add or remove. The player never sees these tags — they're purely for our internal use.

The Decision

We settled on ~4,000 mixed vocabulary with internal POS tracking:

  • Nouns (singular form): apple, castle, dragon
  • Verbs (infinitive form): run, swim, whisper
  • Adjectives (base form): happy, beautiful, cold

This gives us enough density for good UMAP projections while keeping the game open to natural guessing patterns.


File Format

The vocabulary file lives at server/services/data/vocab.csv.

Current Format (Recommended)

CSV with header row containing word and pos columns:

word,pos
apple,noun
run,verb
happy,adj
swim,verb
castle,noun
beautiful,adj

Legacy Format (Supported)

Simple list with one word per line (no header, no POS). The build script detects this automatically and assigns NULL to all POS values:

apple
run
happy

Part of Speech Values

POSDescriptionExamples
nounNouns (singular)apple, castle, dragon
verbVerbs (infinitive form)run, swim, whisper
adjAdjectiveshappy, beautiful, cold

Word Selection Guidelines

  1. Nouns: Use singular form (apple, not apples)
  2. Verbs: Use infinitive form (run, not running/ran)
  3. Adjectives: Use base form (happy, not happier/happiest)
  4. Avoid: Proper nouns, offensive words, very obscure words
  5. Prefer: Concrete, visualizable words that are easy to conceptualize

Internal Use Only

The pos field is stored in the database but never exposed to players:

  • NOT included in embeddings (only the word itself is embedded)
  • NOT returned in API responses
  • Used only for vocabulary statistics and curation

Rebuilding the Database

After modifying vocab.csv:

cd apps/word-space-3d/server

# If vocabulary size changed significantly, retrain UMAP first:
python services/umap_reduce/fit_vocab_umap_3d.py

# Then rebuild the database:
python build_vocab_db.py

Cost: ~$0.02 per 1,000 words for embeddings.


Verifying Vocabulary Statistics

sqlite3 services/data/word_space.db "SELECT pos, COUNT(*) FROM words GROUP BY pos;"

Expected output for ~4,000 words:

adj|800
noun|2400
verb|800