Abstract
The English lexicon is vast, but far from chaotic. Morphological research shows that English vocabulary possesses a highly systematic internal structure, decomposable into three fundamental building blocks: roots, prefixes, and suffixes. This article presents findings from the construction and analysis of a large-scale word family derivation tree corpus — covering 7,533 word families and 54,795 derivation nodes — to reveal the structural composition patterns of English vocabulary. We find that: (1) Latin and Greek roots together account for 50.8% of all English word families; (2) the top 10 most frequent prefixes and suffixes alone cover the vast majority of common derivational relationships; (3) 86.4% of word families have a maximum depth of two or fewer (root → first-order derivation → second-order derivation). Based on these findings, we propose a root teaching value scoring system and a phased framework for structured vocabulary instruction, providing a data-driven theoretical foundation for computational approaches to English vocabulary pedagogy.
1. Introduction
English, as one of the most widely spoken languages globally, presents learners with an immense vocabulary challenge. It is estimated that educated native English speakers know approximately 20,000 word families [1], while second language learners typically need 8,000–9,000 word families for unassisted reading of English texts [2]. Under traditional word-by-word memorization approaches, achieving this target demands enormous investments of time and cognitive resources.
However, linguistic research has long established that English words are not isolated, discrete units, but rather systematic combinations of a finite set of morphemes [3]. A single root, through the addition of various prefixes and suffixes, can generate dozens or even hundreds of derived words. For example, the Latin root form ("to shape") gives rise to form, inform, information, transformation, conformity, and 78 words in total. By mastering the meaning of the root form and the functions of core affixes, learners can systematically comprehend the entire word family — this is the leverage effect of morphological awareness in vocabulary acquisition.
Corson [4] identified a significant "Graeco-Latin / Anglo-Saxon divide" in English vocabulary: everyday language draws primarily from Germanic sources, while academic, scientific, and formal registers rely heavily on Latin and Greek. Nation [1] further argued for the word family as the fundamental unit of vocabulary instruction, demonstrating that once learners master basic affixation rules, they can automatically extend a known word to its entire family. Bauer and Nation [5] proposed a graded system of affix difficulty, providing a principled progression for morphological instruction.
While these theoretical frameworks are well-established, existing research has largely relied on small-scale manual analysis or sampling surveys. Our work represents the first large-scale, computationally constructed corpus of English word family derivation trees, built from Oxford dictionary vocabulary, with AI-assisted etymological tracing and derivational annotation, followed by multiple rounds of quality verification. This article aims to:
- Systematically characterize the structural composition of English vocabulary — the distribution patterns of roots, prefixes, and suffixes
- Quantitatively analyze the topological properties of derivation trees — depth, breadth, and size distributions
- Discuss the practical implications for structured vocabulary instruction
2. Corpus Construction Methodology
2.1 Data Sources and Pipeline
Our derivation tree corpus was built from the Oxford English Dictionary word list through a six-stage pipeline:
- Core vocabulary extraction: Extracting words with CEFR (Common European Framework of Reference) level tags from the Oxford dictionary
- Etymological tracing: Using large language models to trace each word's core root according to strict etymological standards
- Derivation tree generation: Building complete derivation trees for each root, strictly following the principle of atomic affixation — one affix per derivation step
- Multi-round quality correction: Including false parent-child relationship detection, affix annotation repair, POS tag correction, and cross-tree duplicate elimination
- Singleton processing: Re-homing or building independent trees for isolated words
- Merge optimization: Consolidating duplicate roots and undersized families for data consistency
The entire process adhered to five strict rules: (1) 100% etymological accuracy — no guessing based on spelling similarity; (2) internal tree consistency — every word must have a correct parent; (3) atomic affixation — each derivation step adds exactly one prefix or suffix; (4) correct extraction of Latin/Greek bound roots; (5) Germanic free morphemes use the base word form as the root.
2.2 Data Schema
Each record in the corpus represents a complete word family derivation tree with the following fields:
| Field | Type | Description |
|---|---|---|
family_id | string | Unique identifier for the word family |
root_info | object | Root metadata: form, meaning (bilingual), source language, spelling variants |
derivation_nodes | array | All derivation nodes, each containing: word, pos, parent, affix_added (type, text, meaning) |
3. Structural Composition of English Vocabulary
3.1 Roots: The Semantic Core
The root is the minimal unit carrying core semantic content in English words. In our corpus, 7,533 word families correspond to 7,533 distinct roots. The distribution of roots by source language is as follows:
| Source Language | Word Families | Proportion |
|---|---|---|
| Latin | 2,970 | 39.4% |
| Other languages | 1,637 | 21.7% |
| Old English / Germanic | 1,474 | 19.6% |
| Greek | 859 | 11.4% |
| French | 531 | 7.0% |
| Blends / Acronyms | 62 | 0.8% |
This distribution validates Corson's [4] classic observation: Latin and Greek roots together account for 50.8% of English word families. This means that more than half of English vocabulary can be systematically decoded through classical language roots. From a pedagogical perspective, this ratio gives root learning an exceptionally high "return on investment" — mastering a single productive Latin root such as port ("to carry") immediately unlocks 108 derived words, including import, export, transport, report, support, portable, deportation, and more.
Germanic roots, while numerically smaller (19.6%), carry the most fundamental, high-frequency everyday vocabulary: all, one, hand, life, home, etc. These roots are themselves common words that learners already know, making them natural "anchor points" for initiating morphological instruction.
3.2 Prefixes: Modifying Semantic Direction
Prefixes attach before roots or existing words, primarily functioning to modify semantic direction — negation, repetition, excess, location, and more. Our corpus contains 694 distinct prefixes, but usage frequency is highly concentrated.
Top 10 most frequent prefixes:
| Prefix | Occurrences | Core Meaning | Examples |
|---|---|---|---|
| un- | 2,190 | negation / reversal | unhappy, undo, unusual |
| re- | 592 | again / back | rebuild, rewrite, return |
| in-/im-/ir-/il- | 338 | negation / into | impossible, illegal, irregular |
| over- | 203 | excess / above | overwork, overcome, overlook |
| de- | 178 | removal / down | decode, decrease, depart |
| dis- | 171 | negation / apart | disagree, disappear, discover |
| sub- | 153 | under / secondary | subway, subtitle, submarine |
| pre- | 134 | before | preview, prepare, predict |
| out- | 127 | beyond / external | outdoor, output, outstanding |
| a-/ab- | 119 | away / on | abnormal, abroad, asleep |
Notably, the prefix un- alone accounts for 2,190 derivational relationships, dominating the prefix system. This aligns with Bauer and Nation's [5] finding that un- is the highest-frequency, earliest-to-teach prefix. The top 10 prefixes collectively cover 4,205 derivations, forming the core backbone of the English prefix system.
3.3 Suffixes: Determining Part of Speech and Grammatical Function
Suffixes attach after roots or existing words, with their most important function being changing part of speech — converting verbs to nouns, nouns to adjectives, etc. The corpus contains 1,970 distinct suffixes.
Top 10 most frequent suffixes:
| Suffix | Occurrences | Grammatical Function | Examples |
|---|---|---|---|
| -s | 4,448 | plural / 3rd person singular | books, runs |
| -ed | 4,013 | past tense / past participle | played, informed |
| -ing | 3,936 | progressive / gerund | running, learning |
| -ly | 3,233 | adverb formation | quickly, carefully |
| -er | 2,289 | agent / comparative | teacher, bigger |
| -ness | 1,607 | abstract noun formation | happiness, darkness |
| -y | 1,094 | adjective formation | rainy, sandy |
| -ion/-tion/-ation | 864 | noun (action/state) | education, information |
| -al | 756 | adjective formation | formal, natural |
| -ity | 717 | abstract noun formation | reality, ability |
The suffix system shows a clear two-tier differentiation: inflectional suffixes (-s, -ed, -ing) handle grammatical conjugation with the highest frequency but do not change the word's basic meaning or part of speech; derivational suffixes (-ness, -tion, -al, -ity) truly create new words by changing both part of speech and meaning. From a teaching perspective, derivational suffixes have higher instructional value as they are the key mechanism for "word family expansion."
3.4 Part-of-Speech Distribution
The POS distribution across all derivation nodes in the corpus:
| Part of Speech | Count | Proportion |
|---|---|---|
| Noun | 25,982 | 47.4% |
| Adjective | 14,438 | 26.3% |
| Verb | 9,978 | 18.2% |
| Adverb | 3,654 | 6.7% |
| Other | 743 | 1.4% |
Nouns constitute nearly half (47.4%) of all derived words, consistent with Bauer and Nation's [5] observation that the English derivational system is particularly productive at nominalizing verbs and adjectives through suffixes such as -tion, -ness, -ment, and -ity. This finding suggests that suffix instruction should particularly emphasize the verb/adjective → noun conversion pathway.
4. Topological Properties of Derivation Trees
4.1 Tree Depth Distribution
The "depth" of a derivation tree represents the number of affixation steps from the root to the most distant derived word. We analyzed the maximum depth across all 7,533 word families:
| Max Depth | Word Families | Proportion | Cumulative |
|---|---|---|---|
| 0 (root only) | 362 | 4.8% | 4.8% |
| 1 (root + direct derivation) | 2,869 | 38.1% | 42.9% |
| 2 (three-layer structure) | 3,274 | 43.5% | 86.4% |
| 3 | 858 | 11.4% | 97.7% |
| 4 | 155 | 2.1% | 99.8% |
| 5–6 | 15 | 0.2% | 100% |
This is a significant finding: 86.4% of word families have a maximum depth of 2 or less. This means the vast majority of English words can be structurally decoded by understanding "root + 1 to 2 affixes." Even the most complex words, such as uninformatively (un- + inform + -ative + -ly, depth 4), involve no more than four simple affixation steps. This conclusion provides confidence for pedagogy: developing morphological analysis skills does not require mastering complex recursive rules — only understanding a finite set of combinatorial patterns.
4.2 Family Size Distribution
The number of words in each word family (the "breadth" of the derivation tree) distributes as follows:
| Family Size | Count | Proportion |
|---|---|---|
| 1 word (singleton) | 352 | 4.7% |
| 2–5 words | 2,858 | 37.9% |
| 6–10 words | 3,196 | 42.4% |
| 11–20 words | 878 | 11.7% |
| 21–50 words | 228 | 3.0% |
| 51+ words | 21 | 0.3% |
Statistics show that 80.3% of word families contain 2–10 words. The mean is 7.27 and the median is 6. This size falls squarely within Nation's [1] typical range for the "word family" concept and is compatible with cognitive load theory [6] — presenting 6–10 related words in a single teaching unit provides enough material to demonstrate the root's derivational power without causing information overload.
4.3 Example: Complete Derivation Structure of the form Family
The following illustrates a medium-sized word family (78 words) built from the Latin root form ("to shape," from Latin formare):
form [root: to shape or fashion, Latin]
├── con- + form → conform (to comply)
│ ├── conform + -ity → conformity
│ │ └── non- + conformist → nonconformist
│ └── conform + -ation → conformation
├── de- + form → deform (to distort)
│ └── deform + -ation → deformation
├── form + -al → formal
│ ├── formal + -ity → formality
│ ├── formal + -ize → formalize
│ └── in- + formal → informal
├── in- + form → inform (to tell)
│ ├── inform + -ation → information
│ │ ├── dis- + information → disinformation
│ │ └── mis- + information → misinformation
│ └── inform + -ative → informative
│ └── un- + informative → uninformative
├── re- + form → reform
│ └── reform + -ation → reformation
├── trans- + form → transform
│ └── transform + -ation → transformation
└── uni- + form → uniform
└── uniform + -ity → uniformity
This example clearly demonstrates the core mechanism of morphology: a single root form, through prefixes (con-, de-, in-, re-, trans-, uni-, non-, dis-, mis-, un-) and suffixes (-al, -ation, -ity, -ize, -ive, -ative), systematically generates a vast vocabulary network spanning different semantic domains and parts of speech.
5. Root Teaching Value Scoring System
5.1 Motivation
Not all roots are equally valuable for language learners. An ideal "teaching-priority root" should simultaneously: (1) have a sufficiently large family to demonstrate morphological combinatorial power; (2) cover common, frequently encountered words; (3) involve diverse affix types to showcase rich derivational patterns.
5.2 Scoring Formula
We designed a composite scoring system (theoretical maximum ~140 points) incorporating six factors:
| Factor | Weight | Calculation | Rationale |
|---|---|---|---|
| CEFR coverage ratio | 40 | Proportion of family words with CEFR tags × 40 | Ensures family words are "worth learning" standard vocabulary |
| High-frequency ratio | 20 | Proportion with COCA rank < 5,000 × 20 | Ensures coverage of commonly encountered words |
| CEFR word count | 24 (cap) | +2 per tagged word, max 12 words | Larger families provide more learning material |
| CEFR span | 15 (cap) | Highest CEFR level − lowest, ×3 per level | Wide-span families support A1-through-C2 vertical teaching |
| Anchor bonus | 5 | +5 if family contains an A1/A2 word | Basic words serve as cognitive anchors for learners |
| Affix diversity | 15 (cap) | Unique (type, text) pairs × 1.5; +5 if both prefix and suffix present | More affix types = richer derivational patterns |
| Size sweet spot | 8 | +8 for 6–15 words; <5: −5/word; >15: −1.5/word | Too small lacks material; too large risks cognitive overload |
5.3 Ranking Results
We scored and ranked all 7,533 roots, extracting the top 300 teaching-priority roots. Here are the top 15:
| Rank | Root | Meaning | Origin | Family Size | Score |
|---|---|---|---|---|---|
| 1 | all | every part, the whole of | Old English | 9 | 117.8 |
| 2 | one | a single person or thing | Old English | 12 | 115.8 |
| 3 | cid/cut | to cut or kill | Latin | 12 | 113.8 |
| 4 | door | entrance or barrier | Old English | 12 | 113.8 |
| 5 | ever | always, at any time | Old English | 12 | 110.3 |
| 6 | fresh | new, pure | Germanic | 14 | 109.0 |
| 7 | care | sorrow, anxiety | Old English | 15 | 107.7 |
| 8 | jus | law, right | Latin | 15 | 107.7 |
| 9 | life | condition of living | Old English | 15 | 107.7 |
| 10 | able | having power or skill | Latin | 14 | 107.0 |
| 11 | for | before, in front of | Old English | 12 | 106.5 |
| 12 | cess | to go, to yield | Latin | 13 | 106.3 |
| 13 | train | to pull, to train | French | 13 | 105.0 |
| 14 | loc | place, location | Latin | 12 | 104.3 |
| 15 | nature | nature, character | Latin | 13 | 104.0 |
The ranking results reveal an interesting pattern: the highest-ranked roots are not the largest word families. The biggest families like logy (124 words) and graph (116 words) actually score lower due to size penalties — they are better suited as advanced academic vocabulary materials rather than introductory teaching candidates. The truly top-ranked roots are those with "moderate size, dense high-frequency coverage, and rich affix diversity."
6. Application Framework for Structured Vocabulary Learning
6.1 Phased Learning Pathway
Synthesizing our corpus analysis with established language teaching research [1, 5, 7], we propose the following phased framework for structured vocabulary instruction:
Phase 1: Affix Awareness Activation (A1–A2)
Teach 10 core prefixes (un-, re-, in-/im-, dis-, over-, out-, pre-, mis-, under-, sub-) and 10 core derivational suffixes (-er, -ness, -ly, -ful, -less, -tion/-sion, -ment, -able/-ible, -ous, -al). These 20 affixes represent the highest-frequency, most transparent affixes as defined by Bauer and Nation [5], covering the vast majority of common derivational relationships in our corpus. The teaching goal is not to memorize affix lists but to cultivate the awareness of "attempting to decompose a long word when encountering one."
Phase 2: Germanic Core Roots (A2–B1)
Prioritize roots from Old English/Germanic sources: all, one, ever, care, life, home, hand, ground. These roots are themselves basic words that learners already know. The teaching strategy is to "activate root awareness" — helping learners discover that already, almost, altogether, always all derive from all, and alone, lonely, only, once all derive from one.
Phase 3: Productive Latin Roots (B1–B2)
Introduce systematic teaching of Latin roots. While Latin roots are "non-intuitive" for learners (as bound roots that cannot stand alone as words), they possess extraordinary productivity. Our data shows that the 15 largest word families are dominated by Latin/Greek roots:
- logy (study of): 124 words — biology, psychology, technology, ecology...
- graph (write): 116 words — photograph, geography, biography, paragraph...
- port (carry): 108 words — import, export, transport, report, support...
- pose (place): 98 words — compose, purpose, suppose, expose, propose...
- press (press): 89 words — express, impress, compress, suppress, depress...
- form (shape): 78 words — inform, reform, transform, uniform, formula...
- tract (draw, pull): 70 words — attract, extract, contract, distract...
- act (do, drive): 69 words — action, react, interact, exact, active...
Each mastered Latin root gives learners a structural key to understanding 70–120 words. Wei and Nation's [7] experimental research also confirmed the significant effectiveness of this "word part technique" for intermediate and advanced learners.
Phase 4: Greek Academic Roots (B2–C1)
Greek roots constitute 11.4% of the corpus but are concentrated in academic and scientific domains. These roots (logy, graph, bio, geo, psych, phil, phon) are core sources for English for Academic Purposes (EAP) and standardized tests (GRE, TOEFL). At this stage, cross-disciplinary vocabulary instruction integrated with subject content becomes particularly effective.
Phase 5: Advanced Word-Formation Decoding (C1–C2)
Develop the ability to instantly decode 3–4 layer derivation structures. For example, when encountering the unfamiliar word disproportionately, learners should be able to rapidly decompose: dis- (negation) + pro- (forward) + port (carry) + -ion (nominalization) + -ate (adjectivalization) + -ly (adverbialization) → "not in proportion."
6.2 Product Design Recommendations
Based on our corpus and analysis, we offer the following design recommendations for vocabulary learning products:
1. Interactive Root Derivation Maps
Visualize the derivation tree data as interactive tree diagrams. Learners can click any root to see its complete derivation family, with hover tooltips showing affix meanings and word definitions. This "panoramic view" helps learners build systematic connections between words rather than memorizing them in isolation.
2. "One Root, Many Words" Teaching Mode
Each teaching unit centers on a single root, with affix combination exercises built around it. For instance, when teaching the port root: import (in- + port) → export (ex- + port) → transport (trans- + port) → portable (port + -able) → transportation (transport + -ation), allowing learners to understand each affix's role through comparison.
3. Affix Decomposition Exercises
Present a complex word (e.g., uncomfortable) and have learners identify the root and each affix (un- + comfort + -able), then infer the whole-word meaning from the component meanings. This "reverse decoding" exercise is the core of morphological awareness training.
4. CEFR-Graded Delivery
Leverage the existing CEFR tags and COCA frequency data to deliver word families at appropriate difficulty levels. A2-level learners see simple families built around Germanic core roots, while B2-level learners encounter the complex derivational networks of productive Latin roots.
7. Discussion and Limitations
7.1 Cross-Validation with Published Research
Our corpus (7,533 word families, 54,795 nodes) significantly exceeds the scale of previously hand-annotated morphological datasets. We cross-validate our findings against three independent lines of research:
Etymological distribution consistency: Finkenstaedt and Wolff [8] computed statistics on approximately 80,000 entries in the Shorter Oxford Dictionary, finding Latin 28.2%, French 28.3%, Germanic 25%, Greek 5.3%. Our figures (Latin 39.4%, Germanic 19.6%, Greek 11.4%, French 7.0%) appear different at first glance, but this is expected — our analysis traces to etymological roots rather than individual words. Many words that entered English through French (e.g., justice, information, government) ultimately trace to Latin roots, and are classified as Latin in our root-level analysis. The combined Latin+French proportions are broadly consistent (ours: 46.4%, F&W: 56.5%), with the gap attributable to classification granularity.
Prefix ranking consistency: White, Sowell, and Yanagihara [9], analyzing The American Heritage Word Frequency Book, found that un- accounts for 26% of all prefixed words, and the top four (un-, re-, in-, dis-) account for 58%. In our corpus, the top 3 prefixes match exactly (un-, re-, in-), and ranks 4–6 contain the same set with minor ordering differences. This strong agreement confirms that our corpus accurately reflects the true distribution of the English affix system.
Word frequency × etymology pattern: Williams [10] found that 83% of the 1,000 most common English words are Germanic in origin, while only 25% of the rarest dictionary entries are Germanic. This pattern directly explains a striking feature of our teaching-priority ranking: the highest-scoring roots are disproportionately Germanic (all, one, ever, care, life) because they contain the highest proportion of high-frequency words with strong COCA scores. This is not coincidental but a direct reflection of English vocabulary's "Germanic core + Latin/Greek periphery" structure.
7.2 Limitations
- Inherent limitations of AI generation: Despite multiple rounds of human+AI joint verification, a small number of etymologically contested derivational relationships may remain (some words have genuinely debated etymologies in academic linguistics).
- Oxford dictionary boundary: The corpus covers standard vocabulary recorded in the Oxford dictionary but excludes specialized terminology, slang, and neologisms.
- Synchronic assumption: The morphological analysis takes a synchronic view of modern English; some historical derivational relationships (e.g., island and isle) may no longer be transparent to contemporary learners.
8. Conclusion
Through systematic analysis of 7,533 English word families comprising 54,795 derivation nodes, this study reveals five core patterns in the structural composition of English vocabulary:
- Highly systematic: Over 95% of English words can be assigned to word families with identifiable roots
- Concentrated origins: Latin + Greek roots account for 50.8% of all families, offering a "learn one root, unlock dozens of words" leverage effect
- Extreme affix reuse: The top 10 prefixes and suffixes cover the vast majority of common derivations
- Manageable depth: 86.4% of word families stay within three layers, suitable for instruction
- Moderate size: 80.3% of word families contain 2–10 words, aligning well with the cognitive capacity of a single teaching unit
These findings provide a solid data foundation for structured vocabulary instruction. Our root teaching value scoring system and phased learning framework combine morphological theory with large-scale corpus analysis, offering actionable guidance for the design of Computer-Assisted Language Learning (CALL) systems.
At a time when vocabulary size constitutes the core bottleneck in language learning, structured morphological pedagogy is not an optional "nice-to-have" but a necessary strategy for improving vocabulary acquisition efficiency. Our data shows that the internal structure of English vocabulary is far more orderly than it appears on the surface — and this orderliness is precisely the foundation of efficient learning.
References
- Nation, I. S. P. (2001). Learning Vocabulary in Another Language. Cambridge University Press.
- Nation, I. S. P. (2006). "How Large a Vocabulary Is Needed For Reading and Listening?" Canadian Modern Language Review, 63(1), 59–82.
- Aronoff, M., & Fudeman, K. (2011). What is Morphology? (2nd ed.). Wiley-Blackwell.
- Corson, D. (1997). "The Learning and Use of Academic English Words." Language Learning, 47(4), 671–718.
- Bauer, L., & Nation, I. S. P. (1993). "Word Families." International Journal of Lexicography, 6(4), 253–279.
- Sweller, J. (1988). "Cognitive Load During Problem Solving: Effects on Learning." Cognitive Science, 12(2), 257–285.
- Wei, Z., & Nation, I. S. P. (2013). "The Word Part Technique: A Very Useful Vocabulary Teaching Technique." Modern English Teacher, 22(1), 12–16.
- Finkenstaedt, T., & Wolff, D. (1973). Ordered Profusion: Studies in Dictionaries and the English Lexicon. C. Winter.
- White, T. G., Sowell, J., & Yanagihara, A. (1989). "Teaching Elementary Students to Use Word-Part Clues." The Reading Teacher, 42(4), 302–308.
- Williams, J. M. (1975). Origins of the English Language: A Social and Linguistic History. Free Press.