The Morphological Structure of English Words: A Corpus-Based Analysis of 7,533 Word Families

Abstract

The English lexicon is vast, but far from chaotic. Morphological research shows that English vocabulary possesses a highly systematic internal structure, decomposable into three fundamental building blocks: roots, prefixes, and suffixes. This article presents findings from the construction and analysis of a large-scale word family derivation tree corpus — covering 7,533 word families and 54,795 derivation nodes — to reveal the structural composition patterns of English vocabulary. We find that: (1) Latin and Greek roots together account for 50.8% of all English word families; (2) the top 10 most frequent prefixes and suffixes alone cover the vast majority of common derivational relationships; (3) 86.4% of word families have a maximum depth of two or fewer (root → first-order derivation → second-order derivation). Based on these findings, we propose a root teaching value scoring system and a phased framework for structured vocabulary instruction, providing a data-driven theoretical foundation for computational approaches to English vocabulary pedagogy.

1. Introduction

English, as one of the most widely spoken languages globally, presents learners with an immense vocabulary challenge. It is estimated that educated native English speakers know approximately 20,000 word families ^[1], while second language learners typically need 8,000–9,000 word families for unassisted reading of English texts ^[2]. Under traditional word-by-word memorization approaches, achieving this target demands enormous investments of time and cognitive resources.

However, linguistic research has long established that English words are not isolated, discrete units, but rather systematic combinations of a finite set of morphemes ^[3]. A single root, through the addition of various prefixes and suffixes, can generate dozens or even hundreds of derived words. For example, the Latin root form ("to shape") gives rise to form, inform, information, transformation, conformity, and 78 words in total. By mastering the meaning of the root form and the functions of core affixes, learners can systematically comprehend the entire word family — this is the leverage effect of morphological awareness in vocabulary acquisition.

Corson ^[4] identified a significant "Graeco-Latin / Anglo-Saxon divide" in English vocabulary: everyday language draws primarily from Germanic sources, while academic, scientific, and formal registers rely heavily on Latin and Greek. Nation ^[1] further argued for the word family as the fundamental unit of vocabulary instruction, demonstrating that once learners master basic affixation rules, they can automatically extend a known word to its entire family. Bauer and Nation ^[5] proposed a graded system of affix difficulty, providing a principled progression for morphological instruction.

While these theoretical frameworks are well-established, existing research has largely relied on small-scale manual analysis or sampling surveys. Our work represents the first large-scale, computationally constructed corpus of English word family derivation trees, built from Oxford dictionary vocabulary, with AI-assisted etymological tracing and derivational annotation, followed by multiple rounds of quality verification. This article aims to:

Systematically characterize the structural composition of English vocabulary — the distribution patterns of roots, prefixes, and suffixes
Quantitatively analyze the topological properties of derivation trees — depth, breadth, and size distributions
Discuss the practical implications for structured vocabulary instruction

2. Corpus Construction Methodology

2.1 Data Sources and Pipeline

Our derivation tree corpus was built from the Oxford English Dictionary word list through a six-stage pipeline:

Core vocabulary extraction: Extracting words with CEFR (Common European Framework of Reference) level tags from the Oxford dictionary
Etymological tracing: Using large language models to trace each word's core root according to strict etymological standards
Derivation tree generation: Building complete derivation trees for each root, strictly following the principle of atomic affixation — one affix per derivation step
Multi-round quality correction: Including false parent-child relationship detection, affix annotation repair, POS tag correction, and cross-tree duplicate elimination
Singleton processing: Re-homing or building independent trees for isolated words
Merge optimization: Consolidating duplicate roots and undersized families for data consistency

The entire process adhered to five strict rules: (1) 100% etymological accuracy — no guessing based on spelling similarity; (2) internal tree consistency — every word must have a correct parent; (3) atomic affixation — each derivation step adds exactly one prefix or suffix; (4) correct extraction of Latin/Greek bound roots; (5) Germanic free morphemes use the base word form as the root.

2.2 Data Schema

Each record in the corpus represents a complete word family derivation tree with the following fields:

Field	Type	Description
`family_id`	string	Unique identifier for the word family
`root_info`	object	Root metadata: form, meaning (bilingual), source language, spelling variants
`derivation_nodes`	array	All derivation nodes, each containing: word, pos, parent, affix_added (type, text, meaning)

3. Structural Composition of English Vocabulary

3.1 Roots: The Semantic Core

The root is the minimal unit carrying core semantic content in English words. In our corpus, 7,533 word families correspond to 7,533 distinct roots. The distribution of roots by source language is as follows:

Source Language	Word Families	Proportion
Latin	2,970	39.4%
Other languages	1,637	21.7%
Old English / Germanic	1,474	19.6%
Greek	859	11.4%
French	531	7.0%
Blends / Acronyms	62	0.8%

Root origin language distribution — Figure 1: Distribution of 7,533 word family roots by source language. Latin and Greek together comprise 50.8%.

This distribution validates Corson's ^[4] classic observation: Latin and Greek roots together account for 50.8% of English word families. This means that more than half of English vocabulary can be systematically decoded through classical language roots. From a pedagogical perspective, this ratio gives root learning an exceptionally high "return on investment" — mastering a single productive Latin root such as port ("to carry") immediately unlocks 108 derived words, including import, export, transport, report, support, portable, deportation, and more.

Germanic roots, while numerically smaller (19.6%), carry the most fundamental, high-frequency everyday vocabulary: all, one, hand, life, home, etc. These roots are themselves common words that learners already know, making them natural "anchor points" for initiating morphological instruction.

3.2 Prefixes: Modifying Semantic Direction

Prefixes attach before roots or existing words, primarily functioning to modify semantic direction — negation, repetition, excess, location, and more. Our corpus contains 694 distinct prefixes, but usage frequency is highly concentrated.

Top 10 most frequent prefixes:

Prefix	Occurrences	Core Meaning	Examples
un-	2,190	negation / reversal	unhappy, undo, unusual
re-	592	again / back	rebuild, rewrite, return
in-/im-/ir-/il-	338	negation / into	impossible, illegal, irregular
over-	203	excess / above	overwork, overcome, overlook
de-	178	removal / down	decode, decrease, depart
dis-	171	negation / apart	disagree, disappear, discover
sub-	153	under / secondary	subway, subtitle, submarine
pre-	134	before	preview, prepare, predict
out-	127	beyond / external	outdoor, output, outstanding
a-/ab-	119	away / on	abnormal, abroad, asleep

Top prefix frequency — Figure 2: Top 10 most frequent prefixes by occurrence count. un- leads with 2,190 occurrences.

Notably, the prefix un- alone accounts for 2,190 derivational relationships, dominating the prefix system. This aligns with Bauer and Nation's ^[5] finding that un- is the highest-frequency, earliest-to-teach prefix. The top 10 prefixes collectively cover 4,205 derivations, forming the core backbone of the English prefix system.

3.3 Suffixes: Determining Part of Speech and Grammatical Function

Suffixes attach after roots or existing words, with their most important function being changing part of speech — converting verbs to nouns, nouns to adjectives, etc. The corpus contains 1,970 distinct suffixes.

Top 10 most frequent suffixes:

Suffix	Occurrences	Grammatical Function	Examples
-s	4,448	plural / 3rd person singular	books, runs
-ed	4,013	past tense / past participle	played, informed
-ing	3,936	progressive / gerund	running, learning
-ly	3,233	adverb formation	quickly, carefully
-er	2,289	agent / comparative	teacher, bigger
-ness	1,607	abstract noun formation	happiness, darkness
-y	1,094	adjective formation	rainy, sandy
-ion/-tion/-ation	864	noun (action/state)	education, information
-al	756	adjective formation	formal, natural
-ity	717	abstract noun formation	reality, ability

Top suffix frequency — Figure 3: Top 10 most frequent suffixes, distinguishing inflectional (amber) from derivational (green).

The suffix system shows a clear two-tier differentiation: inflectional suffixes (-s, -ed, -ing) handle grammatical conjugation with the highest frequency but do not change the word's basic meaning or part of speech; derivational suffixes (-ness, -tion, -al, -ity) truly create new words by changing both part of speech and meaning. From a teaching perspective, derivational suffixes have higher instructional value as they are the key mechanism for "word family expansion."

3.4 Part-of-Speech Distribution

The POS distribution across all derivation nodes in the corpus:

Part of Speech	Count	Proportion
Noun	25,982	47.4%
Adjective	14,438	26.3%
Verb	9,978	18.2%
Adverb	3,654	6.7%
Other	743	1.4%

Nouns constitute nearly half (47.4%) of all derived words, consistent with Bauer and Nation's ^[5] observation that the English derivational system is particularly productive at nominalizing verbs and adjectives through suffixes such as -tion, -ness, -ment, and -ity. This finding suggests that suffix instruction should particularly emphasize the verb/adjective → noun conversion pathway.

4. Topological Properties of Derivation Trees

4.1 Tree Depth Distribution

The "depth" of a derivation tree represents the number of affixation steps from the root to the most distant derived word. We analyzed the maximum depth across all 7,533 word families:

Max Depth	Word Families	Proportion	Cumulative
0 (root only)	362	4.8%	4.8%
1 (root + direct derivation)	2,869	38.1%	42.9%
2 (three-layer structure)	3,274	43.5%	86.4%
3	858	11.4%	97.7%
4	155	2.1%	99.8%
5–6	15	0.2%	100%

Tree depth distribution — Figure 4: Derivation tree depth distribution across 7,533 word families. 86.4% fall within depth 2.

This is a significant finding: 86.4% of word families have a maximum depth of 2 or less. This means the vast majority of English words can be structurally decoded by understanding "root + 1 to 2 affixes." Even the most complex words, such as uninformatively (un- + inform + -ative + -ly, depth 4), involve no more than four simple affixation steps. This conclusion provides confidence for pedagogy: developing morphological analysis skills does not require mastering complex recursive rules — only understanding a finite set of combinatorial patterns.

4.2 Family Size Distribution

The number of words in each word family (the "breadth" of the derivation tree) distributes as follows:

Family Size	Count	Proportion
1 word (singleton)	352	4.7%
2–5 words	2,858	37.9%
6–10 words	3,196	42.4%
11–20 words	878	11.7%
21–50 words	228	3.0%
51+ words	21	0.3%

Family size distribution — Figure 5: Word family size distribution. 80.3% contain 2–10 words, with a mean of 7.27.

Statistics show that 80.3% of word families contain 2–10 words. The mean is 7.27 and the median is 6. This size falls squarely within Nation's ^[1] typical range for the "word family" concept and is compatible with cognitive load theory ^[6] — presenting 6–10 related words in a single teaching unit provides enough material to demonstrate the root's derivational power without causing information overload.

4.3 Example: Complete Derivation Structure of the form Family

The following illustrates a medium-sized word family (78 words) built from the Latin root form ("to shape," from Latin formare):

form [root: to shape or fashion, Latin]
├── con- + form → conform (to comply)
│   ├── conform + -ity → conformity
│   │   └── non- + conformist → nonconformist
│   └── conform + -ation → conformation
├── de- + form → deform (to distort)
│   └── deform + -ation → deformation
├── form + -al → formal
│   ├── formal + -ity → formality
│   ├── formal + -ize → formalize
│   └── in- + formal → informal
├── in- + form → inform (to tell)
│   ├── inform + -ation → information
│   │   ├── dis- + information → disinformation
│   │   └── mis- + information → misinformation
│   └── inform + -ative → informative
│       └── un- + informative → uninformative
├── re- + form → reform
│   └── reform + -ation → reformation
├── trans- + form → transform
│   └── transform + -ation → transformation
└── uni- + form → uniform
    └── uniform + -ity → uniformity

form family derivation tree — Figure 6: Partial derivation tree visualization of the Latin root "form," showing 24 of 78 words. Each step adds exactly one affix.

This example clearly demonstrates the core mechanism of morphology: a single root form, through prefixes (con-, de-, in-, re-, trans-, uni-, non-, dis-, mis-, un-) and suffixes (-al, -ation, -ity, -ize, -ive, -ative), systematically generates a vast vocabulary network spanning different semantic domains and parts of speech.

5. Root Teaching Value Scoring System

5.1 Motivation

Not all roots are equally valuable for language learners. An ideal "teaching-priority root" should simultaneously: (1) have a sufficiently large family to demonstrate morphological combinatorial power; (2) cover common, frequently encountered words; (3) involve diverse affix types to showcase rich derivational patterns.

5.2 Scoring Formula

We designed a composite scoring system (theoretical maximum ~140 points) incorporating six factors:

Factor	Weight	Calculation	Rationale
CEFR coverage ratio	40	Proportion of family words with CEFR tags × 40	Ensures family words are "worth learning" standard vocabulary
High-frequency ratio	20	Proportion with COCA rank < 5,000 × 20	Ensures coverage of commonly encountered words
CEFR word count	24 (cap)	+2 per tagged word, max 12 words	Larger families provide more learning material
CEFR span	15 (cap)	Highest CEFR level − lowest, ×3 per level	Wide-span families support A1-through-C2 vertical teaching
Anchor bonus	5	+5 if family contains an A1/A2 word	Basic words serve as cognitive anchors for learners
Affix diversity	15 (cap)	Unique (type, text) pairs × 1.5; +5 if both prefix and suffix present	More affix types = richer derivational patterns
Size sweet spot	8	+8 for 6–15 words; <5: −5/word; >15: −1.5/word	Too small lacks material; too large risks cognitive overload

5.3 Ranking Results

We scored and ranked all 7,533 roots, extracting the top 300 teaching-priority roots. Here are the top 15:

Rank	Root	Meaning	Origin	Family Size	Score
1	all	every part, the whole of	Old English	9	117.8
2	one	a single person or thing	Old English	12	115.8
3	cid/cut	to cut or kill	Latin	12	113.8
4	door	entrance or barrier	Old English	12	113.8
5	ever	always, at any time	Old English	12	110.3
6	fresh	new, pure	Germanic	14	109.0
7	care	sorrow, anxiety	Old English	15	107.7
8	jus	law, right	Latin	15	107.7
9	life	condition of living	Old English	15	107.7
10	able	having power or skill	Latin	14	107.0
11	for	before, in front of	Old English	12	106.5
12	cess	to go, to yield	Latin	13	106.3
13	train	to pull, to train	French	13	105.0
14	loc	place, location	Latin	12	104.3
15	nature	nature, character	Latin	13	104.0

The ranking results reveal an interesting pattern: the highest-ranked roots are not the largest word families. The biggest families like logy (124 words) and graph (116 words) actually score lower due to size penalties — they are better suited as advanced academic vocabulary materials rather than introductory teaching candidates. The truly top-ranked roots are those with "moderate size, dense high-frequency coverage, and rich affix diversity."

6. Application Framework for Structured Vocabulary Learning

6.1 Phased Learning Pathway

Synthesizing our corpus analysis with established language teaching research ^{[1, 5, 7]}, we propose the following phased framework for structured vocabulary instruction:

Phase 1: Affix Awareness Activation (A1–A2)

Teach 10 core prefixes (un-, re-, in-/im-, dis-, over-, out-, pre-, mis-, under-, sub-) and 10 core derivational suffixes (-er, -ness, -ly, -ful, -less, -tion/-sion, -ment, -able/-ible, -ous, -al). These 20 affixes represent the highest-frequency, most transparent affixes as defined by Bauer and Nation ^[5], covering the vast majority of common derivational relationships in our corpus. The teaching goal is not to memorize affix lists but to cultivate the awareness of "attempting to decompose a long word when encountering one."

Phase 2: Germanic Core Roots (A2–B1)

Prioritize roots from Old English/Germanic sources: all, one, ever, care, life, home, hand, ground. These roots are themselves basic words that learners already know. The teaching strategy is to "activate root awareness" — helping learners discover that already, almost, altogether, always all derive from all, and alone, lonely, only, once all derive from one.

Phase 3: Productive Latin Roots (B1–B2)

Introduce systematic teaching of Latin roots. While Latin roots are "non-intuitive" for learners (as bound roots that cannot stand alone as words), they possess extraordinary productivity. Our data shows that the 15 largest word families are dominated by Latin/Greek roots:

logy (study of): 124 words — biology, psychology, technology, ecology...
graph (write): 116 words — photograph, geography, biography, paragraph...
port (carry): 108 words — import, export, transport, report, support...
pose (place): 98 words — compose, purpose, suppose, expose, propose...
press (press): 89 words — express, impress, compress, suppress, depress...
form (shape): 78 words — inform, reform, transform, uniform, formula...
tract (draw, pull): 70 words — attract, extract, contract, distract...
act (do, drive): 69 words — action, react, interact, exact, active...

Each mastered Latin root gives learners a structural key to understanding 70–120 words. Wei and Nation's ^[7] experimental research also confirmed the significant effectiveness of this "word part technique" for intermediate and advanced learners.

Phase 4: Greek Academic Roots (B2–C1)

Greek roots constitute 11.4% of the corpus but are concentrated in academic and scientific domains. These roots (logy, graph, bio, geo, psych, phil, phon) are core sources for English for Academic Purposes (EAP) and standardized tests (GRE, TOEFL). At this stage, cross-disciplinary vocabulary instruction integrated with subject content becomes particularly effective.

Phase 5: Advanced Word-Formation Decoding (C1–C2)

Develop the ability to instantly decode 3–4 layer derivation structures. For example, when encountering the unfamiliar word disproportionately, learners should be able to rapidly decompose: dis- (negation) + pro- (forward) + port (carry) + -ion (nominalization) + -ate (adjectivalization) + -ly (adverbialization) → "not in proportion."

6.2 Product Design Recommendations

Based on our corpus and analysis, we offer the following design recommendations for vocabulary learning products:

1. Interactive Root Derivation Maps

Visualize the derivation tree data as interactive tree diagrams. Learners can click any root to see its complete derivation family, with hover tooltips showing affix meanings and word definitions. This "panoramic view" helps learners build systematic connections between words rather than memorizing them in isolation.

2. "One Root, Many Words" Teaching Mode

Each teaching unit centers on a single root, with affix combination exercises built around it. For instance, when teaching the port root: import (in- + port) → export (ex- + port) → transport (trans- + port) → portable (port + -able) → transportation (transport + -ation), allowing learners to understand each affix's role through comparison.

3. Affix Decomposition Exercises

Present a complex word (e.g., uncomfortable) and have learners identify the root and each affix (un- + comfort + -able), then infer the whole-word meaning from the component meanings. This "reverse decoding" exercise is the core of morphological awareness training.

4. CEFR-Graded Delivery

Leverage the existing CEFR tags and COCA frequency data to deliver word families at appropriate difficulty levels. A2-level learners see simple families built around Germanic core roots, while B2-level learners encounter the complex derivational networks of productive Latin roots.

7. Discussion and Limitations

7.1 Cross-Validation with Published Research

Cross-validation with external research — Figure 7: Cross-validation of our corpus analysis against three landmark studies.

Our corpus (7,533 word families, 54,795 nodes) significantly exceeds the scale of previously hand-annotated morphological datasets. We cross-validate our findings against three independent lines of research:

Etymological distribution consistency: Finkenstaedt and Wolff ^[8] computed statistics on approximately 80,000 entries in the Shorter Oxford Dictionary, finding Latin 28.2%, French 28.3%, Germanic 25%, Greek 5.3%. Our figures (Latin 39.4%, Germanic 19.6%, Greek 11.4%, French 7.0%) appear different at first glance, but this is expected — our analysis traces to etymological roots rather than individual words. Many words that entered English through French (e.g., justice, information, government) ultimately trace to Latin roots, and are classified as Latin in our root-level analysis. The combined Latin+French proportions are broadly consistent (ours: 46.4%, F&W: 56.5%), with the gap attributable to classification granularity.

Prefix ranking consistency: White, Sowell, and Yanagihara ^[9], analyzing The American Heritage Word Frequency Book, found that un- accounts for 26% of all prefixed words, and the top four (un-, re-, in-, dis-) account for 58%. In our corpus, the top 3 prefixes match exactly (un-, re-, in-), and ranks 4–6 contain the same set with minor ordering differences. This strong agreement confirms that our corpus accurately reflects the true distribution of the English affix system.

Word frequency × etymology pattern: Williams ^[10] found that 83% of the 1,000 most common English words are Germanic in origin, while only 25% of the rarest dictionary entries are Germanic. This pattern directly explains a striking feature of our teaching-priority ranking: the highest-scoring roots are disproportionately Germanic (all, one, ever, care, life) because they contain the highest proportion of high-frequency words with strong COCA scores. This is not coincidental but a direct reflection of English vocabulary's "Germanic core + Latin/Greek periphery" structure.

7.2 Limitations

Inherent limitations of AI generation: Despite multiple rounds of human+AI joint verification, a small number of etymologically contested derivational relationships may remain (some words have genuinely debated etymologies in academic linguistics).
Oxford dictionary boundary: The corpus covers standard vocabulary recorded in the Oxford dictionary but excludes specialized terminology, slang, and neologisms.
Synchronic assumption: The morphological analysis takes a synchronic view of modern English; some historical derivational relationships (e.g., island and isle) may no longer be transparent to contemporary learners.

8. Conclusion

Through systematic analysis of 7,533 English word families comprising 54,795 derivation nodes, this study reveals five core patterns in the structural composition of English vocabulary:

Highly systematic: Over 95% of English words can be assigned to word families with identifiable roots
Concentrated origins: Latin + Greek roots account for 50.8% of all families, offering a "learn one root, unlock dozens of words" leverage effect
Extreme affix reuse: The top 10 prefixes and suffixes cover the vast majority of common derivations
Manageable depth: 86.4% of word families stay within three layers, suitable for instruction
Moderate size: 80.3% of word families contain 2–10 words, aligning well with the cognitive capacity of a single teaching unit

These findings provide a solid data foundation for structured vocabulary instruction. Our root teaching value scoring system and phased learning framework combine morphological theory with large-scale corpus analysis, offering actionable guidance for the design of Computer-Assisted Language Learning (CALL) systems.

At a time when vocabulary size constitutes the core bottleneck in language learning, structured morphological pedagogy is not an optional "nice-to-have" but a necessary strategy for improving vocabulary acquisition efficiency. Our data shows that the internal structure of English vocabulary is far more orderly than it appears on the surface — and this orderliness is precisely the foundation of efficient learning.

References

Nation, I. S. P. (2001). Learning Vocabulary in Another Language. Cambridge University Press.
Nation, I. S. P. (2006). "How Large a Vocabulary Is Needed For Reading and Listening?" Canadian Modern Language Review, 63(1), 59–82.
Aronoff, M., & Fudeman, K. (2011). What is Morphology? (2nd ed.). Wiley-Blackwell.
Corson, D. (1997). "The Learning and Use of Academic English Words." Language Learning, 47(4), 671–718.
Bauer, L., & Nation, I. S. P. (1993). "Word Families." International Journal of Lexicography, 6(4), 253–279.
Sweller, J. (1988). "Cognitive Load During Problem Solving: Effects on Learning." Cognitive Science, 12(2), 257–285.
Wei, Z., & Nation, I. S. P. (2013). "The Word Part Technique: A Very Useful Vocabulary Teaching Technique." Modern English Teacher, 22(1), 12–16.
Finkenstaedt, T., & Wolff, D. (1973). Ordered Profusion: Studies in Dictionaries and the English Lexicon. C. Winter.
White, T. G., Sowell, J., & Yanagihara, A. (1989). "Teaching Elementary Students to Use Word-Part Clues." The Reading Teacher, 42(4), 302–308.
Williams, J. M. (1975). Origins of the English Language: A Social and Linguistic History. Free Press.

This article was produced by the 极坐标单词（PolarWords） research team. The derivation tree corpus was constructed from Oxford English Dictionary vocabulary with AI-assisted annotation and multi-round quality verification. 极坐标单词（PolarWords） is developed by ByuTech.