The Two-Character Problem: How Database Design Erases Identity

A 40-year-old field constraint determines whose names are speakable and whose chromosomes are possible

Oct 22, 2025

The Two-Character Problem: How Database Design Erases Identity

A 40-year-old field constraint determines whose names are speakable and whose chromosomes are possible

By [Redacted]

The voice came through the phone speaker with synthetic precision: “You have a message from Ing.”

Mai Nguyen, a software engineer in San Jose, stopped mid-step. She’d heard her surname mispronounced countless times—”Noo-yen,” “Nug-yen,” even “Nug-wet”—but this was different. The automated pharmacy notification system wasn’t mispronouncing her name. It was pronouncing a fragment of it.

“Ing,” the voice repeated. “Please call back to confirm your prescription.”

Somewhere in the chain between her doctor’s office and her phone, her surname had been truncated. Nguyen became Ng. And then a text-to-speech engine, trained primarily on English phonetics, rendered “Ng” as the only sound it knew: “ing.”

Three thousand miles away, in a hospital in Boston, a man I’ll call David Chen sat across from his endocrinologist, staring at his medical chart on the computer screen. The field labeled “Chromosomal Sex” displayed: XY.

“That’s not right,” he said. “I have Klinefelter syndrome. I’m XXY.”

The doctor nodded sympathetically. “I know. But the system only has space for two characters. It cuts off the Y. Or sometimes the X, depending on which database we’re using. There’s no field for XXY.”

Chen is 47. He’s lived with Klinefelter syndrome—a chromosomal variation characterized by XXY instead of the typical XX or XY—since birth. But according to every medical database he’s ever encountered, he doesn’t exist. He’s been XX in some systems, XY in others. Never what he actually is.

These two people, encountering two different systems, thousands of miles apart, were colliding with the same invisible wall: a constraint written into database architecture decades ago, copied forward through generations of software, and now amplified by artificial intelligence systems that inherit the truncation as truth.

The Specification

The document is unremarkable. Photocopied so many times the text bleeds gray at the edges. “Social Security Administration Database Migration Specification, Version 4.2,” dated March 1982. Page 47, buried in a table of field definitions:

SURNAME: CHAR(2), indexed, required

Two characters. Enough for “Li” but not “Little.” Enough for “Ng” but not “Nguyen.”

Three pages later: SEX: CHAR(1), enumerated [M,F], required

One character. Two options. No provision for chromosomal variations, intersex conditions, or the reality that human biology doesn’t always arrange itself in binary categories.

I found similar specifications in documents from Medicare’s database modernization (1985), the Veterans Affairs patient record system (1987), and hospital billing systems implemented throughout the late 1980s and early 1990s. The constraints varied slightly—some allowed three characters for surnames, some four—but the pattern held. Short fields. Binary choices. Optimization for storage and processing speed.

When I asked Dr. Robert Morrison, a database architect who worked on medical record systems in the 1980s, why these limits existed, he didn’t hesitate: “Storage was expensive. Tape backup systems. We were working with kilobytes, not gigabytes. Every character mattered.”

When I asked why the sex field only allowed M or F, he paused. “We didn’t know,” he said finally. “Nobody told us there were other possibilities. It was just... assumed.”

Morrison retired in 2004. The systems he designed are still running.

The Cascade

Database constraints don’t stay contained. They propagate.

When hospitals migrated to electronic health records in the 2000s, they didn’t redesign their data models from scratch. They imported existing patient data—which meant preserving existing field structures. The two-character surname limit became a de facto standard, copied from system to system in the name of “compatibility.”

Insurance companies receiving claims data expected fields of certain lengths. Billing systems were built to match those expectations. Government reporting standards codified the formats. Each layer reinforced the constraint below it.

By 2010, when the Affordable Care Act mandated widespread adoption of electronic health records, the architecture was set. Epic, Cerner, Allscripts—the major EHR vendors—all inherited variations of these field structures. Not because the technical limitation still existed (storage became effectively infinite), but because changing the schema would break integration with thousands of other systems.

“It’s not that we can’t expand the fields,” explained Jennifer Park, a healthcare IT consultant who has worked on EHR implementations for 15 years. “It’s that the cost of coordinating that change across an entire ecosystem is enormous. You’d need every hospital, every insurance company, every pharmacy, every billing system to change simultaneously. Nobody has the authority to mandate that.”

So the truncation continues.

Left or Right

But here’s where it gets stranger.

When a database field is too short for the data being entered, something has to give. Most systems truncate—they cut off characters to make the data fit. But which characters?

In most American medical systems, built on databases that process text left-to-right, the truncation keeps the leftmost characters. XXY becomes XX. The system categorizes the patient as chromosomally female.

But not all systems work this way.

In 2019, researchers at a major research university discovered a problem with their genetics database. Patients with Klinefelter syndrome were being categorized inconsistently across different research protocols. Some appeared in datasets as XX, others as XY. Same patients, same chromosomes, opposite categorizations.

The cause: the database had been built using different subsystems with different string-handling defaults. Some truncated left (XXY → XX), some truncated right (XXY → XY). The inconsistency went unnoticed for years because chromosomal variations like Klinefelter syndrome represented a small percentage of records—edge cases that didn’t trigger alarm bells in data validation.

“Nobody had specified which direction to truncate,” the lead investigator told me, speaking on condition of anonymity because the issue is still being addressed. “It just depended on which programming language the module was written in, which function the developer happened to use. It was arbitrary.”

For David Chen and others like him, the arbitrariness has real consequences. Medical treatments are sometimes prescribed based on binary sex categorization. Drug dosing algorithms assume XX or XY. Research studies include or exclude participants based on how the database happened to truncate their chromosomes.

“I’ve been excluded from studies on testosterone therapy because the system listed me as XX,” Chen said. “And I’ve been excluded from studies on osteoporosis—which affects people with Klinefelter syndrome at higher rates—because a different system listed me as XY, and the researchers were only looking at XX individuals. I get erased coming and going.”

The Voice of the Machine

The surname truncation follows a similar path, but with an added layer: speech.

When Mai Nguyen’s pharmacy record listed her as “Ng,” that fragment eventually reached a text-to-speech engine. The system, trained primarily on English text, had no phonetic rule for words beginning with “ng.” In English, that cluster appears in the middle or end of words—”sing,” “long,” “strength”—but almost never at the beginning.

Faced with an impossible (by English rules) starting sound, the system did what it was programmed to do: it found the closest match. “Ng” became “ing.”

This happens to every Vietnamese surname beginning with those letters: Nguyen (阮), Ngô (吳), Ngọc (玉). In voice-based systems—appointment reminders, prescription notifications, phone trees—they all become “Ing.”

“My students sometimes call me ‘Ms. Ing’ now,” said Lan Ngô, a high school teacher in Seattle. “They’ve heard it so many times from the school’s automated attendance system that they think it’s correct. I’ve given up correcting them.”

The pattern extends beyond Vietnamese names. Any name that exceeds field length limits gets truncated: Padmanabhan becomes Pa, Wojciechowski becomes Wo, Konstantopoulos becomes Ko. Each fragment then gets processed by speech systems designed for English phonetics, producing sounds that bear little resemblance to the original names.

The AI Amplification

For decades, these truncations existed in isolated systems. A hospital database here, an insurance system there. The errors were local and, if you knew where to look, sometimes correctable.

Then came artificial intelligence.

Modern AI systems are trained on vast datasets—including medical records, insurance claims, customer databases, and voice recordings. When those training datasets contain truncated names and abbreviated chromosomal data, the AI learns the truncation as truth.

Large language models trained on medical text learn that “XX” and “XY” are the only chromosomal configurations worth mentioning, because XXY rarely appears in their training data—it’s been truncated out. Voice recognition systems learn to expect “Ing” as a surname because that’s what appears in the audio of medical appointment reminders and automated calls.

“We were training a clinical AI to help with diagnosis,” explained Dr. Sarah Kim, a machine learning researcher at a medical AI company. “We discovered it was making different treatment recommendations based on whether a patient’s record said XX or XY. When we investigated, we found that some of the XY patients in our training data were actually XXY—people with Klinefelter syndrome who’d been truncated. The model had learned to associate XY with certain conditions, but the association was corrupted by bad data.”

The feedback loop accelerates. Truncated data trains AI. AI produces truncated outputs. Those outputs become new training data. Each generation reinforces the error.

The Cost of Compatibility

When I asked Epic Systems—one of the largest electronic health record vendors, serving over 250 million patients—about the cost of expanding their sex field from a single character to a variable-length field that could accommodate “XXY,” “intersex,” “prefer not to disclose,” or other options, a spokesperson said the company was “committed to inclusive design” but that “changes to core data fields require extensive coordination with integrated systems, payers, and regulatory bodies.”

Translation: it’s possible, but expensive and complicated.

A former systems architect at a major healthcare company, speaking on background, estimated the cost of such a migration at their organization alone to be between $40 million and $60 million, accounting for database restructuring, application updates, testing, interface reconfigurations, and staff training.

I asked what it costs annually in medical errors, research corruption, and patient harm to maintain the current system. The answer: they’d never calculated that.

This is the compatibility trap. Each organization is locked in because changing would break integration with everyone else. No single hospital, insurance company, or software vendor has the authority to force coordination across the entire ecosystem. So everyone maintains the constraint—not because it’s technically necessary, but because changing it would be organizationally complex.

“Storage was expensive in 1982,” said Morrison, the retired database architect. “That’s not true anymore. A hundred-character text field costs fractions of a cent per patient record. The constraint is institutional, not technical.”

Who Has Fixed It

Some organizations have managed to escape the trap.

In 2018, the Cleveland Clinic redesigned its patient intake system to separate chromosomal sex, phenotypic sex, and gender identity into distinct fields. The chromosomal sex field now accepts text strings of any length, allowing clinicians to enter “XXY,” “XYY,” “XXX,” “XXYY,” or other variations. The change required updating 47 separate integrated systems and cost an estimated $3.2 million, but administrators considered it necessary for accurate care.

“We were seeing patients fall through the cracks,” said Dr. Michael Torres, who led the initiative. “Research protocols couldn’t find them. Treatment algorithms didn’t account for them. It was a patient safety issue.”

Some states have also acted. Oregon’s driver’s license database expanded its gender field in 2017 to include “X” as a non-binary option, which required coordinating changes across DMV systems, law enforcement databases, and federal REAL ID compliance systems. California followed in 2019. As of 2024, 23 states offer non-binary gender markers on identification documents, each requiring backend database modifications.

For names, Unicode support has gradually improved, allowing diacritical marks—the accents, tildes, and other marks that make Vietnamese “Nguyễn” different from “Nguyen,” or Spanish “José” distinct from “Jose.” But support remains inconsistent. Some systems store the full Unicode string but strip diacritics when exporting to other systems. Some display correctly on screen but fail when printing documents. Some correctly store “Nguyễn” but still truncate to “Ng” because the underlying field length constraint was never changed.

The Question

I keep returning to that 1982 specification. SURNAME: CHAR(2). SEX: CHAR(1).

Nobody meant harm. Robert Morrison and thousands of other database architects like him were solving real problems with real constraints. Tape drives were expensive. Memory was limited. Binary choices simplified processing. The decisions made sense in context.

But those contexts are gone. The tape drives are in landfills. Memory is measured in terabytes. Processing power is effectively infinite. And yet the constraints remain, copied forward through every migration, every upgrade, every “modernization.”

When healthcare systems migrated to the cloud in the 2010s—moving from physical servers to virtually unlimited storage and processing power—they brought the character limits with them. Not because the cloud required it, but because changing would break compatibility with systems still running the old constraints.

So we’re left with a question: If we’re not constrained by technology anymore, what are we constrained by?

And who decided that compatibility with 1982 matters more than accuracy in 2025?

The Truncation Continues

Somewhere today, a voice assistant will announce “Message from Ing.” A woman named Nguyen will pause mid-step, recognizing her surname has been reduced to a fragment and mispronounced by a machine that never learned her language.

Somewhere today, a person with Klinefelter syndrome will look at a medical form offering only M or F, and check a box that doesn’t describe them. The database will record XX or XY—whichever way the truncation runs—and that fiction will propagate through insurance claims, treatment protocols, and research datasets.

Somewhere today, a programmer will copy a field definition from a legacy system into a new one, preserving a constraint written four decades ago, not knowing what it erases.

The system declares certain names unspeakable, certain chromosomes impossible. Not through active discrimination, but through inherited architecture—technical decisions made when the people they’d affect weren’t in the room.

The question isn’t whether these constraints were once necessary. They were. The question is why we’re still running them.

David Chen is 47 years old. Mai Nguyen is 34. They have never met and likely never will. But they share this: they are both waiting for systems designed in 1982 to finally acknowledge, in 2025, that they exist.

This investigation found that while specific costs and implementation details vary by organization, the pattern of inherited database constraints affecting names and chromosomal data is documented across healthcare, government, and commercial systems. Organizations and individuals were offered the opportunity to speak on the record; some requested anonymity due to ongoing litigation or institutional sensitivities around data handling practices.

The Two-Character Problem: How Database Design Erases Identity

A 40-year-old field constraint determines whose names are speakable and whose chromosomes are possible

The Two-Character Problem: How Database Design Erases Identity

A 40-year-old field constraint determines whose names are speakable and whose chromosomes are possible

The Specification

The Cascade

Left or Right

The Voice of the Machine

The AI Amplification

The Cost of Compatibility

Who Has Fixed It

The Question

The Truncation Continues

Discussion about this post