by Beth Jarosz
Late last year I, along with a group of talented colleagues, was helping to organize a National Academies workshop on communicating the quality of federal data. The issue, at first glance, seems like it should be simple: If it’s a survey, present margins of error on the estimates. If it’s administrative data, provide metadata. Then call it a day.
However, as many readers will understand, data quality is much more complicated and nuanced than that. Quality is a huge umbrella under which sample size, universe, coverage, timeliness, representativeness, and many other dimensions fall. In short, it’s the series of questions any good data scientist will ask themself when trying to pick the right dataset to answer a question:
Does this data answer the question I have?
Is it current?
Does it have sufficient geographic detail?
What’s the margin of error (for survey data or modeled estimates)?
Who (or what) might be missing?
Is there another, better dataset I should be using instead?
If this is the best I can find, what caveats do I need to consider in my analysis?
What privacy protections were applied, and how might those have changed the data?
In late 2024, the committee planned an excellent series of workshops and were excited to move ahead toward a framework for documenting quality. Unfortunately, time was not in our favor. The first event in the series was scheduled for late January. And then… the administration began scrubbing data from public websites, many federal staff were told to stop attending public events, and federal staff and contractors were being shown to the exits. The workshop was “postponed” and then canceled.
Data quality was out. Erasure had entered the chat.
It now seems somewhat quaint that we wanted to geek out about ways to communicate timeliness and representativeness. About how to communicate the difference between employment estimates from BLS and those from BEA (both different, both important). About how to pick the right dataset to answer a policy question. Now we had to wrestle with whether or not data would even exist.
Behind the scenes, and in a parallel effort, teams began rethinking what we need to measure when we measure data quality. Key measures remain important, but now we also need to add uncomfortable new questions:
Have the data been altered or manipulated?
Have staffing cuts led to operational challenges that may affect data quality?
Have safety net program changes affected who’s in the administrative records?
Will more people refuse to share their data, for fear of it being weaponized against them?
Have contract cuts resulted in data that were collected but will never be published?
Is there a list of which surveys have been axed and which survived?
To be clear, in 2024 none of these issues were on our collective radars because federal civil servants had robust processes in place to ensure the integrity of the data system. Now with staffing cuts, contract cuts, and the threat of political interference, all of those questions are suddenly, painfully relevant.
Unfortunately, all of this means we need a completely new paradigm for monitoring. And that is what we're starting to do with the Data Index.
To begin, we’re tracking whether data access has been affected:
Has a data website gone down or been altered?
Has a publication deadline been missed?
Has a notice of potential changes been posted in the Federal Register?
These crude but critical measures are important signals that something may be amiss.
But that is just a starting point.
Next we’ll layer on change detection to see if published data structures differ from what has been published in the past, such as if a variable has been lost.
In the past, data collection changes would go through a public comment process. The process provided transparency and allowed for opportunities for public input before a change was made. (An excellent example of that process at work was when disability rights advocates successfully pushed back on Census Bureau plans to change disability questions in the American Community Survey.)
Unfortunately, now many seemingly substantive changes are being labeled as “minor” and being made without public insight. One example is a change to publicly available data from the National Survey of Children’s Health (NSCH). The administration erased data on whether or not children have been bullied due to their sexual orientation or gender identity. Another example is in the Pregnancy Risk Assessment Monitoring System (PRAMS)--the nation’s key system for protecting the health of pregnant people and infants. PRAMS data collection was suspended for months (a fact we only know because of whistleblowers, not because of transparency), and when the data collection instrument was restored, questions about gender appear to have been removed.
Right now, to track these kinds of data quality problems, we’re relying on subject matter experts to review the data with a critical eye. For NSCH, we knew the website went down and knew who to ask for a damage assessment. For PRAMS, most of the data are confidential, so website change detection would have been insufficient. In both cases, we needed subject expertise to understand the magnitude of the loss.
With time, we hope to automate more review processes, but that will take time and resources to develop. In the meantime, we are thankful to subject experts who are lending their time and knowledge to helping to keep the public informed. We shouldn’t need crowdsourcing for transparency–the public deserves transparency–but we appreciate the experts who are working to fill those gaps.