RDM Weekly - Issue 035

A weekly roundup of Research Data Management resources.

RDM Weekly

Mar 10, 2026

Welcome to Issue 35 of the RDM Weekly Newsletter!

The content of this newsletter is divided into 4 categories:

✅ What’s New in RDM?

These are resources that have come out within the last year or so

✅ Oldies but Goodies

These are resources that came out over a year ago but continue to be excellent ones to refer to as needed

✅ Research Data Management Job Opportunities

Research data management related job opportunities that I have come across in the past week

✅ Just for Fun

A data management meme or other funny data management content

What’s New in RDM?

Resources from the past year

1. Building Realistic Fake Datasets with Pointblank

Every data practitioner eventually runs into the same problem: you need data, but you don’t have it. Maybe you’re writing tests, building a demo, or teaching a workshop and you need something that looks real but carries zero risk. Whatever the reason, the need for synthetic data is everywhere, and it comes up far more often than most of us would like to admit. Pointblank is a Python library for data validation, but over the last several releases Posit has been building out a complementary capability: data generation. The idea is simple. You define a schema (the columns, their types, and their constraints), and Pointblank produces n rows of data that conform to it. The result is a Polars or Pandas DataFrame, ready to use. This post walks through the generate_dataset() function in some depth, showing how to build realistic datasets for common scenarios (including a customer data example you might actually use), and highlighting the country-specific and coherence features that make the generated data feel surprisingly real.

2. The Economic Benefits of Open Science

This report is an output of PLOS’s Redefining Publishing program and it examines the economic implications of Open Science and what it would mean to move towards a research ecosystem in which research elements, including data, code, software, workflows, methods, and publications, are openly shared and valued. The evidence assessment and case studies show that the economic case for Open Science lies in its ability to enable reuse of research outputs at scale. When Open Science is implemented in ways that support reuse through appropriate infrastructure, incentives, and coordination, the sharing of data, code, software, workflows, methods, and publications can deliver measurable efficiency gains, support innovation, and strengthen long-term economic performance. The findings in this report provide a robust evidence base that supports the value of a transition toward Open Science and helps clarify where economic value is created and where barriers remain.

3. What Is Data Governance? 30 Questions and Answers

This document from The GovLab is designed as a practical reference that can be read end-to-end—as a structured primer on data governance—or used modularly to support workshops, institutional design, policy drafting, and capacity-building. The structure serves three functions. First, it establishes a shared vocabulary and clarifies distinctions between concepts that are frequently conflated, such as governance, strategy, ethics, privacy, and management. Second, it foregrounds the normative design choices that shape data governance outcomes in practice—particularly questions of purpose, principles, legitimacy, and stewardship. Third, it situates data governance within real institutional, technical, sectoral, and cross-border contexts, recognizing that what is feasible depends on capacity, architecture, and scale.

4. Evidence of Unreliable Data and Poor Data Provenance in Clinical Prediction Model Research and Clinical Practice

Clinical prediction models are often created using large routinely collected datasets. It is essential that prediction models are developed with appropriate data and methods and transparently reported to ensure that decisions are based on reliable predictions. Kaggle is a popular competition website where users learn and apply analysis skills on a range of datasets. We identified two large, publicly available Kaggle datasets, on stroke and diabetes, that lack clear data provenance, but are widely used in clinical prediction models in peer-reviewed publications. The authenticity of both datasets could not be verified and have evidence they are likely to be simulated or fabricated. Data provenance assessment using nine TRIPOD+AI items revealed major deficiencies, with minimal details for either dataset including no information on when, where, why or how the data were collected. From these two datasets, we found 124 clinical prediction model studies. Three prediction models had evidence of use in clinical practice, one model was cited in a medical device patent, and the models were cited in 86 review articles. Authors recommend that journals and data repositories mandate data provenance reporting to safeguard published research. Prediction models based solely on simulated or fabricated data sets should never be used to directly inform decisions on patient care.

5. Preparing National Research Ecosystems for AI: Strategies and Progress

This is the third edition of this ISC Centre for Science Futures’ paper. The report offers a comprehensive analysis of the integration of artificial intelligence in science and research across various countries. It addresses both the advancements made and the challenges faced in this field, making it a valuable read for science leaders, policy-makers, AI professionals, and academics.

6. Data Management in a Community-Based Birth Cohort: What the SEMILLA Study Teaches Us

In cohort studies, systematic information management often receives limited attention in study protocols, resulting in delays, quality issues, and threats to data validity. This paper describes the data management process of a community-based cohort study, using the SEMILLA (Study of Environmental Exposure of Mothers and Infants Impacted by Large-Scale Agriculture) study conducted in Cayambe, Ecuador, as a case example, and highlights the challenges, adaptations, and lessons learned, with the aim of informing similar studies.

7. Rethinking TXT Files

In this blog post, Kristin Briney considers the accessibility of common file types. In particular, she considers how TXT is often the recommended file type for READMEs and whether other file types that offer more flexibility and accessibility should be considered instead.

Oldies but Goodies

Older resources that are still helpful

1. Grant Budgeting for Data Management and Sharing - Webinar

This 2024 webinar from the Federation of American Societies for Experimental Biology discusses budgeting for data management and sharing (DMS) in your grant applications. Hear from experts, including the Director of NIH Office of Policy for Extramural Research Administration (OPERA) and a representative from the NIH Office of Science Policy, as they discuss the NIH DMS policy and its impact on grant budgets. You'll also gain valuable insights from a data manager at Stanford Medicine on effectively planning your DMS budget from a researcher's perspective.

2. Data Management Planning Checklists

This book section contains links to a series of checklists to help guide teams through data management decisions that should be made throughout the phases of a research project. This includes decision making around documentation, data collection, data cleaning, data archiving and more.

3. Organizing your Evaluation Data: The Importance of Having a Comprehensive Data Codebook

In this blog post, Jennifer Morrow reviews what a codebook is, why they are necessary, when to create them, what to include in them, and what tools you might consider making them in. This post also links to several other helpful resources.

4. Code Review for Statisticians, Data Scientists & Modellers

Code review is an important part of a team’s quality assurance process. This blog post provides insight into why we do code review, as well some helpful tips for how reviewers should approach code review.

Research Data Management Job Opportunities

These are data management job opportunities that I have seen posted in the last week. I have no affiliation with these organizations.

Just for Fun

A haphazard structure. At the top "The Data Cleaning Process is Reproducible". At the bottom "The Data Cleaning Process". — Image from https://github.com/Cghlewis/datamgmt_memes

Thank you for reading! If you enjoy this content, please like, comment, or share this post! You can also support this work through Buy Me A Coffee.

Buy Me A Coffee

RDM Weekly

Discussion about this post

Ready for more?