Data stewardship is vital to science of today and tomorrow. Yet stewardship and its many roles are not consistently defined, conceptualized, or implemented even within the same discipline or organization. There is a gap between theory and practice. This session begins to bridge that gap by examining roles, perspectives, and attributes of the overall stewardship enterprise from proposal through preservation. We explore theoretical and practical approaches for understanding complex issues
such as:
- tracking provenance, added value, and credit through the data lifecycle;
- defining elements of data quality;
- scaling of complex processes; and
- scientist perspectives and norms.
We seek to reexamine worldviews, explore alternatives, and evolve data stewardship.
Keywords: Data management, preservation, rescue; Data and information governance; Interoperability; Emerging informatics technologies; GIScience.
Main Sponsor: Earth and Space Science Informatics Focus Group
Related Sponsoring Sections: Atmospheric Sciences (A), Biogeosciences (B), Education (ED),
Global Environmental Change (GC), Hydrology (H), Natural Hazards (NH), Ocean Sciences (OS),
Paleoceanography and Paleoclimatology (PP), Seismology (S), Societal Impacts and Policy Sciences
(SI), Tectonophysics (T), Volcanology, Geochemistry, and Petrology (V)
Papers |
Posters |
Contacts
PHOTOS 1 |
PHOTOS 2
(courtesy of Sarah Ramdeen)
Presenter 1: Bryan Lawrence, NCAS and Meteorology, University of Reading and Centre for Environmental Data Archival, UK
Managing Data and Facilitating Science: A Spectrum of Activities in the Centre for Environmental Data Archival. (Invited)
[Presentation file]
| [Additional Info.]
| [
AGU Link]
Abstract.
The UK Centre for Environmental Data Archival (CEDA) hosts a number of formal data centres, including the British Atmospheric Data Centre (BADC), and is a partner in a range of national and international data federations, including the InfraStructure for the European Network for Earth system Simulation, the Earth System Grid Federation, and the distributed IPCC Data Distribution Centres. The mission of CEDA is to formally curate data from, and facilitate the doing of, environmental science.
The twin aims are symbiotic: data curation helps facilitate science, and facilitating science helps with data curation. Here we cover how CEDA delivers this strategy by established internal processes supplemented by short-term projects, supported by staff with a range of roles. We show how CEDA adds value to data in the curated archive, and how it supports science, and show examples of the aforementioned symbiosis.
We begin by discussing curation: CEDA has the formal responsibility for curating the data products of atmospheric science and earth observation research funded by the UK Natural Environment Research Council (NERC). However, curation is not just about the provider community, the consumer communities matter too, and the consumers of these data cross the boundaries of science, including engineers, medics, as well as the gamut of the environmental sciences. There is a small, and growing cohort of non-science users. For both producers and consumers of data, information about data is crucial, and a range of CEDA staff have long worked on tools and techniques for creating, managing, and delivering metadata (as well as data). CEDA "science support" staff work with scientists to help them prepare and document data for curation.
As one of a spectrum of activities, CEDA has worked on data Publication as a method of both adding value to some data, and rewarding the effort put into the production of quality datasets. As such, we see this activity as both a curation and a facilitation activity.
A range of more focused facilitation activities are carried out, from providing a computing platform suitable for big-data analytics (the Joint Analysis System, JASMIN), to working on distributed data analysis (EXARCH), and the acquisition of third party data to support science and impact (e.g. in the context of the facility for Climate and Environmental Monitoring from Space, CEMS).
We conclude by confronting the view of Parsons and Fox (2013) that metaphors such as Data Publication, Big Iron, Science Support etc are limiting, and suggest the CEDA experience is that these sorts of activities can and do co-exist, much as they conclude they should. However, we also believe that within co-existing metaphors, production systems need to be limited in their scope, even if they are on a road to a more joined up infrastructure. We shouldn't confuse what we can do now with what we might want to do in the future.
Presenter 2: Clinton Foster,
Geoscience Australia
Data stewardship – a fundamental part of the scientific method (Invited)
[Presentation file]
| [
AGU Link]
Abstract.
This paper emphasises the importance of data stewardship as a fundamental part of the scientific method, and the need to effect cultural change to ensure engagement by earth scientists. It is differentiated from the science of data stewardship per se.
Earth System science generates vast quantities of data, and in the past, data analysis has been constrained by compute power, such that sub-sampling of data often provided the only way to reach an outcome. This is analogous to Kahneman’s System 1 heuristic, with its simplistic and often erroneous outcomes.
The development of HPC has liberated earth sciences such that the complexity and heterogeneity of natural systems can be utilised in modelling at any scale, global, or regional, or local; for example, movement of crustal fluids. Paradoxically, now that compute power is available, it is the stewardship of the data that is presenting the main challenges. There is a wide spectrum of issues: from effectively handling and accessing acquired data volumes [e.g. satellite feeds per day/hour]; through agreed taxonomy to effect machine to machine analyses; to idiosyncratic approaches by individual scientists. Except for the latter, most agree that data stewardship is essential. Indeed it is an essential part of the science workflow.
As science struggles to engage and inform on issues of community importance, such as shale gas and fraccing, all parties must have equal access to data used for decision making; without that, there will be no social licence to operate or indeed access to additional science funding (Heidorn, 2008).
The stewardship of scientific data is an essential part of the science process; but often it is regarded, wrongly, as entirely in the domain of data custodians or stewards.
Geoscience Australia has developed a set of six principles that apply to all science activities within the agency:
- Relevance to Government
- Collaborative science
- Quality science
- Transparent science
- Communicated science
- Sustained science capability
Every principle includes data stewardship: this is to effect cultural change at both collective and individual levels to ensure that our science outcomes and technical advice are effective for the Government and community.
Presenter 3: Mark Parsons,
Rensselaer Polytechnic Institute
Curation Roles in Theory and Practice (Invited)
[Presentation file]
| [Additional Info.]
| [
AGU Link]
Abstract.
To understand something very complex like the Earth system it is helpful to have a model that sketches the major components of the system and their interactions. This helps us to understand the relative importance of those components and their behavior. The same is true, when considering a complex social enterprise such as data stewardship. It is helpful to have a model or a metaphor that allows us to conceive of the entire enterprise, the key players, and their interactions.
In the spirit that “all models are wrong, but some are useful,” this presentation explores several models of data stewardship and how different conceptions of the enterprise define different roles for the participants—be they researchers or data practitioners. The goal is to illustrate how consideration of multiple theories or models of data stewardship can help us better define the actual practice necessary to ensure well-preserved and useful data.
Presenter 4: Karen Baker,
Center for Informatics Research in Science and Scholarship, Graduate School of Library and Information Science, University of Illinois at Urbana Champaign
Enabling Long-Term Earth Science Research: Changing Data Practices (Invited)
[Presentation file]
| [
AGU Link]
Abstract.
Data stewardship plans are shaped by our shared experiences. As a result, community engagement and collaborative activities are central to the stewardship of data. Since modes and mechanisms of engagement have changed, we benefit from asking anew: ‘Who are the communities?’ and ‘What are the lessons learned?’. Data stewardship with its long-term care perspective, is enriched by reflection on community experience. This presentation draws on data management issues and strategies originating from within long-term research communities as well as on recent studies informed by library and information science. Ethnographic case studies that capture project activities and histories are presented as resources for comparative analysis.
Agency requirements and funding opportunities are stimulating collaborative endeavors focused on data re-use and archiving. Research groups including earth scientists, information professionals, and data systems designers are recognizing the possibilities for new ways of thinking about data in the digital arena. Together, these groups are re-conceptualizing and reconfiguring for data management and data curation. A differentiation between managing data for local use and production of data for re-use remotely in locations and fields remote from the data origin is just one example of the concepts emerging to facilitate development of data management. While earth scientists as data generators have the responsibility to plan new workflows and documentation practices, data and information specialists have responsibility to promote best practices as well as to facilitate the development of community resources such as controlled vocabularies and data dictionaries. With data-centric activities and changing data practices, the potential for creating dynamic community information environments in conjunction with development of data facilities exists but remains elusive.
Presenter 5: Wade Sheldon,
Marine Sciences, University of Georgia
Managing Data, Provenance and Chaos through Standardization and Automation at the Georgia Coastal Ecosystems LTER Site
[Presentation file]
| [Additional Info.]
| [
AGU Link]
Abstract.
Managing data for a large, multidisciplinary research program such as a Long Term Ecological Research (LTER) site is a significant challenge, but also presents unique opportunities for data stewardship. LTER research is conducted within multiple organizational frameworks (i.e. a specific LTER site as well as the broader LTER network), and addresses both specific goals defined in an NSF proposal as well as broader goals of the network; therefore, every LTER data can be linked to rich contextual information to guide interpretation and comparison. The challenge is how to link the data to this wealth of contextual metadata.
At the Georgia Coastal Ecosystems LTER we developed an integrated information management system (GCE-IMS) to manage, archive and distribute data, metadata and other research products as well as manage project logistics, administration and governance (figure 1). This system allows us to store all project information in one place, and provide dynamic links through web applications and services to ensure content is always up to date on the web as well as in data set metadata. The database model supports tracking changes over time in personnel roles, projects and governance decisions, allowing these databases to serve as canonical sources of project history. Storing project information in a central database has also allowed us to standardize both the formatting and content of critical project information, including personnel names, roles, keywords, place names, attribute names, units, and instrumentation, providing consistency and improving data and metadata comparability. Lookup services for these standard terms also simplify data entry in web and database interfaces.
We have also coupled the GCE-IMS to our MATLAB- and Python-based data processing tools (i.e. through database connections) to automate metadata generation and packaging of tabular and GIS data products for distribution. Data processing history is automatically tracked throughout the data lifecycle, from initial import through quality control, revision and integration by our data processing system (GCE Data Toolbox for MATLAB), and included in metadata for versioned data products. This high level of automation and system integration has proven very effective in managing the chaos and scalability of our information management program.
Presenter 6: Lynn Yarmey and Sandy Starkweather,
National Snow and Ice Data Center, University of Colorado Boulder,
Metadata Standards in Theory and Practice: The Human in the Loop
[Presentation file]
| [
AGU Link]
Abstract.
Metadata standards are meant to enable interoperability through common, well-defined structures and are a foundation for broader cyberinfrastructure efforts. Standards are central to emerging technologies such as metadata brokering tools supporting distributed data search. However, metadata standards in practice are often poor indicators of standardized, readily interoperable metadata.
The International Arctic Systems for Observing the Atmosphere (IASOA) data portal provides discovery and access tools for aggregated datasets from ten long-term international Arctic atmospheric observing stations. The Advanced Cooperative Arctic Data and Information Service (ACADIS) Arctic Data Explorer brokers metadata to provide distributed data search across Arctic repositories. Both the IASOA data portal and the Arctic Data Explorer rely on metadata and metadata standards to support value-add services. Challenges have included: translating between different standards despite existing crosswalks, diverging implementation practices of the same standard across communities, changing metadata practices over time and associated backwards compatibility, reconciling metadata created by data providers with standards, lack of community-accepted definitions for key terms (e.g. ‘project’), integrating controlled vocabularies, and others. Metadata record ‘validity’ or compliance with a standard has been insufficient for interoperability. To overcome these challenges, both projects committed significant work to integrate and offer services over already 'standards compliant' metadata. Both efforts have shown that the 'human-in-the-loop' is still required to fulfill the lofty theoretical promises of metadata standards.
In this talk, we 1) summarize the real-world experiences of two data discovery portals working with metadata in standard form, and 2) offer lessons learned for others who work with and rely on metadata and metadata standards.
Presenter 7: Vicki Lynn Ferrini,
Lamont-Doherty Earth Observatory, Columbia University
[Presentation file]
| [Additional Info.]
| [
AGU Link]
Abstract.
Stewardship of scientific data is fundamental to enabling new data-driven research, and ensures preservation, accessibility, and quality of the data, yet researchers, especially in disciplines that typically generate and use small, but complex, heterogeneous, and unstructured datasets are challenged to fulfill increasing demands of properly managing their data. The IEDA Data Facility (www.iedadata.org) provides tools and services that support data stewardship throughout the full life cycle of observational data in the solid earth sciences, with a focus on the data management needs of individual researchers. IEDA builds upon and brings together over a decade of development and experiences of its component data systems, the Marine Geoscience Data System (MGDS, www.marine-geo.org) and EarthChem (www.earthchem.org). IEDA services include domain-focused data curation and synthesis, tools for data discovery, access, visualization and analysis, as well as investigator support services that include tools for data contribution, data publication services, and data compliance support. IEDA data synthesis efforts (e.g. PetDB and Global Multi-Resolution Topography (GMRT) Synthesis) focus on data integration and analysis while emphasizing provenance and attribution. IEDA's domain-focused data catalogs (e.g. MGDS and EarthChem Library) provide access to metadata-rich long-tail data complemented by extensive metadata including attribution information and links to related publications. IEDA's visualization and analysis tools (e.g. GeoMapApp) broaden access to earth science data for domain specialist and non-specialists alike, facilitating both interdisciplinary research and education and outreach efforts.
As a disciplinary data repository, a key role IEDA plays is to coordinate with its user community and to bridge the requirements and standards for data curation with both the evolving needs of its science community and emerging technologies. Development of IEDA tools and services is based first and foremost on the scientific needs of its user community. As data stewardship becomes a more integral component of the scientific workflow, IEDA investigator support services (e.g. Data Management Plan Tool and Data Compliance Reporting Tool) continue to evolve with the goal of lessening the 'burden' of data management for individual investigators by increasing awareness and facilitating the adoption of data management practices. We will highlight a variety of IEDA system components that support investigators throughout the data life cycle, and will discuss lessons learned and future directions.
Presenter 8: Anne Wilson,
University of Colorado at Boulder and Foundation for Earth Science
Establishing Long Term Data Management Research Priorities via a Data Decadal Survey
[Presentation file]
| [
AGU Link]
Abstract.
We live in a time of unprecedented collection of and access to scientific data. Improvements in sensor technologies and modeling capabilities are constantly producing new data sources. Data sets are being used for unexpected purposes far from their point of origin, as research spans projects, discipline domains, and temporal and geographic boundaries. The nature of science is evolving, with more open science, open publications, and changes to the nature of peer review and data "publication". Data-intensive, or computational science, has been identified as a new research paradigm. There is recognition that the creation of a data set can be a contribution to science deserving of recognition comparable to other scientific publications. Federally funded projects are generally expected to make their data open and accessible to everyone.
In this dynamic environment, scientific progress is ever more dependent on good data management practices and policies. Yet current data management and stewardship practices are insufficient. Data sets created at great, and often public, expense are at risk of being lost for technological or organizational reasons. Insufficient documentation and understanding of data can mean that the data are used incorrectly or not at all. Scientific results are being scrutinized and questioned, and occasionally retracted due to problems in data management. The volume of data is greatly increasing while funding for data management is meager and generally must be found within existing budgets.
Many federal government agencies, including NASA, USGS, NOAA and NSF are already making efforts to address data management issues. Executive memos and directives give substantial impetus to those efforts, such as the May 9 Executive Order directing agencies to implement Open Data Policy requirements and regularly report their progress. However, these distributed efforts risk duplicating effort, lack a unifying, long-term strategic vision, and too often work in competition with other priorities of the research enterprise.
This presentation will introduce the Data Decadal Survey, an initial concept created in collaboration between the Federation of Earth Science Information Partners (ESIP) and of the National Research Council’s Board on Research Data and Information (BRDI). Consistent with Executive open data policies, the Survey will provide a coordinating platform to address overarching issues and identify research needs and funding priorities in scientific data management and stewardship for the long term. The Survey would address at the broadest level gaps in data management knowledge and practices that hold back scientific progress, and recommend a strategy to address them. The goal is to provide a long term strategic vision that will
- increase the meaningful availability of higher quality data,
- create or improve tools, processes, and practices that support accreditation, accountability, traceability, and reproducibility,
- redirect resources previously required by scientists for data discovery, acquisition, and reformatting to performing actual science,
thereby ultimately enhancing scientific knowledge.
Back to Top
Presenter of IN53C-1571: Irina Bastrakova,
Geoscience Australia
Embedding Data Stewardship in Geoscience Australia
[Poster file]
| [
AGU Link]
Presenter of IN53C-1572: Irina Bastrakova
,
Geoscience Australia
Facilitating Stewardship of Scientific Data through Standards-based Workflows
[Poster file]
| [
AGU Link]
Presenter of IN53C-1573: Karen Baker
,
Center for Informatics Research in Science and Scholarship, University of Illinois at Urbana-Champaign
Outcomes of the “Data Curation for Geobiology at Yellowstone National Park” Workshop
[Poster file]
| [
AGU Link]
Presenter of IN53C-1574: Bob Arko,
Lamont-Doherty Earth Observatory, Columbia University
Rolling Deck to Repository (R2R): Supporting Global Data Access Through the Ocean Data Interoperability Platform (ODIP)
[Poster file]
| [Additional Info.]
| [
AGU Link]
Presenter of IN53C-1575: Patrick West,
Tetherless World Constellation, Rensselaer Polytechnic Institute
Provenance Capture in Data Access and Data Manipulation Software
[Poster file (pptx) |
[Additional Info]
| [
AGU Link]
Presenter of IN53C-1576: Margaret O'Brien,
Santa Barbara Coastal LTER, University of California, Santa Barbara,
Ensuring the Quality of Data Packages in the LTER Network Provenance Aware Synthesis Tracking Architecture Data Management System and Archive
[Poster file]
| [
AGU Link]
Presenter of IN53C-1577: Rebekah Cummings,
University of California Los Angeles
Between land and sea: divergent data stewardship practices in deep-sea biosphere research
[Poster file]
| [
AGU Link]
Presenter of IN53C-1578: Sarah Ramdeen,
University of North Carolina at Chapel Hill
ESIP’s Emerging Provenance and Context Content Standard Use Cases: Developing Examples and Models for Data Stewardship
[Poster file]
| [Additional Info.]
| [
AGU Link]
Presenter of IN53C-1579: David Moroni,
Physical Oceanography Distributive Active Archive Center, NASA/JPL
Dataset Lifecycle Policy Development & Implementation at the PO.DAAC
[Poster file]
| [Additional Info.]
| [
AGU Link]
Presenter of IN53C-1580: Walter Baskin,
NASA Atmospheric Sciences Data Center (ASDC)
"Best" Practices for Aggregating Subset Results from Archived Datasets
[Poster file]
| [Additional Info.]
| [
AGU Link]
Presenter of IN53C-1581: Dirk Fleischer,
GEOMAR Helmholtz Centre for Ocean Research, Kiel, Germany
Don’t leave data unattended at any time!
[Poster file]
| [Additional Info.]
| [
AGU Link]
Back to Top
Contacts
Session Organizers:
- Cyndy Chandler, WHOI, cchandler-at-whoi.edu
- Lesley A. Wyborn, Geoscience Australia, now at Australian National University, Lesley.Wyborn-at-anu.edu.au
- Dawn J. Wright, Environmental Systems Research Institute and College of Earth, Ocean, and Atmospheric Sciences, Oregon State University, dwright-at-esri.com
- Deborah L. McGuinness, Rensselaer Polytechnic Institute (RPI) Tetherless World Constellation, dlm-at-cs.rpi.edu
|