Data sources and management

SharE-RR Toolkit | Detail

Data sources and management

Requires local knowledge and technical expertise

Data management covers the obtaining, cleaning, storing, monitoring, reviewing, and reporting of your registry data. It determines the quality, utility, cost and burden of the data – and therefore, the chances that your registry will achieve its goals.

As part of your assessment of the data elements you require, you will have been prompted in the previous section to consider whether these data need to be collected from a primary source (i.e., primary data) or whether they have already been collected and you can re-use existing data (i.e., secondary data).

Many techniques and technologies exist for entering or moving data into a registry database, including paper case report forms (CRFs), direct data entry, facsimile or scanning systems, interactive voice response systems, and electronic CRFs. You and your advisory committee must balance flexibility (the number of options available) with data availability (when the central repository is populated), data validity (whether all methods are equally able to produce clean data), and cost.

Obtaining Data

Primary data is collected to populate your registry database. Primary data sources are typically used when the data of interest are unavailable elsewhere or, if available, are unlikely to be of good enough quality for use in your registry. Primary data includes clinician- and patient-reported data. Such data may range from basic demographic information to validated scales of patient-reported outcomes.

These are questionnaires that are used to collect the individual patient data. They can be in a number of formats:

A paper form – this allows users to enter clinical data onto paper, either at the time of the clinical encounter or after the clinical encounter. These paper forms can then be entered into the database locally or centrally by manual data entry or scanning technology.
An electronic document or spreadsheet – this can be filled in locally and then transferred or uploaded to the registry database at intervals. Validation checks can be built into such a system at the point of data entry.
A webpage that interfaces directly with your registry database – this enables data to be entered directly into the registry database and can have validation checks built in to identify and prompt the correction of errors at the time of entry. These can often be accessed on mobile devices.

Although involving more upfront investment, the last of these is generally preferred as it enables realtime data validation at source and reduces the number of opportunities for data entry errors. Paper and spreadsheet options do still have a role, however, for example, when you are piloting your data elements and when data is being collected in remote locations with limited or unreliable internet connectivity.

Extraction
The other way to collect primary source data is to extract it directly from an external source. This requires a lot of initial and ongoing work to map your data extraction tools to the correct fields in the local source. It also limits your data elements (and their quality) to what is available in the local source. On a positive note, however, it reduces the data collection burden on clinical staff. Local sources might include:

  • Electronic health records (EHRs) – these are used to document and manage patient care, potentially capturing many different types of data – e.g., vital signs, patient history, diagnoses/conditions, treatments and therapies, laboratory results, surveys, questionnaires, etc.
  • Ancillary clinical information systems – these are used alongside EHRs by specialized departments such as radiology or laboratory sciences. These systems may have an interface with the EHR, but they typically only transmit a small fraction of the information they collect to the EHR (e.g., the interpretation of an echocardiogram vs. all the data generated during the procedure).
Before the full launch of your registry, it is essential to conduct pilot testing. This can range from testing a subset of the procedures, CRFs, or data capture systems to a full launch of the registry at a small number of sites with a limited number of patients. Through pilot testing, you can assess comprehension, acceptance, feasibility, and other factors that will influence how readily your registry will fit into the habits of those entering the data.
Documenting procedures and training
Data collection procedures must be written down in protocols, policies, and procedures. All personnel involved in data collection should be identified in these documents, and their roles should be specified. Any training or qualifications required should also be specified.

These data are already being collected for purposes other than populating your registry database. Protocols for data collection, if they exist, are therefore not focused on optimizing data quality to address your registry’s objectives.

There are many possible sources of secondary data:

  • Clinical data warehouses and integrated data repositories – used by institutions or health systems to pull together data from the EHR and other systems into a common, standardized data model.
  • Administrative databases – used by medical insurance organizations (including government or private health insurance programs) to track healthcare use, evaluate coverage, and manage billing and payment. Information in these databases includes patient characteristics, e.g., insurance coverage and co-payments, and healthcare provider characteristics, e.g., specialty and location.
  • National birth and death records – used to track population death and birth data, e.g., date of birth, date of death, and recognized causes of death.
  • Aggregate databases – containing area- or population-level statistics, details about providers or medical facilities, or de-identified encounter details.
  • Distributed research networks – bring together data at a local level (behind a firewall) and provide aggregate counts or summary statistics. These can support a range of research activities, including pharmacovigilance studies, pragmatic clinical trials, and studies of treatment effectiveness.
  • Existing registries – collect data relevant to your registry. For example, if you are setting up a hemodialysis registry and there is already a peritoneal dialysis registry or a transplant registry in your country, you might want to link to these to be able to report the complete patient journey.

It will be rare that the data elements in your registry exactly match those in a secondary data source so that all you need to do is directly import the data. Most commonly, some form of transformation is required to translate the data into a consistent format for integration and analysis. You may also want to consider computational derivation, e.g., using diagnosis or procedure codes to establish the presence of co-morbidities. Documentation of such transformations and derivations is critical to ensure traceability to the original source.

Considerations when using data from secondary data sources

  • The underlying information may not have been entered in a standardized way, and the data quality of the secondary data source will impact on the overall quality of your registry. Inspecting the secondary data for quality and completeness is therefore important.
  • Your registry may receive an annual update of large amounts of data, or there may be monthly, weekly, or even daily data transfers. A registry update schedule should be in place to ensure proper data transfer into your registry.
  • Data updates will happen in your secondary data source, which will need to be added to participant records in your registry. You will need a process for managing this.
  • Data may change when it is processed by the secondary data source, which may or may not be well documented by the data providers.
  • All organizations providing data to your registry should have a common understanding of the rules regarding access to the data, including agreements that specify ownership of the source data and permissions for re-use. These agreements should specify the roles of each institution, the legal responsibilities, and any oversight issues and should be put in place before data is transferred.

Data cleaning and storage

Detailed plans for how you will clean the data in your registry should be addressed upfront in a data management manual that identifies the data elements to be cleaned, describes the data validation rules or logical checks for out-of-range values, explains how missing values and values that are logically inconsistent will be handled, and discusses how duplicate patient records will be identified and managed.

Ideally, automated data checks can be programmed to check validity as data are entered or uploaded. These data checks will be useful for cleaning data at the site level, while the patient or medical record is readily accessible or as paper CRFs are being entered into the database. Even relatively simple edit checks, such as range values for laboratories, can significantly improve the quality of data.

On a regular, pre-defined basis, your registry should generate query reports on the quality of the data received. The content of these reports will differ depending on the type of data cleaning that is required.

When data on a CRF are entered into your registry, the form and a log of the data entered should be maintained for the regulatory archival period. You may discover data errors long after the data have been stored in the registry, and there should be mechanisms in place to flag data errors and correct the data for subsequent use.

Careful attention will need to be paid to the storage of your data, including physical securities, firewalls, separating identifiable from clinical data, encryption, controlled access, remote access, backup frequencies, and maintaining a separate standby database in a physically remote site (in case of a fire or flood at the site of the primary database). In most countries, these issues will need to be agreed to and specified when you are securing permission to establish your registry.

Suggested further reading:
Chapters 6 & 11 of the AHRQ Guidelines i
Chapters 6.4 & 8 of the PARENT guidelines ii