The accuracy of redaction tools for de-identification of medical records

To safely use medical records in research via IORD and other similar research databases, the records are first “de-identified”. This means taking out names, dates of birth, addresses, phone numbers, NHS numbers, etc, so that researchers cannot see information that could be used to identify who the record belongs to.

It is easy to do this when this information is stored with an obvious label like “name”. However, rarely information can accidentally be hidden in different parts of the data, like the text of a medical record. For example, a comment about a birthday might be included in the report of a scan or a doctor might write the name of village where a person lives in the text of an entry about how they might have picked up an infection.

There are tools which can be used to find this information. We want to test how effective they are at finding patient identifiers in different kinds of free text so we can take them out. We will test these tools on scan reports, on infection and other lab reports, and patient reviews by infection teams. Two NHS doctors will also review the same text and mark any patient identifiers. Then we can see how well the tools work by comparing them with what the doctors found. This will let us work out if the tools are doing a good job or not, and which is the best to use in future.

See publication: Benchmarking transformer-based models for medical record deidentification: A single centre, multi-specialty evaluation

The accuracy of redaction tools for de-identification of medical records

Subscribe to the BRC Oxford Newsletter

Feedback

Research Theme

IORD Project

Subscribe to the BRC Oxford Newsletter

Feedback

BRC Oxford on Social Media