Big data and how it revolutionised our film archive

Ahead of an event to discuss the ‘big data’ revolution in the BFI’s collections, Stephen McConnachie, BFI’s Head of Data, reflects on a decade that has transformed the way our archive creates and uses data.

Updated: 18 May 2017

By Stephen McConnachie

Video cassette digitisation in the BFI National Archive © James Cumpsty

“It is a capital mistake to theorize before one has data. Insensibly, one begins to twist the facts to suit theories, instead of theories to suit facts.” – Sherlock Holmes, A Scandal in Bohemia

If you accept that a key component of the big data revolution is (thanks to the power of computer technology) sample = all, then the BFI has been a big data organisation since its beginning. Our Monthly Film Bulletin (since incorporated into Sight & Sound magazine) started publication in 1934 with a remit to review every film released to cinemas in the UK, and to provide full cast and crew information for every film. This role as publisher of a document of record for all feature films presented to British audiences put the BFI in a good position to amass a comprehensive film dataset, long before such concepts were established.

In parallel, the BFI National Archive, founded in 1935, required the BFI to document and interpret the collections in one of the largest – and busiest – moving image archives in the world. Which films were held in the archive, on which formats, in what condition, accessed by whom, for what purposes? Which films had multiple versions, and what did each version contain? Which films were restored, using components borrowed from peer archives around the world? Which television programmes were recorded off-air for the national television archive? Who made these films and TV programmes, and who acted in them? What were their subjects and genres? Which posters, stills, designs, scripts, books and articles were held in the BFI Reuben Library?

Film cans in the BFI National Archive
James Cumpsty

Until the start of this decade, this vast sea of information on films, TV programmes and the people who created them was held in many systems, in many data formats and technologies, managed by different parts of the BFI, with standards and approaches developed locally to serve the needs of individual departments. Searching the information meant visits to multiple BFI sites in London and the John Paul Getty Jr Conservation Centre in Hertfordshire, often scheduled in advance, sometimes supervised.

It was impossible to search across this entire sea of information to find, for example, all books, articles, press cuttings, films and television programmes, stills, posters, designs and scripts held by the BFI relating to a single filmmaker.

In 2010 the BFI’s Collections and Information department began modernising the data infrastructure, using public funding opportunities to procure and develop a state-of-the-art, modern collections management system. There was also a reform of information governance within the department, and the creation of dedicated roles to manage standards and practices across the data lifecycle – creation, storage, management, exchange and access.

All collections datasets were brought into one system, the Collections Information Database, or CID for short, and a public Collections Search interface was launched, enabling researchers – for the first time – to search across the BFI’s information about its collections in one web application, without leaving home.

Stacks at the BFI National Archive
James Cumpsty

Standards were implemented, to make sure that the data was governed according to best practice, and to increase interoperability between the BFI and its peer archives around the world, including television archives such as ITV. This public search platform has enabled archivists around the world to find ‘believed lost’ films, including an early Disney short. Every week it also enables peer archives to request access to the BFI’s treasure trove of moving image related materials, and moving image researchers to obtain vital information.

This modernised data infrastructure has also streamlined the provision of authoritative film data to other parts of the BFI – for example the data in BFI Player comes from CID – and it is the data engine-room that powers major diversity data analysis projects like the research on black actors in the UK film industry that accompanied the BFI’s recent Black Star programme, celebrating black actors across film history.

BFI Player logo

Stream hand-picked cinema

A free trial, then £4.99/month or £49/year.

Get 14 days free

Originally published: 18 May 0017

Other things to explore

From the Sight and Sound archive

“The conclusion we came to about equality is that nobody really wants it”: Krzysztof Kieślowski on the Three Colours trilogy

By Tony Rayns

“The conclusion we came to about equality is that nobody really wants it”: Krzysztof Kieślowski on the Three Colours trilogy
From the Sight and Sound archive

Godzilla mon amour

By Ken Hollings

Godzilla mon amour
Where to begin

Where to begin with Víctor Erice

By Geoff Andrew

Where to begin with Víctor Erice