clock menu more-arrow no yes

Filed under:

British Library to archive one billion UK web pages by year's end

New, 9 comments
Ethernet / Internet (stock)
Ethernet / Internet (stock)

In a bid to permanently preserve Britain's stamp on the internet, the British Library will kick off an ambitious plan Saturday that involves archiving each and every URL with a .uk suffix. Already tasked with collecting virtually every printed work that originates in the country, laws are now calling for the library to capture 4.8 million websites, which expands to a total of around one billion individual pages. As for how it's tackling the mission, it will be using an automated web crawler to comb through Britain's corner of the world wide web.

That's a lot of web crawling

The British Library started off on this quest way back in 2004 with the UK Web Archive, though it's been moving at what amounts to a snail's pace at this point. Typically the library has requested permission from site owners to include their content in the archive, but that will no longer be the case as it looks to ramp up progress, aiming to make the archive publicly available by year's end. Per the Associated Press, a majority of sites will be scanned annually, with more prolific internet destinations — newspapers, popular blogs, and magazines — being recorded up to once per day.

Capturing URLs is only one part of the process, though; another challenge is preserving them for decades to come. Copies of the archive will be stored across servers around the country, and library staff will update file formats if and when technology advances call for it. "If we don't capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost," head of content strategy Lucie Burgess told the AP. As ambitious as the project might seem, it doesn't quite match the scope of the Internet Archive, which has amassed a collection of 240 billion web pages since 1996.