A peek inside the busy home of the Internet Archive

SAN FRANCISCO — If you’ve ever clicked on a hyperlink that’s taken you to something called the Wayback Machine to view an old web page, you’ve been introduced to the Internet Archive.
The headquarters of the Internet Archive, an impressive white-columned, Greek revival-style temple, rises just south of the Golden Gate bridge.
Near the entrance of the building’s nave, a triptych of towering black computer servers are humming loudly.
“That is the Internet Archive,” said Mark Graham, the director of the Internet Archive’s Wayback Machine, pointing to the server stacks. Graham was leading about a dozen visitors on a weekly public tour of the headquarters on a recent Friday in March. He projected his voice to be heard over the drone of the computers. “Those machines are servers that are being used right now to record and save material. The lights are blinking — that means that something is being written to read from those hard drives.”

The servers are live-recording the World Wide Web. The results are staggering. Every day, about 100 terabytes of material are uploaded to the Internet Archive, or about a billion URLs, with the assistance of automated crawlers. Most of that ends up in the Wayback Machine, while the rest is digitized analog media — books, television, radio, academic papers — scanned and stored on servers.
As one of the few large-scale archivists to back up the web, the Internet Archive finds itself in a particularly unique position right now. After President Trump’s inauguration in January, some federal web pages vanished. While some pages were removed entirely, many came back online with changes that the new administration’s officials said were made to conform to Trump’s executive orders to remove “diversity, equity, inclusion, and accessibility policies.” Thousands of datasets were wiped — mostly at agencies focused on science and the environment — in the days following Trump’s return to the White House.
Information about climate change, reproductive health, gender identity and sexual orientation also have been on the chopping block. For example, pages referencing the Enola Gay — the B-29 aircraft that dropped an atomic bomb on Hiroshima and is not particularly related to LGBTQ history — were among a leaked list of posts the Pentagon flagged for removal. Some deleted pages, including those related to the Enola Gay, have resurfaced as agencies figure out how to comply with Trump’s directives.
The Internet Archive is among the few efforts that exist to catch the stuff that falls through the digital cracks, while also making that information accessible to the public. Six weeks into the new administration, Wayback Machine director Graham said, the Internet Archive had cataloged some 73,000 web pages that had existed on U.S. government websites that were expunged after Trump’s inauguration.
Graham noted that, for example, the Internet Archive is currently the only place the public can find a copy of an interactive timeline detailing the events of Jan. 6. The timeline is a product of the congressional committee that investigated the Capitol attack, and has since been taken down from their website. Graham said it’s in the public’s interest to save such records.
“How much money did our tax dollars pay to make it?” he said, referring to the timeline and committee proceedings. “It was a non-trivial exercise and it’s part of our history — and for that reason alone, worthy of preservation and worthy of exploration, of understanding.”
It’s typical for new presidential administrations to make changes to federal websites. In 2008, the Internet Archive co-created a tool called the End of Term Web Archive to track and back up such changes. But Graham said that under Trump’s second term, the scope and sheer pace of the deletions of government data has been uecedented.
“A lot of folks are out there trying to say, ‘What the heck just happened?'” Graham said. “We’re just doing our job, trying to be the best library that we can be, trying to help preserve the cultural heritage of our time — to make this material accessible, useful to people now and into the future.”
Since Trump’s second inauguration, more people are turning to the noofit
According to Graham, based on the big jump in page views he’s observed over the past two months, the Internet Archive is drawing many more visitors than usual to its services — journalists, researchers and other inquiring minds. Some want to consult the archive for information lost or changed in the purge, while others aim to contribute to the archival process.
“There’s a groundswell of support for the Internet Archive because of the dramatic shift that’s going on in parts of the government web infrastructure that you wouldn’t imagine would change,” said Brewster Kahle, the founder and current director of the Internet Archive. “People are coming and rallying behind us — by using it, by pointing at things, helping organize things, by submitting content to be archived — data sets that are under threat or have been taken down.”

Roy Rochlin/Getty Images for Unfinished Live
For example, she noted, there’s a web page on the Centers for Disease Control and Prevention’s site titled “Ending Gender-Based Violence.” It highlights CDC research showing that adolescent girls and young women bear a disproportionate burden of HIV cases worldwide, an issue driven by gender-based violence and poor access to health services. The page, which was accessible on Jan. 16 prior to Trump’s inauguration, now reads “page not found.”
Graham’s team has been working to get ahead of future purges, trying to identify and capture the material that might be at greater risk of removal, he said.
“Certainly this administration in some ways has made our job easier,” he said. “Even on the first day, they began sharing terms, words, topic areas that were going to be under examination — terms like ‘DEI.’ “
The Internet Archive doesn’t catch everything. A report about the risks of bird flu to people and pets briefly appeared and disappeared on the Centers for Disease Control and Prevention website. Graham said it appeared that the Wayback Machine wasn’t able to record it in time.
“I remember, I immediately went in and I kind of held my breath like, ‘Oh, do we have that?’ And we didn’t have it,” he said.
There’s a chance it could pop up later, possibly through the stream of material coming from outside contributors and partners. Most of what the Internet Archive slurps into the Wayback Machine becomes available to the public with minimal delay. In some cases, because the organization works with different partners in the archival process, there is a delay between when the material is collected by those partners and when it’s made available through the Wayback Machine.
“I’m still keeping my fingers crossed on that one,” Graham said. When the Internet Archive’s scrapers fail to capture such data, he said “it’s an opportunity for us to learn how we can do our jobs better.”
As the organization works to adapt, Graham said the job has him working overtime. “On a personal level, this has been a bit of a sprint,” he said. “I’ve been working seven days a week for the last many weeks. I’ve been finding myself, quite literally since the inauguration, waking up earlier with a sense of purpose and energy.”
Keeping the public front of mind
Despite its pioneering role in the digital realm, the Internet Archive team wants to keep people, not just machines, in full focus. Near the servers, clay sculptures — petite doppelgängers immortalizing people who have worked for the organization — line the walls and spill into the pews.

“We have all those little statues, which I think is a way of celebrating the people working on these collections,” Kahle said. “People have agency to build the technologies we think will serve us well. It’s [important] to have people understand how they can participate, that it’s not something happening to them. It’s ours.”
Avinash Krishna, a 22-year-old recent college graduate, visited from the Sacramento area to tour the headquarters. He said he’s been using the Internet Archive’s services for about a decade. The tour had long been on his to-do list, but a recent visit to a Wikipedia page bumped it up higher. To him, it was an example of how he’s seen the web become increasingly reliant on the archive’s tools.
“I don’t remember the page but, you know, a significant percentage of the links that were on the Wikipedia article are Internet Archive links,” he said. “That is really sad — that what people view as a primary source is something that doesn’t exist anymore.”

Krishna is grieving what’s known as digital decay or “link rot” — the massive, expanding graveyard of broken links across the web. It’s what you see when you encounter “Error 404” or “page not found.”
While the Trump administration’s scrubbing of federal web pages presents a notable example of the severed links issue, it’s long been an epidemic. A Pew Research Center study published last year found that roughly 38% of web pages on the internet that existed in 2013 were no longer accessible as of 2023. According to a Harvard Law Review study published in 2014, about half of all links cited in U.S. Supreme Court opinions no longer led to the original source material.
Kahle, who early on recognized the ephemeral nature of the web, said the rapid deterioration of the living web is a serious threat to historical preservation. “We’re building our culture on shifting sands,” he said.

A behemoth of link rot repair, the Internet Archive rescues a daily average of 10,000 dead links that appear on Wikipedia pages. In total, it’s fixed more than 23 million rotten links on Wikipedia alone, according to the organization.
The rapid decimation of government site data is just the latest challenge facing the noofit. Since 2020, the Internet Archive has been slapped with costly copyright lawsuits over its digitization of books and music that are not in the public domain. Record labels and book publishers have sued the noofit for hundreds of millions of dollars.
Founder Kahle said the costly lawsuits — which legal experts say are meant to be a deterrent — threaten the future of the archive. With a staff of some 120 people, the organization had a budget of about $28 million last year — less than a fifth of the San Francisco Public Library’s budget. It’s funded through donations big and small, as well as money that comes from museums, libraries and other institutions that pay the noofit to preserve its collections. On top of that, the organization has also been a target in a recent series of cyberattacks on libraries.
Even at a time when the Internet Archive is under threat, its founder Kahle appreciated that, back at the headquarters, the big room of towering servers — the lifeblood of the library — remains unobstructed, in full public view.
“It’s like open stacks,” he said. “It’s not hidden away in some bunker someplace. It’s ‘this is us.’ It comes across as a bit vulnerable, right?”
