How big is 30 terabytes of malware source code? Or 31 petabytes of submitted malware samples? For most people, those numbers are abstract. But a recent exchange between two prominent figures in cybersecurity has given us a tangible way to visualize the sheer scale of the world’s largest malware repositories.
Malware research group vx-underground, which claims to host the largest publicly known collection of malware source code, posted on X that its archive totals roughly 30 terabytes. In response, Bernardo Quintero, founder of VirusTotal — the Google-owned service that scans files across dozens of antivirus engines — noted that VirusTotal’s community-submitted malware samples amount to approximately 31 petabytes. (A petabyte is about 1,000 times larger than a terabyte.)
Also read: Notion transforms workspace into hub for AI agents with new developer platform
These datasets are critical infrastructure for cybersecurity companies, threat intelligence firms, and AI researchers who train detection models and study how malware evolves. But what does that much data actually look like in physical terms?
Stacking the Numbers: From Inches to Eiffel Towers
We decided to do some rough, back-of-the-napkin math. Using standard 3.5-inch internal hard drives — which are 1 inch tall and typically offer 1 terabyte of storage — we calculated the height of a hypothetical stack.
Also read: Anthropic's Cat Wu on staying ahead in AI: 'We don't think about competitors'
For vx-underground’s 30 terabytes, that’s 30 hard drives stacked on top of each other, reaching 30 inches, or about 2.5 feet. To put that in human terms, this reporter is 6 feet tall — so the stack would come up to about waist height.
For VirusTotal’s 31 petabytes, the numbers become staggering. That’s 31,744 hard drives. Stacked vertically, they would reach approximately 2,645 feet. The world’s tallest building, the Burj Khalifa in Dubai, stands at 2,722 feet — just 77 feet taller. The Eiffel Tower, at 1,083 feet, would be dwarfed. VirusTotal’s dataset is roughly two-and-a-half Eiffel Towers’ worth of data.
These comparisons are, of course, simplified. Real-world storage uses far denser enterprise drives, and the total usable capacity of a 1TB drive is slightly less than advertised. But the exercise offers a visceral sense of scale that raw numbers cannot convey.
Why This Matters Beyond the Numbers
These repositories are not just curiosities. They are essential tools for defending against cyber threats. Malware samples — both source code and compiled binaries — allow security researchers to reverse-engineer attacks, identify signatures, and train machine learning models to detect new variants. Without large, diverse datasets, the cybersecurity industry would be fighting blind.
Vx-underground’s collection focuses on source code, which is rarer and more valuable for understanding how malware is built. VirusTotal’s dataset is broader, encompassing millions of user-submitted files, many of which are benign but some of which represent advanced threats. Together, they represent a significant portion of the world’s known malware intelligence.
Implications for AI and Threat Detection
The size of these datasets also underscores a growing challenge: storage and processing. As malware becomes more sophisticated and prolific, the infrastructure required to analyze it must scale accordingly. Cloud providers, specialized hardware, and AI-driven analysis tools are all part of the solution. But the physical reality — that 31 petabytes of data would fill a stack of hard drives taller than most skyscrapers — is a reminder of the immense resources required to keep digital systems safe.
Conclusion
While the numbers behind vx-underground and VirusTotal are often discussed in abstract terms, visualizing them as a stack of hard drives brings a new level of understanding. The next time you hear about a cybersecurity dataset, remember: it’s not just data. It’s a tower of evidence, a library of threats, and a monument to the ongoing battle between attackers and defenders.
FAQs
Q1: What is vx-underground?
Vx-underground is a malware research group that maintains one of the largest publicly available collections of malware source code. Researchers use it to study how malware is built and evolves.
Q2: How does VirusTotal’s dataset compare to vx-underground’s?
VirusTotal’s dataset is approximately 31 petabytes, which is about 1,000 times larger than vx-underground’s 30 terabytes. VirusTotal’s collection includes both benign and malicious files submitted by its global user base.
Q3: Why do these datasets matter for everyday users?
These datasets are used by cybersecurity companies to train antivirus software, threat detection systems, and AI models that protect computers and networks from malware attacks. They are a critical part of the global cybersecurity infrastructure.