The VirusTotal API - Malware Automation at Scale

The VirusTotal API - Malware Automation at Scale

If you've spent much time at all in information security, you've probably had to deal with VirusTotal for some function or another, whether it's searching for malware on the platform or uploading the malware you built to test whether or not it is detected by various antivirus programs.

If you aren't as familiar, VirusTotal is a web platform that allows information security researchers to upload malware and view a wealth of data about that malware, from raw binary data, strings, IP addresses, and detection information from dozens of antivirus vendors. It also allows researchers to search the platform for malware matching specific criteria, from file size to extracted file type. Finally, for the advanced researchers, it allows users to query the API for automated research and use YARA rules to hunt for specific malware using custom rules.

This article is going to serve as a foundational introduction to the latter functions, specifically the VirusTotal API. The VT API is incredibly powerful, fairly well-documented and has enormous potential for information security research.

The basics

The VT API is divided across multiple objects and collections. You can think of objects the same way you would in most other programming contexts: they're sets of related data in a structured format. Files are obviously the biggest object type that we're going to deal with. Collections are basically just ways of referring to sets of objects. Finally, there are relationships between objects, such as two files being related because one is a ZIP file containing the other. This is all fairly well explained in the VT API documentation page on objects, collections and relationships.

This collection of objects layout makes for a fairly intuitive way to understand the endpoint structure. To query the VT API, you only have to know the collection name and the ID for the object you're looking for. For example, the ID for a file is its SHA256 hash, so the API endpoint I'd query to get information about that file object is https://www.virustotal.com/api/v3/files/{SHA256}, but thankfully VT makes it even easier by indexing files on this collection by their MD5, SHA1 or SHA256 hash, so you only have to know one to hit that API endpoint.

Finally, authentication: authentication is incredibly easy. You go to your API settings page on your VirusTotal account, copy your API key, and use that API key in the `x-apikey` header of every request.

In this simple introduction, we're going to do some very simple file lookup requests to get started. This should be viewed as one of the foundational use cases for the VT API, but you'll find it's not incredibly powerful, especially with just a free-tier account. This means you really only need to be familiar with the files endpoint, which you can read more about here in the VT API documentation.

Free account API limits

A good look at the VT API quota limits for free tier

VirusTotal offers a fairly generous free-tier account offering. That said, there are notable limitations that will push a team toward looking into a (pricey) premium tier account.

One of the biggest limiting factors is the rate limiting. Free-tier accounts are only allowed to send up to 4 requests per minute. This is, in my opinion, prohibitively low for automation approaching even an incredibly modest level of scale, but I guess VT has to make their money somehow.

đź’ˇ
Like the article? Get research like this every week, straight to your inbox and always for free with the Valhalla Research Weekly Newsletter.

You are allowed up to 500 requests per day, which isn't horrible for a solo researcher, and 15,500 requests per month, which also isn't horrible. I think especially for beginners, you're likely to stay far under those limits for quite a while.

One of the larger limitations is the limited data you have access to, and some of the premium-tier features you can expect from a full VT Premium account. The classic example is VT Hunting, which offers the researcher the ability to hunt across VT for samples that match certain YARA rules, but even on specific object-level endpoints, you're missing out on some great data. If you open the "Files" object dropdown on the left panel of the VT API documentation, you can view what data you're missing out on represented by fields with lock emojis next to them:

Speaking from experience, these are massively useful data points. But, for a cash-strapped solo researcher, you'll have to make do.

A quick use-case

The best quick use-case for the VT API is iterating over a list of hashes that you're interested in and querying VirusTotal to check and see what we can find out. This is an incredibly easy process and is very easy to extend and scale up, especially if you have a higher-level API account.

We're going to read in a line-separated list of various hashes, query the API, download and store all of the data and display some basic results.

"""
Requirements (all default packages or downloadable via PIP)
- os - used to load in environment variables
- requests - used to make HTTP requests
- json - used to load JSON data into dictionaries and dump it to files
- time - used to sleep to avoid rate limiting
- dotenv.load_dotenv() - used to load a dotenv file (.env) which holds our API key so I don't paste it directly in the script

"""
import os, requests, json, time
from dotenv import load_dotenv

# Load information from the .env file in the same directory as our script, which holds our API key.
load_dotenv()

# Load the API key in from the OS environment variable
api_key = os.getenv('API')

# The endpoint to fetch information about a File object, sans the file ID which we will append later
file_endpoint = 'https://www.virustotal.com/api/v3/files/'

# A list of hashes in newline separated text format
hash_file = open('./hashes.txt','r')
hashes = hash_file.readlines()
hash_file.close()

# Setting up the header object for authentication
headers = {'x-apikey':api_key}

# Iterate over all hashes...
for hash in hashes:
	# Make the request and append the unique ID
    res = requests.get(file_endpoint+hash, headers=headers)
    js = res.json()
    
    # Dump the JSON data to a file named after the file's ID
    ffile = open(f'./{hash}.json','w')
    ffile.write(json.dumps(js))
    ffile.close()
    
    # Attempt to load the data and display the MD5/SHA1/SHA256 hashes
    try:
        data = js['data']['attributes']
        md5 = data['md5']
        sha1 = data['sha1']
        sha256 = data['sha256']
        print(f'[-] ID: {hash}\n\tMD5: {md5}\n\tSHA1: {sha1}\n\tSHA256: {sha256}')
    # Tell the user when a file doesn't exist on VT
    except Exception as e:
        if 'error' in js and js['error']['code'] == 'NotFoundError':
            print(f'[x] File of ID {hash} not found')
        else:
            print(f'[x] Error: {str(e)}\nRes: {res}')
	
	# Sleep to avoid rate limiting
    print('[-] Sleeping for 15 seconds...')
    time.sleep(15)

The above code snippet is fairly straightforward.

  • We load our API key in via a .env file and the python-dotenv library
  • We load in some example hashes (in this example, I used hashes from Cisco's Manjusaka report)
  • We iterate over those hashes, making the API requests, dumping the data into JSON files and printing out some basic information.

Dumping the data into JSON files is useful for two reasons: saving yourself some precious API usage by caching results, and being able to view actual examples of an incredibly large and fairly complex JSON structure. If you implement cache checking (ie. check to see if you've already requested information on a file before making the API request) you can save yourself some requests by loading data from cache instead of making the request itself. The data structure is fairly complex as well, so you'll want an example of the JSON format to reference for future automation.

The terminal output of the example code above

Honestly, that's about it.

There are a lot of additions you can make to this code, like adding a back end storage interface to store your results and check your local cache and pivoting on some of the data points in the JSON structure to enrich your initial results, but I'll leave that up to your ingenuity and creativity.

If you enjoyed this article, good! I plan on making it a series. Feel free to subscribe to my weekly newsletter where you'll get all of my articles to your inbox for free.