Skip to main content

What do the usage data on Drupal.org actually mean?

Authors
November 23, 2021
Body paragraph

Many Drupal insiders recognize that data on Drupal usage that's collected and displayed on Drupal.org have their limitations. Since 2018 there's been a proposed Drupal core telemetry initiative to expand and improve the data collected. Meantime, though, the usage stats are widely referred to and cited. So it's worth spending a bit of time with them. What do they actually capture?

Before digging in, a disclaimer. Circa early 2007 I wrote the first rough prototypes of what became - with a lot of great work from Earl Miles, Derek Wright, and others - the client and server components of the Drupal core Update module. But I had little or nothing to do with any of that further work and I haven't done more than glance over the current Update module code, let alone the various pieces that run on Drupal.org to receive, log, and tally the data. So my notes here are more inference than studied conclusions.

To start off, a brief step back to look at where the stats come from.

How the stats are calculated

The Drupal.org data on project usage look simple enough at first glance, but like many statistics they have some non-obvious complexity and nuance.

Take the figures for Drupal core. There are date-specific tallies for how many sites report using Drupal core, broken down by major and minor release version and - if you scroll down the page - by individual release. The explanatory text states, "the figures show the number of sites that reported they are using a given version of the project."

Where to the data come from? A Drupal install using the "standard" install profile default installs a core module, Update, that checks periodically - by default, once daily - for available updates. When it does so, the instance sends data about itself to the Drupal servers, including data on all the projects it has installed. A "project" here is in the Drupal.org sense of something that's downloaded and installed as a package. It could be Drupal core, including all the core modules and themes, or a contributed project like Admin Toolbar, including submodules like Admin Toolbar Search.

To complicate things, the mechanism that triggers sending data to drupal.org - cron - relies by default on page visits, so installs with very low traffic may be missing from the data if they receive no page views at all in a given week. This fact may account for periodic dips in usage stats over periods that include holiday seasons, when certain kinds of sites receive fewer visits. Similarly, some Drupal sites won't have the Update module installed and so won't show up either.

On the flip side, it's always possible that some of what show up in the stats aren't Drupal installs at all but result from some attempt to game the system. Glancing over the graphs of Drupal core usage, it's hard not to notice some questionable spikes in the data, maybe most notably two in March and April of 2018 that showed supposed Drupal 7 usage jumps of nearly twenty percent from one week to the next, followed immediately by equally steep declines. So, yeah, some oddities, and over the years there have been discussions about spikes in the data and how to address them.

Once the installs have "called home," code on Drupal.org analyzes the logged data returned by all those Drupal installs in a given week, tallies it up, and presents the results in those handy tabular lists.

So does that mean for example that as of October 24, 2021 there were (doing some quick sums of the 9.0.x, 9.1.x, 9.2.x, 9.3.x, and 9.4.x numbers) at least 155,449 Drupal 9 sites?

The answer is, that depends on what you mean by "sites."

Installs vs. sites

A key thing to keep in mind when looking at the Drupal.org usage data is that they report Drupal installs--and these can be spun up for many different reasons. Some for sure are production websites. But other uses include:

  • Development environments. It's common for Drupal developers to spin up many installs they use purely for development purposes. Various dev tools have this kind of workflow baked in. For example, if you're using Pantheon to manage a Drupal site, you might spin up a different "multidev" environment for each new feature or bug you work on. If you have dozens of issues in a release, this might mean dozens of installs for each production website.
  • Code and configuration staging. Aside from one-off environments created for a particular issue, it's also common practice to use several permanent environments for staging code or configuration stages. Looking again to Pantheon, their platform features built-in support for Dev, Test, and Live environments. In other words, you might have two (or more) additional installs used for staging code and configuration changes on a single Drupal "site."
  • Evaluation. Some installs are created just to evaluate or try out Drupal or one or more of its extensions. For example, the simplytest.me service can be used to spin up short-lived Drupal installs for demo or testing purposes.
  • Continuous integration and automated testing. Drupal core and many Drupal extensions feature continuous integration including automated tests that are run on each proposed change to the software. There are many thousands of these issues open at any given time. Since tests are run for every new iteration of a proposed change classed as needing review, a given active issue can trigger multiple automated test runs in a given week, each involving a new Drupal install.

The main point: production websites are only a subset of Drupal installs.

Development installs

A sign that many reported installs are probably driven by automated testing is a data quirk that shows up whenever a new core minor version branch is opened up. For example, here's a chronological excerpt of the usage data from the weeks around the 9.2.x dev release.

Week 9.0.x 9.1.x 9.2.x
September 27, 2020 24,405 29,493 0
October 4, 2020 23,753 26,411 0
October 11, 2020 25,506 28,148 345
October 18, 2020 24,558 10,723 19,926
October 25, 2020 26,288 16,150 18,461

What immediately stands out is the leap in usage numbers for 9.2.x. By the end of the week starting October 18, 2020 - just eight days after the 9.2.x dev release was cut - there were suddenly close to 20,000 installs reportedly running the release. Meanwhile, the usage of 9.1.x dropped by nearly as much.

Really? Did tens of thousands of site developers decide to switch their sites to the newest dev branch, practically overnight?

Less immediately obvious but equally striking are the high numbers for the two development Drupal 9 release branches compared to the stable one. During this period, combined usage numbers for the then-unsupported 9.1.x and 9.2.x development releases were around twenty percent higher than usage numbers for the stable 9.0 branch.

Huh? Were the majority of Drupal 9 sites really running the completely unsupported cutting edge releases? Where did these "sites" come from?

The answer is probably in the October 6, 2020 pre-announcement of the 9.1.0-alpha1 release, which noted that "all outstanding issues filed against 9.1.x" were to be "automatically migrated to 9.2.x"--meaning they'd now be tested against the 9.2.x version rather than 9.1.x.

And voila, all those 9.2.x installs appeared.

After shooting up to nearly 20,000 installs in eight days, the 9.2.x usage figures remained relatively static in the subsequent months, settling in for the most part at from 20,000-25,000. This pattern is consistent with usage driven mostly by automated tests.

Many higher-end Drupal sites feature their own flavour of continuous integration, with their own rules about which version or versions of core to test against--and their own potential footprint of test installs.

In short, many or most of these reported development-version installs are probably ephemeral software instances installed and then taken down by infrastructure scripts. "Sites" only in the most abstract sense. This probability means we should take care when basing conclusions on raw usage data, particularly early in a major version cycle.

Drupal 7 vs. 8+

Automated test installs, development environments, and the rest aren't new, but several of these types of install are likely to be on the increase in Drupal 8+.

In previous versions it was possible to stage configuration between multiple versions of the same site, but Drupal 8 was the first version to include explicit support for this workflow. "Reliable configuration management for safe and straightforward deployment of changes between environments" was a key selling point when Drupal 8 was announced. All things equal, we can expect a lot more usage of staging in Drupal 8+ than in earlier versions--and therefore a lot more installs per production site.

Automated testing too may have a bigger footprint in Drupal 8+ than it did in Drupal 7. It's widely accepted that Drupal 8+ has proportionally more high end, enterprise level sites, where dedicated development teams and continuous integration and automated testing are much more likely. Further, in Drupal 8+ the introduction of minor release cycles (for example, 9.0, 9.1) means at any given time there are more current or upcoming versions to test against and hence more potential test installs.

All of these factors mean among other things that the falling graph of Drupal core usage data probably understates the case if we're looking for data on production websites as opposed to installs.

Core vs. contrib

For core usage, there are other data sources available, like those from W3Techs, that for some questions may be a good corollary or alternative to drupal.org stats.

For contributed projects, there don't tend to be other sources. But Drupal site builders often want to know which are the most-used projects for a certain area of functionality and there, sinc the interest is in relative rank, whether the data do or don't include non-production sites is mostly irrelevant.

Refining the data

Is there anything we can do to exclude non-production install types from the core usage data and so more closely model numbers of production websites?

There is one easy adjustment we can make: filter out reported installs running development releases for future versions. These releases are, by definition, not suitable for production sites. For example, since at time of writing the latest stable release branch is 9.2.x, this would mean excluding data for the 9.3.x and 9.4.x development releases.

As of the October 31 data for Drupal 9, that adjustment brings us down from a gross figure of 155,449 reported installs to 130,772, a 16% reduction.

But that's probably as far as we can go without resorting to conjecture.

Could we reliably filter out installs done for the purpose of developing a module or theme? There are some relevant data sets, like the drupal.org commit log that lists all commits pushed to drupal.org core and contrib repositories, and shows up to hundreds of commits per day, but even with easy access to those data extrapolating from them to assume a number of development installs used in a given week would be mostly guesswork.

Similarly, it would be useful to be able to filter out duplicate environments of a given site. Every Drupal 8 or 9 site has a unique setting - system.site.uuid - that's shared across all environments and if it was possible to filter Drupal usage stats by unique site ID, doing so would remove the duplicate environments from the data. But the site's UUID doesn't figure in the key sent to Drupal.org to identify the particular install. Without that identifier, it's hard to see how we could distinguish duplicate environments--except, again, by making some fairly arbitrary assumptions. For example, as well as guessing how many additional environments exist per site, we'd also have to assume how likely it is that each environment is used or visited in a given week--which gets us solidly into the realm of speculation.

Conversely, is there anything we can do to include data for sites that didn't "call home" in a given week? Because data are tracked per install, drupal.org servers probably do have the needed data. For example, that could look like:

  • For sites that (a) haven't sent data in a given week, (b) have reported in non-consecutive weeks at least once in the past year, and (c) did report in one of the prior three weeks, use the most recent data for that site.

That approach would include a few sites past the date they go offline and could report non-current data for some sites, but overall could be useful for filling in missing data.

Summing up

  • Drupal usage data reflect the number of reporting installs, not production sites.
  • Though it still leaves various types of non-production installs such as additional site environments, the best adjustment that can be easily made is removing stats for development releases for future versions.
  • Some sites are missing from the data because they don't have the Update module installed or haven't "called home" in a given week.
  • The ratio of production sites to reported installs is likely lower for Drupal 8+ than for Drupal 7, meaning the falling graph of total core usage would likely be steeper if it captured production sites rather than all installs.

All of the above doesn't in any way mean the usage figures on drupal.org are without value. Just that they're like most statistics--best used with an eye to limitations and context.