Saturday, September 4, 2010

Usage share of newsreaders

I have noted before, by a nonscientific and utterly biased survey, that Thunderbird appeared to account for a significant share of the newsreader market (testing bug 16913 was what caused me to discover this fact). But actually finding any attempts to measure usage share of newsreaders via Google has actually been rather frustrating. You can easily find market shares of web browsers, desktop operating systems, server operating systems (though the numbers vary wildly?), and mobile platforms. But not things like email client shares or newsreader shares.

Okay, I am not about to find market shares of email clients. I have no access to anywhere near enough a representative sample that could work. But collecting newsreader market shares should not be that hard. After all, pretty much anyone can pick up a large, representative sample of news postings... connect to a NNTP server of your choice. So, seeing as how it's a three-day weekend, I thought I might as well collect the data myself. The other reason for my collecting this data was to demonstrate that a significant number of Thunderbird users are NNTP, so removing NNTP support would adversely affect the userbase.

Methodology

First off, I have to define what I mean by "usage share." Unlike other mediums, a relatively small numbers of users account for a relatively large share of NNTP postings. I've decided to measure it by the number of posts generated by each NNTP client, since it's easier to calculate, and I think it is more informative than the measuring by individual users.

I also have to pick the subset to log. For this set of data, I collected every single news article in the Big-8 newsgroups on my school's NNTP server (news.gatech.edu), which has a retention time of a month (30 days is the exact number, I think). I did not even attempt to filter out spam messages, and I did not account for cross-posting (my script managed to crash due to races a few times, so the totalized data which accounted for cross-posting was unreported).

Essentially, this is what I did. I ran a python which collected every group in the big 8 (determined by LIST ACTIVE wildmats) on the server. Then, it entered every group, performed an XOVER to find all messages, and then XHDR'd User-Agent, X-Mailer, and X-Newsreader to figure out what the user agent was. This script output, for every group, a total for each full string that represented the UA into a ~2MB csv file.

I collected all of the csv data into OpenOffice.org Calc and then ran a macro which attempted to collect the program number and the version from the UA strings. Unsurprisingly, I had to do some hacks to get it to recognize SeaMonkey and Thunderbird correctly (Mnenhy was not helping). I output tables that broke readers down by versions and by total program counts.

Results

It turns out that there is an incredibly long tail of newsreaders. There are about 250 different UA strings I found. Excluding one particularly prevalent bot and those postings for which I could not find a UA string, I found around 430,000 messages (there may be some other things dropped by copy-paste errors). Of these, the top 5 newsreaders account for just 79% of the total count. By contrast, the top 5 web browsers account for very nearly 100% of the total. Some interesting newsreaders I found in the long tail:

  • Mozilla 4.8 [en] (Windows NT 5.0; U)
  • Mozilla 3.04 (WinNT; U)
  • trn 4.0-test70 (17 January 1999)
  • Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/0.8.12
  • MyBB

Finally, here is the table of the top newsreaders:

NewsreaderTotalPercent
Google Groups18953643.94%
Thunderbird5225812.11%
Forte Agent4710010.92%
Microsoft Outlook Express410429.51%
Microsoft Windows Live Mail111962.60%
MT-NewsWatcher108722.52%
Other7935918.40%

That Google Groups has the highest market share is not surprising, but I was surprised by the strong showing of Forte Agent and the poor showing of traditional newsreaders (e.g., tin, rn-based newsreaders). I guess this goes to show you that Windows has a surprisingly large market share in the Big 8 newsgroups. For SeaMonkey enthusiasts, your newsreader has a mere 4,187 postings (with another ~5K provided by other Mozilla distributions, some of whom cannot be determined... Mnenhy made processing UA strings difficult).

In terms of individual versions, one of Outlook Express's versions clocks in #1 at 31,617 total posts, with Thunderbird 3.1.2 trailing at a "mere" 23,661. Thunderbird has around 14,000 on the 2.x branch, 11,000 on the 3.0.x branch, and 25,000 on the 3.1.x branch. There is apparently some spoofing going on for SeaMonkey users as well (I found a dozen or so Firefox entries, which I presumed is a SeaMonkey-spoofed UA string).

Another datum incidentally collected was the number of postings in each hierarchy. Here they are:

HierarchyCountLargest Newsgroup
comp.*64,360comp.soft-sys.matlab (8,399)
humanities.*2,460humanities.lit.authors.shakespeare (1,455)
misc.*28,518misc.test (8,796)
news.*31,635news.list.filters (26,238)
rec.*217,548rec.games.pinball (14,707)
sci.*47,948sci.electronics.design (6,076)
soc.*87,192soc.retirement (6,053)
talk.*12,513talk.origins (6,498)

Remind me again why we have the humanities hierarchy? Almost 60% of its messages come from a single newsgroup, and it has just 8 newsgroups.

Future Work

What could be done in the future is to expand this research into binary newsgroups. However, merely counting posts becomes a more inappropriate metric because binary newsgroups use a lot of multipart messages, so just because someone uploads a ginormous binary does not mean it should be counted 50 times. I also don't have access to any binary newsservers.

Another opportunity for fixing is to discount spam. As a brief test, I looked into only those newsgroups which had the name `moderated'--this resulted in a paltry sample of 3,272 messages. The statistics also appear to not change much, but the newsgroups are likely not a representative sample of the Big 8 anyways.

Finally, this needs to be broadened and run repeatedly so it can collect snapshots of the data across time. This metric suffers poorly at capturing historical data, but it could be an excellent way to get data every few months from the future, so long as someone collects all of the data in the future.

12 comments:

Ludovic Hirlimann said...

So someone is doing this for you on a monthly basis and is counting all messages for the fr. hierarchy. Results are posted on a Monthly basis to fr.usenet.stats stats from
2010-05-01 au 2010-07-01 :

20.25% : Microsoft Outlook Express
14.65% : Mozilla
13.95% : MesNews
13.49% : G2
9.31% : Thunderbird
5.02% : MacSOUP
4.12% : Microsoft Windows Mail
2.33% : Microsoft Windows Live Mail
2.32% : Forte Agent
1.81% : Gnus
1.72% : Pan
1.56% : bleachbot
1.28% : XPN
1.24% : Xnews
1.14% : slrn
5.72% : the rest.


More stats are available on a per newsgroup basis at : http://www.crampe.eu.org/statfr/

And yes all theses stats are unfortunately for most of you in french.

Joshua Cranmer said...

The stats collected for about the past 90 days on a free fr.* server I found:
22.5% Thunderbird
20.8% Outlook Express
12.4% MesNews
12.4% Google Groups
3.93% Windows Mail
2.73% Forte Agent
[ with a long tail ]

Anonymous said...

Why did you skip the alt.* hierarchy?

Joshua Cranmer said...

1. I don't have the patience to wait for my script to log all of alt.*
2. The server I used didn't have alt.* at all.
3. The Big-8 was the subset I wanted to log, not "the entirety of plain-text Usenet."

jmdesp said...

The truth is Usenet is really dying a slow death. I'm also a fr.* user, and I know there's another stat posted there that shows the number of unique posters is already very low WRT what it was 6 years ago, and each month losing a few tens of contributors.

I see a phenomena that accelerates the fall : as they're loosing vitality some groups get invaded by trollers with an obsessive agenda, who are extremely persistent and manage to get out all others contributors, and then after finding themselves alone by themselves end up after some time also leaving, the group is then definitively dead.

But for example the proxad.* groups on the private server news.free.fr of the Free FAI are quite preserved from this (less outside troller, more Free users actually looking for and getting informations), and so are getting more active than most of fr.*

Similarly some groups on news.gmane.org are also very active (boost groups for example)

So maybe it's important to try to include the world of private servers in those stats, they seem to be comparatively preserved from the decline of Usenet, so probably will end up representing more and more of newsgroups usage.

If NNTP support actually goes out some day, could it be maintained through an add-on implementing it as a specific account type ?

bwinton said...

So, you set out to test the hypothesis that a significant number of Thunderbird users are using NNTP, but ended up proving that a significant number of NNTP users are using Thunderbird. ;)

And, just as an interesting data point, over the last 30 days, 52,258 posts were from Thunderbird (counting spam, and not counting users, just posts). On the other hand, just yesterday Thunderbird users created at least 56,497 new accounts. (It's a little lower than normal, perhaps because it was the Sunday of a long weekend. And that only counts people who we successfully found configs for. We don't know how many of the 72,951 other requests ended up with valid auto-detected configs.)

So, is NNTP used by a significant number of Thunderbird users? I don't think we've proven that. (But, for the record, I don't want NNTP support to be removed!)

Later,
Blake.

Dan Mosedale said...

Interesting indeed; thanks for doing this work! Another interesting future direction would be
to find numbers of posts for a variety of different group conversational media (blog posts, blog comments, forum posts, social media, etc.). The trick, of course, would be to figure out how to find numbers that make suitably apples-to-apples comparisons.

Pedahzur said...

Just wondering: does this take into account the news groups that have mail-to-NNTP gateways set up? In other words, are you testing the usage of NNTP in Thunderbird, or just looking at how often "Thunderbird" would show up in the agent header, which might include the mail functionality of Thunderbird.

Joshua Cranmer said...

jmdesp: I only included the Big 8 newsgroups in the stats mostly because it's what I can access the most easily. Collecting private server stats is certainly worth doing, but it is a bit more than I wanted for a weekend project.

Also, part of my rationale for writing this post was to provide reasons for not removing NNTP support from TB. In any case, such a decision will not be made until it is feasible to implement NNTP in an extension (or so I'm told).

bwinton: Well, I realized after doing the runs that I chose probably the lowest traffic month to log the data (Europe goes on holiday for much of August), and I'm choosing a rather small subset of Usenet, so the values I produce are more of a loose lower bound.

For a graphical comparison, in this treemap of Usenet (a bit dated), I've essentially logged only the leftmost column after alt, so I'm only looking at a tiny fraction of what really exists.

Dan: One thing discussed over IRC is that this script should be run on a monthly basis so we could get more datapoints. Certainly, trying to measure more modern social media would be useful, but they are nowhere near as centralized as Usenet (that sounds like an oxymoron...). I have in mind a very rich set of data to collect for NNTP. :-)

Pedahzur: I've only collected data for the Big 8 newsgroups, which, to my knowledge, are not distributed as mailing lists anywhere. However, there are significant cases of Web Interfaces to Usenet (another I didn't mention was MATLAB's support forum user agent). If I do start logging private newsservers, I will have to find ways to identify mailing lists versus newsgroup postings.

Pedahzur said...

Hmm...you might want to check things. Unless I'm misunderstanding (which is VERY possible), you collected data from, for example, comp.*.

Per http://www.python.org/community/lists/ compl.lang.python is also accessible as a mailing list. So, a UA of Thunderbird in comp.lang.python could be an NNTP post or a mailing list post.

comp.lang.perl.moderated is also available as a mailing list: http://lists.perl.org/list/comp.lang.perl.moderated.html Although it's less clear if you can post via e-mail.

Joshua Cranmer said...

Pedahzur: There are fewer TB messages in comp.lang.python than the set of messages that are labeled as "Mnenhy" (and which I cannot distribute between TB and SM reliably). All in all, the effects of mailing list availability are probably smaller than the inherent noise in the dataset.

Mardus said...

At least in SeaMonkey 1.x (and maybe even Mozilla Application Suite, if anyone still has to use it), Mnenhy is the only way to disable and remove extensions.

SeaMonkey 2.x already has its own Add-ons manager styled after Mozilla Firefox.