Estimating article numbers

From Meta, a Wikimedia project coordination wiki

Thursday, June 28, 6:10 PM -- So Wikipedia has almost 10,000 pages. This represents a heck of a lot of work and a heck of a lot of content, and we can all be proud, but...I think there's a bit of a problem. I have a puzzle for you: how many of these pages are articles?

We all know "10,000 pages" does not mean "10,000 encyclopedia articles." There are a lot of redirection pages, Talk pages, member pages and subpages, commentary pages, Wikipedia project pages, and other non-articles. But the new reader doesn't know this, and if some news media source comes along (as they inevitably will--it's only a matter of time now), I think we might be blasted for misrepresenting the extent of our achievement. Not only would that be shameful, it lose us the participation of potential new contributors who care about how accurately we represent our achievement, though they don't care if we say we have 10,000 or, instead, a mere 6,000 articles.  :-)

I wouldn't envy anyone the task of counting the actual number of articles. But we could estimate the number.

Anyone care to give it a shot, and report the results?

For purposes of this exercise, I don't think we need to draw a distinction between one-sentence articles to the effect that so-and-so was a famous novelist, and treatise-length pages. Both of those can, for our purposes, be called articles, or perhaps "entries." What can't be called articles are:

  • redirection pages
  • Talk pages
  • member pages and subpages thereof
  • pages describing the Wikipedia project (e.g., the FAQ, news, announcements, etc.)
  • pages that consist only of links to other articles, with virtually no content of their own
  • any other categories?

What I'd like to do with your estimate is to make the assumption that the ratio of present pages to present articles will remain roughly the same for the next 5,000 or so pages. (Or perhaps you can tell me how long the ratio can probably be relied upon.)

Then, we can (honestly) boast "over 5,000 articles" (notice, articles, not pages) or "over 5,000 entries" on the front page. This will make our work seem more substantial, more real, and that's important if we're going to make this a reasonably serious project.

I'll bet the present number is right around 5,000, but I really don't know!

--Larry_Sanger


A quick check as of a few minutes ago showed 9690 pages in total, of which 1348 redirect to another page, and 335 are /Talk subpages, leaving 8007 actual pages. Excluding the redirects, 4% of all pages are talk pages.
Counting only "comma pages", we have 6789 pages of which 286 are /Talk subpages, leaving 6583 non-Talk pages which contain at least one comma.
So, it seems that just about 4% of all entries excluding redirects, or 4% of all "comma pages" are /Talk subpages, if that's any help in the estimates.


I'll do a rerun tomorrow, which I hope will have some filtering of members' pages and their subpages from the totals.. -- Malcolm Farmer

How did you do these calculations? I'm not questioning your methodology; I'm just curious. --KQ

Yes, I'm curious too! --LMS

Well, I did a search for the redirect command; that gives a count of redirects (and probably a few pages discussing the redirect, but they're pages we want to exclude from the count anyway) Counting the /talk pages is easy enough: first my Perl script requests a search, either for a blank (all pages) or a comma. The number of pages is listed at the top of the returned page, and the /talk pages are counted by the script just going down the page line by line counting the matches for "/Talk" I'm no great programmer, so I really appreciate how easy Perl makes it to knock up a small script for analysing stuff. - Malcolm Farmer


See Malcolm Farmer/How many Wikipedia pages are there for Malcolm's work!

I'll say we have "over 6,000 articles" on the main page. Any objections? --LMS


A potential method of checking to see which articles are probably of high-quality, and thus how many "real" articles vs. how many "stub" entires, would be to check the number of revisions (major and minor, if possible) done to any article. I'd place a bet that given a little analysis, you could determine the average number of revisions needed before an article reached a certain degree of quality (and perhaps length, too.)

Of course, this would require a more intensive stats program -- probably best if it ran with access to the Wikipedia files themselves. --Colin dellow


The problem with counting revisions is that there seems to be little correlation between the number of revisions and the quality of the article.

  • About 1500 articles have been renamed and replaced with a redirect, the renamed article effectively being new-born.
  • Some articles get posted, and very few changes are made to them: whether this is because they are so good or because no-one else feels competent to revise them is a matter of judgement.
  • Some articles get repeated minor changes in spelling and/or layout: some get revised several times a day, like Biographical Listing.
  • And there's that quirk whereby when an article is created, articles with pre-existing links pointing to it don't show that it now exists: so people just add a space character to the page so that the new link shows up...

I think automated quality assessment via revision history is a non-starter. It needs people in the loop. -- Malcolm Farmer


I bet we have 15,000 articles now, Larry. --KQ


Maybe? Malcolm? --LMS


I have just learned that people intend to add individual pages for people who died in the Sep. 11 attack. I have no objection; but I was wondering should we count them as encyclopedia articles? They aren't encyclopedia articles in the traditional sense at least, since except for having been victims many (if not most) of them would never have got pages. Since fatalities are expected to number in the thousands, we could potentially have a significant inflation of the article count. And if we do decide to exclude them from the page count, might it be a good idea to include some set phrase (e.g. "victim of September 11, 2001 Terrorist Attack") so that we can easily exclude them from the page count? -- Simon J Kissane

I think this is a non-issue. I think the creation of new non-attack-related articles will continue at the same rate as before. New obituary pages will appear over a long period, surely, not a few thousand at once. I don't think there's any chance of obituary pages "taking over". And even if they would amount to a few percent of the total number of pages for a while, so what? Wiki is not paper, after all. --Pinkunicorn