8 Jun 2011 joey   » (Master)

date formats of a decade of usenet

I've finished importing the usenet archive for oldusenet. The fun part was parsing the dates to put the posts in order.

No date format was really required on usenet, and so a wide variery of formats were used. Some posts didn't have a Date, but a guess could be made from their Message-ID. Some posts had absurd dates (ie, 1969, 1995), others had dates that were correct in every way.. except the year was left out (oops). One early post had a date of "_".

Still, this excerpt of my code managed to parse the rest and so gives a fairly complete picture of how messy dates can possibly be. Read and weep.

    p anyzone "%d %b %y %T"       "15 Jun 88 02:27:41 GMT"
, p anyzone "%a, %d %b %y %T"       "Thu, 22 Jun 89 20:02:03 GMT"
, p anyzone "%a, %d-%b-%y %T"       "Thu, 15-Jun-89 18:01:56 EDT"
, p anyzone "%d %b %y %T"       "8 Jan 90 14:07:27 -0400"
, p anyzone "%d %b %y %H:%M"        "4 Oct 89 19:56 GMT"
, p anyzone "%a, %d %b %y %H:%M"    "Thu, 23 May 91 02:13 PDT"
, p anyzone "%a, %d %b %Y %T"       "Thu, 23 May 1991 07:07:00 -0400"
, p anyzone "%a, %d %b %Y %H:%M"    "Sat, 18 May 1991 17:28 CDT"
, p anyzone "%d %b %Y %T"       "11 Apr 1991 12:02:01 GMT"
, p anyzone "%d-%b-%y %H:%M"        "24-Mar-90 14:22 CST"
, p anyzone "%d %b %y, %T"      "22 May 91, 16:31:37 EST"
, p anyzone "%d %b %Y %H:%M"        "30 June 1991 17:15 -0400"
, p anyzone "%a, %d %b T  %T"       "Fri, 8 Feb T  09:49:39 EST"

-- special cases
, p (tzconst est) "%a %b %d %T EST %Y"  "Tue Jan 11 12:44:36 EST 1983"
, p (tzconst est) "%a %b %d %T EST %y"  "Tue Jan 11 12:44:36 EST 83"
, p (tzconst edt) "%a %b %d %T EDT %Y"  "Tue Jan 11 12:44:36 EDT 1983"
, p (tzconst edt) "%a %b %d %T EDT %y"  "Tue Jan 11 12:44:36 EDT 83"
, p (tzconst utc) "%a %b %d %T GMT %Y"  "Thu Nov  1 23:14:37 GMT 1990"
, p (tzconst pdt) "%d %b %y %T -7"  "11 Jun 91 15:41:21 -7"

-- dates with no timezone specified are guessed
, p nozone "%d %b %y %T"        "9 Jan 90 09:33:59"
, p nozone "%d %b %Y %T"        "10 APR 1990 05:25:28"
, p nozone "%a %b %d %T %Y"     "Fri Feb  6 00:19:47 1981"
, p nozone "%a %b %d %T %y"     "Fri Feb  6 00:19:47 81"
, p nozone "%Y-%m-%d %T"        "1981-11-12 18:31:01"
, p nozone "%y-%m-%d %T"        "81-11-12 18:31:01"
, p nozone "%a, %d %b %y %T"        "Sat, 13 Apr 91 08:37:57"
, p nozone "%a, %d %b %Y %T"        "Sun, 16 Jun 1991 13:23:02"
, p nozone "%d %b, %Y %T"       "1 May, 1991 00:00:00"
, p nozone "%d %b %y %H:%M"     "8 Jan 88 18:03"
, p nozone "%a, %d %b %y %H:%M"     "Wed, 29 May 91 17:14"
, p nozone "1 %b %d %T %Y"      "1 Jan 08 20:59:08 1991"

-- this has to come near the end, as it matches greedily
, g nozone "%a %b %d %T %Y ("       "Wed Oct 27 17:02:46 1982 (Tuesday)"
, g nozone "%a, %d %b %y %T +"      "Tue, 21 May 91 16:46:01 +22323328"

-- extract date from message-id headers
-- (used for messages with no Date field)
, g nozone "<%Y%b%d.%H%M%S."        "<1989Jul6.214048.28313@jarvis.csri.toronto.edu>"

(Parsing the often ambiguous, malformed, etc timezones was fun all its own too, of course.)

Syndicated 2011-06-08 21:28:38 from see shy jo

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!