<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Advogato blog for RhysJones</title>
    <link>http://www.advogato.org/person/RhysJones/</link>
    <description>Advogato blog for RhysJones</description>
    <language>en-us</language>
    <generator>mod_virgule</generator>
    <pubDate>Fri, 10 Feb 2012 18:18:37 GMT</pubDate>
    <item>
      <pubDate>Sat, 24 Aug 2002 16:29:54 GMT</pubDate>
      <title>24 Aug 2002</title>
      <link>http://www.advogato.org/person/RhysJones/diary.html?start=4</link>
      <guid>http://www.advogato.org/person/RhysJones/diary.html?start=4</guid>
      <description>&lt;B&gt;He who obeys, obeys Bayes?&lt;/b&gt;&lt;p&gt;

Yup, another set of possibly ill-conceived thoughts on spam filtering. There was a good round-up of what's currently going on in &lt;A HREF="http://www.ntk.net/2002/08/23/"&gt;yesterday's NTK&lt;/a&gt;, incidentally. (Search for Tracking in that page).&lt;p&gt;

What the current crop of Bayesian spam filters are trying to do is detect whether an email is written in 'normal English' or 'spam English'. That language identification requirement is exactly what this &lt;A HREF="http://epsilon3.georgetown.edu/~cball/languageid/"&gt;this text-based language identifier&lt;/a&gt; is trying to work out.&lt;p&gt;

So, can anyone tell me why an &lt;I&gt;n&lt;/i&gt;-gram (where 2 &amp;lt;= &lt;I&gt;n&lt;/i&gt; &amp;lt;= about 4-5) based system, working on the word level, hasn't been tried on this spam/non-spam classification problem? After all, &lt;I&gt;n&lt;/i&gt;-gram statistics are used by virtually every single speech recognition software package out there. That's because &lt;I&gt;n&lt;/i&gt;-grams are what let the speech recogniser determine which word has a reasonable chance of following the word it thinks you just said, and this information is used to improve the likelihood of what you said being a 'valid' sentence.&lt;p&gt;

There are problems with this, of course: recognising 'spam English' as opposed to 'proper English' is much more difficult than recognising English as opposed to, say, French. But there are subtleties of vocabulary ('free' can often be followed by 'porn', for instance), that may make this a workable method.&lt;p&gt;

Well, those are my thoughts anyway. Someone's probably tried this already and will tell me why it doesn't work as well as other methods, but even in that case I really would be interested to know why.&lt;p&gt;

[edited for formatting]</description>
    </item>
    <item>
      <pubDate>Fri, 23 Aug 2002 09:07:05 GMT</pubDate>
      <title>23 Aug 2002</title>
      <link>http://www.advogato.org/person/RhysJones/diary.html?start=3</link>
      <guid>http://www.advogato.org/person/RhysJones/diary.html?start=3</guid>
      <description>For my own benefit:

&lt;p&gt; &lt;UL&gt;
&lt;LI&gt;&lt;A HREF="http://ww.telent.net/#forwarding_pointers"&gt;Links to Lisp sites&lt;/a&gt;
&lt;LI&gt;&lt;A HREF="http://www.csn.ul.ie/~caolan/pub/Portaloo/"&gt;
The current 'official' Portaloo&lt;/a&gt;
&lt;/ul&gt;</description>
    </item>
    <item>
      <pubDate>Fri, 23 Aug 2002 08:58:36 GMT</pubDate>
      <title>23 Aug 2002</title>
      <link>http://www.advogato.org/person/RhysJones/diary.html?start=2</link>
      <guid>http://www.advogato.org/person/RhysJones/diary.html?start=2</guid>
      <description>&lt;A HREF="http://www.advogato.org/person/thomasvs/"&gt;thomasvs&lt;/a&gt; wrote:

&lt;p&gt; &lt;tt&gt;...when I look at scripts I wrote five years ago, or code for my thesis, I'm ashamed ;) So the big question is : &lt;b&gt;What will I think five years from now about stuff I did today ?&lt;/b&gt;&lt;/tt&gt;

&lt;p&gt; I could've written that, but, being me, I didn't. It's an incredibly sobering thought though. The mantra of 'Do stuff. Repeat.', and the goal of continuous improvement, is still as pertinent now as it ever was. Maybe programmers, localisers and other free software advocates really are made, and not born.

</description>
    </item>
    <item>
      <pubDate>Thu, 22 Aug 2002 11:04:57 GMT</pubDate>
      <title>22 Aug 2002</title>
      <link>http://www.advogato.org/person/RhysJones/diary.html?start=1</link>
      <guid>http://www.advogato.org/person/RhysJones/diary.html?start=1</guid>
      <description>&lt;A HREF="http://www.cstr.ed.ac.uk/projects/festival"&gt;Festival&lt;/a&gt; is great. Not only it is GPLed ('course), but it's also truly multi-lingual, thanks to a Welsh speech researcher (Briony Williams).



Unfortunately, the letter-to-sound rules for Welsh are now written in Lisp.



Goal, then, is to attempt to gain a passing acquaintance with Lisp. Never thought I'd see myself typing that. Mind, never thought I'd have to use Fortran either; that's another story.</description>
    </item>
    <item>
      <pubDate>Thu, 15 Aug 2002 19:52:09 GMT</pubDate>
      <title>15 Aug 2002</title>
      <link>http://www.advogato.org/person/RhysJones/diary.html?start=0</link>
      <guid>http://www.advogato.org/person/RhysJones/diary.html?start=0</guid>
      <description>Oops. Created an account as &lt;A HREF="http://www.advogato.org/person/Rhys"&gt;Rhys&lt;/a&gt; a while back, but completely forgot the password. Decided that a new account was the better option (sorry to those who've already certified me once...)

&lt;p&gt; &lt;p&gt; &lt;p&gt; This week, I have been mostly looking for Welsh-speaking open source advocates. In my view, and that of a
&lt;A HREF="http://www.advogato.org/person/Telsa"&gt;few&lt;/a&gt; &lt;A HREF="http://www.tgb.org.uk"&gt;others&lt;/a&gt;, GNU/Linux Must Be Translated. &lt;A HREF="mailto:rhys@sucs.org"&gt;Mail me&lt;/a&gt; if you fit that bill. &lt;I&gt;Diolch yn dalpau.&lt;/i&gt;</description>
    </item>
  </channel>
</rss>

