<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Advogato blog for salmoni</title>
    <link>http://www.advogato.org/person/salmoni/</link>
    <description>Advogato blog for salmoni</description>
    <language>en-us</language>
    <generator>mod_virgule</generator>
    <pubDate>Sun, 19 May 2013 20:35:00 GMT</pubDate>
    <item>
      <pubDate>Thu, 19 Jan 2012 10:26:09 GMT</pubDate>
      <title>19 Jan 2012</title>
      <link>http://www.advogato.org/person/salmoni/diary.html?start=593</link>
      <guid>http://www.advogato.org/person/salmoni/diary.html?start=593</guid>
      <description>&lt;p&gt;&lt;a href="http://www.mozilla.org" &gt;Mozilla&lt;/a&gt; are looking for a &lt;a href="http://hire.jobvite.com/CompanyJobs/Careers.aspx?c=qpX9Vfwa&amp;cs=9Kt9Vfw1&amp;page=Job%20Description&amp;j=oxt3VfwI" &gt;Quantitative user researcher&lt;/a&gt; which sounds cool. The emphasis on user research sounds right up my street, particularly the need for mastery of experimental design and statistical analysis. It kind of takes me back to my PhD and work on SalStat (still going strong).&lt;br/&gt;
&lt;p&gt;The problem is my covering letter. Can anyone here tell me what style of covering letters are preferred? Long and detailed explaining why I meet each of the requirements? The standard 3 paragraph ["intro", "I'm cool", "thanks"]? Or some combination in between?&lt;br/&gt;
&lt;p&gt;In the meantime, I've released &lt;a href="http://roistr" &gt;Roistr&lt;/a&gt; which does some basic semantic analysis / text analytics stuff. I put up some demos but it's hard to really show how useful this thing is. It's based on the open source &lt;a href="http://radimrehurek.com/gensim/" &gt;Gensim&lt;/a&gt; toolkit along with numpy and scipy.&lt;br/&gt;
&lt;p&gt;Scipy sounds like it's going places. Travis Oliphant recently announced an initiative to bring it to big data properly. I have an idea of what he means and it would be very cool.</description>
    </item>
    <item>
      <pubDate>Thu, 7 Jul 2011 12:45:48 GMT</pubDate>
      <title>7 Jul 2011</title>
      <link>http://www.advogato.org/person/salmoni/diary.html?start=592</link>
      <guid>http://www.advogato.org/person/salmoni/diary.html?start=592</guid>
      <description>Does anyone have any Google Plus invites that they could send (one) to me?&lt;br/&gt;
&lt;br/&gt;
In other news, wife, daughter and I are off to the Philippines for 5 weeks and hoping to get some start-up work moving over there. UX is in demand at the moment so it's a good time to be around.&lt;br/&gt;
&lt;br/&gt;
I've also been looking up versions of principle components analysis in Python and found these:&lt;br/&gt;
&lt;br/&gt;
&lt;ul&gt;&lt;a href="http://mdp-toolkit.sourceforge.net/" &gt;Modular Toolkit for Data Processing&lt;/a&gt;&lt;/ul&gt;&lt;ul&gt;&lt;a href="http://folk.uio.no/henninri/pca_module/" &gt;PCA module for Python&lt;/a&gt;&lt;/ul&gt;&lt;ul&gt;&lt;a href="http://matplotlib.sourceforge.net/api/mlab_api.html" &gt;A version in MatPlotLib&lt;/a&gt;&lt;/ul&gt;&lt;br/&gt;
&lt;br/&gt;
All the linguistic stuff I've been doing lately is making my head spin but it's coming together. </description>
    </item>
    <item>
      <pubDate>Mon, 6 Jun 2011 07:46:33 GMT</pubDate>
      <title>6 Jun 2011</title>
      <link>http://www.advogato.org/person/salmoni/diary.html?start=591</link>
      <guid>http://www.advogato.org/person/salmoni/diary.html?start=591</guid>
      <description>Lots happening: I've been building a semantic relevance engine - something that can accurately determine the semantic similarity of 2 text documents and it's working reasonably well. Working completely untrained, I'm getting accuracies of well above 0.8 and often above 0.9. Obviously 1.0 is the ideal but even human judgements rarely get above 0.9 with the corpora I've been using for this.&lt;br/&gt;
&lt;br/&gt;
The good thing is that I appear to be discovering new stuff almost every day about how documents are understood. There are some approaches I've used that I've not read about in the literature so there might be some useful stuff for the world here. &lt;br/&gt;
&lt;br/&gt;
However my aim is to make a web service around this. And it's all based on open source software (Python, numpy, Scipy, Gensim etc) which is perfect. There is proprietary knowledge used, however: the corpora, how it's prepared and the architecture of the engine; but that will all come publicly out soon enough. &lt;br/&gt;
</description>
    </item>
    <item>
      <pubDate>Wed, 27 Apr 2011 15:57:39 GMT</pubDate>
      <title>27 Apr 2011</title>
      <link>http://www.advogato.org/person/salmoni/diary.html?start=590</link>
      <guid>http://www.advogato.org/person/salmoni/diary.html?start=590</guid>
      <description>&lt;b&gt;Log Entropy models&lt;/b&gt;&#xD;
&#xD;
&lt;p&gt; I had problems when I last upgraded to 0.7.8 of Gensim. The main issue was &#xD;
that the package I imported wasn't necessarily the one used: quite often, it &#xD;
seemed as though the top level would be from one install whereas another &#xD;
import would be from somewhere else. The net result was that parts of my &#xD;
software were looking for an id2word method in a dictionary where there &#xD;
were none before.&#xD;
&#xD;
&lt;p&gt; However, I still want to try 0.7.8 if I can and I found a way. I downloaded and &#xD;
untarred it, and renamed it 'gensim078'. Then, I went and changed each 'from &#xD;
gensim import *' statement to 'from gensim078 import *' which seems to be &#xD;
doing the trick. I'm sure there are better ways to do it but this is working for &#xD;
me so I'm happy.&#xD;
&#xD;
&lt;p&gt; The advantages are that a) it's faster particularly for similarity calculations, &#xD;
and b) I now have access to the Log Entropy model which I'm building for &#xD;
G1750. &#xD;
&#xD;
&lt;p&gt; Later tonight, I'll adjust the dictionary and begin pruning words that appear &#xD;
across lots of documents to see if that improves the focus. The program does &#xD;
seem a little 'fuzzy' as it is but that is quite a human characteristic so I'm not &#xD;
too worried. However, it will help me explore vector models and understand &#xD;
them better myself.&#xD;
&#xD;
&lt;p&gt; Although the results of the word-pair semantic association task were poor, &#xD;
I'm not dismayed (too much!) because my whole construction is not perfect &#xD;
and there is lots of room for improvement. The task is also useful as it gives &#xD;
me an indication of accuracy by another means to the 20NG categorisation &#xD;
task. When I create a new corpus, I should ideally subject it to a battery of &#xD;
tests designed to test different things. With the results of these, I can work &#xD;
out whether the corpus is heading in the right direction or not. It's all good to &#xD;
have these tools even if (initially) not going how I wanted them to.&#xD;
&#xD;
&lt;p&gt; I'm turning into a perfectionist. I really need to release something useful &#xD;
before I refine... Release early, release often...&#xD;
</description>
    </item>
    <item>
      <pubDate>Wed, 20 Apr 2011 20:20:45 GMT</pubDate>
      <title>20 Apr 2011</title>
      <link>http://www.advogato.org/person/salmoni/diary.html?start=589</link>
      <guid>http://www.advogato.org/person/salmoni/diary.html?start=589</guid>
      <description>I've been having lots of fun lately with &lt;a href="http://nlp.fi.muni.cz/projekty/gensim/" &gt;Gensim&lt;/a&gt;, a Python &#xD;
framework for vector space modelling. It includes fun stuff like latent &#xD;
semantic &#xD;
analysis, latent dirichlet allocation and other goodies. Allied with &lt;a href="http://www.nltk.org/" &gt;NLTK&lt;/a&gt;, this makes a very formidable &#xD;
Python-&#xD;
based NLP framework.&#xD;
&#xD;
&lt;p&gt; My tasks are sorting newsgroup posts into correct groups and I've achieved a &#xD;
reasonable level of accuracy (0.92) which isn't bad given that it's entirely &#xD;
dependent upon content. However, most analyses are showing lower &#xD;
accuracies &#xD;
(0.70+) which isn't bad but not far away enough from chance performance to &#xD;
be &#xD;
taken realistically. However, there are a few ways to improve this and I'm &#xD;
conducting an enormous number of experiments to get an effective mental &#xD;
model of how vector space models work.&#xD;
&#xD;
&lt;p&gt; This is all the beginning of constructing a relevance engine which I'm sure will &#xD;
be useful to some people.&#xD;
&#xD;
&lt;p&gt; Great fun!&#xD;
</description>
    </item>
    <item>
      <pubDate>Sun, 12 Dec 2010 20:44:36 GMT</pubDate>
      <title>12 Dec 2010</title>
      <link>http://www.advogato.org/person/salmoni/diary.html?start=588</link>
      <guid>http://www.advogato.org/person/salmoni/diary.html?start=588</guid>
      <description>This is a list of things that have to be done to get Infomap working on a &#xD;
modern Linux distribution (tried on Ubuntu 10.10).&#xD;
&#xD;
&lt;p&gt; * BLOCKSIZE in preprocessing/preprocessing_env.h : needs to be set to the &#xD;
highest number of words a document has in the corpus. If a document has &#xD;
more words than BLOCKSIZE, the building of the model will hang.&#xD;
&#xD;
&lt;p&gt; * Install libgdbm-dev with Synaptic or apt-get. Infomap needs a header file &#xD;
and without it, Infomap will not compile (not pass ./configure).&#xD;
&#xD;
&lt;p&gt; * Not finding ndbm.h : All happens in /usr/include&#xD;
&#xD;
&lt;p&gt; ln -s gdbm-ndbm.h ndbm.h or just copy gdbm-ndbm.h to &#xD;
/usr/include/ndbm.h&#xD;
&#xD;
&lt;p&gt; Infomap will not compile (not pass ./configure) without this.&#xD;
&#xD;
&lt;p&gt; Then it should go through configure, make, and make install well.&#xD;
&#xD;
&lt;p&gt; This is the code for CompareTerms:&#xD;
&lt;code&gt;&#xD;
&#xD;
&lt;p&gt; # term1, term2 - terms to be compared&#xD;
&#xD;
&lt;p&gt; vec1 = "associate -q term1"&#xD;
&#xD;
&lt;p&gt; vec2 = "associate -q term2"&#xD;
&#xD;
&lt;p&gt; vec1 = numpy.array(vec1)&#xD;
&#xD;
&lt;p&gt; vec2 = numpy.array(vec2)&#xD;
&#xD;
&lt;p&gt; product = numpy.sum(vec1 * vec2)&#xD;
&#xD;
&lt;p&gt; return product&#xD;
&lt;/code&gt;&#xD;
This produces an association between 2 terms.&#xD;
&#xD;
&lt;p&gt; When calling this, the 'args' string that calls associate must be formatted as a &#xD;
single string and not by Popen. This is important when sending more than 1 &#xD;
term. If not, associate will treat the terms as a quote search rather than an &#xD;
AND search.</description>
    </item>
    <item>
      <pubDate>Mon, 6 Dec 2010 19:56:27 GMT</pubDate>
      <title>6 Dec 2010</title>
      <link>http://www.advogato.org/person/salmoni/diary.html?start=587</link>
      <guid>http://www.advogato.org/person/salmoni/diary.html?start=587</guid>
      <description>Long time no post! I've been very busy with family and work and not had &#xD;
much time to do stuff. If there are no objections, I was thinking of reposting &#xD;
some of my UX stuff here. It's not commercial but informational and might be &#xD;
of use.&#xD;
&#xD;
&lt;p&gt; As for open source, I've been working a lot on &lt;a href="http://infomap-&#xD;
nlp.sourceforge.net/" &gt;Infomap&lt;/a&gt; lately for natural language processing. I &#xD;
had some failures using &lt;a href="http://code.google.com/p/semanticvectors" &gt;Semantic Vectors&lt;/a&gt;, &#xD;
namely the speed at which it does comparisons between terms. I had an idea &#xD;
for an automated information architecture creator but the speed was too slow. &#xD;
Infomap is much faster so I will try to use that - even though I know it's been &#xD;
superceded by Semantic Vectors. &#xD;
&#xD;
&lt;p&gt; Plus, being written in C means that it is accessible with Python whereas &#xD;
Semantic Vectors being in Java means going through Jython (and learning lots &#xD;
of new things which I don't have time for) or going through a very awkward &#xD;
process to translate.&#xD;
&#xD;
&lt;p&gt; With my first run using SV, I generated an information map much like that &#xD;
resulting from a card sort. The card sort took weeks to prepare, perform and &#xD;
analyse - and a lot of staff time. Mine ran in a few hours and got results that &#xD;
weren't entirely dissimilar to the human version. There were some odd &#xD;
surprises but that was because of the corpus (Wikipedia was what I used at &#xD;
the time) which by nature has a focus on particular topics as opposed to &#xD;
general language. This meant that the results were generally quite good but &#xD;
with one or two startling exceptions. &#xD;
&#xD;
&lt;p&gt; But the difficulty in integrating it with a Python backend is too hard, so back &#xD;
to Infomap. I just need to figure out how to do semantic comparisons of terms &#xD;
in Infomap. &#xD;
&#xD;
&lt;p&gt; It was a job to get going. The first problem was not having the appropriate &#xD;
symlink to a DB library and a header file. Once rectified, I had to ensure the &#xD;
BLOCKSIZE constant was set to a figure larger than the highest number of &#xD;
words. It defaults to 1 million but the longest document in the corpus was &#xD;
1.25 million words. Without doing this, I had no warning and left the program &#xD;
building its model for over a week before finding the problem. Once done, &#xD;
the model was analysed and built in under 2 hours on an Asus 701 netbook! &#xD;
&#xD;
&lt;p&gt; I remember when LSA used to take days...&#xD;
&#xD;
&lt;p&gt; So in the spirit of openness and the basis of this endeavour being in open &#xD;
source software, I will publish results here to ensure everyone is totally bored.&#xD;
</description>
    </item>
    <item>
      <pubDate>Mon, 20 Apr 2009 07:39:16 GMT</pubDate>
      <title>20 Apr 2009</title>
      <link>http://www.advogato.org/person/salmoni/diary.html?start=586</link>
      <guid>http://www.advogato.org/person/salmoni/diary.html?start=586</guid>
      <description>&lt;p&gt;I have a &lt;a href="http://www.linkedin.com/in/alanjamessalmoni" &gt;linkedin&#xD;
profile&lt;/a&gt; here. Advogatoans are welcome to add me to their&#xD;
network.&#xD;
&#xD;
&lt;p&gt; Edit: This entry was already turning up in Google's search&#xD;
results less than 2 hours after writing it. I think it was&#xD;
spidered 15 minutes ago.</description>
    </item>
    <item>
      <pubDate>Sun, 19 Apr 2009 06:55:55 GMT</pubDate>
      <title>19 Apr 2009</title>
      <link>http://www.advogato.org/person/salmoni/diary.html?start=585</link>
      <guid>http://www.advogato.org/person/salmoni/diary.html?start=585</guid>
      <description>&lt;p&gt;Life is going well in NZ. My job is enjoyable -&#xD;
thoroughly so - and I'm learning lots every day. Very little&#xD;
open source work done lately as I need to check the T&amp;amp;Cs of&#xD;
my contract to see if I'm okay. I'm sure there is no problem&#xD;
but I need to check first.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt;Our application for permanent residence here is going&#xD;
well. I submitted our expression of interest back on 21&#xD;
March and we were successful on 6th April which is quite&#xD;
quick really. I was expecting it to take a few months. I'm&#xD;
still waiting for the ITA form to come through by post which&#xD;
seems to be taking some time. I'm guessing that receiving it&#xD;
is really the long part of the process.&#xD;
&#xD;
&lt;p&gt; &lt;p&gt;I hope it comes through quickly as my wife and daughter&#xD;
are still in the Philippines and I'm missing them so much.&#xD;
We could apply for a visitors visa for her, but we have&#xD;
other obligations which need to be met in the immediate&#xD;
future (too much detail to go into here). Still, we chat&#xD;
every day by email and video chat. I've even managed to play&#xD;
games with Louise by webcam which ranks as a good&#xD;
achievement. It's not the same as being with her but it's&#xD;
the best I can do right now.</description>
    </item>
    <item>
      <pubDate>Mon, 23 Feb 2009 19:28:43 GMT</pubDate>
      <title>23 Feb 2009</title>
      <link>http://www.advogato.org/person/salmoni/diary.html?start=584</link>
      <guid>http://www.advogato.org/person/salmoni/diary.html?start=584</guid>
      <description>Well I made it! I'm in New Zealand working for &lt;a href="http://www.westpac.co.nz" &gt;Westpac&lt;/a&gt; as an &#xD;
interaction designer for their website. All good fun! The &#xD;
work seems really cool and I have so many ideas to &#xD;
implement.&#xD;
&#xD;
&lt;p&gt; In other news, I've been exploring neural networks to &#xD;
predict currency markets and found a modicum of success &#xD;
(though nothing that translates into a prediction system &#xD;
that I could make money out of). Been using bpnn in &#xD;
Python. Python slows things down a lot but allows &#xD;
interactive analysis. I tried updating bpnn to use numpy &#xD;
but found my version to be significantly slower (eg, 1.5 &#xD;
seconds against over 5 seconds for the new one) which is &#xD;
odd. Is it worth releasing the code? </description>
    </item>
  </channel>
</rss>
