Older blog entries for Stevey (starting at number 684)

Tagging images, and maintaining collections?

I'm an amateur photographer, although these days I tend to drop the amateur prefix, given that I shoot people for cash at least once a month.

(It isn't my main job, and I'd never actually want it to be, because I'm certain I'd become unhappy hustling for jobs and doing the promotion thing.)

Anyway over the years I've built up a large library of images, mostly organized in a hierarchy of directories beneath ~/Images.

Unlike most photographers I don't use aperture, lighttable, or any similar library management. I shoot my images in RAW, convert to JPG via rawtherapee, and keep both versions of the images.

In short I don't want to mix the "library management" functions with the "RAW conversion" because I do regard them as two separate steps. That said I'm reaching a point where I do want to start tagging images, and finding them more quickly.

In the past I wrote a couple of simple tools to inject tags into the EXIF data of images, and then indexed them. But that didn't work so well in practise. I'm starting to think instead I should index images into sqlite:

  • Size.
  • date.
  • Content hash.
  • Tags.
  • Path.

The downside is that this breaks utterly as soon as you move images around on-disk. Which is something my previous exif-manipulation was designed to avoid.

Anyway I'm thinking at the moment, but I know that the existing tools such as F-Spot, shotwell, DigiKam, and similar aren't suitable. So I either need to go standalone and use EXIF tags, accepting the fact that the tags I enter won't be visible to other tools, or I cope with the file-rename issues by attempting to update an existing sqlite database via hash/size/etc.

Syndicated 2014-04-03 11:02:30 from Steve Kemp's Blog

Some things on DNS and caching

Although there wasn't too many comments on my what would you pay for? post I did get some mails.

I was reminded about this via Mario Langs post, which echoed a couple of private mails I received.

Despite being something that I take for granted, perhaps because my hosting comes from the Bytemark, people do seem willing to pay money for DNS hosting.

Which is odd. I mean you could do it very very very cheaply if you had just four virtual machines. You can get complex and be geo-fancy, and you could use anycast on a small AS, but really? You could just deploy four virtual machines0 to provide a.ns, b.ns, c.ns, d.ns, and be better than 90% of DNS hosters out there.

The thing that many people mentioned was Git-backed, or Git-based DNS. Which would be trivial if you used tinydns, and no much harder if you used bind.

I suspect I'm "not allowed" to do DNS-things for a while, due to my contract with Dyn, but it might be worth checking...

ObRandom: Beat me to it. Register gitdns.io, or similar, and configure hooks from github to compile tinydns records.

In other news I started documenting early thoughts about my caching reverse proxy, which has now got a name stockpile.

I wrote some stub code using node.js, and although it was functional it soon became callback hell:

  • Is this resource cachable?
  • Does this thing exist in the cache already?
  • Should we return the server's response to the client, archive to memcached, or do both?

Expressing the rules neatly is also a challenge. I want the server core to be simple and the configuration to be something like:

is_cachable ( vhost, source, request, backened )
{
    /**
     * If the file is static, then it is cachable.
     */
    if ( request.url.match( /\.(jpg|png|txt|html?|gif)$/i ) ) {
        return true;
    }

    /**
     * If there is a cookie then the answer is false.
     */
    if ( request.has_cookie? ) { return false ; }

    /**
     * If the server is alive we'll now pass the remainder through it
     * if not then we'll serve from the cache.
     */
    if ( backend.alive? ) {
        return false;
    }
    else {
        return true;
    }
}

I can see there is value in judging the cachability based on the server response, but I plan to ignore that except for "Expires:", "Etag", etc ,etc)

Anyway callback hell does make me want to reexamine the existing C/C++ libraries out there. Because I think I could do better.

Syndicated 2014-03-29 18:16:12 from Steve Kemp's Blog

A diversion on off-site storage

Yesterday I took a diversion from thinking about my upcoming cache project, largely because I took some pictures inside my house, and realized my offsite backup was getting full.

I have three levels of backups:

  • Home stuff on my desktop is replicated to my wifes desktop, and vice-versa.
  • A simple server running rsync (content-free http://rsync.io/).
  • A "peering" arrangement of a small group of friends. Each of us makes available a small amount of space and we copy to-from each others shares, via rsync / scp as appropriate.

Unfortunately my rsync-based personal server is getting a little too full, and will certainly be full by next year. S3 is pricy, and I don't trust the "unlimited" storage people (backblaze,etc) to be sustainable and reliable long-term.

The pricing on Google-drive seems appealing, but I guess I'm loathe to share more data with Google. Perhaps I could dedicated a single "backup.account@gmail.com" login to that, separate from all-else.

So the diversion came along when I looked for Amazon S3-comptible, self-hosted, servers. There are a few, most of them are PHP-based, or similarly icky.

So far cloudfoundry's vlob looks the most interesting, but the main project seems stalled/dead. Sadly using s3cmd to upload files failed, but certainly the `curl` based API works as expected.

I looked at Gluster, CEPH, and similar, but didn't yet come up with a decent plan for handling offsite storage, but I know I have only six months or so before the need becomes pressing. I imagine the plan has to be using N-small servers with local storage, rather than 1-Large server, purely because pricing is going to be better that way.

Decisions decisions.

Syndicated 2014-03-26 17:53:08 from Steve Kemp's Blog

New GPG-key

I've now generated a new GPG-key for myself:

$ gpg --fingerprint 229A4066
pub   4096R/0C626242 2014-03-24
      Key fingerprint = D516 C42B 1D0E 3F85 4CAB  9723 1909 D408 0C62 6242
uid                  Steve Kemp (Edinburgh, Scotland) <steve@steve.org.uk>
sub   4096R/229A4066 2014-03-24

The key can be found online via mit.edu : 0x1909D4080C626242

This has been signed with my old key:

pub   1024D/CD4C0D9D 2002-05-29
      Key fingerprint = DB1F F3FB 1D08 FC01 ED22  2243 C0CF C6B3 CD4C 0D9D
uid                  Steve Kemp <steve@steve.org.uk>
sub   2048g/AC995563 2002-05-29

If there is anybody who has signed my old key who wishes to sign my new one then please feel free to get in touch to arrange it.

Syndicated 2014-03-24 19:26:03 from Steve Kemp's Blog

So I failed at writing some clustered code in Perl

Until this time next month I'll be posting code-based discussions only.

Recently I've been wanting to explore creating clustered services, because clusters are definitely things I use professionally.

My initial attempt was to write an auto-clustering version of memcached, because that's a useful tool. Writing the core of the service took an hour or so:

  • Simple KeyVal.pm implementation.
  • Give it the obvious methods get, set, delete.
  • Make it more interesting by creating a read-only append-log.
  • The logfile will be replayed for clustering.

At the point I was done the following code worked:

use KeyVal;

# Create an object, and set some values
my $obj = KeyVal->new( logfile => "/tmp/foo.log" );
$obj->incr( "steve" );
$obj->incr( "steve" );

print $obj->get( "steve" ) # prints 2.

# Now replay the append-only log
my $replay = KeyVal->new( logfile => "/tmp/foo.log" );
$replay->replay();

print $replay->get( "steve" ) # prints 2.

In the first case we used the primitives to increment a value twice, and then fetch it. In the second case we used the logfile the first object created to replay all prior transactions, then output the value.

Neat. The next step was to make it work over a network. Trivial.

Finally I wanted to autodetect peers, and deploy replication. Each host would send out regular messages along the lines of "Do you have updates made since $time?". Any that did would replay the logfile from the given unixtime offset.

However here I ran into problems. Peer discovery was supposed to be basic, and I figured I'd write something that did leader election by magic. Unfortunately Perls threading code is .. unpleasant:

  • I wanted to store all known-peers in a singleton.
  • Then I wanted to create threads that would announce and receive updates.

This failed. Majorly. Because you cannot launch the implementation of a class-method as a thread. Equally you cannot make a variable which is "complex" shared across threads.

I wrote some demo code which works without packages and a shared singleton:

The Ruby version, by contrast, is much more OO and neater. Meh.

I've now shelved the project.

My next, big, task was to make the network service utterly memcached compatible. That would have been fiddly, but not impossible. Right now I just use a simple line-based network protocol.

I suspect I could have got what I wanted using EventMachine, or similar, but that's a path I've not yet explored, and I'm happy enough with that decision.

Syndicated 2014-03-24 09:41:18 from Steve Kemp's Blog

That is it, I'm going to do it

That's it, I'm going to do it: I have now committed myself to writing a scalable, caching, reverse HTTP proxy.

The biggest question right now is implementation language; obviously "threading" of some kind is required so it is a choice between Perl's anyevent, Python's twisted, Rubys event machine, or node.js.

I'm absolutely, definitely, not going to use C, or C++.

Writing a a reverse proxy in node.js is almost trivial, the hard part will be working out which language to express the caching behaviour, on a per type, and per-resource basis.

I will ponder.

Syndicated 2014-03-22 14:49:49 from Steve Kemp's Blog

Time to mix things up again

I'm currently a contractor, working for/with Dyn, until April the 11th.

I need to decide what I'm doing next, if anything. In the meantime here are some diversions:

Some trivial security issues

I noticed and reported two more temporary-file issues insecure temporary file usage in apt-extracttemplates (apt), and libreadline6: Insecure use of temporary files - in _rl_trace.

Neither of those are particularly serious, but looking for them took a little time. I recently started re-auditing code, and decided to do three things:

  • Download the source code to every package installed upon this system.
  • Download the source code to all packages matching the pattern ^libpam-, and ^libruby-*.

I've not yet finished slogging through the code, but my expectation will be a few more issues. I'll guess 5-10, given my cynical nature.

NFS-work

I've been tasked with the job of setting up a small cluster running from a shared and writeable NFS-root.

This is a fun project which I've done before, PXE-booting a machine and telling it to mount a root filesystem over NFS is pretty straight-forward. The hard part is making that system writeable, such that you can boot and run "apt-get install XX". I've done it in the past using magic filesystems, or tmpfs. Either will work here, so I'm not going to dwell on it.

Another year

I had another birthday, so that was nice.

My wife took me to a water-park where we swam like fisheseses, and that tied in nicely with a recent visit to Deep Sea World, where we got to walk through a glass tunnel, beneath a pool FULL OF SHARKS, and other beasties.

Beyond that I received another Global Knife, which has now been bloodied, since I managed to slice my finger open chopping mushrooms on Friday. Oops. Currently I'm in that annoying state where I'm slowly getting used to typing with a plaster around the tip of my finger, but knowing that it'll have to come off again and I'll get confused again.

Linux Distribution

I absolutely did not start working on a "linux distribution", because that would be crazy. Do I look like a crazy-person?

All I did was play around with GNU Stow, and ponder the idea of using a minimal LibC and GNU Stow to organize things.

It went well, but the devil is always in the details.

I like the idea of a master-distribution which installs pam, ssh, etc, but then has derivitives for "This is a webserver", "This is a Ruby server", and "This is a database server".

Consider it like task-selection, but with higher ambition.

There's probably more I could say; a new kitchen sink (literally) and a new tap have made our kitchen nicer, I've made it past six months of regular gym-based workouts, and I didn't die when I went to the beach in the dark the other night, so that was nice.

Umm? Stuff?

Have a nice day. Thanks.

Syndicated 2014-03-20 11:19:57 from Steve Kemp's Blog

Sending commands to your running mail-client

So I wrote a mail client, and this morning I added the ability for it to receive input from a Unix domain socket.

In one terminal I have my email client open. In another I run:

lumailctl /tmp/foo.sock "open('/home/skx/Maildir/.livejournal.2014/');"

That opens the unix domain socket, and pipes the following command to it:

open('/home/skx/Maildir/.livejournal.2014/');

The mail client has already got the socket open, and the end result is that my mail client suddenly opens the specified mail folder, and redraws itself.

Neat.

The "open" function is obviously a lua function, which builds upon the lua primitives the client understands:

function open( folder )
   clear_selected_folders()
   set_selected_folder(folder)
   index()
end

Obviously this would be woefully insecure if it were released like this. Later I'll wire up some lua function to establish the socket, such that the user specifies where the socket is created (their home directory, ideally), and it doesn't run by default.

Syndicated 2014-03-17 11:10:20 from Steve Kemp's Blog

So load-balancers are awesome

When I was recently talking about load-balancers, and automatically adding back-ends, not just removing bad ones that go offline, I obviously spent a while looking over some.

There are several dedicated load-balancers packaged for Debian GNU/Linux, including:

In addition to actual dedicated load-balancers there are things that can be coerced into running in that way: apache2, varnish, squid, nginx, & etc.

Of the load-balancers I was immediately drawn to both pen and pound, because they have command line tools ("penctl" and "poundctl" respectively) for adding/removing/updating the running configuration.

Pen I've been using for a couple of days now, and although it suffers from some security issues I'm confident they will be resolved in the near future. (#741370)

My only outstanding task is to juggle some hosts around and stress-test the pair of them a little more before deciding on a winner.

In other news I kinda regret the whole blogspam.net API. I'd have had a far simpler life if I'd just ran the damn thing as a DNSBL in the first place. (That's essentially how it operates on the whole anyway. Submit spammy comments for long enough and you're just blacklisted, thereafter.)

Syndicated 2014-03-14 13:49:16 from Steve Kemp's Blog

Discovering back-end servers automatically?

Recently I've been pondering how to do service discovery.

Pretend you have a load balancer which accepts traffic and routes incoming requests to different back-ends. The loadbalancer might be pound, varnish, haproxy, nginx, or similar. The back-ends might be node applications, apache, or similar.

The typical configuration of the load-balancer will read:

# forward

# backends
backend web1  { .host = "10.0.0.4"; }
backend web2  { .host = "10.0.0.6"; }
backend web3  { .host = "10.0.0.5"; }

#  afterword

I've seen this same setup in many situations, and while it can easily be imagined that there might be "random HTTP servers" on your (V)LAN which shouldn't receive connections it seems like a pain to keep updating the backends.

Using UDP/multicast broadcasts it is trivial to announce "Hey I'm a HTTP-server with the name 'foo'", and it seems to me that this should allow seamless HTTP load-balancing.

To be more explicit - this is normal:

  • The load-balancer listens for HTTP requests, and forwards them to back-ends.
  • When back-ends go away they stop receiving traffic.

What I'd like to propose is another step:

  • When a new back-end advertises itself with the tag "foo" it should be automatically added and start to receive traffic.

i.e. This allows backends to be removed from service when they go offline but also to be added when they come online. Without the load-balancer needing its configuration to be updated.

This means you'd not give a static list of back-ends to your load-balancer, instead you'd say "Route traffic to any service that adfvertises itself with the tag 'foo'.".

VLANS, firewalls, multicast, udp, all come into play, but in theory this strikes me as being useful, obvious, and simple.

(Failure cases? Well if the "announcer" dies then the backend won't get traffic routed to it. Just like if the backend were offline. And clearly if a backend is announced, but not receiving HTTP-requests it would be dropped as normal.)

If I get the time this evening I'll sit down and look at some load-balancer source code to see if any are written in such a way that I could add this "broadcast discovery" as a plugin/minor change.

Syndicated 2014-03-11 13:28:36 from Steve Kemp's Blog

675 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!