The Wayback Machine - https://web.archive.org/web/20170630231454/http://www.advogato.org/person/oubiwann/diary.html?start=203

Older blog entries for oubiwann (starting at number 203)

24 Jul 2008 (updated 30 Jul 2008 at 22:04 UTC) »

In Memoria: The Great Work


The OSCON Tuesday Night Extravaganza was just fabulous: awards, laughter, brain-bending, and affirmation. The primary speakers were Mark Shuttleworth, r0ml, and Damian Conway; but I'm going to be focusing on r0ml's talk right now :-) Well, in part, anyway.

Let's back up to Monday night, first: Alex Martelli and I had a chance to wax philosophical about programming and software. It was wonderful. Both because it revealed Alex's code-spirit and because of the sympatico I felt as his passionate idealism resonated with mine. While Alex talked of the holy architecture of mosques and cathedrals, of the contributions that such artisans as stonecutters, masons, sculptors, and calligraphers made, he emphasized how each individual played an essential role in bringing these wondrous works into being, that each act was an offering to the ideals that formed the basis of the respective belief system.

What's more, though, Alex extended the analogy from religion to mysticism, saying that even more than builders of such great structures, coders are alchemists engaged in the magum opus. We are the transmutators. In our crucibles, the opposites of function and beauty unite; performance and elegance are commingled to produce the perfection of our art. Alex was careful to point out that he intended perfection in both an abstract and practical sense. On one hand, being able to create and actually deliver code that others found useful, regardless of the sex appeal (or lack thereof), can be viewed as a form of perfection. It is accomplishment; attainment of the goal. On the other hand, it's just something that someone wanted us to write; it's not a proof of Fermat's Last Theorem. It's useful; it serves a specific function.

Before I get to r0ml's talk, I want to mention UQDS as employed by the Twisted and Divmod communities. I think it's phenomenal and I enjoy working with that system. It's a well thought-out and proven process that tends to produce code of an extremely high quality. However, it's not my natural tendency. I like quick and dirty prototypes; a little messy code goes a long way. I like to throw something out there and then fix it up and apply polish incrementally, as dictated by need.

This is why I've been enjoying the Twisted Community Code project/group on Launchpad. Not only do you have the benefits of using a tool like bazaar that lets one branch other projects on a whim, but you've got a community space to put these explorations, where others can easily see what you're doing, check it out, and try something of their own. (There's a whole 'nother blog post I have coming about that.) However, this finally brings me to r0ml's talk: a new spin on the development process.

For those of you that have seen his phenomenal rhetoric talks, you'd be delighted to see what he did :-) He established a nice mapping from both Microsoft's development process as well as the one defined by Rational. He used the five canons of classical rhetoric: inventio, dispositio, elocutio, memoria, and pronuntiatio. However, the really brilliant thing was where he started the process: smack in the middle, right where I like to do it :-) And he justified this beautifully. His mapping was the following:
  • Memoria = Commit / Update
  • Pronuntiatio = Run / Use
  • Inventio = Bug Reporting / Patch Submission
  • Dispositio = Triage
  • Elocutio = Integration


The idea here being this: get what you've got done out there and in front of people's eyes. Everyone knows its crap; don't worry about it. Get it running and get others running it. Work on what matters most and integrate the changes. Repeat and continue.

I like to tease other Twisted devs that I tend not to do test-driven development, but bug-driven testing. What's interesting is that we both start with a requirements doc: for them, it's a development plan; for me, it's a bug/TODO list. The difference is that they then engage in Inventio whereas I start with Memoria. As r0ml said, with this model there is no development, there is only maintenance.

One of the other great things that r0ml mentioned about this process is that it not only gets you the developer started more quickly, it gets others started at the same time. Each programmer is engaged in a macroscopic genetic programming effort: everyone takes the source, mutates it, evolves it, reviews it, and the best implementations (or parts thereof) survive to become the basis for the next generation. Everyone gets to write at the same time; no one is blocked.

This development approach evokes images of philosophers from the Middle Ages sending letters to each other in cryptic alchemical symbols and diagrams, with all the implicit and explicit layers of meaning. I see this methodology as establishing the true foundation of the open source art: a gnostic, spirit-(of-open-souce)-ual transformation that brings us to improved states of mind and clarity.

The perfection of our art, whether sublime or mundane, can be merged in the mind of the developer as one... this union being our philosopher's stone. With each release of software engaged in this manner, we iterate the Great Work.

Syndicated 2008-07-24 03:41:00 (Updated 2008-07-30 20:37:56) from Duncan McGreggor

OSCON 2008

Hey all, thanks to a friend's amazingly generous offer, I'll be attending OSCON this year :-) I only have to pay for my airfare and food! I've contacted several people already who I know are going to be there (including Van Lindberg of Haynes and Boone and Bradley Kuhn of the SFC and the SFLC), and look forward to meeting up with others. Leave a comment or email me if you're going to be there!


Syndicated 2008-07-16 20:43:00 (Updated 2008-07-16 20:56:41) from Duncan McGreggor

5 Jul 2008 (updated 1 Aug 2008 at 22:03 UTC) »

Native LoadBalancing for Twisted Apps

Yesterday, right before midnight, I tagged the 1.1.0 release of txLoadBalancer on Launchpad after completing the last of the planned features. There are some pretty radical changes that have been developed for this release... and the coolest part is this is just the beginning :-) (See the TODO if you don't believe me!)

You can checkout from lp:~oubiwann/txloadbalancer/1.1.0 or download from PyPI. If you're a PyPI expert, I've got some questions for you at the end of this post... Been having some sucky experiences with PyPI lately :-(

So here's what's going on with txLoadBalancer:

Improved API

The biggest thing you'll notice if you've switching from PythonDirector is the massive overhaul the API has undergone. Things are cleaner and generally more modern, with a concise and well-defined module layout.

New Load Balancing Algorithm

I've added support for a weighted host scheduler. Given a weight that represents the frequency a host should be used, a host will be randomly selected, based on it's weight. For example, with two hosts, one having a weight of 1 and the other having a weight of 3, host 2 will be chosen about 75% of the time and host 1 will get about 25% of the requests.

Right now, this algorithm has to make several calls to other parts of the code in order to get all the data it needs (it also builds some crazy iterators). As such, it's rather slow and performs poorly when compared to the very light-weight least-connections algorithm. That being said, the next release will include optimizations for the weighted scheduler that make use of a Twisted timer and caching.

Native Twisted Load-Balancing

Here's the sexiest part: you can now load-balance your Twisted application by using the txLB API; you don't even need to run the load-balancer as a separate app! This evolved as a feature after a conversation with an as-yet unnamed cloud hosting provider, a follow-up discussion with the Divmod team, and then some quiet pondering about ways in which Twisted applications could be supported in cloud/grid/massively-multi-core architectures.

The "self load-balancing" API in txLB is not a comlete solution for grid-hosting, but it is a first step in one direction (we've been discussing lots of others, too, including the use of our deployment tool).

Before I show you how to use the self load-balancing API, let's take a quick look at a normal Twisted application service:
from twisted.web import static, server
from twisted.application import service, internet

application = service.Application("Demo Web Server")
web = server.Site('/home/oubiwann/public_html')
service = internet.TCPServer(7001, web)
service.setServiceParent(application)
You start that with the command twistd -noy myweb.tac. For use with the next example, you can also start two more, one on port 7002 and the other on port 7003.

Now here's what you do to make a self load-balanced app:
from twisted.application import service

from txlb import manager
from txlb.model import HostMapper
from txlb.schedulers import leastc
from txlb.application.service import LoadBalancedService

proxyServices = [
HostMapper(proxy='127.0.0.1:8080', lbType=leastc, host='host1',
address='127.0.0.1:7001'),
HostMapper(proxy='127.0.0.1:8080', lbType=leastc, host='host2',
address='127.0.0.1:7002'),
HostMapper(proxy='127.0.0.1:8080', lbType=leastc, host='host3',
address='127.0.0.1:7003'),
]

application = service.Application('Demo LB Service')
pm = manager.proxyManagerFactory(proxyServices)
lbs = LoadBalancedService(pm)
lbs.setServiceParent(application)
As you would expect, you need to indicate the proxy host:port, the algorithm to use, and the hosts that are to be balanced. The host setup assumes that you have three services running on localhost ports 7001, 7002, and 7003. All that's needed now is to just run that code with the usual twistd -noy myapp.tac. Also, for demonstration purposes, this is a somewhat simplified example of what is possible.

This may seem like a lot of extra work when compared to the simple web host above, but think about it: we're load-balancing here :-) This saves you from having to manage yet another application. With a few extra lines of code, you can keep it all in one place and have it manage itself.

Note that this API is in development and continuing to improve. The example above is from code running in trunk. For the more verbose configuration that is in the 1.1.0 release, be sure to see ./bin/txlbWeb.tac from the source tarball. To play with the latest and greatest, you'll want to checkout the code here: lp:txloadbalancer.

Other Goodies

Here is some other good stuff in the release:
  • You can now ssh into a txLB instance and mainipulate the load-balancer in real time from an interactive Python interpreter.
  • You can change the proxy to listen on a different port while the application is running (no restart requred!).
  • Changes made to the configuration while running are no longer volatile; they are saved to disk (and your old config gets backed up).
  • Work from Apple, Inc. was included in this release, too (they use the old PythonDirector in their Calendaring server). This includes a bug fix and management socket feature.
  • There is a significant jump in performance between this release and the previous one. I believe this to be due to the separation of concerns in the API, but haven't yet confirmed that.

Coming Work

There are a lot of exciting features coming for txLB. Just to name a few:
  • improved weighted algorithm
  • resources-based algorithm (a scheduler that determins the weight of a proxied host by memory, CPU, etc., utilization)
  • smarter proxied host failover and recovery
  • a heartbeat manager
  • txLB-powered application cloning (when started, an app will determine if it needs to run the clone as the managing load-balancer or simply as a proxied host)
  • auto-discovery of balanced hosts
  • proxy fail-over (a balanced host taking over as manager in the event that the manager goes down)
  • ApacheMQ/Stomp integration
  • LDAP/RADIUS authentication

Additionally, I'll be putting together some basic performance metrics contrasting Apache and load-balanced Twisted apps. I will also be comparing previous versions of txLB/PythonDirector with the latest release(s).

Problems with PyPI

I will close this post on a sad note: PyPI used to be an amazing experience for me (a couple years ago, when it was still being called "cheeseshop"). Everything worked as it was supposed to. This hasn't been the case when I've used it recently (over the past few months).

For all that I say about PyPI, I allow for the fact that I may just be missing something, and it may be entirely my fault. That being said, I spent about 3 hours online last night combing though the SIG mail list, the bug list on sourceforge, and blog posts about setuptools and PyPI, and could find no answers to my questions. Well, with the possible exception of a bug report, but it doesn't look like it was confirmed by a PyPI team member, so I'm not sure if it's valid or not.

Here are my issues:
  • When I upload my project using python setup.py [sdist|bdist_egg] upload, no metadata defined in my setup() function is presented on my package's PyPI page. When I click the metadata link, it's only got three sparse lines.
  • When I manually upload from the package's PKG-INFO itself, all the metadata is presented on the page as it should be, with the exception of the long description. It is in plain text instead of ReST (I am checking that it is valid ReST using distutils settings of reporter.halt_level = 5, reporter.report_level = 1, settings.pep_references = False, and settings.trim_footnote_reference_space = None; these are the same settings that Zope Corp uses to verify the ReST that it uploads to PyPI).
  • When I manually edit the long description in the form, I get the same thing: plain text, no ReST.
  • When I upload a package that is displayed properly on PyPI (such as zc.twist; uploaded as one of my projects by chaning the name), I get the same problem (this is why I think it might be something that I'm doing wrong...): no metadata, and when I upload the PKG-INFO manually, no ReST.
Why, oh why, cruel fates, does this not work any more? I used to be able to upload to PyPI without any of these issues...


Syndicated 2008-07-05 18:51:00 (Updated 2008-08-01 16:33:01) from Duncan McGreggor

Divmod Tech: Making the "Next Gen" Grade

Last night, after I already posted the latest Twisted in the News, I came across another post that would have made the list had I found it sooner. However, this is a good opportunity to give it a little extra attention.

The title of the post is "Next Gen Web Dev: Playing with Python Twisted/Nevow/Athena" and I gotta say, that made my day :-) Between that post and Colin Alston's post that I mentioned in the News, Nevow had a good week. And people are appreciating it for the right reasons. It may not be the easiest web framework to use and certainly not the best documented, but when you need the flexibility to interact with your (Twisted) web server in particular ways as well as benefit from the functionality that COMET provides, Nevow comes out shining.

It's also refreshing to see new developers entering the community who not only see the potential of these tools (designed with that potential in mind) but are capable of taking advantage of it immediately. If nothing else, the author of that post has motivated me to finally merge the Athena tutorial to trunk in order to bring the publicly available and published content in sync with the new code that's in the branch.

Update: Along similar lines, but with more details, Tristan has provided an excellent write-up for this motivation to use Twisted/Nevow/Axiom/Mantissa. Be sure to check it out!

Syndicated 2008-07-03 12:22:00 (Updated 2008-07-03 20:26:03) from Duncan McGreggor

27 Jun 2008 (updated 1 Aug 2008 at 22:03 UTC) »

So You Want Your Code to Be Asynchronous? A Twisted Interview

Prologue

This blog post was taken from a chat on a Divmod IRC channel couple weeks ago. Let's start with my opening comments to JP about what I hoped we could accomplish in the interview.

[1:47pm] oubiwann:exarkun: developers/users have started to understand Twisted, see the benefits of an async paradigm, and want to start writing their code making the best possible use of twisted's event driven nature
[1:48pm] oubiwann:they know how to write code using deferreds, and they're ready to get writing...
[1:48pm] oubiwann:except they're not
[1:48pm] oubiwann:because they don't know python internals
[1:49pm] oubiwann:they don't know what python can actually be used with deferreds because they don't know what requirements there are for python code that it be non-blocking in the reactor
[1:50pm] oubiwann:so you're going to help us understand the pitfalls
[1:50pm] oubiwann:how to make best guesses
[1:50pm] oubiwann:and where to look to get definitive answers

Change Your Mind


Before we go any further, I want to share a few comments and answer two questions: "Who is this for?" and "What do I need to know for this to mean something to me?" This post is for anyone who wants to write async code with Twisted and the answer to the second question is open-ended.

Let me start with what is often interpreted as effrontery: read the source code. Despite how that may have sounded, it's not another RTFM quip. The Twisted source code was specifically designed to be read (well, the code from the last two years, anyway). It was designed to be read, re-read, absorbed, pondered, and turned into living memes in your brain.

Understanding tricky topics in conceptually dense fields such as mathematics, physics, and advanced programming requires immersion. When we commit to really learning something difficult in programming, when we take the big step and dive in, we are surrounded by code. At a conceptual level, I mean that literally: it is a spacial experience. This is not something that is typically taught... the lucky few are able to do this their on the own; the rest have to slowly build their intuition through experience in order to get comfortable and be productive in code space.

Our school systems tend to train us along very linear lines: there's a right answer, and a wrong answer. Don't rock the boat. Don't make the teacher uncomfortable. Follow the rules, do your homework, and don't ask too many questions. We carry these habits with us into our professional lives, and it can be quite the task to overcome such a mindset.

Experience is multidimensional. Learning is experience, not rules. When you really jump into this stuff, it will surround you. You will have an experience of the code. For me, that is a mental experience akin to looking at something from the perspective of three dimensions versus two. When I've not dedicated myself to understanding a problem, the domain, or the tools of the domain, everything looks very flat to me. It's hard to muddle through. I feel like I have no depth perception and I get easily frustrated.

When I do take the time, when I make the investment of attention and interest, the problem spaces really do become spaces, ones where my mind has a much greater freedom of movement. It's not smart people who do this kind of thing, it's committed people. Your mind is your world and it's up to you to make it what you want. No one on a mail list or IRC channel can do that for you. They can help you with the rules, provide you with valuable moral support, and guide you along the way. However, a direct experience of the code as a living world of mind comes from taking many brave leaps into the unknown.

Interview in a Blender

Jean-Paul Calderone graciously set aside some time to talk with me about creating asynchronous code in Python, particularly, using the Twisted framework. As has been said many times before, simply using Twisted or deferreds doesn't make your code asynchronous. As with any tricky problem, you have to put some time and thought into what you want to accomplish and how you want to accomplish it.

I'm going to post bits of our chat in different sections, but hopefully in a way that makes sense. There's some good information here and some nice reminders. More than anything, though, this should serve as an encouragement to dig deeper.

Why Would I Ever Need Async Code?

There are a couple short answers to that:
  • Your application is doing many long-running computations (or runs of a varying/unpredictable length).
  • Your application runs in an unpredictable environment (in particular, I'm thinking of network communications).
  • Your application needs to handle lots of events
[1:55pm] oubiwann:exarkun: so, what's the first question a developer should ask themselves as they begin writing their Twisted application/library, txFoo?
[1:55pm] dash:"would everyone be better off if I just stopped now?"
[1:55pm] exarkun:oubiwann: I'm not sure I completely understand the target audience yet
[1:56pm] exarkun:my question is kind of like dash's question
[1:56pm] exarkun:why is this person doing this?
[1:57pm] oubiwann:exarkun: the audience is the group of software developers that are new to twisted, have a basic grasp of deferreds, and want their code to be properly async (using Twisted, of course)
[1:57pm] oubiwann:they don't have anything more than a passing familiarity of the reactor
[1:57pm] oubiwann:they don't know python internals

Protocols, Servers, and Clients, Oh My!

If your application can use what's already in Twisted, you're on easy street :-) If not, you may have to write your own protocols.

Let's get back to the chat:

[1:57pm] exarkun:So `foo´ is... a django-based web application?
[1:58pm] exarkun:... a unit conversion library?
[1:58pm] oubiwann:sure, that works
[1:58pm] oubiwann:unit conversion lib
[1:58pm] oubiwann:(which could be used in Django)
[1:58pm] exarkun:at a first guess, I'd say that there's probably no work to do
[1:58pm] exarkun:how could you have a unit conversion library that's not async?
[1:58pm] exarkun:that'd take some work
[1:59pm] oubiwann:let's say that the unit calculations take a really long time to run
[1:59pm] exarkun:Hm. :)
[1:59pm] idnar:you'd probably have to spawn a new process then :P
[2:00pm] exarkun:basically. probably the only other reasonable thing is for twisted-using code to use the unit conversion api with threads.
[2:00pm] exarkun:so then the question to ask "is my code threadsafe?"
[2:00pm] oubiwann:what about a messaging server
[2:00pm] oubiwann:that sends jobs out to different hosts for calcs
[2:01pm] dash:that's not going to be a tiny example
[2:01pm] exarkun:for that, the job is probably to take all the parsing and app logic and make sure it's separate from the i/o
[2:01pm] exarkun:so "am I using the socket/httplib/urllib/ftplib/XXXlib module?"
[2:03pm] exarkun:is another question for the developer to ask himself
[2:06pm] exarkun:they probably need to find the api in twisted that does what they were using a blocking api for, and switch to it
[2:07pm] exarkun:that might mean implementing a protocol, or it might mean using getPage or something
[2:07pm] exarkun:and pushing the async all the way from the bottom up to the top (maybe not in that direction)
[2:08pm] oubiwann:by "bottom" are you referring to protocol/wire-level stuff?
[2:08pm] oubiwann:exarkun: and by "top" their module's API?
[2:09pm] exarkun:yes
[2:10pm] exarkun:oubiwann: the point being, can't have a sync api implemented in terms of an async one (or at least the means by which to do so are probably beyond the scope of this post)

Processes

We didn't really talk about this one. Idnar mentioned spawning processes briefly, but the discussion never really returned there. I imagine that this is fairly well understood and may not merit as much pondering as such things as threads.

Which brings us to...

Threads

Thread safety is the number one concern when trying to provide an asynchronous API for synchronous code. Here are some starters for background information:
Discussing threads consumed the rest of the interview:

[2:12pm] oubiwann:exarkun: so, back to your comment about "is it threadsafe" (if they are doing long-running python calculations)
[2:13pm] oubiwann:what are the problems we face when we don't ask ourselves this question?
[2:13pm] oubiwann:what happens when we try to run non-threadsafe code in the Twisted reactor?
[2:14pm] exarkun:The problem happens when we try to run non-threadsafe code in a thread to keep it from blocking the reactor thread.
[2:16pm] oubiwann:so non-thread safe code run in deferredToThread could...
[2:16pm] oubiwann:have data inconsistencies which cause non-deterministic bugs?
[2:16pm] dash:have the usual effects of running non-threadsafe code
[2:16pm] exarkun:have any problem that using non-thread safe code in a multithreaded way using any other threading api could have
[2:16pm] dash:like that, yeah
[2:17pm] exarkun:inconsistencies, non-determinism, failure only under load (ie, only after you deploy it), etc
[2:18pm] dash:i smell a research paper
[2:18pm] oubiwann:so, next question: how does one determine that python code is thread safe or not?
[2:19pm] glyph:a research *paper*?
[2:19pm] exarkun:heh
[2:19pm] glyph:research *industry* more like
[2:19pm] oubiwann:exarkun: or, if not determine, at least ask the right sorts of questions to get the developer thinking in the right direction
[2:20pm] dash:glyph: Heh heh.
[2:20pm] exarkun:oubiwann: well, is there shared mutable state? if you're calling `f´ in a thread, does it operate on objects not passed to it as arguments?
[2:20pm] exarkun:oubiwann: if not, then it's probably safe - although don't call it twice at the same time with the same arguments
[2:20pm] exarkun:oubiwann: if so, who knows
[2:20pm] dash:with the same mutable arguments, anyway
[2:23pm] oubiwann:exarkun: so, because python and/or the os doesn't do anything to make file operations atomic, I'm assuming that reading and writing file data is not threadsafe?
[2:24pm] exarkun:don't use the same python file object in multiple threads, yes.
[2:24pm] exarkun:but certain filesystem operations are atomic, and you can manipulate the same file from multiple threads (or processes) if you know what you're doing
[2:25pm] oubiwann:what about C extensions in Python? any general rules there?
[2:25pm] oubiwann:other than "if they're threadsafe, you can use them"
[2:25pm] exarkun:that's about all you can say with certainty
[2:26pm] exarkun:for dbapi2 modules, look at the `threadlevel´ attribute. that's about the most general rule you can express.
[2:26pm] exarkun:there's some stuff other than objects that gets shared between threads too that might be worth mentioning
[2:26pm] exarkun:at least to get people to think about non-object state
[2:27pm] oubiwann:such as?
[2:27pm] exarkun:like, process working directory, or uid/gid
[2:30pm] • oubiwann looks at deferToThread...
[2:31pm] • oubiwann looks at reactor.callInThread
[2:33pm] • oubiwann looks at ReactorBase.threadpool
[2:38pm] oubiwann:hrm
[2:38pm] oubiwann:internesting
[2:39pm] oubiwann:never took the time to trace that all the way back to (and then read) the Python threading module
[2:40pm] oubiwann:exarkun: are there any python modules well known for their lack of threadsafety?
[2:42pm] exarkun:oubiwann: I dunno about "well known"
[2:42pm] exarkun:oubiwann: urllib isn't threadsafe
[2:42pm] exarkun:neither is urllib2
[2:43pm] exarkun:apparently random.gauss is not thread-safe?
[2:43pm] exarkun:you generally start with the assumption that any particular api is not thread-safe
[2:44pm] exarkun:and then maybe you can demonstrate to your own satisfaction that it's thread-safe-enough for your purposes
[2:44pm] exarkun:or you can demonstrate that it isn't
[2:45pm] exarkun:grepping the stdlib for 'thread' and 'safe' is interesting
[2:45pm] oubiwann:I wonder if the stuff available in math is threadsafe....
[2:45pm] oubiwann:exarkun: heh, I was just getting ready to dl the source so I could do that :-)
[2:46pm] exarkun:the math module probably is threadsafe
[2:46pm] exarkun:maybe that's another generalization
[2:46pm] exarkun:stdlib C modules are probably threadsafe
[2:49pm] oubiwann:hrm, looks like part of random isn't threadsafe
[2:51pm] oubiwann:random.random() is safe, though
[2:53pm] oubiwann:exarkun: I really appreciate you taking the time to discuss this
[2:53pm] exarkun:np
[2:53pm] oubiwann:and thanks to dash, glyph, and idnar for contributing to the discussion :-)

Summary

Concurrency is hard. If you want to use threads and you want to do it right and you want to avoid pitfalls and have bug-free code, you're going to be doing some head-banging. If you want to use an asynchronous framework like Twisted, you're going to have to bend your mind in a different way.

No matter what school of thought you follow for any given project, the best results will come with full commitment and immersion. Don't fear the learnin' -- embrace the pain ;-)

Update: Special thanks to Piet Delport for sorting out my endless typos!


Syndicated 2008-06-27 08:51:00 (Updated 2008-08-01 16:30:22) from Duncan McGreggor

25 Jun 2008 (updated 29 Sep 2008 at 06:03 UTC) »

Safari 3.1.1 Installer Hosed on OS X 10.5.3

I recently tried updating my Safari to the latest version, only to discover from here and here that Apple seems to have intentionally made this a 10.5.2-only update. I looked in the "Distribution" script and confirmed that this was, in fact, the case. The obvious symptom of this was that the installer told me I couldn't install Safari on any of my drives. Nice.

On those forum posts, I also discovered this great tool: Pacifist. It's been on my backburner list for a while to find a tool that could open up and extract Mac OS X packages, so for that alone I was delighted. When combined with PackageMaker, I was able to create my own installer. Even better.

If this is useful for anyone else, I've put it up here: Safari311UpdLeo_Divmod.pkg. Do note, however, that this installer has no brains: it just puts the files where they should be. It also doesn't check for your system version, so it could potentially really screw things up. Neither I, the Divmod community, nor Divmod, Inc. are responsible in any way if this installer takes your machine to the knacker's yard. However, I am using it on 10.5.3 with no issues (so far).


Syndicated 2008-06-25 21:36:00 (Updated 2008-09-29 05:28:55) from Duncan McGreggor

21 Jun 2008 (updated 1 Aug 2008 at 22:03 UTC) »

txLoadBalancer

Well today was a flurry of activity... pulled an all-nighter whipping a python load balancer into shape after some late-afternoon discussions on #divmod.

At Divmod, we're going to be labbing out some distributed services experiments with twistd servers, and one set of those experiments involves "developer friendly" load balancing. JP suggested that I take a look at how PyDirector works and see if we could use that. Which was actually interesting in a full-circle kind of way: I worked on PyDirector when I was at PBS, ages ago, where I wrote a weighted lb algorithm for it.

Jumping into the code again after a 5-year hiatus was like seeing an old friend :-)

All tonight, I worked on the following branches:
txLoadBalancer 0.9.1 and 1.0.1 are up on PyPI in the usual place.

I did lots of manual functional testing for each branch tonight, but I didn't do any TDD. While I'm still playing with it, I'll probably start adding tests as bugs crop up (BDT), and as it gets more serious I'll go fully into TDD and fill in what's missing at that point.

Tonight's mad rush was actually a great deal of fun. It's been a while since I've had the opportunity to plow through a bunch of code like that, and I enjoyed myself to near exhaustion :-) I don't think I'll be able to get to sleep tonight (er, this morning), due to the endless thinking about all the ways in which I want to use this code, mutate it, and... well, I better leave some surprises for later!

Update: I've edited the links for the latest micro-releases that fixed some issues with setup.py.

Update 2: Thanks to the heads-up in the comments from Kapil, I've patched txLoadBalancer trunk with the changes from Apple (David Reid and Wilfredo Sanchez).


Syndicated 2008-06-21 11:04:00 (Updated 2008-08-01 16:33:26) from Duncan McGreggor

21 Jun 2008 (updated 1 Aug 2008 at 22:03 UTC) »

Async Batching with Twisted: A Walkthrough

While drafting a Divmod announcement last week, I had a quick chat with a dot-bomb-era colleague of mine. Turns out, his team wants to do some cool asynchronous batching jobs, so he's taking a look at Twisted. Because he's a good guy and I like Twisted, I drew up some examples for him that should get him jump-started. Each example covered something in more depth that it's predecessor, so is probably generally useful. Thus this blog post :-)

I didn't get a chance to show him a DeferredSemaphore example nor one for the Cooperator, so I will take this opportunity to do so. For each of the examples below, you can save the code as a text file and call it with "python filname.py", and the output will be displayed.

These examples don't attempt to give any sort of introduction to the complexities of asynchronous programming nor the problem domain of highly concurrent applications. Deferreds are covered in more depth here and here. However, hopefully this mini-howto will inspire curiosity about those :-)

Example 1: Just a DefferedList

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

def listCallback(results):
print results

def finish(ign):
reactor.stop()

def test():
d1 = getPage('http://www.google.com')
d2 = getPage('http://yahoo.com')
dl = DeferredList([d1, d2])
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
This is one of the simplest examples you'll ever see for a deferred list in action. Get two deferreds (the getPage function returns a deferred) and use them to created a deferred list. Add callbacks to the list, garnish with a lemon.

Example 2: Simple Result Manipulation

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

def listCallback(results):
for isSuccess, content in results:
print "Successful? %s" % isSuccess
print "Content Length: %s" % len(content)

def finish(ign):
reactor.stop()

def test():
d1 = getPage('http://www.google.com')
d2 = getPage('http://yahoo.com')
dl = DeferredList([d1, d2])
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
We make things a little more interesting in this example by doing some processing on the results. For this to make sense, just remember that a callback gets passed the result when the deferred action completes. If we look up the API documentation for DeferredList, we see that it returns a list of (success, result) tuples, where success is a Boolean and result is the result of a deferred that was put in the list (remember, we've got two layers of deferreds here!).

Example 3: Page Callbacks Too

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

def pageCallback(result):
return len(result)

def listCallback(result):
print result

def finish(ign):
reactor.stop()

def test():
d1 = getPage('http://www.google.com')
d1.addCallback(pageCallback)
d2 = getPage('http://yahoo.com')
d2.addCallback(pageCallback)
dl = DeferredList([d1, d2])
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
Here, we mix things up a little bit. Instead of doing processing on all the results at once (in the deferred list callback), we're processing them when the page callbacks fire. Our processing here is just a simple example of getting the length of the getPage deferred result: the HTML content of the page at the given URL.

Example 4: Results with More Structure

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

def pageCallback(result):
data = {
'length': len(result),
'content': result[:10],
}
return data

def listCallback(result):
for isSuccess, data in result:
if isSuccess:
print "Call to server succeeded with data %s" % str(data)

def finish(ign):
reactor.stop()

def test():
d1 = getPage('http://www.google.com')
d1.addCallback(pageCallback)
d2 = getPage('http://yahoo.com')
d2.addCallback(pageCallback)
dl = DeferredList([d1, d2])
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
A follow-up to the last example, here we put the data in which we are interested into a dictionary. We don't end up pulling any of the data out of the dictionary; we just stringify it and print it to stdout.

Example 5: Passing Values to Callbacks

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

def pageCallback(result, url):
data = {
'length': len(result),
'content': result[:10],
'url': url,
}
return data

def getPageData(url):
d = getPage(url)
d.addCallback(pageCallback, url)
return d

def listCallback(result):
for isSuccess, data in result:
if isSuccess:
print "Call to %s succeeded with data %s" % (data['url'], str(data))

def finish(ign):
reactor.stop()

def test():
d1 = getPageData('http://www.google.com')
d2 = getPageData('http://yahoo.com')
dl = DeferredList([d1, d2])
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
After all this playing, we start asking ourselves more serious questions, like: "I want to decide which values show up in my callbacks" or "Some information that is available here, isn't available there. How do I get it there?" This is how :-) Just pass the parameters you want to your callback. They'll be tacked on after the result (as you can see from the function signatures).

In this example, we needed to create our own deferred-returning function, one that wraps the getPage function so that we can also pass the URL on to the callback.

Example 6: Adding Some Error Checking

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

urls = [
'http://yahoo.com',
'http://www.google.com',
'http://www.google.com/MicrosoftRules.html',
'http://bogusdomain.com',
]

def pageCallback(result, url):
data = {
'length': len(result),
'content': result[:10],
'url': url,
}
return data

def pageErrback(error, url):
return {
'msg': error.getErrorMessage(),
'err': error,
'url': url,
}

def getPageData(url):
d = getPage(url, timeout=5)
d.addCallback(pageCallback, url)
d.addErrback(pageErrback, url)
return d

def listCallback(result):
for ignore, data in result:
if data.has_key('err'):
print "Call to %s failed with data %s" % (data['url'], str(data))
else:
print "Call to %s succeeded with data %s" % (data['url'], str(data))

def finish(ign):
reactor.stop()

def test():
deferreds = []
for url in urls:
d = getPageData(url)
deferreds.append(d)
dl = DeferredList(deferreds, consumeErrors=1)
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
As we get closer to building real applications, we start getting concerned about things like catching/anticipating errors. We haven't added any errbacks to the deferred list, but we have added one to our page callback. We've added more URLs and put them in a list to ease the pains of duplicate code. As you can see, two of the URLs should return errors: one a 404, and the other should be a domain not resolving (we'll see this as a timeout).

Example 7: Batching with DeferredSemaphore

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet import defer

maxRun = 1

urls = [
'http://twistedmatrix.com',
'http://twistedsoftwarefoundation.org',
'http://yahoo.com',
'http://www.google.com',
]

def listCallback(results):
for isSuccess, result in results:
print len(result)

def finish(ign):
reactor.stop()

def test():
deferreds = []
sem = defer.DeferredSemaphore(maxRun)
for url in urls:
d = sem.run(getPage, url)
deferreds.append(d)
dl = defer.DeferredList(deferreds)
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
These last two examples are for more advanced use cases. As soon as the reactor starts, deferreds that are ready, start "firing" -- their "jobs" start running. What if we've got 500 deferreds in a list? Well, they all start processing. As you can imagine, this is an easy way to run an accidental DoS against a friendly service. Not cool.

For situations like this, what we want is a way to run only so many deferreds at a time. This is a great use for the deferred semaphore. When I repeated runs of the example above, the content lengths of the four pages returned after about 2.5 seconds. With the example rewritten to use just the deferred list (no deferred semaphore), the content lengths were returned after about 1.2 seconds. The extra time is due to the fact that I (for the sake of the example) forced only one deferred to run at a time, obviously not what you're going to want to do for a highly concurrent task ;-)

Note that without changing the code and only setting maxRun to 4, the timings for getting the the content lengths is about the same, averaging for me 1.3 seconds (there's a little more overhead involved when using the deferred semaphore).

One last subtle note (in anticipation of the next example): the for loop creates all the deferreds at once; the deferred semaphore simply limits how many get run at a time.

Example 8: Throttling with Cooperator

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet import defer, task

maxRun = 2

urls = [
'http://twistedmatrix.com',
'http://twistedsoftwarefoundation.org',
'http://yahoo.com',
'http://www.google.com',
]

def pageCallback(result):
print len(result)
return result

def doWork():
for url in urls:
d = getPage(url)
d.addCallback(pageCallback)
yield d

def finish(ign):
reactor.stop()

def test():
deferreds = []
coop = task.Cooperator()
work = doWork()
for i in xrange(maxRun):
d = coop.coiterate(work)
deferreds.append(d)
dl = defer.DeferredList(deferreds)
dl.addCallback(finish)

test()
reactor.run()
This is the last example for this post, and it's is probably the most arcane :-) This example is taken from JP's blog post from a couple years ago. Our observation in the previous example about the way that the deferreds were created in the for loop and how they were run is now our counter example. What if we want to limit when the deferreds are created? What if we're using deferred semaphore to create 1000 deferreds (but only running them 50 at a time), but running out of file descriptors? Cooperator to the rescue.

This one is going to require a little more explanation :-) Let's see if we can move through the justifications for the strangeness clearly:
  1. We need the deferreds to be yielded so that the callback is not created until it's actually needed (as opposed to the situation in the deferred semaphore example where all the deferreds were created at once).
  2. We need to call doWork before the for loop so that the generator is created outside the loop. thus making our way through the URLs (calling it inside the loop would give us all four URLs every iteration).
  3. We removed the result-processing callback on the deferred list because coop.coiterate swallows our results; if we need to process, we have to do it with pageCallback.
  4. We still use a deferred list as the means to determine when all the batches have finished.
This example could have been written much more concisely: the doWork function could have been left in test as a generator expression and test's for loop could have been a list comprehension. However, the point is to show very clearly what is going on.

I hope these examples were informative and provide some practical insight on working with deferreds in your Twisted projects :-)

Syndicated 2008-06-20 07:08:00 (Updated 2008-08-01 16:30:44) from Duncan McGreggor

Async Batching with Twisted: A Walkthrough

While drafting a Divmod announcement last week, I had a quick chat with a dot-bomb-era colleague of mine. Turns out, his team wants to do some cool asynchronous batching jobs, so he's taking a look at Twisted. Because he's a good guy and I like Twisted, I drew up some examples for him that should get him jump-started. Each example covered something in more depth that it's predecessor, so is probably generally useful. Thus this blog post :-)

I didn't get a chance to show him a DeferredSemaphore example nor one for the Cooperator, so I will take this opportunity to do so. For each of the examples below, you can save the code as a text file and call it with "python filname.py", and the output will be displayed.

These examples don't attempt to give any sort of introduction to the complexities of asynchronous programming nor the problem domain of highly concurrent applications. Deferreds are covered in more depth here and here. However, hopefully this mini-howto will inspire curiosity about those :-)

Example 1: Just a DefferedList

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

def listCallback(results):
print results

def finish(ign):
reactor.stop()

def test():
d1 = getPage('http://www.google.com')
d2 = getPage('http://yahoo.com')
dl = DeferredList([d1, d2])
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
This is one of the simplest examples you'll ever see for a deferred list in action. Get two deferreds (the getPage function returns a deferred) and use them to created a deferred list. Add callbacks to the list, garnish with a lemon.

Example 2: Simple Result Manipulation

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

def listCallback(results):
for isSuccess, content in results:
print "Successful? %s" % isSuccess
print "Content Length: %s" % len(content)

def finish(ign):
reactor.stop()

def test():
d1 = getPage('http://www.google.com')
d2 = getPage('http://yahoo.com')
dl = DeferredList([d1, d2])
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
We make things a little more interesting in this example by doing some processing on the results. For this to make sense, just remember that a callback gets passed the result when the deferred action completes. If we look up the API documentation for DeferredList, we see that it returns a list of (success, result) tuples, where success is a Boolean and result is the result of a deferred that was put in the list (remember, we've got two layers of deferreds here!).

Example 3: Page Callbacks Too

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

def pageCallback(result):
return len(result)

def listCallback(result):
print result

def finish(ign):
reactor.stop()

def test():
d1 = getPage('http://www.google.com')
d1.addCallback(pageCallback)
d2 = getPage('http://yahoo.com')
d2.addCallback(pageCallback)
dl = DeferredList([d1, d2])
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
Here, we mix things up a little bit. Instead of doing processing on all the results at once (in the deferred list callback), we're processing them when the page callbacks fire. Our processing here is just a simple example of getting the length of the getPage deferred result: the HTML content of the page at the given URL.

Example 4: Results with More Structure

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

def pageCallback(result):
data = {
'length': len(result),
'content': result[:10],
}
return data

def listCallback(result):
for isSuccess, data in result:
if isSuccess:
print "Call to server succeeded with data %s" % str(data)

def finish(ign):
reactor.stop()

def test():
d1 = getPage('http://www.google.com')
d1.addCallback(pageCallback)
d2 = getPage('http://yahoo.com')
d2.addCallback(pageCallback)
dl = DeferredList([d1, d2])
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
A follow-up to the last example, here we put the data in which we are interested into a dictionary. We don't end up pulling any of the data out of the dictionary; we just stringify it and print it to stdout.

Example 5: Passing Values to Callbacks

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

def pageCallback(result, url):
data = {
'length': len(result),
'content': result[:10],
'url': url,
}
return data

def getPageData(url):
d = getPage(url)
d.addCallback(pageCallback, url)
return d

def listCallback(result):
for isSuccess, data in result:
if isSuccess:
print "Call to %s succeeded with data %s" % (data['url'], str(data))

def finish(ign):
reactor.stop()

def test():
d1 = getPageData('http://www.google.com')
d2 = getPageData('http://yahoo.com')
dl = DeferredList([d1, d2])
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
After all this playing, we start asking ourselves more serious questions, like: "I want to decide which values show up in my callbacks" or "Some information that is available here, isn't available there. How do I get it there?" This is how :-) Just pass the parameters you want to your callback. They'll be tacked on after the result (as you can see from the function signatures).

In this example, we needed to create our own deferred-returning function, one that wraps the getPage function so that we can also pass the URL on to the callback.

Example 6: Adding Some Error Checking

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet.defer import DeferredList

urls = [
'http://yahoo.com',
'http://www.google.com',
'http://www.google.com/MicrosoftRules.html',
'http://bogusdomain.com',
]

def pageCallback(result, url):
data = {
'length': len(result),
'content': result[:10],
'url': url,
}
return data

def pageErrback(error, url):
return {
'msg': error.getErrorMessage(),
'err': error,
'url': url,
}

def getPageData(url):
d = getPage(url, timeout=5)
d.addCallback(pageCallback, url)
d.addErrback(pageErrback, url)
return d

def listCallback(result):
for ignore, data in result:
if data.has_key('err'):
print "Call to %s failed with data %s" % (data['url'], str(data))
else:
print "Call to %s succeeded with data %s" % (data['url'], str(data))

def finish(ign):
reactor.stop()

def test():
deferreds = []
for url in urls:
d = getPageData(url)
deferreds.append(d)
dl = DeferredList(deferreds, consumeErrors=1)
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()
As we get closer to building real applications, we start getting concerned about things like catching/anticipating errors. We haven't added any errbacks to the deferred list, but we have added one to our page callback. We've added more URLs and put them in a list to ease the pains of duplicate code. As you can see, two of the URLs should return errors: one a 404, and the other should be a domain not resolving (we'll see this as a timeout).

Example 7: Batching with DeferredSemaphore

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet import defer

maxRun = 1

urls = [
'http://twistedmatrix.com',
'http://twistedsoftwarefoundation.org',
'http://yahoo.com',
'http://www.google.com',
]

def listCallback(results):
for isSuccess, result in results:
print len(result)

def finish(ign):
reactor.stop()

def test():
deferreds = []
sem = defer.DeferredSemaphore(maxRun)
for url in urls:
d = sem.run(getPage, url)
deferreds.append(d)
dl = defer.DeferredList(deferreds)
dl.addCallback(listCallback)
dl.addCallback(finish)

test()
reactor.run()

These last two examples are for more advanced use cases. As soon as the reactor starts, deferreds that are ready, start "firing" -- their "jobs" start running. What if we've got 500 deferreds in a list? Well, they all start processing. As you can imagine, this is an easy way to run an accidental DoS against a friendly service. Not cool.

For situations like this, what we want is a way to run only so many deferreds at a time. This is a great use for the deferred semaphore. When I repeated runs of the example above, the content lengths of the four pages returned after about 2.5 seconds. With the example rewritten to use just the deferred list (no deferred semaphore), the content lengths were returned after about 1.2 seconds. The extra time is due to the fact that I (for the sake of the example) forced only one deferred to run at a time, obviously not what you're going to want to do for a highly concurrent task ;-)

Note that without changing the code and only setting maxRun to 4, the timings for getting the the content lengths is about the same, averaging for me 1.3 seconds (there's a little more overhead involved when using the deferred semaphore).

One last subtle note (in anticipation of the next example): the for loop creates all the deferreds at once; the deferred semaphore simply limits how many get run at a time.

Example 8: Throttling with Cooperator

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.internet import defer, task

maxRun = 2

urls = [
'http://twistedmatrix.com',
'http://twistedsoftwarefoundation.org',
'http://yahoo.com',
'http://www.google.com',
]

def pageCallback(result):
print len(result)
return result

def doWork():
for url in urls:
d = getPage(url)
d.addCallback(pageCallback)
yield d

def finish(ign):
reactor.stop()

def test():
deferreds = []
coop = task.Cooperator()
work = doWork()
for i in xrange(maxRun):
d = coop.coiterate(work)
deferreds.append(d)
dl = defer.DeferredList(deferreds)
dl.addCallback(finish)

test()
reactor.run()
This is the last example for this post, and it's is probably the most arcane :-) This example is taken from JP's blog post from a couple years ago. Our observation in the previous example about the way that the deferreds were created in the for loop and how they were run is now our counter example. What if we want to limit when the deferreds are created? What if we're using deferred semaphore to create 1000 deferreds (but only running them 50 at a time), but running out of file descriptors? Cooperator to the rescue.

This one is going to require a little more explanation :-) Let's see if we can move through the justifications for the strangeness clearly:
  1. We need the deferreds to be yielded so that the callback is not created until it's actually needed (as opposed to the situation in the deferred semaphore example where all the deferreds were created at once).
  2. We need to call doWork before the for loop so that the generator is created outside the loop. thus making our way through the URLs (calling it inside the loop would give us all four URLs every iteration).
  3. We removed the result-processing callback on the deferred list because coop.coiterate swallows our results; if we need to process, we have to do it with pageCallback.
  4. We still use a deferred list as the means to determine when all the batches have finished.
This example could have been written much more concisely: the doWork function could have been left in test as a generator expression and test's for loop could have been a list comprehension. However, the point is to show very clearly what is going on.

I hope these examples were informative and provide some practical insight on working with deferreds in your Twisted projects :-)

Syndicated 2008-06-17 00:08:00 (Updated 2008-06-20 07:48:52) from Duncan McGreggor

17 Jun 2008 (updated 12 Aug 2008 at 22:07 UTC) »

The Future of Personal Data

In a recent post about ULS systems, I said this:
The balance of power, from individuals all the way to the top of
whatever organizations exist in the future will rest in information.
Not like it is today, however. The "information economy" of the today
(+/- 10 years) will look like kids' games and playgrounds. The
information economies this will evolve into will be so completely
integrated into human existence that they will resemble the basic
necessities like water and food.
I'm not going to focus on the ULS systems topic in this post, but there is a very deep connection between privacy, personal data and all things ULS. Any thoughts of a ULS system should be coupled with how this will impact the system's users and their data. Any thought of our personal data's future existence should include the anticipated future of computing: ULS systems.

Inside and Out

In a nutshell, here's how things look:
  • Yesterday: Paid Services - You want something, you buy it. Demographic research is expensive and mostly outsourced.
  • Today: Free Services - You want something, companies give it to you for free... in exchange for your demographic data.
  • Tomorrow: Information Economy - You want something, you leverage the value of your information in brokering the the service deals that mean the most to you.
What do we have right now? Companies are fighting for each other over who gets to have our data for free. Yay, free stuff! We used to have to pay for that sort of thing! But paying for people to hold your data was the old, old world. Having them do it for free is the old world. Here's the new world: They pay you.

Why would they do that? Why would things shift from the current status quo? The value of personal information.

There are many ways to assess the value of personal information, but let's look at a few from the perspective of large organizations (entailing everything from government to business). Simplistically, we can assign value to a single individual's data based on the value of a large collection of many individuals' data. The more participants, the greater the value of the whole, and therefore the greater the value for each individual's data. This perspective is limited because it treats data very staticly. The data may change, but in relation to the system it's "acquired" and inside as opposed to "for sale" and outside.

We Are the Markets

But the value of our data is not defined simply by the presense of bits or membership in a valued data conglomerate. Our data is not just our emails, our medical records, our purchasing trends, nor our opinions about local and national politics. Like an organism moving through an ecosystem, our data is dynamic and living; it is the very trace we leave in the world around us, be it digital or otherwise.

Any part of our lives that is ever recorded in "the system" provides data and comprises part of our movements through this system. Our traces through this digital ecosystem impact it, change it, shape its future direction. The collective behaviours (not just collective data) are immensely valuable to organizations. Their value is on-going and growing, with accrued, compounded interest.

Static data bits seem like property to us: you can buy them, you can sell them, you can store them somewhere. But moving, living data... that's a different story. That's not a buy-once commodity; ownership of that might be tantamount to slavery in a future, information-based economy. However, organizations might opt to lease it, or individuals might turn the past back on the future and offer license agreements to organizations.

More likely, though, individuals will form co-ops or communities (we have already seen this happend extensively in today's Internet) with shared mutual interest. Seeing how a group entity with shared values has a larger effect on the system than single individuals, data from such groups would likely be much more interesting and number-crunch-worthy. The greater power a group has to perterb systems' ecomonic or political trends, the more valuable that group's data will be to other groups.

In addition, I'm sure there'd be all sorts of tiered "offerings" from individuals and groups: the juicier/more detailed the data, the higher the premium offered. The changes this will introduce to markets (global and local), legal systems, and politcal organziations are probably barely imaginable right now. But what would it take to get us there? What would it take for my data and your data to be valuable enough to transform the world and make Wall Street look like an old-time, irrelevant boys club?

Privacy

One thing: a fanatical devotion to privacy, pure and simple. Security and a fanatical devotion to privacy. Two things! Okay, reliability, security and a fantaical devotion to privacy. Three things!

Monty Python references aside, an economy that values the data of individuals and groups can only arise if that data is secure. If we live in a topsy-turvy world where the Government, MPAA, RIAA, the Russian Mafia, and Big Hosting Company are pirating our data, then we're hosed. However, if our data is secure and contracts are effective, then we will have a world where data is the currency. There are an incredible number of hurdles to overcome in order for this to happen, however.
  • The System - we need a system where user data can be tracked, recorded, and analyzed, and there's enough of it to matter
  • Storage - we need our own, personal banks for our data (irrefutable ownership rights and complete power over that data)
  • Transactions - we need a mechanism for engaging in secure, data transations
  • Identity - when making a transation, we need to be able to prove unequivocally that we are who we say we are
  • Anonymity - we need to decouple activity in the system and identity, thus requiring organizations to come to us (or our groups) to get the definitive data they need
  • Recourse - we need a legal system and effective laws that protect the individuals and groups against the crimes of data-hungry organizations; fortunately, we will have had years of established precedent protecting the sellers from the buyers... oh my, how the tables turn!
And that's just off the top of my head. There's got to be tons of stuff which hasn't even occurred to me.

Closing Thoughts

Information will be as essential for us as water, yet there is a very interesting divergence from the example of a hydrological empire: each individual is the producer of some of that metaphorical water. By virtue of this difference, we hold the keys of the empire. We will be more a part of the economic and political powerbases than we have every been at any time in human history.

Of course, that means that we've got to get ready :-) This is already being done in many different ways. Everything from community housing cooperatives to small, co-op banks; from capabilities-based programming models to secure online transactions. Like the next 20 years of research needed for ULS systems to become a reality, we've got just as much work to do in order to guarantee our place in the economies of the future.


Syndicated 2008-06-16 20:02:00 (Updated 2008-08-12 18:08:37) from Duncan McGreggor

194 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!