Older blog entries for titus (starting at number 484)

Tracy Teal's PyCon '14 submission: How I learned to stop worrying and love matplotlib

Note: This is a proposal being submitted by Tracy Teal (@tracykteal) for PyCon '14. I suggested she post it here for feedback, because she does not have her own blog. --titus


TITLE: How I learned to stop worrying and love matplotlib

CATEGORY: Science

DURATION: I prefer a 30 minute time slot

DESCRIPTION:

I was living a dual life, programming in Python and plotting in R, too worried to move R code to Python. Then I decided to make 100 plots in Python in 100 days. I documented the journey on a website, posting the plot & code in IPython Notebooks and welcoming comments. This talk will summarize lessons learned, including technical details, the process and the effects of learning in an online forum.

AUDIENCE:

Scientists interested in statistical computing with Python, those interested in learning more about NumPy and matplotlib.

PYTHON LEVEL: Beginner

OBJECTIVES

Attendees will see use cases for numpy and matplotlib, as well as one approach on how to succeed (or fail) at challenging yourself to learn something new.

DETAILED ABSTRACT:

Many scientific programmers use multiple languages for different applications, primarily because specific packages are available for their standard use cases or they're working with existing code. While these languages work well, it can limit the ability to integrate different components of a project in to one framework. The reasons not to use numpy, matplotlib and pandas is therefore often not technical, but the effort required to learn or develop a new approach when there are already so many demands on a scientist's time can be inhibiting. Additionally the development of new packages or integrated code bases are often not as valued in the academic structure.

I am one of those scientists, a microbial ecologist and bioinformatician, writing most of my code in Python and teaching it in Software Carpentry, but doing all my statistics in R. I like R and the R community and in particular, the ecological statistics package, vegan, so I haven’t felt the need to switch, but I realized my reluctance was mainly because I didn't know how to do the same things in Python, not that R was necessarily better for my workflow. So, I figured I should at least give it a try, but it was always a task on the back burner and not particularly interesting. Inspired by Jennifer Dewalt's 180 web sites in 180 days, the idea of making something in order to learn particular skills and the process of deliberate practice, I decided to start a project 100 plots in 100 days. In this project I will make a plot every (week)day for 100 days using Python. Plots encompass y=x to visualizations of multivariate statistics and genomic data. I use matplotlib, numpy and pandas, make the plots in IPython Notebook and post the notebook and comments about the process of creating that plot on my blog. I welcome comments to get more immediate feedback on the process.

This talk will focus on lessons learned during the project, both technical and about the process of learning - the expected and unexpected outcomes and how the involvement of community impacts practice.

OUTLINE

  • Intro (5 min)
  • Who am I?
  • Why this project?
  • Show the website
  • Lessons learned (18 min)
  • Technical lessons learned
  • numpy/matplotlib tricks or tips
  • any new statistical algorithms developed for numpy
  • Lessons learned about learning
  • Was this process good for learning something new? Why/ why not?
  • Deliberate practice has been shown to be the most effective way to get good at something. It involves working at something and getting feedback. Was this approach good for that?
  • Social aspects
  • Response to the project
  • Social pressures and accountability - does saying you'll do something publicly make you more likely to do it
  • Concluding remarks (2 min)
  • Would I do something like this again? Would I recommend it?
  • Questions (5 min)

ADDITIONAL NOTES

  • I'm just starting this project, inspired by both a recent Hacker News post on Jennifer Dewalt's 180 web sites in 180 days and the opportunity to present at PyCon. As such, at review time, I'll only be beginning the journey. Success for me for this project would be following through on the 100 plots in 100 (week)days, learning the fundamentals of numpy and matplotlib and making some neat and useful plots along the way. I'll share all the code for the statistics and each plot on the website, however ugly it may be. This could fail too. I could not be able to get beyond variations on a y=x plot and write terrible code. This talk will document both the success and the failures, as I hope I and others can learn from both. I do understand the risk of accepting a talk like this where I can't yet tell you what the lessons learned will be.
  • This would be my first time speaking at PyCon. I've spoken at many scientific conferences, been selected as an Everhart Lecturer at Caltech and received "Best Presentation" award at conferences. I've also been an instructor for five Software Carpentry bootcamps, including one for Women in Science and Engineering.

ADDITIONAL REQUIREMENTS

None

Syndicated 2013-09-13 22:00:00 from Living in an Ivory Basement

Data intensive biology in the cloud: instrumenting ALL the things

Here's a draft PyCon '14 proposal. Comments and suggestions welcome!


Title: Data intensive biology in the cloud: instrumenting ALL the things

Description: (400 ch)

Cloud computing offers some great opportunities for science, but most cloud computing platforms are both I/O and memory limited, and hence are poor matches for data-intensive computing. After four years of research software development we are now instrumenting and benchmarking our analysis pipelines; numbers, lessons learned, and future plans will be discussed. Everything is open source, of course.

Audience: People who are interested in things.

Python level: Beginner/intermediate.

Objectives:

Attendees will

  • learn a bit about I/O and big-memory performance in demanding situations;
  • see performance numbers for various cloud platforms;
  • hear about why some people can't use Hadoop to process large amounts of data;
  • gain some insight into the sad state of open science;

Detailed abstract:

The cloud provides great opportunities for a variety of important computational science challenges, including reproducible science, standardized computational workflows, comparative benchmarking, and focused optimization. It can also help be a disruptive force for the betterment of science by eliminating the need for large infrastructure investments and supporting exploratory computational science on previously challenging scales. However, most cloud computing use in science so far has focused on relatively mundane "pleasantly parallel" problems. Our lab has spent many moons addressing a large, non-parallelizable "big data/big graph" problem -- sequence assembly -- with a mixture of Python and C++, some fun new data structures and algorithms, and a lot of cloud computing. Most recently we have been working on open computational "protocols", worfklows, and pipelines for democritizing certain kinds of sequence analysis. As part of this work we are tackling issues of standardized test data sets to support comparative benchmarking, targeted optimization, reproducible science, and computational standardization in biology. In this talk I'll discuss our efforts to understand where our computational bottlenecks are, what kinds of optimization and parallelization efforts make sense financially, and how the cloud is enabling us to be usefully disruptive. As a bonus I'll talk about how the focus on pleasantly paralellizable tasks has warped everyone's brains and convinced them that engineering, not research, is really interesting.

Outline:

  1. Defining the terms: cloud computing; data intensive; compute intensive.

2. Our data-intensive problem: sequence assembly and the big graph problem. The scale of the problem. A complete analysis protocol.

  1. Predicted bottlenecks, including computation and I/O.
  2. Actual bottlenecks, including NUMA architecture and I/O.

5. A cost-benefit analysis of various approaches, including buying more memory; striping data across multiple volumes; increasing I/O performance; focusing on software development; "pipelining" across multiple machines; theory vs practice in terms of implementation.

6. A discussion of solutions that won't work, including parallelization and GPU.

7. Making analysis "free" and using low-cost compute to analyze other people's data. Trying to be disruptive.

Syndicated 2013-09-09 22:00:00 from Living in an Ivory Basement

How can we do literate programming for reproducibility, in Python?

Note: Yarden Katz (the author of MISO) sent me the e-mail below, and I asked him if I could post it as a guest-post on my blog. He said yes - so here it is! Feedback solicited.

---

Hi Titus,

Hope all is well. A recent tweet you had about Ben Bolker's notes for lit. programming in R (via @hylopsar) made me think about the same for Python, with has been bugging me for a while. Wanted to see if you have any thoughts on getting the equivalent in Python.

What I've always wanted in Python is a way to simultaneously document and execute code that describes an applied analysis pipeline. Some easy way to declaratively describe and document a step-by-step analysis pipeline: Given X datasets available from some web resource, which depends on packages / tools Y, download the data and run the pipeline and ensure that you get results Z. I'd like a language that allows a description that is easily reproducible on a system that's not your own, and forces you to declaratively state things in such a way that you can't cheat with hardcoded paths or quirky settings/versions of software that apply only to your own system. A kind of "literate" pipeline for applied analysis pipelines that allows you to state your assertions/expectations along the way.

One of the main advantages of R over Python is that they have a packaging system that actually works, where as pip/setuptools/distribute are all broken and hard to use, even for Python experts, let alone people who don't want to delve into the guts of Python. So ideally I'd like a system that takes this description of the code and the inputs and executes on a new virtual environment. readthedocs.org does this for documentation, and it's a great way to ensure that you don't have unnoticed hardcoded paths, or Python libraries or packages that cannot be fetched by package managers. Because Python libraries are so hopelessly complicated and broken, and because in comp. bio we rely so often on external tools (tophat version/bowtie version/etc.) this is all the more important. Something that ensures that if you follow these steps, with these data, it'll be automatically installable on your system, and give you the expected output -- no matter what! Knowing that it runs on a server other than your own is key.

Some related tools/ideas that haven't worked very well for me for this purpose, or that only partially address this:

  • IPython notebook: I've had issues with IPython in general, but even when it works, it doesn't address the problem of describing systematically the input and output of the problem, which is key in analysis pipelines. It also doesn't give you a way to state dependencies. If I have a series of calls to numpy/scipy/matplotlib and I want to share that with you, it's good, but an applied analysis pipeline is far more complex than using a bunch of commonly available Python packages to get an output.
  • Unit tests: Standard unit tests are OK for generic software tools. But they don't really make sense for applied analysis pipelines, where the software that you're writing is basically a bunch of analysis (and possibly plotting) code, and not a generic algorithm. You're not testing internal Python library calls, and testing is only a small component of the goal (the other part is describing dependencies and data, and how the pipeline works). You're trying to describe a flow of sequential steps, with forced assertions and conditions for reproducibility. Some of these steps might not be fully automated, or might take far too long to run as a unit test. So what I'm looking for is closer to some kind of sphinx/pydoc document interspersed with executable code, than a plain Python file with unit tests.
  • Ruffus: It's too complicated for most things in my view and it doesn't give you a way to describe the data inputs, etc. It's best for pipelines that consist of internal Python functions that exist within a module, but it gives you no features for describing interaction with external world (external input data, external tools of a specific version whose output you process.) that forces you to get things somewhat right is Sphinx/Pydoc. It was for Pycogent which I occasionally contribute it to, and they had configured it so that all the inline examples in the sphinx .rst file were run in real time. That's nice though it's still running only on your own environment and has no features for describing complex data sets / inputs, it was really made for testing library calls within a Python package (like an IPython notebook) -- again, not meant for data-driven pipelines.

The ideal system would even allow you to register analysis pipelines or Python functions in some kind of web system, where each analysis can get a URI and be run with a single click dispatched to some kind of amazon node. But that's not necessary and I don't use the cloud for now.

Would love to hear your thoughts (feel free to share with others who might have views on this.) I've thought about this for a while and never found a satisfactory solution.

Thanks very much!

Best,

--Yarden

Syndicated 2013-07-12 22:00:00 from Living in an Ivory Basement

Excerpts from Coders At Work: Peter Deutsch Interview

I've been reading Peter Seibel's excellent book, Coders at Work, which is a transcription of interviews with a dozen or so very well known and impactful programmers. After the first two interviews, I found myself itching to highlight certain sections, and then I thought, heck, why not post some of the bits I found most interesting? This is a book everyone should be aware of, and it's surprisingly readable. Highly recommended.

This is the second of my blog posts. The first contained excerpts from Seibel's interview with Joe Armstrong.

The excerpts below come from Seibel's interview with Peter Deutsch, who is (among many other things) the creator and long-time maintainer of Ghostscript.

My comments are labeled 'CTB'.


On programmers

Seibel: So is it OK for people who don't have a talent for systems-level thinking to work on smaller parts of software? Can you split the programmers and the architects? Or do you really want everyone who's working on systems-style software, since it is sort of fractal, to be able think in terms of systems?

Deutsch: ... But in terms of who should do software, I don't have a good flat answer that. I do know that the further down in the plumbing the software is, the more important it is that it be built by really good people. That's an elitist point of view, and I'm happy to hold it.

...

You know the old story about the telephone and the telephone operators? The story is, sometime fairly early in the adoption of the telephone, when it was clear that use of the telephone was just expanding at an incredible rate, more and more people were having to be hired to work as operators because we didn't have dial telephones. Someone extrapolated the growth rate and said "My God. By 20 or 30 years from now, every single person will have to be a telephone operator." Well, that's happened. I think something like that may be happening in some big areas of programming as well.

CTB: This seemed like interesting commentary on the increasing ... democratization? ... of computer use.

Fast, cheap, good -- pick any two.

Deutsch: ...The problem being the old saying in the business: "fast, cheap, good -- pick any two." If you build things fast and you have some way of building them inexpensively, it's very unlikely that they're going to be good. But this school of thought says you shouldn't expect software to last.

I think behind this perhaps is a mindset of software as expense vs software as capital asset. I'm very much in the software-as-capital-asset school. When I was working at ParcPlace and Adele Goldberg was out there evangelizing object-oriented design, part of the way we talked about objects and part of the way we advocated object-oriented languages and design to our customers and potential customers is to say, "Look, you should treat software as a capital asset."

And there is no such thing as a capital asset that doesn't require ongoing maintenance and investment. You should expect that there's going to be some cost associated with maintaining a growing library of reusable software. And that is going to complicate your accounting because it means you can't charge the cost of building a piece of software only to the project or the customer that's motivating the creation of that software at this time. You have to think of it the way you would think of a capital asset.

CTB: A really good perspective that's relevant to scientists' concerns about software and data.

On how software practice has (not) improved over the last 30 years

Seibel: So you don't believe the original object-reuse pitch quite as strongly now. Was there something wrong with the theory, or has it just not worked out for historical reasons?

Deutsch: Well, part of the reason that I don't call myself a computer scientists any more is that I've seen software practice over a period of just about 50 years and it basically hasn't improved tremendously in about the last 30 years.

If you look at programming languages I would make a strong case that programming languages have not improved qualitatively in the last 40 years. There is no programming language in use today that is qualitatively better than Simula-67. I know that sounds kind of funny, but I really mean it. Java is not that much better than Simula-67.

Seibel: Smalltalk?

Deutsch: Smalltalk is somewhat better than Simula-67. But Smalltalk as it exists today essentially existed in 1976. I'm not saying that today's languages aren't better than the languages that existed 30 years ago. The language that I do all of my programming in today, Python, is, I think, a lot better than anything that was available 30 years ago. I like it better than Smalltalk.

I use the word qualitatively very deliberately. Every programming language today that I can think of, that's in substantial use, has the concept of pointer. I don't know of any way to make software built using that fundamental concept qualitatively better.

CTB: Well, that's just a weird opinion in some ways. But interesting, especially since he has been around and active for so long, and his perspective is obviously not based in ignorance.

On temptation

Deutsch: Every now and then I feel a temptation to design a programming language but then I just lie down until it goes away. But if I were to give in to that temptation, it would have a pretty fundamental cleavage between a functional part that talked only about values and had no concept of pointer, and a different sphere of some kind that talked about patterns of sharing and reference and control.

More on Smalltalk and Python

Seibel: So, despite it not being qualitatively better than Smalltalk, you still like Python better.

Deutsch: I do. There are several reasons. With Python there's a very clear story of what is a program and what it means to run a program and what it means to be part of a program. There's a concept of module, and modules declare basically what information they need from other modules. So it's possible to develop a module or a group of modules and share them with other people and those other people can come along and look at those modules and know pretty much exactly what they depend on and know what their boundaries are.

...

I've talked with the few of my buddies that are still at VisualWorks about open-sourcing the object engine, the just-in-time code generator, which, even though I wrote it, I still think is better than a lot of what's out there. Gosh, here we have Smalltalk, which has this really great code-generation machinery, which is now very mature -- it's about 20 years old and it's extremely reliable. It's a relatively simple, relatively retargetable, quite efficient just-in-time code generator that's designed to work really well with non type-declared languages. On the other hand, here's Python, which is this wonderful language with these wonderful libraries and a slow-as-mud implementation. Wouldn't it be nice if we could bring the two together?

(I'm a bit fixated on Python. OK?)

Deutsch: ... But that brings me to the other half, the other reason I like Python syntax better, which is that Lisp is lexically pretty monotonous.

Seibel: I think Larry Wall described it as a bowl of oatmeal with fingernail clippings in it.

Deutsch: Well, my description of Perl is something that looks like it came out of the wrong end of a dog. I think Larry Wall has a lot of nerve talking about language design -- Perl is an abomination as a language. But let's not go there.

CTB: heh.

Syndicated 2013-05-13 22:00:00 from Living in an Ivory Basement

Excerpts from Coders At Work: Joe Armstrong Interview

I've been reading Peter Seibel's excellent book, Coders at Work, which is a transcription of interviews with a dozen or so very well known and impactful programmers. After the first two interviews, I found myself itching to highlight certain sections, and then I thought, heck, why not post some of the bits I found most interesting? This is a book everyone should be aware of, and it's surprisingly readable. Highly recommended.

This is the first in what I expect to be a dozen or so blog posts, time permitting.

The excerpts below come from Seibel's interview with Joe Armstrong, the inventer of Erlang.

My comments are labeled 'CTB'.


On learning to program

Seibel: How did you learn to program? When did it all start?

Armstrong: When I was at school. I was born in 1950 so there weren't many computers around then. The final year of school, I suppose I must have been 17, the local council had a mainframe computer -- probably an IBM. We could write Fortran on it. It was the usual thing -- you wrote your programs on coding sheets and you sent them off. A week later the coding sheets and the punch cards came back and you had to approve them. But the people who made the punch cards would make mistakes. So it might go backwards and forwards one or two times. And then it would finally go to the computer center.

Then it went to the computer center and came back and the Fortran compilter had stopped at the first syntactic error in the program. It didn't even process the remainder of the program. It was something like three months to run your first program. I learned then, instead of sending one program you had to develop every single subroutine in parallel and sned the lot. I think I wrote a little program to dispay a chess board -- it would plot a chess board on the printer. But I had to write all the subroutines as parallel tasks because the turnaround time was so appallingly bad.

CTB: I think it's fascinating to interpret this statement in light of Erlang's pattern of small components, working in parallel (http://en.wikipedia.org/wiki/Erlang_(programming_language). Did Armstrong shape his mental architecture in this pattern from the early mainframe days, and then translate that over to programming design? Also, this made me think about unit testing in a whole new way.

On modern gizmos like "hierarchical file systems", and productivity

Armstrong: The funny thing is, thinking back, I don't think all of these modern gizmos actually make you any more productive. Hierarchical file systems -- how do they make you more productive? Most of software development goes on in your head anyway. I think having worked with that simpler system imposes a kind of disciplined way of thinking. If you haven't got a directory system and you have to put all the files in one directory, you have to be fairly disciplined. If you haven't got a revision control system, you have to be fairly disciplined. Given that you apply that discipline to what you're doing it doesn't seem to me to be any better to have hierarchical file systems and revision control. They don't solve the fundamental problem of solving your problem. They probably make it easier for groups of people to work together. For individuals I don't see any difference.

CTB: If your tools require you to be as good as Joe Armstrong in order to get things done, that's probably not a generalizable solution...

On calling out to other languages, and Domain Specific Lanaguages

Seibel: So if you were writing a big image processing work-flow system, then would you write the actual image transformation in some other language?

Armstrong: I'd write them in C or assembler or something. Or I might actually write them in a dialect of Erlang and then cross-compile the Erlang to C. Make a dialect - this kind of domain-specific language kind of idea. Or I might write Erlang programs which generate C programs rather than writing the C programs by hand. But the target language would be C or assembler or something. Whether I wrote them by hand or generated them would be the interesting question. I'm tending toward automatically generating C rather than writing it by hand because it's just easier.

CTB: heh. So, I'd just generate C automatically from a dialect of Erlang...

On debugging

Seibel: What are the techniques that you use there? Print statements?

Armstrong. Print statements. The great gods of programming said, "Thou shall put printf statements in your program at the point where yout hink it's gone wrong, recompile, and run it.

Then there's -- I don't know if I read it somewhere or if I invented it myself -- Joe's Law of Debugging, which is that all errors will be plus/minus three statements of the place you last changed the program.

CTB: one surprising commonality amongst many of the interviews thus far is the lack of use (or disdain for) debuggers. Almost everyone trots out print statements!

Syndicated 2013-04-28 22:00:00 from Living in an Ivory Basement

PyCon 2013 and Codes of Conduct, more generally

The tech community is messed up in da head, yo.

Several times since Steve Holden's I'm Sorry post I've written long blog posts about my own views on codes of conduct and professional behavior, including the views informed by some of my own extraordinarily embarrassing transgressions. I never felt that the end result had much to add to the conversation so I never posted any of 'em. Plus, they were really embarrassing transgressions.

If you want to know, until last week, I was fairly publicly on the fence about the proposed Python Software Foundation code of conduct (which is not yet public, but is based on the Ubuntu CoC, I think) because I was worried about CoCs being used to whack people inappropriately, due to nonspecificity and other things.

Three things happened at PyCon 2013 that made me decide to (a) change my mind and (b) post this short note saying so.

First, I came to PyCon with two women colleagues, one of whom was harassed nearly constantly by men, albeit on a low level. Both of them are friendly people who are willing to engage at both a personal and a technical level with others, and apparently that signals to some that they can now feel free to comment on "hotness", proposition them, and otherwise act like 14 year old guys. As one friend said, (paraphrased) "I'd be more flattered that they seem to want to sleep with me, if they'd indicated any interest in me as a human being -- you know, asked me why I was at PyCon, what I did, what I worked on, what I thought about things. But they didn't." (Honestly, if I were to judge by that reported set of interactions, any genetic component to such behavior would be weeded out in approximately one generation, 'cause such guys would only be able to reproduce through anonymous donations to sperm banks.)

Second, at an event for a subcommunity that I help not run, bad shit happened. At the same event, derogatory and not-fun sexist remarks were made, publicly and loudly, about a presenter. This made the main organizer for that event feel horrible, and put a damper on the whole event.

Third, this happened. As with #2, I found PyCon's response perfectly appropriate, which makes me much happier about the way PyCon specifically and the PSF in general are likely to enforce any code of conduct in the future. As for the person who posted Twitter pics identifying the men she felt were being sexist, I am not very upset by her actions, because she is not an official representative of PyCon or the PSF, and she did not post anonymously, so she is taking responsibility for her actions -- unlike the people harassing Jesse Noller for doing his effin' job. I do reject the notion that Adria speaks for me in the particulars, and I would guess that her claim to speak for all future women is similarly rejected by many women. Again, PyCon is not responsible for her tweet or her picture, and they should not be held accountable in any way for it; that's her personal action. (I'm more upset by the company that took this to the extent of actually firing someone over it, and I'm really glad my employer (Michigan State University) has rules, procedures, and formal hearings -- if they fire me, it will be after a certain amount of due process and not due to what is presumably Internet hearsay.)

In the end, the latter two incidents are completely overshadowed by the first, though. I'm not the parent or guardian of either of my colleagues, and I suspect they would similarly reject the idea that I speak for either of them - they're both perfectly capable of telling people what they think, frankly (and I would love to be an audience for some of those conversations). But, as both a visible member of the Python community and as the father of two small girls, I am appalled at the second-hand reports of male behavior. I'm particularly appalled at the systemic low-level harassment that seems to be considered normal behavior by some. It's not cool, it's not fun, and it doesn't even have the dubious virtue of being effective.

In summary,

  1. GOOD JOB, PYCON. The way the incidents were officially handled was really well done and speaks well of the people we have chosen to run PyCon.
  2. We need codes of conduct because they provide some minimal guidelines for people that (apparently) need 'em, because they can't figure out how to tie their own shoelaces without such guidelines.
  3. As a community, we need to change the way we treat women, because my daughters will TASER YOU ALL INTO OBLIVION in 10-20 years if we don't.

--titus

p.s. Sorry, no comments. Go blog 'em and I'll link to the blog posts, just send them to @ctitusbrown on Twitter.

Syndicated 2013-03-19 23:00:00 from Living in an Ivory Basement

My 2013 PyCon talk: Awesome Big Data Algorithms

Schedule link

Description

Random algorithms and probabilistic data structures are algorithmically efficient and can provide shockingly good practical results. I will give a practical introduction, with live demos and bad jokes, to this fascinating algorithmic niche. I will conclude with some discussions of how our group has applied this to large sequencing data sets (although this will not be the focus of the talk).

Abstract:

I propose to start with Python implementations of most of the DS & A mentioned in this excellent blog post:

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

and also discuss skip lists and any other random algorithms that catch my fancy. I'll put everything together in an IPython notebook and add visualizations as appropriate.

I'll finish with some discussion of how we've put these approaches to work in my lab's research, which focuses on compressive approaches to large data sets (and is regularly featured in my Python-ic blog, http://ivory.idyll.org/blog/).

SkipLists:

skiplist IPython Notebook

Wikipedia page on SkipLists

John Shipman's excellent writeup

William Pugh's original article

The HackerNews (oops!) reference for my reddit-attributed quote about putting a gun to someone's head and asking them to write a log-time algorithm for storing stuff: https://news.ycombinator.com/item?id=2670632

HyperLogLog:

coinflips IPython Notebook

HyperLogLog counter IPython Notebook

Aggregate Knowledge's EXCELLENT blog post on HyperLogLog. The section on Big Pattern Observables is truly fantastic :)

Flajolet et al. is the original paper. It gets a bit technical in the middle but the discussions are great.

Nick Johnson's blog post on cardinality estimation

MetaMarkets' blog post on cardinality counting

More High Scalability blog posts, this one by Matt Abrams

The obligatory Stack Overflow Q&A

Vasily Evseenko's git repo https://github.com/svpcom/hyperloglog, forked from Nelson Goncalves's git repo, https://github.com/goncalvesnelson/Log-Log-Sketch, served as the source for my IPython Notebook.

Bloom Filters:

Bloom filter notebook

The Wikipedia page is pretty good.

Everything I know about Bloom filters comes from my research.

I briefly mentioned the CountMin Sketch, which is an extension of the basic Bloom filter approach, for counting frequency distributions of objects.

Other nifty things to look at

Quotient filters

Rapidly-exploring random trees

Random binary trees

Treaps

StackOverflow question on Bloom-filter like structures that go the other way

A survey of probabilistic data structures

K-Minimum Values over at Aggregate Knowledge again

Set operations on HyperLogLog counters, again over at Aggregate Knowledge.

Using SkipLists to calculate an efficient running median

Syndicated 2013-03-15 23:00:00 from Living in an Ivory Basement

Communicating programming practice with screencasts

One of the things that I have struggled with over the years is how to teach people how to actually program -- by this I mean the minute-to-minute process and techniques of generating code, more so than syntax and data structures and algorithms. This is generally not taught explicitly in college: most undergraduate students pick it up in the process of doing homeworks, by working with other people, observing TAs, and evolving their own practice. Most science graduate students never take a formal course in programming or software development, as far as I can tell, so they pick it up haphazardly from their colleagues. Open source hackers may get their practice from sprints, but usually by the time you get to a sprint you are already wedged fairly far into your own set of habits.

Despite this lack of explicit teaching, I think it's clear that programming practice is really important. I and the other Linux/UNIX geeks I know all have a fairly small set of basic approaches -- command-line driven, mostly emacs or vi, with lots of automation at the shell -- that we apply to all of our problems, and it is all pretty optimized for the tools and tasks we have. I would be hard pressed to imagine a significantly more efficient and effective set of practices (which just tells me that there is probably something much better, but it's far away from my current practices :).

Now that I'm a professional educator, I'd like to teach this, because what I see students doing is so darned inefficient by comparison. I regularly watch students struggle with the mouse to switch between windows, copy and paste by selecting or dragging, and otherwise completely fail to make use of keyboard shortcuts. I see a lot of code being built from scratch by guess-work, without lots of Google-fu or copy/pasting and editing. Version control isn't integrated into their minute-by-minute process. Testing? Hah. We don't even teach automated testing here at MSU. It's an understatement to say that using all of these techniques together is a conceptual leap that many students seem ill-prepared to make.

Last term I co-taught an intro graduate course in computation for evolutionary biologists using IPython Notebook running in the cloud, and I made extensive use of screencasts as a way to show the students how I worked and how I thought. It went pretty well -- several students told me that they really appreciated being able to see what I was doing and hear why I was doing it, and being able to pause and rewind was very helpful when they ran into trouble.

So this term, for my database-backed Web development course, I decided to post videos of the homework solutions for the second homework, which is part of a whole-term class project to build a distributed peer-to-peer liquor cabinet and party planning Web site. (Hey, you gotta teach 'em somehow, right?)

I posted the example solutions as a github branch as well as videos showing me solving each of the sub problems in real time, with discussion:

HW 2.1 -- http://www.youtube.com/watch?v=2img0wKdokA

HW 2.2 -- http://www.youtube.com/watch?v=eQU4qImY9VM

HW 2.3 -- http://www.youtube.com/watch?v=YqL18Ip2wws

HW 2.4 -- http://www.youtube.com/watch?v=7iOITFHrqmA

HW 2.5 -- http://www.youtube.com/watch?v=0Ea5yxRCKKw

HW 2.6 -- http://www.youtube.com/watch?v=6k8pnl2SgVI

I think the videos are decent screencasts, and by breaking them down this way I made it possible for students to look only at the section they had questions about. Each screencasts is 5-10 minutes total, and now I can use them for other classes, too.

So far so good, and I doubt many students have spent much time looking at them, but maybe some will. We'll have to see if my contentment in having produced them matches their actual utility in the class :).

But then something entertaining happened. Greg Wilson is always bugging us (where "us" means pretty much anyone whose e-mail inbox he has access to) about developing hands-on examples that we can use in Software Carpentry, so I sent these videos to the SWC 'tutors' mailing list with a note that I'd love help writing better homeworks. And within an hour or so, I got back two nice polite e-mails from other members of the list, offering better *solutions. One was about HW 2.1 --

System Message: WARNING/2 (/Users/t/dev/blog-final/src/communicating-programming-practice.rst, line 88); backlink

Inline emphasis start-string without end-string.
It might be safer to .lstrip() the line before checking for comments (to allow indented comments). Also, not line[0].strip() doesn't test for lines with only white space. it tests for lines that have white space as the first character. 'not line.strip()[0]' would be all white space... That would also make 'not line' redundant.

I also got a more general offer from someone else to peer review my homework solutions, and chastising me for using

fp = open(filename)
try:
  ... do stuff ...
finally:
  fp.close()

instead of

with open(filename) as fp:
   ... do stuff

heh.

I find this little episode very entertaining. I love the notion that other people (at least one is another professor) had the spare time to watch the videos and then critique what I'd done and then send me the critique; I also like the point that the quest for perfect code is ongoing. I am particularly entertained by the fact that they are both right, and that my explanation of my code was in some cases facile, shallow, and somewhat wrong (although not significantly enough to make me redo the videos -- the perfect is the enemy of the good enough!)

And, finally, although no feedback spoke directly to this, I am in love with the notion that we can convey effective practice through video. I think this episode is a great indication that if we could get students to record themselves working through problems, we could learn how they are responding to our instruction and start to develop a deeper understanding of the traps for the novice that lie within our current programming processes.

--titus

Syndicated 2013-02-11 23:00:00 from Living in an Ivory Basement

Adding disqus, Google Analytics, and github edit links to ReadTheDocs sites

Inspired by the awesomeness of disqus on my other sites, I wanted to make it possible to enable disqus on my sites on ReadTheDocs. A bit of googling led me to Mikko Ohtamaa's excellent work on the Plone documentation, where a blinding flash of awesomeness hit me and I realized that github had, over the past year, nicely integrated online editing of source, together with pull requests.

This meant that I could now give potential contributors completely command-line-free edit ability for my documentation sites, together with single-click approval of edits, and automated insta-updating of the ReadTheDocs site. Plus disqus commenting. And Google Analytics.

I just had to have it.

Voila.

Basically, I took Mikko's awesomeness, combined it with some disqus hackery, refactored a few times, and, well, posted it.

The source is here.

Two things --

I could some JS help disabling the 'Edit this document!' stuff if the 'github_base_account' variable isn't set in page.html'. Anyone? See line 105 of page.html. You can edit online by hitting 'e' :).

It would be nice to be able to configure disqus, Google Analytics, and github editing in conf.py, but I wasn't able to figure out how to pass variables into Jinja2 from conf.py. It's probably really easy.

But otherwise it all works nicely.

Enjoy! And thanks to Mikko, as well as Eric Holscher and the RTD team, and github, for making this all so frickin' easy.

--titus

Syndicated 2012-11-03 23:00:00 from Living in an Ivory Basement

475 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!