Older blog entries for titus (starting at number 488)

17 Apr 2015 »

The PyCon 2015 Ally's Workshop

At PyCon 2015, I had the pleasure of attending the Ally Skills Workshop, organized by @adainitiative (named after Ada Lovelace).

The workshop was a 3 hour strongly guided discussion centering around 4-6 person group discussion of short scenarios. There's a guide to running them here, although I personally would not have wanted to run one without attending one first!

I attended the workshop for at least three reasons --

First, I want to do better myself. I have put some effort into (and received a lot of encouragement for) making my lab an increasingly open and welcoming place. While I have heard concerns about being insufficiently critical and challenging of bad ideas in science (and I have personally experienced a few rather odd situations where obviously bad ideas weren't called out in my past labs), I don't see any inherent conflict between being welcoming and being intellectually critical - in fact, I rather suspect they are mutually supportive, especially for the more junior people.

But, doing better is surprisingly challenging; everyone needs a mentor, or at least guideposts. So when I heard about this workshop, I leapt at the chance to attend!

Second, I am interested in connecting these kinds of things to my day job in academia, where I am now a professor at UC Davis. UC Davis is the home of the somewhat notorious Jonathan Eisen, who is notorious for many reasons that include boycotting and calling out conferences that have low diversity. UC Davis also has an effort to increase diversity at the faculty level, and I think that this is an important effort. I'm hoping to be involved in this when I actually take up residence in Davis, and learning to be a male ally is one way to help. More, I think that Davis would be a natural home to some of these ally workshops, and so I attended the Ally Skills workshop to explore this.

And third, I was just curious! It's surprisingly tricky to confront and talk about sexism effectively, and I thought seeing how the the pros did it would a good way to start.

Interestingly, 2/3 of my lab attended the workshop, too - without me requesting it. I think they found it valuable, too.

The workshop itself

Valerie Aurora ran the workshop, and it's impossible to convey how good it was, but I'll try by picking out some choice quotes:

"You shouldn't expect praise or credit for behaving like a decent human being."

"Sometimes, you just need a flame war to happen." (paraphrase)

"LPT: Read Captain Awkward. And read the comments."

"It's not up to the victim whether you enforce your code of conduct."

"The physiological effects of alcohol are actually limited, and most effects of alcohol are socially and/or culturally mediated."

"Avoid rules lawyering. I don't now if you've ever worked with lawyers, but software engineers are almost as bad."

"One problem for male allies is the assumption that you are only talking to a woman because you are sexually interested in them."

"Trolls are good at calibrating their level of awfulness to something that you will feel guilty about moderating."

Read the blog post "Tone policing only goes one way..

Overall, a great experience and something I hope to help host more of at UC Davis.

--titus

Syndicated 2015-04-16 22:00:00 from Living in an Ivory Basement

11 Nov 2014 »

Some thoughts on Journal Impact Factor

A colleague just e-mailed me to ask me how I felt about journal impact factor being such a big part of the Academic Ranking of World Universities - they say that 20% of the ranking weight comes from # of papers published in Nature and Science. So what do I think?

On evaluations

I'm really not a big fan of rankings and evaluations in the first place. This is largely because I feel that evaluations are rarely objective. For one very specific example, last year at MSU I got formal evaluations from both of my departments. Starting with the same underlying data (papers published, students graduated, grants submitted/awarded, money spent, classes taught, body mass/height ratio, gender weighting, eye color, miles flown, talks given, liquid volume of students' tears generated, committees served on, etc.) one department gave me a "satisfactory/satisfactory/satisfactory" while the other department gave me an "excellent/excellent/excellent." What fun! (I don't think the difference was the caliber in departments.)

Did I mention that these rankings helped determine my raise for the year?

Anyhoo, I find the rating and ranking scheme within departments at MSU to be largely silly. It's done in an ad hoc manner by untrained people, and as far as I can tell, is biased against people who are not willing to sing their own praises. (I brought up the Dunning-Kruger effect in my last evaluation meeting. Heh.) That's not to say there's not serious intent -- they are factored into raises, and at least one other purpose of evaluating assistant professors is so that once you fire their ass (aka "don't give them tenure") there's a paper trail of critical evaluations where you explained that they were in trouble.

Metrics are part of the job, though; departments evaluate their faculty so they can see who, if anyone, needs help or support or mentoring, and to do this, they rely at least in part on metrics. Basically, if someone has lots of funding and lots of papers, they're probably not failing miserably at being a research professor; if they're grant-poor and paper-poor, they're targets for further investigation. There are lots of ways to evaluate, but metrics seem like an inextricable part of it.

Back to the impact factor

Like faculty evaluations, ranking by the impact factor of the journals that university faculty publish in is an attempt to predict future performance using current data.

But impact factor is extremely problematic for many reasons. It's based on citations, which (over the long run) may be an OK measure of impact, but are subject to many confounding factors, including field-specific citation patterns. It's an attempt to predict the success of individual papers on a whole-journal basis, which falls apart in the face of variable editorial decisions. High-impact journals are also often read more widely by people than low-impact journals, which yields a troubling circularity in terms of citation numbers (you're more likely to cite a paper you've read!) Worse, the whole system is prone to being gamed in various ways, which is leading to high rates of retractions for high-impact journals, as well as outright fraud.

Impact factor is probably a piss-poor proxy for paper impact, in other words.

If impact factor was just a thing that didn't matter, I wouldn't worry. The real trouble is that impact factors have real-world effect - many countries use impact factor of publications as a very strong weight in funding and promotion decisions. Interestingly, the US is not terribly heavy handed here - most universities seem pretty enlightened about considering the whole portfolio of a scientist, at least anecdotally. But I can name a dozen countries that care deeply about impact factor for promotions and raises.

And apparently impact factor affects university rankings, too!

Taking a step back, it's not clear that any good ranking scheme can exist, and if it does, we're certainly not using it. All of this is a big problem if you care about fostering good science.

The conundrum is that many people like rankings, and it seems futile to argue against measuring and ranking people and institutions. However, any formalized ranking system can be gamed and perverted, which ends up sometimes rewarding the wrong kind of people, and shutting out some of the right kind of people. (The Reed College position on the US News & World Report ranking system is worth reading here.) More generally, in any ecosystem, the competitive landscape is evolving, and a sensible measure today may become a lousy measure tomorrow as the players evolve their strategies; the stricter the rules of evaluation, and the more entrenched the evaluation system, the less likely it is to adapt, and the more misranking will result. So ranking systems need to evolve continuously.

At its heart, this is a scientific management challenge. Rankings and metrics do pretty explicitly set the landscape of incentives and competition. If our goal in science is to increase knowledge for the betterment of mankind, then the challenge for scientific management is to figure out how to incentive behaviors that trend in that direction in the long term. If you use bad or outdated metrics, then you incentivize the wrong kind of behavior, and you waste precious time, energy, and resources. Complicating this is the management structure of academic science, which is driven by many things that include rankings and reputation - concepts that range from precise to fuzzy.

My position on all of this is always changing, but it's pretty clear that the journal system is kinda dumb and rewards the wrong behavior. (For the record, I'm actually a big fan of publications, and I think citations are probably not a terribly bad measure of impact when measured on papers and individuals, although I'm always happy to engage in discussions on why I'm wrong.) But the impact factor is especially horrible. The disproportionate effect that high-IF glamour mags like Cell, Nature and Science have on our scientific culture is clearly a bad thing - for example, I'm hearing more and more stories about editors at these journals warping scientific stories directly or indirectly to be more press-worthy - and when combined with the reproducibility crisis I'm really worried about the short-term future of science. Journal Impact Factor and other simple metrics are fundamentally problematic and are contributing to the problem, along with the current peer review culture and a whole host of other things. (Mike Eisen has written about this a lot; see e.g. this post.)

In the long term I think a much more experimental culture of peer review and alternative metrics will emerge. But what do we do for now?

More importantly:

How can we change?

I think my main advice to faculty is "lead, follow, or get out of the way."

Unless you're a recognized big shot, or willing to take somewhat insane gambles with your career, "leading" may not be productive - following or getting out of the way might be best. But there are a lot of things you can do here that don't put you at much risk, including:

be open to a broader career picture when hiring and evaluating junior faculty;
argue on behalf of alternative metrics in meetings on promotion and tenure;
use sites like Google Scholar to pick out some recent papers to read in depth when hiring faculty and evaluating grants;
avoid making (or push back at) cheap shots at people who don't have a lot of high-impact-factor papers;
invest in career mentoring that is more nuanced than "try for lots of C-N-S papers or else" - you'd be surprised how often this is the main advice assistant professors take away...
believe in and help junior faculty that seem to have a plan, even if you don't agree with the plan (or at least leave them alone ;)

What if you are a recognized big shot? Well, there are lots of things you can do. You are the people who set the tone in the community and in your department, and it behooves you to think scientifically about the culture and reward system of science. The most important thing you can do is think and investigate. What evidence is there behind the value of peer review? Are you happy with C-N-S editorial policies, and have you talked to colleagues who get rejected at the editorial review stage more than you do? Have you thought about per-article metrics? Do you have any better thoughts on how to improve the system than 'fund more people', and how would you effect changes in this direction by recognizing alternate metrics during tenure and grant review?

The bottom line is that the current evaluation systems are the creation of scientists, for scientists. It's our responsibility to critically evaluate them, and perhaps evolve them when they're inadequate; we shouldn't just complain about how the current system is broken and wait for someone else to fix it.

Addendum: what would I like to see?

Precisely predicting the future importance of papers is obviously kind of silly - see this great 1994 paper by Gans and Shepherd on rejected classics papers, for example -- and is subject to all sorts of confounding effects. But this is nonetheless what journals are accustomed to doing: editors at most journals, especially the high impact factor ones, select papers based on projected impact before sending them out for review, and/or ask the reviewers to review impact as well.

So I think we should do away with impact review and review for correctness instead. This is why I'm such a big fan of PLOS One and PeerJ, who purport to do exactly that.

But then, I get asked, what do we do about selecting out papers to read? Some (many?) scientists claim that they need the filtering effect of these selective journals to figure out what they should be reading.

There are a few responses to this.

First, it's fundamentally problematic to outsource your attention to editors at journals, for reasons mentioned above. There's some evidence that you're being drawn into a manipulated and high-retraction environment by doing that, and that should worry you.

But let's say you feel you need something to tell you what to read.

Well, second, this is technologically solvable - that's what search engines already do. There's a whole industry of search engines that give great results based on integrating free text search, automatic content classification, and citation patterns. Google Scholar does a great job here, for example.

Third, social media (aka "people you know") provides some great recommendation systems! People who haven't paid much attention to Twitter or blogging may not have noticed, but in addition to person-to-person recommendations, there are increasingly good recommendation systems coming on line. I personally get most of my paper recs from online outlets (mostly people I follow, but I've found some really smart people to follow on Twitter!). It's a great solution!

Fourth, if one of the problems is that many journals review for correctness AND impact together, why not separate them? For example, couldn't journals like Science or Nature evolve into literature overlays that highlight papers published in impact-blind journals like PLOS One or PeerJ? I can imagine a number of ways that this could work, but if we're so invested in having editors pick papers for us, why not have them pick papers that have been reviewed for scientific correctness first, and then elevate them to our attention with their magic editorial pen?

I don't see too many drawbacks to this vs the current approach, and many improvements. (Frankly this is where I see most of scientific literature going, once preprint archives become omnipresent.)

So that's where I want and expect to see things going. I don't see ranking based on predicted impact going away, but I'd like to see it more reflective of actual impact (and be measured in more diverse ways).

--titus

p.s. People looking for citations of high retraction rate, problematic peer review, and the rest could look at one of my earlier blog posts on problems with peer review. I'd be interested in more citations, though!

Syndicated 2014-11-10 23:00:00 from Living in an Ivory Basement

5 Oct 2014 »

Putting together an online presence for a diffuse academic community - how?

I would like to build a community site. Or, more precisely, I would like to recognize, collect, and collate information from an already existing but rather diffuse community.

The focus of the community will be academic data science, or "data driven discovery". This is spurred largely by the recent selection of the Moore Data Driven Discovery Investigators, as well as the earlier Moore and Sloan Data Science Environments, and more broadly by the recognition that academia is broken when it comes to data science.

So, where to start?

For a variety of reasons -- including the main practical one, that most academics are not terribly social media integrated and we don't want to try to force them to learn new habits -- I am focusing on aggregating blog posts and Twitter.

So, the main question is... how can we most easily collect and broadcast blog posts and articles via a Web site? And how well can we integrate with Twitter?

First steps and initial thoughts

Following Noam Ross's suggestions in the above storify, I put together a WordPress blog that uses the RSS Multi Importer to aggregate RSS feeds as blog posts (hosted on NFSN). I'd like to set this up for the DDD Investigators who have blogs; those who don't can be given accounts if they want to post something. This site also uses a Twitter feed plugin to pull in tweets from the list of DDD Investigators.

The resulting RSS feed from the DDDI can be pulled into a River of News site that aggregates a much larger group of feeds.

The WordPress setup was fairly easy and I'm going to see how stable it is (I assume it will be very stable, but shrug time will tell :). I'm upgrading my own hosting setup and once that's done, I'll try out River4.

Next steps and longer-term thoughts

Ultimately a data-driven-discovery site that has a bit more information would be nice; I could set up a mostly static site, post it on github, authorize a few people to merge, and then solicit pull requests when people want to add their info or feeds.

One thing to make sure we do is track only a portion of feeds for prolific bloggers, so that I, for example, have to tag a post specifically with 'ddd' to make it show up on the group site. This will avoid post overload.

I'd particularly like to get a posting set up that integrates well with how I consume content. In particular, I read a lot of things via my phone and tablet, and the ability to post directly from there -- probably via e-mail? -- would be really handy. Right now I mainly post to Twitter (and largely by RTing) which is too ephemeral, or I post to Facebook, which is a different audience. (Is there a good e-mail-to-RSS feed? Or should I just do this as a WordPress blog with the postie plug-in?)

The same overall setup could potentially work for a Software Carpentry Instructor community site, a Data Carpentry Instructor community site, trainee info sites for SWC/DC trainees, and maybe also a bioinformatics trainee info site. But I'd like to avoid anything that involves a lot of administration.

Things I want to avoid

Public forums.

Private forums that I have to administer or that aren't integrated with my e-mail (which is where I get most notifications, in the end).

Overly centralized solutions; I'm much more comfortable with light moderation ("what feeds do I track?") than anything else.

Thoughts?

--titus

Syndicated 2014-10-04 22:00:00 from Living in an Ivory Basement

5 May 2014 »

PyCon 2014: Community, community, community. Also, childcare.

There were lots of problems with PyCon this year. For example, the free, hi-speed wifi made you log in each day. And it was in Montreal, so one of my foreign students couldn't come because he didn't get a visa in time. The company booths were not centrally located. And, worst of all, the PyCon mugs aren't dishwasher safe.

So, as you can imagine, I was pretty disappointed.

Haha, no, just kidding. PyCon 2014 was wonderful! For the first time, I brought quite a few people from my lab; my wife & two children (6 and 3 yro) also came, because it was driving distance from Michigan, and so we just packed everyone into a minivan and drive 700 miles.

Community

My labbies -- none of whom had ever been to PyCon before -- said that they really enjoyed the conference. In large part that was not just because of the talks, but because of all the subcommunity stuff that went on -- I heard that the various women-centric and LGBTQ meetups were great. The efforts to raise diversity and enforce the code of conduct at PyCon (especially in light of last years' happenings) paid off this year: I heard of no major CoC violations, and while mansplaining was apparently alive and well in some Q&A sessions, the overall atmosphere was really friendly and welcoming.

As Michael Crusoe pointed out to me, PyCon has clearly decided that they will focus on community more than technology -- which is probably the only truly sustainable path anyway. Highlighting this, there were plenty of talks on Python, and also plenty of talks on other things, including building community and raising awareness.

A particular highlight for me in this regard was Naomi Ceder's talk on being a member of the Python community both before and after she transitioned. What an amazing way to raise awareness, and what an excellent talk.

On Childcare

This was also the first year that PyCon had childcare. It was great! We brought our six and three year old girls, and they spent Friday, Saturday and Sunday of the conference in a meeting room in one of the conference hotels. Presumably it will be similarly located next year (PyCon 2015 will be in Montreal also), and I can't recommend it highly enough. Our girls loved it and were very happy to go back each day. They had activities, movies, and swimming - good fun.

I would suggest changing a few things next year -- first, it would be great if parents knew where childcare was going to be. As it was we stayed in the other hotel (Hyatt?), and had to walk from one hotel to the other (the Hilton, I think) before going to the conference. This extended our morning quite a bit; next year, if we bring the kids, it'd be nice to just walk them downstairs in the morning. Second, it might be nice to have the option of extending childcare a day or two; my wife had to take the children while I taught Software Carpentry on the Monday after the conference. We did make use of an in-room babysitter from the daycare to go out one evening, and that was great! She even taught our older child some French into the bargain.

From a larger perspective, it was super fun to have the kids at the conference without having to have either my wife or myself take care of 'em all the time. My wife (who is also technical) got to attend talks, as did I, and I got to introduce the kids to a few people (and make Jesse homesick for his kids) -- maybe next year we can do some more young-kid-focused gatherings?

My talk

I gave a talk again this year -- it was on instrumenting data-intensive pipelines for cloud computing. You can see all the things. It was reasonably well received; the crowd was smaller than my 2013 talk (video), because I'd avoided sexy keywords, but I got good questions and several people told me it had made them think differently about things, which is a nice outcome.

My talk was sandwiched between two other great talks, by Julia Evans (talking on Pandas) and David Beazley (talking about what happens when you lock him in a small enclosed space with a Windows computer that has Python installed along with several TB of C code). Julia's talk was hilariously high energy -- she gets way too excited about bike routes ;) -- and David's was, as usual, the highlight of the conference for me. You should go watch both of 'em.

Next year, I'm thinking about doing a talk on sequencing your own genome and interpreting the results with Python. I think that means I need to sequence my own genome first. You know, for science. Anyone got a spare $1000 lying around?

--titus

Syndicated 2014-05-04 22:00:00 from Living in an Ivory Basement

12 Apr 2014 »

Notes for my PyCon 2014 talk: Instrument ALL the things: Studying data-intensive workflows in the cloud

Resources:

--titus

Syndicated 2014-04-11 22:00:00 from Living in an Ivory Basement

14 Sep 2013 »

Tracy Teal's PyCon '14 submission: How I learned to stop worrying and love matplotlib

Note: This is a proposal being submitted by Tracy Teal (@tracykteal) for PyCon '14. I suggested she post it here for feedback, because she does not have her own blog. --titus

TITLE: How I learned to stop worrying and love matplotlib

CATEGORY: Science

DURATION: I prefer a 30 minute time slot

DESCRIPTION:

I was living a dual life, programming in Python and plotting in R, too worried to move R code to Python. Then I decided to make 100 plots in Python in 100 days. I documented the journey on a website, posting the plot & code in IPython Notebooks and welcoming comments. This talk will summarize lessons learned, including technical details, the process and the effects of learning in an online forum.

AUDIENCE:

Scientists interested in statistical computing with Python, those interested in learning more about NumPy and matplotlib.

PYTHON LEVEL: Beginner

OBJECTIVES

Attendees will see use cases for numpy and matplotlib, as well as one approach on how to succeed (or fail) at challenging yourself to learn something new.

DETAILED ABSTRACT:

Many scientific programmers use multiple languages for different applications, primarily because specific packages are available for their standard use cases or they're working with existing code. While these languages work well, it can limit the ability to integrate different components of a project in to one framework. The reasons not to use numpy, matplotlib and pandas is therefore often not technical, but the effort required to learn or develop a new approach when there are already so many demands on a scientist's time can be inhibiting. Additionally the development of new packages or integrated code bases are often not as valued in the academic structure.

I am one of those scientists, a microbial ecologist and bioinformatician, writing most of my code in Python and teaching it in Software Carpentry, but doing all my statistics in R. I like R and the R community and in particular, the ecological statistics package, vegan, so I haven’t felt the need to switch, but I realized my reluctance was mainly because I didn't know how to do the same things in Python, not that R was necessarily better for my workflow. So, I figured I should at least give it a try, but it was always a task on the back burner and not particularly interesting. Inspired by Jennifer Dewalt's 180 web sites in 180 days, the idea of making something in order to learn particular skills and the process of deliberate practice, I decided to start a project 100 plots in 100 days. In this project I will make a plot every (week)day for 100 days using Python. Plots encompass y=x to visualizations of multivariate statistics and genomic data. I use matplotlib, numpy and pandas, make the plots in IPython Notebook and post the notebook and comments about the process of creating that plot on my blog. I welcome comments to get more immediate feedback on the process.

This talk will focus on lessons learned during the project, both technical and about the process of learning - the expected and unexpected outcomes and how the involvement of community impacts practice.

OUTLINE

Intro (5 min)
Who am I?
Why this project?
Show the website
Lessons learned (18 min)
Technical lessons learned
numpy/matplotlib tricks or tips
any new statistical algorithms developed for numpy
Lessons learned about learning
Was this process good for learning something new? Why/ why not?
Deliberate practice has been shown to be the most effective way to get good at something. It involves working at something and getting feedback. Was this approach good for that?
Social aspects
Response to the project
Social pressures and accountability - does saying you'll do something publicly make you more likely to do it
Concluding remarks (2 min)
Would I do something like this again? Would I recommend it?
Questions (5 min)

ADDITIONAL NOTES

I'm just starting this project, inspired by both a recent Hacker News post on Jennifer Dewalt's 180 web sites in 180 days and the opportunity to present at PyCon. As such, at review time, I'll only be beginning the journey. Success for me for this project would be following through on the 100 plots in 100 (week)days, learning the fundamentals of numpy and matplotlib and making some neat and useful plots along the way. I'll share all the code for the statistics and each plot on the website, however ugly it may be. This could fail too. I could not be able to get beyond variations on a y=x plot and write terrible code. This talk will document both the success and the failures, as I hope I and others can learn from both. I do understand the risk of accepting a talk like this where I can't yet tell you what the lessons learned will be.
This would be my first time speaking at PyCon. I've spoken at many scientific conferences, been selected as an Everhart Lecturer at Caltech and received "Best Presentation" award at conferences. I've also been an instructor for five Software Carpentry bootcamps, including one for Women in Science and Engineering.

ADDITIONAL REQUIREMENTS

None

Syndicated 2013-09-13 22:00:00 from Living in an Ivory Basement

10 Sep 2013 »

Data intensive biology in the cloud: instrumenting ALL the things

Here's a draft PyCon '14 proposal. Comments and suggestions welcome!

Title: Data intensive biology in the cloud: instrumenting ALL the things

Description: (400 ch)

Cloud computing offers some great opportunities for science, but most cloud computing platforms are both I/O and memory limited, and hence are poor matches for data-intensive computing. After four years of research software development we are now instrumenting and benchmarking our analysis pipelines; numbers, lessons learned, and future plans will be discussed. Everything is open source, of course.

Audience: People who are interested in things.

Python level: Beginner/intermediate.

Objectives:

Attendees will

learn a bit about I/O and big-memory performance in demanding situations;
see performance numbers for various cloud platforms;
hear about why some people can't use Hadoop to process large amounts of data;
gain some insight into the sad state of open science;

Detailed abstract:

The cloud provides great opportunities for a variety of important computational science challenges, including reproducible science, standardized computational workflows, comparative benchmarking, and focused optimization. It can also help be a disruptive force for the betterment of science by eliminating the need for large infrastructure investments and supporting exploratory computational science on previously challenging scales. However, most cloud computing use in science so far has focused on relatively mundane "pleasantly parallel" problems. Our lab has spent many moons addressing a large, non-parallelizable "big data/big graph" problem -- sequence assembly -- with a mixture of Python and C++, some fun new data structures and algorithms, and a lot of cloud computing. Most recently we have been working on open computational "protocols", worfklows, and pipelines for democritizing certain kinds of sequence analysis. As part of this work we are tackling issues of standardized test data sets to support comparative benchmarking, targeted optimization, reproducible science, and computational standardization in biology. In this talk I'll discuss our efforts to understand where our computational bottlenecks are, what kinds of optimization and parallelization efforts make sense financially, and how the cloud is enabling us to be usefully disruptive. As a bonus I'll talk about how the focus on pleasantly paralellizable tasks has warped everyone's brains and convinced them that engineering, not research, is really interesting.

Outline:

Defining the terms: cloud computing; data intensive; compute intensive.

2. Our data-intensive problem: sequence assembly and the big graph problem. The scale of the problem. A complete analysis protocol.

Predicted bottlenecks, including computation and I/O.
Actual bottlenecks, including NUMA architecture and I/O.

5. A cost-benefit analysis of various approaches, including buying more memory; striping data across multiple volumes; increasing I/O performance; focusing on software development; "pipelining" across multiple machines; theory vs practice in terms of implementation.

6. A discussion of solutions that won't work, including parallelization and GPU.

7. Making analysis "free" and using low-cost compute to analyze other people's data. Trying to be disruptive.

Syndicated 2013-09-09 22:00:00 from Living in an Ivory Basement

13 Jul 2013 »

How can we do literate programming for reproducibility, in Python?

Note: Yarden Katz (the author of MISO) sent me the e-mail below, and I asked him if I could post it as a guest-post on my blog. He said yes - so here it is! Feedback solicited.

---

Hi Titus,

Hope all is well. A recent tweet you had about Ben Bolker's notes for lit. programming in R (via @hylopsar) made me think about the same for Python, with has been bugging me for a while. Wanted to see if you have any thoughts on getting the equivalent in Python.

What I've always wanted in Python is a way to simultaneously document and execute code that describes an applied analysis pipeline. Some easy way to declaratively describe and document a step-by-step analysis pipeline: Given X datasets available from some web resource, which depends on packages / tools Y, download the data and run the pipeline and ensure that you get results Z. I'd like a language that allows a description that is easily reproducible on a system that's not your own, and forces you to declaratively state things in such a way that you can't cheat with hardcoded paths or quirky settings/versions of software that apply only to your own system. A kind of "literate" pipeline for applied analysis pipelines that allows you to state your assertions/expectations along the way.

One of the main advantages of R over Python is that they have a packaging system that actually works, where as pip/setuptools/distribute are all broken and hard to use, even for Python experts, let alone people who don't want to delve into the guts of Python. So ideally I'd like a system that takes this description of the code and the inputs and executes on a new virtual environment. readthedocs.org does this for documentation, and it's a great way to ensure that you don't have unnoticed hardcoded paths, or Python libraries or packages that cannot be fetched by package managers. Because Python libraries are so hopelessly complicated and broken, and because in comp. bio we rely so often on external tools (tophat version/bowtie version/etc.) this is all the more important. Something that ensures that if you follow these steps, with these data, it'll be automatically installable on your system, and give you the expected output -- no matter what! Knowing that it runs on a server other than your own is key.

Some related tools/ideas that haven't worked very well for me for this purpose, or that only partially address this:

IPython notebook: I've had issues with IPython in general, but even when it works, it doesn't address the problem of describing systematically the input and output of the problem, which is key in analysis pipelines. It also doesn't give you a way to state dependencies. If I have a series of calls to numpy/scipy/matplotlib and I want to share that with you, it's good, but an applied analysis pipeline is far more complex than using a bunch of commonly available Python packages to get an output.
Unit tests: Standard unit tests are OK for generic software tools. But they don't really make sense for applied analysis pipelines, where the software that you're writing is basically a bunch of analysis (and possibly plotting) code, and not a generic algorithm. You're not testing internal Python library calls, and testing is only a small component of the goal (the other part is describing dependencies and data, and how the pipeline works). You're trying to describe a flow of sequential steps, with forced assertions and conditions for reproducibility. Some of these steps might not be fully automated, or might take far too long to run as a unit test. So what I'm looking for is closer to some kind of sphinx/pydoc document interspersed with executable code, than a plain Python file with unit tests.
Ruffus: It's too complicated for most things in my view and it doesn't give you a way to describe the data inputs, etc. It's best for pipelines that consist of internal Python functions that exist within a module, but it gives you no features for describing interaction with external world (external input data, external tools of a specific version whose output you process.) that forces you to get things somewhat right is Sphinx/Pydoc. It was for Pycogent which I occasionally contribute it to, and they had configured it so that all the inline examples in the sphinx .rst file were run in real time. That's nice though it's still running only on your own environment and has no features for describing complex data sets / inputs, it was really made for testing library calls within a Python package (like an IPython notebook) -- again, not meant for data-driven pipelines.

The ideal system would even allow you to register analysis pipelines or Python functions in some kind of web system, where each analysis can get a URI and be run with a single click dispatched to some kind of amazon node. But that's not necessary and I don't use the cloud for now.

Would love to hear your thoughts (feel free to share with others who might have views on this.) I've thought about this for a while and never found a satisfactory solution.

Thanks very much!

Best,

--Yarden

Syndicated 2013-07-12 22:00:00 from Living in an Ivory Basement

15 May 2013 »

Excerpts from Coders At Work: Peter Deutsch Interview

I've been reading Peter Seibel's excellent book, Coders at Work, which is a transcription of interviews with a dozen or so very well known and impactful programmers. After the first two interviews, I found myself itching to highlight certain sections, and then I thought, heck, why not post some of the bits I found most interesting? This is a book everyone should be aware of, and it's surprisingly readable. Highly recommended.

This is the second of my blog posts. The first contained excerpts from Seibel's interview with Joe Armstrong.

The excerpts below come from Seibel's interview with Peter Deutsch, who is (among many other things) the creator and long-time maintainer of Ghostscript.

My comments are labeled 'CTB'.

On programmers

Seibel: So is it OK for people who don't have a talent for systems-level thinking to work on smaller parts of software? Can you split the programmers and the architects? Or do you really want everyone who's working on systems-style software, since it is sort of fractal, to be able think in terms of systems?

Deutsch: ... But in terms of who should do software, I don't have a good flat answer that. I do know that the further down in the plumbing the software is, the more important it is that it be built by really good people. That's an elitist point of view, and I'm happy to hold it.

...

You know the old story about the telephone and the telephone operators? The story is, sometime fairly early in the adoption of the telephone, when it was clear that use of the telephone was just expanding at an incredible rate, more and more people were having to be hired to work as operators because we didn't have dial telephones. Someone extrapolated the growth rate and said "My God. By 20 or 30 years from now, every single person will have to be a telephone operator." Well, that's happened. I think something like that may be happening in some big areas of programming as well.

CTB: This seemed like interesting commentary on the increasing ... democratization? ... of computer use.

Fast, cheap, good -- pick any two.

Deutsch: ...The problem being the old saying in the business: "fast, cheap, good -- pick any two." If you build things fast and you have some way of building them inexpensively, it's very unlikely that they're going to be good. But this school of thought says you shouldn't expect software to last.

I think behind this perhaps is a mindset of software as expense vs software as capital asset. I'm very much in the software-as-capital-asset school. When I was working at ParcPlace and Adele Goldberg was out there evangelizing object-oriented design, part of the way we talked about objects and part of the way we advocated object-oriented languages and design to our customers and potential customers is to say, "Look, you should treat software as a capital asset."

And there is no such thing as a capital asset that doesn't require ongoing maintenance and investment. You should expect that there's going to be some cost associated with maintaining a growing library of reusable software. And that is going to complicate your accounting because it means you can't charge the cost of building a piece of software only to the project or the customer that's motivating the creation of that software at this time. You have to think of it the way you would think of a capital asset.

CTB: A really good perspective that's relevant to scientists' concerns about software and data.

On how software practice has (not) improved over the last 30 years

Seibel: So you don't believe the original object-reuse pitch quite as strongly now. Was there something wrong with the theory, or has it just not worked out for historical reasons?

Deutsch: Well, part of the reason that I don't call myself a computer scientists any more is that I've seen software practice over a period of just about 50 years and it basically hasn't improved tremendously in about the last 30 years.

If you look at programming languages I would make a strong case that programming languages have not improved qualitatively in the last 40 years. There is no programming language in use today that is qualitatively better than Simula-67. I know that sounds kind of funny, but I really mean it. Java is not that much better than Simula-67.

Seibel: Smalltalk?

Deutsch: Smalltalk is somewhat better than Simula-67. But Smalltalk as it exists today essentially existed in 1976. I'm not saying that today's languages aren't better than the languages that existed 30 years ago. The language that I do all of my programming in today, Python, is, I think, a lot better than anything that was available 30 years ago. I like it better than Smalltalk.

I use the word qualitatively very deliberately. Every programming language today that I can think of, that's in substantial use, has the concept of pointer. I don't know of any way to make software built using that fundamental concept qualitatively better.

CTB: Well, that's just a weird opinion in some ways. But interesting, especially since he has been around and active for so long, and his perspective is obviously not based in ignorance.

On temptation

Deutsch: Every now and then I feel a temptation to design a programming language but then I just lie down until it goes away. But if I were to give in to that temptation, it would have a pretty fundamental cleavage between a functional part that talked only about values and had no concept of pointer, and a different sphere of some kind that talked about patterns of sharing and reference and control.

More on Smalltalk and Python

Seibel: So, despite it not being qualitatively better than Smalltalk, you still like Python better.

Deutsch: I do. There are several reasons. With Python there's a very clear story of what is a program and what it means to run a program and what it means to be part of a program. There's a concept of module, and modules declare basically what information they need from other modules. So it's possible to develop a module or a group of modules and share them with other people and those other people can come along and look at those modules and know pretty much exactly what they depend on and know what their boundaries are.

...

I've talked with the few of my buddies that are still at VisualWorks about open-sourcing the object engine, the just-in-time code generator, which, even though I wrote it, I still think is better than a lot of what's out there. Gosh, here we have Smalltalk, which has this really great code-generation machinery, which is now very mature -- it's about 20 years old and it's extremely reliable. It's a relatively simple, relatively retargetable, quite efficient just-in-time code generator that's designed to work really well with non type-declared languages. On the other hand, here's Python, which is this wonderful language with these wonderful libraries and a slow-as-mud implementation. Wouldn't it be nice if we could bring the two together?

(I'm a bit fixated on Python. OK?)

Deutsch: ... But that brings me to the other half, the other reason I like Python syntax better, which is that Lisp is lexically pretty monotonous.

Seibel: I think Larry Wall described it as a bowl of oatmeal with fingernail clippings in it.

Deutsch: Well, my description of Perl is something that looks like it came out of the wrong end of a dog. I think Larry Wall has a lot of nerve talking about language design -- Perl is an abomination as a language. But let's not go there.

CTB: heh.

Syndicated 2013-05-13 22:00:00 from Living in an Ivory Basement

29 Apr 2013 »

Excerpts from Coders At Work: Joe Armstrong Interview

This is the first in what I expect to be a dozen or so blog posts, time permitting.

The excerpts below come from Seibel's interview with Joe Armstrong, the inventer of Erlang.

My comments are labeled 'CTB'.

On learning to program

Seibel: How did you learn to program? When did it all start?

Armstrong: When I was at school. I was born in 1950 so there weren't many computers around then. The final year of school, I suppose I must have been 17, the local council had a mainframe computer -- probably an IBM. We could write Fortran on it. It was the usual thing -- you wrote your programs on coding sheets and you sent them off. A week later the coding sheets and the punch cards came back and you had to approve them. But the people who made the punch cards would make mistakes. So it might go backwards and forwards one or two times. And then it would finally go to the computer center.

Then it went to the computer center and came back and the Fortran compilter had stopped at the first syntactic error in the program. It didn't even process the remainder of the program. It was something like three months to run your first program. I learned then, instead of sending one program you had to develop every single subroutine in parallel and sned the lot. I think I wrote a little program to dispay a chess board -- it would plot a chess board on the printer. But I had to write all the subroutines as parallel tasks because the turnaround time was so appallingly bad.

CTB: I think it's fascinating to interpret this statement in light of Erlang's pattern of small components, working in parallel (http://en.wikipedia.org/wiki/Erlang_(programming_language). Did Armstrong shape his mental architecture in this pattern from the early mainframe days, and then translate that over to programming design? Also, this made me think about unit testing in a whole new way.

On modern gizmos like "hierarchical file systems", and productivity

Armstrong: The funny thing is, thinking back, I don't think all of these modern gizmos actually make you any more productive. Hierarchical file systems -- how do they make you more productive? Most of software development goes on in your head anyway. I think having worked with that simpler system imposes a kind of disciplined way of thinking. If you haven't got a directory system and you have to put all the files in one directory, you have to be fairly disciplined. If you haven't got a revision control system, you have to be fairly disciplined. Given that you apply that discipline to what you're doing it doesn't seem to me to be any better to have hierarchical file systems and revision control. They don't solve the fundamental problem of solving your problem. They probably make it easier for groups of people to work together. For individuals I don't see any difference.

CTB: If your tools require you to be as good as Joe Armstrong in order to get things done, that's probably not a generalizable solution...

On calling out to other languages, and Domain Specific Lanaguages

Seibel: So if you were writing a big image processing work-flow system, then would you write the actual image transformation in some other language?

Armstrong: I'd write them in C or assembler or something. Or I might actually write them in a dialect of Erlang and then cross-compile the Erlang to C. Make a dialect - this kind of domain-specific language kind of idea. Or I might write Erlang programs which generate C programs rather than writing the C programs by hand. But the target language would be C or assembler or something. Whether I wrote them by hand or generated them would be the interesting question. I'm tending toward automatically generating C rather than writing it by hand because it's just easier.

CTB: heh. So, I'd just generate C automatically from a dialect of Erlang...

On debugging

Seibel: What are the techniques that you use there? Print statements?

Armstrong. Print statements. The great gods of programming said, "Thou shall put printf statements in your program at the point where yout hink it's gone wrong, recompile, and run it.

Then there's -- I don't know if I read it somewhere or if I invented it myself -- Joe's Law of Debugging, which is that all errors will be plus/minus three statements of the place you last changed the program.

CTB: one surprising commonality amongst many of the interviews thus far is the lack of use (or disdain for) debuggers. Almost everyone trots out print statements!

Syndicated 2013-04-28 22:00:00 from Living in an Ivory Basement

479 older entries...