titus is currently certified at Journeyer level.

Name: Titus Brown
Member since: 2004-10-29 00:14:46
Last Login: 2012-06-25 05:32:14

FOAF RDF Share This

Homepage: http://ivory.idyll.org/

Notes:

Also see http://ged.msu.edu/ (work page).

Projects

Articles Posted by titus

Recent blog entries by titus

Syndication: RSS 2.0

PyCon 2014: Community, community, community. Also, childcare.

There were lots of problems with PyCon this year. For example, the free, hi-speed wifi made you log in each day. And it was in Montreal, so one of my foreign students couldn't come because he didn't get a visa in time. The company booths were not centrally located. And, worst of all, the PyCon mugs aren't dishwasher safe.

So, as you can imagine, I was pretty disappointed.

Haha, no, just kidding. PyCon 2014 was wonderful! For the first time, I brought quite a few people from my lab; my wife & two children (6 and 3 yro) also came, because it was driving distance from Michigan, and so we just packed everyone into a minivan and drive 700 miles.

Community

My labbies -- none of whom had ever been to PyCon before -- said that they really enjoyed the conference. In large part that was not just because of the talks, but because of all the subcommunity stuff that went on -- I heard that the various women-centric and LGBTQ meetups were great. The efforts to raise diversity and enforce the code of conduct at PyCon (especially in light of last years' happenings) paid off this year: I heard of no major CoC violations, and while mansplaining was apparently alive and well in some Q&A sessions, the overall atmosphere was really friendly and welcoming.

As Michael Crusoe pointed out to me, PyCon has clearly decided that they will focus on community more than technology -- which is probably the only truly sustainable path anyway. Highlighting this, there were plenty of talks on Python, and also plenty of talks on other things, including building community and raising awareness.

A particular highlight for me in this regard was Naomi Ceder's talk on being a member of the Python community both before and after she transitioned. What an amazing way to raise awareness, and what an excellent talk.

On Childcare

This was also the first year that PyCon had childcare. It was great! We brought our six and three year old girls, and they spent Friday, Saturday and Sunday of the conference in a meeting room in one of the conference hotels. Presumably it will be similarly located next year (PyCon 2015 will be in Montreal also), and I can't recommend it highly enough. Our girls loved it and were very happy to go back each day. They had activities, movies, and swimming - good fun.

I would suggest changing a few things next year -- first, it would be great if parents knew where childcare was going to be. As it was we stayed in the other hotel (Hyatt?), and had to walk from one hotel to the other (the Hilton, I think) before going to the conference. This extended our morning quite a bit; next year, if we bring the kids, it'd be nice to just walk them downstairs in the morning. Second, it might be nice to have the option of extending childcare a day or two; my wife had to take the children while I taught Software Carpentry on the Monday after the conference. We did make use of an in-room babysitter from the daycare to go out one evening, and that was great! She even taught our older child some French into the bargain.

From a larger perspective, it was super fun to have the kids at the conference without having to have either my wife or myself take care of 'em all the time. My wife (who is also technical) got to attend talks, as did I, and I got to introduce the kids to a few people (and make Jesse homesick for his kids) -- maybe next year we can do some more young-kid-focused gatherings?

My talk

I gave a talk again this year -- it was on instrumenting data-intensive pipelines for cloud computing. You can see all the things. It was reasonably well received; the crowd was smaller than my 2013 talk (video), because I'd avoided sexy keywords, but I got good questions and several people told me it had made them think differently about things, which is a nice outcome.

My talk was sandwiched between two other great talks, by Julia Evans (talking on Pandas) and David Beazley (talking about what happens when you lock him in a small enclosed space with a Windows computer that has Python installed along with several TB of C code). Julia's talk was hilariously high energy -- she gets way too excited about bike routes ;) -- and David's was, as usual, the highlight of the conference for me. You should go watch both of 'em.

Next year, I'm thinking about doing a talk on sequencing your own genome and interpreting the results with Python. I think that means I need to sequence my own genome first. You know, for science. Anyone got a spare $1000 lying around?

--titus

Syndicated 2014-05-04 22:00:00 from Living in an Ivory Basement

Tracy Teal's PyCon '14 submission: How I learned to stop worrying and love matplotlib

Note: This is a proposal being submitted by Tracy Teal (@tracykteal) for PyCon '14. I suggested she post it here for feedback, because she does not have her own blog. --titus


TITLE: How I learned to stop worrying and love matplotlib

CATEGORY: Science

DURATION: I prefer a 30 minute time slot

DESCRIPTION:

I was living a dual life, programming in Python and plotting in R, too worried to move R code to Python. Then I decided to make 100 plots in Python in 100 days. I documented the journey on a website, posting the plot & code in IPython Notebooks and welcoming comments. This talk will summarize lessons learned, including technical details, the process and the effects of learning in an online forum.

AUDIENCE:

Scientists interested in statistical computing with Python, those interested in learning more about NumPy and matplotlib.

PYTHON LEVEL: Beginner

OBJECTIVES

Attendees will see use cases for numpy and matplotlib, as well as one approach on how to succeed (or fail) at challenging yourself to learn something new.

DETAILED ABSTRACT:

Many scientific programmers use multiple languages for different applications, primarily because specific packages are available for their standard use cases or they're working with existing code. While these languages work well, it can limit the ability to integrate different components of a project in to one framework. The reasons not to use numpy, matplotlib and pandas is therefore often not technical, but the effort required to learn or develop a new approach when there are already so many demands on a scientist's time can be inhibiting. Additionally the development of new packages or integrated code bases are often not as valued in the academic structure.

I am one of those scientists, a microbial ecologist and bioinformatician, writing most of my code in Python and teaching it in Software Carpentry, but doing all my statistics in R. I like R and the R community and in particular, the ecological statistics package, vegan, so I haven’t felt the need to switch, but I realized my reluctance was mainly because I didn't know how to do the same things in Python, not that R was necessarily better for my workflow. So, I figured I should at least give it a try, but it was always a task on the back burner and not particularly interesting. Inspired by Jennifer Dewalt's 180 web sites in 180 days, the idea of making something in order to learn particular skills and the process of deliberate practice, I decided to start a project 100 plots in 100 days. In this project I will make a plot every (week)day for 100 days using Python. Plots encompass y=x to visualizations of multivariate statistics and genomic data. I use matplotlib, numpy and pandas, make the plots in IPython Notebook and post the notebook and comments about the process of creating that plot on my blog. I welcome comments to get more immediate feedback on the process.

This talk will focus on lessons learned during the project, both technical and about the process of learning - the expected and unexpected outcomes and how the involvement of community impacts practice.

OUTLINE

  • Intro (5 min)
  • Who am I?
  • Why this project?
  • Show the website
  • Lessons learned (18 min)
  • Technical lessons learned
  • numpy/matplotlib tricks or tips
  • any new statistical algorithms developed for numpy
  • Lessons learned about learning
  • Was this process good for learning something new? Why/ why not?
  • Deliberate practice has been shown to be the most effective way to get good at something. It involves working at something and getting feedback. Was this approach good for that?
  • Social aspects
  • Response to the project
  • Social pressures and accountability - does saying you'll do something publicly make you more likely to do it
  • Concluding remarks (2 min)
  • Would I do something like this again? Would I recommend it?
  • Questions (5 min)

ADDITIONAL NOTES

  • I'm just starting this project, inspired by both a recent Hacker News post on Jennifer Dewalt's 180 web sites in 180 days and the opportunity to present at PyCon. As such, at review time, I'll only be beginning the journey. Success for me for this project would be following through on the 100 plots in 100 (week)days, learning the fundamentals of numpy and matplotlib and making some neat and useful plots along the way. I'll share all the code for the statistics and each plot on the website, however ugly it may be. This could fail too. I could not be able to get beyond variations on a y=x plot and write terrible code. This talk will document both the success and the failures, as I hope I and others can learn from both. I do understand the risk of accepting a talk like this where I can't yet tell you what the lessons learned will be.
  • This would be my first time speaking at PyCon. I've spoken at many scientific conferences, been selected as an Everhart Lecturer at Caltech and received "Best Presentation" award at conferences. I've also been an instructor for five Software Carpentry bootcamps, including one for Women in Science and Engineering.

ADDITIONAL REQUIREMENTS

None

Syndicated 2013-09-13 22:00:00 from Living in an Ivory Basement

Data intensive biology in the cloud: instrumenting ALL the things

Here's a draft PyCon '14 proposal. Comments and suggestions welcome!


Title: Data intensive biology in the cloud: instrumenting ALL the things

Description: (400 ch)

Cloud computing offers some great opportunities for science, but most cloud computing platforms are both I/O and memory limited, and hence are poor matches for data-intensive computing. After four years of research software development we are now instrumenting and benchmarking our analysis pipelines; numbers, lessons learned, and future plans will be discussed. Everything is open source, of course.

Audience: People who are interested in things.

Python level: Beginner/intermediate.

Objectives:

Attendees will

  • learn a bit about I/O and big-memory performance in demanding situations;
  • see performance numbers for various cloud platforms;
  • hear about why some people can't use Hadoop to process large amounts of data;
  • gain some insight into the sad state of open science;

Detailed abstract:

The cloud provides great opportunities for a variety of important computational science challenges, including reproducible science, standardized computational workflows, comparative benchmarking, and focused optimization. It can also help be a disruptive force for the betterment of science by eliminating the need for large infrastructure investments and supporting exploratory computational science on previously challenging scales. However, most cloud computing use in science so far has focused on relatively mundane "pleasantly parallel" problems. Our lab has spent many moons addressing a large, non-parallelizable "big data/big graph" problem -- sequence assembly -- with a mixture of Python and C++, some fun new data structures and algorithms, and a lot of cloud computing. Most recently we have been working on open computational "protocols", worfklows, and pipelines for democritizing certain kinds of sequence analysis. As part of this work we are tackling issues of standardized test data sets to support comparative benchmarking, targeted optimization, reproducible science, and computational standardization in biology. In this talk I'll discuss our efforts to understand where our computational bottlenecks are, what kinds of optimization and parallelization efforts make sense financially, and how the cloud is enabling us to be usefully disruptive. As a bonus I'll talk about how the focus on pleasantly paralellizable tasks has warped everyone's brains and convinced them that engineering, not research, is really interesting.

Outline:

  1. Defining the terms: cloud computing; data intensive; compute intensive.

2. Our data-intensive problem: sequence assembly and the big graph problem. The scale of the problem. A complete analysis protocol.

  1. Predicted bottlenecks, including computation and I/O.
  2. Actual bottlenecks, including NUMA architecture and I/O.

5. A cost-benefit analysis of various approaches, including buying more memory; striping data across multiple volumes; increasing I/O performance; focusing on software development; "pipelining" across multiple machines; theory vs practice in terms of implementation.

6. A discussion of solutions that won't work, including parallelization and GPU.

7. Making analysis "free" and using low-cost compute to analyze other people's data. Trying to be disruptive.

Syndicated 2013-09-09 22:00:00 from Living in an Ivory Basement

How can we do literate programming for reproducibility, in Python?

Note: Yarden Katz (the author of MISO) sent me the e-mail below, and I asked him if I could post it as a guest-post on my blog. He said yes - so here it is! Feedback solicited.

---

Hi Titus,

Hope all is well. A recent tweet you had about Ben Bolker's notes for lit. programming in R (via @hylopsar) made me think about the same for Python, with has been bugging me for a while. Wanted to see if you have any thoughts on getting the equivalent in Python.

What I've always wanted in Python is a way to simultaneously document and execute code that describes an applied analysis pipeline. Some easy way to declaratively describe and document a step-by-step analysis pipeline: Given X datasets available from some web resource, which depends on packages / tools Y, download the data and run the pipeline and ensure that you get results Z. I'd like a language that allows a description that is easily reproducible on a system that's not your own, and forces you to declaratively state things in such a way that you can't cheat with hardcoded paths or quirky settings/versions of software that apply only to your own system. A kind of "literate" pipeline for applied analysis pipelines that allows you to state your assertions/expectations along the way.

One of the main advantages of R over Python is that they have a packaging system that actually works, where as pip/setuptools/distribute are all broken and hard to use, even for Python experts, let alone people who don't want to delve into the guts of Python. So ideally I'd like a system that takes this description of the code and the inputs and executes on a new virtual environment. readthedocs.org does this for documentation, and it's a great way to ensure that you don't have unnoticed hardcoded paths, or Python libraries or packages that cannot be fetched by package managers. Because Python libraries are so hopelessly complicated and broken, and because in comp. bio we rely so often on external tools (tophat version/bowtie version/etc.) this is all the more important. Something that ensures that if you follow these steps, with these data, it'll be automatically installable on your system, and give you the expected output -- no matter what! Knowing that it runs on a server other than your own is key.

Some related tools/ideas that haven't worked very well for me for this purpose, or that only partially address this:

  • IPython notebook: I've had issues with IPython in general, but even when it works, it doesn't address the problem of describing systematically the input and output of the problem, which is key in analysis pipelines. It also doesn't give you a way to state dependencies. If I have a series of calls to numpy/scipy/matplotlib and I want to share that with you, it's good, but an applied analysis pipeline is far more complex than using a bunch of commonly available Python packages to get an output.
  • Unit tests: Standard unit tests are OK for generic software tools. But they don't really make sense for applied analysis pipelines, where the software that you're writing is basically a bunch of analysis (and possibly plotting) code, and not a generic algorithm. You're not testing internal Python library calls, and testing is only a small component of the goal (the other part is describing dependencies and data, and how the pipeline works). You're trying to describe a flow of sequential steps, with forced assertions and conditions for reproducibility. Some of these steps might not be fully automated, or might take far too long to run as a unit test. So what I'm looking for is closer to some kind of sphinx/pydoc document interspersed with executable code, than a plain Python file with unit tests.
  • Ruffus: It's too complicated for most things in my view and it doesn't give you a way to describe the data inputs, etc. It's best for pipelines that consist of internal Python functions that exist within a module, but it gives you no features for describing interaction with external world (external input data, external tools of a specific version whose output you process.) that forces you to get things somewhat right is Sphinx/Pydoc. It was for Pycogent which I occasionally contribute it to, and they had configured it so that all the inline examples in the sphinx .rst file were run in real time. That's nice though it's still running only on your own environment and has no features for describing complex data sets / inputs, it was really made for testing library calls within a Python package (like an IPython notebook) -- again, not meant for data-driven pipelines.

The ideal system would even allow you to register analysis pipelines or Python functions in some kind of web system, where each analysis can get a URI and be run with a single click dispatched to some kind of amazon node. But that's not necessary and I don't use the cloud for now.

Would love to hear your thoughts (feel free to share with others who might have views on this.) I've thought about this for a while and never found a satisfactory solution.

Thanks very much!

Best,

--Yarden

Syndicated 2013-07-12 22:00:00 from Living in an Ivory Basement

481 older entries...

 

titus certified others as follows:

  • titus certified mirwin as Apprentice
  • titus certified Demie as Journeyer
  • titus certified aero6dof as Journeyer
  • titus certified codedbliss as Journeyer
  • titus certified tcopeland as Journeyer
  • titus certified neoneye as Journeyer
  • titus certified hereticmessiah as Journeyer
  • titus certified esteve as Journeyer
  • titus certified demoncrat as Master
  • titus certified pipeman as Journeyer
  • titus certified cdfrey as Journeyer
  • titus certified Xorian as Journeyer
  • titus certified Ohayou as Apprentice
  • titus certified icherevko as Journeyer
  • titus certified bi as Journeyer

Others have certified titus as follows:

  • gnutizen certified titus as Journeyer
  • kai certified titus as Journeyer
  • lerdsuwa certified titus as Apprentice
  • mirwin certified titus as Master
  • wspace certified titus as Journeyer
  • sye certified titus as Apprentice
  • demoncrat certified titus as Journeyer
  • pipeman certified titus as Journeyer
  • hereticmessiah certified titus as Journeyer
  • esteve certified titus as Journeyer
  • cdfrey certified titus as Journeyer
  • StevenRainwater certified titus as Journeyer
  • Ankh certified titus as Apprentice
  • wingo certified titus as Journeyer
  • badvogato certified titus as Journeyer
  • oubiwann certified titus as Master
  • pvanhoof certified titus as Journeyer
  • markpratt certified titus as Journeyer
  • Omnifarious certified titus as Journeyer
  • caruzo certified titus as Master
  • fzort certified titus as Apprentice
  • bi certified titus as Journeyer
  • IainLowe certified titus as Journeyer
  • robby certified titus as Journeyer
  • percious certified titus as Master
  • kskuhlman certified titus as Journeyer
  • dangermaus certified titus as Master

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page