Older blog entries for mikal (starting at number 970)

21 Aug 2014 »

Don't Tell Mum I Work On The Rigs

ISBN: 1741146984
LibraryThing

I read this book while on a flight a few weeks ago. Its surprisingly readable and relatively short -- you can knock it over in a single long haul flight. The book covers the memoirs of an oil rig worker, from childhood right through to middle age. That's probably the biggest weakness of the book, it just kind of stops when the writer reaches the present day. I felt there wasn't really a conclusion, which was disappointing.
An interesting fun read however.

Tags for this post: book paul_carter oil rig memoir
Related posts: Extreme Machines: Eirik Raude; New Orleans and sea level; Kern County oil wells on I-5; What is the point that people's morals evaporate?

Comment

Syndicated 2014-08-21 13:45:00 from stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts)

20 Aug 2014 »

Juno nova mid-cycle meetup summary: nova-network to Neutron migration

This will be my second last post about the Juno Nova mid-cycle meetup, which covers the state of play for work on the nova-network to Neutron upgrade.

First off, some background information. Neutron (formerly Quantum) was developed over a long period of time to replace nova-network, and added to the OpenStack Folsom release. The development of new features for nova-network was frozen in the Nova code base, so that users would transition to Neutron. Unfortunately the transition period took longer than expected. We ended up having to unfreeze development of nova-network, in order to fix reliability problems that were affecting our CI gating and the reliability of deployments for existing nova-network users. Also, at least two OpenStack companies were carrying significant feature patches for nova-network, which we wanted to merge into the main code base.

You can see the announcement at http://lists.openstack.org/pipermail/openstack-dev/2014-January/025824.html. The main enhancements post-freeze were a conversion to use our new objects infrastructure (and therefore conductor), as well as features that were being developed by Nebula. I can't find any contributions from the other OpenStack company in the code base at this time, so I assume they haven't been proposed.

The nova-network to Neutron migration path has come to the attention of the OpenStack Technical Committee, who have asked for a more formal plan to address Neutron feature gaps and deprecate nova-network. That plan is tracked at https://wiki.openstack.org/wiki/Governance/TechnicalCommittee/Neutron_Gap_Coverage. As you can see, there are still some things to be merged which are targeted for juno-3. At the time of writing this includes grenade testing; Neutron being the devstack default; a replacement for nova-network multi-host; a migration plan; and some documentation. They are all making good progress, but until these action items are completed, Nova can't start the process of deprecating nova-network.

The discussion at the Nova mid-cycle meetup was around the migration planning item in the plan. There is a Nova specification that outlines one possible plan for live upgrading instances (i.e, no instance downtime) at https://review.openstack.org/#/c/101921/, but this will probably now be replaced with a simpler migration path involving cold migrations. This is prompted by not being able to find a user that absolutely has to have live upgrade. There was some confusion, because of a belief that the TC was requiring a live upgrade plan. But as Russell Bryant says in the meetup etherpad:

"Note that the TC has made no such statement on migration expectations other than a migration path must exist, both projects must agree on the plan, and that plan must be submitted to the TC as a part of the project's graduation review (or project gap review in this case). I wouldn't expect the TC to make much of a fuss about the plan if both Nova and Neutron teams are in agreement."

The current plan is to go forward with a cold upgrade path, unless a user comes forward with an absolute hard requirement for a live upgrade, and a plan to fund developers to work on it.

At this point, it looks like we are on track to get all of the functionality we need from Neutron in the Juno release. If that happens, we will start the nova-network deprecation timer in Kilo, with my expectation being that nova-network would be removed in the "M" release. There is also an option to change the default networking implementation to Neutron before the deprecation of nova-network is complete, which will mean that new deployments are defaulting to the long term supported option.

In the next (and probably final) post in this series, I'll talk about the API formerly known as Nova API v3.

Tags for this post: openstack juno nova mid-cycle summary nova-network neutron migration
Related posts: Juno nova mid-cycle meetup summary: scheduler; Juno nova mid-cycle meetup summary: ironic; Juno nova mid-cycle meetup summary: DB2 support; Juno nova mid-cycle meetup summary: social issues; Juno nova mid-cycle meetup summary: slots; Juno nova mid-cycle meetup summary: containers

Comment

Syndicated 2014-08-19 20:37:00 from stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts)

19 Aug 2014 »

Juno nova mid-cycle meetup summary: slots

If I had to guess what would be a controversial topic from the mid-cycle meetup, it would have to be this slots proposal. I was actually in a Technical Committee meeting when this proposal was first made, but I'm told there were plenty of people in the room keen to give this idea a try. Since the mid-cycle Joe Gordon has written up a more formal proposal, which can be found at https://review.openstack.org/#/c/112733.

If you look at the last few Nova releases, core reviewers have been drowning under code reviews, so we need to control the review workload. What is currently happening is that everyone throws up their thing into Gerrit, and then each core tries to identify the important things and review them. There is a list of prioritized blueprints in Launchpad, but it is not used much as a way of determining what to review. The result of this is that there are hundreds of reviews outstanding for Nova (500 when I wrote this post). Many of these will get a review, but it is hard for authors to get two cores to pay attention to a review long enough for it to be approved and merged.

If we could rate limit the number of proposed reviews in Gerrit, then cores would be able to focus their attention on the smaller number of outstanding reviews, and land more code. Because each review would merge faster, we believe this rate limiting would help us land more code rather than less, as our workload would be better managed. You could argue that this will mean we just say 'no' more often, but that's not the intent, it's more about bringing focus to what we're reviewing, so that we can get patches through the process completely. There's nothing more frustrating to a code author than having one +2 on their code and then hitting some merge freeze deadline.

The proposal is therefore to designate a number of blueprints that can be under review at any one time. The initial proposal was for ten, and the term 'slot' was coined to describe the available review capacity. If your blueprint was not allocated a slot, then it would either not be proposed in Gerrit yet, or if it was it would have a procedural -2 on it (much like code reviews associated with unapproved specifications do now).

The number of slots is arbitrary at this point. Ten is our best guess of how much we can dilute core's focus without losing efficiency. We would tweak the number as we gained experience if we went ahead with this proposal. Remember, too, that a slot isn't always a single code review. If the VMWare refactor was in a slot for example, we might find that there were also ten code reviews associated with that single slot.

How do you determine what occupies a review slot? The proposal is to groom the list of approved specifications more carefully. We would collaboratively produce a ranked list of blueprints in the order of their importance to Nova and OpenStack overall. As slots become available, the next highest ranked blueprint with code ready for review would be moved into one of the review slots. A blueprint would be considered 'ready for review' once the specification is merged, and the code is complete and ready for intensive code review.

What happens if code is in a slot and something goes wrong? Imagine if a proposer goes on vacation and stops responding to review comments. If that happened we would bump the code out of the slot, but would put it back on the backlog in the location dictated by its priority. In other words there is no penalty for being bumped, you just need to wait for a slot to reappear when you're available again.

We also talked about whether we were requiring specifications for changes which are too simple. If something is relatively uncontroversial and simple (a better tag for internationalization for example), but not a bug, it falls through the cracks of our process at the moment and ends up needing to have a specification written. There was talk of finding another way to track this work. I'm not sure I agree with this part, because a trivial specification is a relatively cheap thing to do. However, it's something I'm happy to talk about.

We also know that Nova needs to spend more time paying down its accrued technical debt, which you can see in the huge amount of bugs we have outstanding at the moment. There is no shortage of people willing to write code for Nova, but there is a shortage of people fixing bugs and working on strategic things instead of new features. If we could reserve slots for technical debt, then it would help us to get people to work on those aspects, because they wouldn't spend time on a less interesting problem and then discover they can't even get their code reviewed. We even talked about having an alternating focus for Nova releases; we could have a release focused on paying down technical debt and stability, and then the next release focused on new features. The Linux kernel does something quite similar to this and it seems to work well for them.

Using slots would allow us to land more valuable code faster. Of course, it also means that some patches will get dropped on the floor, but if the system is working properly, those features will be ones that aren't important to OpenStack. Considering that right now we're not landing many features at all, this would be an improvement.

This proposal is obviously complicated, and everyone will have an opinion. We haven't really thought through all the mechanics fully, yet, and it's certainly not a done deal at this point. The ranking process seems to be the most contentious point. We could encourage the community to help us rank things by priority, but it's not clear how that process would work. Regardless, I feel like we need to be more systematic about what code we're trying to land. It's embarrassing how little has landed in Juno for Nova, and we need to be working on that. I would like to continue discussing this as a community to make sure that we end up with something that works well and that everyone is happy with.

This series is nearly done, but in the next post I'll cover the current status of the nova-network to neutron upgrade path.

Tags for this post: openstack juno nova mid-cycle summary review slots blueprint priority project management
Related posts: Juno nova mid-cycle meetup summary: social issues; Juno nova mid-cycle meetup summary: scheduler; Juno nova mid-cycle meetup summary: ironic; Juno nova mid-cycle meetup summary: DB2 support; Juno nova mid-cycle meetup summary: containers; Juno nova mid-cycle meetup summary: cells

Comment

Syndicated 2014-08-19 00:34:00 from stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts)

18 Aug 2014 »

Juno nova mid-cycle meetup summary: scheduler

This post is in a series covering the discussions at the Juno Nova mid-cycle meetup. This post will cover the current state of play of our scheduler refactoring efforts. The scheduler refactor has been running for a fair while now, dating back to at least the Hong Kong summit (so about 1.5 release cycles ago).

The original intent of the scheduler sub-team's effort was to pull the scheduling code out of Nova so that it could be rapidly iterated on its own, with the eventual goal being to support a single scheduler across the various OpenStack services. For example, the scheduler that makes placement decisions about your instances could also be making decisions about the placement of your storage resources and could therefore ensure that they are co-located as much as possible.

During this process we realized that a big bang replacement is actually much harder than we thought, and the plan has morphed into being a multi-phase effort. The first step is to make the interface for the scheduler more clearly defined inside the Nova code base. For example, in previous releases, it was the scheduler that launched instances: the API would ask the scheduler to find available hypervisor nodes, and then the scheduler would instruct those nodes to boot the instances. We need to refactor this so that the scheduler picks a set of nodes, but then the API is the one which actually does the instance launch. That way, when the scheduler does move out it's not trusted to perform actions that change hypervisor state, and the Nova code does that for it. This refactoring work is under way, along with work to isolate the SQL database accesses inside the scheduler.

I would like to set expectations that this work is what will land in Juno. It has little visible impact for users, but positions us to better solve these problems in Kilo.

We discussed the need to ensure that any new scheduler is at least as fast and accurate as the current one. Jay Pipes has volunteered to work with the scheduler sub-team to build a testing framework to validate this work. Jay also has some concerns about the resource tracker work that is being done at the moment that he is going to discuss with the scheduler sub-team. Since the mid-cycle meetup there has been a thread on the openstack-dev mailing list about similar resource tracker concerns (here), which might be of interest to people interested in scheduler work.

We also need to test our assumption at some point that other OpenStack services such as Neutron and Cinder would be even willing to share a scheduler service if a central one was implemented. We believe that Neutron is interested, but we shouldn't be surprising our fellow OpenStack projects by just appearing with a complete solution. There is a plan to propose a cross-project session at the Paris summit to cover this work.

In the next post in this series we'll discuss possibly the most controversial part of the mid-cycle meetup. The proposal for "slots" for landing blueprints during Kilo.

Tags for this post: openstack juno nova mid-cycle summary scheduler
Related posts: Juno nova mid-cycle meetup summary: ironic; Juno nova mid-cycle meetup summary: DB2 support; Juno nova mid-cycle meetup summary: social issues; Juno nova mid-cycle meetup summary: containers; Juno nova mid-cycle meetup summary: cells; Michael's surprisingly unreliable predictions for the Havana Nova release

Comment

Syndicated 2014-08-17 20:06:00 from stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts)

18 Aug 2014 »

Juno nova mid-cycle meetup summary: bug management

Welcome to the next exciting installment of the Nova Juno mid-cycle meetup summary. In the previous chapter, our hero battled a partially complete cells implementation, by using his +2 smile of good intentions. In this next exciting chapter, watch him battle our seemingly never ending pile of bugs! Sorry, now that I'm on to my sixth post in this series I feel like it's time to get more adventurous in the introductions.

For at least the last cycle, and probably longer, Nova has been struggling with the number of bugs filed in Launchpad. I don't think the problem is that Nova has terrible code, it is instead that we have a lot of users filing bugs, and the team working on triaging and closing bugs is small. The complexity of the deployment options with Nova make this problem worse, and that complexity increases as we allow new drivers for things like different storage engines to land in the code base.

The increasing number of permutations possible with Nova configurations is a problem for our CI systems as well, as we don't cover all of these options and this sometimes leads us to discover that they don't work as expected in the field. CI is a tangent from the main intent of this post though, so I will reserve further discussion of our CI system until a later post.

Tracy Jones and Joe Gordon have been doing good work in this cycle trying to get a grip on the state of the bugs filed against Nova. For example, a very large number of bugs (hundreds) were for problems we'd fixed, but where the bug bot had failed to close the bug when the fix merged. Many other bugs were waiting for feedback from users, but had been waiting for longer than six months. In both those cases the response was to close the bug, with the understanding that the user can always reopen it if they come back to talk to us again. Doing "quick hit" things like this has reduced our open bug count to about one thousand bugs. You can see a dashboard that Tracy has produced that shows the state of our bugs at http://54.201.139.117/nova-bugs.html. I believe that Joe has been moving towards moving this onto OpenStack hosted infrastructure, but this hasn't happened yet.

At the mid-cycle meetup, the goal of the conversation was to try and find other ways to get our bug queue further under control. Some of the suggestions were largely mechanical, like tightening up our definitions of the confirmed (we agree this is a bug) and triaged (and we know how to fix it) bug states. Others were things like auto-abandoning bugs which are marked incomplete for more than 60 days without a reply from the person who filed the bug, or unassigning bugs when the review that proposed a fix is abandoned in Gerrit.

Unfortunately, we have more ideas for how to automate dealing with bugs than we have people writing automation. If there's someone out there who wants to have a big impact on Nova, but isn't sure where to get started, helping us out with this automation would be a super helpful way to get started. Let Tracy or I know if you're interested.

We also talked about having more targeted bug days. This was prompted by our last bug day being largely unsuccessful. Instead we're proposing that the next bug day have a really well defined theme, such as moving things from the "undecided" to the "confirmed" state, or similar. I believe the current plan is to run a bug day like this after J-3 when we're winding down from feature development and starting to focus on stabilization.

Finally, I would encourage people fixing bugs in Nova to do a quick search for duplicate bugs when they are closing a bug. I wouldn't be at all surprised to discover that there are many bugs where you can close duplicates at the same time with minimal effort.

In the next post I'll cover our discussions of the state of the current scheduler work in Nova.

Tags for this post: openstack juno nova mi-cycle summary bugs
Related posts: Juno nova mid-cycle meetup summary: scheduler; Juno nova mid-cycle meetup summary: ironic; Michael's surprisingly unreliable predictions for the Havana Nova release; Juno nova mid-cycle meetup summary: DB2 support; Juno nova mid-cycle meetup summary: social issues; Juno nova mid-cycle meetup summary: containers

Comment

Syndicated 2014-08-17 19:38:00 from stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts)

15 Aug 2014 »

Juno nova mid-cycle meetup summary: cells

This is the next post summarizing the Juno Nova mid-cycle meetup. This post covers the cells functionality used by some deployments to scale Nova.

For those unfamiliar with cells, it's a way of combining smaller Nova installations into a thing which feels like a single large Nova install. So for example, Rackspace deploys Nova in cells of hundreds of machines, and these cells form a Nova availability zone which might contain thousands of machines. The cells in one of these deployments form a tree: users talk to the top level of the tree, which might only contain API services. That cell then routes requests to child cells which can actually perform the operation requested.

There are a few reasons why Rackspace does this. Firstly, it keeps the MySQL databases smaller, which can improve the performance of database operations and backups. Additionally, cells can contain different types of hardware, which are then partitioned logically. For example, OnMetal (Rackspace's Ironic-based baremetal product) instances come from a cell which contains OnMetal machines and only publishes OnMetal flavors to the parent cell.

Cells was originally written by Rackspace to meet its deployment needs, but is now used by other sites as well. However, I think it would be a stretch to say that cells is commonly used, and it is certainly not the deployment default. In fact, most deployments don't run any of the cells code, so you can't really call them even a "single cell install". One of the reasons cells isn't more widely deployed is that it doesn't implement the entire Nova API, which means some features are missing. As a simple example, you can't live-migrate an instance between two child cells.

At the meetup, the first thing we discussed regarding cells was a general desire to see cells finished and become the default deployment method for Nova. Perhaps most people end up running a single cell, but in that case at least the cells code paths are well used. The first step to get there is improving the Tempest coverage for cells. There was a recent openstack-dev mailing list thread on this topic, which was discussed at the meetup. There was commitment from several Nova developers to work on this, and notably not all of them are from Rackspace.

It's important that we improve the Tempest coverage for cells, because it positions us for the next step in the process, which is bringing feature parity to cells compared with a non-cells deployment. There is some level of frustration that the work on cells hasn't really progressed in Juno, and that it is currently incomplete. At the meetup, we made a commitment to bringing a well-researched plan to the Kilo summit for implementing feature parity for a single cell deployment compared with a current default deployment. We also made a commitment to make cells the default deployment model when this work is complete. If this doesn't happen in time for Kilo, then we will be forced to seriously consider removing cells from Nova. A half-done cells deployment has so far stopped other development teams from trying to solve the problems that cells addresses, so we either need to finish cells, or get out of the way so that someone else can have a go. I am confident that the cells team will take this feedback on board and come to the summit with a good plan. Once we have a plan we can ask the whole community to rally around and help finish this effort, which I think will benefit all of us.

In the next blog post I will cover something we've been struggling with for the last few releases: how we get our bug count down to a reasonable level.

Tags for this post: openstack juno nova mid-cycle summary cells
Related posts: Juno nova mid-cycle meetup summary: ironic; Juno nova mid-cycle meetup summary: DB2 support; Juno nova mid-cycle meetup summary: social issues; Juno nova mid-cycle meetup summary: containers; Michael's surprisingly unreliable predictions for the Havana Nova release; Juno Nova PTL Candidacy

Comment

Syndicated 2014-08-14 21:20:00 from stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts)

15 Aug 2014 »

More bowls and pens

The pens are quite hard to make by the way -- the wood is only a millimeter or so thick, so it tends to split very easily.

Tags for this post: wood turning 20140805-woodturning photo

Comment

Syndicated 2014-08-14 19:35:00 from stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts)

15 Aug 2014 »

Juno nova mid-cycle meetup summary: DB2 support

This post is one part of a series discussing the OpenStack Nova Juno mid-cycle meetup. It's a bit shorter than most of the others, because the next thing on my list to talk about is DB2, and that's relatively contained.

IBM is interested in adding DB2 support as a SQL database for Nova. Theoretically, this is a relatively simple thing to do because we use SQLAlchemy to abstract away the specifics of the SQL engine. However, in reality, the abstraction is leaky. The obvious example in this case is that DB2 has different rules for foreign keys than other SQL engines we've used. So, in order to be able to make this change, we need to tighten up our schema for the database.

The change that was discussed is the requirement that the UUID column on the instances table be not null. This seems like a relatively obvious thing to allow, given that UUID is the official way to identify an instance, and has been for a really long time. However, there are a few things which make this complicated: we need to understand the state of databases that might have been through a long chain of upgrades from previous Nova releases, and we need to ensure that the schema alterations don't cause significant performance problems for existing large deployments.

As an aside, people sometimes complain that Nova development is too slow these days, and they're probably right, because things like this slow us down. A relatively simple change to our database schema requires a whole bunch of performance testing and negotiation with operators to ensure that its not going to be a problem for people. It's good that we do these things, but sometimes it's hard to explain to people why forward progress is slow in these situations.

Matt Riedemann from IBM has been doing a good job of handling this change. He's written a tool that operators can run before the change lands in Juno that checks if they have instance rows with null UUIDs. Additionally, the upgrade process has been well planned, and is documented in the specification available on the fancy pants new specs website.

We had a long discussion about this change at the meetup, and how it would impact on large deployments. Both Rackspace and HP were asked if they could run performance tests to see if the schema change would be a problem for them. Unfortunately HP's testing hardware was tied up with another project, so we only got numbers from Rackspace. For them, the schema change took 42 minutes for a large database. Almost all of that was altering the column to be non-nullable; creating the new index was only 29 seconds of runtime. However, the Rackspace database is large because they don't currently purge deleted rows, if they can get that done before running this schema upgrade then the impact will be much smaller.

So the recommendation here for operators is that it is best practice to purge deleted rows from your databases before an upgrade, especially when schema migrations need to occur at the same time. There are some other takeaways for operators as well: if we know that operators have a large deployment, then we can ask if an upgrade will be a problem. This is why being active on the openstack-operators mailing list is important. Additionally, if operators are willing to donate a dataset to Turbo-Hipster for DB CI testing, then we can use that in our automation to try and make sure these upgrades don't cause you pain in the future.

In the next post in this series I'll talk about the future of cells, and the work that needs to be done there to make it a first class citizen.

Tags for this post: openstack juno nova mid-cycle summary sql database sqlalchemy db2
Related posts: Juno nova mid-cycle meetup summary: ironic; Juno nova mid-cycle meetup summary: social issues; Juno nova mid-cycle meetup summary: containers; Michael's surprisingly unreliable predictions for the Havana Nova release; Exploring a single database migration; Time to document my PDF testing database

Comment

Syndicated 2014-08-14 19:20:00 from stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts)

14 Aug 2014 »

Review priorities as we approach juno-3

I just send this email out to openstack-dev, but I am posting it here in case it makes it more discoverable to people drowning in email:

To: openstack-dev
Subject: [nova] Review priorities as we approach juno-3

Hi.

We're rapidly approaching j-3, so I want to remind people of the
current reviews that are high priority. The definition of high
priority I am using here is blueprints that are marked high priority
in launchpad that have outstanding code for review -- I am sure there
are other reviews that are important as well, but I want us to try to
land more blueprints than we have so far. These are listed in the
order they appear in launchpad.

== Compute Manager uses Objects (Juno Work) ==

https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/compute-manager-objects-juno,n,z

This is ongoing work, but if you're after some quick code review
points they're very easy to review and help push the project forward
in an important manner.

== Move Virt Drivers to use Objects (Juno Work) ==

I couldn't actually find any code out for review for this one apart
from https://review.openstack.org/#/c/94477/, is there more out there?

== Add a virt driver for Ironic ==

This one is in progress, but we need to keep going at it or we wont
get it merged in time.

* https://review.openstack.org/#/c/111223/ was approved, but a rebased
ate it. Should be quick to re-approve.
* https://review.openstack.org/#/c/111423/
* https://review.openstack.org/#/c/111425/
* ...there are more reviews in this series, but I'd be super happy to
see even a few reviewed

== Create Scheduler Python Library ==

* https://review.openstack.org/#/c/82778/
* https://review.openstack.org/#/c/104556/

(There are a few abandoned patches in this series, I think those two
are the active ones but please correct me if I am wrong).

== VMware: spawn refactor ==

* https://review.openstack.org/#/c/104145/
* https://review.openstack.org/#/c/104147/ (Dan Smith's -2 on this one
seems procedural to me)
* https://review.openstack.org/#/c/105738/
* ...another chain with many more patches to review

Thanks,
Michael

The actual email thread is at http://lists.openstack.org/pipermail/openstack-dev/2014-August/043098.html.

Tags for this post: openstack juno review nova ptl
Related posts: Juno Nova PTL Candidacy; Thoughts from the PTL; Havana Nova PTL elections; Expectations of core reviewers; Juno nova mid-cycle meetup summary: social issues; More reviews

Comment

Syndicated 2014-08-14 13:01:00 from stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts)

14 Aug 2014 »

Juno nova mid-cycle meetup summary: ironic

Welcome to the third in my set of posts covering discussion topics at the nova juno mid-cycle meetup. The series might never end to be honest.

This post will cover the progress of the ironic nova driver. This driver is interesting as an example of a large contribution to the nova code base for a couple of reasons -- its an official OpenStack project instead of a vendor driver, which means we should already have well aligned goals. The driver has been written entirely using our development process, so its already been reviewed to OpenStack standards, instead of being a large code dump from a separate development process. Finally, its forced us to think through what merging a non-trivial code contribution should look like, and I think that formula will be useful for later similar efforts, the Docker driver for example.

One of the sticking points with getting the ironic driver landed is exactly how upgrade for baremetal driver users will work. The nova team has been unwilling to just remove the baremetal driver, as we know that it has been deployed by at least a few OpenStack users -- the largest deployment I am aware of is over 1,000 machines. Now, this unfortunate because the baremetal driver was always intended to be experimental. I think what we've learnt from this is that any driver which merges into the nova code base has to be supported for a reasonable period of time -- nova isn't the right place for experiments. Now that we have the stackforge driver model I don't think that's too terrible, because people can iterate quickly in stackforge, and when they have something stable and supportable they can merge it into nova. This gives us the best of both worlds, while providing a strong signal to deployers about what the nova team is willing to support for long periods of time.

The solution we came up with for upgrades from baremetal to ironic is that the deployer will upgrade to juno, and then run a script which converts their baremetal nodes to ironic nodes. This script is "off line" in the sense that we do not expect new baremetal nodes to be launchable during this process, nor after it is completed. All further launches would be via the ironic driver.

These nodes that are upgraded to ironic will exist in a degraded state. We are not requiring ironic to support their full set of functionality on these nodes, just the bare minimum that baremetal did, which is listing instances, rebooting them, and deleting them. Launch is excluded for the reasoning described above.

We have also asked the ironic team to help us provide a baremetal API extension which knows how to talk to ironic, but this was identified as a need fairly late in the cycle and I expect it to be a request for a feature freeze exception when the time comes.

The current plan is to remove the baremetal driver in the Kilo release.

Previously in this post I alluded to the review mechanism we're using for the ironic driver. What does that actually look like? Well, what we've done is ask the ironic team to propose the driver as a series of smallish (500 line) changes. These changes are broken up by functionality, for example the code to boot an instance might be in one of these changes. However, because of the complexity of splitting existing code up, we're not requiring a tempest pass on each step in the chain of reviews. We're instead only requiring this for the final member in the chain. This means that we're not compromising our CI requirements, while maximizing the readability of what would otherwise be a very large review. To stop the reviews from merging before we're comfortable with them, there's a marker review at the beginning of the chain which is currently -2'ed. When all the code is ready to go, I remove the -2 and approve that first review and they should all merge together.

In the next post I'll cover the state of adding DB2 support to nova.

Tags for this post: openstack juno nova mid-cycle summary ironic
Related posts: Juno nova mid-cycle meetup summary: social issues; Juno nova mid-cycle meetup summary: containers; Michael's surprisingly unreliable predictions for the Havana Nova release; Juno Nova PTL Candidacy; Thoughts from the PTL; Merged in Havana: fixed ip listing for single hosts

Comment

Syndicated 2014-08-14 01:49:00 from stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts)

961 older entries...