top of page
z_system_recovery_leadspace_rszd.jpg

IBM z15 Instant Recovery

AKA System Recovery Boost

The team

 

My role

Design & research lead

 

Product Managers

Stephan Wiedemer & Jake Snyder

 

Engineering Lead

Dave Surman

The pitch:
Unleashing unused "dark capacity" to expedite planned & unplanned downtime for mainframe users. 

Quick project stats

80

Hours spent conducting research with users

11

Dedicated research participants (sponsor users)

50

Companies participated in research 

What is a mainframe?

Before we jump in, I just want to make sure we're on the same page. This project is all about mainframes! 

If you've ever...

  • Swiped a credit card

  • Withdrawn from an ATM

  • Booked a flight

  • Booked a hotel

 

Congrats! 

You've interacted with a mainframe! A mainframe (image on the right) is just a big refrigerator-sized computer that's really good at processing transactions (e.g.: swiping a credit card). 

z15-glam-shot.jpg

The (very technical) problem statement

Joining the team

I was working on another project when the product manager for the z15 hardware release approached me about this project called "Rapid Recovery." Here's the problem statement he pitched to me:

How might we speed up the IPL process using the compute power the machine already has... at no cost to the user?

Wait, what?

I had no clue what that meant, so I needed to take a step back and start to understand what the user need really was. What was the problem we were actually trying to solve?

The technical space

m1-macbook-pro-pp-header.png

A vague technical problem

Think about your laptop. You know how every few months you’ll get a pop-up telling you to restart your machine so it can install updates? If you’re anything like me, you’ll click “ignore” until your laptop forces you to restart at the worst possible time. 

 

Well mainframes are just big computers; they also have to restart if there’s a required update or a problem. Thankfully (since mainframes run such critical workloads) they only need to restart on a partition-by-partition basis. That way, you can still go grocery shopping and life can keep on moving as usual. 

 

These mainframe restarts are called IPLs. IPL = Initial Program Load.

A quick look at an IPL

The process we wanted to improve

Okay, so now I knew we were basically trying to speed up how mainframe partitions re-start. So how do IPLs work?

 

The way an IPL works is pretty much how you’d expect any computer restart to work. The system shuts down, it does some reconfiguration or updating while it’s down, then it boots back up. First the operating system comes online, then the different subsystems, then the middleware. At that point, you’re back to a steady state and can get back to processing workloads.

 

Sounds pretty straightforward, right? So why was a researcher being pulled into this project that was already well underway?

Screen Shot 2022-03-25 at 12.48.56 PM.png

Joining a project underway

By the time I was tapped for this project, engineering and technical design work was already happening. It wasn’t entirely clear to me that there was even a problem to be solved. The engineering lead even admitted that he didn’t see design having an obvious role on the team.

 

 

The technical solution

  • Expedite time to reconfigure

  • 30-minute start up boost

  • Parallelized GDPS scripting

My initial design goals

  • Make sure we should actually be building this thing

  • Figure out what’s needed to create a delightful user experience

  • Make sure what we’re building is easy for users to adopt and use

Research process

Pre-research

  • SME interviews

  • Questions & assumptions exercise

  • Secondary market research

Generative research

Evaluative research

Methods

  • Research plan

  • Survey questions

  • Interview protocol

  • Market definition report

Artifacts

  • Heuristic evaluation

  • User interviews

  • Surveys

  • Key insights 

  • Research report

  • Technical design changes

  • Playback 0

  • Focus group

  • User interviews

  • Usability assessment

  • Surveys

  • Case study analysis 

  • Key insights

  • Key requirements 

  • Usability assessment

  • Executive playback

Pre-research

questions and assumptions srb.png

Before I jumped into building a research plan and roadmap, I needed to reflect and discover internally with my new team.

SME interviews

I started by interviewing subject matter experts across the team. I spoke with key engineering leaders from the various components of the project (such as LPAR development, GDPS, and firmware), product managers, and directors.

Questions & assumptions with the team

Whenever I join a new project, I facilitate a questions and assumptions with the ENTIRE team. To kick off the research efforts and start getting buy in from my new team, I grabbed all 30 of my engineers and product managers and asked them to dump all of their questions (for users, for each other, etc.) into a Mural. They were skeptical, but it was an essential exercise for us to get started with our research.

Generative research

Hands-on heuristic evaluation

Before asking users about their IPL processes, I wanted to try to do an IPL myself. A test engineer cleared a test LPAR for me, verbally instructed me, and let me play in HMC and the x3270.

 

It was pretty underwhelming (which was actually an important data point that will come into play later).

User interviews

Next, my distinguished engineer and I tag teamed a quick sprint of interviews with 5 users. IPL processes are a standard function for mainframes, so 5 initial interviews was a solid sample size to start building a persona and pulling early insights to inform our research roadmap.

 

Number: 5

Surveys

Mainframe users are awesome. Maybe I’m biased, but I adore them for their passion and dedication to these giant machines.

 

And because mainframe users are so passionate, they gather in droves at various client councils and conferences around the world. Which is a researcher’s dream! 

Once we finished up the interview sessions, my team put together some more process- and system-related questions we needed to ask the broader mainframe community. It was a mix of qualitative and quantitative questions, along with a blend of user experience and technical configuration questions. I look back at this survey as the first big point of collaboration between myself and my engineers; both design and engineering were really happy with the data the survey promised to provide.

3270 screen.png
ZDC synthesis.png

I deployed paper surveys (literal paper surveys) at the Z Design Council, the zTPF User Group, and the CICS User Group.

Number: 60

Key first-round insights

IPLs are pretty easy

  • IPLs are not complicated, so we shouldn’t change the mechanics

  • IPLs take a while, which impacts users’ ability to adhere to SLAs

We weren't addressing the whole problem

We weren't providing a full experience to all users

Insight

  • We deliberately did not want to change anything about IPL mechanics

Resulting action

  • Our original plan was to shorten downtime, speed up startup, and give a boost

  • Research revealed LPAR shutdown to be an additional time suck

  • We introduced a 30-minute shutdown boost into the technical designs

  • We originally planned to provide a 30-minute startup and post-IPL boost

  • At large shops, an IPL can take 45 minutes, so users would not get the full experience of SRB

  • We elongated the startup and post-IPL boost to 60 minutes – plenty of time for users to get the full benefit of the boost

Next steps

  • Develop research plan for upcoming quarters

  • Get changes implemented into the development plan

  • Write some hills (user outcome statements)

  • Evaluate changes and hills with users

  • Usability study

Evaluative research

While my engineers were busy implementing those changes, my product managers and I set out to re-validate those decisions, get some additional context on users’ behavioral patterns, and gather some more technical requirements from users.

More surveys!

I printed out a new batch of surveys and deployed them at two user groups: the Z Design Council and GuideShare. The fun part was that all of the GuideShare users were German, so we had to deploy the survey in German! Sehr Spaß, aber ein bisschen schwierig. 

Number: 52

german survey portfolio blank.png

Usability assessment 

In parallel, I worked with my dedicated research participants (at IBM we call them sponsor users) to complete a usability study. Since this was going to be a hardware capability deployed with the next mainframe release, we wouldn’t be able to do normal beta testing. Instead, I walked my users through command line and HMC mockups.

 

Remember, research indicated that the IPL process is pretty straightforward for users, so we didn’t want to change much about the mechanics of an IPL. Instead, in our usability testing and future designs, we wanted users to feel comfortable and in control.

Number: 11

Case study analysis

My PMs and I reviewed how other IBM Z product releases had gone, especially hardware-related releases. We paid special attention to what had gone wrong.

 

We looked at a capability from the previous mainframe release – a capability that had been considered a blockbuster. We discovered that while it had been a huge hit with users on paper, the reality was much different. Turns out the team hadn’t considered installation and getting started during their design and development, so users ended up having to spend months getting this thing to work.

 

We didn’t want to make the same mistake, so we and our sponsor users together decided that we’d have this capability turned on by default. That was unprecedented of for a new mainframe capability, but it was something that would significantly enhance our users’ ability to leverage this new function.

Key second-round insights

Ease of adoption is key

  • We learned from looking at previous offerings that you can have the coolest offering, but if it takes 6 months to turn it on, it’s not as cool as you thought

Users value control

Additional technical requirements

Insight

  • We decided to have SRB turned on by default (pretty unprecedented in IBM Z)

Resulting action

  • Since we were turning the capability on by default, users needed a way to turn it off before AND during a boost, just in case something went wrong

  • Simplified mechanisms to turn off an in-progress boost

  • Parmlib options to control the applicability of the boost

  • SADMP exploitation of SRB is important

  • Uncertain definition of special-purpose Boost zIIP pool and temp capacity

  • z/OS isn’t the only operating system that matters

  • SADMP exploitation of SRB

  • Extended the use of the “real” zIIP pool for Boost purposes

  • zVM and zTPF exploitation planned

Next steps

  • Final playback to our GM and VPs

  • Final development push

  • Go to market work

  • PARTAY!

Go to market prep

Making the technical accessible to execs

I acted as the product manager for the development of SRB’s launch event collateral. I worked closely with IBM executives and the hired creative agency to build out a launch story that would make sense at the executive level rather than the esoteric systems programmer level. 

What we shipped

So what was the end result?

I am so unbelievably proud of the capability we shipped. Instant Recovery was included in every IBM z15 mainframe, and users were able to use the associated capabilities with no additional licensing charges. There are three capabilities that Instant Recovery encompasses, and users can take advantage of any combination of the three (it all depends on preferences and machine configuration). Users get a 30-minute boost for shutdown and a 60-minute boost for startup. The startup boost is extra useful since startup usually only takes 12-30 minutes, so users can use the rest of the boost to run their workloads in overdrive to make up for lost time.

The three boosts

zIIP
Boost

Enables general process work to run on specialty zIIP engines

Speed
Boost

Allows sub-capacity mainframes to run at full-capacity speed.

GDPS
Enhancement

GDPS scripting parallelization and other enhancements

Screen Shot 2022-03-25 at 12.47.04 PM.png

Personal victory

I was awarded IBM Systems' first ever Outstanding Design Achievement Award for my work on Instant Recovery. 

My engineering lead also won an Outstanding Technical Achievement Award, and my product manager won IBM Systems' first ever Outstanding Business Achievement Award! 

Overall, this project was such a great example of a really equitable and dynamic partnership between design, engineering, and product management, and I couldn't be more proud of what we accomplished together. 

odaa.png

In the news

"System Recovery Boost can give companies a competitive edge.”

TechChannel

This “will enable instant recovery, resulting in ‘ultimate uptime’ for customers.”

Forbes

“z15 customers can restart systems and return to steady state business in up to 50% less time than previously required and complete their transactional backlogs up to twice as fast.”

eWeek

terminal talk banner-02.png

Terminal Talk podcast

I was a guest on Terminal Talk, a popular mainframe podcast! I’m the only designer to have guested on Terminal Talk, so that was pretty cool!

Marketing video

bottom of page