Researcher, strategist, level 14 bard
The team
My role
Design & research lead
Product Managers
Stephan Wiedemer & Jake Snyder
Engineering Lead
Dave Surman
The pitch:
Unleashing unused "dark capacity" to expedite planned & unplanned downtime for mainframe users.
Quick project stats
80
Hours spent conducting research with users
11
Dedicated research participants (sponsor users)
50
Companies participated in research
What is a mainframe?
Before we jump in, I just want to make sure we're on the same page. This project is all about mainframes!
If you've ever...
-
Swiped a credit card
-
Withdrawn from an ATM
-
Booked a flight
-
Booked a hotel
Congrats!
You've interacted with a mainframe! A mainframe (image on the right) is just a big refrigerator-sized computer that's really good at processing transactions (e.g.: swiping a credit card).
The (very technical) problem statement
Joining the team
I was working on another project when the product manager for the z15 hardware release approached me about this project called "Rapid Recovery." Here's the problem statement he pitched to me:
How might we speed up the IPL process using the compute power the machine already has... at no cost to the user?
Wait, what?
I had no clue what that meant, so I needed to take a step back and start to understand what the user need really was. What was the problem we were actually trying to solve?
The technical space
A vague technical problem
Think about your laptop. You know how every few months you’ll get a pop-up telling you to restart your machine so it can install updates? If you’re anything like me, you’ll click “ignore” until your laptop forces you to restart at the worst possible time.
Well mainframes are just big computers; they also have to restart if there’s a required update or a problem. Thankfully (since mainframes run such critical workloads) they only need to restart on a partition-by-partition basis. That way, you can still go grocery shopping and life can keep on moving as usual.
These mainframe restarts are called IPLs. IPL = Initial Program Load.
A quick look at an IPL
The process we wanted to improve
Okay, so now I knew we were basically trying to speed up how mainframe partitions re-start. So how do IPLs work?
The way an IPL works is pretty much how you’d expect any computer restart to work. The system shuts down, it does some reconfiguration or updating while it’s down, then it boots back up. First the operating system comes online, then the different subsystems, then the middleware. At that point, you’re back to a steady state and can get back to processing workloads.
Sounds pretty straightforward, right? So why was a researcher being pulled into this project that was already well underway?
Joining a project underway
By the time I was tapped for this project, engineering and technical design work was already happening. It wasn’t entirely clear to me that there was even a problem to be solved. The engineering lead even admitted that he didn’t see design having an obvious role on the team.
The technical solution
-
Expedite time to reconfigure
-
30-minute start up boost
-
Parallelized GDPS scripting
My initial design goals
-
Make sure we should actually be building this thing
-
Figure out what’s needed to create a delightful user experience
-
Make sure what we’re building is easy for users to adopt and use
Research process
Pre-research
-
SME interviews
-
Questions & assumptions exercise
-
Secondary market research
Generative research
Evaluative research
Methods
-
Research plan
-
Survey questions
-
Interview protocol
-
Market definition report
Artifacts
-
Heuristic evaluation
-
User interviews
-
Surveys
-
Key insights
-
Research report
-
Technical design changes
-
Playback 0
-
Focus group
-
User interviews
-
Usability assessment
-
Surveys
-
Case study analysis
-
Key insights
-
Key requirements
-
Usability assessment
-
Executive playback
Pre-research
Before I jumped into building a research plan and roadmap, I needed to reflect and discover internally with my new team.
SME interviews
I started by interviewing subject matter experts across the team. I spoke with key engineering leaders from the various components of the project (such as LPAR development, GDPS, and firmware), product managers, and directors.
Questions & assumptions with the team
Whenever I join a new project, I facilitate a questions and assumptions with the ENTIRE team. To kick off the research efforts and start getting buy in from my new team, I grabbed all 30 of my engineers and product managers and asked them to dump all of their questions (for users, for each other, etc.) into a Mural. They were skeptical, but it was an essential exercise for us to get started with our research.
Generative research
Hands-on heuristic evaluation
Before asking users about their IPL processes, I wanted to try to do an IPL myself. A test engineer cleared a test LPAR for me, verbally instructed me, and let me play in HMC and the x3270.
It was pretty underwhelming (which was actually an important data point that will come into play later).
User interviews
Next, my distinguished engineer and I tag teamed a quick sprint of interviews with 5 users. IPL processes are a standard function for mainframes, so 5 initial interviews was a solid sample size to start building a persona and pulling early insights to inform our research roadmap.
Number: 5
Surveys
Mainframe users are awesome. Maybe I’m biased, but I adore them for their passion and dedication to these giant machines.
And because mainframe users are so passionate, they gather in droves at various client councils and conferences around the world. Which is a researcher’s dream!
Once we finished up the interview sessions, my team put together some more process- and system-related questions we needed to ask the broader mainframe community. It was a mix of qualitative and quantitative questions, along with a blend of user experience and technical configuration questions. I look back at this survey as the first big point of collaboration between myself and my engineers; both design and engineering were really happy with the data the survey promised to provide.
I deployed paper surveys (literal paper surveys) at the Z Design Council, the zTPF User Group, and the CICS User Group.
Number: 60
Key first-round insights
IPLs are pretty easy
-
IPLs are not complicated, so we shouldn’t change the mechanics
-
IPLs take a while, which impacts users’ ability to adhere to SLAs
We weren't addressing the whole problem
We weren't providing a full experience to all users
Insight
-
We deliberately did not want to change anything about IPL mechanics
Resulting action
-
Our original plan was to shorten downtime, speed up startup, and give a boost
-
Research revealed LPAR shutdown to be an additional time suck
-
We introduced a 30-minute shutdown boost into the technical designs
-
We originally planned to provide a 30-minute startup and post-IPL boost
-
At large shops, an IPL can take 45 minutes, so users would not get the full experience of SRB
-
We elongated the startup and post-IPL boost to 60 minutes – plenty of time for users to get the full benefit of the boost
Next steps
-
Develop research plan for upcoming quarters
-
Get changes implemented into the development plan
-
Write some hills (user outcome statements)
-
Evaluate changes and hills with users
-
Usability study
Evaluative research
While my engineers were busy implementing those changes, my product managers and I set out to re-validate those decisions, get some additional context on users’ behavioral patterns, and gather some more technical requirements from users.
More surveys!
I printed out a new batch of surveys and deployed them at two user groups: the Z Design Council and GuideShare. The fun part was that all of the GuideShare users were German, so we had to deploy the survey in German! Sehr Spaß, aber ein bisschen schwierig.
Number: 52
Usability assessment
In parallel, I worked with my dedicated research participants (at IBM we call them sponsor users) to complete a usability study. Since this was going to be a hardware capability deployed with the next mainframe release, we wouldn’t be able to do normal beta testing. Instead, I walked my users through command line and HMC mockups.
Remember, research indicated that the IPL process is pretty straightforward for users, so we didn’t want to change much about the mechanics of an IPL. Instead, in our usability testing and future designs, we wanted users to feel comfortable and in control.
Number: 11
Case study analysis
My PMs and I reviewed how other IBM Z product releases had gone, especially hardware-related releases. We paid special attention to what had gone wrong.
We looked at a capability from the previous mainframe release – a capability that had been considered a blockbuster. We discovered that while it had been a huge hit with users on paper, the reality was much different. Turns out the team hadn’t considered installation and getting started during their design and development, so users ended up having to spend months getting this thing to work.
We didn’t want to make the same mistake, so we and our sponsor users together decided that we’d have this capability turned on by default. That was unprecedented of for a new mainframe capability, but it was something that would significantly enhance our users’ ability to leverage this new function.
Key second-round insights
Ease of adoption is key
-
We learned from looking at previous offerings that you can have the coolest offering, but if it takes 6 months to turn it on, it’s not as cool as you thought
Users value control
Additional technical requirements
Insight
-
We decided to have SRB turned on by default (pretty unprecedented in IBM Z)
Resulting action
-
Since we were turning the capability on by default, users needed a way to turn it off before AND during a boost, just in case something went wrong
-
Simplified mechanisms to turn off an in-progress boost
-
Parmlib options to control the applicability of the boost
-
SADMP exploitation of SRB is important
-
Uncertain definition of special-purpose Boost zIIP pool and temp capacity
-
z/OS isn’t the only operating system that matters
-
SADMP exploitation of SRB
-
Extended the use of the “real” zIIP pool for Boost purposes
-
zVM and zTPF exploitation planned
Next steps
-
Final playback to our GM and VPs
-
Final development push
-
Go to market work
-
PARTAY!
Go to market prep
Making the technical accessible to execs
I acted as the product manager for the development of SRB’s launch event collateral. I worked closely with IBM executives and the hired creative agency to build out a launch story that would make sense at the executive level rather than the esoteric systems programmer level.
What we shipped
So what was the end result?
I am so unbelievably proud of the capability we shipped. Instant Recovery was included in every IBM z15 mainframe, and users were able to use the associated capabilities with no additional licensing charges. There are three capabilities that Instant Recovery encompasses, and users can take advantage of any combination of the three (it all depends on preferences and machine configuration). Users get a 30-minute boost for shutdown and a 60-minute boost for startup. The startup boost is extra useful since startup usually only takes 12-30 minutes, so users can use the rest of the boost to run their workloads in overdrive to make up for lost time.
The three boosts
zIIP
Boost
Enables general process work to run on specialty zIIP engines
Speed
Boost
Allows sub-capacity mainframes to run at full-capacity speed.
GDPS
Enhancement
GDPS scripting parallelization and other enhancements
Personal victory
I was awarded IBM Systems' first ever Outstanding Design Achievement Award for my work on Instant Recovery.
My engineering lead also won an Outstanding Technical Achievement Award, and my product manager won IBM Systems' first ever Outstanding Business Achievement Award!
Overall, this project was such a great example of a really equitable and dynamic partnership between design, engineering, and product management, and I couldn't be more proud of what we accomplished together.
In the news
"System Recovery Boost can give companies a competitive edge.”
TechChannel
This “will enable instant recovery, resulting in ‘ultimate uptime’ for customers.”
Forbes
“z15 customers can restart systems and return to steady state business in up to 50% less time than previously required and complete their transactional backlogs up to twice as fast.”
eWeek
Terminal Talk podcast
I was a guest on Terminal Talk, a popular mainframe podcast! I’m the only designer to have guested on Terminal Talk, so that was pretty cool!
Marketing video