Everything Is Engineering Now
Three years of Practical Software Engineering at Lifelock
Copyright © 2012 by Cam Riley All Rights Reserved
Why Write This?
I was disappointed when I read "Coders At Work". None of the advice or conversation in the book was something I could use or apply at work the next day. Most of the people in the book were famous, or academics, and did not have the same day to day problems that I did as a Technical Manager and Technical Lead. Martin Fowler's book on Refactoring is a good example of something that has direct work application to software engineering. After reading a few pages of Refactoring, the next day I could improve how I worked and produced software.
This is not a tell all or a gossip story about Lifelock. My time at Lifelock was a positive experience which I enjoyed. What this is about is the day to day engineering issues and consequences we faced in a startup that was only a couple of years old and growing rapidly. This book is written for other Software Engineers and hopefully helps explain our decision making in what worked, what didn't and what we couldn't change due to organizational limitations.
At Lifelock I was employed as a Technical Lead and that position morphed into the position of Technical Manager. I wrote code every day for Lifelock, but the position was more than that. It was part production debugger, part environment debugger, constant support for Infrastructure and QA, as well as driving Engineering to produce higher quality code and spearheading automation efforts. I worked in a cube farm and often there was a line outside my cube of people needing my input for this or that. It was draining, hectic, stressful, rewarding and fun all at the same time.
Hopefully this book helps out other Technical Managers, Technical Leads and Software Engineers in organizations who deal day in and day out with legacy code, with feature sprints, with CTOs, with Vice Presidents, with Offshore groups, with Infrastructure, with QA and maybe even engineers who are unhappily caught in a death march.
Interviewing With Lifelock
In 2008 a software engineering friend who I had shared an office with at a previous job had moved across to a contracting firm and was working on a new project at Lifelock. That project became the death march known as Project Renaissance where the PHP website and the CMS were replaced with an Oracle stack and J2EE technologies. Oddly the front end systems were re-done in .Net which meant that engineers could not move up and down the stack as needed. There were specialist .Net front end engineers and specialist Java middleware engineers.
The company I was currently working for, Shutterfly, had run me through several little mini death marches of three to six months in length that I was tiring of. A death march was coined in the software engineering profession to describe projects where the timelines for completion were too short, where the requirements are vague and the estimates unrealistic. In these situations management - and the engineers - delude themselves that the go-live date is possible and will often put engineering and QA into crunch mode where people are working seven days a week and long hours each day to give the appearance of the dead line being achievable.
It is a little unfair to call those mini death marches but they felt like it. We were always late delivering the software and the quality was exceptionally low. The requirements were not vague. It was front-end work and the requirements were well written as wireframes. We tended to be overly aggressive with timelines despite padding hours into the project. Additionally, front end work is never complete until all the permutations an end user can conceivably click on is identified and consequently those edge cases often pushed development well into the QA period.
The final straw for me was when a project started to rewrite a front end system in Flex. This became a classic death march rather than just a poor quality system. We blew a time line and the engineering staff remained in crunch mode. The bugs kept piling up and piling up. My stress levels and nerves at that employer were never great to start with because of the project lengths and quality issues. Going through a divorce at the time contributed to the stress and the added hopeless-ness of working through another death march was too much. So I started interviewing.
A major concern I had when interviewing was that I didn't want to change jobs and go immediately into another death march. I was brought in for an interview by Limelight Networks which was a rising company in the Phoenix area. At the time they were in a small light industrial complex off 48th Street. Now they have the top floors of a brand new multi story office in downtown Tempe so their fortunes in the CDN market were definitely on the rise.
The project I was interviewing for had gone through multiple tech leads, had high turn over and was still not finished after over a year of work. They wanted someone to come in to get it completed and out into production. I can recall wincing at the description and asking at the time, "Is this a death march?" I suspect that question got me remembered a couple of years later when a friend of mine interviewed there and the interviewer mentioned to him that I had interviewed there. Phoenix is a small tech community in comparison to Washington DC and Northern Virginia. Nearly everyone has worked with someone else in some capacity so it is a good thing to bear in mind. Never burn bridges in Phoenix.
I interviewed with Lifelock when the Renaissance project was in its early stages. I met with several managers who left soon after I interviewed. I remember saying to the Vice President I interviewed with as I was leaving, "Are we going to work something out, because I am willing to do this" I cannot remember his exact words, but it was to the effect yes, we will try and make this happen. Unfortunately that management group soon left and with the chaos of the Renaissance death march I think my application was forgotten. Nearly six months after my initial interview I was given a job offer at Lifelock which I took. I knew some of the people I would be working with there and it sounded like fun, plus, I desperately needed to move on from where I was.
When I gave in my notice to go to Lifelock I left a project that was getting closer to completion but was still in crunch mode. On my last week there I came in on the weekends - both of them if I recall correctly - to help out the project. I did not see any end in sight and was happy to leave.
Experience In Quality
I often joked that me and JIRA were not friends, but there was one obvious reality at Shutterfly; no unit testing produced low quality code. To make it worse, when a bug got into JIRA the only way to get it out was to resolve it - whether it was worthwhile or not - or have a Product Owner say it is no longer necessary. The alternative was to not let the bug into JIRA and keep it out by arguing it out. Arguing over each and every bug is a valid management technique for keeping the bug counts down but it causes too much friction. Another tech lead dealt with the low quality and high bug count by leaving a lot of them open and closing them a couple of projects later when the requirement provoking the bug was no longer relevant.
When we got to the end of that project, we were in constant crunch mode and just closing bugs as fast as we could. I can recall one week where one engineer alone closed one hundred and fifty and I closed just over a hundred. I graphed it the rate of new bugs being created and the rate of bugs being closed and there was one weekend when we won, when we beat the inherent lack of quality of the system. It was too much of an emotional and physical toll though. We lost engineers in Phoenix because of it.
In 2002 I had worked on an NTCIP Driver project which was a lot of fun. The project included a mathematician who was a friend of mine. Java doesn't explicitly handle signed and unsigned integers so I was often running to the mathematician for help with bitwise operations to ensure that the PMPP packets were correct. He did the bitwise operations on his fingers with each finger representing a bit. After he had worked it out on his fingers we would turn it into a Java method. In that project I was terrified of a bit or a byte being out of place so we covered that project with all manner of unit and functional tests to ensure that it worked as the spec demanded.
We were given a variable messaging sign [VMS] to test the new Java NTCIP driver with. These are the large electronic road signs you see with yellow lights in them. They are usually attached to an overpass or orange trailers on the side of the road. The one we were given had a board in it which could handle the new protocol. We would constantly run functional tests against it and we discovered the board didn't handle the protocol. It turned out that manufacturer was beta testing on us by putting in their own alpha and beta boards. We ended up swapping in and out several boards before we got the whole thing working. It was the unit tests that gave us the confidence to say it it the board in the variable messaging sign that was not compliant and that our code was handling the spec correctly.
The point of the NTCIP Driver project story was that I wasn't a stranger to unit testing and had used it in the teams I had led prior to Shutterfly. I didn't press for unit testing at Shutterfly and didn't provide the leadership for it there either. That was a failure on my part. I did not want to make that mistake again. I was resolved that at Lifelock I would not give up code quality and allow myself to be in a position where I would shrug off one thousand plus bugs as being normal for a software project.
The other part that made working at Shutterfly unbearable was the hours. I am an aging Software Engineer who is 41 at the time of writing. I should know better but I was happy to pretend I am the Herculean guy that can work more and get it done by sheer force of willpower. It is conceit. I used to say, "You have to be fit to be a software engineer" because of the hours, stress and project pressures. I was wrong. You have to have the courage to say no and just dig your heels in. There are too many managers that say yes when they should say no and are happy to take advantage of engineer's good nature by having them work impossible hours for the sake of appearance. I was resolved at Lifelock that the work/life balance was going to be normal for myself and any engineers I was working with.
If there is anything I am proud of at Lifelock, it is that I achieved those two goals; code quality and work/life balance for the Tempe engineers. It was a hard slog to get there but totally worth it.
When I came to Lifelock in August of 2009 the death march Renaissance project had gone out to production and been bulldogged into a stable live system courtesy of long hours from the engineering, infrastructure and QA departments. Lifelock had gone through a boom in 2007 when they had started advertising on the right wing radio stations. It turned out that products Lifelock offered and the demographics and concerns of the right wing radio audience were a close match. Lifelock was a startup and one of the leaders in the Identity Theft Protection market. They were constantly trying to find which market wanted their products and what they needed to sell in order to resonate with a larger customer base.
The identity theft market became possible because more and more information was flowing through the Credit Bureaus. These are a regulated body dominated by a select few companies: Equifax, Experian and Transunion are the best known. The Fair and Accurate Credit Transaction Act allowed for individuals to place alerts on their credit histories if identity theft was suspected. The legislation intended to make fraudulent applications for credit difficult if not impossible. This meant if you had that flag on your credit history, and someone made a request for credit, then the credit bureau would ring you and say, "Is this you opening credit for …." which was hugely useful and pro-active. This started the identity theft protection industry and Lifelock was the one who recognized the market opportunities in that legislation.
Most startups have manual fulfillment and only begin to automate processes as the scale dictates or necessitates. Lifelock's initial product was to put the identity theft alerts on your credit history at the Credit Bureaus so the paying customer didn't have to worry about it. The alert only lasted for a short time so every three months someone from Lifelock would ring up the credit bureau's for each customer and have the identity theft alerts renewed on a customer's credit history.
As Lifelock's customer base grew, more and more of the fulfillment and billing processes needed to be automated, as Member Services were swamped with requests. The billing project was supposed to rectify all the tasks that did not need to be manual by making them an automated part of the billing system. True to their startup roots, Lifelock was using Paypal as their billing provider up until that time. If I recall correctly, in 2009, Lifelock was Paypal's largest recurring billing customer. By 2010 Paypal could not provide all the services that a growing Lifelock required so a new billing system was sought out. When I came to Lifelock I was placed on the Billing Project immediately.
The Billing Death March
I had left Shutterfly because of a death march. When I joined Lifelock I managed to drop myself into a death march immediately. Edward Yourdon suggests engineers caught in Death Marches to quit. That is not always feasible, certainly not in my situation as I had just left one company for another. It wasn't obvious until about two months into the project that we were dealing with a death march either. After we got stuck on the Service Bus work we realized the project was going nowhere fast.
Every morning Lifelock had a large scrum in a medium sized conference room. This was a leftover of the Renaissance project days when they were focused on getting the Renaissance software into production. All the engineers, QA, infrastructure, project managers, business analysts and middle managers would shuffle into this conference room and pack every square inch of space. The CTO generally ran the meeting and would go round the middle managers and project managers in turn. The rest of the technical people there were largely an audience except when a specific question was asked of them. I hated it.
I had been to about two weeks of these morning scrums and was feeling impatient. Not only was I being forced to wear uncomfortable business casual clothes but I was jammed in an uncomfortable meeting each morning. Soon after, I approached the new Vice President of Engineering and asked if I could hold the billing scrum separately to the morning meeting. She said that sounded good and that she disliked the big morning meeting as well. That large scrum disbanded and smaller scrums started occurring. The meeting had a place during Renaissance but had outgrown its usefulness.
The vendor for the new billing system was Metranet who had a .Net based product that was to be housed internally. The original approach was that Metranet would take over the billing and product catalog responsibilities. In Lifelock's existing software the recurring billing was handled in the middleware and the product catalog was Salesforce with the important pieces of data being replicated in the Lifelock database.
With the Renaissance project Lifelock had moved to a Service Oriented Architecture [SOA] structure where web services were the main method for calling between different subsystems. The goal was for our middleware and service bus to integrate with Metranet transparently via web services and our front ends and partners not notice that we had a new billing system and product catalog.
I am not sure why Metranet was chosen as the vendor for the billing system. I heard that it was in the middle for price, I had also heard that the investors in Lifelock were also investors in Metranet and that it was a case of eating your own dog food. Another thing I heard was that Metranet claimed everything was out of the box when requirements were hashed out and then when the project started suddenly everything was custom code. I don't know the truth of why Metranet was chosen.
I used to joke that we should have bought Metranet's sales people and thrown out their software. Metranet was very effective in convincing Lifelock that they were the best for our needs. This perception only started changing as it became apparent that the project was a death march and the majority of the quality issues were with Metranet's system and not Lifelock's code.
When I arrived at Lifelock the engineers were over-managed. They were being sucked into meetings all day. Young talented engineers, who should have been punching out code for six hours a day, were spending that same six hours stuck in meetings and not uttering more than five words an hour. Once I settled in at Lifelock I started the approach of I would go to the meetings and we would only bring in other engineers when we needed them. One young engineer made the comment once that when I arrived his calendar changed from one hundred percent meetings to zero percent.
I am not a fan of meetings. I dislike them as a forum for discussing issues as people tend to like to talk and you have to give everyone equal time. At Lifelock, quick hallway meetings to hash things out or to determine consensus were far better and for the most part I tried to achieve things that way. This method works well when there is limited management, but once more and more management starts piling on the number of meetings increase in frequency.
Until early 2012 the management structure of the engineering department at Lifelock was super flat. There was the Tempe group with myself as the tech lead and the Irvine group in California with an engineer as lead and we all reported to the VP of Engineering. It was remarkably effective as the VP of Engineering set strategy and the engineering groups set about implementing those strategies. This approach and flat structure showed empirical results as the engineering group was the highest morale of any group in Lifelock in 2011.
When I first started at Lifelock I had been told we had to wear business shirts, business pants and dress shoes. This is not comfortable wear for a software engineer. I also have a muscled build so nothing in the business casual catalog really fitted me that well. I am certain business clothing is designed to be comfortable for the average business male who has a pudgy belly, chicken legs and stooped shoulders.
I had to go out and buy clothes specifically for Lifelock. I had one suit that I got married in, and outside of that I had t-shirts, jeans and work out clothing. I put up with wearing business clothes until I started noticing that a couple of upper managers were wearing plain t-shirts with dress pants and dress shoes. I think that is a massive fashion faux pas; it did not look good, but I decided that if they were wearing t-shirts, so would I.
None of my t-shirts are plain, or go with dress pants. So I wore jeans instead. It was like dominoes dropping. Within a couple of weeks all the engineers were coming into the office in t-shirts and jeans and soon after infrastructure were as well. It was unstoppable. Even though it sounds like I started that process in this telling, it was a group consensus and everyone kind of did it at the same time. Engineering and Infrastructure were just itching for the excuse not to wear business shirts.
We couldn't get the dress code dumbed down to shorts. Phoenix is a hot city and wearing jeans in the desert when it is 115F is not fun. I always thought it was strange in Australia that people would wear long pants and long shirts for business reasons when the Australian climate - other than Tasmania - is either hot or muggy. If Australia is hot, Phoenix is even hotter and during monsoon season it is brutally hot, humid and muggy. In those kind of environments I think shorts are more than acceptable.
Engineering Gets Macs
Windows machines are essentially crippled for business courtesy of all the anti-virus software that goes on them. Trying to run Weblogic on localhost and compile branches on the command line generally makes the machine unusable for extended periods. It is incredibly frustrating. When I interviewed with Lifelock I stipulated as part of my employment that I would get a Mac and wouldn't have to use a Windows machine. When I started that was ignored and I got a standard windows machine.
The designer guys down the hall had Macs. Supposedly it was because they are incapable of being creative with Windows machines. I chatted to the designers and asked how they got the Macs. I was told they were an exception. I determined that engineering would be the next exception to that policy and consequently we started politicking for Macs as well. We were rebuffed numerous times.
One day the CTO moved on and the current CFO took both the CFO and CTO roles for the interim period. I had lunch with him one day and in passing I was talking about our crappy machines and how they take a dive when we run functional tests against Weblogic on localhost. Our main complaint was that a task which should take thirty minutes ends up consuming a day. Very soon after that lunch we had someone from procurement asking all the engineers what machines we wanted.
Not all the engineers in Tempe wanted Macs. We had two dedicated .Net engineers but they were happy to upgrade their machines to brand new boxes with ample memory, CPU speed and hard drive space. Three of us decided that we wanted a strong separation between work and home. Consequently we refused laptops and took desktops. Two of us asked for Mac Pros, the third was the lone middleware engineer hold out who asked for a Windows desktop.
Once the Mac was in the organization they were unstoppable. Prior to the influx that started with engineering there had been the dictum that only Windows and Lenevo were supported. The IT Group probably could have stopped more Macs coming in after engineering got them but a curious thing started happening - Macbook Pros that came in started getting light fingered by executives and directors. One of our engineers loved Macs and was pretty miffed as he was the last to get a Macbook Pro due to someone higher in the chain than him seeing the new Macbook Pro intended for him and it getting appropriated for themselves.
Between upper management starting to get Macbook Pros and the rapidly expanding numbers of engineers in Tempe and Irvine with them; the Macs were there to stay. It was probably a good thing as soon after iPhones and tablets became common place, ousting the blackberry and the under powered Lenevo laptops.
We used subversion for our source code repository. I had been using svn from the command line for managing my source code changes but the Irvine engineers started using Cornerstone and it became the defacto standard for the Mac users in engineering. It worked well, though sometimes merging was dicey. Cornerstone wasn't always obvious if it resolved a conflict or not and you could mark something as resolved that still had >>>> in it. This is acceptable if it is in a java file or something that is compiled, but not so good when it is in an html file or a config file as it ends up being a runtime error.
Eclipse has been the dominant Integrated Development Environment [IDE] for a while. There was a time when it was slow and unstable but that ceased to be an issue by about 2005. Prior to that I used emacs and ant off the command line. With the tools for refactoring and introspection that eclipse has I could never go back even if I wanted to. It is a remarkably productive environment.
The only issue we hit was the m2eclipse plugin which appeared to cause instability when importing projects once we changed to a larger branch structure. Upgrading to m2e caused its own issues as it didn't recognize a lot of the custom goals and plugins we had created to support our build and artifact generation process. The m2e plugin was also not so great in working out exclusions in parent files that were not in the workspace either.
We checked in .project and .settings files with the projects. When I first came to Lifelock they were not checked in and there was a text file that was passed around to get the OSB and EAR projects set up correctly in eclipse. If you ever had to change to a new workspace you had to go and find the text file then go through all the steps to make sure all the eclipse projects for the OSB config, the wsdls, the ear, the ejb and the ws projects were correct.
Since everyone was on eclipse it was a quick win to start checking in the .project files and then it was simple to import the project from the filesystem after it had been checked out. As we went to the platform architecture with multiple ears, common jars and common EJB modules it became even more necessary as otherwise engineers would have spent twenty percent of the week setting up projects in eclipse rather than coding.
When we moved to maven we continued to check in the .project and .settings files. It became normal for engineers to import existing projects into the workspace and since we had no-one use IntelliJ or emacs/vim it was not a big deal. I know there are religious wars over whether to check in .project files or not, but I have always found it easier, and more efficient. It is possible open source projects with their diverse developer base may prefer not to, but in our case it made it easier to focus on developing than screwing around with the IDE.
The checking in of .project files extended to our automation projects as well. The majority of our eclipse based automation was functional testing for the front end and middletier though we had python and bash projects which supported different automation tasks. The Linux Engineers did not check in their project files as most of their stuff was one-off source files or a small grouping of source files that did not number more than five or so. Additionally they all used different IDEs such as vim or komodo.
The automation projects that engineering were involved in had the .project files checked in despite them being projects with only one or two files in them. Again, it was more efficient to import the project and have it all setup for you than muck around with the IDE project settings. If there is anything I have learnt from eclipse is that it is hard to do the same thing twice manually.
One of the things that used to drive me bonkers was people checking in class files or the compiled jars files in a target directory. Subversion didn't handle the constant change well and these files would always appear with the little green squiggle. Subversion is not particularly good at turning a previously checked in file to one that is now on ignore. You really have to identify the ignore files the first time. It is a weakness in Subversion that is annoying, but not really a show stopper.
After the Renaissance project there were multiple projects in subversion but the two important ones were the middleware and front end. Essentially we all developed off trunk. Tags were made to denote a production release. With the billing project and its incompatibilities with trunk a new branch was created off trunk for the middleware and front end. Infrastructure also provided a completely new development integration environment for the billing project as well. This was the start of Lifelock using a feature branch strategy.
When billing had been pushed into production in December of 2010 the next large incompatible branch came in and replaced two of our front end projects. From that point on we settled into the Agile approach of two to three week sprints with a production release at the end of it. Occasionally a larger feature set - such as the sales tax changes or product platform - would stay out of trunk for a period of three sprints or longer before coming in and being released to production.
We often toyed with the idea of Fowler's continuous integration or the kanban method of code production but our organizational structure did not support that speed of code movement through our system into production. Conway's law states; "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations."
Conway's law is true of processes too. Organizations which design a process will match the process to existing communication structures. The process to get code from a user story to production matched the organizational structure and did not change no matter how much engineering and management tried to speed it.
Processes are not inherently bad. They largely exist to try and make sense of chaos, however, chaos is not uniform, it is often concentrated in small areas and formalizing a process that is too heavy can penalize the areas that don't really need to be under that umbrella of chaotic inputs and outputs. Chaos is not static either. It can move or be forced out, so a process that made sense six months ago, may not now. Unfortunately, once a process is established it is like its own personal bureaucracy and defies elimination.
What Lifelock called feature branching is a little different to the standard definition of a feature branch. More accurately Lifelock used sprint branches. When a new sprint was started we would create a new Version in Jira and Greenhopper, add the user stories as JIRA tickets, and then create a new sprint branch out of Jenkins. The Jenkins job would version all the maven poms with the correct version in the Group Artifact Version [GAV] coordinates for the sprint, check the new poms in and then build the new feature branch via Continuous Integration in Jenkins. The feature branch artifacts being pushed up to Nexus and ready to be deployed to a DEV Environment.
If, for example, the new Sprint was for Sales Tax and was number two in the sequence of sprints for that project, the Jenkins job would create a branch named, SALESTAX_SP2-SNAPSHOT and then version all the poms in the branch with the same version name. That way when anything was checked into that branch, the continuous integration job in Jenkins for that branch - the CI job being created by Jenkins when the branch was created - would kick in and build artifacts with the SALESTAX_SP2-SNAPSHOT versioning.
When a sprint was finished, the feature branch would be merged into trunk. The version for trunk was TRUNK-SNAPSHOT. From there a release branch would be cut and those artifacts would go through stage and into production. A Lifelock feature branch included more than one feature. Often it was one to twenty User Stories that could cover the front end, the middleware, the database, batch jobs and business tasks. It was not unique to code changes and very cross functional.
One of the downsides of this form of branching was that some projects got a little too far out of sync with trunk and were difficult to merge. They were not impossible, just difficult. Maybe perforce or git might have made the process easier but the fact remained that some projects are sufficiently incompatible that they stayed out of trunk for three plus sprints and had difficulty merging back into trunk.
The Martin Fowler style of continuous integration where you are constantly merging into trunk and using feature toggles is a solution but one our organization could not support. We had engineers that had worked in a Fowler style environment or in Kanban structures and they loved it because of the constant flow of code getting into production. Lifelock has the hard stop of going through the inspection process with QA that was dictated by the CTO. We could not convince others that the functional testing that engineering, and later the selenium testing from QA, could cover the majority of the inspection steps. Everyone agreed they wanted testing automation, no-one believed we actually had it and that it had been used since 2009. Engineering could not convince the other groups that the middletier tests covered most of the inspection situations.
The positive side of Lifelock's branching methodology was that once you got into trunk, you knew your code was going out into production as no-one wanted to revert a merge. It was too large and messy. Getting into trunk was sometimes the hard part. Which sprint was going out next was often horse traded in multiple meetings between Project Managers, Product Owners, Vice Presidents and the CTO.
Engineering found that unless you had a Project Manager championing your changes and demanding their project is more important than any other Project Managers' you had a low probability of getting your feature changes into trunk and hence production. This was a bad thing as quality improvements were difficult to get out even if new products and features were not.
This limitation led to significant functionality being hidden in sprints or slid in as additional user stories by engineering in order to get those changes to production. Often it would be a subtask is an existing user story rather than its own user story which QA could focus on. It was unfortunate as it 'hid' important functionality and improvements and placed an extra burden on QA who were generally good natured about the changes as they understood the importance of quality improvements.
Knowing what I do now, and how Lifelock operates, what could we do to improve the branching process? I was heavily involved in the original decision to branch so I have to take responsibility for what we ended up with. Prior to branching all Lifelock's code was done in trunk. Often a bad commit would stop development or an artifact being built and readied for production. Back then there was only really one project going at a time. When the Billing Deathmarch started we had incompatible code that was going to be incompatible for at least six months so we branched. That practice morphed into the Sprint branch structure when Lifelock went to an agile process.
In 2012 we started using feature toggles and they worked really well. The sales tax project was the best example of this as it was in production for a month before finance wanted the sales tax functionality turned on at midnight of the first calendar day of July. When we flicked it on, everything worked as expected. I know we are way behind the curve there, Facebook and Flickr both have complex feature toggle functionality, being able to turn features on and off depending on location, IP, etc. A simple on/off toggle gave Lifelock what we needed. It is an excellent technology mechanism.
If I was to do the billing deathmarch again, I would put it behind a feature toggle and an adapter that could load whichever was the required billing integration piece. Development would continue in trunk but the new development work would be hidden behind the feature toggle which could be flipped on and off depending who needed it. This process is nothing remarkable. When prototyping against a couple of different third party APIs, one of our engineers did this instinctively. This process is what Martin Fowler calls continuous integration. In Fowler's description you are continuously integrating development code with production code. It would require a change of organizational mindset for Lifelock to start developing in that manner but it is achievable.
When I started at Lifelock I was the new guy and had no idea what everything did. I started unit testing in order to learn the code base. Lifelock had complex rules for different promocodes to make sure that people couldn't fraudulently take advantage of partner relationships we have. I was unit testing one of these when I found a bug through exploratory unit testing. When I showed the issue to another engineer a fix was made quickly in trunk and it went out with the next release a couple of days later.
That was the beginning of a long hard slog of improving quality through unit testing at Lifelock. We started with zero unit tests and three years later we had close to four thousand unit tests. In August of 2012 we had ten projects with one hundred percent unit test coverage according to Cobertura reports running from Jenkins.
One hundred percent unit test coverage is not necessary, especially as we were using data transfer objects a lot which were incredibly simple, however, even those could be unit tested to ensure they did not cause runtime issues. A lot of our data transfer objects were sent through queues and topics which meant they needed to be Serializable. Unit testing that a data transfer object was an instance of Serializable became a cheap way of avoiding a runtime issue when that data transfer object was passed through a queue for the first time.
One of the reasons why I liked our engineering team to aim for one hundred percent was the element of completeness, professionalism and discipline. It meant the engineers had combed the code to ensure everything that could be tested was, and that mentality had a follow on effect for the code quality. Cobertura is slightly forgiving as it averages all the different parameters and rounds up. You can have redlines in your code, where a line is not covered, but as long as they are only a few, then Cobertura will give the big one hundred anyway. Which is something to be aware of.
I scripted up in Jenkins some reports which noted the number of unit tests in Subversion's trunk. The job went through all the projects for the common libraries, the front end projects and middleware projects. It then tallied up the unit test code coverage for each project. I would put these into our wiki and note the plus or minus change in percentage from month to month. Negative changes were noted with a red. This kept an eye on a project that was introducing new code without adding unit tests it also showed which projects were maintaining a high output of code coverage.
The code coverage report was originally only on trunk. I started adding other projects such as our product configuration project, sales tax and encryption projects to it despite their being in a feature branch status. The reason for that was these projects had high code coverage and high numbers of unit tests while they were in development and they were meeting their code complete sprint dates as well. I thought it was a good example of the speed of development unit testing enabled. Once they hit trunk it was a simple task to change over their subversion url and keep them in the report as production projects.
EJB3 was pretty nice in that the container took care of injecting the appropriate bean when it was referenced in another bean. We took advantage of that by mocking the injected bean and then flipping that reference over to public in the unit test setup via reflection so that the mocked object could replace it before being changed back to private again. One of our talented engineers created a little utility to do that. Along with JMock and that utility it became the standard mechanism for unit testing bean code.
When you unit test, you write unit testable code. There is a massive difference between what people write when they don't unit test and when they do. At Lifelock we had trouble getting the culture of unit testing in. Some engineers did not write unit testable code and it showed. When someone who did unit test came to that untested code, it was a nightmare.
Static classes were the big sore thumb. We had one nasty class from the pre-unit testing days called the TypeMap. It was a series of static classes and methods that used reflection. We got around it by creating an interface and then an implementation class that called the static methods. This meant we were not stopped by being unable to mock the TypeMap itself.
Another issue was long methods. Engineers who don't unit test tend to write methods that go on and on forever. One example of a long method was from a third party system we had to deal with. They had a method with; seventy five if statements, sixteen else statements, eight for statements, three try catch statements and eight while statements. That system was covered by functional tests, but still unmanageable code like that was in production. This is why you have to both unit test and functional test. One or the other is not enough.
Functional tests are also known as integration testing, system testing, regression testing, acceptance testing etc. Functional tests run against the deployed system at runtime. We did this because J2EE containers are notoriously complex and even though they adhere to the spec there are still vagaries that can bite you unexpectedly. The other benefit is that functional testing exercises the system like an end user and you can quickly state if the system is 'working' or not.
I detest the term 'not working' as it can mean anything. Not working can mean nothing is responding in the system, it can mean I tried one little thing and the result wasn't what I expected so I gave up, or it can also mean a 98% pass rate but one or two minor bugs, and hence 'not working'. When a VP or someone else hears not working in a scrum blocker sirens go off in their head. Functional tests are very good for making 'not working' empirical. Broad statements like that can be qualified very quickly with functional testing.
One of the issues we hit constantly with testing was that engineers were mixing up what was a unit test and what was a functional test. The simplest way I could describe it was; if you are running tests and pull the CAT5 cable out the back of your machine and the tests fail - then it is a functional test. If you need a network connection for the test to pass it is not a unit test.
A mechanism we used to control functional tests being mixed in with unit tests was to limit the connectivity of the servers that Jenkins runs the continuous integration builds on. If a test tries to make a network connection and is denied it will fail the test. We saw this occurring on one feature branch, the problem was that the connection hung, so it hung the build as well and more and more builds piled up behind it. The end result was that the functional test got removed from that branch but at a bit of an initial cost.
We put our functional testing under their own projects for the Front End and Middle Tier. One of the benefits of running functional tests against the middleware was that those APIs did not really change that often. The middletier tests were particularly good for testing a code base on a new environment and for double checking a merge on localhost before checking it all in.
Our functional tests for the middletier would change the state of the system and then go back and check that the system's state was now what we expected. This included double checking the data in our database and third party systems. When you work in this manner you quickly realize that you need to make the entire system testable. Not just code around unit tests, but the public APIs as well need to be done in such a way that the system and its state is testable at any moment.
Consequently we added secure middleware APIs that could tell us the state of the system beyond what our front end customers and third party consumers required. The change of thinking here is that testability becomes a customer. APIs and functionality expand to include the testable nature of the system as well.
The functional tests also expanded into tools. As we made the system more and more interrogatable and testable we found that many of the tools we had written were of value to other groups as well. We exposed these through numerous dashboards which helped explain the system beyond the major interfaces that our customers, partners and members service agents interacted with the system.
We created a user interface to run the functional tests that was written in Wicket. It never really caught on though. There was some clever engineering and cool use of the JUnit API in that application. The functional test package names and tests names became clickable in the user interface and fired off the tests at the package and class name level. We used Wicket's ajax calls and JUnit's callbacks to put the response in a nice web interface that went green or red depending on the result at the test and package level. It is a shame it never caught on as it looked great.
We also added the ability to run the middletier functional tests through Bamboo and Jenkins. During the period where we were using Bamboo for continuous integration, myself and another engineer worked on the functional tests to make them lighter so we could run small groups of them hourly and daily on our development environments. We hoped that we could have them running constantly but our development environments by that time had degraded to a level of instability that gave our functional tests too many failures. System instability is better monitored with Nagios than the overhead of a functional test.
We started adding information to our EARs and WARs to make the interrogatable via a URL so we could get the branch location and maven version from them. We used this to then get the correct set of functional tests for that deployment, and hence environment. This was put together in Jenkins so you could run a Jenkins job with the environment as a drop down and choose a maven profile which tested specific functionality. The scripting would then support getting the correct functional tests and running them through JUnit.
We also managed to create a small subset of functional tests that became a Performance Test which ran against production every two hours and emailed out the response times to a mailing list. It became a good mechanism to get the feel of our production systems and how they were responding. These performance tests were good for determining at a glance whether an existing environment was degrading and whether new environments were comparable in performance to the existing ones.
Functional tests were very important at Lifelock for shaking out environmental issues. We had too many non production environments that needed to be supported and we never had enough middleware engineers to support all the weblogic, service bus and tomcat servers that these non production environments entailed. Running a quick functional test or suite of functional tests against an environment were very accurate for determining any issues with the environment or the deployments on that environment.
Improving the quality of the functional testing was as easy a having someone from QA sit with you and go over what the functional test does. When you have to explain it to someone and are checking the change of state in the database and in third party systems, you end up having to re-justify why data is going where it is and how that matches to user stories and other requirements. It works really well in improving what the functional tests are testing.
Functional testing is as important as unit testing. You cannot have one without the other. Functional tests were used constantly by engineering at Lifelock to ensure that the software code we developed did what engineering said it did. Using the functional tests this way meant we were able to say with certainty the both old and new features worked as intended before Service Delivery or QA received the artifact.
We tried several mechanisms of doing code reviews and we never really found a simple maintainable and sustainable way to achieve it. The first thing we tried as a project team was loading all new code up on a projector in a conference room and as a group we would go over it. This was good, but it meant we needed the time away as a group. In the case of the death march we got inundated with work and were lacking time so this process kind of drifted away and we relied on unit and functional testing to ensure quality.
When the Atlassian suite was purchased we started using Crucible. One of our principal engineers used Crucible to manage the quality of the code coming from our offshore group when they started the batch migration. He used that as his main point of interaction with the offshore engineers and their code. It did improve quality but at great cost to his productivity. He is a talented engineer and most of his time went on code reviews.
One of the things we found with the Crucible code reviews was that the same errors were being made over and over. So we started documenting these on the wiki. When the issue popped up in a code review we would post the wiki page into the Crucible comment. I am not really a fan of Crucible. It only gives a small view of the code and it lacks the intimacy of a side by side code review. When you leave a comment on Crucible, only that comment gets fixed so it can pass a review. It doesn't start any dialog or introspection on the code and why the code was written a certain way.
The best code reviews are not through crucible but with someone sitting next to you while you review it. We don't do pair programming, that is by choice, but having someone next to you to review your code can be quite insightful. At the end of a sprint I had another engineer review my code which was complete as far I was concerned. The code satisfied the requirements, had unit test and functional test coverage, but once we went over it we managed to get the number of unit tests down and we improved the production code by reducing the number of methods and making it more obvious what we did. It can be valuable and fun when having another person - no matter their level of engineering skill - look at your code with you.
Unfortunately this is not always possible. Lifelock had an engineering group of about thirty engineers spread across three offices in Arizona, California and India. Doing side by side reviews with the Californian or Indian engineers is not really a possibility even with modern telecommunications technology and software. Crucible is a poor replacement for side by side code reviews but it is all we have for code reviewing code across groups. A reality is that proximity matters and I don't see anyway to get around that.
Software engineering tends to have a temporal view which is propagated by the myth of the 'hacker'. Code is quite remarkable in that it can get past QA and into production, and work well enough that a business is sustained and customers are happy, even if the code itself is not that good. A truism of software engineering is that if code gets into production it is going to hang around for a long time. So you have to write your software not to get past QA next week, but also for the poor schlub of a software engineer that is going to be wading through your code five, ten and even twenty years from now.
That might seem silly but there are still mainframe systems in production from the 1970s today. I know software that I developed in the early 2000s is still in production ten years later. Consequently, I try to make the code review ensure that the software and its comments are sufficient that I could look at this code in ten years time and know what it does or is supposed to do. Code reviews are not just about code.
JIRA and Agile
Because we had the entire Atlassian Suite when we moved to Agile we also adopted their Greenhopper software. This is like a plugin for JIRA that can reorganize tickets into something resembling a backlog and sprint. Atlassian's software is not that great and shows its 1990 and early aughts history. For instance traveling back and forth in the work and flow often gets stuck with "form resubmission" warnings. We made do with it despite their being better software to support sprint workflows.
Often engineers put hours on a User Story as one big lump and assume that it will cover the feature development work, the unit testing, the functional testing, the documentation etc. They then find that they have run out of time to get the feature complete and they then just do the feature coding work and forget about the rest.
We discovered that adding a series of subtasks under a user story and explicitly adding sub-tasks for unit testing, functional testing, wiki documentation, etc meant that you had a Scrum Master nee Project Manager asking you each morning if you had done your unit testing. It worked quite well as the hours we added as sub-tasks were always accepted - never questioned - and the unit testing, functional testing and documentation all became part of the normal user story completion tasks. We also found that some of the Scrum Masters who previously weren't aware of these tasks started demanding them - even defending them. These subtasks became something to be expected. Which was great.
It wasn't always that way. Agile was introduced by the VP of Engineering, largely in response to the billing death march and the difficulty of getting new code and features into production. It was probably the best thing that could have been done. Different groups were trying to organize their projects in an agile manner, doing scrums, doing stand ups, etc. A truth is, unless there is a buy in for it from upper management and other groups that are involved in projects, such as Project Management and Product Management, then it does not work. A second necessary component is a weekly meeting or document which allows everyone involved to see the state of every sprint, of every user story and the burndown charts. This simple layout enables everyone in the organization to know how everyone is doing.
When the billing death march started I used to get stuck in long days of meetings where a Use Case would be hammered out between us and the third party vendor as to what the requirements are. They were huge lists of text with multiple indents that went on for way too long. The meetings that were required to build these large use cases were exhausting for everyone.
As the billing project got going we tried to impose a scrum style morning meeting over it, but the timelines were getting blown so often, and there was middle management between the engineers and upper management that the project just ended in stasis. The second issue we faced was that when we did get going and started producing quality code at a good velocity, the vendor system we were integrated with was suffering from poor performance and low quality code. Which made our functional testing next to useless. Once we got on top of our system, that was the story of the project, slow times and low quality.
The change over to Agile occurred during the last part of the billing death march when it was being forced into production with the direct involvement of the VP of Engineering. The billing project didn't go agile though, it was still in the struggle of making the system production ready if not production quality. During the agile introduction Lifelock brought in a Scrum Coach who took everyone in the organization through what going agile meant. Engineering was all for it as we were sick of heavy requirements that were next to useless and then getting stuck in crunch mode until whatever the feature hit the arbitrarily set end date.
The agile manifesto states:
• Individuals and interactions over processes and tools • Working software over comprehensive documentation • Customer collaboration over contract negotiation • Responding to change over following a plan
So did Lifelock's adoption of Agile achieve these goals? Individuals over interactions and processes over tools I would argue no. Agile was put like a piece of vellum over the existing organization, processes and structures with new lines drawn between the old organizational responsibilities and the new ones. There was the promise for a while there that engineers would be scrum masters and the product owners would be the end users, but it didn't happen.
Project managers took the scrum master role and the product owners remained specialists like business analysts in older structures. There are places for project managers and product owners, but when someone is working on a billing system, there is no need. The engineers as scrum masters can talk directly to the product owners in finance and billing operations.
Working software over comprehensive documentation we achieved but that was not due to agile. It was engineering that put into place the rigor of unit and functional testing. It was not a result of the agile process or project managers wanting that in the sprints. We also documented heavily our projects and modules in the wiki as well. We did this as part of the sprint so that other groups would know what the batch jobs etc did without having to look at the code itself.
Customer collaboration over contract negotiation. Because engineers were not scrum masters and product owners were not end users, I have no idea. User stories came to engineering in sprints as a-priori. They were essentially immutable and we had to take them at face value as being what the end user wanted. If we did question what a user story was then a product owner would try to give an answer, and if they couldn't then they would go back to the end user. I would have loved to have the end user in some of our sprints but organizationally we did not have the will to do it.
Lifelock's adoption of agile did stop the big ugly contract negotiation pieces. When the billing death march started I got stuck in all day meetings where big nasty happy path use cases were hammered out. This was hopelessly inefficient and the adoption of user stories stopped this style of requirement gathering.
The final agile quality of responding to change over a plan did happen. One good thing about the adoption of sprints and user stories was that a user story being late or taking longer than thought was never questioned and it was ok to push a user story out into the next sprint if it wasn't feasible. The upper management of Products and Technology should be proud of this attitude as without it this is how death marches happen. Having a realist approach to the ability of complex systems to be estimated accurately and for working software to be delivered was the greatest benefit of adopting agile.
One thing I took out of the the experience was that agile will not get into a company from the bottom up. We tried and failed. Agile as a process did not work in Lifelock until the entire organization accepted it. I am left with the observation it has to be top down. Edward Yourdon claimed in his book about death marches that if you are in one then leave. It can probably be expanded further; if you are working somewhere that does not do agile, then you will be involved in a death march sooner or later. It might be wiser to leave and work somewhere else under an agile process.
The idea of service oriented architecture is alluring. It offers the promise of a very clean architectural separation of consumer and provider. The reality is that very few services are static and most are changing constantly with each sprint. While we may have decoupled the Front End and Middleware by adopting this approach, we ended up tightly coupling the build environment through maven to handle the constant change and volatility. We built all our library jars, EJB modules, ears and front end wars as part of the some compile and artifact generation process.
Everyone agrees that versioning a WSDL and services are a good idea, but there is no real good way to do it. You can use UDDI but that is overly complex and another piece of functionality in the service bus that you cannot unit test. You also have to convince your customers for the web service to go look up a registry first. Considering many customers and partners build the XML as a string with tokenization and then send it over HTTP rather than build a web service client application then it is far too complex for most.
The other mechanism is to version through the namespace or the package name. With annotations in Java you can usually do this quickly and effectively with @Webservice attributes. This puts the generated client code under a new package name, however you are left supporting backwards compatible methods and legacy code in your application. I can recall integrating with Netsuite many years ago and this was how they dealt with the versioning issues.
One possibility we tried was adding a version as an argument into the webmethod. So you could specify like; SomeService#get(id, version) where the version can be iterated. This still leaves you stuck with the return contract or XSD being the same no matter what the version. Usually you need versioning because your XSDs are either asking for more information or returning more information, so it makes this kind of versioning limited in usefulness.
Another possibility is to do what Salesforce has done and make an abstract webservice object; SObject that can be interacted with. The interface to the SObject does not change but you lose some of the advantages of the SOAP protocol and generating clients. Namely that the objects you are integrating with are strongly typed. With the SObject style of structure you end up passing in a lot of strings and requesting a lot of string names. I found that the SObject format required you to have the SOQL Explorer up all the time so you could correctly identify what you needed.
With the Salesforce type approach complexity gets handed off to the developers at either end and the compiled nature of the generated client code is not as useful. You also end up with public static final strings throughout your code that match some magic string in the system which will return the appropriate type or functionality. One thing you cannot guarantee is the technical expertise of your partners. Often there is one person working in Perl, or PHP, trying to make sense of "this WSDL junk" and ended up just using strings and regular expressions to build the request and response packets.
Static or unchanging WSDLs are sufficient as long as your business model does not change. Contracts between customers and partners are always a reflection of the current business model and not all the volatile parts of that business model can be abstracted away. I can recall being involved in a government job for a state highway system. The project included numerous deliverables of designs, testing procedures etc before any coding was done. That department could do that as the business model for providing highway services has been pretty static for the last thirty years or so even with new technology.
Not all problems are technological in origin. One way of dealing with customers or partners demands for their business case to be different and cause one WSDL to split into twenty different versions is just to resist it. We had one project where the Product Owner didn't allow the different partners to put their own permutation on what they wanted. What we provided was a uniform series of web service contracts across all partners. This is a valid technique for managing the WSDL versioning, only have one version and have them sufficiently abstracted that it covers all cases for the business need. This requires a lot of people skills rather than technological skills.
If you do expose a WSDL to a customer or partner there are two things you have to do; first generating the client code for the customer; second, write a user guide to remove any ambiguity about how to interact with that generated client code.
We found that when a WSDL was palmed off on us there was no guarantee it would compile. Most companies always had an example or two, but they were in Ant and for an older WSDL, or the example was in PHP or some other scripting language. I had one experience where with a third party we were looking to integrate with which had examples on Github; but I could not get it to generate a client jar cleanly with the maven plugins for axis, axis2 or JAX-WS. In the end, I had to use the maven ant plugin to generate the code in ant, and then compile it over in maven.
At that time, myself and another developer were trying out competitors for a particular integration. Of the two companies one supplied a pre-generated client jar, while the other just a WSDL. Another engineer had completed a prototype integration of the company that supplied the generated client jar, while I was still struggling to generate a client jar for the other company. The irony was that the company which supplied us a WSDL only probably had a better API but we had to get through the frustration of generating the client jar before we even got to that stage. It is just easier to generate the client jar for your future customers and partners and avoid that initial impression altogether. It is not hard to make generating client jars part of the build and artifact process.
The second thing that is necessary is a User Guide. This should include what every field is and what every field does in explicit detail. Some of the best documentation you will come across is from Acquiring Banks who expose APIs to auth, capture and refund money. When money is involved nobody likes ambiguity or confusion. If you get even a penny out, people get mad. Consequently the Acquiring Bank APIs have excellent documentation. As part of the artifact process it is a good idea to make the User Guides themselves part of the artifact and created each time with the corresponding client jar.
Given a choice and knowing what I know now would I use SOAP/XML or REST/JSON? I have to say the JSON is superior as a document transport format. Using SOAP requires using all the tools that go with it such as creating client jars and service factories. JSON is quickly readable and if the content structure is static can be converted quickly into a intuitive POJO with libraries like Gson or Jackson.
SOAP has a lot of overhead to it and is restrictive in how you can interact with it whereas JSON is far more flexible. For instance when we exposed the meta-data about the artifacts that were being deployed we chose JSON because it was easy to query and parse with bash scripts, python scripts and java code. With SOAP you have to have the client, then the service factory, then the module to return the data structure which is more complexity than a bash script cares for. You could just get the SOAP XML document directly via HTTP I suppose, but, why bother when JSON doesn't require the same overhead to interact with it.
Courtesy of Java annotations and the J2EE spec it was simple to create a webservice but we still had to generate the JAX-WS java class files and bundle them into the WAR file so they could be used in the EAR file. In our build process this was a weak link. It took a long time and we often had false positives of failed builds when the JAX-WS plugin got confused over the /var/tmp directory.
Interacting with JSON is also less sensitive to changes in the contract. Depending on what was calling the WSDL, if there had been a change in the WSDL contract the client code would often refuse to start and fail as a runtime issue. Deploying the wrong front end code to an environment which had an incompatible middletier would often give confusing errors. Another issue was that our functional tests used the WSDLs that the front end did and you could not run functional tests against a middletier with WSDLs of a different contract. Again JSON has it over SOAP/XML in this area but that advantage can be negated if you generate client code of WADLs.
REST is no panacea though. It has the same versioning issues as SOAP does. It appears that most people version through the URI which is not much different to SOAP being versioned through the namespace. Neither are a particularly good solution in that instance but they obviously work well enough. REST is more a convention so you can guess what the URI will be for a CRUD operation. SOAP is more laborious in this area but there is no real clear cut winner in those situations.
Ant is definitely more flexible than maven, but for a build tool and technology maven is far superior. The simple way that maven works out of the box with Nexus is also really nice, not just for engineering but also for service delivery.
We mavenized our complete system from ant over a period of about a month. It was mainly myself doing it though with help from others. We managed to slide that new build system in with minimal disruption. We found most of the problematic ears in the dev integration environments. Since we were doing skinny ears occasionally we would find missing jars when running functional tests over them that would lead to a runtime issue.
We could use maven plugins out of the box but we ended up having to modify several plugins to get the artifacts come out the way we wanted them. Fortunately the plugins were opensource so we could modify them to our idiosyncratic needs and build them out of Jenkins when we needed another change in them. There was only one plugin we created that did not exist in the maven plugin ecosystem. This was the OSB deployment plugin which we ported across from the Ant task we had created. Most other plugins we used were readily available and did not need any modification.
We used the skinny ear methodology of building the ears. All the dependencies were marked as provided in the parent poms dependencyManagement. The only dependencies marked as compile were in the EAR's pom.
We had multiple levels of parent poms. The very parent defined the nexus location, the scm plugin and universal plugins and reporting such as cobertura and pmd, as well properties that were relevant to all the projects that inherited from it. The front end and middletier had their own poms which was where the dependencyManagement was defined. The wars and ears had their own poms that inherited from these and added functionality specific to their needs.
We used the scm subversion revision output to populate the Weblogic-Application-Version in the manifest. Later on we added the maven version and the svn revision number together to make a unique string for the Weblogic-Application-Version. This helped middleware engineering and service delivery identify what was deployed where.
The manifest became where we compiled into the artifacts all the meta-data about where the artifact came from, which branch, which sprint, what was the last revision number and what version of supporting jars and modules were compiled into the jar. The maven properties were used to populate this information. We also exposed the manifest information through the ear as json so we could interrogate the ear and know what it was. We had a Jenkins job which queried for this json and ran through all our environments, identifying which ears were deployed where.
As we lumped more and more into our build branches the times got slower and slower for a branch to build. By 2012 we had the automation building the front end and middletier. We also had a project that was bringing our batch into the one build process as well.
The main reason for doing this was so we could build the supporting jars with the same project.version into the different projects that the front end, middletier and batch needed. This took the guess work out of it for service delivery. We did this as some of our jars were extremely volatile and were always having new work done in them.
We discovered that maven had a switch where it would work out what could be compiled in parallel on a multicore system. Since we had so many utility jars and ejb modules we were able to take advantage of this switch. Our compiles times dropped from twenty to thirty minutes depending on how heavily the jenkins slaves were being hit down to ten minutes.
With the addition of batch joining the others we will probably need to make our build script more dynamic and only build the parts of the system that had change done to them, currently it builds everything, every time. At Lifelock it is engineering that does all the scripting to support this kind of work. It is also engineering that checks all the continuous integration jobs and spends a lot of time in Jenkins.
My morning routine was checking over the continuous integration jobs and sending out reminders if any branches were in a broken state. Unfortunately the JAX-WS plugin had issues with clearing out the /var/tmp directory on the Jenkins slaves and on local builds on our macs. We got a lot of false positives that way where a branch broke not because of the code, but because of how a plugin interacted with the build machine.
Engineering worked heavily with infrastructure to make the build process for servers automated as well. The infrastructure guys wanted the ability to deploy the applications, such as the wars and ears, as rpms when they built their servers out of Satellite.
There is a maven rpm plugin and we got it working where we could build rpms. Unfortunately I left Lifelock before that work was completed and I never got to see the rpms being pulled down from Nexus and deployed on a fresh virtual machine. One of the issues with Weblogic is that is 1990s in technology. Tomcat is far more robust and flexible. For instance when we tried rpm'ing our artifacts we did so with the war application first.
The Weblogic container does everything through mbeans which require it to be running for any change of state to be recognize. Whereas the war rpms and their configuration rpms can just be pulled down and when tomcat is started up, wallah, it all works. Weblogic is not simple. Where do you dump the ear rpm? Worse, Weblogic configuration files get changed when you start up Weblogic. Now that Tomee exists, I don't see any real reason to use a traditional J2EE container like weblogic, websphere, glassfish or jboss.
In J2EE the local interfaces use the same JVM. This leads to putting all your ears into the one admin server rather than splitting up the different containers by usage. We mainly used remote interfaces for functional testing, some client access by the batch applications and for talking across ears.
One problem we never really solved was that the remote interfaces got versioned because we used the Weblogic-Application-Version entry in the manifest of the EAR. We did this for the middleware engineering group and service delivery. Middleware wanted to be able to do hot swaps back and forth in production and service delivery liked being able to look into the console and see what had been deployed.
From what we could work out, the Weblogic-Application-Version put that version into the JNDI name for the remote interface. So if an external ear was referencing a versioned JNDI lookup for the remote interface and that ear was no longer available then the reference would go stale and that ear would start throwing runtime errors.
Webservices between ears are probably safer for that reason, but putting an @EJB(mappedName=RemoteInterface.class) was so much easier. We ended up getting around it by bouncing the managed servers after the ears had been deployed so they all came up together and attached to the correct remote interfaces.
One mechanism we came up with to get around this was pluggable EJB Modules. We started putting small amounts of work into an EJB module and use maven to pull it into an ear as an internal module so that we could use the local interfaces rather than having to use remote ones. One benefit we got from this approach was to share these EJB modules across ears.
Some batch jobs attached as clients to the remote interfaces exposed in the middleware. Spring's injection mechanism made it easy to wire up a remote interface into a batch job even if it was in XML. Lifelock's production middleware systems were not high performers when it came to throughput. Batch relies on getting as much data through as quickly as possible, so only operations which did not impact the middleware system heavily were done via remote lookup.
A major consumer of the remote interfaces were the functional tests. We used these to interrogate the state of the system after a normal business function had been performed. Our front end systems interacted with our middleware through webservices, so using remote interfaces meant that the methods which pulled back the state of the system was hidden from the front end applications. Their contract with the system was through the exposed webservices only.
The functional tests also comprised of a lot of tools. Often we bled over from test to tool as it was easy to correct or remediate the system being in a state that was causing issues. A common one was when the data between Salesforce and the main Lifelock database were out of sync. We had tools based around remote interfaces interacting with the system that could kick off the events that would correct the data mismatch. We ended up putting little UIs over these kinds of tools so others could use them.
The enterprise archive file is a glorified zip file; same as the jar and war file. The EAR contains jar files that have session, messaging and entity beans as well as normal library jar files and war files. We tended to split ears by singular functionality. While some ears were untouched once that functionality got into production, others were touched in every sprint by just about every project. Volatility was not uniform across the ears.
Originally the ears were built with ant but we later migrated to maven to build them. We adopted the skinny ear paradigm of building ears. All our dependencies were marked as provided in the parent poms and only in the ear pom were the necessary dependencies marked as compiled. This convention was not always followed and sometimes clashing jars would end up in the APP-INF/lib.
Maven always carries the problem that even when you are careful, jars that you don't want will end up in the classpath of the ear or war file. Fortunately you can exclude sub-dependencies but it can be a painful and frustrating series of steps to isolate the jar that is causing the ear not to deploy or run. SLF4J is a good example of this. Marking dependencies as compile can lead to conflicting logging libraries being packaged into APP-INF/lib.
Our client jars that we used for internal and external consumption of WSDLs were in a mix of axis, axis2 and JAX-WS. This depended on who did the generation of the client code and if the WSDL client code was generated first in axis2 or JAX-WS. Sometimes getting the namespaces lined up can be frustrating and it is a relief to get started. Usually following a convention for what mechanism generates the client jar is forgotten.
We had one issue where two clients jars were used in the same ear. One was from an axis generated client and the other from an axis2 generated client. The ear compiled and was packaged without issues, additionally it ran happily on the admin servers, but refused to deploy on a cluster. We managed to isolate it to a clash in the APP-INF/lib between the axis and axis2 dependency jars. We were careful in the future about this kind of mixing.
The simplest structure for an ear is one EJB module, one WAR file and maybe a client jar with the remote interfaces and the data transfer objects. When we started migrating older ears and placing new functionality in them we used multiple WAR files inside the ear to expose new functionality while keeping the old functionality present. We could have versioned the webservices in the old WAR context, but some of the legacy code was pretty poor and was designed for issues that we no longer had. It was easier to migrate the functionality from one WAR file to another inside the ear.
We started with one EJB module in each ear, but by the time we started splitting more and more functionality out into pluggable EJB modules the ears contained multiple EJB modules. For instance one ear that did a lot of heavy lifting had specific EJB modules to support smaller chunks of functionality such as encryption, configuration and taxation.
As we spent more and more time with the ear structure we started using its ability to support multiple jars, multiple EJB modules and multiple wars. Using maven to build the ear made this approach quite simple and without quality issues.
When Lifelock decided to do the Renaissance project they essentially chose an Oracle stack. The middleware was Weblogic and the back end system was an Oracle database. The front end was done in .Net which seemed odd. Normally it is a good idea to make your whole stack the same technology so engineers can move up and down the stack as needed.
The J2EE container came out of the 1990s of technology where you bought several multi-million dollar Sun servers with sixteen CPUs and tons of memory. To make up for having one big server, the idea was you would run multiple JVMs on this big iron which supported admin and managed servers. This was done for resiliency, lose one JVM and the services the machine was exposing did not go down.
Nowadays you don't need multiple JVMs on one big server as vendors like VMWare made it easy to make multiple virtual machines that mimicked an entire server. So you could spool up multiple VMs and put a single JVM on each. Even better, if the managed servers were running hot, you just gave them more CPU and memory from the VM configuration user interface.
With cloud infrastructure the VM concept has been taken further and horizontal elasticity is being done automatically and without an admin having to go in and change things. Even more amazing, infrastructure is being put behind an API with the likes of JClouds and the different toolkits that are coming from Amazon. Infrastructure is now a software engineering problem and consequently has to go into continuous integration, have unit tests, have functional tests, etc.
J2EE was also a response to CORBA and DCOM. These were distributed system designed for the purpose of there being multiple terminals interacting with a distributed system that spanned multiple data centers. I worked on a system that was grounded in CORBA. It had been designed with user interaction being done through a Java User Interface, if I remember correctly it was AWT.
The internet had overtaken this system and a browser based user interface was put over the top of it. The CORBA IDLs acted as stubs - same as J2EEs remote interfaces - and was how the war application in the tomcat container interacted with the CORBA system.
J2EE is a distributed system. The managed servers do not have to be housed locally with the admin server and using remote interfaces the ears and clients can talk with each other across a contiguous network.
The main issue is that J2EE has come out of 1990s technology and been overtaken to an extent by the internet and more recently cloud technology. The Tomcat container is simpler, and since it does not have to deal with mirroring ears across managed servers, or bean pools, it is consequently more robust. In my opinion tomcat is easier to configure and manage as well.
To cap it off Tomee is a tomcat container which supports CDI, JMS and JPA which are amazingly productive technologies. Given choice, I would not use a J2EE container such as Weblogic, etc. Tomcat is a simpler and more robust solution. That is without taking the service bus into account. Once that is summed in tomcat has it all over a J2EE stack like Weblogic.
Our problems with Weblogic were legion. When I first arrived at Lifelock we all had windows machines. They were crappy and under powered. It would take a frustrating forty minutes to log in because of all the crap that was on them. I never used to turn my machine off for that reason.
Starting a Weblogic instance on Windows was another exercise in frustration. The Weblogic instance on localhost would take twenty minutes to startup. Then you would make a code change and deploy again, and wait, and then do it again, and wait again. Once we got macs startup was just a couple of minutes on the Mac Pros. Which was like manna from heaven. The Macbook Pro folks had to beg for more RAM before they got that kind of startup times.
We had multiple environments which covered the standard needs; DEV integration, QA and Stage in addition to production. We had six environments in DEV which were a mix of clustered and single managed servers. The QA environments were all on a single managed server and stage mimicked production.
Weblogic is a bear to configure. We moved code with great speed through our system. We often had six to ten sprints going at once and every two weeks a sprint would end, merge into trunk, and then go out into production. We were often adding new configuration changes including queues, topics, supporting libs that needed to be in the Weblogic classpath, etc. Our numerous Weblogic environments were always out of sync and constantly had performance and configuration issues.
The real problem was that we had too many non production environments and too few middleware engineers to support the number we had. Often we would be down to one middleware engineer who we would then burn out by over working them, so they would leave and get a new job that was less stressful. For instance Pearson had something like nine Weblogic engineers in Phoenix, while we would have one. We got a bad name for burning middleware engineers out as well; which made it hard to recruit. It was vicious cycle.
We kept trying to slim down the number of environments to take some of the load off the middleware engineers that were supporting all these environments. We collapsed our backends so that all of DEV shared the same backend and all of QA and Stage shared another backend. This removed a lot of the difficulties in engineering and infrastructure managing multiple backend systems and trying to keep them in sync.
The design for the environments between 2010 and 2012 came through a design document I did. The goal of that document was to slim down the number of environments and back end systems. Consequently the design document only had DEV and STAGE. The idea was that DEV and QA would share the dev environments, working together during sprints in a DEV environment and doing regressions in stage once it had been merged to trunk.
If I was to do that again, I would not call those environments DEV. Language is important and saying DEV gave those environments that appearance being owned by engineering. I would have called them something like the SPRINT Environments which has a connotation of shared ownership.
One of the advantages of the DEV environments was that engineering helped managed them, so they tended to be a bit more stable than the QA environments which were reliant on the over worked middleware engineers to fix any issues. The code in the DEV environments was usually more robust as engineering could deploy there whenever they wanted, so bugs were always fixed there first and didn't require the Configuration Approval Form [CAF] to get the code up to the QA or STAGE environments.
The environment design did not include any QA environments but the Director of QA kept pushing for a Quick Test or QT environment. I would convince everyone that we didn't need it, we could do that work in DEV or STAGE, but I would skip a meeting, and then the QT environment would go back in. Which was frustrating. I talked people out of it three times but ultimately it went in. I skip a lot of meetings and am not a particularly patient meeting attendee so the QT environments became part of the structure.
The QT environments became the bane of Lifelock's middleware engineering and service delivery groups. They were rushed out, poorly configured - for instance we found pointbase running on them - and they were a cause of constant problems as they were not used constantly. They were a nightmare, and worse, there was not just one QT environment, there became five of them.
It was a bad decision and we were never able to get it changed. Once something like that gets in, it is hard to remove them as the organization and processes wrap around them and kind of get stuck in doing things that way. Whenever anyone asked, or even when they didn't I would say, "We should delete all the QT environments." We didn't though.
This is also why simpler and fewer is better. When the design of environments came up again in 2012 I deleted QT and deleted DEV as well. That meant we only had to manage one set of non-production back ends, and one series of non-production weblogic and tomcat servers. It would have made it simpler for everyone and let the middleware engineering folks concentrate on production which is what actually made us money.
One of the problems with Engineers is that we love writing software. So we assume this is the most important part of the whole process. It isn't true though. Software engineering is overhead and any code we produce that is not in production is money poorly spent. In lean development it represents waste and risk.
The environment design I did in 2012 made every environment releasable to production so we did not have the bottleneck of stage. The environments would be close enough to production in structure that we could release from them in confidence. This was becoming possible because of the server automation that was coming out of Infrastructure.
This environment design was not adopted, in fact Lifelock kept plugging along with the existing legacy environment design with the change that infrastructure was building servers and starting the process of updated Weblogic, Red Hat and the JVM to the latest.
Weblogic being a struggle to configure meant that we all had unique installs on our local machines. The Weblogic configuration also wrote the path right through all the configuration files. So effectively obtuse that running a perl script and replacing the path was not enough to use someone else's Weblogic container and domains.
We decided to check in the Weblogic container and a working and pre-configured domain into subversion. It was about 1.3 Gb and was for mac users only which isolated our couple of windows users. The container was put under /Library so that the path was the same for everyone rather than a user's home directory.
We had WLST scripts to configure the Weblogic container. Weblogic had to be running to accept the mbean changes that the WLST scripts were performing. The idea of Jython is a good one; all the best things of python with all the best things of java. Python has a lovely terse syntax that is eminently readable. Java has a massive number of libraries outside of the core java system.
Eclipse doesn't support Jython all that well. It certainly does not support WLST with anything approaching consistency. To get a WLST project in eclipse you have to create a Dynamic Web Project and then add the WLST facet. WTF. It is like the tools for the OSB in eclipse; half-arsed.
I wrote a WLST script back in 2010 to put in the datasources and queues. This was expanded by succeeding middleware engineers to include data stores and other queue configuration. We did some other WLST scripts such as flushing queues and listing all the queues in a container. It was Engineering doing those though, not Infrastructure. We also hit issues getting the WLST scripts into Jenkins.
Weblogic is not fun to configure, and WLST is not fun to program for. We toyed with using a maven plugin to configure queues as well. If I was to do it again, I would make Jenkins jobs in Java that used the MBeanServer to get into the Weblogic system and change its configuration.
Oracle Service Bus
As Lifelock expanded in 2009 contractors were brought in to fill positions in engineering and infrastructure. There were factions amongst the contractors as well as among the middle managers. One dividing line was those that liked the Oracle Service Bus [OSB] and those that did not. The idea behind an Enterprise Service Bus is that you can hide all the internal and third party systems that a modern business uses behind one set of unified interfaces. The benefits are supposed to be loose coupling, plus the ability to aggregate or transform data flowing across the service bus.
The service bus is heavily based on XML from the SOAP protocol and the concept that any XML flowing across the bus can be transformed by XSLT into something else. SOAP dates back to Microsoft embracing XML in the late 1990s and early aughts when it was the next big technology. One problem during that time period was that Microsoft machines were highly prone to being compromised by viruses and trojans when exposed to anything from the internet so Network engineers would close off every port they could; except for port 80. Microsoft's response was to make port 80 the new default remote method invocation port and send XML SOAP packets through it.
The promise of the Service Bus always sounds great. We interviewed one engineer at Lifelock who I was discussing our quality, development and support problems with the Service Bus and he was telling me how awesome it was because you can call PL/SQL, transform it with XSLT and then send it out as SOAP. Which sounds insanely simple. But it never is. When you actually have to do a project with a Service Bus, it is insanely slow, defies introspection and has poor quality and artifact output. The killer is; you can't unit test a Service Bus and as a result you cannot guarantee quality.
Another issue with the Service Bus is that Java Engineers always get stuck with it. The number of genuine Service Bus Engineers are insanely small and they tend to be contractors that you hire from Oracle etc. You rarely see positions being advertised for Service Bus engineers, instead it is always thrown on the software engineers who have been trained to read Java in IDEs. The crappy tools for reading XML that comes with Oracle Service Bus are a nightmare in comparison. It is more productive, and more efficient, in my opinion to have Java engineers working in Java rather than working in XML and a Service Bus.
Aside from the inability to unit test, the next really disgusting problem is that you cannot automate the creation of an Oracle Service Bus artifact. The sbconfig.jar is the artifact which is deployed to a service bus instance. It has to be generated out of eclipse. The crappy mechanism for automating it is to put eclipse and Weblogic onto the Jenkins server and use them to create the sbconfig.jar on the continuous integration server. This is not a satisfactory solution. We also found that an sbconfig.jar generated out of windows differed in the files, folders and structure to one that was generated out of a mac. That is not acceptable and grounds for dismissing the OSB.
We had issues with the customization files as well. They had to be generated out of localhost which is not good for automation as it is dependent on a human pulling the file from a running OSB. We managed to at least automate the deployment of them through maven by using the localhost:7001 as a token. Despite automating the deployments of the OSB as much as we could we still occasionally hit production issues that could only be solved by going into the service bus console and eyeballing the setup.
The architecture for the service bus in the technology stack meant that all the webservices went through the OSB then to the Weblogic System [WLS]. Creating web services in the WLS layer is a snap with EJB3. You throw some annotations on a java file, add the jaxws build step in the maven build and there they are. Because of the OSB's position we had to use it as a pass through for the majority of the services. Only a small number of service actually used service bus functionality like transformation or aggregation. The rest were just straight up and down of what was in the WLS anyway. It took three years to remove the pass through services and have the front end systems hit the WLS directly.
The tools Oracle provides to work with the Service Bus are very poor. For windows users there is the Workshop For Windows which is a customized copy of Eclipse 3.3 with the OSB tool bundled in. Since most of our engineers were on macs we had separate versions of Eclipse 3.3 on our systems with the correct plugin installed. One of our engineers checked into subversion a customized copy of eclipse with the OSB plugin installed. A lot of the newer engineers ended up using this for OSB work when they could no longer avoid it.
Another issue we hit with the Service Bus was the problem of automated deployments. Originally the build system was ant based and we later mavenized the build process. There were no off the shelf ant or maven plugins that were able to deploy the sbconfig.jar and the customization file. It was a manual process from dev, to QA to Stage and production. Shudder. One of our talented young engineers ended up writing an ant task that used MBeans to deploy the sbconfig.jar and the customization file. He did it while we were receiving training on the new billing system. It was a good use of time. Automating deployments was more important. When we moved to maven I ported his ant task across to a maven plugin.
When we started the billing death march we tried to use the service bus to aggregate calls between our internal middleware and the billing system. We did this because Lifelock had invested in the service bus and this was a legitimate use of the service bus' capabilities. We quickly hit a brick wall in terms of productivity and tools. The process was slow and laborious.
We were also hampered by the service bus only being testable at runtime. We quickly got bogged down and were behind in the project. The VP of Engineering stated that it would be better if we integrated with the billing system through the Java layer. After some initial resistance it was obviously the correct decision as we could unit test the java code and have java engineers working in Java rather than in the service bus' alien tools. The service bus is not a fun technology to work with.
When I was at Shutterfly I was amazed I ever got through the interview process. We would say no to a potential candidate for the most trivial and arbitrary reasons. Lifelock was not much better. The engineers would find someone we liked and then we would get a no after others had interviewed that candidate and decided they didn't like an answer to some question. I thought it was all quite unfair so I tried to add some rigor to the process by getting candidates to unit test and interact with code rather than answering questions or scribble on a whiteboard. I thought this approach more apt as most engineers spend most of their day in an IDE anyway.
We had some funny experiences. We were interviewing a .Net developer for sharepoint and while I have done some work in C# and .net I am not an expert. One of our engineers who was working on our billing system was a small, quiet and shy engineer who came from Hyderabad. She had a soft voice and was really sweet. I asked her to interview the candidate since I had no idea what I was asking in the .Net area. During the interview she would ask a question in a quiet, soft voice and then upon getting the answer would give no facial expression or body language that the answer was good or bad. She just went on to the next question in the same soft voice. I told her afterwards that I would not like to be interviewed by her as her technique was so brutal.
We also had two principal engineers in Tempe that would often tag team a candidate with questions. One of the senior engineers is sharp as a whip and can retain amazing and vivid detail in his head and repeat that detail back to you in a heartbeat. He was great if you forgot some complicated shortcut key in eclipse as he had them all memorized after a glance. He was a difficult interviewer as he knew a lot of detail in a lot of areas. Far more than I did.
Our other principal engineer was a big and intimidating guy with a shaved head. He had a deep booming Connecticut accent as well and tended to suck the air out of a room when he asked an interview question. The amusing part was he would say afterwards how he was being so nice to the person being interviewed, but he still managed to be intimidating.
I asked to be involved in interviews with the other groups as well. One of the skills we lacked in the other groups - such as QA, Infrastructure and Service Delivery - was engineering and scripting skills which are becoming more and more important. I would test candidates in these areas to see how closely they matched their resume. I always had problems to solve in Jenkins, so would ask a candidate to script something up that might help me in the issue I was having at the time. This was good for shaking out people who really could not script.
Unit testing served the same purpose with engineering candidates. It was good for seeing how comfortable a candidate is with code, for determining if they can code and how good their command of Java libraries and concepts are. Usually the unit tests I would have candidates do were based on existing production methods. I went through a period when I would find some code that was not covered and get the candidate to work on that, or I would delete an existing unit test and have the candidate unit test an existing method. It also offered me a good chance to hammer in that they were going to have to unit test. I think this was the most valuable part of our interviews.
This approach was also adopted by other groups. The Business Information [BI] group started having candidates do hands on SQL statements in our dev systems as part of the interview. BI has so many tools and user interfaces these days that reports can be dragged and dropped all over the place and people working in that area don't really need to know what a database looks like, or how tables have foreign keys. Since Lifelock was a startup, things weren't always that clean and database knowledge was necessary. So knowing your way around PL-SQL rather than how to point and click in Infomatica was a necessity.
When I came to Lifelock the binaries were built out of eclipse and deployed manually. We sat down and started automating as much of the build and deployment process as we could. We wrote our own ant tasks to deploy ears and the OSB; once we had that we could deploy our artifacts as part of the continuous integration process. From that point on we did not have failed deployments due to the artifact itself. Which took the arbitrariness out of it. This is a lot of work though. More importantly, it is engineering work.
Everything is engineering now. Automation is an engineering problem because you have to be able to code. Some people can script, and they might be able to script well enough to run something from the command line on their machine, but automation means anyone can use it from anywhere - though normally through Jenkins - and it will achieve the same result. It takes an engineering mindset to do that.
Another reality of that is, the engineering mindset does not do anything manually twice. I used to make the comment that an engineer will do something manually once, the second time they will script it and the third time they use it they will put unit tests over it as that code is going to be around for a long time.
There was also a time at Lifelock when artifacts were put on a share drive and were taken from there to be deployed. It took a long time for the automation in deployment that engineering had been using since 2010 to make its way to other organizations. For multiple reasons, the main one being it wasn't really trusted as engineering made it and used it. When we added Nexus to the mix after the porting of the build technology to maven it made that process easier. We automated the creation of release branches and the pushing of artifacts to Nexus which made artifact management easier as well.
Infrastructure is now an engineering problem. When I was at JavaOne in 2011 I went to a talk by one of the JClouds developers who stated how he wanted to put everything to do with infrastructure behind an API. This means the code to create a server has to be checked into source control, it has to be part of continuous integration and if it is used all the time it has to have unit tests and functional tests to guarantee that it is working when it is needed.
Quality Assurance suffers from the same issue. It too is now an engineering problem. Engineering produced functional tests for the middleware, but not every front end sprint created selenium tests. We had a QA Engineer who had a Computer Science degree but due to the down economy in 2008 was unable to get an engineering job. He took up a job in QA and proceeded to automate everything he touched. The greatest benefit was that he created a testing framework using Selenium for the front end applications. Again, the engineering mindset is a huge advantage.
W. Edwards Deming argued that you should remove the need for inspection. In heavy industry such as manufacturing this is done by monitoring the process at all steps along the way using statistical analysis, or statistical process control. This determines when the system is not behaving as it should when it goes outside of a statistical bound.
In software engineering this problem has been made easier with functional testing as you get a binary result when a functional test is run, it passes or it fails. There is not the same noise in the system that there is in manufacturing. To satisfy the pass/fail outcome we needed to get our functional tests into the continuous integration process and deployment process. While we did get them into Bamboo and Jenkins, we never did in such a way that they could replace inspection.
At Lifelock Service Delivery and Quality Assurance were under Infrastructure. I argued that those groups should be under the VP of Engineering as they perform engineering tasks. They should also be run the same as engineering and all their work done through sprints with scrum master and product owners until the frequency of release is daily and even they should remain in sprints the same as engineering does. Soon after I left there was an organization reshuffle and Service Delivery and Quality Assurance were moved under Engineering. It was a wise move. Everything is engineering now, I am sure one day manufacturing will go behind APIs as well and factories will face the same issues as infrastructure does.
I have to put a massive shout out to Jenkins. What a powerfully remarkable tool. We ran everything through it. Myself and another engineer were at JavaOne in 2011 and were listening to a forum about how the Eclipse organization was using Hudson. Toward the end of the hour or so, the moderator asked, "Who is using Hudson/Jenkins for continuous integration?" and nearly everyone's hands went up. The moderator then asked, "Who is using Hudson/Jenkins for tasks outside of continuous integration?" and only about eight hands went up, of which two were myself and the other engineer I was with.
We used Jenkins for everything. Once we found out the Linux Team was automating their tasks through Jenkins, we ditched Bamboo quick smart, as Bamboo could not create jobs programmatically. This meant we couldn't create new continuous integration tasks from a script. It was manual in Bamboo which sucked. So we moved our automation across to Jenkins. We did our continuous integration in Jenkins, we also did our deployments to dev, qa, stage through Jenkins and when I left they were being done to production as well.
Engineering did all our quality reporting through Jenkins such as unit test reports, code coverage reports, performance reports, environment reports and deployment reports. We did numerous one off jobs as well, such as double checking billing status, repairing data, etc. Infrastructure used it heavily as well, Jenkins jobs kicked off server builds and provisioning, all manner of small jobs were run to provide reports on the infrastructure as well.
Probably the funniest story of that nature is the Linux Team were required to hand in reports on what they had been working on every week at midday on friday. Since all their work was done through JIRA, they added information in the tickets that was report related and one of the Linux Engineers wrote a script which queried JIRA, pulled the appropriate data out of the XML, bundled up in a format their manager wanted, and then emailed it to their manager based on a Jenkins cron. Their manager ultimately wised up to what was going on and asked for them to automate the report he had to do to his boss as well. The Linux Engineers were an incorrigible lot.
So what makes a good engineer? Obviously code can get out into production and work sufficiently well that it solves a problem and can pass QA's testing but it can be what we call poor code or unmanageable code. It can also do that without unit testing, or sometimes even being built in continuous integration. If working in production is not a good determinant, what makes one engineer better than another?
Fearlessness is a good marker in my opinion. We had fearless engineers and we had helpless engineers too. Most people fall in the middle of that range. We had a legacy .Net application that no-one really liked working in. Some of the Java engineers including myself, had done some .Net work and we occasionally fixed bugs in this .Net application or if pushed did a feature. Most of the time we had specialist .Net contractors who worked in this application and were included in the sprints for any features to do with it.
This application had seen a lot of different employees and contractors work in it already by this stage and when we flicked our WSDLs over to the WLS from the OSB, it threw one of the .net contractors for a loop. It was setup with a common library where the generated .net code was housed. We explained that this client piece was not generated liked it used to be as the WSDL had to be pointed to the new location and the bat script which generated these classes had to be updated to accommodate the WLS manner of presenting WSDLs.
Despite several of us pointing this out, and remarking that he is our .Net and windows specialist and would have to solve this problem, he was adamant that the Java middleware was to blame and he spent an entire day emailing with me with the changes he had found in subversion where a service class had changed and why this was my fault. One of our fearless engineers in Irvine had hit the same problem, and despite he being a Java engineer, jumped in and wrote the bat script that could generate the client code correctly. He had gone out of his way to do it correctly, even when the specialist we had hired in this area had no idea.
That same fearless engineer fixed our WSDL compilation issues as part of continuous integration. Previously our functional testing was kind of a sore thumb as it was dependent on generating against an environment for the client code. This meant you had to have you branch up in an environment, and all deployed without issue before you could generate the client code for the functional tests.
He decided that sucked, so he went ahead and put all the WSDLs into a jar, shoved them up into Nexus as part of the continuous integration process, then pulled them down again as a dependency into the application that needed them, unpacked the dependency and then compiled the client code into the application. This meant that runtime issues now became compile time and improved how robust our applications and functional tests were. The point I am making is that he went ahead, made it happen and in doing so solved a couple of lingering issues we had stumbled over previously.
Dealing with helpless engineers could be frustrating. An Australian comedian taking off Chopper Reid has a famous skit where he says, "Harden the ** up Australia". This is quite literally the answer to helpless engineers, they have to harden up. We had one engineer who, if she didn't get the answer she wanted from one person, would go to the next, and the next, and the next. We had one meeting where she had interrupted the day of six people which is not acceptable. That example was a case where putting your head down and working through it would have solved the problem.
The other form of helplessness I disliked was the "it's your fault" mechanism of getting out of work. During the end of the billing death march I would get an email daily from the VP of Engineering, "Has this bug been completed?", to which I would have to reply, "No, let me go check". This was in the .Net legacy application, each day the contractor we had at that time, would blame the middleware for the bug. I had functional tests over my work so I knew that wasn't the issue.
After a couple of days of explaining and getting the same email, I asked if I could borrow that person's machine for a while. I sat down and with him over my shoulder put in break points and worked out where the issue was in the .Net application. When I had fixed it I asked him to check it in the fix. He was aware that he had been wrong and had the grace to apologize but it was frustrating as I had spent the previous couple of days explaining to him why it couldn't be the middleware.
Engineering treated infrastructure as customers. In Tempe, as we were on nearly all production issues, we would go to great lengths to debug any production issue and determine where the problem was. We had some finger pointers, but for the most part, our input and debugging was both useful, thanked and respected. It made getting things done easier with infrastructure in other avenues than if we had sat there and repeated the mantra, "I don't know, it isn't code, I don't know".
So what is it to be fearless? It is jumping in to get stuff done, because it is the right thing to do, and doing it no matter what the technology; Java, C#, Python, Perl, Bash etc. Engineers that do that and think like that are valuable. I am willing to pass on unit testing as being part of a valuable engineer as that is more the culture of the workplace and building that culture is a different problem to getting a raw engineer with talent who is willing to jump in anywhere.
Lifelock made several attempts to get in cheaper and less experienced engineers to do bug fixing and production support. It never worked, and it most likely never will. The people that are willing to work for that salary and the knowledge they have just leaves them stumped in a complex J2EE environment with injection and JMock unit test cases. We have trouble getting senior engineers to write good code and understand our system, what chance does a production support engineer have?
The reality is you have to get good engineers and pay them well, you then have to let them do everything. Features, Quality, Production Support, Automation, Functional Tests, Release, Ops; everything. It is the only way you get it all done. You even have to give them root or admin privileges and let them maintain non production environments. Again, you just have to let them go and do everything. Some companies have embraced that with great success. For instance I am constantly jealous of what Etsy has been able to do.
You also need engineers to run engineers. If an engineer is running other engineers then there is no he said, she said. Both of you are in the code all day, in the source repository all day and you cannot pull the wool over the eyes of someone who is coding side by side with you. Specialist managers don't work with software. For starters they make technical decisions they have no business, right or legitimacy to make. They don't really know and they don't have the eye to see what is right and what is wrong.
Lifelock was constantly being audited for PII, PCI, ISO and we had Sarbanes Oxley coming down the pipe as well. Our compliance guy had the hardest job in the company. It didn't matter how competent he was, all it took was one person to be careless or even just unaware, and it meant our compliance guy had screwed up. No wonder he was so stressed. I can recall him jokingly describing his job as "running on bananas". Brutal. He had an accent like a New Jersey mobster despite growing up in Phoenix and a pretty strong no nonsense attitude. He loved the Tempe engineers as we put everything we knew, or thought of, straight into the wiki which meant anything we touched was documented in great detail.
One of the things that made compliance such a tough job is that many of the rules that lead to compliance are so open to interpretation by the individual auditor. We would have one audit where everything is cool and then the next one, nothing was. Our system hadn't changed and we had spent the prior year ensuring we were compliant etc. It is just so open to arbitrary interpretation that it makes a mess of things. We had part of the network shuffled around on us at the whim of an auditor and it caused havoc as we were told it had to be done 'now'. Which is not cool in complex production systems. We have processes to manage that level of change, but an auditor can just blow out it of the water based on interpretation.
Sarbanes Oxley is a good example of that. We knew it was coming up and we were going to have to become Sox compliant. I had dealt with that before and I have read the legislation as well. What people think is in it, actually is not explicitly. For instance everyone 'knows' that Sarbanes Oxley says engineers cannot deploy to production but cannot point it out in the legislation where that actually is.
I had the legislation on my desktop and when I got told that I was not allowed to do this, or to do that, I asked where it was in the legislation. It is open to interpretation and it does change from auditor to auditor and from what people think is in the legislation based on what they have heard or what they did at some other place they worked at. I think it is healthy to challenge each and every assumption because people use Sox as some absolute and saying, "Oh, you have to do that because of Sarbanes-Oxley" is an argument from authority. Make people prove it because 90% of the time they are wrong or don't know what they are talking about.
PCI [Payment Card Industry] is pretty straight forward. People know that leaking credit card or billing address information into logs, emails and unrestricted APIs are a bad thing. People are self governing when it comes to PCI compliance because they understand that credit card information escaping is a bad thing. Personally Identifiable Information [PII] is a bit different because it is dangerous when in a combination of information such as name, address and SSN or name, drivers license etc. It can look innocuous but in combination allows an intruder to triangulate a person's identity.
We had an InfoSec group which was dedicated to ensuring that no PCI or PII data was being published. For engineering the main mechanism where PCI and PII could be leaked was in the logs. The best mechanism to stop that was to explain to engineers that putting PCI and PII data in the logs was not acceptable. Engineers got that quickly and it was not an issue. Despite having an Infosec group it also payed for engineering to keep an eye out in the logs to make sure that applications were not getting overly verbose in what was being printed into the logs.
During 2010 the cube farm had a reorganization where engineering was moved to the south of the building, room was made for infrastructure to be next to engineering and other changes. We had a .Net contractor that I had been working with whose cube was diagonally opposite mine. I needed a bug completed and could not find him in the new cube farm layout. I walked all around the third floor, asking people, but no-one knew. So I went up to the fourth floor and asked around as well with the same result. I ended up sending an email asking, "Which cube are you in now? I couldn't find you." The reply I got back was, "I am in India." It made me laugh. The globalized technology world is such that a cube re-org can put you in a different country. As it turns out he had a family emergency and flew back for that reason, it just happened to coincide with the cube layout change.
The reality is that a lot of the American technology industry is built on the backs of foreign labor with India and China doing most of the lifting. Lifelock would have to be the most multicultural technology group I have ever worked in - even by tech standards. There was a time when the engineering group only had two Americans in it. The rest of us were Australian, Indian, Chinese, Taiwanese, Mexican, French etc. I ended up learning how to say "This is bad" and "This is good" in Spanish, Hindi, Marathi, Telugu, Tamil, Malayalam, Mandarin and Cantonese. It always gave people as laugh as my broad yet fading Australian accent mangled the pronunciation of those words in a way only the Australian accent can.
I am also convinced that Masters Programs by American tertiary institutions are an extortion mechanism for foreigners to get access to the US labor market. We interviewed a lot of people with Masters degrees and they were no guarantee of quality or expertise in the candidate. One of our engineers was faced with the heartbreaking problem of her husband having to go back to India because her visa could no longer keep him in the US. As a result she was going back to college to do her Masters as it was easier to get a green card and have her husband with her in America. That is horrible and she was justifiably upset about it.
We had numerous other visa issues as well. One of the strangest was some bizarre rule where the person above you in title could not be on a better visa than those with a lesser title. So we lost people because they would go to a different job that gave them better remuneration and visa status. There were also issues where someone would move on, but couldn't take the job they wanted to do, were better qualified for, and were better remunerated for because of visa restrictions. In one case an engineer took a job short term to get his visa upped and then jumped to a better place immediately after.
It always amazed me in the United States that benefits were tied to employment and were at the mercy and benevolence of the company. Lifelock was an exceptionally generous company. For instance they matched the 401K at 133% for the first six percent. Health benefits were equally as generous. As someone who came from Australia and saw how the employment system and health system worked in Australia, it seems to me very inefficient to do it the way the United States does.
I know that health care being done through an employer is a quirk of history in the United States. I am also aware that American prefers to subsidize welfare through the tax code rather than by direct payments; but there is a ton of empirical evidence that single payer health care is far more efficient and cost effective. There are also numerous different mechanisms around the world that the United States could copy; Australia's mixed public/private system or the regulated health care systems of Germany, France and Switzerland. It seems strange to me that I have to take into account health care benefits when choosing a job. It should not be a responsibility that an employer has to shoulder.
Other than the health care issue, the American economy is fantastic to work in, especially for someone working in the technology field. Things tend to happen first in the United States and even in Phoenix you are somewhere close to where they are happening. The new things tend to be cheaper in the United States as well. In the 1990s Americans were doing local calls for free to an ISP and then having unlimited hours, while in Australia it was 22c per call and then the Australian ISPs had hard caps on the number of Mbs you could use. This still occurs today with broadband. Australians have a similar gripe with desirable hardware like iPhones which are far more expensive in Australia than America.
For a while Lifelock's CEO, Todd Davis, was the most visible part of Lifelock as he starred in adverts along with his social security number. He is also pretty inspirational when you interact with him. I used to joke with the network engineer who had the cube opposite me, that if you were having a bad day, go and tell Todd about it and within ten minutes you will be coming out of his office saying, "I am going to work here forever" and be invigorated. He had that kind of effect.
Lifelock was set up to be customer focused in the same way that Zappos is. On one of the walls in the member service training area there is the message emblazoned across a wall in large letters that as a member service agent you can escalate an issue a member has all the way to the CEO and whoever you call has to drop what they are doing and help. I always secretly hoped I would get a call like that but I never did. I guess the Member Service Agents were able to take care of things without involving engineering directly.
The benefits at Lifelock are generous. In comparison to other companies they are way superior. The 401K is a good example, Lifelock matched 133% for every dollar you put in up to 6%. Which is remarkable. The health care benefits were also very good as well. One of the executives worked out at the same gym as I did. I can recall chatting with about the benefits and he made the comment that the board called Todd Santa Claus because of this. He also said, "The board doesn't mean that in a good way either."
Every year there was a May Event. Essentially Lifelock would put the entire workforce up in a local Phoenix resort for the weekend. There was a formal dinner where awards were handed out. I was fortunate enough to win an award one year. There was also plenty of time spent in the pool, drinking, eating, golfing, etc. I suspect a lot of these events are coming through how Todd came to success and an element of wanting to share that success.
Social Issues VS Technical Issues
Jenkins came into the Product and Technology organization through the Linux Team. When I started and we were discovering that Bamboo was not powerful enough for our needs, I ran Hudson out of the Mac Pro I was using. It was good for getting stuff up and running but obviously not good enough to be used by all of engineering. Between that problem and the Service Delivery group wanting everything on Bamboo we migrated our continuous integration and automation to Bamboo.
One of the big downsides of Bamboo was that we could not programmatically create jobs on Bamboo. This was something I had been trying to do in order to make our automation run smoothly. When Jenkins was introduced I quickly moved over the middleware continuous integration into Jenkins and the Front End automation soon followed. Eventually only our legacy .net application remained on Bamboo.
The problem was that the Linux Team had set all the slave servers up for their needs. Consequently the slave servers could do more on other servers than if they had been built for engineering, quality assurance and service delivery only. Eventually the Info Security manager put his foot down - with no understanding of how everyone used Jenkins - and the create and configure permissions were removed for everyone other than the Linux Team.
It caused havoc.
We didn't just use Jenkins for continuous integration. We had deployment jobs for engineering and service delivery, we had performance testing jobs for production, we had branch creation jobs for engineering and service delivery, quality assurance had their own set of jobs as we were trying to get all our functional testing into Jenkins. Additionally, engineering ran all our version check reports, svn log reports, code coverage reports etc out of Jenkins. Simple things, such as adding a new branch version into a drop down, or adding another email address into a list were done through configure permissions. That stopped. We were also constantly making new jobs for new problems. That stopped too.
The outcry was big enough - and the threat of stopping code getting to production in a speedy manner - that the create and configure permissions were handed back as a short term solution until something else was worked out. The real issue here was that a technical solution was taken to what is a social problem.
My wife likes to tell the story of how she went into a Dentist one day and up on the wall was a sign that said, "Please don't write on the wall." This is a useless solution to a social problem. The correct thing to do is, if someone is writing on the wall, go up to them and say, "Please don't write on the wall, writing on the walls is not socially acceptable." Rather than having every customer come in and laugh at the sign because it is silly and obvious.
It was the same with the Jenkins jobs. only a couple of us had create permissions, and a few had configure permissions. Further, we had been using that system without causing any problems for about a year or so. Those of us with create and configure permissions were high paid professionals who had a long history at the company and had passed background checks. It is ok to say to us that we had to be careful with what we do and stay within the boundaries we are currently working. If we do screw up then the permissions will be taken from us.
You could argue if some hacker gets in, and steals some of our passwords, then uses that to log into Jenkins, etc. You could also argue that if just one hacker gets through all those layers, then, it will be bad - and it would be - but it is improbable. If a hacker got in that far, there is production data they would be after. Besides most of the people that have the capability to create and configure jobs on Jenkins before they were removed also have permission to view PCI and PII data as well.
This is one example of how technical decisions are made to social problems where the technical solution is inappropriate. It is probably not the best example, but it does make the point that it is easier to solve these issues socially by saying, "Please don't do this, or we are giving you these permissions for the this specific task, please don't abuse them." Employees want to do the right thing, they want to do well, so let them. Just because an ACL list lets you be absolutist about doesn't make it the right choice. A few kind words are often better in achieving a technical outcome.
Change Management and Configuration
One of the realities of Lifelock was that Infrastructure, QA and Service Delivery were often understaffed and overworked. This had ramifications in how we moved code and configuration changes through the environments. One of the pieces of paperwork we had to fill out was the CAF, or configuration approval form. This was put in place when Service Delivery was lacking staff, configuration changes were all over the place and this often lead to QA being blocked. The CAF was a piece of paper work that slowed the process down and forced the configuration and artifact changes to be recorded.
Unfortunately paperwork is paperwork and is tedious for everyone involved. I know I was not the greatest at filling out CAFs and often got them wrong. I created a Python script in Jenkins which queried SVN for a sprint and found out all the revisions that had been made. It then went and looked up all the JIRAs that were associated with those revisions and put them into a generated HTML CAF that Jenkins then sent out by email. It turns out I wasn't the only engineer that had done this, another engineer in California had his own CAF script as well.
The reality that Service Delivery was grappling with was; how do you record the code, configuration and artifact changes in a meaningful way that can explain to all the interested bodies what has been deployed, what is new and what is old. There are multiple organizations that have a stake in code and configuration changes; the members who are the end customers, the internal customers such as member services, partner operations, billing operations etc, not to mention the internal groups like engineering, quality assurance, infrastructure, networking and product.
The goal of engineering was to do continuous deployment where a code change could be tested so thoroughly after check in that it could be pushed directly into production. This kind of speed of movement of code hits organizational boundaries. Lifelock was set up with separate engineering, quality, infrastructure and service delivery groups which each had their own series of managers. Everyone agreed that moving code faster to production was a great idea but the organization structure itself was a large inhibitor.
We started developing better personal relationships with the other groups which sped things up and largely stopped, the "put a ticket in so I can work on it" style of inter-organizational communication. What some of us in engineering and infrastructure wanted was a DevOps group where all these were mingled into an organization whose single purpose was to push high quality code from user story to production as quickly as possible.
We could not get higher ups to reorganize in that way so we started creating our own informal organization underneath the existing formal ones. The first step was an IRC server and a #devops channel where everyone could gather and help with feature, production and environment issues.
We had been trying to get an irc server into Lifelock for quite a while, but what the IT Group gave us was some Microsoft thing, like a group communicator product that was complete rubbish. It was so bad that no-one used it, even the ones that wanted to. Eventually the irc server popped up courtesy of the Linux Group who wanted it to co-ordinate their own tasks. It was a rogue - yet secure - internal irc server until it gained acceptance when most of engineering and infrastructure were using it constantly.
Tools do make a difference. If it is a crappy tool no-one will use it. The group communicator product was crap and so no-one used it for communication even though it would have been of benefit. There was a similar issue with the wiki. We were pretty much a Windows shop in most areas of IT and the sharepoint site was supposed to be used by everyone. Engineering and Infrastructure preferred a wiki, but we could not get one into the organization. There was one aborted try when an Apple wiki was rolled out, but it was crap and hard to use, so no-one did.
Fortunately we bought the complete Atlassian Suite and Confluence was bundled with it. The Confluence wiki is easy enough to use that pretty much within a month it stored more engineering and infrastructure documentation than sharepoint had ever done over several years. The business side continued to use sharepoint, but products and technology pretty much used the wiki for everything; from documentation, to scrum notes, to getting started for new hires and photo galleries of events.
The IRC server became central enough a tool that when we did releases everyone would jump on there and coordinate. Larger projects would also create their own channel and engineers would fill the channels up with their homegrown and handwritten bots that published commits into the channel, or Jenkins jobs status, the artifact version and even basketball trade information.
When I would have a day off I would often replace my login with a bot that would randomly say something into the #devops channel that was believable. Like an lol or another engineer's name with a question mark after it. I think I gave that bot odds of replying to an event in the channel as one in two hundred and fifty. It worked pretty well and caused some confusion, especially as my email was set to "I will be back on the ... blah blah blah".
The best bot I wrote was the 'southbot' named after an acerbic network engineer that used to work at Lifelock. He was cynical and dry to the point of draining emotions out of you. Kind of like a network engineering succubus. He also wouldn't let up so you had to walk away but he always enjoyed getting as good as he gave. There was one young Infosec intern we had who worked out how to completely neutralize the network engineer's comments. Whenever the network engineer spoke to him he would say, "Haters gonna hate." to which there was no reply. It was hilarious.
That network engineer left Lifelock in early 2012 and had been away for a while before I created the southbot in Python. It didn't do that much. Every now and then it would say something into the channel like 'stupid, 'dumb', etc. Because it was on IRC with the nick 'south' you could hear him saying it and it cut a little too close to the bone sometimes. It was irritating.
Anyway, one evening a Linux Engineer was having a bad day and the southbot wrote a 'stupid' under something he had posted to IRC. The Linux Engineer commented, "Shutup south", and then a little later, "How did south get in here?". He opped himself and kicked the south bot out of the channel. Given the reaction and obvious irritation I was pretty pleased at my IRC bot creation.
Hopefully the bots that are in IRC expand out to the kind of release and sprint management bots that the likes of Facebook use. During the product configuration project one of the bots was noting check-ins and change in state of JIRAs, while another supported basketball trades during the last days of fantasy basketball which the Tempe and Irvine offices were into. It is not a large step from there to query Jenkins for build and deployment information, to have people register as present for deployments, and some of the other cool stuff that other companies use IRC for.
Why Is Software So Stressful?
This is an example of a three day period that left me stressed and frustrated. It was five days before a release date which was not going to move and four feature branches had merged into trunk. I made a modification to trunk, compiled the branch locally, deployed it onto my local Weblogic instance, ran the functional tests and then checked it in. When Jenkins had finished compiling the branch I deployed to a dev integration environment and the ear didn't deploy.
I tracked it back to two poms that had come into trunk a couple of feature sprint merges prior. The poms had the dependencies marked as compile scope so they were flowing jars of all kinds - including our weblogic-client-full.jar which is basically a zip of bea/modules - into the APP-INF/lib of two of our ears. I started the process of pushing those dependencies into a parent pom under dependencyManagement and then in the EAR poms adding the necessary dependencies which needed the scope of compile.
This is a laborious process as each time a change is made you need to compile, deploy and then look for any runtime errors. Myself and another engineer took the better part of a day and a half hunting down the no class definition found issues. It does not help that java doesn't make it the easiest to find out what class is in what jar though fortunately websites have been getting better and better at helping out there.
The other issue we had was our integration environments which tend to be a bit flaky and don't handle the constant process of deploy, deploy, deploy too well. Another issue we hit was that not all our dev integration environments are the same. For instance one environment we got the EARs deploying and running on was not clustered, so when we deployed to a clustered environment we got a runtime issue. So we spent more time isolating these changes as well.
To cap it off, we hit an axis2 and axis1.x incompatibility in the same EAR and we could not solve the problem in time no matter what we did. So that two days work had to be reverted and that user story backed out of the release. Immediately after, two new bugs were found, and I was back into the mode of come in early, leave late and skip lunch.
But why so stressful? Obviously release dates are hard dates that won't move. Too many people have to be lined up and their precious time allocated in advance so that a release can go smoothly. Some companies have moved to continuous deployment to take that stress and roadblock out of their working lives, but they are the leading edge companies and they most likely do not carry organizational inertia toward those goals. In the three day period I described it was the looming release date that caused a lot of the stress.
The poms being done poorly and in a manner that did not match how the build process and maven compilation and deployment were done was another area. It meant that work which had previously passed QA and was in trunk and hence production had to be redone so that it matched the rest of the poms. Having an EAR that wouldn't deploy was not fun and is stressful on its own accord as it means the system is not production quality. An undeployable EAR means that it is a blocker for production and integration.
Code that is good enough to pass QA once and get into production with reproducible results is relatively easy to do. However the code that can get past QA is no guarantee of quality. In this instance the poms were so poorly done that they caused a later sprint to have an artifact which failed to deploy. This happens a lot in software engineering. It is most commonly called technical debt but why does that occur?
I have seen technical debt mount up because managers just target a date and don't care how the code is created. I have also seen managers strong arm QA into saying something is certified and ready for production. Of course QA hates this with a passion, but QA Directors will get strong armed as often as QA engineers will. It makes a mockery of the whole purpose of QA but it does occur with considerable regularity. When we were interviewing QA Directors I would often say to a candidate, "You are going to have very powerful people telling you to say yes when you want to say no. What are you going to do?"
Another common problem is when managers that are not software engineers run software teams. Having a non-software manager leads to silly preventable things happening such as functional tests being checked in under the /test/ directory where the unit tests are and a build breaking on Jenkins because it is firewalled off from the third party system the functional test was trying to access.
Managers that are not software engineers tend not to understand the system from the software angle no matter how strong their technical skills are, so a lot of bad architectural or refactoring decisions are made by the manager and are then implemented by junior software engineers who don't understand the errors the manager is making. Alternatively the junior engineer doesn't feel empowered to point out the error or go against that design decision.
In these cases a senior engineer or a tech lead can spot it and fix it up, but usually when other groups see the code is when it has come into trunk and is going out to production within a couple of days. There is also a limit to how much senior engineers who are already strapped for time can rework code that is coming in. Code reviews are a solution to this but I have never seen code reviews catch on when engineers are busy trying to push out their own code and complete their own user stories.
Another issue is weak managers that say yes to every date no matter who achievable or unachievable it is that leads to time pressures and code being pumped out with highly variable quality just for the purposes of that date being hit. There ends up being a group consensus there that it is unhittable but teams don't want to say that it is impossible, or if they do, they keep that opinion within the cube farm and it doesn't get to the upper manager who thinks they are getting an artifact on a certain date. I have seen this called the "thermocline of truth".
Technical debt can come through software engineers that don't really understand why the system was designed a certain way or the technologies they are using. J2EE being a distributed environment is a good example. Most engineers know about EJBs but they don't understand why J2EE was built with remote interfaces, messaging beans and webservices as the main mechanisms of communication. In particular there is a need to understand why distributed systems were developed and designed the way they were in order to use a J2EE system well.
Software engineers that don't unit test produce code that is remarkably different to those that do. This makes it appear like their are two different software systems being written within the one class. Michael Feathers in "Working Effectively With Legacy Code" describes legacy code as 'code without tests'. I totally agree with that definition. If code is being checked in without unit tests and functional tests you have no way in the future to determine if the code is working or not other than inspection which is the slow process of a QA person manually clicking on things to test it.
Another reason for technical debt is software engineers that are junior and don't know any better. One of the more common mistakes is that junior engineers will commit everything at the end of the sprint in one big dump taking other developers by surprise. Often it is too late to fix things, or even to review them when the code comes in like that. I usually watch the svn logs constantly to see what code is flowing in and offer suggestions, comments, etc as I see the code come in. For engineers that are used to committing a big code dump all at once it can take a while to get them used to small and constant commits.
Another reason is that the initial design or progress that was thought to be good at the start of the project ends up being a pretty bad design at the end once everything is known. Given hindsight it would be designed a different way. Once code is in production, even with the confidence tools like refactoring and unit tests give, is very hard to change. Even getting time to change an existing design without changing the outwards behavior is hard to get. Most teams are organized for the purpose of creating new products for revenue generation or solving process problems such as billing and fulfillment to maximize efficiency. It is rare you get time to fix code's design.
Another reason software engineers get stressed - and infrastructure engineers for that matter too - is because the systems we have to build, maintain and work on are of increasing complexity and not giving the same productive return they should. J2EE and Weblogic are a good example of this problem. J2EE is hopelessly over specced for what most people need it for. It would be simpler to use a Tomcat container with a basic framework and interact via JSON with a central server and APIs that don't use the ridiculous overhead of a distributed system and the J2EE libraries. We make life hard for ourselves when we don't need to.
Often this is not our decision. At Lifelock I played no part in the decision to go to an Oracle stack with Weblogic and the service bus in the middletier, oracle for the database and two .Net front ends that interacted with Weblogic via web services. I have no doubt it was the fashionable and safe thing to do for an IT organization at the time but was it a good decision? Probably not.
Service oriented architecture is a buzzword but it is another way of selling "have strong APIs". We ended up integrating through Java in the middletier with the third party systems we work with. So they are hidden behind strong java interfaces in our implementation. Exposing those integration points through a tomcat container and JSON is trivial once that integration work in making the third party look like an internal API has been completed. I am slightly prejudiced as it has been my experience that the Tomcat container is more reliable and robust than a Weblogic deployment. It is also simpler to integrate with and to run locally.
I can remember when I was younger we had finished a project for a local government agency and had to integrate that code with CSC in Maryland. When myself and another engineer turned up I was amazed to discover that a Rational Rose expert was there to help us integrate our project into their source code repository. If your choice of software is so complex you need a specialist just to manage simple tasks like merging code then there is a good argument it is too complex. I don't know many people who would defend Rational Rose for its simplicity.
One of the common issues we hit was that the non-production Weblogic environments became very fragile. They were robust when I started at Lifelock but entropy set in as our middleware engineering group was understaffed and overworked. These environments were constantly being worked on but the nature of how they were used, configured and managed led to them being a sore point between engineering and infrastructure. They were constantly having issues such as managed servers going walkabout, or the environment falling over, something getting stuck in a semi-deployed state, etc.
We tried to deal with these issues by adding more and more scripts to Jenkins to restart Weblogic, to clean out space on those servers, etc etc. But they remained overly complex for what we needed them for and too fragile to be of solid use. The fragility of these environments and the variability between them led to a lot of frustration between engineering and infrastructure.
The older infrastructure folks tended to think in terms of the 1990s when it came to servers and code. Their strategy was that you configured this perfectly solid server - probably on big iron or the modern equivalent of it is the ESX servers - and because it was configured and specced so well, and managed so tightly, if there was any instability in the system it must be code and the way to solve instability was to back the code out.
Infrastructure has been moved into the realm of engineering and QA due to server virtualization. It is an engineering problem because servers are now behind APIs and they pose the same management and feature problems as software development does. It is a QA issue because configuration is now an API and hence not stable. Configuration needs to run through the same quality processes as software does. This also means that code and infrastructure now move at the same speed of development which means always forward - never going back.
Managing Stress In Software
The simplest answer to taking the stress out of software is make the system and code as simple as possible and ensure it is always testable so you know if something is working or not. This means 100% unit test coverage and 100% functional test coverage. Given the success of continuous integration in compiling the code and running the unit tests every time there is a check in, having the functional tests part of that continuous integration process is a good idea as well. Doing things that are hard and error prone often is a good strategy.
Software is only partially stressful for the technical side of things. Any business is a social operation and navigating personalities is very important. If there is one piece of good advice I can give it is that saying no is ok. It is alright to say I can't give you a date because I don't know. It is also acceptable to say I can give milestones but not a date as there is just too much volatility to know. That is enough for the planners who have to orchestrate marketing, member services and all the other arms of a company that need to know what is going on. It is also ok to say I am not working weekends. We did this, and again, nothing happened, none of us got fired. Some of did work weekends, but it was a choice, not required or expected.
Don't rely on a manager to do it for you either. I watched a middle manager cave in one of those meetings when engineering didn't and I came in on the weekend to make sure that group was not blocked. Their middle manager was in his office on the weekend, but he wasn't helping that group get through the work they needed to, he was doing 'manager work', and he was resented for it. You will have to do it yourself, as an individual, as a team and ultimately as a group.
I handed in my two weeks notice the day after my three years at Lifelock. It had been a good run and I had a fun time working there. By the end of three years I was doing the same things each day that I had been doing for a while and the technology was not going to change in the near future so I was starting to get bored. Three years is a cycle in the technology industry. After about three years people start to move on, to new opportunities, new technologies or just a change for the sake of it.
It was a pretty remarkable three years. Software Engineering took the code base from low quality to high quality through the discipline of unit testing and functional testing. Engineering along with our good friends in Service Delivery, QA and Infrastructure took deployment time from twelve months to weekly. When I left there were two major releases being done each week and three minor releases on the other days. We could not have reached that pace without all the automation that was coming from Engineering, QA, Infrastructure and Service Delivery. We also could not have done it without the groups just getting things done by working together closely.
I made a lot of friends at Lifelock and the decision to leave was tempered by the thought of not being around them forty plus hours a week. Looking back over the three years there I would like to think I played a strong role in the quality improvement of software, the increased pace of deployment for high quality code from dev to production, and the maintenance of a decent work/life balance for the Tempe engineers. It was fun.
Simple Steps To Higher Quality Code And Faster Code To Production
Some simple steps that engineering can do to improve the quality and design of source code. These may seem obvious or stupid, but they get broken all the time. It is crazy, but people will avoid these things for speed, because they don't know any better, sometimes out of ignorance and sometimes out of sheer laziness.
- Unit Test. This is currently a discipline, but you have to, have to, have to, have to do this. This is about the only pre-inspection metric we have which can determine quality. Make unit testing part of the culture and publish reports that make it obvious who is unit testing and who isn't. If someone is not unit testing and resisting the culture, get them on a performance review.
- Functional Test. This is also a discipline, but you have to be able to test the system at runtime and make sure functionality does what it says it does at the system level.
- Javadoc. Code will exist for ten years after it is pushed to production and the main audience for your code is other developers. Make your comments meaningful as they will be around for a long time. I know Robert Martin in "Clean Code" argues against this but from my experience I think he is wrong.
- Testability. The code and system need to be testable in all environments and at all levels. For instance the system must be testable in DEV, QA, STAGE and PROD. You must be able to determine the state of the system at any one time and be able to change that state non-destructively to ensure functionality is behaving as expected. Surprisingly this is very hard to achieve, especially in production.
- Formatting. Source formatting matters. Code should read like a book with consistent paragraph (method) sizes, consistent white space between methods and in an IDE there should be a nice balance of colors with the beautified text. Your code will be read for ten to twenty years so write code like an author and not a hacker. Take pride in how it looks and reads as well as how it functions.
- Delete any commented out code. If it is that important the source repository will have a copy of it. Commented out code is noise and should not be in source code that is in production.
- Data Transfer Objects. Where a java object is being used to transfer data from one system to another use a DTO that has public member variables and no mutators. Putting getters and setters in there only encourages someone to change the behavior with an if statement which is bad. Next thing you know your DTOs are carrying all sorts of business logic and are no longer for data transfer only.
- Don't expose Entity Beans. It is seductive to push entity beans up through the web services to the front end or third parties. It is a bad practice as the entity beans are surprisingly volatile. Data contracts are constantly changing and you are now pushing that volatility through the public APIs. It is quick to expose entity beans but it is bad practice. Use DTOs and a converter bean between the DTOs and Entities instead.
- Exception Handling. Throw runtime exceptions for methods that are internal. For public APIs throw checked exceptions. SOAP makes this easy at it pushes out formalized Faults. There are multiple ways to access a system externally and those exposed methods should adhere to this rule.
- J2EE. Parts of the spec are fantastic like the dependency injection of EJB and the asynchronicity of JMS. The big, clumsy, heavy weight and complex J2EE containers are not. If you can, use the good parts and avoid the rest.
- Refactor. If JUnit is the most important software written in the last twenty years, then Fowler's Refactoring is the most important software book written in the same time period. Refactoring is an essential skill for any software engineer working in sprints where there is no real time to design and often barely the time to settle on an approach.
- Normalization is good. In any engineering group there are those that visualize data structures better than others. Go to them for help with table design.
- Hard Coding. Don't hard code anything. This seems an obvious thing to say, but we had instances of data being hard coded and causing bugs. It happens. It is done by people who should know better as well. You have to stop it.
- Duplicate Data. Try not to duplicate data as it will become a synchronization nightmare. Again an obvious thing to say but we duplicated data to get something out into production faster and we - surprise, surprise - had synchronization issues. We eventually wrote a batch job which queried the data constantly and warned us of any data issues in this area. We sunk a bunch of extra effort and energy into maintaining the duplicated data.
- Roll your own. Most out of the box logging solutions are quite poor. As an example, engineering and infrastructure had to do a logging project so that we had domain level, app level and cluster level logging from the Weblogic container.
- Make logs human and machine readable. JSON is the best format for solving both these issues at the moment.
- Embrace two or three week sprints. They are the only accepted mechanism to stop death marches.
- Add hours to a user story for design of the user story. When doing the design on a wiki for the user story add sub-tasks that have hours attached to them describing all the different steps and tasks that need to be completed. This makes the scoping quite accurate. The wiki design doesn't have to be more than one page. We had a projects space on confluence specifically for this.
- Add hours to a user story for unit testing explicitly. This stops engineers dropping off unit testing when they are jammed for time to complete a feature. This also makes the scrum master constantly ask, "Have you done the unit testing yet?"
- Add hours to a user story for functional testing explicitly.
- Add hours to a user story for wiki documentation of the project, modules and user guides.
- Have QA test with you in the Dev or Integration environment during the sprint. If they test in a separate environment to where the development is going on then it is too late. With QA testing alongside you, engineering can fix bugs faster and make the system more stable.
- Unit testable code is different to untested code. You can see the difference quickly. Untested code tends to be longer, with more conditionals and indents.
- Unit tests should be of the same quality as production code. They will be around for the same length of time.
- Keep if statements out of unit tests and functional tests. They are usually an example of a bad test.
- Run the unit tests on each compile on localhost and in continuous integration.
- Design the system to be functional testable. This may mean exposing APIs that exist only for the functional tests to determine the state of the system.
- Have an engineer design and start the functional testing framework. The functional tests need to be extensible so experienced programmers and inexperienced programmers, such as QA or Service Delivery, can add to the tests relatively easily.
- Make the functional testing project in the same language as the primary systems. Don't use SoapUI or some scripting language. You will end up with five hundred different functional test projects. Make one functional test project that everyone has to use and if you are using Java, then make the functional test framework Java as well.
- Javadoc the functional tests, especially if it is not obvious what the test is doing.
- Add the functional test project to continuous integration. Use the excludes in maven on the surefire plugin so you can make the functional test part of continuous integration but where it builds without running the functional JUnit tests. Putting all functional tests under com.company.functional means you can exclude /functional/ and not run the functional tests during continuous integration.
- Unit tests and functional tests should always be able be run in isolation. They should also give the same answer when running individually and when run as a group.
- For quality, software projects need both unit and functional testing. You cannot guarantee quality without doing both.
- Engineers will have to write both of unit and functional tests as part of a sprint. Engineers generally are better coders than QA, they are also writing the feature and should be able to show QA and Service Delivery with a repeatable test that can be run out of Jenkins that they are 'done'.
- If Service Delivery and Upper Management don't have confidence in the functional tests then they are useless. It takes time to build that confidence. It took us nearly two years for our functional test framework to be trusted but the functional tests were requested by other groups to prove features were working in different environments, including production.
- Artifact and Configuration deployment should be done the same way on localhost as it is in production. There should only be one way to deploy an artifact no matter what the environment. Same for configuration.
- Reduce the number of non-production environments. They represent overhead and waste.
- Make every non-production environment a releasable platform. The goal is to have the distance between an artifact being certified and being able to be pushed into production as small as possible.
- Configuration is best handled through server automation. When an artifact is delivered bake it into a new VM with any configuration changes and have QA or automation test both at once.
- Artifacts should only be pulled from one repository. This sounds obvious, but it is surprising how often it does not happen. For instance one place I worked, depending on how an artifact was deployed, it would be taken either from the Jenkins workplace for lastSuccessfulBuild or Nexus's public repository. They are not necessarily the same artifact.
- Don't bother with a staging repository it slows things down too much. If an artifact survives a sprint, then make it a release artifact immediately and push it out into production.
- Jenkins is your best and only friend here. All automation should be through one tool, so people only have to log in once. Jenkins is an amazingly powerful platform for this.
- Expose error enums and other forms of fixed data via a URL and JSON so the automation can access it.
- Applications should announce who they are. Maven makes it easy to put a lot of information in the manifest file. This can be exposed to the automation so that it will always know who, what and which version it is dealing with.
- Make starting a sprint a one button click. Jenkins is flexible enough to create new jobs from other jobs. This meant all of the continuous integration jobs were correctly versioned and exactly the same as they were being created out of Jenkins when a sprint started.
- Automate the creation of a VM, Configuration and Artifacts into one step out of Jenkins so that QA can certify the whole lot as a group. Move the VM to production and delete the old production. Creating VMs is cheap now and this stops the quality and production issues with configuration and artifact changes.
- Automate all the things.
- Everything is engineering.
- Infrastructure are your friends.
- QA are your friends.
- Managers, project managers, and product owners are cool with engineers saying no. It is ok to say no. It is ok to not give a date. Give milestones etc, but if you don't know a date, don't give one. People are ok with it. You won't get fired.
- Social limitations and restrictions are always greater than technical ones. Break that barrier and change will happen faster.
- Source control. Git is the current popular one but there are multiple that are up to the task for this role. Source control is not just for source code, it is for configuration, build scripts, batch jobs, Jenkins jobs, database schemas, J2EE containers, anything that changes over time should go into some kind of source control.
- JUnit. This is probably the biggest innovation in software engineering in the last twenty years.
- Jenkins. This is a remarkably flexible environment for managing any automated task. Bamboo doesn't even come close.
- User Story Software. Something to track user stories and burn down charts. Atlassian's Greenhopper is not that great, neither is JIRA for that matter.
- Wiki. All documentation should go here. This includes project designs, module documentation, trouble-shooting steps, scrum notes, reports, how-tos, getting started, photos of group events, photoshops of co-workers, etc