zobie's blog I create software, I like music, and I'm mildly(?) OCD.

23Aug/12

Delivering a Platform for Growth!

We are in the midst of a major architectural change at SRS. As new SRSWP services are created and products begin using our SOA/SaaS platform, it is vital to understand that every service must provide a stable architecture.

SRSWP services do not grow slowly or linearly like many new products. These are foundational elements which will be integrated into other applications. They will grow by entire user-bases at a time! For example:

  • Until recently one basic service was handling about 9 requests/second at the top end.
  • Over the course of a week two more applications started using it, and over night the service jumped up to 15 requests/second then to 47 requests/second!
  • One app that will begin using the service in the next few months will bring a load of around 29 additional requests/second.

SRSWP services are not expanding into new markets. We are designing services which will impact all SRS products and customers. At this level, even small mistakes result in major problems. So, before any release there are four major areas that must be considered:

  1. Security:Mission-critical systems must define and enforce security policies.
    • Have we considered how to secure the service itself?
    • Have we considered how to protect the data this service touches?
    • Have we considered how to provide security at every level (application, software, operational)?
  2. Availability:Mission-critical systems cannot have downtime.
    • How will this service provide 100% availability?
      • How will we perform scheduled maintenance?
      • How will we handle operational failures (e.g. rogue web head)?
    • How are we backing up data and services?
    • How will we regularly test our ability to restore and recover from backups?
  3. Stability:Mission-critical systems cannot crash; they cannot change contracts.
    • Is the service 100% stable?
    • How do we prove stability before release and on an ongoing basis?
  4. Scalability:Mission-critical systems must seamlessly scale to support load.
    • How does this service scale to support a full load of users for the next year?

Any problem with an SRSWP service harms more than just that service and the team who owns it. Problems are magnified as they cascade to every consumer and their users. Problems are reflected across SRS's product lines. This must not happen!

If any service you work on cannot confidently answer all of the questions above, this must become a top priority. Work with your team leads and VPs and make sure this doesn't fall through the cracks.