Gatekeeping

Share this post on:

There are many approaches available for gatekeeping. For instance Amazon WebServices (AWS) offers  functionality for load balancing, even accross regions or countries. Some scenarios require further functionality for gatekeeping; especially when load and routing becomes part of such requirements. Therefore I would like to distinguish between pure load balancers – such as in AWS –  and real gatekeepers like Netflix Zuul. Due to Netflix excessive scaling of their business, Netflix created a gatekeeper they call Netflix Zuul, which has been released in Version 2 on 5th of January, 2018. The software is Opern Source and Netflix is licensing it with Apache Open Source License V2. It can be used to balance, scale and manipulate data pacakges (e.g. for consistence reason from legacy devices). Zuul is integrated centrally into Netflixs ecosystem as it opens many options for application maintenance and development:

Technical adjustment of HTTP requests, which are not well packaged by external devices
Some manufacures do not escape URLs or URI correclty, e.g. %20 for white spaces. This can be normalized by the gatekeeper.

Routing of packages
Some devices are not compatible with certain versions of a given software. They can be routed for legacy support. Debugging
In certain scenarios, it is crucial to route packages through certain regions or force latency to reproduce issues.

Testing
For roll-outs canary testing might be the choice for a smooth go-live (cf. architectural aspects for DevOps)

You can get a get a short introduction about the capabilities of Netflixs Zuul in a talk from Mikey Cohen (Netflix).

Netflix’s Microservice Ecosystem

Netflix is using Zuul as part of their microservice landscape. Microservices are addressed by Zuul, and only by Zuul. There are a couple of other systems for security, analysis and processing. As Netflix requires world-wide availability, their services are hostet in the cloud; in this case at Amazon WebServices.

Most of them are standard software components, such as:

  • Apache Kafka
  • Netflix Eureka
  • Authentication Cryptography
  • Databases
  • etc.

Global Cloud Routing

As mentioned earlier, Netflix requires global availability for their services. In case of service failure, immediate re-routing is required to provide the required user experience. Amazon offers services for restarting components; Netflix thought about proceeding further. In case of failure, they want to be able to take over the environment for debugging, while users witch to other regions. This provides maintenance teams a brand new kind of flexibility to analyze and resolve the issue where they happend. There are many ways to start the analysis: Services can be replaced or a whole availability zone can be moved into maintenance.

Accessing Services

Services are accessed throug Zuul. This happens through routing rules, which Zuul utilized. Traffic can be redirected to other regions or certain services. These are available via Netflixs Origin API.

Zuul Routing

Alexey Kuznetsov implemented the iproute2 service, which is now maintained Stephen Hemminger. Today this module is part of the Linux kernel, and implemented a routing pattern since Linux kernel 2.2. Netflix follows this princible with Zuul and offers three different processing filters to process requests:

  • Pre-Routing Filter
    Can be used to resolve issues on a technical level, when devices call Netflix services.
  • Routing Filter
    Can be used to address internal routing and balancing. Examples are canary or running multiple versions of a service.
  • Post-Routing Filter
    Can be used to handle errors and edge cases. It also offers re-routing possibilities, which is handsome in case of service failur or developer debugging.

Request Lifecycle of HTTP Requests

Please find a diagram of the lifecycle for HTTP calls below. Keep in mind that Origin is Netflixs API for micro services. Services are actually called there.

Filter Example

Filters for Zuul should follow given iSAQB standards and are written in Apache Groovy. You should also evaluate your architectural requirements. Please note the sleep filter, which avoids a shot-down when core services are in failure mode. Ths sleep prevents all clients to make a retry after 1 seconds and overload the application while recovering. An example from Netflix can be found below:

class DeviceDelayFilter extends ZuulFilter {
def static Random rand = new Random()

@Override
String filterType() {
return 'pre'
}
@Override
int filterOrder() {
return 5
}
@Override
boolean shouldFilter() {
return RequestContext.getRequest().getParameter("deviceType") ? equals("BrokenDevice") : false
}
@Override
Object run() {
sleep(rand.nextInt(20000)) // Sleep for random number of seconds between [0-20]
}
}

Leave a Reply