Netflix: Simian Army

Share this post on:

Netflix once introduced a pattern to increase quality within their deployment pipeline via a pattern called Simian Army. This pattern is also known as Chaos Monkey Army, as it highly depends on components, which create failures on application and environment basis and therefore generate chaos and instability through unknown ad-hoc failure scenarios. These components are named “monkeys”, as it fits into the common understandment of monkey humor.

After selecting an application instance, the chaos monkey choses a strategy randomly out of the available strategies. These strategies shall be configurable via a chaos.properties file. Details about netflix implementation ca be found on GitHub.

Instance Shutdown (Simius Mortus)

Shuts down an application instance or the environment.

Block all network traffic (Simius Quies)

Removes all security groups from an instance, and moves it into a security group that does not allow any access. Therefore the instance is running but cannot be reached via the network.

Detach volumnes (Simius Amputa)

Force-detaches all volumes from the instance, simulating a volume failure. The instance will be running while having disk I/O failure. Please note that this strategy creates a data loss scenario.

SSH Monkeys

Executes any available script on an instance after logging in via SSH. This strategy can also include removal of files to ensure proper file permissions (security).

Burn-CPU (Simius Cogitarius)

This monkey runs CPU intensive processes, simulating a noisy neighbor or faulty CPU. Therefore the instance will effectively have much slower CPU. This strategy is important for virtualized and cloud environments.

Burn-IO (Simius Occupatus)

This monkey runs disk intensive processes, simulating a noisy neighbour or faulty disk. The instance will effectively have a much slower disk. This strategy is important for database servers or applications which rely on disk I/O such as big data or business intelligence applications.

Fill Disk (Simius Plenus)

This monkey writes a high file to the root device, filling up the disk. This can have multiple layers, such as open file handles for the operating system and middleware, disk space issues for persistance layers of middleware or log files.

Kill Processes (Simius Delirius)

This monkey kills any java or python program it finds every second, simulating a faulty application, corrupted installation of faulty instance. The instance is fine, but the java/pythibn application running on it will fail continuously.

Null-Route (Simius Desertus)

This monkey null-routes the given network to terminate any network traffic.

Fail DNS (Simius Nonomenius)

This monkey uses ip-tables to block port 53 for TCP and UDP connections (DNS traffic ports). This simulates a DNS server failure.

Fail API (Simius Amnesius)

This monkey puts dummy host entries into /etc/hosts, so all communication will fail. It might be required to manipulate other system settings to affect applications which run without the hosts file; see the method of Simius Nonomenius.

Network Corruption (Simius Politicus)

This monkey uses traffic shaping to corrupt a large fraction of network packages. This simulates a degradation of the network.

Network Latency (Simius Tardus)

This monkey uses traffic shaping to introduce latency (e.g. 1 second +/- 50%) to all network packets to simulates degradation of the network.

Network Loss (Simius Perditus)

This monkey uses traffic shaping to dorp fraction of all network packets, which simulates degradation of the network.

Leave a Reply