There are 3 types of scalability issues that need to be addressed to scale to a global scale . Those are
- Network scalability & service discovery.
- Compute scalability & virtualization
- Storage scalability
One will see that organizations that offer cloud as a service use all three of these scalabilities.
Network Scalability :
(A) Load Balancers
Refer to this blog which points out how modern L4/L7 load balancers work.
https://blog.envoyproxy.io/introduction-to-modern-network-load-balancing-and-proxying-a57f6ff80236 . I have also seen L3 load balancers used via DNS (e.g. UltraDNS sitebacker pools).
In summary I have seen load balancers of the following types:
- Proxy based load balancers
- L3 load balancing: DNS based load balancing via pools (round-robin) or via mapping changes (Akamai), or via Anycast (See this for how BGP makes this happen: https://www.imperva.com/blog/how-anycast-works/)
- L4 load balancing via HAProxy (SSL termination via NGINX)
- L7 load balancing via HAProxy and a sidecar like Muttley (Uber) , which is essentially based on Healthchecks, Traffic controller rules, and Zookeeper nodes that are maintained at a /zone/service/ level , and updated when a particular service is deployed to a machine.
- Client side load balancers:
When a service is deployed on a machine, it needs to be discoverable. This can be done in the following ways
- DNS based service discovery such as Mesos-DNS
- DNS based service discovery using SRV records (See this https://docs.citrix.com/en-us/citrix-adc/13/dns/service-discovery-using-dns-srv-records.html)
- Zookeeper based service discovery
Storage Scalability:
Refer to
http://www.cloudbus.org/reports/DistributedStorageTaxonomy.pdf for a taxonomy of Distributed Storage Systems (DSS)
In summary, Distributed storage can be looked at from different perspectives. If we look at it from the point of view of "functionality" there is the following categorization:
- Archival: Provide persistent nonvolatile storage. Achieving reliability, even in the event of failure, supersedes all other objectives and data replication is a key instrument in achieving this
- General purpose Filesystem: Persistent nonvolatile POSIX compliant filesystem e.g. NFS, CODA, xFS,
- Publish/Share: More volatile, think peer-peer
- Performance: Operate in parallel over a fast network, typically will stripe data e.g. Zebra,
- Federation middleware: Bring together various filesystems over a single API
- Custom: GFS (combination of many of the things above
Example
- DHT : Store the keys associated with a node in that node's DNS records (e.g. TXT record) and the node info is obtained via SRV record for that node (refer to : https://labs.spotify.com/2013/02/25/in-praise-of-boring-technology/)
Compute Scalability
There are 4 main categories of cluster workloads (ref:
https://eng.uber.com/peloton/)
- Stateless jobs
- Stateful jobs
- Batch jobs
- Daemon jobs
The idea is to have these jobs scheduled diversely to a cluster. This is done using the tools such as Borg, YARN (slowly moving to Spark in the industry), Mesos and Kubernetes.