How sharding works in mongodb
Sharding is a method used to distribute data across multiple servers, allowing a database to scale horizontally by partitioning data into smaller, more manageable pieces called shards. Each shard holds a subset of the total data, and together they form a complete dataset.
Here’s a practical example of how sharding can be implemented in a MongoDB database.
MongoDB Sharding Example
1. Setting Up the Sharded Cluster
In MongoDB, a sharded cluster consists of three main components:
- Shards: These are the data storage units. Each shard holds a subset of the sharded data.
- Config Servers: These store the metadata and configuration settings for the cluster.
- Query Routers (mongos): These act as the interface between client applications and the sharded cluster. They route queries to the appropriate shards.
2. Installing MongoDB
First, you need to have MongoDB installed on your servers. You can follow the installation guide from the MongoDB official documentation.
3. Starting the Shards
Each shard is essentially a MongoDB instance (or replica set in a production environment). For simplicity, we'll start three MongoDB instances as shards.
# Start three MongoDB instances as shards on different ports
mongod --shardsvr --port 27018 --dbpath /data/db/shard1 --logpath /data/log/shard1.log --fork
mongod --shardsvr --port 27019 --dbpath /data/db/shard2 --logpath /data/log/shard2.log --fork
mongod --shardsvr --port 27020 --dbpath /data/db/shard3 --logpath /data/log/shard3.log --fork
4. Starting the Config Servers
The config servers store metadata about the cluster. Start three config servers:
# Start three config servers on different ports
mongod --configsvr --port 27021 --dbpath /data/db/config1 --logpath /data/log/config1.log --fork
mongod --configsvr --port 27022 --dbpath /data/db/config2 --logpath /data/log/config2.log --fork
mongod --configsvr --port 27023 --dbpath /data/db/config3 --logpath /data/log/config3.log --fork
5. Starting the Query Routers (mongos)
The query routers route queries from clients to the appropriate shards. Start a query router and connect it to the config servers:
# Start a query router and connect it to the config servers
mongos --configdb configReplSet/localhost:27021,localhost:27022,localhost:27023 --logpath /data/log/mongos.log --fork
6. Configuring the Sharded Cluster
Connect to the mongos
instance and add the shards:
# Connect to the mongos instance
mongo --port 27017
# Add shards to the cluster
sh.addShard("localhost:27018")
sh.addShard("localhost:27019")
sh.addShard("localhost:27020")
7. Enabling Sharding for a Database and Collection
Enable sharding for a database and a specific collection:
# Enable sharding for a database
sh.enableSharding("myDatabase")
# Shard a specific collection
sh.shardCollection("myDatabase.myCollection", { "shardKey": 1 })
Example Use Case: Sharding a User Collection
Imagine you have a collection of users and you want to distribute them across multiple shards based on their user IDs. Here’s how you could shard the users
collection:
# Connect to the mongos instance
mongo --port 27017
# Enable sharding for the 'users' collection
sh.enableSharding("myDatabase")
# Shard the 'users' collection on the 'userId' field
sh.shardCollection("myDatabase.users", { "userId": 1 })
How Sharding Works Internally
- Data Distribution: When a new document is inserted into the sharded collection, MongoDB uses the shard key to determine which shard the document should reside in.
- Query Routing: When a query is executed, the
mongos
router uses the shard key to route the query to the appropriate shard(s). If the shard key is not specified in the query,mongos
may need to query all shards. - Balancing: MongoDB periodically checks the distribution of data and moves chunks (ranges of data) between shards to ensure an even distribution and to prevent any single shard from becoming a bottleneck.