发布于 2015-09-14 14:54:55 | 155 次阅读 | 评论: 0 | 来源: 网络整理
Sharding is MongoDB’s approach to scaling out. Sharding partitions a collection and stores the different portions on different machines. When a database’s collections become too large for existing storage, you need only add a new machine. Sharding automatically distributes collection data to the new server.
Sharding automatically balances data and load across machines. Sharding provides additional write capacity by distributing the write load over a number of mongod instances. Sharding allows users to increase the potential amount of data in the working set.
To run sharding, you set up a sharded cluster. For a description of sharded clusters, see 片式群集管理.
Within a sharded cluster, you enable sharding on a per-database basis. After enabling sharding for a database, you choose which collections to shard. For each sharded collection, you specify a shard key.
The shard key determines the distribution of the collection’s documents among the cluster’s shards. The shard key is a field that exists in every document in the collection. MongoDB distributes documents according to ranges of values in the shard key. A given shard holds documents for which the shard key falls within a specific range of values. Shard keys, like indexes, can be either a single field or multiple fields.
Within a shard, MongoDB further partitions documents into chunks. Each chunk represents a smaller range of values within the shard’s range. When a chunk grows beyond the chunk size, MongoDB splits the chunk into smaller chunks, always based on ranges in the shard key.
Choosing the correct shard key can have a great impact on the performance, capability, and functioning of your database and cluster. Appropriate shard key choice depends on the schema of your data and the way that your application queries and writes data to the database.
The ideal shard key:
The challenge when selecting a shard key is that there is not always an obvious choice. Often, an existing field in your collection may not be the optimal key. In those situations, computing a special purpose shard key into an additional field or using a compound shard key may help produce one that is more ideal.
Balancing is the process MongoDB uses to redistribute data within a sharded cluster. When a shard has too many chunks when compared to other shards, MongoDB automatically balances the shards. MongoDB balances the shards without intervention from the application layer.
The balancing process attempts to minimize the impact that balancing can have on the cluster, by:
You may disable the balancer on a temporary basis for maintenance and limit the window during which it runs to prevent the balancing process from impacting production traffic.
注解
The balancing procedure for sharded clusters is entirely transparent to the user and application layer. This documentation is only included for your edification and possible troubleshooting purposes.
While sharding is a powerful and compelling feature, it comes with significant Infrastructure Requirements for Sharded Clusters and some limited complexity costs. As a result, use sharding only as necessary, and when indicated by actual operational requirements. Consider the following overview of indications it may be time to consider sharding.
You should consider deploying a sharded cluster, if:
If these attributes are not present in your system, sharding will only add additional complexity to your system without providing much benefit. When designing your data model, if you will eventually need a sharded cluster, consider which collections you will want to shard and the corresponding shard keys.
警告
It takes time and resources to deploy sharding, and if your system has already reached or exceeded its capacity, you will have a difficult time deploying sharding without impacting your application.
As a result, if you think you will need to partition your database in the future, do not wait until your system is overcapacity to enable sharding.
A sharded cluster has the following components:
Three config servers.
These special mongod instances store the metadata for the cluster. The mongos instances cache this data and use it to determine which shard is responsible for which chunk.
For development and testing purposes you may deploy a cluster with a single configuration server process, but always use exactly three config servers for redundancy and safety in production.
Two or more shards. Each shard consists of one or more mongod instances that store the data for the shard.
These “normal” mongod instances hold all of the actual data for the cluster.
Typically each shard is a replica sets. Each replica set consists of multiple mongod instances. The members of the replica set provide redundancy and high available for the data in each shard.
警告
MongoDB enables data partitioning, or sharding, on a per collection basis. You must access all data in a sharded cluster via the mongos instances as below. If you connect directly to a mongod in a sharded cluster you will see its fraction of the cluster’s data. The data on any given shard may be somewhat random: MongoDB provides no guarantee that any two contiguous chunks will reside on a single shard.
One or more mongos instances.
These instance direct queries from the application layer to the shards that hold the data. The mongos instances have no persistent state or data files and only cache metadata in RAM from the config servers.
注解
In most situations mongos instances use minimal resources, and you can run them on your application servers without impacting application performance. However, if you use the aggregation framework some processing may occur on the mongos instances, causing that mongos to require more system resources.
Your cluster must manage a significant quantity of data for sharding to have an effect on your collection. The default chunk size is 64 megabytes, and the balancer will not begin moving data until the imbalance of chunks in the cluster exceeds the migration threshold.
Practically, this means that unless your cluster has many hundreds of megabytes of data, chunks will remain on a single shard.
While there are some exceptional situations where you may need to shard a small collection of data, most of the time the additional complexity added by sharding the small collection is not worth the additional complexity and overhead unless you need additional concurrency or capacity for some reason. If you have a small data set, usually a properly configured single MongoDB instance or replica set will be more than sufficient for your persistence layer needs.
Chunk size is user configurable. However, the default value is of 64 megabytes is ideal for most deployments. See the Chunk Size section in the 片式集群内部和行为 document for more information.