发布于 2015-09-14 14:46:47 | 141 次阅读 | 评论: 0 | 来源: 网络整理
This document answers common questions about horizontal scaling using MongoDB’s sharding.
If you don’t find the answer you’re looking for, check the complete list of FAQs or post your question to the MongoDB User Mailing List.
Frequently Asked Questions:
Sometimes.
If your data set fits on a single servers, you should begin with an unsharded deployment.
Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.
Still, all production deployments should use replica sets to provide high availability and disaster recovery.
To use replication with sharding, deploy each shard as a replica set.
No.
There is no automatic support in MongoDB for changing a shard key after sharding a collection. This reality underscores the important of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
See shardCollection, sh.shardCollection(), 片式群集管理, the Shard Key section in the 片式集群内部和行为 document, 片式集群部署, and SERVER-4000 for more information.
In the current implementation, all databases in a sharded cluster have a “primary shard.” All unsharded collection within that database will reside on the same shard.
Sharding must be specifically enabled on a collection. After enabling sharding on the collection, MongoDB will assign various ranges of collection data to the different shards in the cluster. The cluster automatically corrects imbalances between shards by migrating ranges of data from one shard to another.
The mongos routes the operation to the “old” shard, where it will succeed immediately. Then the shard mongod instances will replicate the modification to the “new” shard before the sharded cluster updates that chunk’s “ownership,” which effectively finalizes the migration process.
If a shard is inaccessible or unavailable, queries will return with an error.
However, a client may set the partial query bit, which will then return results from all available shards, regardless of whether a given shard is unavailable.
If a shard is responding slowly, mongos will merely wait for the shard to return results.
在 2.0 版更改.
The exact method for distributing queries to shards in a cluster depends on the nature of the query and the configuration of the sharded cluster. Consider a sharded collection, using the shard key user_id, that has last_login and email attributes:
For a query that selects one or more values for the user_id key:
mongos determines which shard or shards contains the relevant data, based on the cluster metadata, and directs a query to the required shard or shards, and returns those results to the client.
For a query that selects user_id and also performs a sort:
mongos can make a straightforward translation of this operation into a number of queries against the relevant shards, ordered by user_id. When the sorted queries return from all shards, the mongos merges the sorted results and returns the complete result to the client.
For queries that select on last_login:
These queries must run on all shards: mongos must parallelize the query over the shards and perform a merge-sort on the email of the documents found.
If you call the cursor.sort() method on a query in a sharded environment, the mongod for each shard will sort its results, and the mongos merges each shard’s results before returning them to the client.
If you do not use _id as the shard key, then your application/client layer must be responsible for keeping the _id field unique. It is problematic for collections to have duplicate _id values.
If you’re not sharding your collection by the _id field, then you should be sure to store a globally unique identifier in that field. The default BSON ObjectID works well in this case.
First, ensure that you’ve declared a shard key for your collection. Until you have configured the shard key, MongoDB will not create chunks, and sharding will not occur.
Next, keep in mind that the default chunk size is 64 MB. As a result, in most situations, the collection needs at least 64 MB before a migration will occur.
Additionally, the system which balances chunks among the servers attempts to avoid superfluous migrations. Depending on the number of shards, your shard key, and the amount of data, systems often require at least 10 chunks of data to trigger migrations.
You can run db.printShardingStatus() to see all the chunks present in your cluster.
Yes. mongod creates these files as backups during normal shard balancing operations.
Once these migrations are complete, you may delete these files.
You can set noMoveParanoia to true to disable this behavior.
Each client maintains a connection to a mongos instance. Each mongos instance maintains a pool of connections to the members of a replica set supporting the sharded cluster. Clients use connections between mongos and mongod instances one at a time. Requests are not multiplexed or pipelined. When client requests complete, the mongos returns the connection to the pool.
See the System Resource Utilization section of the Linux ulimit Settings document.
mongos uses a set of connection pools to communicate with each shard. These pools do not shrink when the number of clients decreases.
This can lead to an unused mongos with a large number of open connections. If the mongos is no longer in use, it is safe to restart the process to close existing connections.
Connect to the mongos with the mongo shell, and run the following command:
db._adminCommand("connPoolStats");
The writeback listener is a process that opens a long poll to relay writes back from a mongod or mongos after migrations to make sure they have not gone to the wrong server. The writeback listener sends writes back to the correct server if necessary.
These messages are a key part of the sharding infrastructure and should not cause concern.
Failed migrations require no administrative intervention. Chunk moves are consistent and deterministic.
If a migration fails to complete for some reason, the cluster will retry the operation. When the migration completes successfully, the data will reside only on the new shard.
See
管理配置服务器 which describes this process.
mongos instances maintain a cache of the config database that holds the metadata for the sharded cluster. This metadata includes the mapping of chunks to shards.
mongos updates its cache lazily by issuing a request to a shard and discovering that its metadata is out of date. There is no way to control this behavior from the client, but you can run the flushRouterConfig command against any mongos to force it to refresh its cache.
The mongos instances will detect these changes without intervention over time. However, if you want to force the mongos to reload its configuration, run the flushRouterConfig command against to each mongos directly.
The maxConns option limits the number of connections accepted by mongos.
If your client driver or application creates a large number of connections but allows them to time out rather than closing them explicitly, then it might make sense to limit the number of connections at the mongos layer.
Set maxConns to a value slightly higher than the maximum number of connections that the client creates, or the maximum size of the connection pool. This setting prevents the mongos from causing connection spikes on the individual shards. Spikes like these may disrupt the operation and memory allocation of the sharded cluster.
If the query does not include the shard key, the mongos must send the query to all shards as a “scatter/gather” operation. Each shard will, in turn, use either the shard key index or another more efficient index to fulfill the query.
If the query includes multiple sub-expressions that reference the fields indexed by the shard key and the secondary index, the mongos can route the queries to a specific shard and the shard will use the index that will allow it to fulfill most efficiently. See this document for more information.
Shard keys can be random. Random keys ensure optimal distribution of data across the cluster.
Sharded clusters, attempt to route queries to specific shards when queries include the shard key as a parameter, because these directed queries are more efficient. In many cases, random keys can make it difficult to direct queries to specific shards.
Yes. There is no requirement that documents be evenly distributed by the shard key.
However, documents that have the shard key must reside in the same chunk and therefore on the same server. If your sharded data set has too many documents with the exact same shard key you will not be able to distribute those documents across your sharded cluster.
You can use any field for the shard key. The _id field is a common shard key.
Be aware that ObjectId() values, which are the default value of the _id field, increment as a timestamp. As a result, when used as a shard key, all new documents inserted into the collection will initially belong to the same chunk on a single shard. Although the system will eventually divide this chunk and migrate its contents to distribute data more evenly, at any moment the cluster can only direct insert operations at a single shard. This can limit the throughput of inserts. If most of your write operations are updates or read operations rather than inserts, this limitation should not impact your performance. However, if you have a high insert volume, this may be a limitation.
If you insert documents with monotonically increasing shard keys, all inserts will initially belong to the same chunk on a single shard. Although the system will eventually divide this chunk and migrate its contents to distribute data more evenly, at any moment the cluster can only direct insert operations at a single shard. This can limit the throughput of inserts.
If most of your write operations are updates or read operations rather than inserts, this limitation should not impact your performance. However, if you have a high insert volume, a monotonically increasing shard key may be a limitation.
To address this issue, you can use a field with a value that stores the hash of a key with an ascending value. While you can compute a hashed value in your application and include this value in your documents for use as a shard key, the SERVER-2001 issue will implement this capability within MongoDB.
Consider the following error message:
ERROR: moveChunk commit failed: version is at <n>|<nn> instead of <N>|<NN>" and "ERROR: TERMINATING"
mongod issues this message if, during a chunk migration, the shard could not connect to the config database to update chunk information at the end of the migration process. If the shard cannot update the config database after moveChunk, the cluster will have an inconsistent view of all chunks. In these situations, the primary member of the shard will terminate itself to prevent data inconsistency. If the secondary member can access the config database, the shard’s data will be accessible after an election. Administrators will need to resolve the chunk migration failure independently.
If you encounter this issue, contact the MongoDB User Group or 10gen support to address this issue.
The sharded cluster balancing process controls both migrating chunks from decommissioned shards (i.e. draining,) and normal cluster balancing activities. Consider the following behaviors for different versions of MongoDB in situations where you remove a shard in a cluster with an uneven chunk distribution: