Skip to main content

MongoDB

Module mongodb

Certified

Important Capabilities

CapabilityStatusNotes
Table-Level LineageEnabled by default

This plugin extracts the following:

  • Databases and associated metadata
  • Collections in each database and schemas for each collection (via schema inference)

By default, schema inference samples 1,000 documents from each collection. Setting schemaSamplingSize: null will scan the entire collection. Moreover, setting useRandomSampling: False will sample the first documents found without random selection, which may be faster for large collections.

Note that schemaSamplingSize has no effect if enableSchemaInference: False is set.

Really large schemas will be further truncated to a maximum of 300 schema fields. This is configurable using the maxSchemaSize parameter.

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[mongodb]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: "mongodb"
config:
# Coordinates
connect_uri: "mongodb://localhost"

# Credentials
username: admin
password: password
authMechanism: "DEFAULT"

# Options
enableSchemaInference: True
useRandomSampling: True
maxSchemaSize: 300

sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

View All Configuration Options
Field [Required]TypeDescriptionDefaultNotes
authMechanismstringMongoDB authentication mechanism.None
connect_uristringMongoDB connection URI.mongodb://localhost
enableSchemaInferencebooleanWhether to infer schemas.True
maxDocumentSizeinteger16793600
maxSchemaSizeintegerMaximum number of fields to include in the schema.300
optionsobjectAdditional options to pass to pymongo.MongoClient().None
passwordstringMongoDB password.None
schemaSamplingSizeintegerNumber of documents to use when inferring schema size. If set to 0, all documents will be scanned.1000
useRandomSamplingbooleanIf documents for schema inference should be randomly selected. If False, documents will be selected from start.True
usernamestringMongoDB username.None
envstringThe environment that all assets produced by this connector belong toPROD
collection_patternAllowDenyPatternregex patterns for collections to filter in ingestion.{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
collection_pattern.allowarray(string)None
collection_pattern.denyarray(string)None
collection_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True
database_patternAllowDenyPatternregex patterns for databases to filter in ingestion.{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
database_pattern.allowarray(string)None
database_pattern.denyarray(string)None
database_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True

Code Coordinates

  • Class Name: datahub.ingestion.source.mongodb.MongoDBSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for MongoDB, feel free to ping us on our Slack