mongodb: insert if not exists


Question

Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.

  • I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
  • I don't want to have duplicate documents.
  • I don't want to remove a document which has previously been saved, but is not in my update.
  • 95% (estimated) of the records are unmodified from day to day.

I am using the Python driver (pymongo).

What I currently do is (pseudo-code):

for each document in update:
      existing_document = collection.find_one(document)
      if not existing_document:
           document['insertion_date'] = now
      else:
           document = existing_document
      document['last_update_date'] = now
      my_collection.save(document)

My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update). I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh.... a bit terse.... (http://www.mongodb.org/display/DOCS/Updating )

Can someone advise how to do it faster?

1
121
6/7/2015 3:20:53 AM

Accepted Answer

Sounds like you want to do an "upsert". MongoDB has built-in support for this. Pass an extra parameter to your update() call: {upsert:true}. For example:

key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument

This replaces your if-find-else-update block entirely. It will insert if the key doesn't exist and will update if it does.

Before:

{"key":"value", "key2":"Ohai."}

After:

{"key":"value", "key2":"value2", "key3":"value3"}

You can also specify what data you want to write:

data = {"$set":{"key2":"value2"}}

Now your selected document will update the value of "key2" only and leave everything else untouched.

129
12/11/2017 8:18:22 AM

As of MongoDB 2.4, you can use $setOnInsert (http://docs.mongodb.org/manual/reference/operator/setOnInsert/)

Set 'insertion_date' using $setOnInsert and 'last_update_date' using $set in your upsert command.

To turn your pseudocode into a working example:

now = datetime.utcnow()
for document in update:
    collection.update_one(
        {"_id": document["_id"]},
        {
            "$setOnInsert": {"insertion_date": now},
            "$set": {"last_update_date": now},
        },
        upsert=True,
    )

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon