Analyzing MongoDb aggregation performance -


i trying millisecond (< second) response time aggregation collection. however, takes 3-5 seconds small data size (~200mb). expected data in production collection around 100gb per shard. checked following already
- response time same when tried query individually on each shard. - checked profiling output , see high timelockedmicros , numyield. - couldn't find unusual in mms charts too. feel overloooking silly here. analyze further highly appreciated. cluster , collection details provided below

cluster - 6 nodes of 34 gb, 4 core machines (aws m2.2xlarge) data size 1,285 mb (213 mb per shard) no of records = 5.5 million (~1 m per shard)

records sample

{     "_id" : {         "itemid" : 105182,         "devicetype" : 16,         "itemversionid" : 117971,         "timestamp" : isodate("2014-11-14t00:00:00z"),         "recordtype" : 1     },     "dim1id" : 102260,     "dim2id" : 313,     "dim3id" : 1,     "actiontype" : {         "1" : 66,         "47" : 66,         "42" : 72,         "46" : 130     } } 

query

db.aggregatecollection.aggregate({ "$group" : { "_id" : { } , "type1" : { "$sum" : "$actiontype.1"} , "type2" : { "$sum" : "$actiontype.2"}}}) 

profile stats (from 1 shard)

"keyupdates" : 0,     "numyield" : 79,     "lockstats" : {         "timelockedmicros" : {             "r" : numberlong(2981456),             "w" : numberlong(0)         },         "timeacquiringmicros" : {             "r" : numberlong(3756),             "w" : numberlong(4)         }     },     "responselength" : 157,     "millis" : 3268,     "execstats" : {      }, 

update thank prompt response. apreciate it. new data model , indexing. however, afraid not suitable current data because, - 99% of records have soem value actiontype.1 , - 99% of our queries select actiontype.1 index on actiiontype.k wouldn't guess.

as suggested in #2 & #3, doing pre-aggregation using spark cluster upadtes mongodb.

a little bit more query query shared earlier sample 1 , used benchmark performance. actual query have $match on timestamp , $group on 1 or more fileds. typical production query 30 days data. collection has 15 days data. objective sub-second response time 30 days data

btw, did more analysis today toook dump of shard , restored in local mongo installed on macbook. same query took 2 seconds (took 4s in aws isntance) didn't make sense because aws instance @ lest 4 times powerful macbook (both cpu & memory) macbook air - http://www.cpubenchmark.net/cpu.php?cpu=intel+core+i5-4250u+%40+1.30ghz aws m2.2xlarge instance - http://www.cpubenchmark.net/cpu.php?cpu=intel+xeon+e5-2665+%40+2.40ghz

i suspected fragmentation becuase data in aws mongo instance populated on last 15 days through application. re-imported dump on aws mongo separate collection. query on new collection took 2s comparable macbook speed. fragmentation 1 reason sure. planning more research on fragmentation later. though defragmenting improved perfromace, fact took same time macbook didn't make sense aws isntance 4 times powerful. looked @ cpu utilization , found mongod instance uses 1 cpu (out of 4) query execution. planning install 4 shards on each machine around this. please let me know if see better approach.

one more thing, know query has scan entire collection 2 seconds scan ~200mb data seems high me. expected or missing something?

things i'd try:

1) you've organized data in way makes grouping difficult. may better results if organize document this:

{     ...     "actiontype" : [{k:1, v:66}, {k:47, v:66}, {k:42, v:72}, {k:46, v:130}] } 

this allow create index on 'actiontype.k'. match on index reduce overall dataset exact actiontypes want aggregation, query is:

db.action.aggregate([{$unwind: '$actiontype'},          {$group:{_id:'$actiontype.k', t:{$sum:'$actiontype.v'} } }]); //output { "_id" : 46, "t" : 130 } { "_id" : 42, "t" : 72 } { "_id" : 47, "t" : 66 } { "_id" : 1, "t" : 66 } 

then ensureindex on 'actiontype.k'. if you're not planning filter different key values, index quite bit depending on density of keys in documents. if you're planning sum every key, index won't here.

2) map-reduce and/or add these on cron-job/settimeout schedule. again, depending on update cycle , how accurate need data @ 1 time, set like:

  • every hour process 'dirty' results
  • add current values running total
  • mark 'clean'

that works if you're doing insert-only db.

3) if key values changing regularly (updates instead of inserts), might have better luck doing changelog insert happens @ same time updates main collection.

db.changes.insert({key:44, change:2}); db.changes.insert({key:34, change:-2}); 

then routinely empty out 'changes' collection summing values different collection.


Comments

Popular posts from this blog

ruby on rails - RuntimeError: Circular dependency detected while autoloading constant - ActiveAdmin.register Role -

c++ - OpenMP unpredictable overhead -

javascript - Wordpress slider, not displayed 100% width -