paolo@bimodesign.com | +34 608 61 64 10

NoSQL

        

MongoDB - Aggragate cases

One of the tests of the MongoDB Developer Course was the aggregation framework and these three exercises and my solution. I'm trying to explain every steps for each one of them.
Firt of all, we have three different zip to import into the database, using this command:

mongoimport -d blog -c posts --drop posts.json (example 1)
mongoimport -d test -c zips --drop small_zips.json (example 2)
mongoimport -d test -c zips --drop zips.json (example 3)

and to download the three .zip files click on this link (XXXX) <---- ci saranno tutti i file -->

In the example 1 the issue is to find the most frequent author of comments (the author with the greatest number) on your blog.
So I create this aggregate statment:

db.posts.aggregate([
    { $unwind: "$comments" },
    { $group: { _id: "$comments.author", count: { $sum: 1 } } },
    { $sort: { count: -1 } },
    { $limit: 1 }
])

where
unwind: Using unwind I deconstructs the array field comments for every post and send this output to the group stages. In fact if I execute only this part of aggregation,

db.posts.aggregate([
    { $unwind: "$comments" }
]).pretty()

the output will be

{
	"_id" : ObjectId("50ab0f8bbcf1bfe2536dc3f9"),
	"body" : "blabla",
	"author" : "machine",
	"title" : "Bill of Rights",
	"tags" : [
		"watchmaker",
		"santa",
		"xylophone",
		"math",
		"handsaw",
		"dream",
		"undershirt",
		"dolphin",
		"tanker",
		"action"
	],
	"comments" : {
		"email" : "ZFpiFQlp@tUhXPUjs.com",
		"author" : "Danika Loeffler",
		"body" : "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum"
	},
	"date" : ISODate("2012-11-20T05:05:15.231Z")
}
...
{
	"_id" : ObjectId("50ab0f8bbcf1bfe2536dc3f9"),
	"body" : "blabla",
	"author" : "machine",
	"title" : "Bill of Rights",
	"tags" : [
		"watchmaker",
		"santa",
		"xylophone",
		"math",
		"handsaw",
		"dream",
		"undershirt",
		"dolphin",
		"tanker",
		"action"
	],
	"comments" : {
		"email" : "RAflrgPY@DzXPkIeS.com",
		"author" : "Tawana Oberg",
		"body" : "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum"
	},
	"date" : ISODate("2012-11-20T05:05:15.231Z")
}
...

where I have a record for every comment of every post of the collection.
group: To sum by comment author from the data output unwind
sort: to sort descending
limit: to return only one record

In the example 2 the issue is to calculate the average population of cities in California (abbreviation CA) and New York (NY) (taken together) with populations over 25,000.

So I create this aggregate statment:

db.zips.aggregate([
    { $match: {$or: [ {state: "CA"}, {state: "NY"} ] } },
    { $group: { _id: { city: "$city" }, pop: { $sum: "$pop" } } },
    { $match: { "pop": { $gt: 25000 } } },
    { $group: { _id: null, avg_pop_of_city: { $avg: "$pop" } } }
])

where
match: to filter only the documents that matching with state CA and NY. The output to next stage will be several records like this

{
	"_id" : "90021",
	"city" : "LOS ANGELES",
	"loc" : [
		-118.244698,
		34.033303
	],
	"pop" : 2869,
	"state" : "CA"
}
....
{
	"_id" : "10021",
	"city" : "NEW YORK",
	"loc" : [
		-73.958805,
		40.768476
	],
	"pop" : 106564,
	"state" : "NY"
}

group: To group by city and sum the value of popolation
match: To match only the value population is greater than 25000
group: To group by all the id values (id=null) calculating the average value

The result will be:

{ "_id" : null, "avg_pop_of_city" : 44804.782608695656 }
In the example 3 the issue is to calculate the number of people who live in a zip code in the US where the city starts with a digit.

So I create this aggregate statment:
db.zips.aggregate([
    { $project: { _id: 0, city: 1, pop: 1 } },
    { $match: { city: /^\d.*/ } },
    { $group: { _id: null, pop: { $sum: "$pop" } } },
    { $sort: { city: 1} }
])

project: To change the ouput of each document as output to the match state
Using that statment the original data (example of structure of one record)

{
	"_id" : "35004",
	"city" : "ACMAR",
	"loc" : [
		-86.51557,
		33.584132
	],
	"pop" : 6055,
	"state" : "AL"
}

will change in

{ "city" : "ACMAR", "pop" : 6055 }
{ "city" : "ADAMSVILLE", "pop" : 10616 }
{ "city" : "ADGER", "pop" : 3205 }
{ "city" : "KEYSTONE", "pop" : 14218 }
{ "city" : "NEW SITE", "pop" : 19942 }
{ "city" : "ALPINE", "pop" : 3062 }

match: To match only the city that starting with digit (I used regexp sintax)
group: To group by all the id values (id=null) calculating the sum
sort: To sort ascending