paolo@bimodesign.com | +34 608 61 64 10

NoSQL

        

Spark - Difference Map and flatMap

In this post I'll show the difference between Map and FlatMap, that work quite differently when it comes to lists
Note before starting to read this post, you need to read this where there is the quickly introduction to some PySpark functions.
Suppose you have a PairRDD called test where each record holds an ID as key and a list of strings as the RDD's value. So a record would look like the following tuple

( id, ['a', 'b', 'c', 'd'] )

If you apply

test.map(lambda (id, list): list)
you get an RDD where each record holds a list of strings.
If you apply instead
test.flatMap(lambda (id,list): list)
you get an RDD where each record hold a single string value.

A map will give you a list containing lists containing tokens. Doing a count wil only count the first level of elements.
When you flatmap, all those multi element lists in the main list get "flattened", so all elements get to be part of the same list. Now you will be counting all of the elements.

Example:
((x1,x2,x3,x4,x5),(y1,y2,y3,y4)).count() = 2

Now flatten this nested list:

(x1,x2,x3,x4,x5,y1,y2,y3,y4).count() = 9