paolo@bimodesign.com | +34 608 61 64 10

NoSQL

        

Spark - example of Distinct

Before starting to read this post, you need to read this where there is the quickly introduction to some PySpark functions. In this post I'll show the difference using or not the distinct function.
If I apply to the access log Apache data, un SortByKey without distict, after to map the logs by DateTime and Host

dayToHostPairTuple = (access_logs
          .map(lambda log: ((log.date_time.day,log.host),1))
          #.distinct()
          .sortByKey())
the result will be:
[((1, u'128.126.216.37'), 1), ((1, u'128.126.216.37'), 1), ((1, u'128.126.216.37'), 1), ((1, u'128.126.216.37'), 1), ((1, u'128.126.216.37'), 1), ((1, u'128.126.216.37'), 1), ((1, u'128.126.216.37'), 1), ((1, u'128.126.216.37'), 1), ((1, u'128.126.216.37'), 1), ((1, u'128.126.216.37'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.91'), 1), ....]

Now if I'll apply the distinct (the same code without the comment #)

se applico il distict
dayToHostPairTuple = (access_logs
          .map(lambda log: ((log.date_time.day,log.host),1))
          .distinct()
          .sortByKey())

I'll have this result:

[((1, u'128.126.216.37'), 1), ((1, u'128.135.36.35'), 1), ((1, u'128.138.169.91'), 1), ((1, u'128.138.169.94'), 1), ((1, u'128.149.109.74'), 1), ((1, u'128.158.20.67'), 1), ((1, u'128.158.28.33'), 1), ((1, u'128.158.36.4'), 1), ((1, u'128.158.37.244'), 1), ((1, u'128.158.42.141'), 1), ((1, u'128.158.42.193'), 1), ((1, u'128.158.45.18'), 1), ((1, u'128.158.49.61'), 1), ((1, u'128.158.50.129'), 1), ((1, u'128.158.53.223'), 1), ((1, u'128.158.54.58'), 1), ((1, u'128.158.55.116'), 1), ((1, u'128.158.56.155'), 1), ((1, u'128.158.66.97'), 1), ((1, u'128.159.105.240'), 1), ((1, u'128.159.111.138'), 1), ((1, u'128.159.111.141'), 1), ((1, u'128.159.111.174'), 1), ((1, u'128.159.111.23'), 1),