OJ’s rants What would OJ do?

28Jul/0814

A Better ‘nub’

During my Haskell travels I have found myself using the nub function quite regularly. For those too lazy to click the link: nub removes duplicates from a list of items. eg:

Prelude> nub [1,1,3,3,5,5,6,6,6,1]
[1,3,5,6]

Fairly simple stuff. Until recently I hadn't bothered pondering the internal implementation of this function because I hadn't really been too worried about performance. That's no longer the case. I recently cracked open the hood of nub, and was rather surprised by what I saw.

Since the source of Prelude is in the public domain, I'm safe to show you the content:

nub                     :: (Eq a) => [a] -> [a]
nub                      = nubBy (==)
 
nubBy                   :: (a -> a -> Bool) -> [a] -> [a]
nubBy eq []              = []
nubBy eq (x:xs)          = x : nubBy eq (filter (\y -> not (eq x y)) xs)

So nub uses nubBy and passes (==) as the filter for the items. This is all well and good, but the result is an O(n2) level of complexity! In this day and age that's pretty poor for a simple filtering mechanism, especially when we have more powerful mechanisms built-in to most libraries.

Since seeing this I had added a "write my own nub" task to my TODO list, but I haven't yet got round to it. So you can imagine my delight when I found this comment by Jedai over at hvergi.net. He'd pointed out that nub was indeed inefficient, and showed an alternative implementation:

Note that nub only demands that the list element be part of the Eq typeclass. As a result it is very inefficient and a better solution must always be prefered whenever the nature of the elements allows it.

import qualified Data.Set as S
nub' :: (Ord a) => [a] -> [a]
nub' = go S.empty
  where go _ [] = []
        go s (x:xs) | S.member x s = go s xs
                    | otherwise    = x : go (S.insert x s) xs

La différence de performance est énorme.

You got that right! Thanks to Jedai (and Arnar via hvergi.net) I now have a much faster nub implementation. Cheers guys!

It just goes to show that even these days you can't always trust the libraries. There is no harm is taking a look under the hood to see what's going on! You don't have to reinvent the wheel, but there's nothing wrong with making the wheel smoother :)

  • OJ
    HA! If I wanted to fame and attention I wouldn't be talking to the likes of you! :P
  • The purpose of this blogging thing, for me at least, is to learn!

    Liar! You just want the fame and attention! ;-)
  • OJ
    @Tony: First of all let me say WOW! Tony Morris posting on my blog! I'm a HUGE fan of yours. You write some seriously amazing stuff. Kudos to you.

    I had a feeling that someone would come along and pick me up on some flaw somewhere along the line :) I was hoping for it! The purpose of this blogging thing, for me at least, is to learn! I think that's about to happen :)

    When I read up on the definition of nub it appeared that it was as simple as "removing all duplicate elements from a list while preserving the order". Is my definition incorrect?

    A hint would be lovely :) Given that my implementation of nub is a blatant "hack" which works in most cases for what I need, I'm sure I'm missing something. So yes please, can I have a hint?

    Cheers mate, thanks for posting!
    OJ
  • "it’s technically not nub"

    Neither is your given function. I won't spoil it for you. Want a hint? :)
  • OJ
    @Gwern: Thanks for your comment.

    Please consider the following:

    import qualified Data.Set as S

    nub1 :: (Ord a) => [a] -> [a]
    nub1 = S.toList . S.fromList

    nub2 :: (Ord a) => [a] -> [a]
    nub2 = go S.empty
    where
    go _ [] = []
    go s (x:xs) | S.member x s = go s xs
    | otherwise = x : go (S.insert x s) xs

    someNums :: [Int]
    someNums = [1,2,3,4,5,9,8,7,6,5,3,4,5,6,7,12,14,13,10]

    main :: IO ()
    main = do
    print $ nub1 someNums
    print $ nub2 someNums

    The result:
    *Main> main
    [1,2,3,4,5,6,7,8,9,10,12,13,14]
    [1,2,3,4,5,9,8,7,6,12,14,13,10]

    Note how they're actually different. Your express method has the side-effect of sorting the values at the same time, where the method discussed in this post does not. One of the features of nub is that it should retain the order of the items in the list. So while yours is faster, it's technically not nub :)
  • Gwern
    Actually, I have to question whether this version really is faster than the simple Set trick.

    gwern@craft:15976~>time runhaskell tmp.hs && echo \n && time runhaskell tmp2.hs && echo \n && cat tmp.hs tmp2.hs      [11:47AM]
    [1]
    runhaskell tmp.hs  7.99s user 0.05s system 99% cpu 8.066 total
    n
    [1]
    runhaskell tmp2.hs  0.87s user 0.03s system 99% cpu 0.907 total
    n
    import qualified Data.Set as S

    main = print $ nub' $ take 10000000 $ cycle [(1::Int)]

    nub' :: (Ord a) => [a] -> [a]
    nub' = go S.empty
    where go _ [] = []
    go s (x:xs) | S.member x s = go s xs
    | otherwise    = x : go (S.insert x s) xs
    import qualified Data.Set as S

    main = print $ nub' $ take 10000000 $ cycle [(1::Int)]

    nub' :: (Ord a) => [a] -> [a]
    nub' = S.toList . S.fromList%



    ---

    Given that they seem to have the same type sigs, but toList.fromList is 10x faster, I know which one I will use.
  • OJ
    I'm hoping that by the time I make my way through all of the problems I'll have a good enough grasp of the language to write something a bit more substantial without the need to constantly reference the web ;)

    Thanks for the feedback on the toList.fromList, that's exactly as I'd expect. Cheers!
  • miran
    Hehe, yeah Project Euler is great for Haskell, it helped me a lot when I was learning the basics.
    About the toList . fromList trick, it is less efficient than the function in the post but the complexity (in big O notation) is still the same. So in practice you could use the trick to avoid premature optimization but then if you need some more speed you can switch to the function that inserts sequentially.
  • OJ
    @miran: Ah! Another Euler solver :) Welcome to the club! :D I'm not up to 87 yet, but I'm working my way through sequentially. I'm currently using the problems to learn Haskell, which is why my Haskell posts are still very n00b.

    re: the toList . fromList trick - is this not less efficient given that the list is constructed from the set after the set is constructed? The method listed above generates the list while using the set as a filter.
  • miran
    OJ: Yup, most of the times that I've used nub it was on numbers. Interesting coincidence though, just before reading this post I was solving Project Euler problem 87, in which a solution involves weeding out duplicates in a list of a million numbers or so. Needless to say, nub just went on and on while putting it in a set produced the result almost instantly.
    gwern: There's a catch though. The function in the post preserves the ordering while using the toList . fromList trick does not.
  • OJ
    Thanks for the responses guys!

    @miran: That's true, it does require Ord. But most cases that isn't a problem. If there is a case where it is a problem you can always fall back on the old-skool nub :)

    @gwern: I can't see how that's cheating. It's an alternative implementation which provides a means to get the result much faster/more efficiently.

    @nhlab: Cheers for the links mate! Great comment, really informative :) I guess I came across as rather narked when I dissed nub's implementation in Prelude. I didn't mean it to sound that way. I guess for the most part developers use nub to filter out duplicates and hence it's not the fastest way of doing it. I guess what narks me more is that almost every beginner tutorial you read says to use nub when removing duplicates. If there was more visibility of the underlying implementation and the tutorials indicated that it's not the best method to use when you are using values of the Ord class then it wouldn't erk me so much.

    The beauty of it is, that posts like this one bring out these nuggets of information such as the one you posted :) Cheers!
  • Another blog post on implementing nub using Ord can be found here:

    http://blog.unsafeperformio.com/?s=nub&subm...

    We can\'t replace the version of nub in the standard libraries with the Ord version, because many data types only support Eq. However, by using GHC rules, it could be possible for the compiler to use the standard version when only Eq is available, but use the Ord version,  when possible.

    http://hackage.haskell.org/trac/ghc/ticket/1218

    One of my former coworkers insisted that understanding big-O notation was not very important because the core library maintainers would obviously pick the fastest implementation of sort, etc. Hence, you can just use the standard library functions and know you are getting the best possible algorithm.

    As your blog post shows, that is not really the case. The nub instance in the standard libraries is a fine algorithm for:

    nub :: (Eq a) => [a] -> [a]

    but, if your data type has an Ord instance, then you can clearly do much better.
  • gwern
    If you're going to 'cheat' and change the type signature/behaivour, then why not go with a much simpler definition?

    import Data.Set

    nub = toList . fomList
  • miran
    That is true but using Data.Set requires the elements to be part of the Ord typeclass while nub only requires Eq, which is a much looser requirement.
blog comments powered by Disqus