Thursday, 24 February 2011

Scala collections: Filtering each n-th element - Part 2

If you read my previous post you know that I'm trying to find the best implementation for filtering each n-th element from a collection. Since then I found two more possible implementations:
  1. Loop - Reverse
    This implementation is faster than other loop-based implementations. It benefits from replacing append :+ with concatenation ::
    def filterUsingLoopReversed[A](in: Iterable[A], step: Int, offset: Int): Iterable[A] = {
      var out = List.empty[A]
      val it = in.iterator
      for (i <- 0 to in.size - 1) {
        if (i % step == offset) out = it.next :: out
        else it.next
      }
      out.reverse
    }
    
  2. Range - Map
    This is simply the best implementation I have. It's not only concise but also very fast. The drawback is that it requires the input collection to be an instance of Seq
    def filterUsingMap[A](in: Iterable[A], step: Int, offset: Int): Iterable[A] = {
      val seq = in.toSeq
      Range(offset, in.size, step).map(seq(_))
    }
    
    Edit: Peter suggested that the perfomance of this solution depends on the implementation of the input collection. To mitigate this I've decided to convert input collection to IndexedSeq which is supposed to guarantee near constant-time element access. The performance is a bit worse because of the extra conversion time, though it's still the fastest solution I found
    def filterUsingMap[A](in: Iterable[A], step: Int, offset: Int): Iterable[A] = {
      val seq = in.toIndexedSeq
      Range(offset, in.size, step).map(seq(_))
    }
    

Benchmark results:
Average execution time of 100 executions

Range-Map (light green) is comparable to Loop-Buffered (dark green) when filtering small-mid collection (<10000 elements). And there is no better choice than Range-Map to filter big collections (>10000 elements).

2 komentarze:

Peter said...

The result of the benchmarks depend on the type of the collection you are filtering. On Arrays, filterUsingMap is fast, however on Lists its behaviour is terrible.

This is because you use O(N) many indexed accesses, each of which is O(1) for arrays and O(N) for Lists. Thus filterUsingMap is O(N) for arrays, but O(N*N) for lists.

Krzysztof Białek said...

Peter, you're totally right. Non-indexed lists are wrong choice for random access reads. I've tested Range-Map implementation once again with conversion in.toSeq replaced by in.toIndexedSeq. This time preformance is very good again also for non-indexed lists.