Recently I had to process huge log files. I found it to be a great opportunity to check my Scala language skills. One of the problems I had to figure out was filtering every n-th element from iterable collection. This is supposed to be a trivial task, but well, after thinking for a while I had five competitive implementations ready. Which one should I pick? Perhaps the most concise one. To make the choice easier I've decided to measure how performant they are.
- Filter using groups
This is my favorite one, the cleanest implementation.
def filterUsingGroups[A](in: Iterable[A], step: Int, offset: Int): Iterable[A] =
in.drop(offset).grouped(step).map(_.head).toIterable
How it works:
in = (1,2,3,4,5,6,7,8,9,10), step=3, offset=2
drop(2) => (3,4,5,6,7,8,9,10)
grouped(3) => ((3,4,5),(6,7,8),(9,10))
map(_.head) => (3,6,9)
- Filter using index
def filterUsingIndex[A](in: Iterable[A], step: Int, offset: Int): Iterable[A] =
in.zipWithIndex.filter(_._2 % step == offset).unzip._1
How it works:
in = (1,2,3,4,5,6,7,8,9,10), step=3, offset=2
zipWithIndex => ((1,0),(2,1),....,(10,9))
filter(_._2 % 3 == 2) => ((3,2), (6,5), (9,8))
unzip._1 => (3,6,9)
- Filter using recursion
def filterUsingRecursion[A](in: Iterable[A], step: Int, offset: Int): Iterable[A] = {
if (in.size >= step) {
val (l, r) = in.splitAt(step)
List(l.drop(offset).head) ++ filterUsingRecursion(r, step, offset)
} else List.empty[A]
}
How it works:
in = (1,2,3,4,5,6,7,8,9,10), step=3, offset=2
splitAt(3) => l=(1,2,3); r=(4,5,6,7,8,9,10)
l.drop(2).head == 3 => List(3) ++ filterUsingRecursion((4,5,6,7,8,9,10), 3, 2)
- Filter using loop (unbuffered)
def filterUsingLoop[A](in: Iterable[A], step: Int, offset: Int): Iterable[A] = {
var out = List.empty[A]
val it = in.iterator
for (i <- 0 to in.size - 1) {
if (i % step == offset) out = out :+ it.next
else it.next
}
out
}
- Filter using loop (buffered)
def filterUsingLoopWithBuffer[A](in: Iterable[A], step: Int, offset: Int): Iterable[A] = {
val out = collection.mutable.ListBuffer.empty[A]
val it = in.iterator
for (i <- 0 to in.size - 1) {
if (i % step == offset) out += it.next
else it.next
}
out
}
- Filter using tail recursion (added later)
def filterUsingTailRecursion[A](in: Iterable[A], step: Int, offset: Int): Iterable[A] = {
def filter(acc: ListBuffer[A], in: Iterable[A]): ListBuffer[A] = {
in.splitAt(step) match {
case (l, r) if (l.size >= offset) => filter(acc += l.drop(offset).head, r)
case _ => acc
}
}
filter(ListBuffer.empty[A], in)
}
In my benchmark I've called each method 10 times in a row and measured average execution time (using parameters step=3, offset=2). When benchmarking I call foreach on the result of each execution of the method under test. I do this to materialize returned collection if it would be backed by an iterator.
The chart below shows the results I've collected.
Sadly the implementation using groups is not the most performant although it's clean and concise. Anyway it's good enough when processing less than 1000 elements. That makes it quite good candidate for initial implementation.