Monday, October 25, 2010

Writing Python's groupby in C#

A couple of days ago, while working on a C# program, I had the necessity of grouping contiguous elements from a sequence given a property. A group needs to be created each time the value of the property changes in a similar way to the uniq Unix utility.

Python has a function called groupby which is part of the nice itertools module which does exactly what I want. For example:


>>> strList = ["abc","ert","bre","sd","ghj","awe","ew","gh"]
>>> [list(g) for i,g in groupby(strList,lambda x: len(x))]
[['abc', 'ert', 'bre'], ['sd'], ['ghj', 'awe'], ['ew', 'gh']]
>>> numList = [1,2,2,2,1,1,1,5,5]
>>> [list(g) for i,g in groupby(numList)]
[[1], [2, 2, 2], [1, 1, 1], [5, 5]]



In .NET a class called Enumerable with lots of extension methods to manipulate with sequences (IEnumerable<T>). This class includes an extension method called GroupBy which groups values according to a key. However it behaves more like SQL's Group by in that it considers all the values of the collection. For example:

csharp> using System.Collections.Generic;
csharp> var strList = new List<string>() {"abc","ert","bre","sd","ghj","awe","ew","gh"};

csharp> strList.GroupBy(x => x.Length);
{ { "abc", "ert", "bre", "ghj", "awe" }, { "sd", "ew", "gh" } }

csharp> var numList = new List<int>() {1,2,2,2,1,1,1,5,5};

csharp> numList.GroupBy(x => x);
{ { 1, 1, 1, 1 }, { 2, 2, 2 }, { 5, 5 } }


(The C# examples will be presented using the very useful Mono C# REPL)

Writing the equivalent function in C# turned out to be a nice programming excessive. Also trying to write the function using only "yield return" turned out the more challenging than I thought!.

As a first try we can write this function using intermediate collections to store the partial groups:

using System.Collections.Generic;
using System;
namespace Langexplr.Experiments
{
public class MyTuple<T,K>
{
public T Item1 { get; set; }
public K Item2 { get; set; }


public static MyTuple<T1,K1> Create<T1,K1>(T1 first, K1 second)
{
return new MyTuple<T1,K1>() {Item1 = first, Item2 = second};
}
}
public static class GroupByTests
{
public static IEnumerable<MyTuple<K,IList<T>>> MyGroupByWithLists<T,K>(this IEnumerable<T> en,Func<T,K> keyExtraction)
{
K currentKey = default(K);
bool firstTime = true;
IList<T> currentGroup = new List<T>();
foreach(var aValue in en)
{
if (firstTime)
{
currentKey = keyExtraction(aValue);
firstTime = false;
}
else
{
K tmpKey = keyExtraction(aValue);
if (!tmpKey.Equals(currentKey))
{
yield return MyTuple<K,IList<T>>.Create(currentKey, currentGroup);
currentGroup = new List<T>();
currentKey = tmpKey;
}

}
currentGroup.Add(aValue);
}
if (currentGroup.Count > 0)
{
yield return MyTuple<K,IList<T>>.Create(currentKey, currentGroup);
}
}

}
}

Here I'm defining a class called MyTuple which is very similar to .NET 4' Tuple class to store group's key and members.

This function works as expected, for example:

csharp> strList.MyGroupByWithLists(x => x.Length).Select(x => x.Item2).ToList();
{ { "abc", "ert", "bre" }, { "sd" }, { "ghj", "awe" }, { "ew", "gh" } }
csharp> numList.MyGroupByWithLists(x => x).Select(x => x.Item2).ToList();
{ { 1 }, { 2, 2, 2 }, { 1, 1, 1 }, { 5, 5 } }



One of the interesting things about the Python version of groupby is that it doesn't create an intermediate collections for each group. The itertools module reference has the code for the groupby implementation.

Trying to write a this function with similar characteristics in C# resulted in the following (scary) code:


public static IEnumerable<MyTuple<K,IEnumerable<T>>> MyGroupBy<T,K>(this IEnumerable<T> en,Func<T,K> keyExtraction)
{
K currentGroupKey = default(K);
bool firstTime = true;
bool hasMoreElements = false;
bool yieldNewValue = false;

IEnumerator<T> enumerator = en.GetEnumerator();
hasMoreElements = enumerator.MoveNext();
while (hasMoreElements)
{
if (firstTime)
{
firstTime = false;
yieldNewValue = true;
currentGroupKey = keyExtraction(enumerator.Current);
}
else
{
K lastKey;
while((lastKey = keyExtraction(enumerator.Current)).Equals( currentGroupKey) &&
(hasMoreElements = enumerator.MoveNext()))
{

}
if(hasMoreElements &&
!lastKey.Equals(currentGroupKey))
{
currentGroupKey = lastKey;
yieldNewValue = true;
}
else
{
yieldNewValue = false;
}
}

if (yieldNewValue) {
yield return MyTuple<K,IEnumerable<T>>.Create(
currentGroupKey,
ReturnSubSequence((x) => {
hasMoreElements = enumerator.MoveNext();
return hasMoreElements &&
x.Equals(keyExtraction(enumerator.Current)); },
enumerator,
currentGroupKey,
enumerator.Current));
}
}
}
static IEnumerable<T> ReturnSubSequence<T,K>(Predicate<K> pred, IEnumerator<T> seq,K currentElement,T first)
{
yield return first;
while( pred(currentElement))
{
yield return seq.Current;
}
}



Using this function we can write:


csharp> numList.MyGroupBy().Select(x => x.Item1);
{ 1, 2, 1, 5 }
csharp> numList.MyGroupBy().Select(x => x.Item2.ToList());
{ { 1 }, { 2, 2, 2 }, { 1, 1, 1 }, { 5, 5 } }
csharp> strList.MyGroupBy(s => s.Length).Select(x => x.Item1);
{ 3, 2, 3, 2 }
csharp> strList.MyGroupBy(s => s.Length).Select(x => x.Item2.ToList());
{ { "abc", "ert", "bre" }, { "sd" }, { "ghj", "awe" }, { "ew", "gh" } }



One interesting fact about this way of writing the groupby function is that, you have to be very careful handling the resulting iterator/enumerable. From Python's groupby documentation:

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible


For example in Python, the following effect occurs if we consume the complete iterator before consuming each group:

>>> l = groupby(strList,lambda s: len(s))
>>> consumed = list(groupby(strList,lambda s: len(s)))
>>> [list(g) for k,g in consumed]
[[], ['gh'], [], []]


In our C# version we have a similar restriction, for example:


csharp> var consumed = strList.MyGroupBy(s => s.Length).ToList();
csharp> consumed.Select(x => x.Item2);
{ { "abc" }, { "sd" }, { "ghj" }, { "ew" } }


Code for this post can be found here.