In [1]:
from functools import partial
from rpy2.ipython import html
html.html_rdataframe=partial(html.html_rdataframe, table_class="docutils")

dplyr in Python

We need 2 things for this:

1- A data frame (using one of R's demo datasets).

In [2]:
from rpy2.robjects.packages import importr, data
datasets = importr('datasets')
mtcars_env = data(datasets).fetch('mtcars')
mtcars = mtcars_env['mtcars']

In addition to that, and because this tutorial is in a notebook, we initialize HTML rendering for R objects (pretty display of R data frames).

In [3]:
import rpy2.ipython.html
rpy2.ipython.html.init_printing()

2- dplyr

In [4]:
from rpy2.robjects.lib.dplyr import DataFrame

With this we have the choice of chaining (D3-style)

In [5]:
dataf = (DataFrame(mtcars).
         filter('gear>3').
         mutate(powertoweight='hp*36/wt').
         group_by('gear').
         summarize(mean_ptw='mean(powertoweight)'))

dataf
Out[5]:
DataFrame with 2 rows and 2 columns:
gear mean_ptw
0 1 4.0 1237.1266499803169
1 2 5.0 2574.0331639315027

or piping (magrittr style).

In [6]:
from rpy2.robjects.lib.dplyr import (filter,
                                     mutate,
                                     group_by,
                                     summarize)

dataf = (DataFrame(mtcars) >>
         filter('gear>3') >>
         mutate(powertoweight='hp*36/wt') >>
         group_by('gear') >>
         summarize(mean_ptw='mean(powertoweight)'))

dataf
Out[6]:
DataFrame with 2 rows and 2 columns:
gear mean_ptw
0 1 4.0 1237.1266499803169
1 2 5.0 2574.0331639315027

The strings passed to the dplyr function are evaluated as expression, just like this is happening when using dplyr in R. This means that when writing mean(powertoweight) the R function mean() is used.

Using a Python function is not too difficult though. We can just call Python back from R:

In [7]:
from rpy2.rinterface import rternalize
@rternalize
def mean_np(x):
    import numpy
    return numpy.mean(x)

from rpy2.robjects import globalenv
globalenv['mean_np'] = mean_np

dataf = (DataFrame(mtcars) >>
         filter('gear>3') >>
         mutate(powertoweight='hp*36/wt') >>
         group_by('gear') >>
         summarize(mean_ptw='mean(powertoweight)',
                   mean_np_ptw='mean_np(powertoweight)'))

dataf
Out[7]:
DataFrame with 2 rows and 3 columns:
gear mean_np_ptw mean_ptw
0 1 4.0 1237.126649980317 1237.1266499803169
1 2 5.0 2574.0331639315023 2574.0331639315027

It is also possible to carry this out without having to place the custom function in R's global environment.

In [8]:
del(globalenv['mean_np'])
In [9]:
from rpy2.robjects.lib.dplyr import StringInEnv
from rpy2.robjects import Environment
my_env = Environment()
my_env['mean_np'] = mean_np

dataf = (DataFrame(mtcars) >>
         filter('gear>3') >>
         mutate(powertoweight='hp*36/wt') >>
         group_by('gear') >>
         summarize(mean_ptw='mean(powertoweight)',
                   mean_np_ptw=StringInEnv('mean_np(powertoweight)',
                                           my_env)))

dataf
Out[9]:
DataFrame with 2 rows and 3 columns:
gear mean_np_ptw mean_ptw
0 1 4.0 1237.126649980317 1237.1266499803169
1 2 5.0 2574.0331639315023 2574.0331639315027

note: rpy2's interface to dplyr is implementing a fix to the (non-?)issue 1323 (https://github.com/hadley/dplyr/issues/1323)

The seamless translation of transformations to SQL whenever the data are in a table can be used directly. Since we are lifting the original implementation of dplyr, it just works.

In [10]:
from rpy2.robjects.lib.dplyr import dplyr
# in-memory SQLite database broken in dplyr's src_sqlite
# db = dplyr.src_sqlite(":memory:")
import tempfile
with tempfile.NamedTemporaryFile() as db_fh:
    db = dplyr.src_sqlite(db_fh.name)
    # copy the table to that database
    dataf_db = DataFrame(mtcars).copy_to(db, name="mtcars")
    res = (dataf_db >>
           filter('gear>3') >>
           mutate(powertoweight='hp*36/wt') >>
           group_by('gear') >>
           summarize(mean_ptw='mean(powertoweight)'))
    print(res)
# 
Source: sqlite 3.8.6 [/tmp/tmpfjndk8mk]
From: <derived table> [?? x 2]

    gear mean_ptw
   (dbl)    (dbl)
1      4 1237.127
2      5 2574.033
..   ...      ...

Since we are manipulating R objects, anything available to R is also available to us. If we want to see the SQL code generated that's:

In [11]:
print(res.rx2("query")["sql"])
<SQL> SELECT "gear", "mean_ptw"
FROM (SELECT "gear", AVG("powertoweight") AS "mean_ptw"
FROM (SELECT "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "hp" * 36.0 / "wt" AS "powertoweight"
FROM "mtcars"
WHERE "gear" > 3.0) AS "zzz1"
GROUP BY "gear") AS "zzz2"

And if the starting point is a pandas data frame, do the following and start over again.

In [12]:
from rpy2.robjects import pandas2ri
from rpy2.robjects import default_converter
from rpy2.robjects.conversion import localconverter
with localconverter(default_converter + pandas2ri.converter) as cv:
    mtcars = mtcars_env['mtcars']
    mtcars = pandas2ri.ri2py(mtcars)
print(type(mtcars))
<class 'pandas.core.frame.DataFrame'>

Using a local converter lets us also go from the pandas data frame to our dplyr-augmented R data frame.

In [13]:
with localconverter(default_converter + pandas2ri.converter) as cv:
    dataf = (DataFrame(mtcars).
             filter('gear>=3').
             mutate(powertoweight='hp*36/wt').
             group_by('gear').
             summarize(mean_ptw='mean(powertoweight)'))

dataf
Out[13]:
DataFrame with 3 rows and 2 columns:
gear mean_ptw
0 1 3.0 1633.989574118287
1 2 4.0 1237.1266499803169
2 3 5.0 2574.0331639315027

Reuse. Get things done. Don't reimplement.