# How do I discretize values in a pandas DataFrame and convert to a binary matrix?

### Question

I mean something like this:

I have a `DataFrame` with columns that may be categorical or nominal. For each observation (row), I want to generate a new row where every possible value for the variables is now its own binary variable. For example, this matrix (first row is column labels)

``````'a'     'b'     'c'
one     0.2     0
two     0.4     1
two     0.9     0
three   0.1     2
one     0.0     4
two     0.2     5
``````

would be converted into something like this:

``````'a'              'b'                                                    'c'
one  two  three  [0.0,0.2)  [0.2,0.4)  [0.4,0.6)  [0.6,0.8)  [0.8,1.0]   0   1   2   3   4   5

1    0     0        0          1          0          0          0       1   0   0   0   0   0
0    1     0        0          0          0          0          1       0   1   0   0   0   0
0    1     0        0          0          0          0          1       1   0   0   0   0   0
0    0     1        1          0          0          0          0       0   0   1   0   0   0
1    0     0        1          0          0          0          0       0   0   0   0   1   0
0    1     0        0          1          0          0          0       0   0   0   0   0   1
``````

Each variable (column) in the initial matrix get binned into all the possible values. If it's categorical, then each possible value becomes a new column. If it's a float, then the values are binned some way (say, always splitting into 10 bins). If it's an int, then it can be every possibel int value, or perhaps also binning.

FYI: in my real application, the table has up to 2 million rows, and the full "expanded" matrix may have hundreds of columns.

Is there an easy way to perform this operation?

Separately, I would also be willing to skip this step, as I am really trying to compute a Burt table (which is a symmetric matrix of the cross-tabulations). Is there an easy way to do something similar with the `crosstab` function? Otherwise, computing the cross tabulation is just a simple matrix multiplication.

1
8
1/4/2017 11:09:43 PM

You can use some kind of broadcasting:

``````    In : df
Out:
a    b  c
0    one  0.2  0
1    two  0.4  1
2    two  0.9  0
3  three  0.1  2
4    one  0.0  4
5    two  0.2  5

In : (df.a.values[:,numpy.newaxis] == df.a.unique()).astype(int)
Out:
array([[1, 0, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0]])

In : ((0 <= df.b.values[:,numpy.newaxis]) & (df.b.values[:,numpy.newaxis] < 0.2)).astype(int)
Out:
array([,
,
,
,
,
])

In : (df.c.values[:,numpy.newaxis] == df.c.unique()).astype(int)
Out:
array([[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]])
``````

And then join all the pieces together with `pandas.concat` or similar.

4
5/29/2012 8:13:14 AM

Note that I have implemented new `cut` and `qcut` functions for discretizing continuous data:

http://pandas-docs.github.io/pandas-docs-travis/basics.html#discretization-and-quantiling