I want to use the equivalent of the subset command in R for some Python code I am writing.

Here is my data:

```
col1 col2 col3 col4 col5
100002 2006 1.1 0.01 6352
100002 2006 1.2 0.84 304518
100002 2006 2 1.52 148219
100002 2007 1.1 0.01 6292
10002 2006 1.1 0.01 5968
10002 2006 1.2 0.25 104318
10002 2007 1.1 0.01 6800
10002 2007 4 2.03 25446
10002 2008 1.1 0.01 6408
```

I want to subset the data based on contents of `col1`

and `col2`

. (The unique values in col1 are 100002 and 10002, and in col2 are 2006,2007 and 2008.)

This can be done in R using the subset command, is there anything similar in Python?

While the iterator-based answers are perfectly fine, if you're working with numpy arrays (as you mention that you are) there are better and faster ways of selecting things:

```
import numpy as np
data = np.array([
[100002, 2006, 1.1, 0.01, 6352],
[100002, 2006, 1.2, 0.84, 304518],
[100002, 2006, 2, 1.52, 148219],
[100002, 2007, 1.1, 0.01, 6292],
[10002, 2006, 1.1, 0.01, 5968],
[10002, 2006, 1.2, 0.25, 104318],
[10002, 2007, 1.1, 0.01, 6800],
[10002, 2007, 4, 2.03, 25446],
[10002, 2008, 1.1, 0.01, 6408] ])
subset1 = data[data[:,0] == 100002]
subset2 = data[data[:,0] == 10002]
```

This yields

subset1:

```
array([[ 1.00002e+05, 2.006e+03, 1.10e+00, 1.00e-02, 6.352e+03],
[ 1.00002e+05, 2.006e+03, 1.20e+00, 8.40e-01, 3.04518e+05],
[ 1.00002e+05, 2.006e+03, 2.00e+00, 1.52e+00, 1.48219e+05],
[ 1.00002e+05, 2.007e+03, 1.10e+00, 1.00e-02, 6.292e+03]])
```

subset2:

```
array([[ 1.0002e+04, 2.006e+03, 1.10e+00, 1.00e-02, 5.968e+03],
[ 1.0002e+04, 2.006e+03, 1.20e+00, 2.50e-01, 1.04318e+05],
[ 1.0002e+04, 2.007e+03, 1.10e+00, 1.00e-02, 6.800e+03],
[ 1.0002e+04, 2.007e+03, 4.00e+00, 2.03e+00, 2.5446e+04],
[ 1.0002e+04, 2.008e+03, 1.10e+00, 1.00e-02, 6.408e+03]])
```

If you didn't know the unique values in the first column beforehand, you can use either `numpy.unique1d`

or the builtin function `set`

to find them.

Edit: I just realized that you wanted to select data where you have unique combinations of two columns... In that case, you might do something like this:

```
col1 = data[:,0]
col2 = data[:,1]
subsets = {}
for val1, val2 in itertools.product(np.unique(col1), np.unique(col2)):
subset = data[(col1 == val1) & (col2 == val2)]
if np.any(subset):
subsets[(val1, val2)] = subset
```

(I'm storing the subsets as a dict, with the key being a tuple of the combination... There are certainly other (and better, depending on what you're doing) ways to do this!)

`subset()`

in R is pretty much analogous to `filter()`

in Python. As the reference notes, this will be used implicitly by list comprehensions, so the most concise and clear way to write the code might be

```
[ item for item in items if item.col2 == 2006 ]
```

if, for example, your data rows were in an iterable called `items`

.

Licensed under: CC-BY-SA with attribution

Not affiliated with: Stack Overflow