GroupedDataFrame
s¶Podemos realizar tres tipos de operaciones que se traducen en tres verbos
aggregate
: se trata de resumir cada grupo calculando uno o varios indicadores, por ejemplo su media o su media y desviación típicatransform
: se trata de calcular para cada grupo una o varias columnas con el mismo index
que el grupo, por lo tanto con el mismo número de filas y las mismas etiquetas. Por ejemplo puedo, dentro de cada grupo, normalizar los valores respecto a la media y la desviación típica del grupo.filter
: se trata de seleccionar grupos que cumplen un determinado criterioPreliminares
import pandas as pd
import numpy as np
from os import path
DATA_DIRECTORY = path.join('..', '..', 'data')
Consideramos el DataFrame:
df = pd.DataFrame(
{
"X": ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c'],
"Y": np.arange(8),
"Z": np.arange(8,16)
}
)
df
X | Y | Z | |
---|---|---|---|
0 | a | 0 | 8 |
1 | a | 1 | 9 |
2 | a | 2 | 10 |
3 | a | 3 | 11 |
4 | b | 4 | 12 |
5 | b | 5 | 13 |
6 | c | 6 | 14 |
7 | c | 7 | 15 |
Agrupamos según los valores de X
y aplicamos el método agg
, pasándole la función para calcular el indicador.
df.groupby('X').agg(np.mean)
Y | Z | |
---|---|---|
X | ||
a | 1.5 | 9.5 |
b | 4.5 | 12.5 |
c | 6.5 | 14.5 |
Ya vimos que se puede obtener el mismo resultado sin usar agg
df.groupby('X').mean()
Y | Z | |
---|---|---|
X | ||
a | 1.5 | 9.5 |
b | 4.5 | 12.5 |
c | 6.5 | 14.5 |
Usar agg
permite, por una parte, aplicar más de una función
df.groupby('X').agg([np.mean, np.std])
Y | Z | |||
---|---|---|---|---|
mean | std | mean | std | |
X | ||||
a | 1.5 | 1.290994 | 9.5 | 1.290994 |
b | 4.5 | 0.707107 | 12.5 | 0.707107 |
c | 6.5 | 0.707107 | 14.5 | 0.707107 |
Por otra parte, permite usar nuestras propias funciones.
Para ilustrarlo, modificamos df
para introducir datos faltantes
df.loc[[0, 2, 5], 'Y'] = np.NaN
df.loc[6, 'Z'] = np.NaN
df
X | Y | Z | |
---|---|---|---|
0 | a | NaN | 8.0 |
1 | a | 1.0 | 9.0 |
2 | a | NaN | 10.0 |
3 | a | 3.0 | 11.0 |
4 | b | 4.0 | 12.0 |
5 | b | NaN | 13.0 |
6 | c | 6.0 | NaN |
7 | c | 7.0 | 15.0 |
Calculamos el número de datos faltantes por columna, desglosándolo por grupos
df.groupby('X').agg(lambda x: x.isna().sum())
Y | Z | |
---|---|---|
X | ||
a | 2.0 | 0.0 |
b | 1.0 | 0.0 |
c | 0.0 | 1.0 |
Cargamos el conjunto de datos de vuelos que salieron en 2013 de uno de los tres aeropuertos de NYC
flights = pd.read_feather(path.join(DATA_DIRECTORY, 'flights.feather'))
flights
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400.0 | 5.0 | 15.0 | 2013-01-01 05:00:00-05:00 |
1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416.0 | 5.0 | 29.0 | 2013-01-01 05:00:00-05:00 |
2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089.0 | 5.0 | 40.0 | 2013-01-01 05:00:00-05:00 |
3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576.0 | 5.0 | 45.0 | 2013-01-01 05:00:00-05:00 |
4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762.0 | 6.0 | 0.0 | 2013-01-01 06:00:00-05:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | None | JFK | DCA | NaN | 213.0 | 14.0 | 55.0 | 2013-09-30 14:00:00-04:00 |
336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | None | LGA | SYR | NaN | 198.0 | 22.0 | 0.0 | 2013-09-30 22:00:00-04:00 |
336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764.0 | 12.0 | 10.0 | 2013-09-30 12:00:00-04:00 |
336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419.0 | 11.0 | 59.0 | 2013-09-30 11:00:00-04:00 |
336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431.0 | 8.0 | 40.0 | 2013-09-30 08:00:00-04:00 |
336776 rows × 19 columns
Queremos obtener el número de vuelos cancelados por hora (tienen NaN
en la columna dep_time
)
flights.groupby('hour').agg(lambda x: x.isna().sum())
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hour | ||||||||||||||||||
1.0 | 0 | 0 | 0 | 1.0 | 0 | 1.0 | 1.0 | 0 | 1.0 | 0 | 0 | 1 | 0 | 0 | 1.0 | 0.0 | 0.0 | 0 |
5.0 | 0 | 0 | 0 | 9.0 | 0 | 9.0 | 9.0 | 0 | 13.0 | 0 | 0 | 6 | 0 | 0 | 13.0 | 0.0 | 0.0 | 0 |
6.0 | 0 | 0 | 0 | 425.0 | 0 | 425.0 | 454.0 | 0 | 504.0 | 0 | 0 | 127 | 0 | 0 | 504.0 | 0.0 | 0.0 | 0 |
7.0 | 0 | 0 | 0 | 289.0 | 0 | 289.0 | 305.0 | 0 | 346.0 | 0 | 0 | 102 | 0 | 0 | 346.0 | 0.0 | 0.0 | 0 |
8.0 | 0 | 0 | 0 | 442.0 | 0 | 442.0 | 465.0 | 0 | 508.0 | 0 | 0 | 161 | 0 | 0 | 508.0 | 0.0 | 0.0 | 0 |
9.0 | 0 | 0 | 0 | 327.0 | 0 | 327.0 | 343.0 | 0 | 381.0 | 0 | 0 | 128 | 0 | 0 | 381.0 | 0.0 | 0.0 | 0 |
10.0 | 0 | 0 | 0 | 290.0 | 0 | 290.0 | 303.0 | 0 | 338.0 | 0 | 0 | 96 | 0 | 0 | 338.0 | 0.0 | 0.0 | 0 |
11.0 | 0 | 0 | 0 | 296.0 | 0 | 296.0 | 314.0 | 0 | 344.0 | 0 | 0 | 61 | 0 | 0 | 344.0 | 0.0 | 0.0 | 0 |
12.0 | 0 | 0 | 0 | 388.0 | 0 | 388.0 | 404.0 | 0 | 437.0 | 0 | 0 | 80 | 0 | 0 | 437.0 | 0.0 | 0.0 | 0 |
13.0 | 0 | 0 | 0 | 429.0 | 0 | 429.0 | 456.0 | 0 | 499.0 | 0 | 0 | 90 | 0 | 0 | 499.0 | 0.0 | 0.0 | 0 |
14.0 | 0 | 0 | 0 | 566.0 | 0 | 566.0 | 611.0 | 0 | 684.0 | 0 | 0 | 170 | 0 | 0 | 684.0 | 0.0 | 0.0 | 0 |
15.0 | 0 | 0 | 0 | 670.0 | 0 | 670.0 | 733.0 | 0 | 806.0 | 0 | 0 | 222 | 0 | 0 | 806.0 | 0.0 | 0.0 | 0 |
16.0 | 0 | 0 | 0 | 840.0 | 0 | 840.0 | 891.0 | 0 | 957.0 | 0 | 0 | 249 | 0 | 0 | 957.0 | 0.0 | 0.0 | 0 |
17.0 | 0 | 0 | 0 | 660.0 | 0 | 660.0 | 691.0 | 0 | 759.0 | 0 | 0 | 150 | 0 | 0 | 759.0 | 0.0 | 0.0 | 0 |
18.0 | 0 | 0 | 0 | 626.0 | 0 | 626.0 | 667.0 | 0 | 711.0 | 0 | 0 | 289 | 0 | 0 | 711.0 | 0.0 | 0.0 | 0 |
19.0 | 0 | 0 | 0 | 861.0 | 0 | 861.0 | 898.0 | 0 | 934.0 | 0 | 0 | 283 | 0 | 0 | 934.0 | 0.0 | 0.0 | 0 |
20.0 | 0 | 0 | 0 | 636.0 | 0 | 636.0 | 656.0 | 0 | 678.0 | 0 | 0 | 232 | 0 | 0 | 678.0 | 0.0 | 0.0 | 0 |
21.0 | 0 | 0 | 0 | 409.0 | 0 | 409.0 | 418.0 | 0 | 430.0 | 0 | 0 | 62 | 0 | 0 | 430.0 | 0.0 | 0.0 | 0 |
22.0 | 0 | 0 | 0 | 78.0 | 0 | 78.0 | 81.0 | 0 | 81.0 | 0 | 0 | 3 | 0 | 0 | 81.0 | 0.0 | 0.0 | 0 |
23.0 | 0 | 0 | 0 | 13.0 | 0 | 13.0 | 13.0 | 0 | 19.0 | 0 | 0 | 0 | 0 | 0 | 19.0 | 0.0 | 0.0 | 0 |
Así obtenemos el número de valores faltantes para todas las columnas. Como solamente nos interesan las de dep_time
, modificamos nuestra petición
flights.groupby('hour' )['dep_time'].agg(lambda x: x.isna().sum())
hour 1.0 1.0 5.0 9.0 6.0 425.0 7.0 289.0 8.0 442.0 9.0 327.0 10.0 290.0 11.0 296.0 12.0 388.0 13.0 429.0 14.0 566.0 15.0 670.0 16.0 840.0 17.0 660.0 18.0 626.0 19.0 861.0 20.0 636.0 21.0 409.0 22.0 78.0 23.0 13.0 Name: dep_time, dtype: float64
Hasta el momento, hemos obtenido los mismos indicadores para las diferentes columnas. Es posible especificar funciones diferentes para distintas columnas, usando un dictionario.
df.groupby('X').agg(
{
'Y': 'describe',
'Z': np.mean
}
)
Y | Z | ||||||||
---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | Z | |
X | |||||||||
a | 2.0 | 2.0 | 1.414214 | 1.0 | 1.50 | 2.0 | 2.50 | 3.0 | 9.5 |
b | 1.0 | 4.0 | NaN | 4.0 | 4.00 | 4.0 | 4.00 | 4.0 | 12.5 |
c | 2.0 | 6.5 | 0.707107 | 6.0 | 6.25 | 6.5 | 6.75 | 7.0 | 15.0 |