python - How to generate group based columns faster? -
i'd create column dataframe based on anther column. example, have dataframe this:
content date id bob birthday 2010.03.01 bob school 2010.04.01 tom shopping 2010.02.01 tom work 2010.09.01 tom holiday 2010.10.01
i want generate column equals size of id
, resulting dataframe looks this:
content date size id bob birthday 2010.03.01 2 bob school 2010.04.01 2 tom shopping 2010.02.01 3 tom work 2010.09.01 3 tom holiday 2010.10.01 3
the standard way seems use groupby
, transform
. code work:
df['size'] = df['date'].groupby(df.index).transform(np.size)
the problem is, transform
works slow. in dataframe 40k
rows, above code takes more 10 sec
on pc. regularly work on datasets larger 1 million rows , generating group-based variables frequent practice.
the problem lies transform
. example, if generate cumcount
on same dataframe using
# method 1: use groupby attribute 'cumcount' df['cumcount'] = df['date'].groupby(df.index).cumcount() # method 2: use 'transform' df['cumcount'] = df['date'].groupby(df.index).transform(lambda x: np.arange(0, len(x)))
method 1
takes 0.2 sec
while method 2
again takes 14 sec
. however, groupby
not seem have attributes generating columns group size, group max, group mean
, etc. there method can improve performance here?
any appreciated.
see issue here: https://github.com/pydata/pandas/issues/6496.
these equivalent, 2nd faster
in [41]: %timeit grp.transform(np.size) 1 loops, best of 3: 442 ms per loop in [40]: %timeit pd.concat([ series([r]*len(grp.groups[i])) i, r in enumerate(grp.size().values) ],ignore_index=true) 10 loops, best of 3: 135 ms per loop
this scales number of groups not number of rows
waiting on implement. not difficult, , first pull-request.
Comments
Post a Comment