python - How to generate group based columns faster? -
i'd create column dataframe based on anther column. example, have dataframe this:
content date id bob birthday 2010.03.01 bob school 2010.04.01 tom shopping 2010.02.01 tom work 2010.09.01 tom holiday 2010.10.01 i want generate column equals size of id, resulting dataframe looks this:
content date size id bob birthday 2010.03.01 2 bob school 2010.04.01 2 tom shopping 2010.02.01 3 tom work 2010.09.01 3 tom holiday 2010.10.01 3 the standard way seems use groupby , transform. code work:
df['size'] = df['date'].groupby(df.index).transform(np.size) the problem is, transform works slow. in dataframe 40k rows, above code takes more 10 sec on pc. regularly work on datasets larger 1 million rows , generating group-based variables frequent practice.
the problem lies transform. example, if generate cumcount on same dataframe using
# method 1: use groupby attribute 'cumcount' df['cumcount'] = df['date'].groupby(df.index).cumcount() # method 2: use 'transform' df['cumcount'] = df['date'].groupby(df.index).transform(lambda x: np.arange(0, len(x))) method 1 takes 0.2 sec while method 2 again takes 14 sec. however, groupby not seem have attributes generating columns group size, group max, group mean, etc. there method can improve performance here?
any appreciated.
see issue here: https://github.com/pydata/pandas/issues/6496.
these equivalent, 2nd faster
in [41]: %timeit grp.transform(np.size) 1 loops, best of 3: 442 ms per loop in [40]: %timeit pd.concat([ series([r]*len(grp.groups[i])) i, r in enumerate(grp.size().values) ],ignore_index=true) 10 loops, best of 3: 135 ms per loop this scales number of groups not number of rows
waiting on implement. not difficult, , first pull-request.
Comments
Post a Comment