python - How to generate group based columns faster? -

- June 15, 2012

i'd create column dataframe based on anther column. example, have dataframe this:

      content        date id                        bob  birthday  2010.03.01 bob    school  2010.04.01 tom  shopping  2010.02.01 tom      work  2010.09.01 tom   holiday  2010.10.01

i want generate column equals size of id, resulting dataframe looks this:

      content        date  size id                        bob  birthday  2010.03.01     2 bob    school  2010.04.01     2 tom  shopping  2010.02.01     3 tom      work  2010.09.01     3  tom   holiday  2010.10.01     3

the standard way seems use groupby , transform. code work:

df['size'] = df['date'].groupby(df.index).transform(np.size)

the problem is, transform works slow. in dataframe 40k rows, above code takes more 10 sec on pc. regularly work on datasets larger 1 million rows , generating group-based variables frequent practice.

the problem lies transform. example, if generate cumcount on same dataframe using

# method 1: use groupby attribute 'cumcount' df['cumcount'] = df['date'].groupby(df.index).cumcount() # method 2: use 'transform' df['cumcount'] = df['date'].groupby(df.index).transform(lambda x: np.arange(0, len(x)))

method 1 takes 0.2 sec while method 2 again takes 14 sec. however, groupby not seem have attributes generating columns group size, group max, group mean, etc. there method can improve performance here?

any appreciated.

see issue here: https://github.com/pydata/pandas/issues/6496.

these equivalent, 2nd faster

in [41]: %timeit grp.transform(np.size) 1 loops, best of 3: 442 ms per loop  in [40]: %timeit pd.concat([ series([r]*len(grp.groups[i])) i, r in enumerate(grp.size().values) ],ignore_index=true) 10 loops, best of 3: 135 ms per loop

this scales number of groups not number of rows

waiting on implement. not difficult, , first pull-request.

Search This Blog

EIght

python - How to generate group based columns faster? -

Comments

Post a Comment

Popular posts from this blog

windows - Single EXE to Install Python Standalone Executable for Easy Distribution -

c# - Access objects in UserControl from MainWindow in WPF -

javascript - How to name a jQuery function to make a browser's back button work? -