Or why hidden Pandas copying is slowing you down
Pandas’ efficiency will get slowed down by copying happening beneath the hood. renaming columns, let’s see how the hidden copying mechanism results in an almost 4x efficiency slowdown in contrast with probably the most performant technique.
We’ll accomplish two issues:
- Present two methods to rename columns in pandas
- Present probably the most performant technique and an vital efficiency bottleneck
Renaming Columns in Pandas
For this instance, we create an instance information body with 1,000 columns and 10,000 rows:
There are two strategies to rename
Methodology 1: __rename__ technique
That is the way in which it’s most frequently completed, a prototypical name is:
df = df.rename(columns=rename_cols)
Methodology 2: exchange column attribute
One other method is to easily set the dataframe
columns attribute straight:
df.columns = [rename_cols[col] for col in df.columns]
Methodology 2 is already optimized, nonetheless, there are two key parameters within the
rename technique — we refer their description straight from the documentation:
copy: Additionally copy underlying information
inplace: Whether or not to return a brand new
Truethen the worth of copy is ignored.
We’ll present the efficiency of Methodology 2 toggling on and off these parameters, and examine that to Methodology 1.
The outcomes will be seen within the graph under.
What is evident is that in Methodology 2 altering the attribute straight is the quickest, whereas utilizing the
rename technique with non-in place copying is the slowest.
It’s apparent that copying the info body will decelerate efficiency, so it is smart why
copy=True is gradual, however why is popping off copying nonetheless gradual, i.e. once we set
copy=False utilizing the rename technique?
We dig a bit into the source code of the rename technique to hypothesize. Whereas there’s a number of control-flow logic (if/else statements), there’s additionally some hidden copying happening. Particularly, there’s this line within the rename technique:
end result._set_axis_nocheck(new_index, axis=axis_no, inplace=True)
I want to interrupt this name down slightly bit.
end result will get set to self. Which means that the variable result’s only a copy of the dataframe.
new_index is roughly the values of the brand new column names. Technically it’s an
Axis object, with column names as an attribute, however that’s extra sophisticated than we have to be. What’s the key efficiency bottleneck? It’s doing an inplace alternative on a duplicate of the dataframe.
_set_axis_nocheck has a comment explaining this:
Whereas I don’t need to declare this inplace copying and attribute setting is the one bottleneck, it teaches an vital lesson about pandas and efficiency. Watch out for the key copying!