How to Rename Columns in Pandas — With Upto 4X Speed | by Ryan Louis Stevens | May, 2022

Or why hidden Pandas copying is slowing you down

Pandas’ efficiency will get slowed down by copying happening beneath the hood. renaming columns, let’s see how the hidden copying mechanism results in an almost 4x efficiency slowdown in contrast with probably the most performant technique.

We’ll accomplish two issues:

  • Present two methods to rename columns in pandas
  • Present probably the most performant technique and an vital efficiency bottleneck

Renaming Columns in Pandas

For this instance, we create an instance information body with 1,000 columns and 10,000 rows:

Create dataset for testing

There are two strategies to rename dataframes:

Methodology 1: __rename__ technique

That is the way in which it’s most frequently completed, a prototypical name is:

df = df.rename(columns=rename_cols)

Methodology 2: exchange column attribute

One other method is to easily set the dataframe columns attribute straight:

df.columns = [rename_cols[col] for col in df.columns]

Methodology 2 is already optimized, nonetheless, there are two key parameters within the rename technique — we refer their description straight from the documentation:

  • copy: Additionally copy underlying information
  • inplace: Whether or not to return a brand new DataFrame. If True then the worth of copy is ignored.

We’ll present the efficiency of Methodology 2 toggling on and off these parameters, and examine that to Methodology 1.

The outcomes will be seen within the graph under.

What is evident is that in Methodology 2 altering the attribute straight is the quickest, whereas utilizing the rename technique with non-in place copying is the slowest.

It’s apparent that copying the info body will decelerate efficiency, so it is smart why copy=True is gradual, however why is popping off copying nonetheless gradual, i.e. once we set copy=False utilizing the rename technique?

We dig a bit into the source code of the rename technique to hypothesize. Whereas there’s a number of control-flow logic (if/else statements), there’s additionally some hidden copying happening. Particularly, there’s this line within the rename technique:

end result._set_axis_nocheck(new_index, axis=axis_no, inplace=True)

I want to interrupt this name down slightly bit.

Inside rename, end result will get set to self. Which means that the variable result’s only a copy of the dataframe.

The parameter new_index is roughly the values of the brand new column names. Technically it’s an Axis object, with column names as an attribute, however that’s extra sophisticated than we have to be. What’s the key efficiency bottleneck? It’s doing an inplace alternative on a duplicate of the dataframe.

Actually, _set_axis_nocheck has a comment explaining this:

Whereas I don’t need to declare this inplace copying and attribute setting is the one bottleneck, it teaches an vital lesson about pandas and efficiency. Watch out for the key copying!

More Posts