Polars
Polars is a data frame library that is completely written in rust. We can say that it is a good framework library with high speed and efficiency. So in this blog, our main focus is on the comparison of pandas and polars.
Pandas vs Polars
The advantage of using polars over pandas is that nowadays we collect a lot of raw and uncleaned data for our data analysis. So before performing any analysis, we need to clean the data. For cleaning, we have to perform a lot of operations on data. Polars provides us with a fast and efficient framework for cleaning and operating on large data. By using polars we can easily clean a large raw data.
Polars is faster and more efficient than pandas. One of the significant problems with pandas is their slow speed and inefficiencies when dealing with larger datasets. The major difference between polars and pandas is speed.
Let us understand the speed difference between pandas and polars in reading a CSV file.
#Data Loading
start = perf_counter()
df_pl = pl.read_csv("tmdb_5000_credits.csv")#not lazy mode
end = perf_counter()
print(f"Spent {round(end-start,2)}s.")
The above code gives the following output:
Spent 0.4 s
start = perf_counter()
df_pd = pd.read_csv("tmdb_5000_credits.csv")
end = perf_counter()
print(f"Spent {round(end-start,2)}s.")
The above code gives the following output:
Spent 0.85s
This shows that for reading a CSV file with 4803 rows pandas took 0.85s whereas polars took only 0.4s
This difference in time is even greater for reading large files. Let us take another example of loading a file of fraudtrain.csv with 1296675 rows.
First, we load it using pandas. Pandas took 4.27s to load this file.
#pandas
start = time.time()
df_pd = pd.read_csv('fraudTrain.csv')
end = time.time()
print(end - start)
This gives the following output:
4.2697
Now, we load it using polars. Polars took 0.617s to load this file.
#polars
start = time.time()
df_pl = pl.read_csv('fraudTrain.csv')
end = time.time()
print(end - start)
This gives the following output.
0.6169
This example shows us that for loading a large dataset polars is much faster than pandas.
Why Polars is faster and more efficient?
Polars doen’t involve garbage collection because rust doesn’t.
Polars does not use an index for a data frame.
Polars represents data internally using Apache Arrow arrays while pandas stores data internally using Numpy arrays.
Polars support more parallel operations than pandas.
Polars support lazy evaluation.
Eager and Lazy API
We will try to understand lazy and eager evaluation in polars by comparing it with pandas.
Let us take an example, we will use pandas to load the Big_Basket.csv file.
Then the following code does the following action:
Load Big_basket.csv file.
Filters the data frame to look for Beauty & Hygiene category and Hair Care subcategory with a 4.1 rating.
Measure the amount of time needed to load the CSV file and filter.
#PANDAS
start = time.time()
df = pd.read_csv('Big_basket.csv')
df = df[(df['category'] == 'Beauty & Hygiene') &
(df['rating'] == 4.1) &
(df['sub_category'] == 'Hair Care')]
end = time.time()
print(end - start)
This code gives the following output:
0.2130
This shows that pandas take 0.213 for performing this task.
Now, let’s do this task again using polars.
#EAGER EXECUTION IN POLARS
start = time.time()
df = pl.read_csv('Big_basket.csv').filter(
(pl.col('category') == "Beauty & Hygiene") &
(pl.col('rating') == 4.1) &
(pl.col('sub_category') == 'Hair Care'))
end = time.time()
print(end - start)
This code gives the following output:
0.165771
This code is for eager execution in polars. Here also we first load the complete CSV file in memory and then apply the filter. But this shows that in eager execution also polars take less time as compared to pandas.
Now, we perform lazy execution in polars.
#LAZY EXECUTION IN POLARS
start = time.time()
df = pl.scan_csv('Big_basket.csv').filter(
(pl.col('category') == "Beauty & Hygiene") &
(pl.col('rating') == 4.1) &
(pl.col('sub_category') == 'Hair Care')).collect()
end = time.time()
print(end - start)
This code gives the following output:
0.200
This shows that in lazy execution the same task takes very less time. Because in lazy execution instead of loading all the rows into the data frame, polars optimizes the query and loads only those rows satisfying the conditions in the filter() method.
Let us take another example for operating and calculating their time.
The following code does the following actions :
Load titanic_train.csv file.
Fill the null values of the “Age” column with the mean value.
Measure the amount of time needed to load the CSV file and fill in the null values.
First, we perform this task in pandas.
#pandas
start = time.time()
df = pd.read_csv('titanic_train.csv')
df['Age'].fillna(value= df['Age'].mean(),inplace = True)
end = time.time()
print(end - start)
This code gives the following output:
0.029804
Now we perform the same task in lazy execution in polars.
#lazy execution
start = time.time()
lazy = (
pl.scan_csv('titanic_train.csv')
.select(
[
pl.exclude('Age'),
pl.col('Age').fill_null(value = pl.col('Age').mean())
]
)
)
lazy.collect()
end = time.time()
print(end - start)
This code gives the following output:
0.009909
We can easily differentiate the time difference in both cases.
These examples show that polars is much faster than pandas not only for loading files but also for performing different operations.
Sources of datasets used in this practice
https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
https://www.kaggle.com/datasets/dermisfit/fraud-transactions-dataset
https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-datapoints
https://www.kaggle.com/datasets/tedllh/titanic-train
Conclusion
At last, we conclude polars is faster and more efficient than pandas. Pandas has certain features like reading data frame from XML, pickle which currently polars lack. But polars is young and still growing.