Code Optimization: Filtering dataframes using exact matches in multiple columns

Filtering medium to large amounts of data to extract a relevant subset is a very common task in any data related project. Often we do this on the basis of pandas dataframes. In this post I want to compare some filtering options for exact matches across multiple columns. The idea is pretty simple. We have a dataframe with multiple columns and rows as well as a list of conditions by which we want to extract data from it....

2023-11-17 · 8 min · Maurice Borgmeier

Code Optimization: Finding the correct spot on the leaderboard

Imagine you’re running a sports competition with multiple competitions going on and you need to keep track of the top 10 fastest scores across all disciplines. As each athlete finishes competing in one or more games they want to know what their spot on the leaderboard is. What’s the fastest way to compute this across a range of competitions? Given a n x m matrix like you can see below where the rows are the disciplines and columns the top 10 spots, figure out where player p ranks in all disciplines based on their times....

2023-10-28 · 8 min · Maurice Borgmeier

Even more efficient hashing of columns in a pandas dataframe

One of the joys of software development is that small changes can sometimes make solving the same problem orders of magnitude faster. Revisiting previous solutions with more experience can lead to even better results. I show you how I improved the previous implementation by a factor of 2.7.

2022-12-20 · 5 min · Maurice Borgmeier

Efficiently hashing columns in a pandas dataframe

One of the joys of software development is that small changes can sometimes make solving the same problem orders of magnitude faster. I experienced this recently when implementing a function to generate a hash over multiple columns in a dataframe. Today I’m going to show you how I came up with that solution.

2022-09-18 · 9 min · Maurice Borgmeier