Removing Outliers from Adjacent Points Using Rolling Median in Pandas

Removing Points Which Deviate Too Much from Adjacent Point in Pandas

Introduction

Pandas is a powerful library used for data manipulation and analysis in Python. It provides data structures and functions to efficiently handle structured data, including tabular data such as spreadsheets and SQL tables. One common task in data analysis is removing outliers or noisy points from a dataset that deviate significantly from the surrounding points. In this article, we will explore how to remove points which deviate too much from adjacent point in Pandas using the rolling function and a simple yet effective approach.

The Problem

The problem presented in the question revolves around identifying and removing outliers from a time series dataset that deviate significantly from the surrounding points. The data is stored in a pandas DataFrame, where the first column represents dates and the second column contains daily measurements from 1984 to the present. The goal is to remove these outlier points using a combination of statistical methods.

Approach

To tackle this problem, we can utilize the rolling function provided by Pandas, which allows us to apply a window-based calculation to each row in the DataFrame. We will use a rolling median approach, where we calculate the median of adjacent data points within a specified distance (D) and compare it with the current data point.

Here’s a step-by-step breakdown of the approach:

  1. Define the Distance (D)

    • The distance D represents the maximum allowed difference between a data point and its two nearest neighbors in terms of index value.
  2. Calculate Rolling Median

    • Use the rolling function to calculate the median of adjacent data points within the specified distance.
  3. Identify Outliers

    • Compare each data point with its corresponding rolling median. If the difference is greater than a certain threshold (e.g., 2 standard deviations), consider it an outlier.
  4. Remove Outliers

    • Remove rows containing outliers from the DataFrame.

Implementation

Here’s a Python code snippet that implements this approach using Pandas:

import pandas as pd
import numpy as np

# Load data
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
auge.columns = ['Date', 'Gauge']
auge = auge.set_index(['Date'])

# Set distance D (example: 1 day, i.e., two consecutive dates)
D = 2

# Define threshold for outlier detection (e.g., 2 standard deviations from median)
threshold_std_dev = 2

def remove_outliers(auge, D, threshold_std_dev):
    # Calculate rolling median
    auge_rolling_median = auge.rolling(window=D).median()

    # Identify outliers
    auge['outlier'] = np.abs(ause['Gauge'] - auge_rolling_median) > threshold_std_dev * auge_rolling_median.std()
    
    # Remove outliers
    auge_cleaned = auge[~ause['outlier']]
    
    return auge_cleaned

# Apply function to remove outliers
auge_cleaned = remove_outliers(auge, D, threshold_std_dev)

# Plot results
auge['1990':'1995'].plot(style='*')
auger_rolling_median.plot(style='*')

In the above code:

  • We define a remove\_outliers function that takes in the DataFrame, distance (D), and threshold for outlier detection as input parameters.
  • Within this function, we calculate the rolling median of adjacent data points using the rolling method with a window size equal to the specified distance.
  • We then identify outliers by comparing each data point’s absolute difference from its corresponding rolling median with the product of the standard deviation and threshold value. If the result exceeds the threshold, it is considered an outlier.
  • Finally, we remove rows containing outliers using boolean indexing and return the cleaned DataFrame.

Conclusion

In this article, we explored how to remove points which deviate too much from adjacent point in Pandas using a simple yet effective approach based on rolling median. By applying this method, you can effectively identify and eliminate outlier data points that significantly deviate from the surrounding values.


Last modified on 2024-12-31