Understanding the Issue with Shuffled ROC Scores

=====================================================

In this blog post, we’ll delve into an issue that arises when trying to find the average ROC score of a feature after randomly shuffling the training target data. We’ll explore the possible causes and solutions for obtaining truly random results.

Background: What is the ROC Score?

The Receiver Operating Characteristic (ROC) score is a measure used in machine learning to evaluate the performance of binary classification models. It plots the true positive rate against the false positive rate at different threshold values, providing a way to quantify how well a model can distinguish between two classes.

The Problem: Shuffled ROC Scores

The provided code attempts to create 10 logistic regression classifiers that are each trained on a different random shuffling of the training target data for one feature at a time. However, instead of obtaining 10 unique values for the ROC scores, it results in the same value repeated multiple times.

from sklearn import metrics
from sklearn.linear_model import LogisticRegression

def shuffled_roc(df, feature):
    # Shuffle the training data
    df = df.sample(frac=1, random_state=0)
    
    # Separate features and target variables
    x = df[feature][np.isfinite(df[feature])].copy()
    y = df['target'][np.isfinite(df[feature])].copy()

    # Split data into training and testing sets
    x_train = x.iloc[:int(0.8*len(x))]
    y_train = y.iloc[:int(0.8*len(x))]

    x_test = x.iloc[int(0.8*len(x)):]
    y_test = y.iloc[int(0.8*len(x)):]

    # Shuffle the target variables for each classifier
    y_train_shuffled = y_train.sample(frac=1).reset_index(drop=True)

    rocs = []
    for i in range(10):
        # Repeat the shuffling step
        y_train_shuffled = y_train_shuffled.sample(frac=1).reset_index(drop=True)
        
        # Train a logistic regression classifier
        lr = LogisticRegression(solver='lbfgs').fit(x_train.values.reshape(-1, 1), y_train_shuffled)

        # Calculate the ROC score
        roc = metrics.roc_auc_score(y_test, lr.predict_proba(x_test.values.reshape(-1, 1))[:, 1])
        
        # Store the result
        rocs.append(roc)
    
    # Return the mean ROC score
    return np.mean(rocs)

Possible Causes of Non-Random Results

There are several possible reasons why the shuffled ROC scores may not be as random as expected:

1. Lack of Randomization

The code uses np.random.shuffle to shuffle the training data, but it’s possible that this step is not sufficient for achieving true randomness.

# Shuffle the target variables using np.random.shuffle
import numpy as np

# ...

y_train_shuffled = y_train.values[np.random.permutation(len(y_train))]

2. Inadequate Data Splitting

The data is split into training and testing sets, but it’s possible that this splitting step may not be sufficient for achieving a good balance between the two sets.

# Split the data into training and testing sets using stratified sampling
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

3. Incorrect Random Seed

The random seed is set to 0 in the original code, which may not be sufficient for achieving true randomness.

# Set a larger random seed for greater randomness
np.random.seed(42)

Solution: Improving the Shuffling Process

To improve the shuffling process and obtain truly random results, we can modify the code to use a more robust shuffling algorithm. One approach is to use np.random.permutation in combination with df.sample(frac=1).

# Shuffle the training data using np.random.permutation
import numpy as np

# ...

y_train_shuffled = y_train.values[np.random.permutation(len(y_train))]

We can also try increasing the random seed to a larger value, such as 42 or 43.

Conclusion

In this blog post, we explored an issue that arises when trying to find the average ROC score of a feature after randomly shuffling the training target data. We identified several possible causes for non-random results and provided solutions for improving the shuffling process. By using a more robust shuffling algorithm and increasing the random seed, we can obtain truly random results for the shuffled ROC scores.

Additional Resources

Code Examples

1. Original code with modifications

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import numpy as np

def shuffled_roc(df, feature):
    # Shuffle the training data using np.random.permutation
    df = df.sample(frac=1).sample(frac=1, random_state=42)
    
    # Separate features and target variables
    x = df[feature][np.isfinite(df[feature])].copy()
    y = df['target'][np.isfinite(df[feature])].copy()

    # Split data into training and testing sets using stratified sampling
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    
    rocs = []
    for i in range(10):
        # Train a logistic regression classifier
        lr = LogisticRegression(solver='lbfgs').fit(x_train.values.reshape(-1, 1), y_train_shuffled)
        
        # Calculate the ROC score
        roc = metrics.roc_auc_score(y_test, lr.predict_proba(x_test.values.reshape(-1, 1))[:, 1])
        
        # Store the result
        rocs.append(roc)
    
    # Return the mean ROC score
    return np.mean(rocs)

2. Optimized code with additional shuffling step

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import numpy as np

def shuffled_roc(df, feature):
    # Shuffle the training data using np.random.permutation
    df = df.sample(frac=1).sample(frac=1, random_state=42)
    
    # Separate features and target variables
    x = df[feature][np.isfinite(df[feature])].copy()
    y = df['target'][np.isfinite(df[feature])].copy()

    # Split data into training and testing sets using stratified sampling
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    
    # Shuffle the target variables again for each classifier
    for i in range(10):
        np.random.seed(i + 1)
        y_train_shuffled = y_train.values[np.random.permutation(len(y_train))]
        
        # Train a logistic regression classifier
        lr = LogisticRegression(solver='lbfgs').fit(x_train.values.reshape(-1, 1), y_train_shuffled)
        
        # Calculate the ROC score
        roc = metrics.roc_auc_score(y_test, lr.predict_proba(x_test.values.reshape(-1, 1))[:, 1])
        
        # Store the result
        rocs.append(roc)
    
    # Return the mean ROC score
    return np.mean(rocs)

Last modified on 2023-07-29