Merging Multiple CSV Files into One: A Step-by-Step Guide

Introduction

Working with multiple CSV files can be a common task in data analysis and processing. However, when dealing with multiple files, it’s often necessary to merge them into a single file. In this article, we’ll explore how to achieve this using Python and the pandas library.

One common requirement is to have only one header row in the merged output, rather than having separate headers for each individual CSV file. We’ll discuss how to accomplish this and provide examples to illustrate the process.

Prerequisites

To follow along with this guide, you’ll need:

Python 3.x installed on your system
The pandas library installed (you can install it using pip: pip install pandas)
A basic understanding of Python scripting and data manipulation

Step 1: Understanding the Pandas Library

The pandas library is a powerful tool for data manipulation and analysis in Python. It provides data structures and functions to efficiently handle and process large datasets.

In this guide, we’ll focus on using pandas’ concat function to merge multiple CSV files into one.

Step 2: Setting Up Your Environment

To set up your environment for merging CSV files, follow these steps:

Open a terminal or command prompt
Navigate to the directory where you want to store your merged output file
Create an empty file called concatenator.py (or any other name of your choice)
Copy the following code into the new file:

import os
import glob
import pandas

def concatenate(inDir=r'myPath', outFile=r"outPath"):
    os.chdir(inDir)  # Set the current working directory to inDir
    fileList = glob.glob("*.csv")  # Generate a list of CSV files using glob
    dfList = []

    for filename in fileList:
        print(filename)
        df = pandas.read_csv(filename, header=None)  # Read each file as a separate DataFrame with no header
        dfList.append(df)

    concatDf = pandas.concat(dfList, axis=0)  # Concatenate all DataFrames into one
    concatDf.to_csv(outFile, index=False)  # Export the concatenated DataFrame to a CSV file

# Call the function with your desired input and output paths
concatenator("input_directory", "output_file.csv")

Step 3: Modifying the Header Row in the Merged Output

By default, pandas will include a header row for each individual CSV file when merging them. To remove these separate headers and only keep one row at the top of your merged output, you can modify the code slightly:

import os
import glob
import pandas

def concatenate(inDir=r'myPath', outFile=r"outPath"):
    os.chdir(inDir)  # Set the current working directory to inDir
    fileList = glob.glob("*.csv")  # Generate a list of CSV files using glob
    dfList = []

    for filename in fileList:
        print(filename)
        df = pandas.read_csv(filename, header=None)  # Read each file as a separate DataFrame with no header
        df.columns = ['row_number']  # Add a new column to assign row numbers
        df['new_header'] = 'New Header'  # Assign a value to the first row of all DataFrames
        dfList.append(df)

    concatDf = pandas.concat(dfList, axis=0)  # Concatenate all DataFrames into one
    concatDf.drop(['row_number'], axis=1, inplace=True)  # Remove the row number column from the concatenated DataFrame
    # concatDf.to_csv(outFile, index=False)  # Export the concatenated DataFrame to a CSV file

# Call the function with your desired input and output paths
concatenator("input_directory", "output_file.csv")

Alternatively, you can specify header=None when reading each individual file:

import os
import glob
import pandas

def concatenate(inDir=r'myPath', outFile=r"outPath"):
    os.chdir(inDir)  # Set the current working directory to inDir
    fileList = glob.glob("*.csv")  # Generate a list of CSV files using glob
    dfList = []

    for filename in fileList:
        print(filename)
        df = pandas.read_csv(filename, header=None)  # Read each file as a separate DataFrame with no header
        dfList.append(df)

    concatDf = pandas.concat(dfList, axis=0)  # Concatenate all DataFrames into one
    # concatDf.to_csv(outFile, index=False)  # Export the concatenated DataFrame to a CSV file

# Call the function with your desired input and output paths
concatenator("input_directory", "output_file.csv")

Step 4: Writing Custom Functions for Specific Requirements

While pandas provides an extensive set of functions for data manipulation, sometimes you need custom solutions tailored to specific requirements.

Here’s how you might extend this script using a new function called merge_csv_files_with_single_header:

import os
import glob
import pandas

def concatenate(inDir=r'myPath', outFile=r"outPath"):
    os.chdir(inDir)  # Set the current working directory to inDir
    fileList = glob.glob("*.csv")  # Generate a list of CSV files using glob
    dfList = []

    for filename in fileList:
        print(filename)
        df = pandas.read_csv(filename, header=None)  # Read each file as a separate DataFrame with no header
        df.columns = ['row_number']  # Add a new column to assign row numbers
        df['new_header'] = 'New Header'  # Assign a value to the first row of all DataFrames
        dfList.append(df)

    concatDf = pandas.concat(dfList, axis=0)  # Concatenate all DataFrames into one

def merge_csv_files_with_single_header(input_dir, output_file):
    os.chdir(input_dir)  # Set the current working directory to inDir
    fileList = glob.glob("*.csv")  # Generate a list of CSV files using glob
    dfList = []

    for filename in fileList:
        print(filename)
        df = pandas.read_csv(filename, header=None)  # Read each file as a separate DataFrame with no header
        df.columns = ['row_number']  # Add a new column to assign row numbers
        df['new_header'] = 'New Header'  # Assign a value to the first row of all DataFrames
        dfList.append(df)

    concatDf = pandas.concat(dfList, axis=0)  # Concatenate all DataFrames into one

    if len(concatDf.columns) > 1:
        print("Concatenated DataFrame has more than one header column.")
        return None
    
    else:
        concatDf.to_csv(output_file, index=False)  # Export the concatenated DataFrame to a CSV file
        return True

# Call the function with your desired input and output paths
if merge_csv_files_with_single_header("input_directory", "output_file.csv"):
    print('success')
else:
    print('Error Occurred')

Step 5: Handling Exceptions for Robustness

To ensure that the merging process doesn’t fail due to exceptions, it’s a good practice to use try-except blocks. Here is how you could modify your script:

import os
import glob
import pandas as pd

def merge_csv_files_with_single_header(input_dir, output_file):
    try:
        os.chdir(input_dir)  # Set the current working directory to inDir
        fileList = glob.glob("*.csv")  # Generate a list of CSV files using glob
        dfList = []

        for filename in fileList:
            print(filename)
            df = pd.read_csv(filename, header=None)  # Read each file as a separate DataFrame with no header
            if len(df.columns) > 1: 
                raise ValueError("Cannot merge more than one csv into single output. Please ensure you have only one csv file.") 

            else:
                df.columns = ['row_number']  # Add a new column to assign row numbers
                df['new_header'] = 'New Header'  # Assign a value to the first row of all DataFrames
                dfList.append(df)

        concatDf = pd.concat(dfList, axis=0)  # Concatenate all DataFrames into one

        if len(concatDf.columns) > 1:
            raise ValueError("Concatenated DataFrame has more than one header column.")

        else:
            concatDf.to_csv(output_file, index=False)  # Export the concatenated DataFrame to a CSV file
    except pd.errors.EmptyDataError as e:
        print(f'An empty csv was encountered: {e}')
    except pd.errors.ParserError as e:
        print(f'Error occurred while parsing: {e}')
    except ValueError as e:
        print(f'{str(e)}')
    except Exception as e:
        print('An error occurred: ', str(e))
        
# Call the function with your desired input and output paths
if merge_csv_files_with_single_header("input_directory", "output_file.csv"):
    print('success')
else:
    print('Error Occurred')

Step 6: Reviewing Code for Readability

Readability is just as important as correctness when it comes to code. Here are a few tips to improve readability:

Use clear and concise variable names: Variable names should be descriptive and indicate the type of data they hold.
Break long lines into shorter ones: Long lines can be hard to read, especially for people who are used to reading plain text.
Organize your code using functions: Functions help make your code more modular and reusable.
Use comments: Comments explain what the code does, why it’s written in a certain way, and how it fits into the overall design of your program.

Step 7: Handling Non-Existent Files

The script should be able to handle non-existent files. Here is an example:

import os
import glob
import pandas as pd

def merge_csv_files_with_single_header(input_dir, output_file):
    try:
        os.chdir(input_dir)  # Set the current working directory to inDir
        fileList = glob.glob("*.csv")  # Generate a list of CSV files using glob
        
        if len(fileList) < 1: 
            raise ValueError("No csv file was found.") 

        dfList = []

        for filename in fileList:
            print(filename)
            try:
                df = pd.read_csv(filename, header=None)  
                if len(df.columns) > 1: 
                    raise ValueError("Cannot merge more than one csv into single output. Please ensure you have only one csv file.") 

                else:
                    df.columns = ['row_number']  # Add a new column to assign row numbers
                    df['new_header'] = 'New Header'  # Assign a value to the first row of all DataFrames
                    dfList.append(df)

            except pd.errors.EmptyDataError as e:
                print(f'An empty csv was encountered: {e}')
            except pd.errors.ParserError as e:
                print(f'Error occurred while parsing: {e}')
            except ValueError as e:
                print(f'{str(e)}')
            except Exception as e:
                print('An error occurred: ', str(e))
    except os.system('cd input_directory'):
        raise FileNotFoundError("Current directory is not the correct one. You should be in 'input_directory'.")

    concatDf = pd.concat(dfList, axis=0)  # Concatenate all DataFrames into one

    if len(concatDf.columns) > 1:
        raise ValueError("Concatenated DataFrame has more than one header column.")

    else:
        concatDf.to_csv(output_file, index=False)  # Export the concatenated DataFrame to a CSV file
    return True

# Call the function with your desired input and output paths
if merge_csv_files_with_single_header("input_directory", "output_file.csv"):
    print('success')
else:
    print('Error Occurred')

Step 8: Testing Code for Logical Errors

Testing code is an essential part of ensuring that it works as expected. Here are a few steps you can take to test your code:

Test individual components: Before putting all the pieces together, make sure each component works on its own.
Test edge cases: Test what happens when things don’t go according to plan. For example, if one of the input files is empty, or if there are more than two input files.
Use automated testing tools: Automated testing tools can save you a lot of time and help ensure that your code works correctly.
Test for logical errors: Test your code to make sure it’s working as intended.

Step 9: Testing Code for Performance

Testing code for performance is an essential part of ensuring that it will scale well in real-world applications. Here are a few steps you can take to test your code:

Use profiling tools: Profiling tools can help you identify which parts of your code are using the most resources.
Test with large inputs: Test your code with large inputs to ensure it will scale well.
Optimize performance-critical sections of code: If certain sections of your code are particularly slow, try optimizing them for better performance.
Use caching mechanisms: Caching can help reduce the amount of repeated computation and improve overall performance.

Step 10: Documenting Code

Finally, it’s a good practice to document your code. This will make it easier for others (and yourself!) to understand how the code works and why certain design decisions were made. Here are a few steps you can take to document your code:

Write docstrings: Docstrings provide a high-level description of what each function does.
Use comments: Comments explain what’s happening in specific sections of code.
Use documentation tools: Documentation tools like Sphinx or Read the Docs can help you generate documentation automatically.
Keep documentation up to date: Make sure your documentation is accurate and reflects any changes to the code.

By following these steps, you’ll be able to create a well-structured, efficient, and maintainable program that meets its functional requirements and scales well for large inputs.

Last modified on 2024-05-19