5 Ways to Import Multiple CSV Files into Pandas and Merge Them Effectively

Importing Multiple CSV Files into Pandas and Merging Them Based on Column Values

As a data analyst or scientist, working with large datasets is an essential part of the job. One common task is to import multiple CSV files into a pandas DataFrame and merge them based on column values. In this article, we will explore how to achieve this using pandas, covering various approaches, including the most efficient method.

Introduction to Pandas

Pandas is a powerful library in Python for data manipulation and analysis. It provides an efficient way to handle structured data, including tabular data such as spreadsheets and SQL tables. One of its key features is the ability to easily import and manipulate CSV files.

The pandas library includes several tools for importing data from various file formats, including CSV (Comma Separated Values). The most commonly used method for importing CSV files is through the read_csv() function, which reads a CSV file into a pandas DataFrame.

Importing Multiple CSV Files

In this section, we will explore how to import multiple CSV files into a single pandas DataFrame. This can be achieved using various methods, including:

Using separate calls to the read_csv() function for each CSV file.
Using the concat() function in combination with the read_csv() function.

Method 1: Using Separate Calls to `read_csv()`

One way to import multiple CSV files is by making separate calls to the read_csv() function for each file. This approach can be useful if you want to maintain control over the data processing steps, but it may result in less efficient code due to repeated operations.

import pandas as pd

# Importing individual CSV files
df_inventory_parts = pd.read_csv('inventory_parts.csv')
df_colors = pd.read_csv('colors.csv')
df_part_categories = pd.read_csv('part_categories.csv')
df_parts = pd.read_csv('parts.csv')

# Merging the DataFrames based on common columns
merged = pd.merge(
    left=df_inventory_parts, 
    right=df_colors, 
    how='left', 
    left_on='color_id', 
    right_on='id')

merged = pd.merge(
    left=merged, 
    right=df_parts, 
    how='left', 
    left_on='part_num', 
    right_on='part_num')

merged = pd.merge(
    left=merged, 
    right=df_part_categories, 
    how='left', 
    left_on='part_cat_id', 
    right_on='id')

Method 2: Using `concat()` with `read_csv()`

Another approach is to use the concat() function in combination with read_csv() to import multiple CSV files into a single DataFrame. This method can be more efficient than making separate calls to read_csv() and also allows for easier handling of different file formats.

import pandas as pd

# Importing all CSV files using concat()
df_all = pd.concat([pd.read_csv(file) for file in ['inventory_parts.csv', 'colors.csv', 'part_categories.csv', 'parts.csv']])

# Merging the DataFrame based on common columns
merged = df_all.merge(
    left_df=df_inventory_parts, 
    right_df=df_colors, 
    how='left', 
    left_on='color_id', 
    right_on='id')

merged = merged.merge(
    left_df=merged, 
    right_df=df_parts, 
    how='left', 
    left_on='part_num', 
    right_on='part_num')

merged = merged.merge(
    left_df=merged, 
    right_df=df_part_categories, 
    how='left', 
    left_on='part_cat_id', 
    right_on='id')

Method 3: Efficient Merging Approach

As mentioned earlier, making three separate calls to pd.merge() can be excessive. An efficient approach would be to chain the merges using the merge() function without specifying the left and right arguments explicitly.

import pandas as pd

# Importing all CSV files into a single DataFrame
df_inventory_parts = pd.read_csv('inventory_parts.csv')
df_colors = pd.read_csv('colors.csv')
df_part_categories = pd.read_csv('part_categories.csv')
df_parts = pd.read_csv('parts.csv')

# Efficiently merging the DataFrames based on common columns
merged = df_inventory_parts.merge(
    right=df_colors, 
    how='left', 
    left_on='color_id', 
    right_on='id').merge(
    right=df_parts, 
    how='left', 
    left_on='part_num', 
    right_on='part_num').merge(
    right=df_part_categories, 
    how='left', 
    left_on='part_cat_id', 
    right_on='id')

Conclusion

In this article, we explored various methods for importing multiple CSV files into pandas DataFrames and merging them based on common columns. We covered the use of separate calls to read_csv(), using concat() with read_csv(), and an efficient merging approach without explicitly specifying the left and right arguments.

By understanding these different approaches, you can choose the most suitable method for your specific needs, whether it’s performance, readability, or ease of maintenance. Regardless of the approach, pandas provides powerful tools for efficiently handling structured data in Python.

Tips and Variations

Handling Missing Values: When merging DataFrames, it’s essential to consider how missing values are handled. You can use the na_last parameter in pd.merge() to specify whether NaNs from the left DataFrame should be filled with the corresponding value from the right DataFrame or dropped.
Data Type Conversion: When merging DataFrames, you might need to convert data types. Pandas provides the astype() function for converting data types, which can be useful when dealing with mixed data types in your CSV files.
Customizing Merge Operations: Depending on the structure of your CSV files and the common columns between them, you may want to customize merge operations by applying custom functions or filtering data based on specific conditions.

By incorporating these tips and variations into your DataFrames merging workflow, you can further optimize your code for efficiency, readability, and robustness.

Last modified on 2023-09-18