Creating DataFrames from Scratch Using Different Methods in Python

Creating a New DataFrame and Adding Variables in Python

In this article, we’ll explore how to create a new dataframe from scratch using Python and add variables to it.

Introduction

Creating a dataframe from scratch can be achieved in various ways, depending on the type of data you’re working with. In this article, we’ll cover two common methods: using np.hstack or np.flatten to combine 2D arrays into a single array, and then passing that array to the pd.DataFrame constructor.

Understanding DataFrames

Before we dive into creating new dataframes, let’s briefly review what a dataframe is. A dataframe is a two-dimensional labeled data structure with columns of potentially different types. It provides both speed and memory efficiency for large datasets.

In Python, the pandas library is commonly used to create and manipulate dataframes. The pd.DataFrame class is the core class in pandas that represents a table of data.

Creating Variables

Let’s first look at creating two random variables, x and y, with certain properties.

Bernoulli Distribution

We’ll use the scipy.stats.bernoulli function to create a variable x with a Bernoulli distribution.

from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)

The rvs method generates random variates from the specified distribution. In this case, we’re generating 100 random numbers with a probability of success (p) equal to 0.6.

Normal Distribution

We’ll use the numpy.random.normal function to create a variable y with a normal distribution.

y = norm.rvs(size=100,loc=0,scale=1)

The rvs method generates random variates from the specified distribution. In this case, we’re generating 100 random numbers with a mean (loc) equal to 0 and a standard deviation (scale) of 1.

Creating a DataFrame

Now that we have our variables x and y, let’s create a dataframe from scratch using pd.DataFrame.

import pandas as pd
import numpy as np

# creating variable x with Bernoulli distribution
from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)

# form a column vector (n, 1)
x = x.reshape(-1, 1) # Reshaped to shape (-1, 1), as reshape uses 2D arrays

# creating variable y with normal distribution
y = np.random.normal(loc=0, scale=1, size=100)

# form a column vector (n, 1)
y = y.reshape(-1, 1) # Reshaped to shape (-1, 1), as reshape uses 2D arrays

# creating a dataframe from scratch and assigning x and y to it
df = pd.DataFrame({ 'x': x.flatten(), 'y': y.flatten()})

print(df)

This code creates two variables x and y using the scipy.stats.bernoulli and numpy.random.normal functions, respectively. It then reshapes these arrays into 2D column vectors (n,1) before creating a dataframe from scratch.

Using np.hstack

Alternatively, we can use np.hstack to stack the two arrays horizontally before passing them to the pd.DataFrame constructor.

import pandas as pd
import numpy as np

# creating variable x with Bernoulli distribution
from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)

# form a column vector (n, 1)
x = x.reshape(-1, 1) # Reshaped to shape (-1, 1), as reshape uses 2D arrays

# creating variable y with normal distribution
y = np.random.normal(loc=0, scale=1, size=100)

# form a column vector (n, 1)
y = y.reshape(-1, 1) # Reshaped to shape (-1, 1), as reshape uses 2D arrays

# stacking the arrays horizontally using np.hstack
df = pd.DataFrame(np.hstack((x,y)))

print(df)

This code creates two variables x and y using the scipy.stats.bernoulli and numpy.random.normal functions, respectively. It then stacks these arrays horizontally using np.hstack before passing them to the pd.DataFrame constructor.

Using np.flatten

Another way to create a dataframe from scratch is to use np.flatten to flatten the two arrays into 1D arrays, and then pass these arrays to the pd.DataFrame constructor.

import pandas as pd
import numpy as np

# creating variable x with Bernoulli distribution
from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)

# form a column vector (n, 1)
x = x.reshape(-1, 1) # Reshaped to shape (-1, 1), as reshape uses 2D arrays

# creating variable y with normal distribution
y = np.random.normal(loc=0, scale=1, size=100)

# form a column vector (n, 1)
y = y.reshape(-1, 1) # Reshaped to shape (-1, 1), as reshape uses 2D arrays

# flattening the arrays into 1D using np.flatten
x = x.flatten()
y = y.flatten()

# creating a dataframe from scratch and assigning x and y to it
df = pd.DataFrame({ 'x': x,'y': y})

print(df)

This code creates two variables x and y using the scipy.stats.bernoulli and numpy.random.normal functions, respectively. It then flattens these arrays into 1D arrays using np.flatten, before passing them to the pd.DataFrame constructor.

Conclusion

In this article, we explored how to create a new dataframe from scratch in Python using different methods: combining 2D arrays with np.hstack or np.flatten, and then passing that array to the pd.DataFrame constructor. We also reviewed what a dataframe is and how it can be used for data analysis.


Last modified on 2023-05-11