Data Loading in Scikit-learn

Built-in Datasets

Scikit-learn comes with sample datasets for practicing machine learning.

# Load the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()

# Explore dataset structure
print(f"Data shape: {iris.data.shape}")
print(f"Target shape: {iris.target.shape}")
print(f"Feature names: {iris.feature_names}")
print(f"Target names: {iris.target_names}")

Dataset Components

Most scikit-learn datasets have these attributes:

data: Features/predictors (X)
target: Labels/outcomes (y)
feature_names: Names of each feature
target_names: Names of each class
DESCR: Text description of the dataset

Common Built-in Datasets

Dataset	Type	Features	Samples	Task
iris	Tabular	4	150	Classification
digits	Images	64	1797	Classification
wine	Tabular	13	178	Classification
breast_cancer	Tabular	30	569	Classification
boston	Tabular	13	506	Regression

Dataset Generators

Generate synthetic data for testing algorithms:

# Create synthetic classification data
from sklearn.datasets import make_classification

X, y = make_classification(
  n_samples=100,
  n_features=4,
  n_informative=2,
  n_redundant=0,
  random_state=42
)

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Class distribution: {np.bincount(y)}")

Loading External Datasets

# From CSV file using pandas
import pandas as pd

# Load data
df = pd.read_csv('your_dataset.csv')

# Split into features and target
X = df.drop('target_column', axis=1)
y = df['target_column']

# From NumPy arrays
import numpy as np

data = np.loadtxt('data.txt', delimiter=',')
X = data[:, :-1]  # All columns except last
y = data[:, -1]   # Last column

Working with Iris Dataset

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Basic statistics
print(f"Dataset shape: {X.shape}")
print(f"Feature means: {X.mean(axis=0)}")
print(f"Feature std: {X.std(axis=0)}")

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.3, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

Fetching Online Datasets

# Fetch the California housing dataset
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)  # Returns pandas DataFrame
df = housing.frame

print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns}")
print(df.head())

Data Loading in Scikit-learn

Built-in Datasets

Dataset Components

Common Built-in Datasets

Dataset Generators

Loading External Datasets

Working with Iris Dataset

Fetching Online Datasets

On this page