Scikit-learn
Data Loading in Scikit-learn
Built-in Datasets
Scikit-learn comes with sample datasets for practicing machine learning.
# Load the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
# Explore dataset structure
print(f"Data shape: {iris.data.shape}")
print(f"Target shape: {iris.target.shape}")
print(f"Feature names: {iris.feature_names}")
print(f"Target names: {iris.target_names}")Dataset Components
Most scikit-learn datasets have these attributes:
- data: Features/predictors (X)
- target: Labels/outcomes (y)
- feature_names: Names of each feature
- target_names: Names of each class
- DESCR: Text description of the dataset
Common Built-in Datasets
| Dataset | Type | Features | Samples | Task |
|---|---|---|---|---|
| iris | Tabular | 4 | 150 | Classification |
| digits | Images | 64 | 1797 | Classification |
| wine | Tabular | 13 | 178 | Classification |
| breast_cancer | Tabular | 30 | 569 | Classification |
| boston | Tabular | 13 | 506 | Regression |
Dataset Generators
Generate synthetic data for testing algorithms:
# Create synthetic classification data
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=100,
n_features=4,
n_informative=2,
n_redundant=0,
random_state=42
)
print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Class distribution: {np.bincount(y)}")Loading External Datasets
# From CSV file using pandas
import pandas as pd
# Load data
df = pd.read_csv('your_dataset.csv')
# Split into features and target
X = df.drop('target_column', axis=1)
y = df['target_column']
# From NumPy arrays
import numpy as np
data = np.loadtxt('data.txt', delimiter=',')
X = data[:, :-1] # All columns except last
y = data[:, -1] # Last columnWorking with Iris Dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load data
iris = load_iris()
X = iris.data
y = iris.target
# Basic statistics
print(f"Dataset shape: {X.shape}")
print(f"Feature means: {X.mean(axis=0)}")
print(f"Feature std: {X.std(axis=0)}")
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")Fetching Online Datasets
# Fetch the California housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True) # Returns pandas DataFrame
df = housing.frame
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns}")
print(df.head())