Skip to content

Documentation for project/data/module.py¤

Source Code Documentation

The source codedocumentation is generated from Python docstrings using MkDocs and mkdocstrings.

Classes:

Name Description
DataModule

Data module for loading and processing datasets.

DataModule ¤

DataModule(train_dataset_url: str, test_dataset_url: str, target_variable: str, include_features: list[str] | None = None, exclude_features: list[str] | None = None, drop_na: bool = True, test_size: float = 0.2)

Data module for loading and processing datasets.

Parameters:

Name Type Description Default
train_dataset_url str

URL to the training dataset.

required
test_dataset_url str

URL to the test dataset.

required
target_variable str

Name of the target variable column.

required
include_features list[str] | None

List of features to include (if specified, overrides exclude_features). Defaults to None.

None
exclude_features list[str] | None

List of features to exclude (if specified, will be removed from the dataset). Defaults to None.

None
drop_na bool

Whether to drop NA values. Defaults to True.

True
test_size float

Size of the test set. Defaults to 0.2.

0.2

Methods:

Name Description
get_split

Get the training or testing data.

get_test_data

Get the test data.

get_train_data

Get the training data.

Source code in src/project/data/module.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def __init__(
    self,
    train_dataset_url: str,
    test_dataset_url: str,
    target_variable: str,
    include_features: list[str] | None = None,
    exclude_features: list[str] | None = None,
    drop_na: bool = True,
    test_size: float = 0.2,
):
    """
    Initialize the data module.

    Args:
        train_dataset_url (str): URL to the training dataset.
        test_dataset_url (str): URL to the test dataset.
        target_variable (str): Name of the target variable column.
        include_features (list[str] | None, optional): List of features to include
            (if specified, overrides exclude_features). Defaults to None.
        exclude_features (list[str] | None, optional): List of features to exclude
            (if specified, will be removed from the dataset). Defaults to None.
        drop_na (bool, optional): Whether to drop NA values. Defaults to True.
        test_size (float, optional): Size of the test set. Defaults to 0.2.
    """
    self.train_dataset_url = train_dataset_url
    self.test_dataset_url = test_dataset_url
    self.target_variable = target_variable
    self.drop_na = drop_na
    self.test_size = test_size
    self.include_features = include_features
    self.exclude_features = exclude_features

    logger.info("Loading training data from %s", self.train_dataset_url)
    self.df_train = pd.read_csv(self.train_dataset_url)

    logger.info("Loading test data from %s", self.test_dataset_url)
    self.df_test = pd.read_csv(self.test_dataset_url)

    if self.drop_na:
        logger.info("Dropping rows with missing values")
        self.df_train.dropna(inplace=True)

    # Define feature sets
    all_features = [c for c in self.df_train.columns if c != self.target_variable]
    logger.info("Found %d features in the dataset", len(all_features))

    if self.include_features is not None:
        self.features_selected = [
            f for f in all_features if f in self.include_features
        ]

    if self.exclude_features is not None:
        self.features_selected = [
            f for f in all_features if f not in self.exclude_features
        ]

    if "id" not in self.features_selected:
        logger.warning(
            "The 'id' column is not included in the selected features. "
            "It will be added automatically."
        )
        self.features_selected = ["id", *self.features_selected]

    logger.info(
        "Selected %d features from the dataset: %s",
        len(self.features_selected),
        ", ".join(self.features_selected),
    )

get_split ¤

get_split(train: bool = True) -> tuple[Series, DataFrame, Series]

Get the training or testing data.

Parameters:

Name Type Description Default
train bool

If True, return training data; otherwise, return testing data. Defaults to True.

True

Returns:

Type Description
tuple[Series, DataFrame, Series]

tuple[pd.Series, pd.DataFrame, pd.Series]: id (pd.Series), features (pd.DataFrame), and target variable (pd.Series).

Source code in src/project/data/module.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
def get_split(
    self, train: bool = True
) -> tuple[pd.Series, pd.DataFrame, pd.Series]:
    """
    Get the training or testing data.

    Args:
        train (bool): If True, return training data; otherwise, return testing data.
            Defaults to True.

    Returns:
        tuple[pd.Series, pd.DataFrame, pd.Series]: id (pd.Series), features (pd.DataFrame), and target variable (pd.Series).
    """
    logger.info("Splitting data into train and test sets")

    X = self.df_train[self.features_selected]
    y = self.df_train[self.target_variable]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=self.test_size
    )

    if train:
        return (
            X_train["id"],
            X_train.drop(columns=["id"]),
            y_train,
        )
    else:
        return (
            X_test["id"],
            X_test.drop(columns=["id"]),
            y_test,
        )

get_test_data ¤

get_test_data() -> tuple[Series, DataFrame]

Get the test data.

Returns:

Type Description
tuple[Series, DataFrame]

tuple[pd.Series, pd.DataFrame]: id (pd.Series): Unique identifier for each sample. features (pd.DataFrame): Features for each sample.

Source code in src/project/data/module.py
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
def get_test_data(self) -> tuple[pd.Series, pd.DataFrame]:
    """
    Get the test data.

    Returns:
        tuple[pd.Series, pd.DataFrame]:
            id (pd.Series): Unique identifier for each sample.
            features (pd.DataFrame): Features for each sample.
    """
    logger.info("Loading test data")

    X = self.df_test[self.features_selected]

    return (
        X["id"],
        X.drop(columns=["id"]),
    )

get_train_data ¤

get_train_data() -> tuple[Series, DataFrame, Series]

Get the training data.

Returns:

Type Description
tuple[Series, DataFrame, Series]

tuple[pd.Series, pd.DataFrame, pd.Series]: id (pd.Series): Unique identifier for each sample. features (pd.DataFrame): Features for each sample. target (pd.Series): Target variable for each sample.

Source code in src/project/data/module.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
def get_train_data(self) -> tuple[pd.Series, pd.DataFrame, pd.Series]:
    """
    Get the training data.

    Returns:
        tuple[pd.Series, pd.DataFrame, pd.Series]:
            id (pd.Series): Unique identifier for each sample.
            features (pd.DataFrame): Features for each sample.
            target (pd.Series): Target variable for each sample.
    """
    logger.info("Loading training data")

    X: pd.DataFrame = self.df_train[self.features_selected]
    y: pd.Series = self.df_train[self.target_variable]

    return (
        X["id"],
        X.drop(columns=["id"]),
        y,
    )