Custom Loss Functions for LightGBM and CatBoost

Nima Sarang

1 Introduction

As a data scientist, I’ve often found myself pushing the boundaries of popular gradient boosting frameworks. Recently I’ve been exploring the implementation of custom loss functions in LightGBM and CatBoost, two powerful tools for learning from tabular data. These frameworks offer a wide range of built-in loss functions, but sometimes you need to optimize for a specific metric or tackle a unique problem that requires a custom loss.

In this blog post, I’ll walk through the process of creating custom loss functions, using Mean Squared Error (MSE) and Mean Squared Logarithmic Error (MSLE) as practical examples. We’ll start by deriving the gradients and Hessians for these functions, to provide the mathematical foundation for our implementations. Then we’ll move on to the code, showing how to integrate these custom losses into LightGBM and CatBoost. Each of these frameworks has its own API for custom losses, so we’ll cover the specifics for each one separately.

2 LightGBM

2.1 Interface

The loss function and the evaluation metric must have the following structure:

#

Objective Function

Calculate the gradient and hessian of a custom loss function for LightGBM.

Parameters:

target (np.ndarray) : The true target values
prediction (np.ndarray) : The predicted values from the model
weight (np.ndarray), optional : Sample weights. If None, uniform weights are assumed.

Returns:

grad (np.ndarray) : First order gradient of the loss with respect to the predictions.
hess (np.ndarray) : Second order gradient (Hessian) of the loss with respect to the predictions.

1def lgbm_objective_function(
2    target: np.ndarray,
3    prediction: np.ndarray,
4    weight: np.ndarray = None,
5):
6    ...
7
8    return grad, hess

#

Evaluation Metric

Calculate a custom evaluation metric for LightGBM.

Parameters:

Same as loss_function.

Returns:

A tuple containing three elements:

eval_name (str): The name of the metric.
eval_result (float): The value of the metric.
is_higher_better (bool): Whether a higher value of the metric is better.

 9def lgbm_evaluation_metric(
10    target: np.ndarray,
11    prediction: np.ndarray,
12    weight: np.ndarray = None,
13):
14    ...
15
16    return eval_name, eval_result, is_higher_better

2.2 Mean Squared Error

Mean Squared Error (MSE) is a commonly used loss function for regression and forecasting problems. It is defined as the average of the squared differences between the predicted and actual values: \[ \text{MSE}(y, p) \;=\; {\frac{1}{|\mathcal{D}|} \sum_{i=1}^{|\mathcal{D}|} (y_i - p_i)^2} \tag{1}\]

where \(y_i\) and \(p_i\) are the target and predicted values for the \(i\)-th sample respectively, and \(|\mathcal{D}|\) is the number of samples in the dataset.

To implement MSE as a loss function, we need to derive the gradient and hessian of the loss value with respect to the predicted values.

The gradient for sample \(i\), ignoring the constant factor of \(\frac{1}{|\mathcal{D}|}\), is: \[ \frac{\partial}{\partial p_i} \text{MSE}(y, p) \;=\; \frac{\partial}{\partial p_i} (y_i - p_i)^2 \;=\; -2(y_i - p_i) \;=\; 2 (p_i - y_i) \;\propto\; p_i - y_i \tag{2}\]

The reason we can ignore the constant factors is that they will not affect the optimum solution and will be absorbed by the learning rate.

The hessian for sample \(i\) is: \[ \frac{\partial^2}{\partial p_i^2} \text{MSE}(y, p) \;\propto\; \frac{\partial}{\partial p_i} p_i - y_i \;=\; 1 \tag{3}\]

Proportionality

The proportionality sign (\(\propto\)) is used to indicate that the expression on the right-hand side is proportional to the expression on the left-hand side, up to a constant factor. Read more about it here.

#

MSE Objective Function

1import numpy as np
2
3
4def lgbm_mse_objective_function(
5    target: np.ndarray,
6    prediction: np.ndarray,
7    weight: np.ndarray = None,
8):
9    gradient = prediction - target

#

Hessian is always 1 for MSE

10    hessian = np.ones_like(gradient)

#

Apply sample weights

11    if weight is not None:
12        gradient *= weight
13        hessian *= weight
14    return gradient, hessian

#

MSE Metric

15def lgbm_mse_metric(
16    target: np.ndarray,
17    prediction: np.ndarray,
18    weight: np.ndarray = None,
19):
20    squared_error = (prediction - target) ** 2
21    mse = np.average(squared_error, weights=weight)

#

(metric name, value, is_higher_better)

22    return "MSE", mse, False

2.3 Mean Squared Logarithmic Error

Mean Squared Logarithmic Error (MSLE) is defined as the average of the squared differences between the logarithm of the predicted and actual values. It’s a scale-invariant metric that is commonly used in regression problems where the target values have a wide range of values. It’s a good metric for modeling ratios or percentages.

\[ \text{MSLE}(y, p) \;=\; \frac{1}{|\mathcal{D}|} \sum_{i=1}^{|\mathcal{D}|} (\log_e (1 + y_i) - \log_e (1 + p_i) )^2 \tag{4}\]

where \(y_i\) and \(p_i\) are the target and predicted values for the \(i\)-th sample respectively, and \(|\mathcal{D}|\) is the number of samples in the dataset. The +1 in the logarithm means the target values must be greater than -1, otherwise the logarithm will be undefined. If your target values are in a different range, you can either add a constant to the target values or use a different constant in the logarithm.

The gradient for sample \(i\) is:

\[ \begin{aligned} \frac{\partial}{\partial p_i} \text{MSLE}(y, p) &= \frac{\partial}{\partial p_i} \Big(\log_e (1 + y_i) - \log_e (1 + p_i) \Big)^2 \\[3ex] &= \frac{-2}{1 + p_i} \Big(\log_e (1 + y_i) - \log_e (1 + p_i)\Big) \\[3ex] &\propto \frac{\log_e (1 + p_i) - \log_e (1 + y_i)}{1 + p_i} \end{aligned} \tag{5}\]

The hessian is calculated by taking another derivative of the gradient:

\[ \begin{aligned} \frac{\partial^2}{\partial p_i^2} \text{MSLE}(y, p) & = \frac{\partial}{\partial p_i} \frac{\log_e (1 + p_i) - \log_e (1 + y_i)}{1 + p_i} \\[3ex] & = \frac{\splitfrac{(1 + p_i)\frac{\partial \Big(\log_e (1 + p_i) - \log_e (1 + y_i)\Big)}{\partial p_i} }{ - \Big(\log_e (1 + p_i) - \log_e (1 + y_i)\Big) \frac{\partial (1 + p_i)}{\partial p_i}}}{(1 + p_i)^2} \\[3ex] & = \frac{(1 + p_i)\times \frac{1}{1 + p_i} - \Big(\log_e (1 + p_i) - \log_e (1 + y_i)\Big) \times 1}{(1 + p_i)^2} \\[3ex] & = \frac{1 - \log_e (1 + p_i) + \log_e (1 + y_i)}{(1 + p_i)^2} \\[3ex] \end{aligned} \tag{6}\]

Note

Since there is \(\log_e{(1 + p_i)}\) in the formulae, we need to ensure that \(1 + p_i > 0\) for all \(i\). This can be achieved by clipping the predicted values to a minimum value greater than \(-1\).

#

MSLE Objective Function

1import numpy as np
2
3
4def lgbm_msle_objective(
5    target: np.ndarray,
6    prediction: np.ndarray,
7    weight: np.ndarray = None,
8):

#

Ensure predictions are at least -1 + 1e-6 to avoid log(0)

9    prediction = np.maximum(prediction, -1 + 1e-6)

#

Gradient

10    gradient = (
11        np.log1p(prediction) - np.log1p(target)
12    ) / (prediction + 1)

#

Hessian

13    hessian = (
14        -np.log1p(prediction) + np.log1p(target) + 1
15    ) / ((prediction + 1) ** 2)

#

Apply sample weights

16    if weight is not None:
17        gradient *= weight
18        hessian *= weight
19
20    return gradient, hessian

#

MSLE Metric

21def lgbm_msle_metric(
22    target: np.ndarray,
23    prediction: np.ndarray,
24    weight: np.ndarray = None,
25):
26    preds = np.maximum(prediction, -1 + 1e-6)
27    squared_log_error = (
28        np.log1p(preds) - np.log1p(target)
29    ) ** 2
30    msle = np.average(squared_log_error, weights=weight)
31    return "MSLE", msle, False

2.4 Usage

Now we can use the loss functions in LightGBM by setting the objective parameter in the model, and the evaluation metric by setting the eval_metric parameter in the fit method.

import lightgbm as lgb

regressor = lgb.LGBMRegressor(objective=lgbm_msle_objective)
regressor.fit(
    X_train,
    y_train,
    eval_set=[(X_val, y_val)],
    eval_metric=lgbm_msle_metric
)

3 CatBoost

3.1 Interface

The CatBoost interface has a few differences from LightGBM:

The objective function and the evaluation metric are implemented as class rather than a function, and must implement a few specific methods.
CatBoost expects the negative gradient and hessian to be returned by the loss function.
It’s not necessary to use NumPy arrays for doing vectorized operations. Using for loops will suffice, since under the hood CatBoost will convert our function into machine code using Numba.

Caution

CatBoost requires the negative gradient and hessian to be returned by the loss function, so we need to apply a negation at the end of the calculations. Intuitively, this is because to minimize the loss function, we need to move in the opposite direction of the gradient, but I personally would’ve preferred CatBoost handled this internally similar to LightGBM and XGBoost.

The overall interface is as follows:

#

Objective Function

Custom loss function for CatBoost.

1class CatBoostObjectiveFunction:

#

Calculate the derivatives (gradient and hessian) of the loss with respect to the predictions.

Parameters:

approxes (list) : The predicted values from the model
targets (list) : The true target values
weights (list), optional : Sample weights. If None, uniform weights are assumed

Returns:

List[Tuple[float, float]] : List of tuples, each containing the negative gradient and hessian for a sample

 2    def calc_ders_range(
 3        self,
 4        approxes: list,
 5        targets: list,
 6        weights: list = None,
 7    ):
 8        result = []
 9        for index in range(len(targets)):
10            grad_i = ...
11            hess_i = ...
12
13            result.append((-grad_i, -hess_i))
14        return result

#

Evaluation Metric

Custom evaluation metric for CatBoost.

15class CatBoostEvalMetric:

#

Parameters:

approxes (List[List]) : List containing the predicted values from the model
target (List) : The true target values
weight (List), optional : Sample weights. If None, uniform weights are assumed.

Returns:

A tuple containing two elements:

float : The accumulated weighted error
float : The total weight

16    def evaluate(
17        self,
18        approxes: list[list],
19        target: list,
20        weight: list = None,
21    ) -> tuple[float, float]:

#

I'm not sure why approxes is a list of lists in the first place, but we'll just have to roll with it

22        assert len(approxes) == 1
23        approx = approxes[0]
24        assert len(target) == len(approx)

#

Calculate the weighted error and total weight

25        error_sum = 0.0
26        weight_sum = 0.0
27
28        for i in range(len(approx)):
29            error_i = ...
30            weight_i = (
31                1.0 if weight is None else weight[i]
32            )
33            weight_sum += weight_i
34            error_sum += weight_i * error_i
35
36        return error_sum, weight_sum

#

Calculate the final metric value from the accumulated error and total weight

Parameters:

error (float) : The accumulated weighted error
weight (float) : The total weight

Returns:

float : The final metric value

37    def get_final_error(
38        self, error: float, weight: float
39    ) -> float:

#

Indicate whether a higher metric value is better.

Returns:

bool : True if higher metric values are better, False otherwise

40    def is_max_optimal(self) -> bool:

If you’re interested in checking out the implementation of CatBoost’s official loss functions, you can find the C++ code here. It took some digging to find! 😄

3.2 Mean Squared Error

We previously derived the gradient (Equation 2) and hessian (Equation 3) of the MSE loss function for LightGBM. We can use the same equations for CatBoost, but we need to implement them as a class.

#

MSE Objective Function

Mean Squared Error objective function for CatBoost.

1class MSEObjective:

#

 2    def calc_ders_range(
 3        self,
 4        approx: list,
 5        target: list,
 6        weights: list = None,
 7    ):
 8        result = []
 9        for i in range(len(target)):
10            grad_i = approx[i] - target[i]
11            hess_i = 1.0
12
13            if weights is not None:
14                grad_i *= weights[i]
15                hess_i *= weights[i]

#

Add the negation before appending to the result

16            result.append((-grad_i, -hess_i))

#

17        return result

#

MSE Evaluation Metric

Mean Squared Error metric for CatBoost.

18class MSEMetric:

#

19    def evaluate(
20        self,
21        approxes: list[list],
22        target: list,
23        weight: list = None,
24    ) -> tuple[float, float]:
25        assert len(approxes) == 1
26        approx = approxes[0]
27        assert len(target) == len(approx)
28
29        error_sum = 0.0
30        weight_sum = 0.0
31
32        for i in range(len(approx)):
33            error_i = (approx[i] - target[i]) ** 2
34            weight_i = (
35                1.0 if weight is None else weight[i]
36            )
37            weight_sum += weight_i
38            error_sum += weight_i * error_i
39
40        return error_sum, weight_sum

#

Calculate the final metric value from the accumulated error and total weight

41    def get_final_error(
42        self, error: float, weight: float
43    ) -> float:
44        return error / weight

#

Indicate whether a higher metric value is better. MSE is an error metric, so lower is better.

45    def is_max_optimal(self) -> bool:
46        return False

3.3 Mean Squared Logarithmic Error

Based on the derivations from the LightGBM section (Equation 5 and Equation 6), we can implement the MSLE objective function for CatBoost:

#

MSLE Objective Function

Mean Squared Logarithmic Error objective function for CatBoost.

1import numpy as np
2
3
4class MSLEObjective:

#

 5    def calc_ders_range(
 6        self,
 7        approx: list,
 8        target: list,
 9        weights: list = None,
10    ):
11        result = []
12        for i in range(len(target)):
13            approx_i = max(approx[i], -1 + 1e-6)
14            grad_i = (
15                np.log1p(approx_i) - np.log1p(target[i])
16            ) / (approx_i + 1)
17            hess_i = (
18                -np.log1p(approx_i)
19                + np.log1p(target[i])
20                + 1
21            ) / ((approx_i + 1) ** 2)
22
23            if weights is not None:
24                grad_i *= weights[i]
25                hess_i *= weights[i]
26
27            result.append((-grad_i, -hess_i))
28        return result

#

MSLE Evaluation Metric

Mean Squared Logarithmic Error metric for CatBoost.

29class MSLEMetric:

#

30    def evaluate(
31        self,
32        approxes: list[list],
33        target: list,
34        weight: list = None,
35    ) -> tuple[float, float]:
36        assert len(approxes) == 1
37        approx = approxes[0]
38        assert len(target) == len(approx)
39
40        error_sum = 0.0
41        weight_sum = 0.0
42
43        for i in range(len(approx)):
44            approx_i = max(approx[i], -1 + 1e-6)
45            error_i = (
46                np.log1p(approx_i) - np.log1p(target[i])
47            ) ** 2
48            weight_i = (
49                1.0 if weight is None else weight[i]
50            )
51
52            weight_sum += weight_i
53            error_sum += weight_i * error_i
54
55        return error_sum, weight_sum

#

Calculate the final metric value from the accumulated error and total weight

56    def get_final_error(
57        self, error: float, weight: float
58    ) -> float:
59        return error / weight

#

Indicate whether a higher metric value is better. MSLE is an error metric, so lower is better.

60    def is_max_optimal(self) -> bool:
61        return False

3.4 Usage

To use the custom loss functions in CatBoost, we need to pass it as a parameter during model initialization.

import catboost as cb

regressor = cb.CatBoostRegressor(
    loss_function=MSLEObjective(),
    eval_metric=MSLEMetric()
)

Note that since the functions are implemented as classes, we need to instantiate them before passing them to the model. I fell victim to this mistake a few times, so be careful!

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{sarang2024,
  author = {Sarang, Nima},
  title = {Custom {Loss} {Functions} for {LightGBM} and {CatBoost}},
  date = {2024-08-11},
  url = {https://www.nimasarang.com/blog/2024-08-11-gbt-custom-loss/},
  langid = {en}
}

For attribution, please cite this work as:

N. Sarang, “Custom Loss Functions for LightGBM and CatBoost.” [Online]. Available: https://www.nimasarang.com/blog/2024-08-11-gbt-custom-loss/