As a data scientist, I’ve often found myself pushing the boundaries of popular gradient boosting frameworks. Recently I’ve been exploring the implementation of custom loss functions in LightGBM and CatBoost, two powerful tools in learning from tabular data. These frameworks offer a wide range of built-in loss functions, but sometimes you need to optimize for a specific metric or tackle a unique problem that requires a custom loss.
In this blog post, I’ll walk through the process of creating custom loss functions, using Mean Squared Error (MSE) and Mean Squared Logarithmic Error (MSLE) as practical examples. We’ll start by deriving the gradients and Hessians for these functions, to provide the mathematical foundation for our implementations. Then we’ll move on to the code, showing how to integrate these custom losses into LightGBM and CatBoost. Each of these frameworks has its own API for custom losses, so we’ll cover the specifics for each one separately.
2 LightGBM
2.1 Interface
The loss function and the evaluation metric must have the following structure:
Mean Squared Error (MSE) is a commonly used loss function for regression and forecasting problems. It is defined as the average of the squared differences between the predicted and actual values: \[
\text{MSE}(y, p) \;=\; {\frac{1}{|\mathcal{D}|} \sum_{i=1}^{|\mathcal{D}|} (y_i - p_i)^2}
\tag{1}\]
where \(y_i\) and \(p_i\) are the target and predicted values for the \(i\)-th sample respectively, and \(|\mathcal{D}|\) is the number of samples in the dataset.
To implement MSE as a loss function, we need to derive the gradient and hessian of the loss value with respect to the predicted values.
The gradient for sample \(i\), ignoring the constant factor of \(\frac{1}{|\mathcal{D}|}\), is: \[
\frac{\partial}{\partial p_i} \text{MSE}(y, p) \;=\; \frac{\partial}{\partial p_i} (y_i - p_i)^2 \;=\; -2(y_i - p_i) \;=\; 2 (p_i - y_i) \;\propto\; p_i - y_i
\tag{2}\]
The reason we can ignore the constant factors is that it will not affect the optimum solution, and its effect will be absorbed by the learning rate.
The hessian for sample \(i\) is: \[
\frac{\partial^2}{\partial p_i^2} \text{MSE}(y, p) \;\propto\; \frac{\partial}{\partial p_i} p_i - y_i \;=\; 1
\tag{3}\]
Propotionality
The proportionality sign (\(\propto\)) is used to indicate that the expression on the right-hand side is proportional to the expression on the left-hand side, up to a constant factor. Read more about it here.
Mean Squared Logarithmic Error (MSLE) is defined as the average of the squared differences between the logarithm of the predicted and actual values. It’s an scale-invariant metric that is commonly used in regression problems where the target values have a wide range of values. I think it’s a good metric for modeling ratios or percentages.
where \(y_i\) and \(p_i\) are the target and predicted values for the \(i\)-th sample respectively, and \(|\mathcal{D}|\) is the number of samples in the dataset. The +1 in the logarithm means the target values must be greater than -1, otherwise the logarithm will be undefined. If your target values are in a different range, you can either add a constant to the target values or use a different constant in the logarithm.
Since there is \(\log_e{(1 + p_i)}\) in the formulae, we need to ensure that \(1 + p_i > 0\) for all \(i\). This can be achieved by clipping the predicted values to a minimum value greater than \(-1\).
Now we can use the loss functions in LightGBM by setting the objective parameter in the model, and the evaluation metric by setting the eval_metric parameter in the fit method.
import lightgbm as lgbregressor = lgb.LGBMRegressor(objective=lgbm_msle_objective)regressor.fit(X_train, y_train, eval_set=[(X_val, y_val)], eval_metric=lgbm_msle_metric)
3 CatBoost
3.1 Interface
The CatBoost interface has a few differences with LightGBM:
The objective function and the evaluation metric are implemented as as class rather than a function, and must implement a few specific methods.
CatBoost expects the negative gradient and hessian to be returned by the loss function.
It’s not necessary to use NumPy arrays for doing vectorized operations. Using for loops will suffice, since under the hood CatBoost will convert our function into machine code using Numba.
Caution
CatBoost requires the negative gradient and hessian to be returned by the loss function, so we need to apply a negation at the end of the calculations. Intuitively, this is because to minimize the loss function, we need to move in the opposite direction of the gradient, but I personally would’ve prefrred CatBoost hanlded this internally similar to LightGBM and XGBoost.
bool : True if higher metric values are better, False otherwise
40defis_max_optimal(self)->bool:
If you’re interested in checking out the implementation of CatBoost’s official loss functions, you can find the C++ code here. Took some digging to find! 😄
3.2 Mean Squared Error
We previously derived the gradient (Equation 2) and hessian (Equation 3) of the MSE loss function for LightGBM. We can use the same equations for CatBoost, but we need to implement them as a class.
Indicate whether a higher metric value is better. MSLE is an error metric, so lower is better.
60defis_max_optimal(self)->bool:61returnFalse
3.4 Usage
To use the custom loss functions in CatBoost, we need to pass it as a parameter during model initialization.
import catboost as cbregressor = cb.CatBoostRegressor(loss_function=MSLEObjective(), eval_metric=MSLEMetric())
Note that since the functions are implemented as classes, we need to instantiate them before passing them to the model. I fell victim to this mistake a few times, so be careful!