Applied unsupervised machine learning in life insurance data
Performing projections of cash flows for a portfolio of life insurance policies can be a highly demanding procedure in terms of both computational time and memory usage, especially when done in the context of a full asset liability management (ALM) model. A well-established method to lower both demands is cluster analysis, one of the techniques of unsupervised2 machine learning. It allows an insurance company to reduce the size of the projected portfolio, while simultaneously controlling accuracy of the valuation.
In this research study, we explored several clustering algorithms, which group policies into homogenous clusters with respect to the cash flow patterns. The purpose of grouping similar policies is to take a single representative policy from each cluster into the projection model as a proxy for all the members of that group. Each representative policy needs to be appropriately scaled to reflect the weight of the policies within the cluster.
Using the reduced portfolio, which consists of the representative policies only, significantly decreases calculation time. In the table in Figure 1, the gain in the calculation times after clustering four typical products is shown. We used real data on several savings and traditional products from a Belgian insurer for this research. Five different compression rates3 of the original portfolio were examined ranging from 1% to 5%. The larger the compression rate the smaller the reduction in the size, i.e., the larger the size that remains after the compression. The sizes mentioned in the figure refer to the number of uncompressed policies per product. We can observe that the computation time does not perfectly scale linearly with the number of policies and depends on the overhead complexity of the product.
Figure 1: Fraction of the projection time for each product after clustering
PRODUCT | SIZE | 5% | 4% | 3% | 2% | 1% |
---|---|---|---|---|---|---|
A | ~20000 | 7.9% | 5.5% | 4.8% | 2.3% | 1.0% |
B | ~80000 | 14.5% | 12.0% | 9.9% | 6.7% | 3.8% |
C | ~100000 | 13.5% | 11.2% | 9.5% | 5.8% | 3.6% |
D | ~400000 | 13.8% | 11.1% | 9.4% | 5.6% | 3.0% |
Clustering algorithms
Different clustering algorithms yield different clustering results. Four categories of unsupervised clustering analysis were investigated and compared in terms of accuracy and execution time. The investigated algorithms are:
- Partitional-based: K-Means and K-Medoids.
- Density-based: DBSCAN and HDBSCAN.
- Fuzzy clustering: Fuzzy C-Means.
- Hierarchical clustering: Agglomerative Hierarchical Clustering.
A common limitation of the clustering algorithms concerns their functionality when dealing with large data sets. This problematic behaviour can be caused by performing calculations on all combinations of the elements of the entire data set in each iteration, rising memory issues.4 To resolve this issue, some variations of the classical K-Means can be considered, such as:
- Mini Batch K-Means
- Scalable K-Means++
- Variational Bisecting K-Means
Other algorithms that could be employed for similar use cases are CLARANS and K-prototypes.
Evaluation of performance
A calculation run of the complete original portfolio was used as a reference point for testing the performance of each clustering algorithm. A sufficient precision in replicating the reference projection results is the focal point. Increasing the size of the representative portfolio, i.e., increasing the compression rate, improves the replication precision. In Figure 2, the projections of the original and reduced portfolios for four clustering variables are presented.
The "no compression" line depicts the aggregated cash flow pattern over the whole projection horizon based on a run with the complete portfolio. The compression rate of 5% yields a close alignment with the original portfolio. We can observe that the alignment depends on the specific variable. There are no visible discrepancies for gross profit. Death benefit payments and the number of deaths reveal regions of slight misalignments, whereas the number of maturities reports the highest discrepancies overall.
The compression rate of 1% increases the discrepancies further, but the relative comparison between the variables remains similar to the compression rate of 5%.
Figure 2: Projection over a 40-year period of four clustering variables for AHC algorithm for product A
The partitional-based and hierarchical clustering methods are successful
The performance of each clustering algorithm was evaluated based on the replication accuracy criterion and the execution time. For the replication accuracy, we decided to work with the precision metric. This is, in fact, a measure of inaccuracy, indicating the percentage error of deviation between the two projections. The absolute execution time is obviously specific to the portfolio, product specifications, model, and hardware available, but the relative times are sufficient to evaluate the performance of the different algorithms.
Given the considered performance metric, the partitional-based and hierarchical clustering methods yield the best results. The table in Figure 3 contains the estimated precision values for products A, B and C with 5% compression ratio.
Figure 3: Precision for each product (lower is better)
CLUSTERING ALGORITHM | PRODUCT A | PRODUCT B | PRODUCT C |
---|---|---|---|
K-Means | 2.19% | 2.94% | 2.90% |
K-Medoids | 1.79% | 2.74% | 2.84% |
Agglomerative Hierarch. Clustering | 1.34% | 2.99% | 2.68% |
The largest portfolio (Product D) enabled us to investigate the advanced partitional-based clustering algorithms specially designed to tackle the memory issue for large data sets. The Mini Batch K-Means and Scalable K-Means++ scored the lowest error percentages. The scalability of Mini Batch K-Means is clear when considering the execution time of the clustering algorithms as presented in Figure 4. Furthermore, from the table in Figure 5 we can see the cost of the good scalability of Mini Batch K-Means, i.e., loss in precision, is rather limited when compared to Scalable K-Means++.
Figure 4: Clustering time of product D (lower is better)
Figure 5: Precision for product D (lower is better)
CLUSTERING ALGORITHM | 5% | 3% | 1% |
---|---|---|---|
Scalable K-Means++ | 3.23% | 6.94% | 9.33% |
Mini Batch K-Means | 4.03% | 7.10% | 9.53% |
Variational Bisecting K-Means | 7.24% | 9.02% | 11.13% |
Improving performance
The complexity and accuracy of the clustering problem depends not only on the size of the portfolio but also on the number of the clustering variables. Therefore, we also analysed two methods of reducing dimensionality of the clustering problem:
- Principal components analysis (PCA)
- Laplacian scores (LS)
The reduction in the number of clustering variables may lead to improvements in data storage, computational cost and learning performance. The aim is to reduce redundant or noisy inputs in the analysis.
The LS is more output-driven, while the PCA is more input-driven. In the table in Figure 6, precision values of K-Means after the application of dimensionality reduction are presented. In our analysis, the LS method outperformed the PCA, but this conclusion may be different case by case.
Figure 6: Precision for each product (K-Means, lower is better)
DIMENSIONALITY REDUCTION METHOD |
PRODUCT A | PRODUCT B | PRODUCT C |
---|---|---|---|
PCA | 2.54% | 3.10% | 3.12% |
Laplacian Score (LS) | 1.61% | 2.10% | 2.55% |
Without Dimensionality Reduction |
2.19% | 2.94% | 2.90% |
Underperforming behaviour
It is worth noting the underperforming algorithms, i.e., the density-based and fuzzy clustering. The fuzzy clustering builds on the idea that one policy may not necessarily belong to only one cluster. The density-based algorithms consider relative placement of each policy in the data space.
Both methods are highly influenced by the densities of the clusters. In our study, highly varying density clusters lead to poor performance. Figure 7 depicts the policy distribution per cluster for the DBSCAN method. It is clear that most policies are grouped in the first 20 clusters (0-20), while the remaining 5% of them (1% for 21-100, 4% for 101-300) belong to the other clusters. Having only a few groups with most of the policies did not provide sufficient differentiation in the cash flow patterns in order to replicate the original valuation.
Figure 7: Proportion of the number of policies per cluster
Conclusions
In the case of the considered portfolio data and products, K-Means, K-Medoids and Agglomerative Hierarchical Clustering (AHC) yielded satisfying performance, with a slight edge for the AHC when considering the combined effect of the precision metric and the execution time.
As for large data sets, the Mini Batch K-Means significantly outperformed the other two proposed methods in terms of clustering time.
Furthermore, dimensionality reduction improved the overall performance of K-Means. More specifically, the feature selection approach based on the Laplacian scores performed better in comparison to the feature extraction principal component analysis method.
As for alternative clustering methods, considered data led to varying density clusters, which caused poor performance of the density-based and fuzzy clustering algorithms.
While the presented results can likely be extended to similar product portfolios, clustering of policy data needs to be tailored to the particular products, portfolio specifications, scenario-runs definitions5 and technology infrastructure. For instance, larger computing resources may overcome the memory and calculation speed limitations encountered in this research.
In this study, we relied on open-source Python implementations of the considered clustering algorithms. For a production-ready solution, Milliman developed its own technique called cluster modelling, which is implemented in the MG-ALFA® modelling engine of the Integrate® ecosystem.6
Milliman consultants have extensive experience in supporting clients in selecting and implementing appropriate clustering algorithms and all the relevant parameters, such as clustering variables, compression rate, time granularity etc. Further expertise can be provided on how to integrate clustering in the reporting process or apply the clustered policies for usage outside the clustering exercise, e.g., shock scenarios. The authors of this paper can be contacted for any inquiries regarding these topics.
1 Full research has been described in a thesis with the same title by Michail Athanasiadis in the Master of Statistics and Data Science programme at KU Leuven.
2 Unsupervised means that the algorithm determines the classification without knowing the correct classes a priori.
3 Compression rate (%) = (size after compression) / (size before compression) * 100
4 This study was performed in a development environment with limited computing resources (16GB RAM and 4-thread CPU).
5 This research focused on a single scenario run. Additional considerations must be given in case of clustering for multi-scenario runs.
6 For methodology details and examples see https://www.milliman.com/en/insight/research/insurance/cluster-analysis-a-spatial-approach-to-actuarial-modeling/.