Hostname: page-component-848d4c4894-nr4z6 Total loading time: 0 Render date: 2024-06-02T10:05:00.543Z Has data issue: false hasContentIssue false

Nonparametric intercept regularization for insurance claim frequency regression models

Published online by Cambridge University Press:  05 January 2024

Gee Y. Lee*
Affiliation:
Department of Statistics and Probability, Department of Mathematics, Michigan State University, East Lansing, MI, USA
Himchan Jeong
Affiliation:
Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada
*
Corresponding author: Gee Y. Lee; Email: leegee@msu.edu
Rights & Permissions [Opens in a new window]

Abstract

In a subgroup analysis for an actuarial problem, the goal is for the investigator to classify the policyholders into unique groups, where the claims experience within each group are made as homogenous as possible. In this paper, we illustrate how the alternating direction method of multipliers (ADMM) approach for subgroup analysis can be modified so that it can be more easily incorporated into an insurance claims analysis. We present an approach to penalize adjacent coefficients only and show how the algorithm can be implemented for fast estimation of the parameters. We present three different cases of the model, depending on the level of dependence among the different coverage groups within the data. In addition, we provide an interpretation of the credibility problem using both random effects and fixed effects, where the fixed effects approach corresponds to the ADMM approach to subgroup analysis, while the random effects approach represents the classic Bayesian approach. In an empirical study, we demonstrate how these approaches can be applied to real data using the Wisconsin Local Government Property Insurance Fund data. Our results show that the presented approach to subgroup analysis could provide a classification of the policyholders that improves the prediction accuracy of the claim frequencies in case other classifying variables are unavailable in the data.

Type
Original Research Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of Institute and Faculty of Actuaries

1. Introduction

Subgroup analysis in actuarial science is related to the risk classification problem in insurance ratemaking. In property insurance ratemaking applications, risk classification allows for insurance rates to be charged differently according to the policyholder’s geographical region, property type, or policy feature. Risk classification is an important technique for the solvency of the insurance company because it prevents adverse selection against the company. In a competitive insurance market, companies with better risk classification are likely to attract more so-called low-risk policyholders, leaving those companies using inferior risk classification methods with high-risk policyholders. The mechanism by which this can influence the insurance company’s solvency is explained in detail by authors such as Akerlof (Reference Akerlof1970), Rothschild and Stiglitz (Reference Rothschild and Stiglitz1976), or Wilson (Reference Wilson1977).

In the actuarial literature, authors such as Guo (Reference Guo2003), and Yeo et al. (Reference Yeo, Smith, Willis and Brooks2001) have suggested using machine learning approaches to segment the subjects in a population into homogenous groups with similar characteristics, while heterogeneity between the groups is maximized. In the statistics literature, subgroup analysis has been known as a technique called cluster analysis. For a recent example, Ma and Huang (Reference Ma and Huang2017) proposed a subgroup linear model, where the subgroup structure is defined by group-specific intercepts. They applied the alternating direction method of multipliers (ADMM) algorithm to the subgroup analysis in the statistics literature for the first time, according to our understanding. The approach in Ma and Huang (Reference Ma and Huang2017) did not have insurance applications in particular in mind, and they used a pairwise penalty on the subject-specific intercepts to regularize the distance among the subjects.

The adoption of the technique described in Ma and Huang (Reference Ma and Huang2017) into the actuarial literature was first done by Chen et al. (Reference Chen, Huang, Chan and Yau2019), who implemented the approach for a zero-inflated Poisson model and applied it to an insurance risk classification context. Their approach also used a pairwise penalty in the same way Ma and Huang (Reference Ma and Huang2017) did. The authors of Chen et al. (Reference Chen, Huang, Chan and Yau2019) have also suggested using deviance residuals as a preliminary tool to assess the applicability of subgroup analysis methods to a given dataset. The paper was an interesting contribution to the literature, and in our opinion they presented an alternative approach to viewing the risk classification problem. Although the approach was interesting, we believe there are some practical difficulties applying the method to larger datasets because of the computational load caused by the pairwise penalties. Reducing this computational load is part of the contribution we are making in our paper.

Meanwhile, the fused lasso method has been around for a longer period of time. Authors such as Tibshirani et al. (Reference Tibshirani, Saunders, Rosset, Zhu and Knight2005) and Tibshirani and Taylor (Reference Tibshirani and Taylor2011) have contributed to the development of fused lasso approaches to estimation, where the coefficients are assumed to be ordered from the lowest to the largest one. A recent example of a study applying the ADMM algorithm to the fused lasso problem in an actuarial context can be found in Deviendt et al. (Reference Deviendt, Antonio, Reynkens and Verbelen2021). In their work, the authors propose an algorithm for solving the regularized regression problem for a multitype lasso penalized problem. The multitype penalty includes the fused lasso and generalized fused lasso as one of the penalties applicable to the objective function.

The approach in Deviendt et al. (Reference Deviendt, Antonio, Reynkens and Verbelen2021) is interesting and is closely related to our approach, although the intercepts were not a target of regularization in their work unlike ours. Additional contributions in our work are that in addition to existing work, we attempt to relate the risk classification problem to the credibility problem and make a comparison with existing Bayesian approaches to the problem. In the actuarial literature, authors such as Jeong and Valdez (Reference Jeong and Valdez2020) or Jeong (Reference Jeong2020) have explored the credibility problem using Bayesian approaches. As we understand, credibility theory is a way to consider unobserved heterogeneity within the observations from the same object (could be an individual or a group) so that it is strongly related with the ratemaking problem, where a large part of the interest is in attaining better risk classification.

Hierarchical random effects models are also related to our work, as we compare the ADMM approach to the random effects approach to classification. In actuarial science, hierarchical random effects models have been explored by authors such as Norberg (Reference Norberg1986), Boucher and Denuit (Reference Boucher and Denuit2006), and Antonio and Beirlant (Reference Antonio and Beirlant2007). Random effects models in actuarial science have been studied extensively in relation to the credibility problem. Another interesting related paper would be Frees and Shi (Reference Frees and Shi2017), where the authors use a Tweedie model with multiplicative random effects to incorporate collateral information into a credibility ratemaking model. We are unaware of existing work comparing the ADMM approach to classification with the random effects approach.

The approach presented in this paper illustrates how a modified ADMM approach can be used to estimate the parameters for a subgroup analysis problem quickly. We present three different cases of the subgroup analysis problem, where the first case assumes there are no interdependence among the different coverage groups of a multiline insurance company, the second case assumes perfect interdependence, and the third case assumes flexible interdependence. We also propose an approach to speed up the ADMM algorithm by keeping track of the group members for the regularized intercepts at each step of the iteration. To summarize our contributions, the first is a computational one, where we contribute to improving the speed and estimation of the ADMM algorithm by presenting a way to figure out the initial values, penalize adjacent coefficients only, and keep track of the groupings of the coefficients. The second is an analytical one, where we are able to combine the ADMM approach with dependence models with dependencies induced by common effects. The third is an empirical one, where we demonstrate the application of our approach using the Wisconsin Local Government Property Insurance Fund (LGPIF) dataset.

The rest of the paper proceeds in the following order: Section 2 explains the proposed methodology in detail. Section 3 presents a simulation study to determine when the proposed method may perform well, and provides some discussion. Section 4 starts by illustrating the actuarial problem of interest, and the dataset related to the problem. Section 5 presents the empirical results from the application of our approach to the actuarial problem stated in Section 4. Section 6 concludes the paper with closing remarks.

2. Methodology

Let us consider a usual data structure for multi-peril frequency. For an insurance policy of the $i\text{th}$ policyholder, we may observe the multi-peril claim frequencies over time $t$ for the $j\text{th}$ coverage type as follows:

(1) \begin{equation} \mathcal{D}_{j,T} = \left \{\left ( n_{jit}, \textbf{x}_{it} \right ) \bigg | \ i=1,\ldots, I, \, t=1,\ldots, T \right \}, \end{equation}

where $\textbf{x}_{it}$ is a $p$ -dimensional vector that captures the observable characteristics of the policy and $n_{jit}$ is defined as the observed number of accident(s) from claim type $j\in \{1,2,\ldots, J\}$ , for the $i\text{th}$ policyholder in year $t$ , respectively.

We assume that the number of claims is affected by both observable covariates via associated regression coefficients as well as the (common) unobserved heterogeneity factor $\theta _{ji}$ for each coverage. Imagine that $\mathcal{P}(\nu)$ is a Poisson distribution for now. One way to incorporate heterogeneity into this model would be to multiply policyholder and claim-type specific effects:

(2) \begin{equation} \begin{aligned} N_{jit}|\textbf{x}_{it}, \theta _{ji},{ \theta _{i}} \sim \mathcal{P}\left ( \, \nu _{jit} \theta _{ji} \theta _i \, \right ), \end{aligned} \end{equation}

where $\nu _{jit}=\exp\!(\textbf{x}_{it}^\prime \boldsymbol{\alpha} _j)$ , and the common effect $\theta _i$ allows the lines to be dependent among one another. If $\theta _i = 1$ with probability 1, then we have independent lines. In comparison to the hierarchical random effects literature, the $\theta$ ’s in our model would correspond to the multiplicative random effects factor used in papers such as the one published by Frees and Shi (Reference Frees and Shi2017). Note that one needs to impose additional constraints on $\theta _i$ and $\theta _{ji}$ in (2) to assure model identifiability, which are specified in each of the scenarios discussed in this section later.

Our goal is to develop a method that enables the analyst to easily consider individual unobserved heterogeneity $\theta _{ji}$ , and the subject-specific effect $\theta _i$ . We are also interested in exploring the possibility of capturing the dependence among the claims of the coverages from the same policyholder. Here, we treat $\theta _{ji}$ as either random or nonrandom. In the former case, we assume a parametric distribution of $\theta _{ji}$ so that Bayesian credibility is used to compute the posterior expectation of $\theta _{ji}$ . In the latter case, we perform a nonparametric regression of $\theta _{ji}$ with a fused LASSO-type penalty applied to the log-likelihood function using an advanced version of ADMM algorithm from that of Chen et al. (Reference Chen, Huang, Chan and Yau2019). We refer to this approach as the nonparametric approach, because we do not assume a parametric distribution on $\theta _{ji}$ . The original ADMM algorithm by Chen et al. (Reference Chen, Huang, Chan and Yau2019) is summarized in Appendix A. The different cases of our model, depending on how $\theta _{ji}$ and $\theta _i$ are treated are explained in more detail in Appendix D.

We tried implementing our routines using all three penalties explained in Chen et al. (Reference Chen, Huang, Chan and Yau2019), including the L1 penalty, the minimax concave penalty (MCP) penalty, and the smoothly clipped absolute deviation (SCAD) penalty. The definitions of all three penalties are replicated in the Appendix, for the interested reader. Intuitively, the MCP and SCAD are concave penalty functions capable of shrinking pairwise differences to zero, and $\kappa$ controls the concavity of the penalty function. As $\kappa \to \infty$ , both penalties approach the L1 penalty as shown in Figure 1.

Figure 1. The SCAD and MCP penalty functions with different $\kappa$ values.

Because the SCAD and MCP penalties are concave, they avoid over-shrinking large differences. The large difference pairs are the pairs that should not be combined into groups, and for these pairs it is optimal to apply smaller penalties. For this reason, the SCAD and MCP are known to reduce biases in the estimated coefficients. As for the choice between SCAD versus MCP, we preferred SCAD because it has a higher curvature near zero and hence is more likely to shrink small coefficients to zero. Furthermore, SCAD gave the cleanest solution paths according to our empirical analyses, so in the rest of this paper and in our implementation, we will assume the penalty term is specifically the SCAD penalty.

To elaborate on this, a solution path is the curve one obtains by adjusting the tuning parameter from a small value to a large value. We prefer that this curve is smooth (clean), without kinks, because only then will the objective function for cross-validation have a smooth shape. Thus, smooth solution paths imply the solutions change continuously as the tuning parameter is altered, which is a desirable property for cross-validation purposes. Yet, even with the SCAD penalty, we found it difficult to obtain nice solution paths in general, where a nice solution path is one where the solution curves are smooth. The problem with the approach taken by Chen et al. (Reference Chen, Huang, Chan and Yau2019) seemed to be that once the policies start forming groups, the group with the largest number of policies start attracting all the other policies into its group. This is because of all the pairwise penalty terms resulting in a large pull from all the policies forming the group. The pairwise penalty approach has other difficulties in the estimation, namely the fact that the estimation algorithm runs very slow when there are a large number of policies. If there are $n$ policies, then there would be $\binom{n}{2}$ pairwise penalty terms. Instead, we propose penalizing the distance between the intercepts of those policies that have similar intercepts. Specifically, we propose using the following penalized likelihood to estimate the fixed effects $\boldsymbol{\alpha} _j$ and coverage specfic unobserved heterogeneity factors $\boldsymbol{\theta }_j = (\theta _{j1}, \ldots, \theta _{jI})$ :Footnote 1

(3) \begin{equation} \begin{aligned} & \tilde{\ell }_p(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j|\mathcal{D}_{j,T}) = \tilde{\ell }(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j|\mathcal{D}_{j,T})- \sum _{i=2}^{I} p\left (|\delta _{ji}^*|;\, \tau _j\right ), \\ \end{aligned} \end{equation}

where $\tilde{\ell }(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j|\mathcal{D}_{j,T})$ is the Poisson likelihood given by (2) with $\theta _i=1$ ,

\begin{equation*} p(|\delta |;\, \tau ) = \frac {\tau }{2.7} \int _0^{|\delta |} \min \{ 2.7, (3.7-x/\tau )_{+} \} dx, \ \delta _{ji}^{*} = \log \theta _{ji}^{*} - \log \theta _{j,i-1}^{*},\end{equation*}

and $\theta _{ji}^{*}$ are the coefficients $\theta _{ji}$ ordered by their initial values $\theta _{ji}^{(0)}$ :

(4) \begin{equation} \min \left \{\theta _{ji}^{(0)}\right \} = \theta _{j1}^{*(0)} \le \theta _{j2}^{*(0)} \le \cdots \le \theta _{jI}^{*(0)} = \max \left \{\theta _{ji}^{(0)}\right \} \end{equation}

for each $j$ . Note that we may define $\eta _{ji}^* = \log \theta _{ji}^*$ for notational convenience in the algorithm. For a low-dimensional problem with a small number of policyholders, a generalized linear model (GLM) may be used to figure out the initial values, which can be used to obtain the ordering of the coefficients. For a high-dimensional problem, this approach becomes problematic, and our proposal is to use a ridge-regression to obtain the ordering, where the tuning parameter is initially set to a high value and eventually adjusted to a value close to zero. The reason for using ridge regression instead of lasso here, is because we would like to preserve all the coefficients while converging to the initial values of the coefficients, for small values of the tuning parameter. Specifically, let

(5) \begin{equation} \begin{aligned} (\boldsymbol{\alpha}_j^{(0)}, \boldsymbol{\theta}_j^{(0)}) = \mathop{\textrm{arg min}}\limits_{\boldsymbol{\alpha} _j, \boldsymbol{\theta} _j} \left [ \sum _{i=1}^I \sum _{t=1}^T \left (n_{jit}( \boldsymbol{x}_{it}^\prime \boldsymbol{\alpha} _j+\log \theta _{ji} )-\nu _{jit}\theta _{ji} \right ) - \lambda \sum _{i=1}^I \big \lVert \theta _{ji} \big \lVert ^2 \right ] \\ \end{aligned} \end{equation}

for a very small (close to zero) tuning parameter $\lambda$ . The expression in Equation (5) is optimized initially with a large $\lambda$ value (so that the coefficients form one big group), and the estimates for the $\theta _{ji}$ ’s from the previous step are used as initial values for the subsequent estimates for a smaller value of $\lambda$ . The process is repeated until $\lambda$ is a very small number close to zero (small enough so that it is numerically identical to zero). Because the ridge penalty does not introduce kinks into the objective function, this optimization can be performed using standard maximum likelihood estimation. Once the ordering is obtained, Equation (3) can be optimized for increasing values of $\tau _j$ , where the optimal tuning parameter is selected by the one that maximizes the correlation of the predicted claim frequencies with the true frequencies (because that particular tuning parameter would be the one that performs the best in terms of prediction). In practice, this scheme would be implemented so that for each $j$ , we optimize

(6) \begin{equation} \tilde{\ell }_p^*(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j^{*} |\mathcal{D}_{j,T}) = \tilde{\ell }^*(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j^{*}|\mathcal{D}_{j,T}) - \sum _{i=2}^{I} p\left (|\delta _{ji}^* |;\, \tau _j\right ). \end{equation}

In this case, the augmented Lagrangian is given as follows:

(7) \begin{equation} L(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j^{*}, \boldsymbol{\delta} _{j}, \boldsymbol{\lambda} _{j} |\mathcal{D}_{j,T}) = -\tilde{\ell }^*(\boldsymbol{\alpha} _{j}, \boldsymbol{\theta }_j^{*}|\mathcal{D}_{j,T}) + \sum _{i=2}^{I} p\left (|\delta _{ji}^* |;\, \tau _j\right ) + W_{j}, \end{equation}

where $W_{j} = \sum _{i=2}^I \lambda _{j,i-1} ( \log \theta _{j,i}^{*} - \log \theta _{j,i-1}^{*}-\delta _{j,i}^*) + \frac{\rho }{2} \sum _{i=2}^I ( \log \theta _{j,i}^{*} - \log \theta _{j,i-1}^{*} - \delta _{j,i}^* )^2$ .

The problem with the naive Algorithm shown in Appendix A is that the solution path is yet not ideal, because the coefficients within the same group do not merge completely. This problem can be fixed using an encoding of which policies belong to which group at each step $s$ . This approach is illustrated in Algorithm 1.

Algorithm 1: A modified ADMM for fast and stable parameter estimation

where

(8) \begin{align} L(\boldsymbol{\alpha} _j, \boldsymbol{\eta} _j^*, \boldsymbol{\delta} _j^{(s)}, \boldsymbol{\lambda} _j^{(s)}) & = -\tilde{\ell }^*(\boldsymbol{\alpha} _{j}, \boldsymbol{\eta }_j^{*}) + \sum _{i=2}^I p(|\delta _{ji}^{(s)}| ;\, \tau _j) + W_{j}^{(s)} \end{align}
(9) \begin{align} & = -\tilde{\ell }^*(\boldsymbol{\alpha} _{j}, \boldsymbol{\eta }_j^{*}) + \frac{\rho }{2} \sum _{i=2}^{I} \left [ \eta _{j,i}^{*} - \eta _{j,i-1}^{*}-\delta _{j,i-1}^{(s)} + \frac{\lambda _{j,i-1}^{(s)}}{\rho } \right ]^2 + C \end{align}

with a constant $C$ ,

(10) \begin{equation} \delta _{j,i}^{(s+1)} = \begin{cases} \textit{ST}(u_{j,i}, \tau _h/\rho ) & \text{ if } |u_{j,i}| \le \tau _h ( 1 + \rho ^{-1} ) \\ \frac{ \textit{ST}(u_{j,i}, \kappa \tau _h/ ((\kappa -1)\rho ))}{1 - ((\kappa -1)\rho )^{-1}} & \text{ if } \tau _j(1+\rho ^{-1} ) \le | u_{j,i} | \le \kappa \tau _h \\ u_{j,i} & |u_{j,i}| \gt \kappa \tau _h, \end{cases} \end{equation}
(11) \begin{equation} \begin{aligned} u_{j,i} &= \eta _{j,i}^{(s+1)} - \eta _{j,i-1}^{(s+1)} + \rho ^{-1} \lambda _{j,i}^{(s)}, \\ \textit{ST}(u, \tau ) &= \text{sgn}(u)(|u|-\tau )_+,\\ \end{aligned} \end{equation}
(12) \begin{equation} \lambda _{j,i}^{(s+1)} = \lambda _{j,i}^{(s)} + \rho \left ( \eta _{j,i}^{*(s+1)} - \eta _{j,i-1}^{*(s+1)} - \delta _{j,i}^{(s+1)} \right ), \end{equation}

and

(13) \begin{equation} \boldsymbol{G}_j^{(s+1)} = ( \boldsymbol{g}_{j1}^{(s+1)}, \boldsymbol{g}_{j2}^{(s+1)}, \cdots, \boldsymbol{g}_{j,g_{s+1}}^{(s+1)} )^\prime, \end{equation}
(14) \begin{equation} \boldsymbol{g}_{jr}^{(s+1)} = \sum _{w \in A(r)} \boldsymbol{g}_{jw}^{(s)}, \quad A(r) = \bigg \{ w\, :\, w \text{ is in the new } r \text{th group } \bigg \}. \end{equation}

Here $g_s$ is the number of groups (number of rows in matrix $\boldsymbol{G}_j^{(s)}$ at step $s$ ) and

(15) \begin{equation} \boldsymbol{\xi} _j^{(s+1)} = \boldsymbol{G}_j^{(s+1)} \boldsymbol{\eta} _j^{*(s+1)}. \end{equation}

With Algorithm 1, the gradient and Hessian need to be modified, but it is a simple modification as explained in Appendix C section.

The proposed method can be used to capture the unobserved heterogeneity of the policyholders in diverse scenarios on the dependence structure of the insurance claims from multiple lines of business. In that regard, we consider the following cases to elaborate how the proposed method can be applied in the ratemaking practices:

  • No interdependence among the lines,

  • Perfect interdependence among the lines,

  • Flexible interdependence.

In Appendix D, the details of these three cases are explained in full detail. In the following section, we perform a simulation study to compare the different dependence cases in conjunction with the ADMM estimation approach.

3. Simulation study

With the three cases illustrated in Section 2 and Appendix D, we are interested in which approach performs better in different situations of data availablity. We perform a simulation study to seek some answers. Here we assume that $J=3$ (trivariate coverage case), $I=1000$ , and $T=5$ .

Recall that $\nu _{jit}=\exp\!(\boldsymbol{x}_{it}^\prime \boldsymbol{\alpha} ^{(j)})$ while we set $\boldsymbol{x}_{it}=(x_{1it}, x_{2i}, x_{3i})^\prime$ . Here $x_{1it}$ follows a student’s $t$ -distribution with degree of freedom of 30, corresponds to a continuous covariate such as age or vehicle value. $x_{2i}$ is a Bernoulli random variable with probability 0.5 that might have varying impacts depending on the coverage type for the same insured. For example, an outdated pick-up truck may have higher risk in bodily injury and property damage liabilities but low exposure on the collision due to relatively less vehicle value, which is vice versa for a compact luxurious electric vehicle. Lastly, $x_{3i}$ follows a normal distribution with mean $-0.5$ and variance $1$ , which corresponds to the territorial or individual risk score that affects all coverages simultaneously in the same way, which could be available or not as a covariate. We also fixed $\boldsymbol{\alpha} _{1}=({-}1, +0.3, 0.1)$ , $\boldsymbol{\alpha} _{2}=({-}1, {-}0.3, 0.1)$ , and $\boldsymbol{\alpha} _{3}=({-}1, +0.1, 0.1)$ .

Note that depending on the data collection scheme or company policies, it is possible that some of the covariates that affect the claim frequency may not be available. In this regard, we assume the following scenarios to fit ratemaking models:

  • Scenario 1: All covariates are available,

  • Scenario 2: $x_{1it}$ and $x_{2i}$ are available,

  • Scenario 3: $x_{1it}$ and $x_{3i}$ are available,

  • Scenario 4: Only $x_{1it}$ is available.

Under each scenario, we used the data points $\{ \boldsymbol{x}_{it}, N_{jit} \, | \, t=1,2,3,4\}$ as a training set, which were fitted with the following models:

  • GLM without individual effects: Simple Poisson GLM and no consideration in the unobserved factors, which is equivalent to $\theta _{ji}=1$ and $\theta _{i}=1$ for all $i$ and $j$ .

  • Individual effects model: A model where $\theta _{ji}$ are nonrandom and estimated by the proposed ADMM approach while $\theta _i=1$ for all $i$ and $j$ . The details are explained in Case 1b of Appendix D.

  • Common effects model: A model where $\theta _{i}$ are nonrandom and estimated by the proposed ADMM approach while $\theta _{ji}=1$ for all $i$ and $j$ . The details are explained in Case 2b of Appendix D.

  • Boosted credibility model: A model where $\theta _{ji}$ are nonrandom and estimated by the proposed ADMM approach first, and boosted by the credibility estimate of random $\theta _i$ for all $i$ and $j$ . The details are explained in Case 3 of Appendix D.

  • True model: A model that used actual $\nu _{jit}$ , which is not attainable in practice but only provided as an ideal benchmark.

After the models are fitted for each dataset, $\{ \boldsymbol{x}_{it}, N_{jit} \, | \, t=5\}$ was used to assess the model performance via out-of-sample validation with root-mean squared error (RMSE), which is defined as follows:

\begin{equation*} \begin {aligned} \text {RMSE}&: \sqrt {\frac {1}{I}\sum _{i=1}^I (N_{ij5}-\hat {N}_{ij5})^2}. \end {aligned} \end{equation*}

Note that we prefer a model with lower RMSE. Tables 1, 2, and 3 summarize the validation measures for all data availability scenarios and models. It is observed that the individual effects model, common effects model, and boosted credibility models tend to perform better than the naive model (GLM without individual effects) as the covariates become unavailable in Scenarios 2, 3, and 4. Note that such improved performance is obtained at the expense of increased computation costs as shown in the Tables E.1, E.2, and E.3 of Appendix E.

Table 1. Out-of-sample validation results with simulated Coverage 1

Table 2. Out-of-sample validation results with simulated Coverage 2

Table 3. Out-of-sample validation results with simulated Coverage 3

4. Data

It is often the case for actuaries to be in a situation, where limited explanatory variables are available besides the exposure variable for modeling the insurance claim frequencies. In this case, the question arises: how would the insurer avoid overcharging those policyholders with little or no claims, while charging a higher, fair, premium for those policies with a recurrently large number of claims. By doing so, the insurer may be interested in avoiding adverse selection, and preventing low-risk policyholders from lapsing. Standard credibility approaches are available for this situation, allowing the actuary to vary the insurance rate depending on past experience, in addition to the exposure variable in hand. Subgroup analysis methods present an alternative approach to solve the same problem, where one may consider the subject-specific intercepts for the frequency regression model as a target of regularization. Which approach performs better in terms of the out-of-sample prediction accuracy would be the question of interest.

We first summarize the dataset used for the empirical analysis. We assume that the insurance company carries multiple coverage groups of business. For the Wisconsin LGPIF dataset that we use, there are six different coverage groups; building and content (BC), contractor’s equipment (IM), comprehensive new (PN), comprehensive old (PO), collision new (CN), and collision old (CO) coverage. Table 4 summarizes the coverage amount per policyholder per year for the six different groups over time. The reader may verify that the building and contents coverage group has the highest coverage amounts. Table 5 shows the average claim frequencies for each coverage group per policy per year. Due to its comprehensive characteristics of the coverage, it is observed that the number of BC claims per year is substantially higher than the number of claims from the other coverages. For detailed description and marginal analysis results of the LGPIF dataset, see Frees et al. (Reference Frees, Lee and Yang2016).

Table 4. Summary of coverage amounts by year in millions of dollars

Table 5. Summary of claim frequencies by year

One may expect that the claim frequencies and coverage amounts are highly correlated, and this is verified in Figure 2. The reader may verify that higher claim frequencies are related to higher coverage amounts in general. This justifies using the log coverage amount as the exposure variable. Note that in Figure 2, it is possible for a policyholder to have positive coverage with zero claim. This is a unique feature of insurance claims datasets. Besides the coverage amount, we assume that no other explanatory variables are available for the actuary. This imaginary setup allows us to focus on the risk classification problem, where limited explanatory variables are available. This kind of situation may arise in practice quite naturally in, for example, a guaranteed issue insurance product, where the insurance company is not allowed to underwrite or collect any details regarding the policyholder. In the real dataset we have in our hands, there are in fact other variables available, such as the geographical location, the county code, and the entity type of the policy. Yet, we avoid using them in our analysis to mimic a situation where rating variables are limited. In Section 5, we do provide a comparison of our method with and without the use of additional entity type variables in the regression. For completeness, we include a summary table of the sample sizes by the entity type variable in Table 6.

Table 6. Number of observations by the entity type variable

Figure 2. Relationship between coverage amounts and claim frequencies.

One of the main goals of our analysis is to figure out a way to prevent policyholders from lapsing. Naturally, it is of interest to see if lapse behaviors are observed in the dataset. Table 7 shows that over time, the number of policyholders with each coverage is in general decreasing. At the same time, the LGPIF has been experiencing a drop in the surplus ratio. The surplus ratio is the ratio of the collected premium to the surplus, and it is a measure of the insurer’s financial health. A surplus ratio below the company’s target ratio would indicate bad financial health. The fact that this ratio has dropped in recent years makes the property fund manager wonder if there is an alternative approach to the ratemaking problem. Hence, the presence of lapse may provide motivation for the type of analysis we are about the present in this paper.

Table 7. Number of policies with positive coverage each year

Figure 3. Histogram plots of the Cox–Snell residuals for a basic GLM.

Figure 3 shows the Cox–Snell residuals for the claim frequencies observed for each policyholder. The Cox–Snell residuals are obtained by performing a simple Poisson frequency regression of the claim frequencies on the exposure and obtaining the fitted distributions first. For example, if $x_{it}$ are the exposures (say the log coverage amount) and $n_{it}$ are the observed frequencies for policyholders $i=1, 2, \cdots, I$ and times $t=1, 2, \cdots, T$ , then one would perform a Poisson regression with the specification

\begin{equation*} N_{jit}|x_{it} \sim \mathcal {P}(\nu _{jit}) \end{equation*}

where $\nu _{jit} = \exp\!({x}_{it} \alpha _j)$ , to obtain the parameter estimates $\hat{\alpha }_j$ for each coverage type $j=1, 2, \cdots, J$ and Poisson distribution functions $\mathcal{P}$ with means $\nu _{jit}$ . The observed frequencies are then plugged into the fitted cumulative distribution functions and plotted as histograms. In other words, the Cox–Snell residuals are

\begin{equation*} \hat {F}_{N_{jit}}(n_{jit}) \end{equation*}

where $\hat{F}_{N_{jit}}$ are the estimated cumulative distribution functions of the observed frequencies $n_{jit}$ of the random variables $N_{jit}$ , having coefficients $\hat{\alpha }_{j}$ . The values are plotted as a histogram in each of the panels of Figure 3. This is done as a preliminary analysis to see if there is evidence of the need of subgroup analysis. It is an approach comparable to analyzing the deviance residuals as done in Chen et al. (Reference Chen, Huang, Chan and Yau2019). Figure 3 shows that the residuals are not so uniform and instead have clusters between 0 and 1. This provides motivation to perform subgroup analysis to figure out if the policyholders can be classified into homogenous groups for the ratemaking purposes.

5. Empirical results

To empirically illustrate which approach has the most advantage for the LGPIF dataset, we compare five different approaches to performing a Poisson regression. The five different cases are summarized in the list below. In each case, we may either include or not include the entity type categories as explanatory variables, resulting in ten different special cases. The entity type categories include: city, county, school, town, village, and miscellaneous (six different entity type categories).

  • The basic GLM approach: This approach is a plain generalized linear modeling approach, without individual or common effects. It will be used as a benchmark model, which we hope to out-perform using the approaches described in this paper. This would be the case when $\theta _i = \theta _{ji} = 1$ in Equation (2), so that we simply have

    (16) \begin{equation} \begin{aligned} N_{jit}|\textbf{x}_{it} \sim \mathcal{P}\left ( \nu _{jit} \right ), \end{aligned} \end{equation}
    Note that $\textbf{x}_{it}$ would include the log coverage amount, and the entity type categories, which are the same across different coverage types. Yet, a different GLM is used for each coverage type, and hence the subscript $j$ . If the entity type categories are not used, then $\textbf{x}_{it}$ would include a single covariate, which is the log coverage amount.
  • The individual effects model: This is one of the ADMM approaches, where each of the six lines is modeled separately with fixed policy-number effects. The details are explained in Case 1b of Appendix D.

  • The common effects model: This is another ADMM approach, where we have perfect interdependence among the different coverage groups. The details are explained in Case 2b of Appendix D.

  • The offsets model: This is a two-step approach, where the predictions from the individual effects model are used as offsets in the regression model for the common effects model, hoping that the second step common effects model will capture any residual common effects among the different coverage groups.

  • The boosted credibility model: This is a hybrid approach, where we have flexible interdependence. The individual effects model is boosted with credibility factors. The details are explained in Case 3 of Appendix D.

We now compare the basic GLM approach, the individual effects model, the common effects model, the offsets model, and the boosted credibility model. Tables 8 and 9 compare the five different approaches. The hold out sample comparisons were performed after omitting new policies for which experience does not exist. The optimal tuning parameters for the models using the ADMM approach are determined using a fourfold cross-validation, where in each fold a particular year of data is used as the validation sample and the rest are used to fit the model. Computation time was a critical issue in completing this project, and we found that a fourfold cross-validation (as opposed to a ten-fold) was enough to deliver the results we wanted in a reasonable amount of time. The model with the lowest validation sample Akaike Information Criteria (AIC) is selected as the optimal model.

Table 8. Spearman correlations with the hold out sample frequencies

Table 9. Number of unique fixed-effect coefficients

In Tables 8 and 9, the GLM without individual effects is a simple Poisson regression model. The individual effects model predictions are obtained by concatenating the predictions from the six different coverage groups ( $j=1, \cdots 6$ ) and comparing it with the hold out sample frequencies from each coverage group. The same is true for the common effects model, the offsets model, and the boosted credibility model.

From Tables 8 and 9, we can see that the GLM and the individual effects model are giving the same results. This is because with the entity type fixed effects present, the individual effects model converges to the trivial model without any additional fixed effects. The resulting overall intercept is exactly same as the intercept for the GLM model, and so are the other coefficients as well. We can see that even the boosted credibility model does not show a particular advantage in this case.

Table 10 illustrates that with the entity type covariates present, the ADMM approach proposes the trivial model without any subject-specific intercepts as the optimal models. This can be seen in the first row, where the number of intercepts in the resulting model is 1 for all of the six coverage groups. In the second row, we can see that if the entity type covariates are unavailable, then the resulting optimal model has several groups in the intercepts. This is consistent with our expectations from the preliminary analysis using the Cox–Snell residuals in Section 4.

Table 10. Number of groups in the intercepts of the individual effects model

The last column in the tables illustrate when and how our approach can be useful. The individual effects model without entity type would be used in an imaginary situation, where the analyst only has a single exposure variable and no other predictors. In this case, the GLM model is out-performed by the individual effects model, which uses 16 additional coefficients. The common effects model seems to have not much added benefits. The boosted credibility model seems to be the best one with 48.69% Spearman correlation with the hold out sample claim frequencies.

The solution paths for the individual effects models for the six lines are shown in Figures 4 and 5. Figure 4 shows the case when there are no entity type fixed effects, while Figure 5 shows the solution path of the $\eta _{j,i}^*$ values for the model with entity type fixed effects.

Figure 4. Solution paths for the six coverage groups without entity type predictors.

Figure 5. Solution paths for the six coverage groups with entity type predictors.

To understand the figure intuitively, imagine changing the tuning parameter $\tau$ from a small value to a larger value. As the tuning parameter increases, more and more penalty is applied to the likelihood function, forcing more coefficients to merge with one another. As the coefficients merge more and more, the model contains less and less unique coefficients. At some point, there will be an optimal model suggested by cross-validation, and that optimal model is indicated by a vertical line in each panel of the figures. Note that each line would represent one coefficient in the beginning, but that beginning point is now shown in the figures, so as to magnify the most interesting part of the entire figure.

In the figures, we can see that the optimal model chosen by cross-validation contains only one unique coefficient for the intercepts, indicating that the optimal model is the GLM (the optimal tuning parameter is indicated by the solid vertical line). In Figure 4, we can see that the selected optimal model contains more than one unique parameters for the $\eta _{j,i}^*$ values for each coverage group.

6. Conclusion

In this paper, we presented an approach for enhancing the prediction from a GLM model by allowing the model to have subject-specific intercepts. We have presented a modified ADMM algorithm that uses penalties for adjacent coefficients only, as opposed to the pairwise penalty approach in the original ADMM for subgroup analysis. Our approach runs faster with better looking solution paths, and we have been successful at applying the approach to a real insurance dataset with more than thousands of unique policyholders. The optimal number of unique intercepts is determined by cross-validation, and in the empirical studies, we have demonstrated that the approach can be further enhanced using a flexible interdependence approach among the coverage groups.

The benefits and costs of the ADMM approach are as follows: The first benefit of the ADMM approach for the unobserved heterogeneity is that it can be flexibly applied to an actuarial application regardless of the underlying distribution of $N|\theta$ . For example, the response $N|\theta$ does not need to follow Poisson but any frequency distribution such as zero-one inflated negative binomial (Shi & Lee, Reference Shi and Lee2022). Another benefit is that one can also use the approach to capture the unobserved heterogeneity of random variables, whose dependence structure is described by a copula. For example, one way this can happen is if one assumes a multivariate Gaussian copula with many dependence parameters (due to the fact that there are many response variables). Each dependence parameter may be a target for subgroup analysis. The disadvantage of the approach seems to be the computation time. If $N|\theta$ indeed follows a Poisson distribution, then it could be more sensible to assume that $\theta$ follows a gamma distribution that leads to much simpler estimation and prediction with closed form formulas. Nevertheless, the ADMM approach is conceptually simple, and it provides a simple alternative to the Bayesian approach to credibility in case there is a reasonable number of unique policies.

Our contribution in this paper is the following. First, we have presented a modified version of the ADMM algorithm for subgroup analysis, allowing the penalties to be applied to adjacent coefficients only, and allowing the number of unique coefficients in the model to decrease as the iterations increase, allowing the algorithm to run even faster. Second, we have demonstrated how the approach can be extended to a hybrid approach, so it allows for the flexible incorporation of dependencies among different coverage groups for a multiline insurance company. Third, we presented a case study using the LGPIF dataset, where we compared the performance of the ADMM approach with competing approaches with a full insurance claims dataset.

Possible future work may include the application of subgroup analysis to dependence models involving copulas, where the dependence parameters for an elliptical copula are regularized using pair-wise penalties similar to the one explored in this paper. Another avenue of future work may be the exploration of subgroup analysis for other distributions, including but not limited to the negative binomial, or zero-inflated versions of the frequency models often used in the actuarial literature. For example, in Section 2, Equation (2), Imagine that $\mathcal{P}(\nu)$ is a Poisson distribution with mean $\nu$ , although our approach is general it is possible to work with other discrete distributions including the binomial or negative binomial distributions. It may be a little trickier with the Tweedie distribution, as the density function requires a big summation, and exploring this may be interesting future work.

Competing interests

The authors have no competing interests to declare.

Data availability statement

The data and code that support the findings of this study are available from the corresponding author, Gee Y. Lee, upon request.

Appendix A: The original ADMM algorithm

For completeness, we briefly introduce the algorithm used in Chen et al. (Reference Chen, Huang, Chan and Yau2019), which considered a simpler data structure compared to the one studied in this paper. They neither considerd possible dependence among the lines of business nor serial dependence among the claims from the same policyholder so that $J=T=1$ in Equation (1) and $\theta _i=1$ in Equation (2) of the main paper. The dataset that we studied is more complicated, because there are multiple coverage types $j=1, \cdots, J$ and times $t=1, \cdots, T$ . Chen et al. (Reference Chen, Huang, Chan and Yau2019) target to maximize the Poisson likelihood with a fused LASSO-type penalty, which is given as follows:

(17) \begin{equation} \begin{aligned} &{\ell }_p(\boldsymbol{\alpha} _{j}, \boldsymbol{\theta }_{j} | \mathcal{D}_{j,T}) = \tilde{\ell }(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j|\mathcal{D}_{j,T}) - \sum _{1 \leq i \lt l \leq I} p\left (|\delta _{jil}|; \,\tau _j\right ) \\ \end{aligned} \end{equation}

subject to $\log \theta _{ji}-\log \theta _{jl} = \delta _{jil}$ for all $1 \leq i \lt l \leq I$ , where

(18) \begin{equation} \tilde{\ell }(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j|\mathcal{D}_{j,T}) = \boldsymbol{N}_j^{\prime } \boldsymbol{X} \boldsymbol{\alpha} _j + \boldsymbol{N}_j^{\prime } \boldsymbol{U}_j \log (\boldsymbol{\theta} _j) - \boldsymbol{\nu} _j^{\prime } \boldsymbol{U}_j \boldsymbol{\theta} _j -C_j \end{equation}

and

(19) \begin{gather} \boldsymbol{N}_j = (\boldsymbol{n}_{j1}^{\prime }, \boldsymbol{n}_{j2}^{\prime }, \cdots, \boldsymbol{n}_{jI}^{\prime })^{\prime} \quad \text{ and } \quad \boldsymbol{n}_{ji} = (n_{ji1}, n_{ji2}, \cdots, n_{jiT})^{\prime} \end{gather}
(20) \begin{gather} \boldsymbol{X} = (\boldsymbol{X}_1^{\prime}, \boldsymbol{X}_2^{\prime}, \cdots \boldsymbol{X}_I^{\prime})^{\prime} \quad \text{ and } \quad \boldsymbol{X}_i = ( \boldsymbol{x}_{i1}^{\prime}, \boldsymbol{x}_{i2}^{\prime}, \cdots \boldsymbol{x}_{iT}^{\prime})^{\prime} \end{gather}
(21) \begin{gather} \boldsymbol{U}_j = (\boldsymbol{U}_{j1}^{\prime}, \boldsymbol{U}_{j2}^{\prime}, \cdots \boldsymbol{U}_{jI}^{\prime})^{\prime} \quad \text{ and } \quad \boldsymbol{U}_{ji} = (\boldsymbol{e}_{ji}^{\prime}, \boldsymbol{e}_{ji}^{\prime}, \cdots, \boldsymbol{e}_{ji}^{\prime})^{\prime}_{T\times I} \end{gather}
(22) \begin{gather} \boldsymbol{\theta} _j = (\theta _{j1}, \theta _{j2}, \cdots, \theta _{jI})^{\prime} \end{gather}
(23) \begin{gather} \boldsymbol{\nu} _{j} = \exp\!\left ( \boldsymbol{X} \boldsymbol{\alpha} _{j} \right ) \end{gather}
(24) \begin{gather} \text{ and } C_{j} = \sum _{i=1}^I \sum _{t=1}^T \log \left ( n_{jit}! \right ). \end{gather}

The elements of $\boldsymbol{e}_{ji} = (e_{ji1}, e_{ji2}, \cdots, e_{jiI})$ are basically vectors whose elements are all zeros except for one entry:

\begin{equation*} e_{jil} = \begin {cases} 1 & \text { if } i = l \\ 0 & \text { if } i \ne l \end {cases} \end{equation*}

and the $\exp\!( \cdot )$ and $\log ( \cdot )$ are element-wise functions in case the input is a vector. $p(\cdot ;\,\tau )$ can be any concave penalty function and $\tau$ is the corresponding tuning parameter. Examples of penalty functions used are

  1. (a) The LASSO penalty:

    (25) \begin{equation} p(t ;\, \tau ) = \tau | t |, \end{equation}
  2. (b) The MC (minimax concave) penalty:

    (26) \begin{equation} p(t;\, \tau ) = \tau \int _0^t (1 - x/(\kappa \tau ))_{+} dx, \quad \kappa \gt 1, \end{equation}
  3. (c) The SCAD penalty:

    (27) \begin{equation} p(t;\, \tau ) = \tau \int _0^t \min \{ 1, (\kappa -x/\tau )_{+}/(\kappa -1) \} dx, \quad \kappa \gt 2. \end{equation}

One can observe that (17) is not smooth for certain values of the tuning parameter, which hinders direct maximization of (17) via the use of popular optimization routines for convex optimization. For example, with the LASSO penalty, one may observe that the negative log-likelihood function has a kink as in the right panel of Figure A.1, for certain values of the tuning parameter:

Figure A.1. An illustration of the potential difficulties of direct optimization.

In that regard, Chen et al. (Reference Chen, Huang, Chan and Yau2019) used a version of ADMM algorithm that uses the following augmented Lagrangian as the objective function:

\begin{equation*} L(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j, \boldsymbol{\delta} _{j}, \boldsymbol{\lambda} _{j} |\mathcal {D}_{j,T}) = -\tilde{\ell }(\boldsymbol{\alpha} _{j}, \boldsymbol{\theta }_j | \mathcal {D}_{j,T}) + \sum _{1 \leq i \lt l \leq I} p\left (|\delta _{jil} |; \,\tau _j\right ) + \overline {W}_{j}, \end{equation*}

where $\overline{W}_{j} = \sum _{1 \leq i \lt l \leq I} \lambda _{jil}( \log \theta _{ji} - \log \theta _{jl}-\delta _{jil}) + \frac{\rho }{2} \sum _{1 \leq i \lt l \leq I} ( \log \theta _{ji} - \log \theta _{jl} - \delta _{jil} )^2$ . Here $\lambda _{jil}$ is Lagrange multiplier that corresponds to the equality constraints and $\rho$ is a hyperparameter that controls smoothness of the augmented Lagrangian with an additional quadratic term. According to Boyd et al. (Reference Boyd, Parikh, Chu, Peleato and Eckstein2011), addition of the quadratic term makes the augmented Lagrangian differentiable under mild conditions and facilitates better convergence of the algorithm.

This results in the naive version of the ADMM algorithm are shown below.

Algorithm 0: A naive ADMM algorithm for parameter estimation

where

(28) \begin{align} L(\boldsymbol{\alpha} _j, \boldsymbol{\eta} _j^*, \boldsymbol{\delta} _j^{(s)}, \boldsymbol{\lambda} _j^{(s)}) & = -\tilde{\ell }^*(\boldsymbol{\alpha} _{j}, \boldsymbol{\eta }_j^{*}) + \sum _{i=2}^I p(|\delta _{ji}^{(s)} ;\, \tau _j) + W_{j}^{(s)} \end{align}
(29) \begin{align} & = -\tilde{\ell }^*(\boldsymbol{\alpha} _{j}, \boldsymbol{\eta }_j^{*}) + \frac{\rho }{2} \sum _{i=2}^{I} \left [ \eta _{j,i}^{*} - \eta _{j,i-1}^{*}-\delta _{j,i-1}^{(s)} + \frac{\lambda _{j,i-1}^{(s)}}{\rho } \right ]^2 + C, \end{align}

where $C$ is a constant, and

(30) \begin{equation} \delta _{j,i}^{(s+1)} = \begin{cases} ST(u_{j,i}, \tau _h/\rho ) & \text{ if } |u_{j,i}| \le \tau _h ( 1 + \rho ^{-1} ) \\ \frac{ ST(u_{j,i}, \kappa \tau _h/ ((\kappa -1)\rho ))}{1 - ((\kappa -1)\rho )^{-1}} & \text{ if } \tau _j(1+\rho ^{-1} ) \le | u_{j,i} | \le \kappa \tau _h \\ u_{j,i} & |u_{j,i}| \gt \kappa \tau _h, \end{cases} \end{equation}
(31) \begin{equation} \begin{aligned} u_{j,i} &= \eta _{j,i}^{(s+1)} - \eta _{j,i-1}^{(s+1)} + \rho ^{-1} \lambda _{j,i}^{(s)}, \\ ST(u, \tau ) &= \text{sgn}(u)(|u|-\tau )_+,\\ \end{aligned} \end{equation}

and

(32) \begin{equation} \lambda _{j,i}^{(s+1)} = \lambda _{j,i}^{(s)} + \rho \left ( \eta _{j,i}^{*(s+1)} - \eta _{j,i-1}^{*(s+1)} - \delta _{j,i}^{(s+1)} \right ). \end{equation}

To see the equality between Equations (28) and (29), note that expanding the right-hand side of (29), we see that

\begin{align*} L(\boldsymbol{\alpha} _j, \boldsymbol{\eta} _j^*, \boldsymbol{\delta} _j^{(s)}, \boldsymbol{\lambda} _j^{(s)}) & = - \tilde{\ell }^*(\boldsymbol{\alpha} _j, \boldsymbol{\eta} _j^*) + \frac{\rho }{2} \sum _{i=2}^I \left [ \eta _{j,i}^* - \eta _{j,i-1}^* - \delta _{j,i}^{(s)} + \frac{ \lambda _{j,i-1}^{(s)} }{ \rho } \right ]^2 + C \\ & = - \tilde{\ell }^*(\boldsymbol{\alpha} _j, \boldsymbol{\eta} _j^*) + \frac{\rho }{2} \sum _{i=2}^I ( \eta _{j,i}^* - \eta _{j,i-1}^* - \delta _{j,i}^{(s)})^2 + \frac{\rho }{2} \sum _{i-2}^I \left ( \frac{ \lambda _{j,i-1}^{(s)} }{ \rho } \right )^2 \\ & \quad \quad + 2 \times \frac{\rho }{2} \sum _{i=2}^I \frac{\lambda _{j,i-1}^{(s)}}{\rho } ( \eta _{j,i}^* - \eta _{j, i-1}^* - \delta _{j,i}^{(s)}) + C \\ & = - \tilde{\ell }^*(\boldsymbol{\alpha} _j, \boldsymbol{\eta} _j^*) + \frac{\rho }{2} \sum _{i=2}^I ( \eta _{j,i}^* - \eta _{j,i-1}^* - \delta _{j,i}^{(s)})^2 + \frac{\rho }{2} \sum _{i-2}^I \left ( \frac{ \lambda _{j,i-1}^{(s)} }{ \rho } \right )^2 \\ & \quad \quad + \sum _{i=2}^I \lambda _{j,i-1}^{(s)} ( \eta _{j,i}^* - \eta _{j, i-1}^* - \delta _{j,i}^{(s)}) + C\\ & = - \tilde{\ell }^*(\boldsymbol{\alpha} _j, \boldsymbol{\eta} _j^*) + \sum _{i=2}^I p( | \delta _{ji}^{(s)} | ; \,\tau _j) + W_j^{(s)} \end{align*}

where, the last equality holds if

\begin{equation*} C = \sum _{i=2}^I p( | \delta _{ji}^{(s)} | ;\, \tau _j) - \frac {\rho }{2} \sum _{i-2}^I \left ( \frac { \lambda _{j,i-1}^{(s)} }{ \rho } \right )^2 \end{equation*}

Note that terms with superscript $(s)$ can all be considered constant. Here $\kappa$ controls the shape of SCAD penalty, which was set to 3.7 in the original work of Fan and Li (Reference Fan and Li2001), but also could be determined by cross-validation. $\tau _h$ corresponds to the magnitude of the penalty term so that as $\tau _h$ increases $\delta _{j,i}^{(s)}$ is more likely to shrink to zero, which implies the discrepancy $\eta _{j,i}^{*(s+1)}$ and $\eta _{j,i-1}^{*(s+1)}$ also shrinks to zero. The choice of $\rho$ determines the smoothness of the objective function, and it is left to the modeler. In our case, it has been set to 5 after trial and error. We acknowledge that we do not have a statistical approach to selecting the optimal $\rho$ (neither did the orignal authors of the ADMM paper) and leave this as future work. The real problem with Algorithm 0 is that the solution path is yet not ideal, because the coefficients within the same group do not merge completely, which gives rise to need for a more efficient algorithm as proposed in the main paper.

Appendix B: Derivation of the gradient and Hessian

For fast computation of the ADMM algorithm, an analytic expression for the gradient and Hessian of the likelihood function is required. Here, we replicate Equation (18) and the ADMM objective function.

\begin{equation*} L(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j^{*}, \boldsymbol{\delta} _{j}, \boldsymbol{\lambda} _{j} |\mathcal {D}_{j,T}) = -\tilde{\ell }^*(\boldsymbol{\alpha} _{j}, \boldsymbol{\theta }_j^{*}|\mathcal {D}_{j,T}) + \sum _{i=2}^{I} p\left (|\delta _{ji} |;\, \tau _j\right ) + W_{j}, \end{equation*}

where

\begin{equation*} \tilde{\ell }^*(\boldsymbol{\alpha} _j, \boldsymbol{\theta }_j^{*}|\mathcal {D}_{j,T}) = \boldsymbol{N}_j^{\prime } \boldsymbol{X} \boldsymbol{\alpha} _j + \boldsymbol{N}_j^{\prime } \boldsymbol{U}_j \log (\boldsymbol{\theta} _j^{*}) - \boldsymbol{\nu} _j^{\prime } \boldsymbol{U}_j \boldsymbol{\theta} _j^{*}. \end{equation*}

To write the gradient down, one can first observe that $W_j$ can be written as a function of $\eta _{jk}^{*}:= \log \theta _{jk}^{*}$ as follows:

(33) \begin{align} W_j(\eta _{jk}^{*}) &= \begin{cases} \lambda _{j,k-1}^{*} ( \eta _{jk}^{*} - \eta _{j,k-1}^{*}-\delta _{j,k-1}^{*}) + \frac{\rho }{2} ( \eta _{jk}^{*} - \eta _{j,k-1}^{*}-\delta _{j,k-1}^{*})^2 + C_{jk}^{*} & \text{ if } k = I\\ \lambda _{j,k-1}^{*} ( \eta _{jk}^{*} - \eta _{j,k-1}^{*}-\delta _{j,k-1}^{*}) + \lambda _{jk}^{*} ( \eta _{j,k+1}^{*} - \eta _{jk}^{*} - \delta _{jk}^{*}) \\ \quad \quad + \frac{\rho }{2} ( \eta _{jk}^{*} - \eta _{j,k-1}^{*}-\delta _{j,k-1}^{*})^2 + \frac{\rho }{2} ( \eta _{j,k+1}^{*} - \eta _{jk}^{*} - \delta _{jk}^{*})^2 + C_{jk}^{*} & \text{ if } 1 \lt k \lt I\\ \lambda _{k} ( \eta _{j,k+1}^{*} - \eta _{,k}^{*} - \delta _{jk}^{*}) + \frac{\rho }{2} ( \eta _{j,k+1}^{*} - \eta _{jk}^{*} - \delta _{jk}^{*})^2 + C_{jk}^{*} & \text{ if } k = 1 \end{cases} \end{align}

where $\partial C_{jk}^{*}/ \partial \eta _{jk}^{*}=0$ and it leads to

(34) \begin{align} \frac{\partial W_j}{\partial \eta _{jk}^{*}} &= \begin{cases} \rho \left (\eta _{jk}^{*} - \eta _{j,k-1}^{*}-\delta _{j,k-1}^{*} + \frac{\lambda _{j,k-1}^{*}}{\rho } \right ) & \text{ if } k = I \\[9pt] \rho \left [ \left (\eta _{jk}^{*} - \eta _{j,k-1}^{*}-\delta _{j,k-1}^{*} + \frac{\lambda _{j,k-1}^{*}}{\rho } \right ) - \left ( \eta _{j,k+1}^{*} - \eta _{jk}^{*} - \delta _{jk}^{*} + \frac{\lambda _{jk}^{*}}{\rho } \right ) \right ] & \text{ if } 1 \lt k \lt I \\[9pt] - \rho \left ( \eta _{j,k+1}^{*} - \eta _{jk}^{*} - \delta _{jk}^{*} + \frac{\lambda _{jk}^{*}}{\rho } \right ) & \text{ if } k = 1 \end{cases} \end{align}

and

(35) \begin{align} \frac{{\partial}^2 W_j}{\partial \eta _{jk}^{*}\partial \eta _{jk^{\prime}}^{*}} = \begin{cases} \rho \left [ \unicode{x1D7D9}_{\{k^{\prime}=k\}} - \unicode{x1D7D9}_{\{k^{\prime}=k-1\}} \right ] & \text{ if } k = I \\[6pt] \rho \left [ 2 \cdot \unicode{x1D7D9}_{\{k^{\prime}=k\}} - \unicode{x1D7D9}_{\{k^{\prime}=k-1\}} - \unicode{x1D7D9}_{\{k^{\prime}=k+1\}} \right ] & \text{ if } 1 \lt k \lt I \\[6pt] \rho \left [ \unicode{x1D7D9}_{\{k^{\prime}=k\}} - \unicode{x1D7D9}_{\{k^{\prime}=k+1\}} \right ] & \text{ if } k = 1 \end{cases} \end{align}

It is also easy to observe that

(36) \begin{align} \frac{\partial \tilde{\ell }^*}{\partial \boldsymbol{\alpha} _j} &=\sum _{i=1}^I \sum _{t=1}^T (n_{jit}-\theta ^{*}_{ji} \nu _{jit})\boldsymbol{x}_{it}=\boldsymbol{N}_j^{\prime } \boldsymbol{X} - \boldsymbol{X}^{\prime} \text{diag}\left (\boldsymbol{\nu} _j^{\prime }\right ) \boldsymbol{U}_j \boldsymbol{\theta} _j^{*}, \end{align}
(37) \begin{align} \frac{\partial \tilde{\ell }^*}{\partial \eta _{jk}^{*}}&=\sum _{t=1}^T (n_{jkt} - \nu _{jkt} \theta _{jk}^{*} ), \ \frac{\partial \tilde{\ell }}{\partial \boldsymbol{\eta }_j^{*}} =\boldsymbol{N}_j^{\prime } \boldsymbol{U}_j -\boldsymbol{\nu }_j^{\prime } \boldsymbol{U}_j\text{diag}(\boldsymbol{\theta }_j^{*}), \end{align}
(38) \begin{align} \frac{\partial ^2 \tilde{\ell }^*}{\partial \boldsymbol{\alpha} _j \partial \boldsymbol{\alpha} _j^{\prime}} &= -\sum _{i=1}^I \sum _{t=1}^T \theta _{ji}^{*} \nu _{jit} \boldsymbol{x}_{it}\boldsymbol{x}_{it}^{\prime} =-\boldsymbol{X}^{\prime} \text{diag}(\boldsymbol{\nu }_j^{\prime }) \cdot \text{diag}(\boldsymbol{U}_j\boldsymbol{\theta }_j^{*})\boldsymbol{X}, \end{align}
(39) \begin{align} \frac{\partial ^2 \tilde{\ell }^*}{\partial \boldsymbol{\alpha} _j \partial \eta _{jk}^{*}} &=-\sum _{t=1}^T \nu _{jkt} \boldsymbol{x}_{kt} \theta _{jk}^{*}, \ \frac{\partial ^2 \tilde{\ell }^*}{\partial \boldsymbol{\alpha} _j \partial \boldsymbol{\eta }_j^{*^{\prime}}} =-\boldsymbol{X}^{\prime}\text{diag}(\boldsymbol{\nu }_j^{\prime })\boldsymbol{U}_j \text{diag}(\boldsymbol{\theta }_j^{*}), \end{align}
(40) \begin{align} \frac{\partial ^2 \tilde{\ell }^*}{\partial \boldsymbol{\eta }_j^{*} \partial \boldsymbol{\eta }_j^{*^{\prime}}} &=\text{diag}\left (-\boldsymbol{\nu }_j^{\prime } \boldsymbol{U}_j \text{diag}(\boldsymbol{\theta }_j^{*})\right ), \end{align}

Using these expressions for the optimization routine helps improve the convergence of the algorithm and reduces computation time significantly.

Appendix C: Modified gradient and Hessian

With the modified algorithm (Algorithm 1 in the main paper), we simply use the fact that we have

(41) \begin{equation} \boldsymbol{\xi} _j^* = \boldsymbol{G}_j \boldsymbol{\eta} _j^*, \end{equation}

so that using chain rule, we may write the gradient and Hessian in terms of $\boldsymbol{\xi} _j^*$ :

(42) \begin{align} \frac{\partial ^2 \tilde{\ell }^*}{\partial \boldsymbol{\alpha} _j \partial \boldsymbol{\xi }_j^{*\prime }} &=\left (\frac{\partial ^2 \tilde{\ell }^*}{\partial \boldsymbol{\alpha} _j \partial \boldsymbol{\eta }_j^{*\prime }}\right ) \boldsymbol{G}_j^\prime, \end{align}
(43) \begin{align} \frac{\partial ^2 \tilde{\ell }^*}{\partial \boldsymbol{\xi }_j^{*} \partial \boldsymbol{\xi }_j^{*^{\prime}}} &= \boldsymbol{G}_j \left ( \frac{\partial ^2 \tilde{\ell }^*}{\partial \boldsymbol{\eta }_j^{*} \partial \boldsymbol{\eta }_j^{*^{\prime}}} \right ) \boldsymbol{G}_j^\prime \end{align}

Note that the rows of $\boldsymbol{G}_j$ are indicators of the elements within a specific group of coefficients, and the length of $\boldsymbol{\xi} _j^*$ is equal to the number of unique coefficients left inside the model.

Appendix D: Dependence cases

In the main text of the paper, three different dependence cases are considered. The details of these three cases are explained in full detail here.

Case 1. No interdependence among the lines

In the first case, we do not assume any type of interdependence among $(N_{1it},\ldots, N_{Jit})$ for each of $i=1,\ldots, I$ and $t=1,\ldots, T$ and estimate $\boldsymbol{\alpha} _j$ and $\theta _{ji}$ only using the observations of claims and policy characteristics from $j\text{th}$ line of business. In other words, we use the following longitudinal dataset for estimation of $\boldsymbol{\alpha} _j$ and $\theta _{ji}$ and no inter-line dependence is considered:

(44) \begin{equation} \mathcal{D}_{j,T} = \left \{\left ( n_{jit}, \textbf{x}_{it} \right ) \bigg | \ i=1,\ldots, I, \, t=1,\ldots, T \right \}. \end{equation}

In this case, we do not consider the effect of $\theta _i$ . For this reason, it may be thought of as if $\theta _i = 1$ is fixed. There are two subcases to the no-interdependence case:

Case 1a. When $\theta _{ji}$ is random

We assume $\theta _{ji}$ follows a gamma distribution with unit mean. More specifically, $\theta _{ji} \sim \mathcal{G}(r_j, 1/r_j)$ so that $\mathbb{E}[\theta _{ji}]=1$ and $Var[\theta _{ji}]=1/r_j$ . Due to the conjugacy of Poisson and gamma distributions (see Jeong & Valdez, Reference Jeong and Valdez2020), it turns out that

(45) \begin{equation} \mathbb{E}[\theta _{ji}|\mathcal{D}_{j,T}] = \frac{r_j + \sum _{t=1}^T n_{jit}}{r_j+ \sum _{t=1}^T \nu _{jit}}, \end{equation}

which captures the posterior expectation of the unobserved heterogeneity of the policyholder $i$ on coverage $j$ . Note that $r_j$ can be determined in two ways. First, $r_j$ can be interpreted as degree of dispersion in $\theta _{ji}$ , which is unobserved heterogeneity in the claim frequency. In this regard, one can assign a value for $r_j$ so that the range (for example, 95% highest posterior density interval) of $\theta _{ji}$ can be in certain level as discussed in Page 6 of Jeong (Reference Jeong2020).

Second, one can use a method of moments estimator to retrieve the value of $r_j$ . If we assume that $\theta _i=1$ and $\theta _{ij} \sim \mathcal{G}(r_j, 1/r_j)$ where $\theta _{ij}$ are independent of each other for all $i=1,\ldots, I$ and fixed $j$ , then $\mathbb{E}\left [\sum _{t=1}^T N_{jit}\right ]=\sum _{t=1}^T\nu _{jit}$ and $\textit{Var}\left [\sum _{t=1}^T N_{jit}\right ]=\sum _{t=1}^T\nu _{jit} + \frac{(\sum _{t=1}^T\nu _{jit})^2}{r_j}$ so that $\mathbb{E}[ z_{ji}]=1/r_j$ and $z_{ji} \ (i=1,\ldots, I)$ are independent observations where

\begin{equation*} z_{ji} = \frac {(\sum _{t=1}^T N_{jit}-\sum _{t=1}^T\nu _{jit})^2-\sum _{t=1}^T\nu _{jit}}{(\sum _{t=1}^T\nu _{jit})^2}, \end{equation*}

which lead to the following estimate of $r_j$ :

\begin{equation*} \hat {r}_j =\frac {I}{\sum _{i=1}^I z_{ji}}. \end{equation*}

Once $r_j$ is specified in either way, the predicted claim frequency can be written as follows:

(46) \begin{equation} \begin{aligned} \hat{N}_{ji,T+1}&=\mathbb{E}[N_{ji,T+1}|\mathcal{D}_{j,T}] = \hat{\nu }_{ji,T+1} \mathbb{E}[\theta _{ji}|\mathcal{D}_{j,T}] \cdot \mathbb{E}[\theta _i|\mathcal{D}_{j,T}] \\ &= \exp\!(\boldsymbol{x}_{i,T+1 }^\prime \hat{\boldsymbol{\alpha} }_j ) \cdot \frac{r_j + \sum _{t=1}^T n_{jit}}{r_j+ \sum _{t=1}^T \hat{\nu }_{jit}}, \end{aligned} \end{equation}

where we have assumed $\theta _i = 1$ with probability 1, which is basically an assumption that the claims experience from each line is independent from one another.

Note that one can use different distributions instead of gamma distribution to describe the behavior of random $\theta _{ji}$ as long as $\mathbb{E}[\theta _{ji}]=1$ for an identifiability issue. For more details on the use of distributions for random $\theta _{ji}$ other than gamma when $N_{jit}|\theta _{ji}$ follows a Poisson distribution, please see Tzougas (Reference Tzougas2020) and Tzougas and Makariou (Reference Tzougas and Makariou2022).

Case 1b. When $\theta _{ji}$ is nonrandom

In this case, there is no assumed structure in $\theta _{ji}$ , so we may consider finding the point estimates of $\boldsymbol{\theta }_j=(\theta _{j1},\ldots,\theta _{jI})$ via the proposed ADMM algorithm. For the identifiability issue, $\theta _{j1}$ was set 0 for all $j$ .

After the estimates of $\boldsymbol{\alpha} _j$ and ${\theta }_{ji}$ are obtained, the predicted claim frequency will be written as follows:

(47) \begin{equation} \begin{aligned} \hat{N}_{ji,T+1}&= \exp\!(\boldsymbol{x}^\prime _{i,T+1 }\hat{\boldsymbol{\alpha} }_j ) \cdot \hat{\theta }_{ji}. \end{aligned} \end{equation}

Case 2. Perfect interdependence among the lines

In this case, we assume there is only a common unobserved heterogeneity factor for each policyholder so that $\theta _{1i}=\cdots = \theta _{Ji}=1$ . In this case, we have

(48) \begin{equation} N_{jit}|\boldsymbol{x}_{it}, \theta _i \sim \mathcal{P}\left ( \,{ \nu _{jit}}\theta _i\, \right ), \end{equation}

for all $i, t$ , and $j$ and we estimate $(\boldsymbol{\alpha} _1, \ldots,\boldsymbol{\alpha} _J, \boldsymbol{\theta }_i)$ simultaneously using the multiline dataset.

Case 2a. When $\theta _i$ is random

We assume that $\theta _i \sim \mathcal{G}(r, 1/r)$ so that $\mathbb{E}[\theta _i]=1$ and $\text{Var}[\theta _i]=1/r$ . Due to the conjugacy of Poisson and gamma distributions (see Jeong & Dey, Reference Jeong and Dey2023), it turns out that

(49) \begin{equation} \mathbb{E}[\theta _i|\mathcal{D}_{T}] = \frac{r + \sum _{j=1}^J \sum _{t=1}^T n_{jit}}{r+\sum _{j=1}^J\sum _{t=1}^T \nu _{jit}}, \end{equation}

where $\mathcal{D}_T = \left \{ \left (n_{1it}, \, \ldots, n_{jit}, \ldots, n_{Jit}, \textbf{x}_{it} \right ) \bigg | \, i=1, \ldots, I, \, t=1, \ldots, T \right \}$ . It captures the pooled posterior expectation of the common unobserved heterogeneity of the policyholder $i$ . Therefore, the predicted claim frequency will be written as follows:

(50) \begin{equation} \begin{aligned} \hat{N}_{ji,T+1} &=\mathbb{E}[N_{ji,T+1}|\mathcal{D}_{T}] = \hat{\nu }_{ji,T+1}\mathbb{E}[\theta _i|\mathcal{D}_{T}] \\ &= \exp\!(\boldsymbol{x}^\prime _{i,T+1 }\hat{\boldsymbol{\alpha} }_j ) \cdot \frac{r + \sum _{j=1}^J\sum _{t=1}^T n_{jit}}{r+\sum _{j=1}^J \sum _{t=1}^T \nu _{jit}}. \end{aligned} \end{equation}

Case 2b. When $\theta _i$ is nonrandom

In this case, there is no assumed distributional structure in $\theta _i$ so we find the point estimates of $\boldsymbol{\theta }=(\theta _1,\ldots,\theta _I)$ by maximizing the following penalized likelihood:

\begin{equation*} \begin {aligned} \tilde{\ell }_p(\boldsymbol{\alpha }, \boldsymbol{\theta }|\mathcal {D}_{T}) &= \sum _{j=1}^J\sum _{i=1}^I \sum _{t=1}^T \left ( n_{jit}( \boldsymbol{x}^\prime _{it}\boldsymbol{\alpha} _j+\log (\theta _i^*))-\nu _{jit}\theta _{i}^*- \log (n_{jit})\right ) \nonumber \\ &\quad- \sum _{i =2}^I p\left (|\log \theta _i^*-\log \theta _{i-1}^*|;\, \tau \right ). \end {aligned} \end{equation*}

Since the only difference between the above penalized likelihood and (3) in the main paper is that $\theta _{ji}$ are replaced with $\theta _i$ , one can use the proposed ADMM algorithm with modest modification to estimate $\boldsymbol{\alpha }, \boldsymbol{\theta }$ . Note that (48) is already of full-rank as long as $J \geq 2$ or $J \geq 2$ for each $i$ , which is easiliy satisfied in our simulation study and empirical analysis so that they is no identification issue either. In that case, the predicted claim frequency will be written as follows:

\begin{equation*} \begin {aligned} \hat {N}_{ji,T+1}&= \exp\!(\boldsymbol{x}^\prime _{i,T+1 }\hat {\boldsymbol{\alpha} }_j ) \cdot \hat {\theta }_i. \end {aligned} \end{equation*}

Case 3. Flexible interdependence

In the third case, we have

(51) \begin{equation} N_{jit}|\boldsymbol{x}_{it}, \boldsymbol{\theta },\theta _i \sim \mathcal{P}\left ( \, \nu _{jit}\theta _{ji} \theta _i \, \right ), \end{equation}

where $\theta _i$ accounts for the common unobserved heterogeneity of policyholder $i$ shared by all coverages. This is an experimental two-step approach, where we boost the risk classification model using common effects $\theta _i$ . The main idea of boosting the risk classification model stepwise is to add common unobserved heterogeneity upon observed covariates and coverage-specific unobserved heterogeneity. We assume $\theta _{ji}$ is nonrandom but $\theta _i$ is random. The step-wise approach is the following:

  1. 1. We find the nonparametric estimates of $\theta _{ji}$ for $j=1,\ldots,J$ and $i=1,\ldots, I$ , as in Case 1 (when $\theta _{ji}$ are nonrandom), while $\theta _i$ is not considered in this step. In other words, $\theta _i = 1$ with probability 1.

  2. 2. After that, we assume that $\theta _i$ follows $\mathcal{G}(r, 1/r)$ , which leads to the following posterior distribution and Bayesian premium:

    (52) \begin{equation} \pi (\theta _i|\mathcal{D}_{j,T}, \boldsymbol{\theta }) \sim \mathcal{G}\left (\sum _{t=1}^T\sum _{j=1}^J N_{jit} \hat{\theta }_{ji}+r, \ \left [\sum _{t=1}^T\sum _{j=1}^J \nu _{jit} \hat{\theta }_{ji}+r\right ]^{-1} \right ), \end{equation}
    (53) \begin{equation} \mathbb{E}[N_{ji,T+1}|\mathcal{D}_{j,T}] =\nu _{ji,T+1} \hat{\theta }_{ji}\mathbb{E}[\theta _i | \mathcal{D}_{j,T}] = \nu _{ji,T+1} \hat{\theta }_{ji} \frac{\sum _{t=1}^T\sum _{j=1}^J N_{jit} \hat{\theta }_{ji}+r}{\sum _{t=1}^T\sum _{j=1}^J \nu _{jit} \hat{\theta }_{ji}+r}, \end{equation}

where $\hat{\theta }_{ji}$ are estimated in the first step. In this case, we assume there are unobserved heterogeneity factors of the coverages for each policyholder that are intertwined in a way.

Appendix E: Tables

Table E.1. Computation time for simulated Coverage 1 (in sec)

Table E.2. Computation time for simulated Coverage 2 (in sec)

Table E.3. Computation time for simulated Coverage 3 (in sec)

Footnotes

*

Both authors have contributed equally to this work.

1 Note that we do not estimate nonrandom $\theta _{ji}$ and nonrandom $\theta _{i}$ simulataneously in our applications. Therefore, we illustrate the proposed ADMM approach for estimation of only nonrandom $\theta _{ji}$ in this section. One can apply essentially the same algorithm with modest modification to estimate nonrandom $\theta _{i}$ , as elaborated in Case 2b of Appendix D.

References

Akerlof, G. (1970). The market for lemons: Quality uncertainty and the market mechanism. Quarterly Journal of Economics, 84(3), 488500.CrossRefGoogle Scholar
Antonio, K. & Beirlant, J. (2007). Actuarial statistics with generalized linear mixed models. Insurance: Mathematics and Economics, 40(1), 5876.Google Scholar
Boucher, J.-P. & Denuit, M. (2006). Fixed versus random effects in Poisson regression models for claim counts: A case study with motor insurance. ASTIN Bulletin: The Journal of the IAA, 36(1), 285301.CrossRefGoogle Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B. & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1), 1122.CrossRefGoogle Scholar
Chen, K., Huang, R., Chan, N. H. & Yau, C. Y. (2019). Subgroup analysis of zero-inflated Poisson regression model with applications to insurance data. Insurance: Mathematics and Economics, 86, 818.Google Scholar
Deviendt, S., Antonio, K., Reynkens, T. & Verbelen, R. (2021). Sparse regression with multi-type regularized feature modeling. Insurance: Mathematics and Economics, 96, 248261.Google Scholar
Fan, J. & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456), 13481360.CrossRefGoogle Scholar
Frees, E. W., Lee, G. & Yang, L. (2016). Multivariate frequency-severity regression models in insurance. Risks, 4(1), 4.CrossRefGoogle Scholar
Frees, E. W. & Shi, P. (2017). Credibility prediction using collateral information. Variance, 11(1), 4559.Google Scholar
Guo, L. (2003). Applying data mining techniques in property/casualty insurance. Forum of the Casualty Actuarial Society.Google Scholar
Jeong, H. (2020). Testing for random effects in compound risk models via Bregman divergence. ASTIN Bulletin: The Journal of the IAA, 50(3), 777798.CrossRefGoogle Scholar
Jeong, H. & Dey, D. (2023). Multi-peril frequency credibility premium via shared random effects. Variance, Forthcoming. http://dx.doi.org/10.2139/ssrn.3825435 Google Scholar
Jeong, H. & Valdez, E. A. (2020). Predictive compound risk models with dependence. Insurance: Mathematics and Economics, 94, 182195.Google Scholar
Ma, S. & Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517), 410423.CrossRefGoogle Scholar
Norberg, R. (1986). Hierarchical credibility: Analysis of a random effect linear model with nested classification. Scandinavian Actuarial Journal, 1986(3-4), 204222.CrossRefGoogle Scholar
Rothschild, M. & Stiglitz, J. (1976). Equilibrium in competitive insurance markets: An essay on the economics of imperfect information. Quarterly Journal of Economics, 90(4), 629649.CrossRefGoogle Scholar
Shi, P. & Lee, G. Y. (2022). Copula regression for compound distributions with endogenous covariates with applications in insurance deductible pricing. Journal of the American Statistical Association, 117(539), 10941109.CrossRefGoogle Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society, Series B, 67(1), 91108.CrossRefGoogle Scholar
Tibshirani, R. & Taylor, J. (2011). The solution path of the generalized lasso. The Annals of Statistics, 39(3), 13351371.CrossRefGoogle Scholar
Tzougas, G. (2020). EM estimation for the Poisson-inverse gamma regression model with varying dispersion: An application to insurance ratemaking. Risks, 8(3), 97.CrossRefGoogle Scholar
Tzougas, G. & Makariou, D. (2022). The multivariate Poisson-generalized inverse gaussian claim count regression model with varying dispersion and shape parameters. Risk Management and Insurance Review, 25(4), 401417.CrossRefGoogle Scholar
Wilson, C. (1977). A model of insurance markets with incomplete information. Journal of Economic Theory, 97(2), 167207.CrossRefGoogle Scholar
Yeo, A., Smith, K., Willis, R. & Brooks, M. (2001). Clustering techniques for risk classification and prediction of claim costs in the automobile insurance industry. Intelligent Systems in Accounting, Finance, and Management, 10(1), 3950.CrossRefGoogle Scholar
Figure 0

Figure 1. The SCAD and MCP penalty functions with different $\kappa$ values.

Figure 1

Algorithm 1: A modified ADMM for fast and stable parameter estimation

Figure 2

Table 1. Out-of-sample validation results with simulated Coverage 1

Figure 3

Table 2. Out-of-sample validation results with simulated Coverage 2

Figure 4

Table 3. Out-of-sample validation results with simulated Coverage 3

Figure 5

Table 4. Summary of coverage amounts by year in millions of dollars

Figure 6

Table 5. Summary of claim frequencies by year

Figure 7

Table 6. Number of observations by the entity type variable

Figure 8

Figure 2. Relationship between coverage amounts and claim frequencies.

Figure 9

Table 7. Number of policies with positive coverage each year

Figure 10

Figure 3. Histogram plots of the Cox–Snell residuals for a basic GLM.

Figure 11

Table 8. Spearman correlations with the hold out sample frequencies

Figure 12

Table 9. Number of unique fixed-effect coefficients

Figure 13

Table 10. Number of groups in the intercepts of the individual effects model

Figure 14

Figure 4. Solution paths for the six coverage groups without entity type predictors.

Figure 15

Figure 5. Solution paths for the six coverage groups with entity type predictors.

Figure 16

Figure A.1. An illustration of the potential difficulties of direct optimization.

Figure 17

Algorithm 0: A naive ADMM algorithm for parameter estimation

Figure 18

Table E.1. Computation time for simulated Coverage 1 (in sec)

Figure 19

Table E.2. Computation time for simulated Coverage 2 (in sec)

Figure 20

Table E.3. Computation time for simulated Coverage 3 (in sec)