Modelling Lorenz curves (LC) for stochastic dominance comparisons is central to the analysis of income distribution. It is conventional to use non-parametric statistics based on empirical income cumulants which are in the construction of LC and other related second-order dominance criteria. However, although attractive because of its simplicity and its apparent flexibility, this approach suffers from important drawbacks. While no assumptions need to be made regarding the data-generating process (income distribution model), the empirical LC can be very sensitive to data particularities, especially in the upper tail of the distribution. This robustness problem can lead in practice to 'wrong' interpretation of dominance orders. A possible remedy for this problem is the use of parametric or semi-parametric models for the datagenerating process and robust estimators to obtain parameter estimates. In this paper, we focus on the robust estimation of semi parametric LC and investigate issues such as sensitivity of LC estimators to data contamination (Cowell and Victoria-Feser 2002), trimmed LC (Cowell and Victoria-Feser 2006) and inference for trimmed LC (Cowell and Victoria-Feser 2003), robust semi-parametric estimation for LC (Cowell and Victoria-Feser 2007) selection of optimal thresholds for (robust) semi parametric modelling (Dupuis and Victoria-Feser 2006) and use both simulations and real data to illustrate these points.
Estimation of the Pareto tail index from extreme order statistics is an important problem in many settings such as income distributions (for inequality measurement), finance (for the evaluation of the value at risk), and insurance (determination of loss probabilities) among others. The upper tail of the distribution in which the data are sparse is typically fitted with a model such as the Pareto model from which quantities such as probabilities associated with extreme events are deduced. The success of this procedure relies heavily not only on the choice of the estimator for the Pareto tail index but also on the procedure used to determine the number k of extreme order statistics that are used for the estimation. For the choice of k most of the known procedures are based on the minimization of (an estimate of) the asymptotic mean square error of the maximum likelihood (or Hill) estimator (MLE) which is the traditional choice for the estimator of the Pareto tail index. In this paper we question the choice of the estimator and the resulting procedure for the determination of k, because we believe that the model chosen to describe the behaviour of the tail distribution can only be considered as approximate. If the data in the tail are not exactly but only approximately Pareto, then the MLE can be biased, i.e. it is not robust, and consequently the choice of k is also biased. We propose instead a weighted MLE for the Pareto tail index that downweights data “far” from the model, where “far” will be measured by the size of standardized residuals constructed by viewing the Pareto model as a regression model. The data that are downweighted this way do not systematically correspond to the largest quantiles. Based on this estimator and proceeding as in Ronchetti and Staudte (1994), we develop a robust prediction error criterion, called RC-criterion, to choose k. In simulation studies, we will compare our estimator and criterion to classical ones with exact and/or approximate Pareto data. Moreover, the analysis of real data sets will show that a robust procedure for selection, and not just for estimation, is needed.
Mixed linear models are used to analyse data in many settings. These models have in most cases a multivariate normal formulation. The maximum likelihood estimator (MLE) or the residual MLE (REML) are usually chosen to estimate the parameters. However, the latter are based on the strong assumption of exact multivariate normality. Welsh and Richardson (1997) have shown that these estimators are not robust to small deviations from the multivariate normality. This means in practice for example that a small proportion of data (even only one) can drive the value of the estimates on their own. Since the model is multivariate, we propose in this paper a high breakdown robust estimator for very general mixed linear models, that inlcude for example covariates. This robust estimator belongs to the class of S-estimators (Rousseeuw and Yohai 1984) from which we can derive the asymptotic properties for the inference. We also use it as a diagnostic tool to detect outlying subjects. We discuss the advantages of this estimator compared to other robust estimators proposed previously and illustrate its performance with simulation studies and the analysis of four datasets.
Robust estimation of covariance matrices when some of the data at hand are missing is an important problem. It has been studied by Little and Smith (1987) and more recently by Cheng and Victoria-Feser (2002). The latter propose the use of high breakdown estimators and so-called hybrid algorithms (see e.g. Woodruff and Rocke 1994). In particular, the minimum volume ellipsoid of Rousseeuw (1984) is adapted to the case of missing data. To compute it, they use (a modified version of) the forward search algorithm (see e.g. Atkinson 1994). In this paper, we propose to use instead a modification of the C-step algorithm proposed by Rousseeuw and Van Driessen (1999) which is actually a lot faster. We also adapt the orthogonalized Gnanadesikan-Kettering (OGK) estimator proposed by Maronna and Zamar (2002) to the case of missing data and use it as a starting point for n adapted Sestimator. Moreover, we conduct a simulation study to compare different robust estimators in terms of their efficiency and breakdown and use them to analyse real datasets.
Generalized linear latent variable models (GLLVMs), as defined by Bartholomew and Knott, enable modelling of relationships between manifest and latent variables. They extend structural equation modelling techniques, which are powerful tools in the social sciences. However, because of the complexity of the log-likelihood function of a GLLVM, an approximation such as numerical integration must be used for inference. This can limit drastically the number of variables in the model and can lead to biased estimators. We propose a new estimator for the parameters of a GLLVM, based on a Laplace approximation to the likelihood function and which can be computed even for models with a large number of variables. The new estimator can be viewed as an M-estimator, leading to readily available asymptotic properties and correct inference. A simulation study shows its excellent finite sample properties, in particular when compared with a well-established approach such as LISREL. A real data example on the measurement of wealth for the computation of multidimensional inequality is analysed to highlight the importance of the methodology.
Extreme value data with a high clump-at-zero occur in many domains. Moreover, it might happen that the observed data are either truncated below a given threshold and/or might not be reliable enough below that threshold because of the recording devices. This situations occurs in particular with radio audience data measured using personal meters that record environmental noise every minute, that is then matched to one of the several radio programs. There are therefore genuine zeroes for respondents not listening to the radio, but also zeroes corresponding to real listeners for whom the match between the recorded noise and the radio program could not be achieved. Since radio audiences are important for radio broadcasters in order for example to determine advertisement price policies, possibly according to the type of audience at di_erent time points, it is essential to be able to explain not only the probability of listening a radio but also the average time spent listening the radio by means of the characteristics of the listeners. In this paper, we propose a generalized linear model for zero-infated truncated Pareto distribution (ZITPo) that we use to fit audience radio data. Because it is based on the generalized Pareto distribution, the ZITPo model has nice properties such as model invariance to the choice of the threshold and from which a natural residual measure can be derived to assess the model fit to the data. From a general formulation of the most popular models for zero-inated data, we derive our model by considering successively the truncated case, the generalized Pareto distribution and then the inclusion of covariates to explain the non-zero proportion of listeners and their mean listening time. By means of simulations, we study the performance of the maximum likelihood estimator (and derived inference) and use the model to fully analyze the audience data of a radio station in an area of Switzerland.
Confirmatory factor analysis (CFA) is a data analysis procedure that is widely used in social and behavioral sciences in general and other applied sciences that deal with large quantities of data (variables). The underlying model links a set latent factors, that are supposed to correspond to latent concepts, to a larger set of observed (manifest) variables through linear regression equations. With CFA, it is not necessary that all manifest variables are linked to all latent factors, and is particularly useful for the construction of so-called measurement scales like depression scales in psychology. The classical estimator (and inference) procedures are based either on the maximum likelihood (ML) or generalized least squares (GLS) approaches. Unfortunately these methods are known to be non robust to model misspecification, which in the case of factor analysis in general, and in CFA in particular, is misspecification with respect to the multivariate normal model. A natural robust estimator is obtained by first estimating the (mean and) covariance matrix of the manifest variables and then "plug-in" this statistic into the ML or GLS estimating equations. This two-stage method however doesn't fully take into account the covariance structure implied by the CFA model. In this paper, we propose an S-estimator for the parameters of the CFA model that is computed directly from the data. We derive the estimating equations and an iterative procedure. The two estimators have different asymptotic properties, in that their asymptotic covariance matrix is not the same, and they both depend on the model and the parameters values. We perform a simulation study to compare the finite sample properties of both estimators and find that the direct estimator we propose is more stable (smaller MSE) than the two-stage estimator.
To assess the quality of the fit in a multiple linear regression, the coefficient of determination or R2 is a very simple tool, yet the most used by statistics users. It is well known that the classical (least-squares) fit and coefficient of determination can be arbitrary misleading in the presence of a single outlier. In many applied setting, the assumption of normality of the error and of absence of outliers are difficult to establish. In these cases, robust procedures for the estimation and the inference in linear regression are available and provide an excellent alternative. In this paper we present a companion robust coefficient of determination that has several desirable properties not shared by others: it is robust to deviations from the specified regression model (like in the presence of outliers), it is efficient if the errors are perfectly normal, and we show that it is a consistent estimator of the population coefficient of determination. A simulation study and two real datasets support the appropriateness of this estimator, compared with classical (least-squares) and existing robust R2.