Large datasets are more and more common in many research fields. In particular, in the linear regression context, it is often the case that a huge number of potential covariates are available to explain a response variable, and the first step of a reasonable statistical analysis is to reduce the number of covariates. This can be done in a forward selection procedure that includes the selection of the variable to enter, the decision to retain it or stop the selection and estimation of the augmented model. Least squares plus t-tests can be fast, but the outcome of a forward selection might be suboptimal when there are outliers. In this paper, we propose a complete algorithm for fast robust model selection, including considerations for huge sample sizes. Since simply replacing the classical statistical criteria by robust ones is not computationally possible, we develop simplified robust estimators, selection criteria and testing procedures for linear regression. The robust estimator is a one-step weighted M-estimator that can be biased if the covariates are not orthogonal. We show that the bias can be made smaller by iterating the M-estimator one or more steps further. In the variable selection process, we propose a simplified robust criterion based on a robust t-statistic that we compare to a false discovery rate adjusted level. We carry out a simulation study to show the good performance of our approach. We also analyze two datasets and show that the results obtained by our method outperform those from robust LARS and random forests.
Robust automatic selection techniques for the smoothing parameter of a smoothing spline are introduced. They are based on a robust predictive error criterion and can be viewed as robust versions of C p and cross-validation. They lead to smoothing splines which are stable and reliable in terms of mean squared error over a large spectrum of model distributions.
By starting from a natural class of robust estimators for generalized linear models based on the notion of quasi-likelihood, we define robust deviances that can be used for stepwise model selection as in the classical framework. We derive the asymptotic distribution of tests based on robust deviances, and we investigate the stability of their asymptotic level under contamination. The binomial and Poisson models are treated in detail. Two applications to real data and a sensitivity analysis show that the inference obtained by means of the new techniques is more reliable than that obtained by classical estimation and testing procedures.
Background: Attitudes toward body shape and food play a role in the development and maintenance of dysfunctional eating behaviors. Nevertheless, they are rarely investigated together. Therefore, this study aimed to explore the interrelationships between implicitly assessed attitudes toward body shape and food and to investigate the moderating effect on these associations of interindividual differences in problematic and nonproblematic eating behaviors (i.e., flexible versus rigid cognitive control dimension of restraint, disinhibition). Methods: One hundred and twenty-one young women from the community completed two adapted versions of the Affect Misattribution Procedure to implicitly assess attitudes toward body shape (i.e., thin and overweight bodies) and food (i.e., “permitted” and “forbidden” foods), as well as the Three-Factor Eating Questionnaire to evaluate restraint and disinhibition. Results: The results revealed that an implicit preference for thinness was positively associated with a positive attitude toward permitted (i.e., low-calorie) foods. This congruence between implicitly assessed attitudes toward body shape and food was significant at average and high levels of flexible control (i.e., functional component of eating). Moreover, an implicit preference for thinness was also positively associated with a positive attitude toward forbidden (i.e., high-calorie) foods. This discordance between implicitly assessed attitudes was significant at average and high levels of rigid control and disinhibition (i.e., dysfunctional components of eating). Conclusions: These findings shed new light on the influence of congruent or discordant implicitly assessed attitudes toward body shape and food on normal and problematic eating behaviors; clinical implications are discussed.
This thesis is divided in two parts. First, it presents a new criterion for model selection which is shown to be particularly well suited in "sparse" settings which we believe to be common in many research fields. Our selection procedure is developed for linear regression models, smoothing splines, autoregressive and mixed linear models. These developments are then applied in Biostatistics. The second part presents a new estimation method for the parameters of a time series model. The proposed estimation method offers an alternative to maximum likelihood estimation, that is straightforward to implement and often the only feasible estimation method with complex models. We derive the asymptotic properties of the proposed estimator for inference and perform an extensive simulation study to compare our estimator to existing methods. Finally, we apply our method in engineering to calibrate inertial sensors and demonstrate that it represents a considerable improvement compared to benchmark methods.
The problem of non-random sample selectivity often occurs in practice in many different fields. In presence of sample selection, the data appears in the sample according to some selection rule. In these cases, the standard tools designed for complete samples, e.g. ordinary least squares, produce biased results, and hence, methods correcting this bias are needed. In his seminal work, Heckman proposed two estimators to solve this problem. These estimators became the backbone of the standard statistical analysis of sample selection models. However, these estimators are based on the assumption of normality and are very sensitive to small deviations from the distributional assumptions which are often not satisfied in practice. In this thesis we develop a general framework to study the robustness properties of estimators and tests in sample selection models. We use an infinitesimal approach, which allows us to explore the robustness issues and to construct robust estimators and tests.
The sophisticated and automated means of data collection used by an increasing number of institutions and companies leads to extremely large datasets. Subset selection in regression is essential when a huge number of covariates can potentially explain a response variable of interest. The recent statistical literature has seen an emergence of new selection methods that provide some type of compromise between implementation (computational speed) and statistical optimality (e.g. prediction error minimization). Global methods such as Mallows’ Cp have been supplanted by sequential methods such as stepwise regression. More recently, streamwise regression, faster than the former, has emerged. A recently proposed streamwise regression approach based on the variance inflation factor (VIF) is promising but its least-squares based implementation makes it susceptible to the outliers inevitable in such large datasets. This lack of robustness can lead to poor and suboptimal feature selection. In our case, we seek to predict an individual’s educational attainment using economic and demographic variables. We show how classical VIF performs this task poorly and a robust procedure is necessary for policy makers. This article proposes a robust VIF regression, based on fast robust estimators, that inherits all the good properties of classical VIF in the absence of outliers, but also continues to perform well in their presence where the classical approach fails.
In this thesis, we develop models to analyze zero-inflated truncated heavy-tailed dependent data to fit radio audience data in Switzerland. These models are able to explain, by means of covariates, both the probability of observing a positive outcome and the mean of the positive outcomes, which respectively correspond to the audience indicators of reach and of time spent listening. Estimation methods, model check, model assumptions and model properties are discussed. The fit of audience data for different Swiss radio stations in their broadcasting area are finally compared in order to select the process that best describes dependent daily listening times.
This thesis contributes to the development of measures of model selection and model adequacy for mixed-effects models. In the context of linear mixed-effects models, we review and compare in a simulation study a large set of measures proposed to evaluate model adequacy or/and to perform model selection. In the more general context of generalized linear mixed-effects models, we develop a measure of both model adequacy and model selection, that we name PRDpen. As a measure of model adequacy, our proposition gives information about the model at hand, as it measures the proportional reduction in deviance due to the model of interest in comparison with a prespecified null model. Furthermore, as a measure of model selection, PRDpen is able to choose the model that best fits the data among a set of alternatives, similarly to the information criteria.
This thesis delivers a new framework for the robust parametric estimation of random fields and latent models through the use of the wavelet variance. By proposing a new M-estimation approach for the latter quantity and delivering results on the identifiability of a wide class of latent models, the thesis finally delivers a computationally efficient and statistically sound method to estimate complex models even when the data is contaminated. The results of this work are then implemented within a new statistical software which is also presented in this thesis, with a focus on its usefulness for inertial sensor calibration.