How many subjects for regression
You can directly check instability 2. Sample size calculation for logistic regression is a complex problem, but based on the work of Peduzzi et al. Journal of Clinical Epidemiology There is no strict rules, but you can include all independent variables so long as the nominal variables dont have too many categories.
You need one "beta" for all except one of the class for each nominal variable. So if a nominal variable was say "area of work" and you have 30 areas, then you'd need 29 betas.
One way to overcome this problen it to regularise the betas - or penalise for large coefficients. This helps ensure that you model doesn't overfit the data.
L2 and L1 regularisation are popular choices. Another issue to consider is how representative your sample is. What population do you want to make inference of?
Vittinghoff, E. Relaxing the rule of ten events per variable in logistic and Cox regression. American Journal of Epidemiology, 6 : — Sign up to join this community. The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Learn more. Sample size for logistic regression?
Ask Question. Asked 9 years, 7 months ago. Active 11 months ago. Viewed k times. Am I right? If not please let me know how to decide the number of independent variables? Improve this question. Braj-Stat Braj-Stat 2 2 gold badges 7 7 silver badges 6 6 bronze badges.
What I mean is: if I have subjects of which 10 are cases the 1 's and 90 non-cases the 0 's , then the rule says "include only 1 predictor". But what if I model the 0 's instead of the 1 's and then I take the reciprocal of the estimated odds ratios? Would I be allowed to include 9 predictors? Interaction Model 1 : A table showing no interaction between the two treatments — their effects are additive.
In this example, there is no interaction between the two treatments — their effects are additive. Interaction Model 2 : A table showing an interaction between the treatments — their effects are not additive. In contrast, if the average responses as in are observed, then there is an interaction between the treatments — their effects are not additive.
The goal of polynomial regression is to model a non-linear relationship between the independent and dependent variables. Explain how the linear and nonlinear aspects of polynomial regression make it a special case of multiple linear regression. For this reason, polynomial regression is considered to be a special case of multiple linear regression.
Polynomial regression models are usually fit using the method of least-squares. The least-squares method minimizes the variance of the unbiased estimators of the coefficients, under the conditions of the Gauss—Markov theorem.
The least-squares method was published in by Legendre and in by Gauss. The first design of an experiment for polynomial regression appeared in an paper of Gergonne. In the 20 th century, polynomial regression played an important role in the development of regression analysis, with a greater emphasis on issues of design and inference. More recently, the use of polynomial models has been complemented by other methods, with non-polynomial models having advantages for some classes of problems.
Although polynomial regression is technically a special case of multiple linear regression, the interpretation of a fitted polynomial regression model requires a somewhat different perspective.
It is often difficult to interpret the individual coefficients in a polynomial regression fit, since the underlying monomials can be highly correlated. Although the correlation can be reduced by using orthogonal polynomials, it is generally more informative to consider the fitted regression function as a whole. Point-wise or simultaneous confidence bands can then be used to provide a sense of the uncertainty in the estimate of the regression function.
Polynomial regression is one example of regression analysis using basis functions to model a functional relationship between two quantities. In modern statistics, polynomial basis-functions are used along with new basis functions, such as splines, radial basis functions, and wavelets. These families of basis functions offer a more parsimonious fit for many types of data.
The goal of polynomial regression is to model a non-linear relationship between the independent and dependent variables technically, between the independent variable and the conditional mean of the dependent variable. This is similar to the goal of non-parametric regression, which aims to capture non-linear regression relationships.
Therefore, non-parametric regression approaches such as smoothing can be useful alternatives to polynomial regression. Some of these methods make use of a localized form of classical polynomial regression. An advantage of traditional polynomial regression is that the inferential framework of multiple regression can be used.
Polynomial Regression : A cubic polynomial regression fit to a simulated data set. Dummy, or qualitative variables, often act as independent variables in regression and affect the results of the dependent variables. Break down the method of inserting a dummy variable into a regression analysis in order to compensate for the effects of a qualitative variable. In statistics, particularly in regression analysis, a dummy variable also known as a categorical variable, or qualitative variable is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.
In regression analysis, the dependent variables may be influenced not only by quantitative variables income, output, prices, etc. For example, if gender is one of the qualitative variables relevant to a regression, then the categories included under the gender variable would be female and male. If female is arbitrarily assigned the value of 1, then male would get the value 0. The intercept the value of the dependent variable if all other explanatory variables hypothetically took on the value zero would be the constant term for males but would be the constant term plus the coefficient of the gender dummy in the case of females.
One type of ANOVA model, applicable when dealing with qualitative variables, is a regression model in which the dependent variable is quantitative in nature but all the explanatory variables are dummies qualitative in nature.
An example with one qualitative variable might be if we wanted to run a regression to find out if the average annual salary of public school teachers differs among three geographical regions in a country.
Qualitative regressors, or dummies, can have interaction effects between each other, and these interactions can be depicted in the regression model. For example, in a regression involving determination of wages, if two qualitative variables are considered, namely, gender and marital status, there could be an interaction between marital status and gender.
Demonstrate how to conduct an Analysis of Covariance, its assumptions, and its use in regression models containing a mixture of quantitative and qualitative variables. They are the statistic control for the effects of quantitative explanatory variables also called covariates or control variables.
Covariance is a measure of how much two variables change together and how strong the relationship is between them. ANCOVA evaluates whether population means of a dependent variable DV are equal across levels of a categorical independent variable IV , while statistically controlling for the effects of other continuous variables that are not of primary interest, known as covariates CV.
ANCOVA can be used to increase statistical power the ability to find a significant difference between groups when one exists by reducing the within-group error variance. This controversial application aims at correcting for initial group differences prior to group assignment that exists on DV among several intact groups. In this situation, participants cannot be made equal through random assignment, so CVs are used to adjust scores and make participants more similar than without the CV.
However, even with the use of covariates, there are no statistical techniques that can equate unequal groups. Furthermore, the CV may be so intimately related to the IV that removing the variance on the DV associated with the CV would remove considerable variance on the DV, rendering the results meaningless. Multilevel nested models are appropriate for research designs where data for participants are organized at more than one level.
Multilevel models, or nested models, are statistical models of parameters that vary at more than one level. These models can be seen as generalizations of linear models in particular, linear regression ; although, they can also extend to non-linear models.
Though not a new idea, they have been much more popular following the growth of computing power and the availability of software. Multilevel models are particularly appropriate for research designs where data for participants are organized at more than one level i. While the lowest level of data in multilevel models is usually an individual, repeated measurements of individuals may also be examined.
As such, multilevel models provide an alternative type of analysis for univariate or multivariate analysis of repeated measures. Individual differences in growth curves may be examined. Furthermore, multilevel models can be used as an alternative to analysis of covariance ANCOVA , where scores on the dependent variable are adjusted for covariates i. Multilevel models are able to analyze these experiments without the assumptions of homogeneity-of-regression slopes that is required by ANCOVA. Before conducting a multilevel model analysis, a researcher must decide on several aspects, including which predictors are to be included in the analysis, if any.
Second, the researcher must decide whether parameter values i. Fixed parameters are composed of a constant over all the groups, whereas a random parameter has a different value for each of the groups. Additionally, the researcher must decide whether to employ a maximum likelihood estimation or a restricted maximum likelihood estimation type.
Multilevel models have the same assumptions as other major general linear models, but some of the assumptions are modified for the hierarchical nature of the design i. Multilevel models have been used in education research or geographical research to estimate separately the variance between pupils within the same school and the variance between schools.
In psychological applications, the multiple levels are items in an instrument, individuals, and families. In sociological applications, multilevel models are used to examine individuals embedded within regions or countries. In organizational psychology research, data from individuals must often be nested within teams or other functional units. Stepwise regression is a method of regression modeling in which the choice of predictive variables is carried out by an automatic procedure.
Evaluate and criticize stepwise regression approaches that automatically choose predictive variables. The frequent practice of fitting the final selected model, followed by reporting estimates and confidence intervals without adjusting them to take the model building process into account, has led to calls to stop using stepwise model building altogether — or to at least make sure model uncertainty is correctly reflected.
Another approach is to use an algorithm that provides an automatic procedure for statistical model selection in cases where there is a large number of potential explanatory variables and no underlying theory on which to base the model selection. This is a variation on forward selection, in which a new variable is added at each stage in the process, and a test is made to check if some variables can be deleted without appreciably increasing the residual sum of squares RSS.
One of the main issues with stepwise regression is that it searches a large space of possible models. Hence it is prone to overfitting the data. In other words, stepwise regression will often fit much better in- sample than it does on new out-of-sample data.
Due to limited space, only the most practical points are presented. The variable to be explained is called the dependent or response variable. When the dependent variable is binary, the medical literature refers to it as an outcome or endpoint.
The factors that explain the dependent variable are called independent variables, which encompass the variable of interest or explanatory variable and the remaining variables, generically called covariates. Not infrequently, the unique function of these covariates is to adjust for imbalances that may be present in the levels of the explanatory variable.
Sometimes, however, the identification of the predictors for the response variable is the main study goal, and in this case, every independent variable becomes of interest.
Table 2. These models are created when the main goal is to predict the probability of the outcome in each subject, often beyond the data from which it originated. As an example, the clinical prediction rules derived from a model fitted to the Framingham data has been shown, after multiple external validations, to provide a quantitative estimation of the absolute risk of coronary heart disease in a general population.
Complex models, such as those with multiple interactions, excessive number of predictors, or continuous predictors modeled through complex nonlinear relationship, tend to fit poorly in other populations. Several recommendations have been proposed for building these types of models, 2 , 4 , 5 the following being the most important: a incorporate as much accurate data as possible, with wide distribution for predictor values; b impute data if necessary as sample size is important; c specify in advance the complexity or degree of nonlinearity that should be allowed for each predictor; d limit the number of interactions, and include only those prespecified and based on biological plausibility; e for binary endpoints, follow the events per variable EPV rule to prevent overfitting.
If not possible, then proceed to data reduction; f be aware of the problems with stepwise selection strategies. Use prior knowledge whenever possible; g check the degree of collinearity between important predictors and use subject matter expertise to decide which of the collinear predictors should be included in the final model; h validate the final model for calibration and discrimination, preferably using bootstrapping, and i use shrinkage methods if validation shows over-optimistic predictions.
These models are created either as tools for effect estimation, or as a basis for hypothesis testing. Most articles published in biomedical literature are based on this type of models.
Because there is a little concern for parsimony, the balance would be in favor of developing a more accurate and complex model that reflects the data at hand. However, always use principles that prevent overfitted estimates, and if necessary precede to data reduction methods.
It is always a good principle to validate the final model based on calibration and discrimination measures. Another consideration when building a regression model is to choose the appropriate statistical model that matches the type of dependent variable.
There is a multitude of variation in how the data is collected, and not infrequently the same data can be analyzed with more than one regression method. Table 1 shows the type of regression methods that match the most frequent types of data collected based on a normal error assumption. A detailed explanation of each of these methods is beyond the scope of this paper, and therefore, only the most important aspects will be provided.
The main assumption is the linearity of the relationship between a continuous dependent variable and predictors. When untenable, linearize the relationship using variable transformation or apply nonparametric methods. The logistic regression model is appropriate for modeling a binary outcome disregarding time dimension. All we need to know about the outcome is whether it is present or absent for each subject at the end of the study.
The resulting estimate of effect for treatment is the odds ratio OR adjusted for other factors included as covariates. Sometimes, logistic regression has been used inappropriately to analyze time-to-event data.
Annesi et al. Therefore, logistic regression should be considered as an alternative to Cox regression only when the duration of the cohort follow-up can be disregarded for being too short, or when the proportion of censoring is minimal and similar between the two levels of the explanatory variable.
Survival regression methods have been designed to account for the presence of censored observations, and therefore are the right choice to analyze time-to-event data.
Cox proportional hazard is the most common method used. The effect size is expressed in relative metrics as a hazard ratio HR. A constant instantaneous hazard risk difference during follow-up is the main assumption. Several nonparametric alternatives to Cox regression have been proposed in the presence of nonproportionality.
Parametric survival methods are recommended as follows: a when the baseline hazard or survival function is of primary interest; b to get more accurate estimates in situations where the shape of the baseline hazard function is known by the researcher; c as a way to estimate the adjusted absolute risk difference ARD and number needed to treat NNT at prespecified time points; d when the proportionality assumption for the explanatory variable is not tenable, 8 and e when there is a need to extrapolate the results beyond the observed data.
In survival analysis, each subject can experience one of the several different types of events at follow-up. If the occurrence of one type of event either influences or prevents the event of interest, a competing risks situation arises.
For example, in a study of patients with acute heart failure, hospital readmission, the event of interest, is prevented if the patient dies at follow-up. Death here is a competing risk, preventing that this patient could be readmitted.
In a competing risks scenario, several studies have demonstrated that the traditional survival methods, such as Cox regression and Kaplan-Meier method KM are inappropriate. In summary, consider these methods: a when the event of interest is an intermediate endpoint and the patient's death prevents its occurrence; b when one specific form of death such as cardiovascular death needs to be adjusted by other causes of death, and c to adjust for events that are in the causal pathway between the exposure and the outcome of interest usually a terminal event.
For instance, revascularization procedures may be considered in this category by modifying the patient's natural history of the disease, and therefore, influencing the occurrence of mortality. Many studies in biomedical research have designs that involve repeated measurements over time of a continuous variable across a group of subjects.
In cardiovascular registries, for instance, information is obtained from patients at each hospitalization and collected along their entire follow-up; this information may include continuous markers, such as BNP, left ventricle ejection fraction, etc. In this setting, the researcher may be interested in modeling the marker across time, by determining its trajectory and the factors responsible for it; or perhaps the effect of the marker on mortality becomes the main interest; or the marker may be simply used as a time-varying adjuster for a treatment indicated at baseline.
A frequent and serious problem in such studies is the occurrence of missing data, which in many cases is due to a patient's death, leading to a premature termination of the series of repeated measurements.
This mechanism of missingness has been called informative censoring or informative drop-out , and requires special statistical methodology for analysis. In a different setting, when the aim is to describe a process in which a subject moves through a series of states in continuous time, Multistate Markov modeling becomes the right analytical tool. Within this model, patients may advance into or recover from adjacent disease stages, or die, allowing the researcher to determine transition probabilities between stages, the factors influencing such transitions, and the predictive role of each intermediate stage on death.
Not infrequently, the data require clean-up before fitting the model. Three important areas need to be considered here:. Missing data. This is a ubiquitous problem in health science research. Multiple imputation was developed for dealing with missing data under MAR and MCAR assumptions by replacing missing values with a set of plausible values based on auxiliary information available on the data.
Variables coding. Variables must be modeled under appropriate coding. Try to collapse categories for an ordered variable if data reduction is needed. Keep variables continuous, as much as possible, since their categorization or even worse, their dichotomization would lead to an important loss of prediction information, to say nothing about the arbitrariness of the chosen cutpoint.
Therefore, in the case of variable dichotomization, provide arguments on how the threshold was chosen, or if it was based on an acceptable cutpoint in the medical field. Check for overly influential observations. When evaluating the adequacy of the fit model, it is important to determine if any observation has a disproportionate influence on the estimated parameters, through influence or leverage analysis.
Unfortunately, there is no firm guidance regarding how to treat influential observations. A careful examination of the corresponding data sources may be needed to identify the origin of the influence. Variable selection is a crucial step in the process of model creation Table 1. Including in the model the right variables is a process heavily influenced by the prespecified balance between complexity and parsimony Table 3.
Predictive models should include those variables that reflect the pattern in the population from which our sample was drawn. Here, what matters is the information that the model as a whole represents. For effect estimation, however, a fitted model that reflects the idiosyncrasy of the data is acceptable as long as the estimated parameters are corrected for overfitting.
Overfitting is a term used to describe a model fitted with too many degrees of freedom with respect to the number of observations or events for binary models. As a consequence, predictions from the overfitted model will not likely replicate in a new sample, some selected predictors may be spuriously associated to the response variable, and regression coefficients will be biased against the null over-optimism. In other words, if you put too many predictors in a model, you are very likely to get something that looks important regardless of whether there is anything important going on in the population.
There are a variety of rules of thumb to approximate the sample size according to the number of predictors. In linear multiple regression, a minimum of 10 to 15 observations per predictor has been recommended. In simulation studies, 10 to 15 events per variable were the optimal ratio. Additional measures have been proposed to correct for overfitting: a use subject matter expertise to eliminate unimportant variables; b eliminate variables whose distributions are too narrow; c eliminate predictors with a high number of missing values; d apply shrinkage and penalization techniques on the regression coefficients, and e try to group, by measures of similarity, several variables into one, either by using multivariate statistical techniques, an already validated score, or an estimated propensity score.
In forward selection, the initial model comprises only a constant, and at each subsequent step the variable that leads to the greatest and significant improvement in the fit is added to the model. In backward deletion, the initial model is the full model including all variables, and at each step a variable is excluded when its exclusion leads to the smallest nonsignificant decrease in the model fit.
The final model of each of these stepwise procedures should include a set of the predictor variables that best explains the response. The use of stepwise procedures has been criticized on multiple grounds. A forward selection with 10 predictor variables performs 10 significance tests in the first step, 9 significance tests in the second step, and so on, and each time includes a variable in the model when it reaches the specified criterion.
In addition, stepwise procedures tend to be unstable, meaning that only slight changes in the data can lead to different results as to which variables are included in the final model and the sequence in which they are entered.
0コメント