This project is an analysis of fast growing firms in a European country. Using data that was collected,
maintained and cleaned by Bisnode that contains data about 19,036 firms. This project is leveraged to
predict the probability of fast growing firms, and classify firms of prospective fast growth and no fast
growth. Fast growth can be defined in many ways. In this project, growth will be defined over 50% annual
growth in revenues. I will use companies data in 2014 to predict probability of firms growing fast in 2015, and
classifying them accordingly. To define fast growth for this project, I consider revenue as a main determinant
of growth. Firms that have a 50% increase in annual revenues are considered fast growing.
Data
The bisnode-firms is a panel dataset that contains information about firms in a European country. In this
project, I use the cleaned dataset that was maintained by Bisnode. As an initial step, the data was filtered to
include observations between 2010 to 2015. The dataset originially contained data on 19,036 firms; 287,829
observations and 48 variables. Each observation corresponds to a firms in a specific year. Some of the variable
that could be useful predictors of firms growth, for instance, financial data, data on the management, region
etc are included.
Data Preparation

Dataset was limited to include the panel from 2010-2015.
Variables with many missing values, such as COGS, finished_prod, net_dom_sales, net_exp_sales,
and wages were dropped.
Label Engineering
Growth rate, as a variable, was not provided in the dataset. However, related financial data were, which
made it simple to measure it. For this paper, we can consider a firm as fast growing if it had a 50% increase
in revenues in the consecutive year. Fast growth, as a binary, is the y variable and all other variables
are considered potential predictors and screened by different methods to pick the likely predictive ones.
Therefore, fast growth, can be defined as 1 if the company is fast growing in 1 year, and 0 otherwise.
Impute the sales variable with 1 if the value is below 0.
Create variables for sales in million Euro, and log transformed sales in million
Created variable growth_rate; which is annual sales growth rate.
Observations that had a negative growth rate or infinite growth rate were dropped since they dont fit
our scope.
Created a binary variable, fast growth, that captures if there was fast growth; over 50% annual growth
in revenues.
Created variables d1_sales_mil_log, first difference in natural log sales in million, and age
1
Sample Design
The sample was limited to the cross-section of firms in 2014. Observation that has sales below the 5th
percentile or above the 95th percentile were excluded. These included firms that either very high or very
low sales. As a result, the cleaned dataset consist of 47 variables and 5,737 observations.
Feature Engineering
In order to have some insight about the data and prior to building the models, we inspect the functional
forms of the variables. Obvious errors such negative current assets or current liabilities, were imputed with
0 instead of the negative value, a binary variable to flag the error was created. Created a new variable for
total assets. Unreasonable age values were imputed with the minimum age of 25 and maximum of 75 years.
Moreover, financial variable, for example, annual profit & loss and income before tax, were standarized by
sales, and winsorized. Variable to flag error or extreme values were also created. Quadratic terms were
added to some financial terms to capture non-linearity.
The final dataset, after cleaning and screening for potential predictors, is composed of 5453 observations and
110 variables. As a robustness check, the dataset will be split into 80% work set, and 20% holdout set.
Explaratory Data Analysis
The target variable is fast growth, expressed as a binary. As a first step, I’ll check the potential predictor
variables
0.00
0.25
0.50
0.75
1.00
−1.0 −0.5 0.0 0.5 1.0
Standardized Annual Profit/Loss
Fast Growth
Fast growth probability distribution across standardized profit/loss
0.00
0.25
0.50
0.75
1.00
−1.0 −0.5 0.0 0.5 1.0
Standardized Income before Tax
Fast Growth
Growth Probability Distribution Across Standardized Profit/Loss
0.00
0.25
0.50
0.75
1.00
−4 −2 0
sales_mil_log
growth
Figure 1: Probability Distribution of Predictor Variables
We can observe that probability of fast growth tends to decrease as sales decreases. Steep drops could be
due to low number of observations in a specific interval. The same pattern applies for the distribution of
probability across income before tax.
Table 1 shows us the descriptive statistics of price for property type.
Modeling
In order to begin building the models, the variables should be defined. Predictors were grouped into 4 main
variable categories: Firm, Quality variables, Financial, HR, as well as a separate group for interactions. I
will consider 4 models for probability prediction with logit of increasing complexity.
Model 1 includes log sales_mill, squared log sales_mil, d1_sales_mil_log_mod, profit_loss_year_pl,
fixed_assets_bs, curr_liab_bs, curr_liab_bs_flag_high, curr_liab_bs_flag_error,age, foreign_management.
2
Model 2 includes log sales_mill, squared log sales_mil, firm, engvar, d1.
Model 3 includes log sales_mill, squared log sales_mil, and all variables, but no interactions.
Logit LASSO model, most of the predictors are included as well as the set of potential interactions.
Random Forest: sales in millions, log 1st difference of sales in millions, Firm, Quality variables, Financial, HR; no interactions, no modified features
Cross Validation Prediction
I prepared 4 logit models to examine with OLS. The best performing model will be selected using crossvalidation, and prediction will be evaluated on that model using the holdout set. The work sample consists
of 4362 observation, of which 3856 fast growths, and the holdout sample has 1089 observations, of which 963
fast growths.
Table 2 shows the number of variables, R-squared, BIC and cross-validated training set and test set RMSE
for the eight regressions. The table shows us two statistics for the entire work set: R-squared and BIC.
First we estimated all regressions using all observations in the work set. Then, we estimated models by
using 5-fold cross-validation. For each fold we estimated the regression using the training set, and used it for
prediction not only on the training set but also in the corresponding test set. For this the Training RMSE
and Test RMSE were calculated as the square root of the average MSE on the five training sets and the five
test sets.
From R-squared we could say that it is improving as more variables are added. In our case the most complex
model explains 37% of the variation in prices. As for BIC it should be decreasing with more complex models,
however, after a certain point it increases. But in our case the differences are relatively small. According to
BIC in our case the best model is model number 6 as the more complex models have a risk of overfitting the
data.
The RMSE in the training set is improving as the model is getting more complex. In the test set it improves
until model 7, after it is significantly worse. Model 7 has the lowest test RMSE with 45.95. This model
includes all the variables except for the interactions in X3. Model 7 is significantly more complex than Model
5, which was deemed as best by BIC. Model 6 contained the interactions of property type and the additional
interactions, meanwhile model 7 included amenities as well.
RMSE suggests that the typical size of the prediction error in the test set is 45.95 euros for model 7,
meanwhile it is 46.07 euro for model 6. From a statistical point of view it might be interesting, but if we
would look only from business point of view it could be deemed insignificant.
If we have conflict between BIC and cross-validation, cross-validation result should be chosen as it is not
based on auxiliary assumptions.

user system elapsed

61.973 0.408 62.424

Table 1: Logit Summary
Number of predictors CV RMSE CV AUC
X1 4 0.314 0.701
X2 39 0.316 0.698
X3 79 0.318 0.689
LASSO 1 0.322 0.677
3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
False positive rate (1 − Specificity)
True positive rate (Sensitivity)
0.2
0.4
0.6
threshold
Figure 2: Training and test RMSE for the models
4

Sample Solution

The post Firms Growth Prediction appeared first on ACED ESSAYS.

user system elapsed

61.973 0.408 62.424

Related posts:

Related Posts

GET HELP WITH YOUR HOMEWORK PAPERS @ 25% OFF