knitr::opts_chunk$set(echo=TRUE)
Ranking universities is a fundamental issue not only to the students and academics but also to university authorities, industry and even government. Global university rankings depend on several key factors such as teaching, research, citations, international outlook, industry income etc. Prospective students require this for applying to specific universities based on upcoming ranks. There are mainly two type of ranking system developed Global and National, which contain several ranking system based on the location . The Global Ranking System consists of 9 rankinig systems out of which the most widely used is THE-QS World University Ranking. This was published in England in 2024. (University Ranking and Evaluation: Trend and Existing Approaches, 2012)
The Global University authorities should assess the varsity’s upcoming rank to improve certain fields and upgrade their standards compared with others. Industry and even government should try to forecast the ranking of universities for providing grants to the most eligible institutions for the upcoming year (Taylor and Braddock, 2007 ; Vitenko et al., 2018). Understanding the criteria and mechanics that assess the quality of education is a key activity to improve current educational models and propose new educational strategies.
This project aims to demonstrate how data analysis techniques and machine learning algorithms can predict the overall score of the university and analyze the predictive features. The raw data collected has 2673 rows and 17 variables out of which overall_score is our target variable. Since the target variable is numerical regression techniques will be used for predictive modeling.
file_path <- "C:/Users/vishw/OneDrive/Desktop/NU/KL7010 - Principle of DS/World University Rankings.csv"
raw_data <- read.csv(file_path, header = TRUE)
#str(raw_data)

Have you heard of the French term “mise en place”. It is crucial for every chef to understand. This French term means “everything in its place”. “The goal of this culinary practice is to have everything ready to go when it comes time to cook, bake, or assemble your final dish.” Similarly, data preprocessing is a component of data preparation which ensures that the raw data is ready for further data processing.
It is a fundamental step in the data mining process. The data collected, i.e. the raw data, is never in an ideal format where data analyses can be done right away. Data type conversion (type casting), variable transformation (unit conversion) are a few steps involved in preprocessing that ensures that the collected data is in a suitable format for analysis and model building. This helps improve consistency, data visualization and interpretation of data. As a result, it helps make the data more meaningful, thereby leading to better insights and decision making.
Certain numerical variables were read as characters. Data type
conversion was made easy using the function
as.numeric(column name) and
as.factor(column name) from the base R package. The latter
function was used to convert variables member_level and
unaccredited. These are categorical data that follow a
certain order. Variables like stats_number_students,
stats_pc_intl_students and
stats_female_male_ratio were not recorded in the right
format.
Number of students which is expected to be in thousands were
recorded with a comma separating the thousands digit, the function
gsub(pattern, replacement, column name) which was followed
by as.numeric() was performed to convert it.
Similarly overall_score and
stats_female_male_ratio were recorded with a special
characters so to a combination of sapply(),
strsplit(x, split) was used to convert them to the correct
format and type.
Note: Certain values of overall_score were recorded as a range of values (34.5 - 38). In this case the average was taken.
The subjects_offered included a list of
different subjects separated by a comma and was read as a character. In
order assess if subjects_offered have any impact on
overall_score, The following line of code
sapply(strsplit(as.character(df$subjects_offered),','), function(x) length(x))
was used to create a new variable no_subjects_offered
which had the total number of subjects offered by that specific
university.
It was found that there are 127 unique locations gaining insights graphically is possible but could be difficult to interpret. Hence, to gain insights about universities based on their location and to plot better graphs, a variable named Region was created by categorizing the locations into their respective regions. A sample line to do the same is given below:
North_America = c("Canada", "United States", "Mexico")
df$Region = case_when(df$location %in% North_America ~ "North_America")
df <- raw_data #getting a copy of the data
df$stats_number_students <- gsub(",", "", df$stats_number_students)
df$stats_pc_intl_students <- gsub("%", "", df$stats_pc_intl_students)
df$stats_pc_intl_students <- as.numeric(df$stats_pc_intl_students)/100
df$stats_female_male_ratio <- gsub(":00", "", df$stats_female_male_ratio)
df$female_male_ratio <- sapply(strsplit(df$stats_female_male_ratio, ":"),
function(x) as.numeric(x[1])/as.numeric(x[2]))
df$no_subjects_offered <- sapply(strsplit(as.character(df$subjects_offered),','),
function(x) length(x))
df$target <- sapply(strsplit(df$overall_score, '–'),
function(x) mean(as.numeric(x), na.rm=TRUE))
df$no_subjects_offered <- as.integer(df$no_subjects_offered)
df$member_level <- as.factor(df$member_level)
df$unaccredited <- as.factor(df$unaccredited)
df$stats_number_students <- as.integer(df$stats_number_students)
df$stats_pc_intl_students <- df$stats_pc_intl_students * df$stats_number_students
df$scores_teaching <- as.numeric(df$scores_teaching)
df$scores_research <- as.numeric(df$scores_research)
df$scores_citations <- as.numeric(df$scores_citations)
df$scores_industry_income <- as.numeric(df$scores_industry_income)
df$scores_international_outlook <- as.numeric(df$scores_international_outlook)
#length(unique(df$location)) #127 countries
countries <- unique(df$location)
North_America = c("Canada", "United States", "Turks and Caicos Islands", "Mexico", "Puerto Rico", "Jamaica", "Costa Rica", "Cuba", "Dominican Republic")
South_America = c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Paraguay", "Peru", "Uruguay", "Venezuela")
Europe = c("Albania", "Andorra", "Austria", "Belarus", "Belgium", "Bosnia and Herzegovina", "Bulgaria", "Croatia", "Cyprus", "Czech Republic", "Denmark", "Estonia", "Finland", "France", "Georgia", "Germany", "Greece", "Hungary", "Iceland", "Ireland", "Italy", "Kosovo", "Latvia", "Liechtenstein", "Lithuania", "Luxembourg", "Malta", "Moldova", "Monaco", "Montenegro", "Netherlands", "North Macedonia", "Norway", "Poland", "Portugal", "Romania", "Russian Federation", "San Marino", "Serbia", "Slovakia", "Slovenia", "Spain", "Sweden", "Switzerland", "Ukraine", "United Kingdom")
Africa = c("Algeria", "Angola", "Botswana", "Egypt", "Ethiopia", "Ghana", "Kenya", "Libya", "Malawi", "Mauritius", "Morocco", "Mozambique", "Namibia", "Nigeria", "Rwanda", "Senegal", "Somalia", "South Africa", "Sudan", "Tanzania", "Tunisia", "Uganda", "Zambia", "Zimbabwe")
Middle_East = c("Bahrain", "Iran", "Iraq", "Israel", "Jordan", "Kuwait", "Lebanon", "Northern Cyprus", "Oman", "Palestine", "Qatar", "Saudi Arabia", "Syria", "Turkey", "United Arab Emirates", "Yemen")
Asia = c("Armenia", "Azerbaijan", "Bangladesh", "Brunei Darussalam", "Cambodia", "China", "Hong Kong", "India", "Indonesia", "Japan", "Kazakhstan", "Kyrgyzstan", "Laos", "Macao", "Malaysia", "Mongolia", "Myanmar", "Nepal", "Pakistan", "Philippines", "Singapore","South Korea", "Sri Lanka", "Taiwan", "Tajikistan", "Thailand", "Turkmenistan", "Uzbekistan", "Vietnam")
Oceania = c("Australia", "Fiji", "New Zealand")
df$Region = case_when(df$location %in% Asia ~ "Asia",
df$location %in% Africa ~ "Africa",
df$location %in% Europe ~ "Europe",
df$location %in% Middle_East ~ "Middle_East",
df$location %in% North_America ~ "North_America",
df$location %in% South_America ~ "South_America",
df$location %in% Oceania ~ "Oceania")
df1 <- df
df <- select(df, -c(13,14,17))
#str(df)
#summary(df)
A few basic insights before performing rigorous data analysis. These insights can be valuable to prospective students and institutions. Table 1 shows the top 10 universities in the world.
Top_uni <- df %>% arrange(desc(target)) %>%
select(c('name', 'location', 'target'))
head(Top_uni, 10) %>%
gt() %>%
cols_label(
name = md("***Universities***"),
location = md("***Location***"),
target = md("***Overall Score***")) %>%
tab_header(title = md("Table 1 :- *Top 10 Universities*"))
| Table 1 :- Top 10 Universities | ||
| Universities | Location | Overall Score |
|---|---|---|
| University of Oxford | United Kingdom | 98.5 |
| Stanford University | United States | 98.0 |
| Massachusetts Institute of Technology | United States | 97.9 |
| Harvard University | United States | 97.8 |
| University of Cambridge | United Kingdom | 97.5 |
| Princeton University | United States | 96.9 |
| California Institute of Technology | United States | 96.5 |
| Imperial College London | United Kingdom | 95.1 |
| University of California, Berkeley | United States | 94.6 |
| Yale University | United States | 94.2 |
From table 2 we infer that Computer Science and Business & Management courses seems to be the most offered courses among all universities in the world.
subject_list <- strsplit(as.character(df1$subjects_offered),',')
l1 <- table(unlist(subject_list))
subjects <- data.frame(Subject = names(l1), Offered_by = as.numeric(l1))
arrange(subjects, desc(Offered_by)) %>% head(6) %>%
gt() %>%
cols_label(
Subject = md("***Subject***"),
Offered_by = md("***No. of Unis***")) %>%
tab_header(title = md("Table 2 :- *Most offered courses*"))
| Table 2 :- Most offered courses | |
| Subject | No. of Unis |
|---|---|
| Computer Science | 2277 |
| Business & Management | 2267 |
| Accounting & Finance | 1965 |
| Biological Sciences | 1946 |
| Mathematics & Statistics | 1914 |
| Literature & Linguistics | 1913 |
df_filtered <- df1[!is.na(df$target), ] #Considering only ranked universities
ggplot(df_filtered, aes(Region, fill = Region)) +
geom_bar(color = "black",show.legend = F) +
theme_minimal() + scale_fill_brewer(palette="Blues") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
The above graph and the tables 3, 4, 5 and 6 illustrate that Europe and Asia has the highest number of ranked universities followed by the North American region. Universities in the UK admitted the highest number of international students followed by the USA and Australia. Europe seems to be a preferred region for study among international students, in particular the UK although there are greater number of ranked universities in the US compared to the UK.
intl.students <- df %>% arrange(desc(stats_pc_intl_students)) %>%
select(Region, stats_pc_intl_students ) %>%
group_by(Region) %>%
summarise(sum_intl = sum(stats_pc_intl_students, na.rm = T))
arrange(intl.students, desc(sum_intl)) %>%
gt() %>%
cols_label(
Region = md("***Region***"),
sum_intl = md("***International Students***")) %>%
tab_header(title = md("Table 3 :- *No. of International Students (Region)*"))
| Table 3 :- No. of International Students (Region) | |
| Region | International Students |
|---|---|
| Europe | 2003036.49 |
| North_America | 710247.36 |
| Asia | 545708.58 |
| Middle_East | 382577.76 |
| Oceania | 295841.18 |
| Africa | 142760.70 |
| South_America | 48438.52 |
uni_loc <- group_by(df_filtered, location) %>% summarise(count = n())
uni_loc %>% arrange(desc(count), .by_group = T) %>%
head(10) %>%
gt() %>%
cols_label(
location = md("***Country***"),
count = md("***Ranked Unis***")) %>%
tab_header(title = md("Table 4 :- *Sum of ranked universitites*"))
| Table 4 :- Sum of ranked universitites | |
| Country | Ranked Unis |
|---|---|
| United States | 169 |
| Japan | 119 |
| United Kingdom | 104 |
| India | 91 |
| China | 86 |
| Russian Federation | 78 |
| Turkey | 75 |
| Iran | 73 |
| Brazil | 56 |
| Italy | 56 |
students <- df %>% arrange(desc(stats_number_students)) %>%
select(location, stats_number_students ) %>%
group_by(location) %>%
summarise(students = sum(stats_number_students, na.rm = T))
arrange(students, desc(students)) %>% head(10) %>%
gt() %>%
cols_label(
location = md("***Country***"),
students = md("***Sum of Students***")) %>%
tab_header(title = md("Table 5 :- *Total Sum of Students (location)*"))
| Table 5 :- Total Sum of Students (location) | |
| Country | Sum of Students |
|---|---|
| United States | 3985888 |
| China | 2818642 |
| Egypt | 2423510 |
| Turkey | 2105034 |
| United Kingdom | 2093040 |
| Brazil | 1707217 |
| Japan | 1460908 |
| Algeria | 1354183 |
| Mexico | 1303044 |
| Spain | 1291193 |
intl_students <- df %>% arrange(desc(stats_pc_intl_students)) %>%
select(location, stats_pc_intl_students ) %>%
group_by(location) %>%
summarise(sum_intl = sum(stats_pc_intl_students, na.rm = T))
arrange(intl_students, desc(sum_intl)) %>% head(10) %>%
gt() %>%
cols_label(
location = md("***Country***"),
sum_intl = md("***International Students***")) %>%
tab_header(title = md("Table 6 :- *No. of International Students (location)*"))
| Table 6 :- No. of International Students (location) | |
| Country | International Students |
|---|---|
| United Kingdom | 654516.2 |
| United States | 481297.0 |
| Australia | 258599.7 |
| Canada | 205261.8 |
| France | 192720.9 |
| Russian Federation | 164572.9 |
| Germany | 163791.9 |
| Turkey | 162938.2 |
| China | 134049.0 |
| Spain | 130483.3 |
Analyzing the summary(data), it was found that
closed was a constant variable & out of 2673
unaccredited values only one value was TRUE which makes
them a zero variance or near zero variance variable. Using the function
nearZeroVar(data) from the caret package we can weed out
the zero or near zero variance variables. Apart from the above mentioned
variables record_type & member_level also
had near zero variance. Hence these variables need not be considered as
predictor variables. These are uninformative features and can reduce the
performance of our model.
#Removing these variables as they are not needed for further analysis
df <- select(df, -c('name', 'Region'))
nzv <- nearZeroVar(df, saveMetrics = T)
nzv %>% rownames_to_column(var = "variables") %>%
gt() %>%
cols_label(
variables = md("***variables***"),
freqRatio = md("***freqRatio***"),
percentUnique = md("***percentUnique***"),
zeroVar = md("***zeroVar***"),
nzv = md("***nzv***")) %>%
tab_header(title = md("Table 4 :- *NZV table*")) %>%
tab_style_body(
style = cell_fill(color = "lightblue"),
fn = function(x) isTRUE(x))
| Table 4 :- NZV table | ||||
| variables | freqRatio | percentUnique | zeroVar | nzv |
|---|---|---|---|---|
| scores_teaching | 1.230769 | 17.69547325 | FALSE | FALSE |
| scores_research | 1.050000 | 18.93004115 | FALSE | FALSE |
| scores_citations | 1.285714 | 30.26561915 | FALSE | FALSE |
| scores_industry_income | 1.866667 | 25.73887018 | FALSE | FALSE |
| scores_international_outlook | 1.222222 | 27.12308268 | FALSE | FALSE |
| record_type | 23.669811 | 0.11223345 | FALSE | TRUE |
| member_level | 25.428571 | 0.14964459 | FALSE | TRUE |
| location | 1.035503 | 4.75121586 | FALSE | FALSE |
| stats_number_students | 1.000000 | 96.29629630 | FALSE | FALSE |
| stats_student_staff_ratio | 1.083333 | 16.76019454 | FALSE | FALSE |
| stats_pc_intl_students | 242.500000 | 81.25701459 | FALSE | FALSE |
| closed | 0.000000 | 0.03741115 | TRUE | TRUE |
| unaccredited | 2672.000000 | 0.07482230 | FALSE | TRUE |
| female_male_ratio | 1.081081 | 2.58136925 | FALSE | FALSE |
| no_subjects_offered | 1.112000 | 1.38421250 | FALSE | FALSE |
| target | 1.286645 | 5.94837262 | FALSE | FALSE |
df <- df[, -nearZeroVar(df)] # removing the nzv and zv variables
The train-test split is being done prior to EDA to avoid data
leakage. The basic idea of prediction is to use the model on an unknown
or unseen data. This can only be achieved by splitting the data before
further rigorous analysis. Using the
createDataPartition(column, p = 0.7) from the caret package
70% of the data was split into train and 30% into test.
set.seed(777)
index <- createDataPartition(df$stats_number_students, p=0.75, list=F)
trainset <- df[index,]
testset <- df[-index,]
This is a fundamental step in analyzing the data to gain insights, detect trends and understand the distribution and the structure of data. It can help identify outliers, and null values thereby increasing the quality of the data for model building. Understanding these factors helps in choosing the right prediction model. Furthermore, it helps us understand the relationship between the variables and feature selection.
To understand the distribution of the predictor variable histograms
were plotted. Clearly the variables do not follow a normal distribution
and the scatter plot revealed that the scores are indeed linearly
related to the target variable. A correlation plot helped examine the
strength of linear relationship between continuous predictor variables
and the target, using the
ggcorrmat(data, type = "nonparametric"). If the data were
to follow a normal distribution parametric type would have been used,
i.e. if parametric then pearson coefficient will be used else spearman
or kendall will be used to calculate the correlation.
#teaching
par(mfrow = c(1,2))
hist(trainset$scores_teaching, main = '',
breaks = 30, col="steelblue1",
xlab = 'teaching score')
scatter.smooth(trainset$scores_teaching, trainset$target,
pch = 20, cex = 1.25 , col="steelblue1",
xlab = 'teaching score', ylab = 'overall score')
#research
par(mfrow = c(1,2))
hist(trainset$scores_research, main = '',
breaks = 30, col="steelblue1",
xlab = 'research score')
scatter.smooth(trainset$scores_research, trainset$target,
pch = 20, cex = 1.25 , col="steelblue1",
xlab = 'research score', ylab = 'overall score')
#citation
par(mfrow = c(1,2))
hist(trainset$scores_citations, main = '',
breaks = 30, col="steelblue1",
xlab = 'citation score')
scatter.smooth(trainset$scores_citations, trainset$target,
pch = 20, cex = 1.25 , col="steelblue1",
xlab = 'citation score', ylab = 'overall score')
#inustry income
par(mfrow = c(1,2))
hist(trainset$scores_industry_income, main = '',
breaks = 30, col="steelblue1",
xlab = 'industry income score')
scatter.smooth(trainset$scores_industry_income, trainset$target,
pch = 20, cex = 1.25 , col="steelblue1",
xlab = 'industry income score', ylab = 'overall score')
#international outlook
par(mfrow = c(1,2))
hist(trainset$scores_international_outlook, main = '',
breaks = 30, col="steelblue1",
xlab = 'international outlook score')
scatter.smooth(trainset$scores_international_outlook, trainset$target,
pch = 20, cex = 1.25 , col="steelblue1",
xlab = 'international outlook score', ylab = 'overall score')
The correlation plot reveals that most performance indicators display
a linear relationship with the target (overall score) except the
stats_student_staff_ratio. Most of the correaltion
coefficient values are above the significant level (alpha = 0.05)
implying they are strongly correlated. Among the predictor variables
scores_research and scores_industry_income are
highly correlated with each other and only
female_male_ratio and
stats_student_staff_ratio are weakly correlated with the
target.
A weak correlation does not necessarily that they are unimportant. Also, multicollinearity can lead to an unstable model but since they are scores needed to calculate the target (overall score) retaining these variables will be wiser choice. Nevertheless, further analysis can explain the impact of these variables.
# Box-plot of scores
par(mfrow = c(1,5))
par(cex.lab=1.5)
boxplot(trainset[1], width = 2, xlab = 'Teaching')
boxplot(trainset[2], width = 2, xlab = 'Research')
boxplot(trainset[3], width = 2, xlab = 'Citation')
boxplot(trainset[4], width = 2, xlab = 'Ind.income')
boxplot(trainset[5], width = 2, xlab = 'Intl.outlook')
mtext("Box-plots of the scores", side = 3,line = -2, outer = TRUE)
#Box-plots of other indicators
par(mfrow = c(1,5))
par(cex.lab=1.5)
boxplot(trainset[7], width = 2, xlab = 'no.stu')
boxplot(trainset[8], width = 2, xlab = 'stu:staff')
boxplot(trainset[9], width = 2, xlab = '%intl students')
boxplot(trainset[10], width = 2, xlab = 'F:M')
boxplot(trainset[11], width = 2, xlab = 'no.subjects')
mtext("Box-plots of other indicators", side = 3,line = -2, outer = TRUE)
The presence of outliers can impact the model performance as these are extreme or incorrect values recorded. The box plots below reveal the distribution of the predictor variables. A common practice is to remove these values as it can affect model performance and regression coefficients. In this case, however, these values seem plausible. A university having score of 100 or 10 (extreme values in this case) is credible, a possible reason could be that the university is extremely good or bad.
| Table 7 :- Effect of outliers on mean & sd | ||||||
| Row | Mean | Mean_x | SD | SD_x | mean.diff | sd.diff |
|---|---|---|---|---|---|---|
| teaching_x | 29.079 | 26.968 | 13.852 | 9.858 | 2.111 | 3.993 |
| research_x | 23.367 | 20.472 | 16.612 | 11.300 | 2.895 | 5.312 |
| students_x | 18333.514 | 15111.161 | 22596.263 | 11552.016 | 3222.353 | 11044.247 |
| stu.staf_x | 19.110 | 17.163 | 13.811 | 7.287 | 1.947 | 6.524 |
| intl_stud | 1522.920 | 830.047 | 2569.564 | 1017.288 | 692.873 | 1552.276 |
| f.m_x | 1.361 | 1.122 | 3.374 | 0.447 | 0.239 | 2.927 |
Table 7 shows the effect of outliers on the mean and standard deviation of the train set. The difference in mean and standard deviation before and after removing outliers is significant suggesting that the outliers do influence the mean and standard deviation. However, the magnitude of the difference alone is not sufficient to decide. There is a risk of losing valuable information if we were to exclude these values. A way around it is to choose a model robust to outliers.
From the vis_miss() plot we clearly see that there are
scores missing. Possible reason for this could be that university did
not meet the ranking threshold or it was not calculated because they are
not very reputed.
#miss_var_summary(trainset)
vis_miss(trainset)
mcar_test(trainset %>% select(stats_pc_intl_students, female_male_ratio,
scores_teaching,scores_research,
scores_citations,scores_industry_income,
scores_international_outlook,target))
| statistic | df | p.value | missing.patterns |
|---|---|---|---|
| 252.7807 | 11 | 0 | 6 |
The calculated p value from the MCAR test is less than 0.05. This implies that the null hypothesis is rejected implying the missing data is not MCAR (i.e, either MAR/MNAR). If the missing data were missing completely at random (MCAR) complete case analysis would have been the right approach (dropping all null values). Considering the size of the train set we might lose the representativeness of the data by dropping these values and high risk of overfitting the model (Zhang, 2016 ; Gomer, 2019).
The missing values can be predicted using the observed values through
multiple imputations. I have used bagImpute from caret
package and cart impute from mice package and
compared their distribution with the original data. The
bagImpute method uses PCA and KNN methods to impute data
while cart uses classification and regression techniques to
impute the data.
imp <- preProcess(trainset %>% select(-location), method = c("bagImpute"))
imputed <- predict(imp, newdata = trainset)
#Mice impute
quick_pred <- quickpred(trainset, exclude = c('location', 'Region'))
mice_imp <- mice(trainset, method = "cart", pred = quick_pred, m = 10)
##
## iter imp variable
## 1 1 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 1 2 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 1 3 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 1 4 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 1 5 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 1 6 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 1 7 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 1 8 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 1 9 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 1 10 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 2 1 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 2 2 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 2 3 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 2 4 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 2 5 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 2 6 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 2 7 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 2 8 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 2 9 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 2 10 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 3 1 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 3 2 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 3 3 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 3 4 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 3 5 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 3 6 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 3 7 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 3 8 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 3 9 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 3 10 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 4 1 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 4 2 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 4 3 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 4 4 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 4 5 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 4 6 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 4 7 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 4 8 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 4 9 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 4 10 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 5 1 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 5 2 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 5 3 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 5 4 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 5 5 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 5 6 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 5 7 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 5 8 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 5 9 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
## 5 10 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook stats_pc_intl_students female_male_ratio target
mice_imputed <- complete(mice_imp)
# Check if the distribution of the graph has be altered while imputing/dropping null values
ggplot() +
geom_density(aes(scores_teaching, fill = "Original"), alpha = .1, data = trainset, na.rm=T) +
geom_density(aes(scores_teaching, fill = "imputed"), alpha = .2, data = imputed) +
scale_fill_manual(name = "datasets", values = c(Original = "red", imputed = "blue"))
ggplot() +
geom_density(aes(scores_teaching, fill = "Original"), alpha = .1, data = trainset, na.rm=T) +
geom_density(aes(scores_teaching, fill = "mice_imputed"), alpha = .2, data = mice_imputed) +
scale_fill_manual(name = "datasets", values = c(Original = "red", mice_imputed = "green"))
It is apparent from the density graphs plotted above that the cart imputed data retains (almost) the original distribution of the train set comparatively better. Moving forward the cart imputed data will be employed for the next process.
Most statistical models only take numerical values as the input. Depending on the type of categorical data different encoding techniques are preferred. Label encoding (like ordinal encoding) is used when it is ordinal while one-hot encoding is preferred when it is nominal. Since our variable location has multiple classes (127 locations), performing one hot encoding will result in a sparse matrix. This will lead to high dimensionality problems.
To mitigate this issue, I have chosen to target encode the categorical variable. This method replaces the categorical feature with the mean value of the target belonging to that group. A major disadvantage being target snooping (deriving the values of the predictor variable from the target) and overfitting. These problems can be resolved by performing cross validations.
target_encoding <- build_target_encoding(mice_imputed,
cols_to_encode = "location",
target_col = "target",
functions = "mean")
## [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target."
imp_train <- target_encode(mice_imputed, target_encoding = target_encoding)
## [1] "target_encode: Start to encode columns according to target."
imp_train <- imp_train %>% select(-location) %>%
rename(Location = target_mean_by_location)
Standardized normally preferred when the data is normally distributed if not normalization is the safest bet. In our case, the data is does not follow a bell shaped curve. Also, the magnitude of variables is not uniform throughout hence normalization will be the right approach. The model will give more importance to a higher magnitude variable than the lower ones which might not be true in all cases. Normalizing the data helps in a more comprehensive model and improves the computation speed. Ultimately, all the values are scaled into a range [0,1]. Models like simple linear regression, lasso benefit from this transformation because these algorithms assume the data is normally distributed.
#Normalizing the train set
imp_train <- normalize(imp_train)
I have considered running simple linear regression lm()
as baseline model for preliminary analysis.
summary(lm(target~., data = imp_train))
##
## Call:
## lm(formula = target ~ ., data = imp_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.123370 -0.014306 0.000174 0.017982 0.205299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.119763 0.002727 -43.912 <2e-16 ***
## scores_teaching 0.303217 0.009452 32.079 <2e-16 ***
## scores_research 0.379514 0.010663 35.590 <2e-16 ***
## scores_citations 0.363245 0.003847 94.435 <2e-16 ***
## scores_industry_income 0.037436 0.003730 10.036 <2e-16 ***
## scores_international_outlook 0.065518 0.004076 16.074 <2e-16 ***
## stats_number_students 0.010116 0.017964 0.563 0.573
## stats_student_staff_ratio 0.010855 0.018596 0.584 0.559
## stats_pc_intl_students -0.013347 0.009468 -1.410 0.159
## female_male_ratio -0.008672 0.021402 -0.405 0.685
## no_subjects_offered -0.001712 0.002828 -0.605 0.545
## Location 0.008796 0.007059 1.246 0.213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0314 on 1993 degrees of freedom
## Multiple R-squared: 0.9755, Adjusted R-squared: 0.9754
## F-statistic: 7215 on 11 and 1993 DF, p-value: < 2.2e-16
vif(lm(target~., data = imp_train))
## scores_teaching scores_research
## 4.122613 6.553424
## scores_citations scores_industry_income
## 2.066576 2.578234
## scores_international_outlook stats_number_students
## 2.180351 1.438249
## stats_student_staff_ratio stats_pc_intl_students
## 1.208819 1.910466
## female_male_ratio no_subjects_offered
## 1.005252 1.353404
## Location
## 1.868093
From the baseline model we can interpret that all the scores and the
location are statistically significant. Although the model seems
significant having a \(R^2\) value of
0.97, it is known that it is sensitive to outliers and can heavily
influence the coefficients and predictions. The vif(model)
shows that scores_research has value less than 5, which is not a grave
concern. {Some sources say VIF less than 10 is acceptable}.
Since the train set contains plausible outliers model selection was done accordingly. There are multiple regression models in the caret package.
train(target~., data = imp_train, method = "model")
The chosen model and the reasons are mentioned below.
Robust Linear Regression : Robust Linear Regression is robust to outliers because it uses m-estimation to weigh down the influence of outliers. Unlike the traditional linear regression this model adopts the skewness of the distribution and thereby provides a reliable analysis (Yang, Gallagher and McMahan, 2018).
Support Vector Machine : SVM with a linear kernel is used to fit the data as extreme outliers will not significantly affect the decision boundary.
Random Forest : Random Forest is basically an average of multiple decision trees. Normally all tree based algorithms are robust to outliers. Overfitting being one of our concerns random forests are known for overcoming it (Jaiswal and Samikannu, 2017).
set.seed(777)
tic("Robust Linear Regression : ")
rlm <- train(target~., data = imp_train, method = "rlm")
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
toc()
## Robust Linear Regression : : 3.23 sec elapsed
tic("Support Vector Machine : ")
svm <- train(target~., data = imp_train, method = "svmLinear")
toc()
## Support Vector Machine : : 10.95 sec elapsed
tic("Random Forest : ")
rf <- train(target~., data = imp_train, method = "rf")
toc()
## Random Forest : : 352.03 sec elapsed
models <- resamples(list(RLM = rlm,
SVM = svm,
RF = rf))
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models, scales=scales, metric = c("RMSE", "Rsquared"),
pch = '|', do.out=FALSE)
Analysis of these untuned model and diagnostics yields the following observations:
summary(svm) it was found that the tuning
parameter C was held at a constant value 1, meaning the model in the
first iteration had enough information to detect the underlying
pattern.Nevertheless, employing cross validation and fine tuning, these models can reduce the computation cost model, increase the interpretability of the model, and reduce overfitting (Awad and Khanna, 2015).
set.seed(777)
#Since the data is small I am choosing 2 separate 5 fold cross validation
repeat_cv <- trainControl(method='repeatedcv', number=5, repeats=2)
#tuneGrid for rlm
tuneGrid_rlm = tune_grid <- data.frame(intercept = TRUE,
psi = c("psi.huber", "psi.hampel", "psi.bisquare"))
tic("Tuned Robust Linear Regression : ")
rlm_tuned <- train(target~., data = imp_train, method = "rlm",
trControl = repeat_cv, tuneGrid = tuneGrid_rlm)
toc()
## Tuned Robust Linear Regression : : 0.59 sec elapsed
#tuneGrid for rlm
tuneGrid_svm <- expand.grid(C = c(0.01,0.1,1))
tic("Tuned Support Vector Machine : ")
svm_tuned <- train(target~., data = imp_train, method = "svmLinear",
trControl = repeat_cv, tuneGrid = tuneGrid_svm)
toc()
## Tuned Support Vector Machine : : 4.35 sec elapsed
# setting number of trees as 10 X number of predictors
tic("Tuned Random Forest : ")
rf_tuned <- train(target~., data = imp_train, method = "rf",
trControl=repeat_cv, ntree = 170)
toc()
## Tuned Random Forest : : 37.2 sec elapsed
tuned_models <- resamples(list(RLM_tuned = rlm_tuned,
SVM_tuned = svm_tuned,
RF_tuned = rf_tuned))
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(tuned_models, scales=scales,
metric = c("RMSE", "Rsquared"),
pch = '|', do.out=FALSE)
Tuning these models has evidently reduced the time taken to fit the data and compute better metrics as well. The tuned SVM is observed to perform better than the tuned RLM which was vice versa before tuning, however in both cases random forest resulted having better metrics than the other two.
final_model <- rf_tuned$finalModel
print(final_model)
##
## Call:
## randomForest(x = x, y = y, ntree = 170, mtry = param$mtry)
## Type of random forest: regression
## Number of trees: 170
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 0.0008266865
## % Var explained: 97.93
print(final_model$results)
## NULL
#plot(rf_tuned$finalModel)
From the variable importance plot we infer that the most influential factors are research and citation scores implying that these variables contribute more to the predictive accuracy of the model. The least influential variables are in the following order: number of subjects offered, female to male ratio, number of students.
final_model$importance
## IncNodePurity
## scores_teaching 7.4539695
## scores_research 39.0937028
## scores_citations 24.9071965
## scores_industry_income 6.0043165
## scores_international_outlook 1.0920681
## stats_number_students 0.1762496
## stats_student_staff_ratio 0.2361783
## stats_pc_intl_students 0.2630562
## female_male_ratio 0.1508004
## no_subjects_offered 0.1271302
## Location 0.5641116
var_imp <- varImp(rf_tuned, scale=FALSE)$importance
var_imp <- data.frame(variables=row.names(var_imp), importance=var_imp$Overall)
var_imp %>%
arrange(importance) %>%
ggplot(aes(x=reorder(variables, importance), y=importance)) +
geom_bar(stat='identity') + coord_flip() +
ggtitle("Variable Importance Plot") +
ylab("Importance") + xlab("Variables")
The basic pipeline for this project is in the following order:
IMPUTATION
ENCODING
NORMALIZATION
Performing the same functions on the test set and fitting our final model, we can predict and measure the goodness of fit of the model.
#Mice impute
quick_pred <- quickpred(testset, exclude = 'location')
imp_test <- mice(testset, method = "cart", pred = quick_pred, m = 10)
##
## iter imp variable
## 1 1 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 1 2 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 1 3 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 1 4 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 1 5 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 1 6 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 1 7 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 1 8 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 1 9 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 1 10 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 2 1 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 2 2 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 2 3 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 2 4 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 2 5 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 2 6 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 2 7 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 2 8 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 2 9 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 2 10 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 3 1 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 3 2 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 3 3 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 3 4 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 3 5 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 3 6 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 3 7 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 3 8 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 3 9 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 3 10 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 4 1 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 4 2 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 4 3 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 4 4 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 4 5 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 4 6 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 4 7 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 4 8 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 4 9 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 4 10 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 5 1 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 5 2 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 5 3 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 5 4 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 5 5 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 5 6 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 5 7 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 5 8 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 5 9 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
## 5 10 scores_teaching scores_research scores_citations scores_industry_income scores_international_outlook female_male_ratio target
imp_test <- complete(imp_test)
#Encoding
target_encoding <- build_target_encoding(imp_test,
cols_to_encode = "location",
target_col = "target",
functions = "mean")
## [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target."
imp_test <- target_encode(imp_test, target_encoding = target_encoding)
## [1] "target_encode: Start to encode columns according to target."
imp_test <- imp_test %>% select(-location) %>%
rename(Location = target_mean_by_location)
#normalizing
imp_test <- normalize(imp_test)
#predicting
obs <- imp_test$target
pred <- predict(object=rf_tuned, newdata=imp_test[, -11])
output <- data.frame(obs, pred)
ggplot(output, aes(obs,pred)) + geom_point() + geom_smooth() +
labs(x = "Predicted") + labs(y = "Observed") + theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
rmse <- round(RMSE(pred = pred, obs = obs),3)
mae <- round(MAE(pred = pred, obs = obs),3)
r_squared <- round(R2(pred = pred, obs = obs),3)
sprintf("R2 value is %s", r_squared)
## [1] "R2 value is 0.971"
sprintf("Root mean squared error is %s", rmse)
## [1] "Root mean squared error is 0.035"
sprintf("Mean absolute error is %s", mae)
## [1] "Mean absolute error is 0.025"
University rankings are seen as a meaningful representation of bettering academic excellence and institutional reputation. World university rankings have gradually drawn greater attention in international and comparative higher education. Using random forest regression algorithm, we were able to predict the overall score of the university with a good accuracy. The model was able to explain a good portion of the variability in the observed data.
The ranking through selection of a particular set of indicators, and allotting a weight to each indicator, results in defining the quality of education offered among the competing institutions. Whereas there may be some other valid indicators, which may have been ignored in the scoring process. Though most of the indicators were strongly correlated, only a few out of them were given more importance. Gaining this information will help universities grow and work on their weak aspects along with the help of the institute’s industry partners. They can improve their infrastructure and curriculum to attract more students, introduce internship programs or increase the number of research collaborations. As it was observed that the research score indicator was given more importance. The ranking helps the upcoming freshers to filter out their universities based on location or any other indicators.
The global rankings can only make sense if the indicators are appropriate, otherwise intuitions and governments will jeopardize transforming their higher education system. These metrics must be collected, recorded, and analyzed by a central body at a global level. Finally, prospective students should not use this data as the sole guide for choosing a university, but they should look for additional information before selecting the best fit for them.
University Ranking and Evaluation: Trend and Existing Approaches. (2012). International Journal of Advancements in Computing Technology, 4(5), pp.10–16. doi:https://doi.org/10.4156/ijact.vol4.issue5.2.
Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of translational medicine, [online] 4(1), p.9. doi:https://doi.org/10.3978/j.issn.2305-5839.2015.12.38.
Gomer, B. (2019). MCAR, MAR, and MNAR Values in the Same Dataset: A Realistic Evaluation of Methods for Handling Missing Data. Multivariate Behavioral Research, 54(1), pp.153–153. doi:https://doi.org/10.1080/00273171.2018.1557033.
De, A., Antonio, J., Javier, F. and Gibran, H. (2020). The World University Rankings Model Validation and a Top 50 Universities Predictive Model. doi:https://doi.org/10.1109/iccais48893.2020.9096841.
Yang, T., Gallagher, C. and McMahan, C.S. (2018). A robust regression methodology via M-estimation. Communications in Statistics - Theory and Methods, 48(5), pp.1092–1107. doi:https://doi.org/10.1080/03610926.2018.1423698.
Jaiswal, J.K. and Samikannu, R. (2017). Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression. [online] IEEE Xplore. doi:https://doi.org/10.1109/WCCCT.2016.25.
Vitenko, T., Volodymyr Shanaida, Paweł Dróździel and Radovan Madleňák (2018). ASSESSMENT OF HIGHER EDUCATION IN GLOBAL MEASUREMENT. INTED proceedings. doi:https://doi.org/10.21125/inted.2018.0787.
Awad, M., Khanna, R. (2015). Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4302-5990-9_1.
Estrada-Real, A.C., Cantu-Ortiz, F.J.(2022). A data analytics approach for university competitiveness: the QS world university rankings. Int J Interact Des Manuf 16, 871–891. https://doi.org/10.1007/s12008-022-00966-2.
Taylor, P. and Braddock, R. (2007). International University Ranking Systems and the Idea of University Excellence. Journal of Higher Education Policy and Management, 29(3), pp.245–260. doi:https://doi.org/10.1080/13600800701457855.