knitr::opts_chunk$set(echo=TRUE)

INTRODUCTION

Ranking universities is a fundamental issue not only to the students and academics but also to university authorities, industry and even government. Global university rankings depend on several key factors such as teaching, research, citations, international outlook, industry income etc. Prospective students require this for applying to specific universities based on upcoming ranks. There are mainly two type of ranking system developed Global and National, which contain several ranking system based on the location . The Global Ranking System consists of 9 rankinig systems out of which the most widely used is THE-QS World University Ranking. This was published in England in 2024. (University Ranking and Evaluation: Trend and Existing Approaches, 2012)

The Global University authorities should assess the varsity’s upcoming rank to improve certain fields and upgrade their standards compared with others. Industry and even government should try to forecast the ranking of universities for providing grants to the most eligible institutions for the upcoming year (Taylor and Braddock, 2007 ; Vitenko et al., 2018). Understanding the criteria and mechanics that assess the quality of education is a key activity to improve current educational models and propose new educational strategies.

This project aims to demonstrate how data analysis techniques and machine learning algorithms can predict the overall score of the university and analyze the predictive features. The raw data collected has 2673 rows and 17 variables out of which overall_score is our target variable. Since the target variable is numerical regression techniques will be used for predictive modeling.

file_path <- "C:/Users/vishw/OneDrive/Desktop/NU/KL7010 - Principle of DS/World University Rankings.csv"
raw_data <- read.csv(file_path, header = TRUE)
#str(raw_data)

Have you heard of the French term “mise en place”. It is crucial for every chef to understand. This French term means “everything in its place”. “The goal of this culinary practice is to have everything ready to go when it comes time to cook, bake, or assemble your final dish.” Similarly, data preprocessing is a component of data preparation which ensures that the raw data is ready for further data processing.

Data preprocessing

It is a fundamental step in the data mining process. The data collected, i.e. the raw data, is never in an ideal format where data analyses can be done right away. Data type conversion (type casting), variable transformation (unit conversion) are a few steps involved in preprocessing that ensures that the collected data is in a suitable format for analysis and model building. This helps improve consistency, data visualization and interpretation of data. As a result, it helps make the data more meaningful, thereby leading to better insights and decision making.

df <- raw_data #getting a copy of the data

df$stats_number_students <- gsub(",", "", df$stats_number_students)

df$stats_pc_intl_students <- gsub("%", "", df$stats_pc_intl_students)
df$stats_pc_intl_students <- as.numeric(df$stats_pc_intl_students)/100 

df$stats_female_male_ratio <- gsub(":00", "", df$stats_female_male_ratio)

df$female_male_ratio <- sapply(strsplit(df$stats_female_male_ratio, ":"),
                               function(x) as.numeric(x[1])/as.numeric(x[2]))


df$no_subjects_offered <- sapply(strsplit(as.character(df$subjects_offered),','), 
                                 function(x) length(x))

df$target <- sapply(strsplit(df$overall_score, '–'),
                    function(x) mean(as.numeric(x), na.rm=TRUE))


df$no_subjects_offered <- as.integer(df$no_subjects_offered)

df$member_level <- as.factor(df$member_level)
df$unaccredited <- as.factor(df$unaccredited)

df$stats_number_students <- as.integer(df$stats_number_students)
df$stats_pc_intl_students <- df$stats_pc_intl_students * df$stats_number_students

df$scores_teaching <- as.numeric(df$scores_teaching)
df$scores_research <- as.numeric(df$scores_research)
df$scores_citations <- as.numeric(df$scores_citations)
df$scores_industry_income <- as.numeric(df$scores_industry_income)
df$scores_international_outlook <- as.numeric(df$scores_international_outlook)


#length(unique(df$location))  #127 countries
countries <- unique(df$location) 

North_America = c("Canada", "United States", "Turks and Caicos Islands", "Mexico", "Puerto Rico", "Jamaica", "Costa Rica", "Cuba", "Dominican Republic")

South_America = c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Paraguay", "Peru", "Uruguay", "Venezuela")

Europe = c("Albania", "Andorra", "Austria", "Belarus", "Belgium", "Bosnia and Herzegovina", "Bulgaria", "Croatia", "Cyprus", "Czech Republic", "Denmark", "Estonia", "Finland", "France", "Georgia", "Germany", "Greece", "Hungary", "Iceland", "Ireland", "Italy", "Kosovo", "Latvia", "Liechtenstein", "Lithuania", "Luxembourg", "Malta", "Moldova", "Monaco", "Montenegro", "Netherlands", "North Macedonia", "Norway", "Poland", "Portugal", "Romania", "Russian Federation", "San Marino", "Serbia", "Slovakia", "Slovenia", "Spain", "Sweden", "Switzerland", "Ukraine", "United Kingdom")

Africa = c("Algeria", "Angola", "Botswana", "Egypt", "Ethiopia", "Ghana", "Kenya", "Libya", "Malawi", "Mauritius", "Morocco", "Mozambique", "Namibia", "Nigeria", "Rwanda", "Senegal", "Somalia", "South Africa", "Sudan", "Tanzania", "Tunisia", "Uganda", "Zambia", "Zimbabwe")

Middle_East = c("Bahrain", "Iran", "Iraq", "Israel", "Jordan", "Kuwait", "Lebanon", "Northern Cyprus", "Oman", "Palestine", "Qatar", "Saudi Arabia", "Syria", "Turkey", "United Arab Emirates", "Yemen")

Asia = c("Armenia", "Azerbaijan", "Bangladesh", "Brunei Darussalam", "Cambodia", "China", "Hong Kong", "India", "Indonesia", "Japan", "Kazakhstan", "Kyrgyzstan", "Laos", "Macao", "Malaysia", "Mongolia", "Myanmar", "Nepal", "Pakistan", "Philippines", "Singapore","South Korea", "Sri Lanka", "Taiwan", "Tajikistan", "Thailand", "Turkmenistan", "Uzbekistan", "Vietnam")

Oceania = c("Australia", "Fiji", "New Zealand")

df$Region = case_when(df$location %in% Asia ~ "Asia",
                      df$location %in% Africa ~ "Africa",
                      df$location %in% Europe ~ "Europe",
                      df$location %in% Middle_East ~ "Middle_East",
                      df$location %in% North_America ~ "North_America",
                      df$location %in% South_America ~ "South_America",
                      df$location %in% Oceania ~ "Oceania")

df1 <- df
df <- select(df, -c(13,14,17))

#str(df)

#summary(df)

A few basic insights before performing rigorous data analysis. These insights can be valuable to prospective students and institutions. Table 1 shows the top 10 universities in the world.

Top_uni <- df %>% arrange(desc(target)) %>% 
  select(c('name', 'location', 'target'))


head(Top_uni, 10) %>% 
  gt() %>% 
  cols_label(
  name = md("***Universities***"),
  location = md("***Location***"),
  target = md("***Overall Score***")) %>%
  tab_header(title = md("Table 1 :- *Top 10 Universities*"))
Table 1 :- Top 10 Universities
Universities Location Overall Score
University of Oxford United Kingdom 98.5
Stanford University United States 98.0
Massachusetts Institute of Technology United States 97.9
Harvard University United States 97.8
University of Cambridge United Kingdom 97.5
Princeton University United States 96.9
California Institute of Technology United States 96.5
Imperial College London United Kingdom 95.1
University of California, Berkeley United States 94.6
Yale University United States 94.2

From table 2 we infer that Computer Science and Business & Management courses seems to be the most offered courses among all universities in the world.

subject_list <- strsplit(as.character(df1$subjects_offered),',')
l1 <- table(unlist(subject_list))
subjects <- data.frame(Subject = names(l1), Offered_by = as.numeric(l1))

arrange(subjects, desc(Offered_by)) %>% head(6) %>% 
  gt() %>%
  cols_label(
  Subject = md("***Subject***"),
  Offered_by = md("***No. of Unis***")) %>%
  tab_header(title = md("Table 2 :- *Most offered courses*"))
Table 2 :- Most offered courses
Subject No. of Unis
Computer Science 2277
Business & Management 2267
Accounting & Finance 1965
Biological Sciences 1946
Mathematics & Statistics 1914
Literature & Linguistics 1913
df_filtered <- df1[!is.na(df$target), ] #Considering only ranked universities

ggplot(df_filtered, aes(Region, fill = Region)) + 
  geom_bar(color = "black",show.legend = F) + 
  theme_minimal() + scale_fill_brewer(palette="Blues") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

The above graph and the tables 3, 4, 5 and 6 illustrate that Europe and Asia has the highest number of ranked universities followed by the North American region. Universities in the UK admitted the highest number of international students followed by the USA and Australia. Europe seems to be a preferred region for study among international students, in particular the UK although there are greater number of ranked universities in the US compared to the UK.

intl.students <- df %>% arrange(desc(stats_pc_intl_students)) %>% 
  select(Region, stats_pc_intl_students ) %>% 
  group_by(Region) %>% 
  summarise(sum_intl = sum(stats_pc_intl_students, na.rm = T)) 

arrange(intl.students, desc(sum_intl)) %>% 
  gt() %>%
  cols_label(
  Region = md("***Region***"),
  sum_intl = md("***International Students***")) %>%
  tab_header(title = md("Table 3 :- *No. of International Students (Region)*"))
Table 3 :- No. of International Students (Region)
Region International Students
Europe 2003036.49
North_America 710247.36
Asia 545708.58
Middle_East 382577.76
Oceania 295841.18
Africa 142760.70
South_America 48438.52
uni_loc <- group_by(df_filtered, location) %>% summarise(count = n()) 
uni_loc %>% arrange(desc(count), .by_group = T) %>% 
  head(10) %>%
  gt() %>%
  cols_label(
  location = md("***Country***"),
  count = md("***Ranked Unis***")) %>%
  tab_header(title = md("Table 4 :- *Sum of ranked universitites*"))
Table 4 :- Sum of ranked universitites
Country Ranked Unis
United States 169
Japan 119
United Kingdom 104
India 91
China 86
Russian Federation 78
Turkey 75
Iran 73
Brazil 56
Italy 56
students <- df %>% arrange(desc(stats_number_students)) %>% 
  select(location, stats_number_students ) %>% 
  group_by(location) %>% 
  summarise(students = sum(stats_number_students, na.rm = T))

arrange(students, desc(students)) %>% head(10) %>% 
  gt() %>%
  cols_label(
  location = md("***Country***"),
  students = md("***Sum of Students***")) %>%
  tab_header(title = md("Table 5 :- *Total Sum of Students (location)*"))
Table 5 :- Total Sum of Students (location)
Country Sum of Students
United States 3985888
China 2818642
Egypt 2423510
Turkey 2105034
United Kingdom 2093040
Brazil 1707217
Japan 1460908
Algeria 1354183
Mexico 1303044
Spain 1291193
intl_students <- df %>% arrange(desc(stats_pc_intl_students)) %>% 
  select(location, stats_pc_intl_students ) %>% 
  group_by(location) %>% 
  summarise(sum_intl = sum(stats_pc_intl_students, na.rm = T))

arrange(intl_students, desc(sum_intl)) %>% head(10) %>% 
  gt() %>%
  cols_label(
  location = md("***Country***"),
  sum_intl = md("***International Students***")) %>%
  tab_header(title = md("Table 6 :- *No. of International Students (location)*"))
Table 6 :- No. of International Students (location)
Country International Students
United Kingdom 654516.2
United States 481297.0
Australia 258599.7
Canada 205261.8
France 192720.9
Russian Federation 164572.9
Germany 163791.9
Turkey 162938.2
China 134049.0
Spain 130483.3

Analyzing the summary(data), it was found that closed was a constant variable & out of 2673 unaccredited values only one value was TRUE which makes them a zero variance or near zero variance variable. Using the function nearZeroVar(data) from the caret package we can weed out the zero or near zero variance variables. Apart from the above mentioned variables record_type & member_level also had near zero variance. Hence these variables need not be considered as predictor variables. These are uninformative features and can reduce the performance of our model.

#Removing these variables as they are not needed for further analysis
df <- select(df, -c('name', 'Region'))

nzv <- nearZeroVar(df, saveMetrics = T)
nzv %>% rownames_to_column(var = "variables") %>% 
  gt() %>% 
  cols_label(
  variables = md("***variables***"),
  freqRatio = md("***freqRatio***"),
  percentUnique = md("***percentUnique***"),
  zeroVar = md("***zeroVar***"),
  nzv = md("***nzv***")) %>%
  tab_header(title = md("Table 4 :- *NZV table*")) %>%
  tab_style_body(
    style = cell_fill(color = "lightblue"),
    fn = function(x) isTRUE(x))
Table 4 :- NZV table
variables freqRatio percentUnique zeroVar nzv
scores_teaching 1.230769 17.69547325 FALSE FALSE
scores_research 1.050000 18.93004115 FALSE FALSE
scores_citations 1.285714 30.26561915 FALSE FALSE
scores_industry_income 1.866667 25.73887018 FALSE FALSE
scores_international_outlook 1.222222 27.12308268 FALSE FALSE
record_type 23.669811 0.11223345 FALSE TRUE
member_level 25.428571 0.14964459 FALSE TRUE
location 1.035503 4.75121586 FALSE FALSE
stats_number_students 1.000000 96.29629630 FALSE FALSE
stats_student_staff_ratio 1.083333 16.76019454 FALSE FALSE
stats_pc_intl_students 242.500000 81.25701459 FALSE FALSE
closed 0.000000 0.03741115 TRUE TRUE
unaccredited 2672.000000 0.07482230 FALSE TRUE
female_male_ratio 1.081081 2.58136925 FALSE FALSE
no_subjects_offered 1.112000 1.38421250 FALSE FALSE
target 1.286645 5.94837262 FALSE FALSE
df <- df[, -nearZeroVar(df)] # removing the nzv and zv variables

Why it’s essential to split the data before performing EDA.

The train-test split is being done prior to EDA to avoid data leakage. The basic idea of prediction is to use the model on an unknown or unseen data. This can only be achieved by splitting the data before further rigorous analysis. Using the createDataPartition(column, p = 0.7) from the caret package 70% of the data was split into train and 30% into test.

set.seed(777)
index <- createDataPartition(df$stats_number_students, p=0.75, list=F)
trainset <- df[index,]
testset <- df[-index,]

EXPLORATORY DATA ANALYSIS

This is a fundamental step in analyzing the data to gain insights, detect trends and understand the distribution and the structure of data. It can help identify outliers, and null values thereby increasing the quality of the data for model building. Understanding these factors helps in choosing the right prediction model. Furthermore, it helps us understand the relationship between the variables and feature selection.

To understand the distribution of the predictor variable histograms were plotted. Clearly the variables do not follow a normal distribution and the scatter plot revealed that the scores are indeed linearly related to the target variable. A correlation plot helped examine the strength of linear relationship between continuous predictor variables and the target, using the ggcorrmat(data, type = "nonparametric"). If the data were to follow a normal distribution parametric type would have been used, i.e. if parametric then pearson coefficient will be used else spearman or kendall will be used to calculate the correlation.

#teaching
par(mfrow = c(1,2))
hist(trainset$scores_teaching, main = '', 
     breaks = 30, col="steelblue1", 
     xlab = 'teaching score')

scatter.smooth(trainset$scores_teaching, trainset$target, 
     pch = 20, cex = 1.25 , col="steelblue1", 
     xlab = 'teaching score', ylab = 'overall score')

#research
par(mfrow = c(1,2))
hist(trainset$scores_research, main = '', 
     breaks = 30, col="steelblue1", 
     xlab = 'research score')
scatter.smooth(trainset$scores_research, trainset$target, 
     pch = 20, cex = 1.25 , col="steelblue1", 
     xlab = 'research score', ylab = 'overall score')

#citation
par(mfrow = c(1,2))
hist(trainset$scores_citations, main = '', 
     breaks = 30, col="steelblue1", 
     xlab = 'citation score')
scatter.smooth(trainset$scores_citations, trainset$target, 
     pch = 20, cex = 1.25 , col="steelblue1", 
     xlab = 'citation score', ylab = 'overall score')

#inustry income
par(mfrow = c(1,2))
hist(trainset$scores_industry_income, main = '', 
     breaks = 30, col="steelblue1", 
     xlab = 'industry income score')
scatter.smooth(trainset$scores_industry_income, trainset$target, 
     pch = 20, cex = 1.25 , col="steelblue1", 
     xlab = 'industry income score', ylab = 'overall score')

#international outlook
par(mfrow = c(1,2))
hist(trainset$scores_international_outlook, main = '', 
     breaks = 30, col="steelblue1", 
     xlab = 'international outlook score')
scatter.smooth(trainset$scores_international_outlook, trainset$target, 
     pch = 20, cex = 1.25 , col="steelblue1", 
     xlab = 'international outlook score', ylab = 'overall score')

The correlation plot reveals that most performance indicators display a linear relationship with the target (overall score) except the stats_student_staff_ratio. Most of the correaltion coefficient values are above the significant level (alpha = 0.05) implying they are strongly correlated. Among the predictor variables scores_research and scores_industry_income are highly correlated with each other and only female_male_ratio and stats_student_staff_ratio are weakly correlated with the target.

A weak correlation does not necessarily that they are unimportant. Also, multicollinearity can lead to an unstable model but since they are scores needed to calculate the target (overall score) retaining these variables will be wiser choice. Nevertheless, further analysis can explain the impact of these variables.

OUTLIER DETECTION

# Box-plot of scores
par(mfrow = c(1,5))
par(cex.lab=1.5)
boxplot(trainset[1], width = 2, xlab = 'Teaching')
boxplot(trainset[2], width = 2, xlab = 'Research')
boxplot(trainset[3], width = 2, xlab = 'Citation')
boxplot(trainset[4], width = 2, xlab = 'Ind.income')
boxplot(trainset[5], width = 2, xlab = 'Intl.outlook')
mtext("Box-plots of the scores", side = 3,line = -2, outer = TRUE)

#Box-plots of other indicators
par(mfrow = c(1,5))
par(cex.lab=1.5)
boxplot(trainset[7], width = 2, xlab = 'no.stu')
boxplot(trainset[8], width = 2, xlab = 'stu:staff')
boxplot(trainset[9], width = 2, xlab = '%intl students')
boxplot(trainset[10], width = 2, xlab = 'F:M')
boxplot(trainset[11], width = 2, xlab = 'no.subjects')
mtext("Box-plots of other indicators", side = 3,line = -2, outer = TRUE)

The presence of outliers can impact the model performance as these are extreme or incorrect values recorded. The box plots below reveal the distribution of the predictor variables. A common practice is to remove these values as it can affect model performance and regression coefficients. In this case, however, these values seem plausible. A university having score of 100 or 10 (extreme values in this case) is credible, a possible reason could be that the university is extremely good or bad.

Table 7 :- Effect of outliers on mean & sd
Row Mean Mean_x SD SD_x mean.diff sd.diff
teaching_x 29.079 26.968 13.852 9.858 2.111 3.993
research_x 23.367 20.472 16.612 11.300 2.895 5.312
students_x 18333.514 15111.161 22596.263 11552.016 3222.353 11044.247
stu.staf_x 19.110 17.163 13.811 7.287 1.947 6.524
intl_stud 1522.920 830.047 2569.564 1017.288 692.873 1552.276
f.m_x 1.361 1.122 3.374 0.447 0.239 2.927

Table 7 shows the effect of outliers on the mean and standard deviation of the train set. The difference in mean and standard deviation before and after removing outliers is significant suggesting that the outliers do influence the mean and standard deviation. However, the magnitude of the difference alone is not sufficient to decide. There is a risk of losing valuable information if we were to exclude these values. A way around it is to choose a model robust to outliers.

HANDLING NULL VALUES

From the vis_miss() plot we clearly see that there are scores missing. Possible reason for this could be that university did not meet the ranking threshold or it was not calculated because they are not very reputed.

#miss_var_summary(trainset)
vis_miss(trainset)

mcar_test(trainset %>% select(stats_pc_intl_students, female_male_ratio, 
                                scores_teaching,scores_research, 
                                scores_citations,scores_industry_income,
                                scores_international_outlook,target))
statistic df p.value missing.patterns
252.7807 11 0 6

The calculated p value from the MCAR test is less than 0.05. This implies that the null hypothesis is rejected implying the missing data is not MCAR (i.e, either MAR/MNAR). If the missing data were missing completely at random (MCAR) complete case analysis would have been the right approach (dropping all null values). Considering the size of the train set we might lose the representativeness of the data by dropping these values and high risk of overfitting the model (Zhang, 2016 ; Gomer, 2019).

The missing values can be predicted using the observed values through multiple imputations. I have used bagImpute from caret package and cart impute from mice package and compared their distribution with the original data. The bagImpute method uses PCA and KNN methods to impute data while cart uses classification and regression techniques to impute the data.

imp <- preProcess(trainset %>% select(-location), method = c("bagImpute"))
imputed <- predict(imp, newdata = trainset)

#Mice impute
quick_pred <- quickpred(trainset, exclude = c('location', 'Region'))
mice_imp <- mice(trainset, method = "cart", pred = quick_pred, m = 10)
## 
##  iter imp variable
##   1   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
mice_imputed <- complete(mice_imp)
# Check if the distribution of the graph has be altered while imputing/dropping null values

ggplot() +
  geom_density(aes(scores_teaching, fill = "Original"), alpha = .1, data = trainset, na.rm=T) +
  geom_density(aes(scores_teaching, fill = "imputed"), alpha = .2, data = imputed) +
  scale_fill_manual(name = "datasets", values = c(Original = "red", imputed = "blue"))

ggplot() +
  geom_density(aes(scores_teaching, fill = "Original"), alpha = .1, data = trainset, na.rm=T) +
  geom_density(aes(scores_teaching, fill = "mice_imputed"), alpha = .2, data = mice_imputed) +
  scale_fill_manual(name = "datasets", values = c(Original = "red", mice_imputed = "green"))

It is apparent from the density graphs plotted above that the cart imputed data retains (almost) the original distribution of the train set comparatively better. Moving forward the cart imputed data will be employed for the next process.

ENCODING

Most statistical models only take numerical values as the input. Depending on the type of categorical data different encoding techniques are preferred. Label encoding (like ordinal encoding) is used when it is ordinal while one-hot encoding is preferred when it is nominal. Since our variable location has multiple classes (127 locations), performing one hot encoding will result in a sparse matrix. This will lead to high dimensionality problems.

To mitigate this issue, I have chosen to target encode the categorical variable. This method replaces the categorical feature with the mean value of the target belonging to that group. A major disadvantage being target snooping (deriving the values of the predictor variable from the target) and overfitting. These problems can be resolved by performing cross validations.

target_encoding <- build_target_encoding(mice_imputed,
                                         cols_to_encode = "location",
                                         target_col = "target",
                                         functions = "mean")
## [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target."
imp_train <- target_encode(mice_imputed, target_encoding = target_encoding)
## [1] "target_encode: Start to encode columns according to target."
imp_train <- imp_train %>% select(-location) %>% 
  rename(Location = target_mean_by_location)

NORMALIZATION or STANDARDIZATION

Standardized normally preferred when the data is normally distributed if not normalization is the safest bet. In our case, the data is does not follow a bell shaped curve. Also, the magnitude of variables is not uniform throughout hence normalization will be the right approach. The model will give more importance to a higher magnitude variable than the lower ones which might not be true in all cases. Normalizing the data helps in a more comprehensive model and improves the computation speed. Ultimately, all the values are scaled into a range [0,1]. Models like simple linear regression, lasso benefit from this transformation because these algorithms assume the data is normally distributed.

#Normalizing the train set
imp_train <- normalize(imp_train)

REGRESSION MODEL

I have considered running simple linear regression lm() as baseline model for preliminary analysis.

summary(lm(target~., data = imp_train))
## 
## Call:
## lm(formula = target ~ ., data = imp_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.123370 -0.014306  0.000174  0.017982  0.205299 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -0.119763   0.002727 -43.912   <2e-16 ***
## scores_teaching               0.303217   0.009452  32.079   <2e-16 ***
## scores_research               0.379514   0.010663  35.590   <2e-16 ***
## scores_citations              0.363245   0.003847  94.435   <2e-16 ***
## scores_industry_income        0.037436   0.003730  10.036   <2e-16 ***
## scores_international_outlook  0.065518   0.004076  16.074   <2e-16 ***
## stats_number_students         0.010116   0.017964   0.563    0.573    
## stats_student_staff_ratio     0.010855   0.018596   0.584    0.559    
## stats_pc_intl_students       -0.013347   0.009468  -1.410    0.159    
## female_male_ratio            -0.008672   0.021402  -0.405    0.685    
## no_subjects_offered          -0.001712   0.002828  -0.605    0.545    
## Location                      0.008796   0.007059   1.246    0.213    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0314 on 1993 degrees of freedom
## Multiple R-squared:  0.9755, Adjusted R-squared:  0.9754 
## F-statistic:  7215 on 11 and 1993 DF,  p-value: < 2.2e-16
vif(lm(target~., data = imp_train))
##              scores_teaching              scores_research 
##                     4.122613                     6.553424 
##             scores_citations       scores_industry_income 
##                     2.066576                     2.578234 
## scores_international_outlook        stats_number_students 
##                     2.180351                     1.438249 
##    stats_student_staff_ratio       stats_pc_intl_students 
##                     1.208819                     1.910466 
##            female_male_ratio          no_subjects_offered 
##                     1.005252                     1.353404 
##                     Location 
##                     1.868093

From the baseline model we can interpret that all the scores and the location are statistically significant. Although the model seems significant having a \(R^2\) value of 0.97, it is known that it is sensitive to outliers and can heavily influence the coefficients and predictions. The vif(model) shows that scores_research has value less than 5, which is not a grave concern. {Some sources say VIF less than 10 is acceptable}.

Since the train set contains plausible outliers model selection was done accordingly. There are multiple regression models in the caret package.

train(target~., data = imp_train, method = "model")

The chosen model and the reasons are mentioned below.

Robust Linear Regression : Robust Linear Regression is robust to outliers because it uses m-estimation to weigh down the influence of outliers. Unlike the traditional linear regression this model adopts the skewness of the distribution and thereby provides a reliable analysis (Yang, Gallagher and McMahan, 2018).

Support Vector Machine : SVM with a linear kernel is used to fit the data as extreme outliers will not significantly affect the decision boundary.

Random Forest : Random Forest is basically an average of multiple decision trees. Normally all tree based algorithms are robust to outliers. Overfitting being one of our concerns random forests are known for overcoming it (Jaiswal and Samikannu, 2017).

set.seed(777)

tic("Robust Linear Regression : ")
rlm <- train(target~., data = imp_train, method = "rlm")
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
toc()
## Robust Linear Regression : : 3.23 sec elapsed
tic("Support Vector Machine : ")
svm <- train(target~., data = imp_train, method = "svmLinear")
toc()
## Support Vector Machine : : 10.95 sec elapsed
tic("Random Forest : ")
rf <- train(target~., data = imp_train, method = "rf")
toc()
## Random Forest : : 352.03 sec elapsed
models <- resamples(list(RLM = rlm,
                         SVM = svm,
                         RF = rf))

scales <- list(x=list(relation="free"), y=list(relation="free"))

bwplot(models, scales=scales, metric = c("RMSE", "Rsquared"), 
       pch = '|', do.out=FALSE)

Analysis of these untuned model and diagnostics yields the following observations:

  1. RLM was not able to successfully converge within the default number of iterations (20) meaning the model was not able to arrive at a solution with the default number of iterations.
  2. Although random forest displays better RMSE and R2 values the time taken to fit the model was quite long. This could be due to the number of trees (default = 500) that was considered.
  3. From summary(svm) it was found that the tuning parameter C was held at a constant value 1, meaning the model in the first iteration had enough information to detect the underlying pattern.

Nevertheless, employing cross validation and fine tuning, these models can reduce the computation cost model, increase the interpretability of the model, and reduce overfitting (Awad and Khanna, 2015).

set.seed(777)
#Since the data is small I am choosing 2 separate 5 fold cross validation
repeat_cv <- trainControl(method='repeatedcv', number=5, repeats=2)

#tuneGrid for rlm
tuneGrid_rlm = tune_grid <- data.frame(intercept = TRUE, 
                   psi = c("psi.huber", "psi.hampel", "psi.bisquare"))

tic("Tuned Robust Linear Regression : ")
rlm_tuned <- train(target~., data = imp_train, method = "rlm", 
                   trControl = repeat_cv, tuneGrid = tuneGrid_rlm)
toc()
## Tuned Robust Linear Regression : : 0.59 sec elapsed
#tuneGrid for rlm
tuneGrid_svm <- expand.grid(C = c(0.01,0.1,1))

tic("Tuned Support Vector Machine : ")
svm_tuned <- train(target~., data = imp_train, method = "svmLinear", 
                   trControl = repeat_cv, tuneGrid = tuneGrid_svm)
toc()
## Tuned Support Vector Machine : : 4.35 sec elapsed
# setting number of trees as 10 X number of predictors
tic("Tuned Random Forest : ")
rf_tuned <- train(target~., data = imp_train, method = "rf", 
                  trControl=repeat_cv, ntree = 170)
toc()
## Tuned Random Forest : : 37.2 sec elapsed
tuned_models <- resamples(list(RLM_tuned = rlm_tuned,
                         SVM_tuned = svm_tuned,
                         RF_tuned = rf_tuned))

scales <- list(x=list(relation="free"), y=list(relation="free"))

bwplot(tuned_models, scales=scales, 
       metric = c("RMSE", "Rsquared"), 
       pch = '|', do.out=FALSE)

Tuning these models has evidently reduced the time taken to fit the data and compute better metrics as well. The tuned SVM is observed to perform better than the tuned RLM which was vice versa before tuning, however in both cases random forest resulted having better metrics than the other two.

final_model <- rf_tuned$finalModel
print(final_model)
## 
## Call:
##  randomForest(x = x, y = y, ntree = 170, mtry = param$mtry) 
##                Type of random forest: regression
##                      Number of trees: 170
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.0008266865
##                     % Var explained: 97.93
print(final_model$results)
## NULL
#plot(rf_tuned$finalModel)

From the variable importance plot we infer that the most influential factors are research and citation scores implying that these variables contribute more to the predictive accuracy of the model. The least influential variables are in the following order: number of subjects offered, female to male ratio, number of students.

final_model$importance
##                              IncNodePurity
## scores_teaching                  7.4539695
## scores_research                 39.0937028
## scores_citations                24.9071965
## scores_industry_income           6.0043165
## scores_international_outlook     1.0920681
## stats_number_students            0.1762496
## stats_student_staff_ratio        0.2361783
## stats_pc_intl_students           0.2630562
## female_male_ratio                0.1508004
## no_subjects_offered              0.1271302
## Location                         0.5641116
var_imp <- varImp(rf_tuned, scale=FALSE)$importance
var_imp <- data.frame(variables=row.names(var_imp), importance=var_imp$Overall)

var_imp %>%
        arrange(importance) %>%
        ggplot(aes(x=reorder(variables, importance), y=importance)) +  
        geom_bar(stat='identity') + coord_flip() +
        ggtitle("Variable Importance Plot") +
        ylab("Importance") + xlab("Variables")

Prediction

The basic pipeline for this project is in the following order:

  1. IMPUTATION

  2. ENCODING

  3. NORMALIZATION

Performing the same functions on the test set and fitting our final model, we can predict and measure the goodness of fit of the model.

#Mice impute
quick_pred <- quickpred(testset, exclude = 'location')
imp_test <- mice(testset, method = "cart", pred = quick_pred, m = 10)
## 
##  iter imp variable
##   1   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
imp_test <- complete(imp_test)

#Encoding
target_encoding <- build_target_encoding(imp_test,
                                         cols_to_encode = "location",
                                         target_col = "target",
                                         functions = "mean")
## [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target."
imp_test <- target_encode(imp_test, target_encoding = target_encoding)
## [1] "target_encode: Start to encode columns according to target."
imp_test <- imp_test %>% select(-location) %>% 
  rename(Location = target_mean_by_location)

#normalizing
imp_test <- normalize(imp_test)
#predicting
obs <- imp_test$target
pred <- predict(object=rf_tuned, newdata=imp_test[, -11])
output <- data.frame(obs, pred)


ggplot(output, aes(obs,pred)) + geom_point() + geom_smooth() + 
  labs(x = "Predicted") + labs(y = "Observed") + theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

rmse <- round(RMSE(pred = pred, obs = obs),3)
mae <- round(MAE(pred = pred, obs = obs),3)
r_squared <- round(R2(pred = pred, obs = obs),3)

sprintf("R2 value is %s", r_squared)
## [1] "R2 value is 0.971"
sprintf("Root mean squared error is %s", rmse)
## [1] "Root mean squared error is 0.035"
sprintf("Mean absolute error is %s", mae)
## [1] "Mean absolute error is 0.025"

CONCLUSION

University rankings are seen as a meaningful representation of bettering academic excellence and institutional reputation. World university rankings have gradually drawn greater attention in international and comparative higher education. Using random forest regression algorithm, we were able to predict the overall score of the university with a good accuracy. The model was able to explain a good portion of the variability in the observed data.

The ranking through selection of a particular set of indicators, and allotting a weight to each indicator, results in defining the quality of education offered among the competing institutions. Whereas there may be some other valid indicators, which may have been ignored in the scoring process. Though most of the indicators were strongly correlated, only a few out of them were given more importance. Gaining this information will help universities grow and work on their weak aspects along with the help of the institute’s industry partners. They can improve their infrastructure and curriculum to attract more students, introduce internship programs or increase the number of research collaborations. As it was observed that the research score indicator was given more importance. The ranking helps the upcoming freshers to filter out their universities based on location or any other indicators.

The global rankings can only make sense if the indicators are appropriate, otherwise intuitions and governments will jeopardize transforming their higher education system. These metrics must be collected, recorded, and analyzed by a central body at a global level. Finally, prospective students should not use this data as the sole guide for choosing a university, but they should look for additional information before selecting the best fit for them.

REFERENCES

  1. University Ranking and Evaluation: Trend and Existing Approaches. (2012). International Journal of Advancements in Computing Technology, 4(5), pp.10–16. doi:https://doi.org/10.4156/ijact.vol4.issue5.2.

  2. Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of translational medicine, [online] 4(1), p.9. doi:https://doi.org/10.3978/j.issn.2305-5839.2015.12.38.

  3. Gomer, B. (2019). MCAR, MAR, and MNAR Values in the Same Dataset: A Realistic Evaluation of Methods for Handling Missing Data. Multivariate Behavioral Research, 54(1), pp.153–153. doi:https://doi.org/10.1080/00273171.2018.1557033.

  4. De, A., Antonio, J., Javier, F. and Gibran, H. (2020). The World University Rankings Model Validation and a Top 50 Universities Predictive Model. doi:https://doi.org/10.1109/iccais48893.2020.9096841.

  5. Yang, T., Gallagher, C. and McMahan, C.S. (2018). A robust regression methodology via M-estimation. Communications in Statistics - Theory and Methods, 48(5), pp.1092–1107. doi:https://doi.org/10.1080/03610926.2018.1423698.

  6. Jaiswal, J.K. and Samikannu, R. (2017). Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression. [online] IEEE Xplore. doi:https://doi.org/10.1109/WCCCT.2016.25.

  7. Vitenko, T., Volodymyr Shanaida, Paweł Dróździel and Radovan Madleňák (2018). ASSESSMENT OF HIGHER EDUCATION IN GLOBAL MEASUREMENT. INTED proceedings. doi:https://doi.org/10.21125/inted.2018.0787.

  8. Awad, M., Khanna, R. (2015). Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4302-5990-9_1.

  9. Estrada-Real, A.C., Cantu-Ortiz, F.J.(2022). A data analytics approach for university competitiveness: the QS world university rankings. Int J Interact Des Manuf 16, 871–891. https://doi.org/10.1007/s12008-022-00966-2.

  10. Taylor, P. and Braddock, R. (2007). International University Ranking Systems and the Idea of University Excellence. Journal of Higher Education Policy and Management, 29(3), pp.245–260. doi:https://doi.org/10.1080/13600800701457855.