World University Score Prediction

knitr::opts_chunk$set(echo=TRUE)

INTRODUCTION

Ranking universities is a fundamental issue not only to the students and academics but also to university authorities, industry and even government. Global university rankings depend on several key factors such as teaching, research, citations, international outlook, industry income etc. Prospective students require this for applying to specific universities based on upcoming ranks. There are mainly two type of ranking system developed Global and National, which contain several ranking system based on the location . The Global Ranking System consists of 9 rankinig systems out of which the most widely used is THE-QS World University Ranking. This was published in England in 2024. (University Ranking and Evaluation: Trend and Existing Approaches, 2012)

The Global University authorities should assess the varsity’s upcoming rank to improve certain fields and upgrade their standards compared with others. Industry and even government should try to forecast the ranking of universities for providing grants to the most eligible institutions for the upcoming year (Taylor and Braddock, 2007 ; Vitenko et al., 2018). Understanding the criteria and mechanics that assess the quality of education is a key activity to improve current educational models and propose new educational strategies.

This project aims to demonstrate how data analysis techniques and machine learning algorithms can predict the overall score of the university and analyze the predictive features. The raw data collected has 2673 rows and 17 variables out of which overall_score is our target variable. Since the target variable is numerical regression techniques will be used for predictive modeling.

file_path <- "C:/Users/vishw/OneDrive/Desktop/NU/KL7010 - Principle of DS/World University Rankings.csv"
raw_data <- read.csv(file_path, header = TRUE)
#str(raw_data)

Have you heard of the French term “mise en place”. It is crucial for every chef to understand. This French term means “everything in its place”. “The goal of this culinary practice is to have everything ready to go when it comes time to cook, bake, or assemble your final dish.” Similarly, data preprocessing is a component of data preparation which ensures that the raw data is ready for further data processing.

Data preprocessing

It is a fundamental step in the data mining process. The data collected, i.e. the raw data, is never in an ideal format where data analyses can be done right away. Data type conversion (type casting), variable transformation (unit conversion) are a few steps involved in preprocessing that ensures that the collected data is in a suitable format for analysis and model building. This helps improve consistency, data visualization and interpretation of data. As a result, it helps make the data more meaningful, thereby leading to better insights and decision making.

Certain numerical variables were read as characters. Data type conversion was made easy using the function as.numeric(column name) and as.factor(column name) from the base R package. The latter function was used to convert variables member_level and unaccredited. These are categorical data that follow a certain order. Variables like stats_number_students, stats_pc_intl_students and stats_female_male_ratio were not recorded in the right format.
Number of students which is expected to be in thousands were recorded with a comma separating the thousands digit, the function gsub(pattern, replacement, column name) which was followed by as.numeric() was performed to convert it.
Similarly overall_score and stats_female_male_ratio were recorded with a special characters so to a combination of sapply(), strsplit(x, split) was used to convert them to the correct format and type.

Note: Certain values of overall_score were recorded as a range of values (34.5 - 38). In this case the average was taken.
The subjects_offered included a list of different subjects separated by a comma and was read as a character. In order assess if subjects_offered have any impact on overall_score, The following line of code sapply(strsplit(as.character(df$subjects_offered),','), function(x) length(x)) was used to create a new variable no_subjects_offered which had the total number of subjects offered by that specific university.
It was found that there are 127 unique locations gaining insights graphically is possible but could be difficult to interpret. Hence, to gain insights about universities based on their location and to plot better graphs, a variable named Region was created by categorizing the locations into their respective regions. A sample line to do the same is given below:

North_America = c("Canada", "United States", "Mexico")

df$Region = case_when(df$location %in% North_America ~ "North_America")

df <- raw_data #getting a copy of the data

df$stats_number_students <- gsub(",", "", df$stats_number_students)

df$stats_pc_intl_students <- gsub("%", "", df$stats_pc_intl_students)
df$stats_pc_intl_students <- as.numeric(df$stats_pc_intl_students)/100 

df$stats_female_male_ratio <- gsub(":00", "", df$stats_female_male_ratio)

df$female_male_ratio <- sapply(strsplit(df$stats_female_male_ratio, ":"),
                               function(x) as.numeric(x[1])/as.numeric(x[2]))


df$no_subjects_offered <- sapply(strsplit(as.character(df$subjects_offered),','), 
                                 function(x) length(x))

df$target <- sapply(strsplit(df$overall_score, '–'),
                    function(x) mean(as.numeric(x), na.rm=TRUE))


df$no_subjects_offered <- as.integer(df$no_subjects_offered)

df$member_level <- as.factor(df$member_level)
df$unaccredited <- as.factor(df$unaccredited)

df$stats_number_students <- as.integer(df$stats_number_students)
df$stats_pc_intl_students <- df$stats_pc_intl_students * df$stats_number_students

df$scores_teaching <- as.numeric(df$scores_teaching)
df$scores_research <- as.numeric(df$scores_research)
df$scores_citations <- as.numeric(df$scores_citations)
df$scores_industry_income <- as.numeric(df$scores_industry_income)
df$scores_international_outlook <- as.numeric(df$scores_international_outlook)


#length(unique(df$location))  #127 countries
countries <- unique(df$location) 

North_America = c("Canada", "United States", "Turks and Caicos Islands", "Mexico", "Puerto Rico", "Jamaica", "Costa Rica", "Cuba", "Dominican Republic")

South_America = c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Paraguay", "Peru", "Uruguay", "Venezuela")

Europe = c("Albania", "Andorra", "Austria", "Belarus", "Belgium", "Bosnia and Herzegovina", "Bulgaria", "Croatia", "Cyprus", "Czech Republic", "Denmark", "Estonia", "Finland", "France", "Georgia", "Germany", "Greece", "Hungary", "Iceland", "Ireland", "Italy", "Kosovo", "Latvia", "Liechtenstein", "Lithuania", "Luxembourg", "Malta", "Moldova", "Monaco", "Montenegro", "Netherlands", "North Macedonia", "Norway", "Poland", "Portugal", "Romania", "Russian Federation", "San Marino", "Serbia", "Slovakia", "Slovenia", "Spain", "Sweden", "Switzerland", "Ukraine", "United Kingdom")

Africa = c("Algeria", "Angola", "Botswana", "Egypt", "Ethiopia", "Ghana", "Kenya", "Libya", "Malawi", "Mauritius", "Morocco", "Mozambique", "Namibia", "Nigeria", "Rwanda", "Senegal", "Somalia", "South Africa", "Sudan", "Tanzania", "Tunisia", "Uganda", "Zambia", "Zimbabwe")

Middle_East = c("Bahrain", "Iran", "Iraq", "Israel", "Jordan", "Kuwait", "Lebanon", "Northern Cyprus", "Oman", "Palestine", "Qatar", "Saudi Arabia", "Syria", "Turkey", "United Arab Emirates", "Yemen")

Asia = c("Armenia", "Azerbaijan", "Bangladesh", "Brunei Darussalam", "Cambodia", "China", "Hong Kong", "India", "Indonesia", "Japan", "Kazakhstan", "Kyrgyzstan", "Laos", "Macao", "Malaysia", "Mongolia", "Myanmar", "Nepal", "Pakistan", "Philippines", "Singapore","South Korea", "Sri Lanka", "Taiwan", "Tajikistan", "Thailand", "Turkmenistan", "Uzbekistan", "Vietnam")

Oceania = c("Australia", "Fiji", "New Zealand")

df$Region = case_when(df$location %in% Asia ~ "Asia",
                      df$location %in% Africa ~ "Africa",
                      df$location %in% Europe ~ "Europe",
                      df$location %in% Middle_East ~ "Middle_East",
                      df$location %in% North_America ~ "North_America",
                      df$location %in% South_America ~ "South_America",
                      df$location %in% Oceania ~ "Oceania")

df1 <- df
df <- select(df, -c(13,14,17))

#str(df)

#summary(df)

A few basic insights before performing rigorous data analysis. These insights can be valuable to prospective students and institutions. Table 1 shows the top 10 universities in the world.

Top_uni <- df %>% arrange(desc(target)) %>% 
  select(c('name', 'location', 'target'))


head(Top_uni, 10) %>% 
  gt() %>% 
  cols_label(
  name = md("***Universities***"),
  location = md("***Location***"),
  target = md("***Overall Score***")) %>%
  tab_header(title = md("Table 1 :- *Top 10 Universities*"))

Universities	Location	Overall Score
Table 1 :- Top 10 Universities
University of Oxford	United Kingdom	98.5
Stanford University	United States	98.0
Massachusetts Institute of Technology	United States	97.9
Harvard University	United States	97.8
University of Cambridge	United Kingdom	97.5
Princeton University	United States	96.9
California Institute of Technology	United States	96.5
Imperial College London	United Kingdom	95.1
University of California, Berkeley	United States	94.6
Yale University	United States	94.2

From table 2 we infer that Computer Science and Business & Management courses seems to be the most offered courses among all universities in the world.

subject_list <- strsplit(as.character(df1$subjects_offered),',')
l1 <- table(unlist(subject_list))
subjects <- data.frame(Subject = names(l1), Offered_by = as.numeric(l1))

arrange(subjects, desc(Offered_by)) %>% head(6) %>% 
  gt() %>%
  cols_label(
  Subject = md("***Subject***"),
  Offered_by = md("***No. of Unis***")) %>%
  tab_header(title = md("Table 2 :- *Most offered courses*"))

Subject	No. of Unis
Table 2 :- Most offered courses
Computer Science	2277
Business & Management	2267
Accounting & Finance	1965
Biological Sciences	1946
Mathematics & Statistics	1914
Literature & Linguistics	1913

df_filtered <- df1[!is.na(df$target), ] #Considering only ranked universities

ggplot(df_filtered, aes(Region, fill = Region)) + 
  geom_bar(color = "black",show.legend = F) + 
  theme_minimal() + scale_fill_brewer(palette="Blues") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

The above graph and the tables 3, 4, 5 and 6 illustrate that Europe and Asia has the highest number of ranked universities followed by the North American region. Universities in the UK admitted the highest number of international students followed by the USA and Australia. Europe seems to be a preferred region for study among international students, in particular the UK although there are greater number of ranked universities in the US compared to the UK.

intl.students <- df %>% arrange(desc(stats_pc_intl_students)) %>% 
  select(Region, stats_pc_intl_students ) %>% 
  group_by(Region) %>% 
  summarise(sum_intl = sum(stats_pc_intl_students, na.rm = T)) 

arrange(intl.students, desc(sum_intl)) %>% 
  gt() %>%
  cols_label(
  Region = md("***Region***"),
  sum_intl = md("***International Students***")) %>%
  tab_header(title = md("Table 3 :- *No. of International Students (Region)*"))

Region	International Students
Table 3 :- No. of International Students (Region)
Europe	2003036.49
North_America	710247.36
Asia	545708.58
Middle_East	382577.76
Oceania	295841.18
Africa	142760.70
South_America	48438.52

uni_loc <- group_by(df_filtered, location) %>% summarise(count = n()) 
uni_loc %>% arrange(desc(count), .by_group = T) %>% 
  head(10) %>%
  gt() %>%
  cols_label(
  location = md("***Country***"),
  count = md("***Ranked Unis***")) %>%
  tab_header(title = md("Table 4 :- *Sum of ranked universitites*"))

Country	Ranked Unis
Table 4 :- Sum of ranked universitites
United States	169
Japan	119
United Kingdom	104
India	91
China	86
Russian Federation	78
Turkey	75
Iran	73
Brazil	56
Italy	56

students <- df %>% arrange(desc(stats_number_students)) %>% 
  select(location, stats_number_students ) %>% 
  group_by(location) %>% 
  summarise(students = sum(stats_number_students, na.rm = T))

arrange(students, desc(students)) %>% head(10) %>% 
  gt() %>%
  cols_label(
  location = md("***Country***"),
  students = md("***Sum of Students***")) %>%
  tab_header(title = md("Table 5 :- *Total Sum of Students (location)*"))

Country	Sum of Students
Table 5 :- Total Sum of Students (location)
United States	3985888
China	2818642
Egypt	2423510
Turkey	2105034
United Kingdom	2093040
Brazil	1707217
Japan	1460908
Algeria	1354183
Mexico	1303044
Spain	1291193

intl_students <- df %>% arrange(desc(stats_pc_intl_students)) %>% 
  select(location, stats_pc_intl_students ) %>% 
  group_by(location) %>% 
  summarise(sum_intl = sum(stats_pc_intl_students, na.rm = T))

arrange(intl_students, desc(sum_intl)) %>% head(10) %>% 
  gt() %>%
  cols_label(
  location = md("***Country***"),
  sum_intl = md("***International Students***")) %>%
  tab_header(title = md("Table 6 :- *No. of International Students (location)*"))

Country	International Students
Table 6 :- No. of International Students (location)
United Kingdom	654516.2
United States	481297.0
Australia	258599.7
Canada	205261.8
France	192720.9
Russian Federation	164572.9
Germany	163791.9
Turkey	162938.2
China	134049.0
Spain	130483.3

Analyzing the summary(data), it was found that closed was a constant variable & out of 2673 unaccredited values only one value was TRUE which makes them a zero variance or near zero variance variable. Using the function nearZeroVar(data) from the caret package we can weed out the zero or near zero variance variables. Apart from the above mentioned variables record_type & member_level also had near zero variance. Hence these variables need not be considered as predictor variables. These are uninformative features and can reduce the performance of our model.

#Removing these variables as they are not needed for further analysis
df <- select(df, -c('name', 'Region'))

nzv <- nearZeroVar(df, saveMetrics = T)
nzv %>% rownames_to_column(var = "variables") %>% 
  gt() %>% 
  cols_label(
  variables = md("***variables***"),
  freqRatio = md("***freqRatio***"),
  percentUnique = md("***percentUnique***"),
  zeroVar = md("***zeroVar***"),
  nzv = md("***nzv***")) %>%
  tab_header(title = md("Table 4 :- *NZV table*")) %>%
  tab_style_body(
    style = cell_fill(color = "lightblue"),
    fn = function(x) isTRUE(x))

variables	freqRatio	percentUnique	zeroVar	nzv
Table 4 :- NZV table
scores_teaching	1.230769	17.69547325	FALSE	FALSE
scores_research	1.050000	18.93004115	FALSE	FALSE
scores_citations	1.285714	30.26561915	FALSE	FALSE
scores_industry_income	1.866667	25.73887018	FALSE	FALSE
scores_international_outlook	1.222222	27.12308268	FALSE	FALSE
record_type	23.669811	0.11223345	FALSE	TRUE
member_level	25.428571	0.14964459	FALSE	TRUE
location	1.035503	4.75121586	FALSE	FALSE
stats_number_students	1.000000	96.29629630	FALSE	FALSE
stats_student_staff_ratio	1.083333	16.76019454	FALSE	FALSE
stats_pc_intl_students	242.500000	81.25701459	FALSE	FALSE
closed	0.000000	0.03741115	TRUE	TRUE
unaccredited	2672.000000	0.07482230	FALSE	TRUE
female_male_ratio	1.081081	2.58136925	FALSE	FALSE
no_subjects_offered	1.112000	1.38421250	FALSE	FALSE
target	1.286645	5.94837262	FALSE	FALSE

df <- df[, -nearZeroVar(df)] # removing the nzv and zv variables

Why it’s essential to split the data before performing EDA.

The train-test split is being done prior to EDA to avoid data leakage. The basic idea of prediction is to use the model on an unknown or unseen data. This can only be achieved by splitting the data before further rigorous analysis. Using the createDataPartition(column, p = 0.7) from the caret package 70% of the data was split into train and 30% into test.

set.seed(777)
index <- createDataPartition(df$stats_number_students, p=0.75, list=F)
trainset <- df[index,]
testset <- df[-index,]

EXPLORATORY DATA ANALYSIS

This is a fundamental step in analyzing the data to gain insights, detect trends and understand the distribution and the structure of data. It can help identify outliers, and null values thereby increasing the quality of the data for model building. Understanding these factors helps in choosing the right prediction model. Furthermore, it helps us understand the relationship between the variables and feature selection.

To understand the distribution of the predictor variable histograms were plotted. Clearly the variables do not follow a normal distribution and the scatter plot revealed that the scores are indeed linearly related to the target variable. A correlation plot helped examine the strength of linear relationship between continuous predictor variables and the target, using the ggcorrmat(data, type = "nonparametric"). If the data were to follow a normal distribution parametric type would have been used, i.e. if parametric then pearson coefficient will be used else spearman or kendall will be used to calculate the correlation.

#teaching
par(mfrow = c(1,2))
hist(trainset$scores_teaching, main = '', 
     breaks = 30, col="steelblue1", 
     xlab = 'teaching score')

scatter.smooth(trainset$scores_teaching, trainset$target, 
     pch = 20, cex = 1.25 , col="steelblue1", 
     xlab = 'teaching score', ylab = 'overall score')

#research
par(mfrow = c(1,2))
hist(trainset$scores_research, main = '', 
     breaks = 30, col="steelblue1", 
     xlab = 'research score')
scatter.smooth(trainset$scores_research, trainset$target, 
     pch = 20, cex = 1.25 , col="steelblue1", 
     xlab = 'research score', ylab = 'overall score')

#citation
par(mfrow = c(1,2))
hist(trainset$scores_citations, main = '', 
     breaks = 30, col="steelblue1", 
     xlab = 'citation score')
scatter.smooth(trainset$scores_citations, trainset$target, 
     pch = 20, cex = 1.25 , col="steelblue1", 
     xlab = 'citation score', ylab = 'overall score')

#inustry income
par(mfrow = c(1,2))
hist(trainset$scores_industry_income, main = '', 
     breaks = 30, col="steelblue1", 
     xlab = 'industry income score')
scatter.smooth(trainset$scores_industry_income, trainset$target, 
     pch = 20, cex = 1.25 , col="steelblue1", 
     xlab = 'industry income score', ylab = 'overall score')

#international outlook
par(mfrow = c(1,2))
hist(trainset$scores_international_outlook, main = '', 
     breaks = 30, col="steelblue1", 
     xlab = 'international outlook score')
scatter.smooth(trainset$scores_international_outlook, trainset$target, 
     pch = 20, cex = 1.25 , col="steelblue1", 
     xlab = 'international outlook score', ylab = 'overall score')

The correlation plot reveals that most performance indicators display a linear relationship with the target (overall score) except the stats_student_staff_ratio. Most of the correaltion coefficient values are above the significant level (alpha = 0.05) implying they are strongly correlated. Among the predictor variables scores_research and scores_industry_income are highly correlated with each other and only female_male_ratio and stats_student_staff_ratio are weakly correlated with the target.

A weak correlation does not necessarily that they are unimportant. Also, multicollinearity can lead to an unstable model but since they are scores needed to calculate the target (overall score) retaining these variables will be wiser choice. Nevertheless, further analysis can explain the impact of these variables.

OUTLIER DETECTION

# Box-plot of scores
par(mfrow = c(1,5))
par(cex.lab=1.5)
boxplot(trainset[1], width = 2, xlab = 'Teaching')
boxplot(trainset[2], width = 2, xlab = 'Research')
boxplot(trainset[3], width = 2, xlab = 'Citation')
boxplot(trainset[4], width = 2, xlab = 'Ind.income')
boxplot(trainset[5], width = 2, xlab = 'Intl.outlook')
mtext("Box-plots of the scores", side = 3,line = -2, outer = TRUE)

#Box-plots of other indicators
par(mfrow = c(1,5))
par(cex.lab=1.5)
boxplot(trainset[7], width = 2, xlab = 'no.stu')
boxplot(trainset[8], width = 2, xlab = 'stu:staff')
boxplot(trainset[9], width = 2, xlab = '%intl students')
boxplot(trainset[10], width = 2, xlab = 'F:M')
boxplot(trainset[11], width = 2, xlab = 'no.subjects')
mtext("Box-plots of other indicators", side = 3,line = -2, outer = TRUE)

The presence of outliers can impact the model performance as these are extreme or incorrect values recorded. The box plots below reveal the distribution of the predictor variables. A common practice is to remove these values as it can affect model performance and regression coefficients. In this case, however, these values seem plausible. A university having score of 100 or 10 (extreme values in this case) is credible, a possible reason could be that the university is extremely good or bad.

Row	Mean	Mean_x	SD	SD_x	mean.diff	sd.diff
Table 7 :- Effect of outliers on mean & sd
teaching_x	29.079	26.968	13.852	9.858	2.111	3.993
research_x	23.367	20.472	16.612	11.300	2.895	5.312
students_x	18333.514	15111.161	22596.263	11552.016	3222.353	11044.247
stu.staf_x	19.110	17.163	13.811	7.287	1.947	6.524
intl_stud	1522.920	830.047	2569.564	1017.288	692.873	1552.276
f.m_x	1.361	1.122	3.374	0.447	0.239	2.927

Table 7 shows the effect of outliers on the mean and standard deviation of the train set. The difference in mean and standard deviation before and after removing outliers is significant suggesting that the outliers do influence the mean and standard deviation. However, the magnitude of the difference alone is not sufficient to decide. There is a risk of losing valuable information if we were to exclude these values. A way around it is to choose a model robust to outliers.

HANDLING NULL VALUES

From the vis_miss() plot we clearly see that there are scores missing. Possible reason for this could be that university did not meet the ranking threshold or it was not calculated because they are not very reputed.

#miss_var_summary(trainset)
vis_miss(trainset)

mcar_test(trainset %>% select(stats_pc_intl_students, female_male_ratio, 
                                scores_teaching,scores_research, 
                                scores_citations,scores_industry_income,
                                scores_international_outlook,target))

statistic	df	p.value	missing.patterns
252.7807	11	0	6

The calculated p value from the MCAR test is less than 0.05. This implies that the null hypothesis is rejected implying the missing data is not MCAR (i.e, either MAR/MNAR). If the missing data were missing completely at random (MCAR) complete case analysis would have been the right approach (dropping all null values). Considering the size of the train set we might lose the representativeness of the data by dropping these values and high risk of overfitting the model (Zhang, 2016 ; Gomer, 2019).

The missing values can be predicted using the observed values through multiple imputations. I have used bagImpute from caret package and cart impute from mice package and compared their distribution with the original data. The bagImpute method uses PCA and KNN methods to impute data while cart uses classification and regression techniques to impute the data.

imp <- preProcess(trainset %>% select(-location), method = c("bagImpute"))
imputed <- predict(imp, newdata = trainset)

#Mice impute
quick_pred <- quickpred(trainset, exclude = c('location', 'Region'))
mice_imp <- mice(trainset, method = "cart", pred = quick_pred, m = 10)

## 
##  iter imp variable
##   1   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   1   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   2   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   3   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   4   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target
##   5   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  stats_pc_intl_students  female_male_ratio  target

mice_imputed <- complete(mice_imp)

# Check if the distribution of the graph has be altered while imputing/dropping null values

ggplot() +
  geom_density(aes(scores_teaching, fill = "Original"), alpha = .1, data = trainset, na.rm=T) +
  geom_density(aes(scores_teaching, fill = "imputed"), alpha = .2, data = imputed) +
  scale_fill_manual(name = "datasets", values = c(Original = "red", imputed = "blue"))

ggplot() +
  geom_density(aes(scores_teaching, fill = "Original"), alpha = .1, data = trainset, na.rm=T) +
  geom_density(aes(scores_teaching, fill = "mice_imputed"), alpha = .2, data = mice_imputed) +
  scale_fill_manual(name = "datasets", values = c(Original = "red", mice_imputed = "green"))

It is apparent from the density graphs plotted above that the cart imputed data retains (almost) the original distribution of the train set comparatively better. Moving forward the cart imputed data will be employed for the next process.

ENCODING

Most statistical models only take numerical values as the input. Depending on the type of categorical data different encoding techniques are preferred. Label encoding (like ordinal encoding) is used when it is ordinal while one-hot encoding is preferred when it is nominal. Since our variable location has multiple classes (127 locations), performing one hot encoding will result in a sparse matrix. This will lead to high dimensionality problems.

To mitigate this issue, I have chosen to target encode the categorical variable. This method replaces the categorical feature with the mean value of the target belonging to that group. A major disadvantage being target snooping (deriving the values of the predictor variable from the target) and overfitting. These problems can be resolved by performing cross validations.

target_encoding <- build_target_encoding(mice_imputed,
                                         cols_to_encode = "location",
                                         target_col = "target",
                                         functions = "mean")

## [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target."

imp_train <- target_encode(mice_imputed, target_encoding = target_encoding)

## [1] "target_encode: Start to encode columns according to target."

imp_train <- imp_train %>% select(-location) %>% 
  rename(Location = target_mean_by_location)

NORMALIZATION or STANDARDIZATION

Standardized normally preferred when the data is normally distributed if not normalization is the safest bet. In our case, the data is does not follow a bell shaped curve. Also, the magnitude of variables is not uniform throughout hence normalization will be the right approach. The model will give more importance to a higher magnitude variable than the lower ones which might not be true in all cases. Normalizing the data helps in a more comprehensive model and improves the computation speed. Ultimately, all the values are scaled into a range [0,1]. Models like simple linear regression, lasso benefit from this transformation because these algorithms assume the data is normally distributed.

#Normalizing the train set
imp_train <- normalize(imp_train)

REGRESSION MODEL

I have considered running simple linear regression lm() as baseline model for preliminary analysis.

summary(lm(target~., data = imp_train))

## 
## Call:
## lm(formula = target ~ ., data = imp_train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.123370 -0.014306  0.000174  0.017982  0.205299 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -0.119763   0.002727 -43.912   <2e-16 ***
## scores_teaching               0.303217   0.009452  32.079   <2e-16 ***
## scores_research               0.379514   0.010663  35.590   <2e-16 ***
## scores_citations              0.363245   0.003847  94.435   <2e-16 ***
## scores_industry_income        0.037436   0.003730  10.036   <2e-16 ***
## scores_international_outlook  0.065518   0.004076  16.074   <2e-16 ***
## stats_number_students         0.010116   0.017964   0.563    0.573    
## stats_student_staff_ratio     0.010855   0.018596   0.584    0.559    
## stats_pc_intl_students       -0.013347   0.009468  -1.410    0.159    
## female_male_ratio            -0.008672   0.021402  -0.405    0.685    
## no_subjects_offered          -0.001712   0.002828  -0.605    0.545    
## Location                      0.008796   0.007059   1.246    0.213    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0314 on 1993 degrees of freedom
## Multiple R-squared:  0.9755, Adjusted R-squared:  0.9754 
## F-statistic:  7215 on 11 and 1993 DF,  p-value: < 2.2e-16

vif(lm(target~., data = imp_train))

##              scores_teaching              scores_research 
##                     4.122613                     6.553424 
##             scores_citations       scores_industry_income 
##                     2.066576                     2.578234 
## scores_international_outlook        stats_number_students 
##                     2.180351                     1.438249 
##    stats_student_staff_ratio       stats_pc_intl_students 
##                     1.208819                     1.910466 
##            female_male_ratio          no_subjects_offered 
##                     1.005252                     1.353404 
##                     Location 
##                     1.868093

From the baseline model we can interpret that all the scores and the location are statistically significant. Although the model seems significant having a $R^2$ value of 0.97, it is known that it is sensitive to outliers and can heavily influence the coefficients and predictions. The vif(model) shows that scores_research has value less than 5, which is not a grave concern. {Some sources say VIF less than 10 is acceptable}.

Since the train set contains plausible outliers model selection was done accordingly. There are multiple regression models in the caret package.

train(target~., data = imp_train, method = "model")

The chosen model and the reasons are mentioned below.

Robust Linear Regression : Robust Linear Regression is robust to outliers because it uses m-estimation to weigh down the influence of outliers. Unlike the traditional linear regression this model adopts the skewness of the distribution and thereby provides a reliable analysis (Yang, Gallagher and McMahan, 2018).

Support Vector Machine : SVM with a linear kernel is used to fit the data as extreme outliers will not significantly affect the decision boundary.

Random Forest : Random Forest is basically an average of multiple decision trees. Normally all tree based algorithms are robust to outliers. Overfitting being one of our concerns random forests are known for overcoming it (Jaiswal and Samikannu, 2017).

set.seed(777)

tic("Robust Linear Regression : ")
rlm <- train(target~., data = imp_train, method = "rlm")

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

toc()

## Robust Linear Regression : : 3.23 sec elapsed

tic("Support Vector Machine : ")
svm <- train(target~., data = imp_train, method = "svmLinear")
toc()

## Support Vector Machine : : 10.95 sec elapsed

tic("Random Forest : ")
rf <- train(target~., data = imp_train, method = "rf")
toc()

## Random Forest : : 352.03 sec elapsed

models <- resamples(list(RLM = rlm,
                         SVM = svm,
                         RF = rf))

scales <- list(x=list(relation="free"), y=list(relation="free"))

bwplot(models, scales=scales, metric = c("RMSE", "Rsquared"), 
       pch = '|', do.out=FALSE)

Analysis of these untuned model and diagnostics yields the following observations:

RLM was not able to successfully converge within the default number of iterations (20) meaning the model was not able to arrive at a solution with the default number of iterations.
Although random forest displays better RMSE and R2 values the time taken to fit the model was quite long. This could be due to the number of trees (default = 500) that was considered.
From summary(svm) it was found that the tuning parameter C was held at a constant value 1, meaning the model in the first iteration had enough information to detect the underlying pattern.

Nevertheless, employing cross validation and fine tuning, these models can reduce the computation cost model, increase the interpretability of the model, and reduce overfitting (Awad and Khanna, 2015).

set.seed(777)
#Since the data is small I am choosing 2 separate 5 fold cross validation
repeat_cv <- trainControl(method='repeatedcv', number=5, repeats=2)

#tuneGrid for rlm
tuneGrid_rlm = tune_grid <- data.frame(intercept = TRUE, 
                   psi = c("psi.huber", "psi.hampel", "psi.bisquare"))

tic("Tuned Robust Linear Regression : ")
rlm_tuned <- train(target~., data = imp_train, method = "rlm", 
                   trControl = repeat_cv, tuneGrid = tuneGrid_rlm)
toc()

## Tuned Robust Linear Regression : : 0.59 sec elapsed

#tuneGrid for rlm
tuneGrid_svm <- expand.grid(C = c(0.01,0.1,1))

tic("Tuned Support Vector Machine : ")
svm_tuned <- train(target~., data = imp_train, method = "svmLinear", 
                   trControl = repeat_cv, tuneGrid = tuneGrid_svm)
toc()

## Tuned Support Vector Machine : : 4.35 sec elapsed

# setting number of trees as 10 X number of predictors
tic("Tuned Random Forest : ")
rf_tuned <- train(target~., data = imp_train, method = "rf", 
                  trControl=repeat_cv, ntree = 170)
toc()

## Tuned Random Forest : : 37.2 sec elapsed

tuned_models <- resamples(list(RLM_tuned = rlm_tuned,
                         SVM_tuned = svm_tuned,
                         RF_tuned = rf_tuned))

scales <- list(x=list(relation="free"), y=list(relation="free"))

bwplot(tuned_models, scales=scales, 
       metric = c("RMSE", "Rsquared"), 
       pch = '|', do.out=FALSE)

Tuning these models has evidently reduced the time taken to fit the data and compute better metrics as well. The tuned SVM is observed to perform better than the tuned RLM which was vice versa before tuning, however in both cases random forest resulted having better metrics than the other two.

final_model <- rf_tuned$finalModel
print(final_model)

## 
## Call:
##  randomForest(x = x, y = y, ntree = 170, mtry = param$mtry) 
##                Type of random forest: regression
##                      Number of trees: 170
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.0008266865
##                     % Var explained: 97.93

print(final_model$results)

## NULL

#plot(rf_tuned$finalModel)

From the variable importance plot we infer that the most influential factors are research and citation scores implying that these variables contribute more to the predictive accuracy of the model. The least influential variables are in the following order: number of subjects offered, female to male ratio, number of students.

final_model$importance

##                              IncNodePurity
## scores_teaching                  7.4539695
## scores_research                 39.0937028
## scores_citations                24.9071965
## scores_industry_income           6.0043165
## scores_international_outlook     1.0920681
## stats_number_students            0.1762496
## stats_student_staff_ratio        0.2361783
## stats_pc_intl_students           0.2630562
## female_male_ratio                0.1508004
## no_subjects_offered              0.1271302
## Location                         0.5641116

var_imp <- varImp(rf_tuned, scale=FALSE)$importance
var_imp <- data.frame(variables=row.names(var_imp), importance=var_imp$Overall)

var_imp %>%
        arrange(importance) %>%
        ggplot(aes(x=reorder(variables, importance), y=importance)) +  
        geom_bar(stat='identity') + coord_flip() +
        ggtitle("Variable Importance Plot") +
        ylab("Importance") + xlab("Variables")

Prediction

The basic pipeline for this project is in the following order:

IMPUTATION
ENCODING
NORMALIZATION

Performing the same functions on the test set and fitting our final model, we can predict and measure the goodness of fit of the model.

#Mice impute
quick_pred <- quickpred(testset, exclude = 'location')
imp_test <- mice(testset, method = "cart", pred = quick_pred, m = 10)

## 
##  iter imp variable
##   1   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   1   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   2   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   3   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   4   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   1  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   2  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   3  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   4  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   5  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   6  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   7  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   8  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   9  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target
##   5   10  scores_teaching  scores_research  scores_citations  scores_industry_income  scores_international_outlook  female_male_ratio  target

imp_test <- complete(imp_test)

#Encoding
target_encoding <- build_target_encoding(imp_test,
                                         cols_to_encode = "location",
                                         target_col = "target",
                                         functions = "mean")

## [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target."

imp_test <- target_encode(imp_test, target_encoding = target_encoding)

## [1] "target_encode: Start to encode columns according to target."

imp_test <- imp_test %>% select(-location) %>% 
  rename(Location = target_mean_by_location)

#normalizing
imp_test <- normalize(imp_test)

#predicting
obs <- imp_test$target
pred <- predict(object=rf_tuned, newdata=imp_test[, -11])
output <- data.frame(obs, pred)


ggplot(output, aes(obs,pred)) + geom_point() + geom_smooth() + 
  labs(x = "Predicted") + labs(y = "Observed") + theme_minimal()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

rmse <- round(RMSE(pred = pred, obs = obs),3)
mae <- round(MAE(pred = pred, obs = obs),3)
r_squared <- round(R2(pred = pred, obs = obs),3)

sprintf("R2 value is %s", r_squared)

## [1] "R2 value is 0.971"

sprintf("Root mean squared error is %s", rmse)

## [1] "Root mean squared error is 0.035"

sprintf("Mean absolute error is %s", mae)

## [1] "Mean absolute error is 0.025"

CONCLUSION

University rankings are seen as a meaningful representation of bettering academic excellence and institutional reputation. World university rankings have gradually drawn greater attention in international and comparative higher education. Using random forest regression algorithm, we were able to predict the overall score of the university with a good accuracy. The model was able to explain a good portion of the variability in the observed data.

The ranking through selection of a particular set of indicators, and allotting a weight to each indicator, results in defining the quality of education offered among the competing institutions. Whereas there may be some other valid indicators, which may have been ignored in the scoring process. Though most of the indicators were strongly correlated, only a few out of them were given more importance. Gaining this information will help universities grow and work on their weak aspects along with the help of the institute’s industry partners. They can improve their infrastructure and curriculum to attract more students, introduce internship programs or increase the number of research collaborations. As it was observed that the research score indicator was given more importance. The ranking helps the upcoming freshers to filter out their universities based on location or any other indicators.

The global rankings can only make sense if the indicators are appropriate, otherwise intuitions and governments will jeopardize transforming their higher education system. These metrics must be collected, recorded, and analyzed by a central body at a global level. Finally, prospective students should not use this data as the sole guide for choosing a university, but they should look for additional information before selecting the best fit for them.

REFERENCES

University Ranking and Evaluation: Trend and Existing Approaches. (2012). International Journal of Advancements in Computing Technology, 4(5), pp.10–16. doi:https://doi.org/10.4156/ijact.vol4.issue5.2.
Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of translational medicine, [online] 4(1), p.9. doi:https://doi.org/10.3978/j.issn.2305-5839.2015.12.38.
Gomer, B. (2019). MCAR, MAR, and MNAR Values in the Same Dataset: A Realistic Evaluation of Methods for Handling Missing Data. Multivariate Behavioral Research, 54(1), pp.153–153. doi:https://doi.org/10.1080/00273171.2018.1557033.
De, A., Antonio, J., Javier, F. and Gibran, H. (2020). The World University Rankings Model Validation and a Top 50 Universities Predictive Model. doi:https://doi.org/10.1109/iccais48893.2020.9096841.
Yang, T., Gallagher, C. and McMahan, C.S. (2018). A robust regression methodology via M-estimation. Communications in Statistics - Theory and Methods, 48(5), pp.1092–1107. doi:https://doi.org/10.1080/03610926.2018.1423698.
Jaiswal, J.K. and Samikannu, R. (2017). Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression. [online] IEEE Xplore. doi:https://doi.org/10.1109/WCCCT.2016.25.
Vitenko, T., Volodymyr Shanaida, Paweł Dróździel and Radovan Madleňák (2018). ASSESSMENT OF HIGHER EDUCATION IN GLOBAL MEASUREMENT. INTED proceedings. doi:https://doi.org/10.21125/inted.2018.0787.
Awad, M., Khanna, R. (2015). Machine Learning. In: Efficient Learning Machines. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4302-5990-9_1.
Estrada-Real, A.C., Cantu-Ortiz, F.J.(2022). A data analytics approach for university competitiveness: the QS world university rankings. Int J Interact Des Manuf 16, 871–891. https://doi.org/10.1007/s12008-022-00966-2.
Taylor, P. and Braddock, R. (2007). International University Ranking Systems and the Idea of University Excellence. Journal of Higher Education Policy and Management, 29(3), pp.245–260. doi:https://doi.org/10.1080/13600800701457855.