Categories: machine-learning, phishing

Catching Phishes with Machine Learning

By Su Yiin Ang, Anne Nyugen Nhi Thai An, Gordy Adiprasetyo, Kendra Luisa Balong Gadong.

This was written as part of the requirements for the Applied Machine Learning module for MITB.


For presentation deck, please view : Presentation deck

1. Introduction

Phishing websites are websites that are constructed to look identical to the real website with intention to trick recipients into divulging confidential data such as usernames and passwords. In 2020, the number of phishing sites doubled across the year as shown in Figure 1. Additionally, the FBI reported that phishing attacks in 2020 had resulted in a financial loss of over USD54mil, with such attacks rising exponentially over the past 5 years.


Figure 1 Phishing activity trend in 2020. (Source: APWG Phishing Activity Trends Report)

2. Project Objective

As internet usage and phishing websites grow, the need for automatic phishing detection is increasingly important to protect users. This project’s objective is to build a machine learning model that can detect phishing websites accurately based on the raw URL. This model can be applied to the backend of internet related applications such as search engines, browser add-ons, and platforms that allows embedding of external sites, which alerts users when a phishing website is being loaded. This would help warn users from accessing the phishing website and thus falling prey to such schemes.

The primary performance measure for the models is identified to be recall score. From the perspective of a user browsing websites, false negatives are extremely costly as users might enter confidential and crucial financial information without knowing. On the other hand, false positives are not particularly costly as users can override browser’s warning to proceed if they are certain that the website is legitimate.


Figure 2 Project Pipeline of Phishing Detection Machine Learning Model

3. Compiling Legitimate & Phishing Websites

URLs for legitimate websites were taken from Singh (2020). The original dataset contains about 1.2M URLs labelled as legitimate or phishing. Only a portion of legitimate websites were taken from this list (82,866 legitimate URLs). Phishing URLs from this list are not used as phishing URLs are usually short-lived and this data source was already a year old when this project was done. Phishing URLs are taken from Phishtank instead, a website listing live and confirmed phishing URL. The two lists are combined and basic pre-processing such as removing duplicates was performed.

4. Feature Extraction

The features used in this project were adapted from Hannousse & Yahiouche (link 1 and link 2 here), with some changes and additions. This paper was chosen as modelling using the features in this paper show promising results.

4.1 URL Features

URL features are features that are obtained by analyzing the text of URLs. Basic operations such as counting the number of ‘www’, checking whether ‘https’ instead of ‘http’ was used, and checking for unusual extensions such as ‘.exe’ was performed. A total of 55 URL features were extracted as shown in the table below.

Table 1: URL Features and Descriptions

NoFeatureDescription
1length_urlfull URL length
2length_hostnamehostname length
3ipIP address
4nb_dotsno. of “.”
5nb_hyphensno. of “-”
6nb_atno. of “@”
7nb_qmno. of “?”
8nb_andno. of “&”
9nb_orno. of "
10nb_eqno. of “=”
11nb_underscoreno. of “_”
12nb_tildeno. of “~”
13nb_percentno. of “%”
14nb_slashno. of “/”
15nb_starno. of “*”
16nb_colonno. of “:”
17nb_commano. of “,”
18nb_semicolumnno. of “:”
19nb_dollarno. of “$”
20nb_spaceno. of %20" or " "
21nb_wwwno. of “www”
22nb_comno. of “.com”
23nb_dslashno. of “//”
24http_in_pathno. of “http”
25https_tokenuse of https
26ratio_digits_urlratio of digits in full URL
27ratio_digits_hostratio of digits in hostname
28punycodepresence of a punnycode
29portport indicator
30tld_in_pathtop-level domain in path
31tld_in_subdomaintop-level domain in subdomain
32abnormal_subdomainURLs matching a pattern of w[w]?[0-9]*
33nb_subdomainsno. of subdomains
34prefix_suffixpresence of “-” in domain names
35shortening_serviceuse of a shortening service
36path_extensionpresence of a path extension
37nb_redirectionnumber of redirections
38nb_external_redirectionnumber of external redirections
39length_words_rawnumber of words
40char_repeatrepeated characters
41shortest_words_rawshortest words in the url
42shortest_word_hostshortest words in the hostname
43shortest_word_pathshortest words in the path
44longest_words_rawlongest words in the url
45longest_word_hostlongest words in the hostname
46longest_word_pathlongest words in the path
47avg_words_rawaverage words in the url
48avg_word_hostaverage words in the hostname
49avg_word_pathaverage words in the path
50phish_hints“total occurrence of the ff phish hints: ‘wp’, ‘login’, ‘includes’, ‘admin’, ‘content’, ‘site’, ‘images’, ‘js’, ‘alibaba’, ‘css’, ‘myaccount’, ‘dropbox’, ‘themes’, ‘plugins’, ‘signin’, ‘view’”
51domain_in_brandpresence of brand in domain
52brand_in_subdomainpresence of brand in subdomain
53brand_in_pathpresence of brand in path
54suspicious_tldsuspicious top-level domain

4.2 Content-Based & Geolocation Features

Content-based features are features that are obtained by loading up the websites and analyzing their HTML contents. This includes checking whether web pages contain login forms, checking the number of pop-up windows, or checking the number of hyperlinks. Using the websites’ IP addresses, Geolocation of hosting server was also extracted. A total of 22 content-based features and 1 geolocation feature were extracted as shown in the table below.

Table 2: Content-Based & Geolocation Features and Descriptions

NoFeatureDescription
56nb_hyperlinksno. of hyperlinks
57ratio_intHyperlinksratio of internal hyperlinks
58ratio_extHyperlinksratio of external hyperlinks
59ratio_nullHyperlinksratio of null hyperlinks
60nb_extCSSno. of external CSS files
61ratio_intRedirectionratio of internal redirections
62ratio_extRedirectionratio of external redirections
63ratio_intErrorsratio of internal hyperlinks connection errors
64ratio_extErrorsratio of external hyperlinks connection errors
65login_formpresence of ‘‘’’, ‘‘#’’, ‘‘#nothing’’, ‘‘#doesnotexist’’, ‘‘#null’’, ‘‘#void’’, ‘‘#whatever’’, ‘‘#content’’, ‘‘javascript::void(0)’’, ‘‘javascript::void(0);’’, ‘‘javascript::;’’, ‘‘javascript’’
66external_faviconpresence of external favicons
67links_in_tagsratio of internal links in tags
68submit_emailpresence of form actions containing ‘mailto:’ or ‘mail()’
69ratio_intMediaratio of internal media file links
70ratio_extMediaratio of external media file links
71sfhpresence of forms with empty string or ‘about:blank’
72iframepresence of < iframe > tag
73popup_windowpresence of pop up window with text fields
74safe_anchorpresence of ‘#’, ‘javascript’, or ‘mailto’ tags
75onmouseoverpresence of ‘onmouseover’ attribute to detect disabling right-click
76right_clicpresence of ‘event.button==2’ to ‘onmouseover’ attribute to detect disabling right-click
77empty_titleabsence of web page title
78domain_in_titlepresence of domain of URL as part of web page title
79domain_with_copyrightpresence of domain of URLs within copyright logo
80geolocationcountry geolocation, based on IP address

4.3 External Service Features

External Service features are obtained by querying reference third party services and search engines. For instance, we check whether the URLs are indexed by Google. URLs without Google Index might be too short-lived to be indexed by Google and therefore might indicate that they are phishing URLs. A total of 7 such external service features are shown in the table below.

Table 3: External Service Features and Descriptions

NoFeatureDescription
81whois_registered_domainwhether domain matches WHOIS database
82domain_registration_lengthlength of domain renewals
83domain_ageage of URL domains
84web_trafficnumber of visitors, retrieved from Alexa
85dns_recordwhether URL domain is registered within the DNS
86google_indexwhether website was indexed by Google
87page_rankGoogle Page Rank value, from Openpagerank

5. Baseline Model

The fully extracted dataset contains 90,576 observations and 89 columns (comprising 3 ID columns, 1 target, and 85 features). Using only the 55 URL features, we built a logistic regression model to be used as baseline comparison before performing further feature engineering and selection. URL features were chosen as they are not dependent on accessibility of the websites and only requires raw URL texts. The recall score of this model is 81.7%.

6. Exploratory Data Analysis

6.1 Feature & Observation Removal

EDA was conducted to check for erroneous data and remove features to reduce compute as well as data collection cost especially for content and external features that require querying. This revealed 5,438 observations with all missing content feature data. Since the business case focuses on detecting live phishing websites (websites that are not live would also not be able to phish any data and thus do not pose any harm), these websites that were inaccessible are non-representative. These observations were thus removed from the dataset. Visual inspection of histogram, bar charts & correlation matrix were used for univariate and bivariate analysis. Discrete numeric or categorical features were dropped when the feature has a high proportion of missing values (for example, querying the site returned a response code that is not 200, or empty content), the feature’s positive label has very low frequency, or the feature has only one single label. These include ‘ratio_nullHyperlinks', ‘ratio_intRedirection’, ‘ratio_intErrors’, ‘submit_email’, ‘sfh’, ‘iframe’, ‘popup_window’, ‘onmouseover’, ‘right_clic’, ‘brand_in_subdomain’, ‘brand_in_path’, ‘suspecious_tld’, ‘statistical_report’, ‘whois_registered_domain’, ‘domain_registration_length’, ‘google_index’, ‘page_rank’, and ‘domain_age’.


Figure 3 Example features of ‘ratio_nullHyperlinks' showing only single value for label, and ‘brand_in_subdomain’ showing very low frequency positive label

Numeric features were checked for multi-collinearity. Where pair-wise correlation exceeds 0.7 (Table 4), selected features were dropped to maximize the remaining number of features: ‘nb_eq’, ‘length_url’, ‘avg_words_raw’, ‘longest_word_host’, ‘longest_word_path’, ‘nb_semicolumn’, ‘shortest_word_host’.


Table 4: Pair-wise correlation of top correlated features

6.2 Processing

The EDA below also showed that most numeric features are extremely skewed, thus requiring standardization. StandardScaler was thus fitted on the train dataset to preserve the numeric features’ mean and standard deviation values for subsequent standardization on the test dataset. Among categorical features, ‘geolocation’ was also heavily skewed and thus re-binned to 9 distinct labels. Lastly, categorical features were one-hot encoded as required for model building. As a result, the processed data obtained has 85,138 observations and 68 features in total (including one-hot encoded columns).


Figure 4: Feature ‘nb_slash’ before (left) and after (right) standardization


Figure 5: Feature ‘geolocation’ before (left) and after (right) re-binning

7. Train-Test Split & Sampling Strategy

The processed dataset was first split into a train:test split of 80:20 to ensure test comparison across all models would be fair since the same test data from sampling would be used.

The dataset was imbalanced at a ratio of roughly 10:1 legitimate:phishing observations. As well-elucidated in literature, a more balanced down-sampled dataset would increase recall at the expense of accuracy as well as possible loss of information when more data is discarded. On the other hand, up-sampling may introduce bias and overfitting, preventing the trained model from generalizing well upon deployment. Therefore, to address the imbalance which will implicate the performance of the models, a total of 4 different sampling strategies were conducted on the train dataset:

  • Original train set at 10:1 ratio
  • Down-sampled train set at 5:1 ratio
  • Down-sampled train set at 1:1 ratio
  • Up-sampled train set at 1:1 ratio

It is noted that different sampling strategies were done on the train dataset only, and model comparison based on performance measures was conducted on the original imbalanced test dataset.

8. Model Building

Using the 4 train datasets explained in the previous section, 5 different models were built: 1 logistic regression model, 3 ensemble models (Random Forest, XGBoost and LightGBM), and 2 Artificial Neural Networks (consisting of 1 or 2 hidden layers). These models were chosen in the order of decreasing explainability to trade off with improving performance.

While specific details differ for each model building process, all models were trained and evaluated with cross-validation on the 4 different train datasets for hyperparameter tuning. Hyperparameter tuning for all models except for neural network models was conducted with a combination of RandomizedSearchCV and/or GridSearchCV for a few rounds, progressively restricting the hyper-parameter values for improved performance. For neural network, Hyperband algorithm was used for hyperparameter tuning.

9. Model Evaluation and Selection

The main performance measure is recall score, with accuracy used as a secondary measure for an overall sanity check, for the test dataset.


Figure 6: Test performance across all models and sampling strategies

In general, the models’ performance agrees with existing literature where down-sampling to make training data more balance improves recall and lowers accuracy. With 1:1 up-sampling, recall score drops significantly compared to the 1:1 down-sampling models, with some slight accuracy gained. However, across all models at all different sampling strategies, accuracy does not vary much and stays well above 95%.

Looking at individual models, logistic regression is observed to significantly underperform against all other more complex machine learning models. Logistic regression is therefore not recommended, even with the benefit of more explainability since the financial consequence of wrongful prediction can be quite severe. On the other hand, the Artificial Neural Network, whether with 1 or 2 hidden layers, does not perform consistently better than the ensemble models either, possibly due to the small dataset size at less than 100,000 observations. The ensemble models stack up very well, with the Random Forest model trained on the 1:1 down-sampled dataset scoring the highest recall at 97.2%. This model also has an accuracy of 96.9%.

We also note that feature collection, engineering as well as model tuning greatly improved the results of our model as we saw an improvement of nearly 19% against the baseline model based on recall score from 81.7% to 97.2%.

10. Chosen Machine Learning Model

Selecting the best model from the model evaluation, this project then chooses the Random Forest with a 1:1 Legitimate:Phishing Ratio Down-sampled by maximizing Recall at 97.2%. Further analysis on the features is done by outputting the feature importance scores shown below. From a total of 68 features, the top 7 features are selected based on the original feature importance graph in Appendix 1.


Figure 7: Test performance across all models and sampling strategies

Classifying them based on the type of features, 6 of the top 7 are under extracted features from the URL: which are (1) the number of ‘www’s (nb_www), where usually legitimate URLs are observed to have only one, (2) the use of an HTTPS (https_token), providing a secure facility, (3) the number of repeating characters (char_repeat) which is an example of a natural language processing feature, (4) the ratio of digits in full URLs (ratio_digits_url) where a high number would be considered as a phishing indicator, and word-raw features such as (5) the average length of words in a path (avg_word_path), and (6) the shortest words in the URL (shortest_words_raw).

One feature based on the content of the website which was included is the number of hyperlinks (nb_hyperlinks) where legitimate websites are supposed to consist of greater number of pages compared with phishing ones. These top features from the chosen machine learning model are what we should be wary of when clicking on possible phishing websites.

11. Summary and Future Work

In summary, machine learning models are effective in detecting phishing websites due to the characteristics of these types of websites. By exploring different machine learning algorithms, from a baseline logistic regression model to a tree-based random forest model, an XGBoost model, a LightGBM model and to a more complex ANN model, it is important to evaluate models based on business use case, which was for this project maximizing recall to detect as many phishing websites as possible.

This project recommends two possible directions of future work. One is to improve the model by collecting more data on external features or updating certain features that the model has been trained with, e.g. phish_hints feature. Phish hints are collected from confirmed phishing websites from the past that include the strings, ‘wp’, ‘login’, ‘admin’, ‘content’, ‘site’, ‘images’, ‘js’, ‘alibaba’, ‘css’, ‘myaccount’, ‘dropbox’, ‘themes’, ‘plugins’, ‘signin’, ‘view’, and that these phishing URLs use the listed sensitive words to gain trust on visited web pages. An improvement of the model is appending the phish hints with an updated list from more phishing websites that will be trained and tested in the future.

After updating and collecting more features, feature selection can be improved by balancing the cost of feature collection versus the recall penalty. The chosen machine learning model in the previous section has revealed that the important features are mostly extracted from the URL, and only one from content, which is a hyperlink-based feature. Hyperlink-based features are found to be much slower than extracting URL-based and external-based features since these require checking every link in the web page. It can be an improvement in model selection to just use URL features without spending a considerable amount of time in extracting all 89 features from the content and external components. Because of lesser time it takes to extract just URL-based features, it is suitable enough for runtime detection systems.

Lastly, due to time constraints, this project only explored traditional machine learning algorithms. It is recommended for future work to explore model stacking by combining various machine learning algorithms to further improve the phishing detection model. Once the model is improved and trained with more recent and relevant data on legitimate and phishing websites, the obtained machine learning model can be deployed through a browser plug-in to catch unwanted phishes.

12. Appendix


Appendix 1: Test performance across all models and sampling strategies


Technology vector created by freepik