Categories: machine-learning, phishing

Catching Phishes with Machine Learning

By Su Yiin Ang, Anne Nyugen Nhi Thai An, Gordy Adiprasetyo, Kendra Luisa Balong Gadong.

This was written as part of the requirements for the Applied Machine Learning module for MITB.

For presentation deck, please view : Presentation deck

1. Introduction

Phishing websites are websites that are constructed to look identical to the real website with intention to trick recipients into divulging confidential data such as usernames and passwords. In 2020, the number of phishing sites doubled across the year as shown in Figure 1. Additionally, the FBI reported that phishing attacks in 2020 had resulted in a financial loss of over USD54mil, with such attacks rising exponentially over the past 5 years.

Figure 1 Phishing activity trend in 2020. (Source: APWG Phishing Activity Trends Report)

2. Project Objective

As internet usage and phishing websites grow, the need for automatic phishing detection is increasingly important to protect users. This project’s objective is to build a machine learning model that can detect phishing websites accurately based on the raw URL. This model can be applied to the backend of internet related applications such as search engines, browser add-ons, and platforms that allows embedding of external sites, which alerts users when a phishing website is being loaded. This would help warn users from accessing the phishing website and thus falling prey to such schemes.

The primary performance measure for the models is identified to be recall score. From the perspective of a user browsing websites, false negatives are extremely costly as users might enter confidential and crucial financial information without knowing. On the other hand, false positives are not particularly costly as users can override browser’s warning to proceed if they are certain that the website is legitimate.

Figure 2 Project Pipeline of Phishing Detection Machine Learning Model

3. Compiling Legitimate & Phishing Websites

URLs for legitimate websites were taken from Singh (2020). The original dataset contains about 1.2M URLs labelled as legitimate or phishing. Only a portion of legitimate websites were taken from this list (82,866 legitimate URLs). Phishing URLs from this list are not used as phishing URLs are usually short-lived and this data source was already a year old when this project was done. Phishing URLs are taken from Phishtank instead, a website listing live and confirmed phishing URL. The two lists are combined and basic pre-processing such as removing duplicates was performed.

4. Feature Extraction

The features used in this project were adapted from Hannousse & Yahiouche (link 1 and link 2 here), with some changes and additions. This paper was chosen as modelling using the features in this paper show promising results.

4.1 URL Features

URL features are features that are obtained by analyzing the text of URLs. Basic operations such as counting the number of ‘www’, checking whether ‘https’ instead of ‘http’ was used, and checking for unusual extensions such as ‘.exe’ was performed. A total of 55 URL features were extracted as shown in the table below.

Table 1: URL Features and Descriptions

No	Feature	Description
1	length_url	full URL length
2	length_hostname	hostname length
3	ip	IP address
4	nb_dots	no. of “.”
5	nb_hyphens	no. of “-”
6	nb_at	no. of “@”
7	nb_qm	no. of “?”
8	nb_and	no. of “&”
9	nb_or	no. of "
10	nb_eq	no. of “=”
11	nb_underscore	no. of “_”
12	nb_tilde	no. of “~”
13	nb_percent	no. of “%”
14	nb_slash	no. of “/”
15	nb_star	no. of “*”
16	nb_colon	no. of “:”
17	nb_comma	no. of “,”
18	nb_semicolumn	no. of “:”
19	nb_dollar	no. of “$”
20	nb_space	no. of %20" or " "
21	nb_www	no. of “www”
22	nb_com	no. of “.com”
23	nb_dslash	no. of “//”
24	http_in_path	no. of “http”
25	https_token	use of https
26	ratio_digits_url	ratio of digits in full URL
27	ratio_digits_host	ratio of digits in hostname
28	punycode	presence of a punnycode
29	port	port indicator
30	tld_in_path	top-level domain in path
31	tld_in_subdomain	top-level domain in subdomain
32	abnormal_subdomain	URLs matching a pattern of w[w]?[0-9]*
33	nb_subdomains	no. of subdomains
34	prefix_suffix	presence of “-” in domain names
35	shortening_service	use of a shortening service
36	path_extension	presence of a path extension
37	nb_redirection	number of redirections
38	nb_external_redirection	number of external redirections
39	length_words_raw	number of words
40	char_repeat	repeated characters
41	shortest_words_raw	shortest words in the url
42	shortest_word_host	shortest words in the hostname
43	shortest_word_path	shortest words in the path
44	longest_words_raw	longest words in the url
45	longest_word_host	longest words in the hostname
46	longest_word_path	longest words in the path
47	avg_words_raw	average words in the url
48	avg_word_host	average words in the hostname
49	avg_word_path	average words in the path
50	phish_hints	“total occurrence of the ff phish hints: ‘wp’, ‘login’, ‘includes’, ‘admin’, ‘content’, ‘site’, ‘images’, ‘js’, ‘alibaba’, ‘css’, ‘myaccount’, ‘dropbox’, ‘themes’, ‘plugins’, ‘signin’, ‘view’”
51	domain_in_brand	presence of brand in domain
52	brand_in_subdomain	presence of brand in subdomain
53	brand_in_path	presence of brand in path
54	suspicious_tld	suspicious top-level domain

4.2 Content-Based & Geolocation Features

Content-based features are features that are obtained by loading up the websites and analyzing their HTML contents. This includes checking whether web pages contain login forms, checking the number of pop-up windows, or checking the number of hyperlinks. Using the websites’ IP addresses, Geolocation of hosting server was also extracted. A total of 22 content-based features and 1 geolocation feature were extracted as shown in the table below.

Table 2: Content-Based & Geolocation Features and Descriptions

No	Feature	Description
56	nb_hyperlinks	no. of hyperlinks
57	ratio_intHyperlinks	ratio of internal hyperlinks
58	ratio_extHyperlinks	ratio of external hyperlinks
59	ratio_nullHyperlinks	ratio of null hyperlinks
60	nb_extCSS	no. of external CSS files
61	ratio_intRedirection	ratio of internal redirections
62	ratio_extRedirection	ratio of external redirections
63	ratio_intErrors	ratio of internal hyperlinks connection errors
64	ratio_extErrors	ratio of external hyperlinks connection errors
65	login_form	presence of ‘‘’’, ‘‘#’’, ‘‘#nothing’’, ‘‘#doesnotexist’’, ‘‘#null’’, ‘‘#void’’, ‘‘#whatever’’, ‘‘#content’’, ‘‘javascript::void(0)’’, ‘‘javascript::void(0);’’, ‘‘javascript::;’’, ‘‘javascript’’
66	external_favicon	presence of external favicons
67	links_in_tags	ratio of internal links in tags
68	submit_email	presence of form actions containing ‘mailto:’ or ‘mail()’
69	ratio_intMedia	ratio of internal media file links
70	ratio_extMedia	ratio of external media file links
71	sfh	presence of forms with empty string or ‘about:blank’
72	iframe	presence of < iframe > tag
73	popup_window	presence of pop up window with text fields
74	safe_anchor	presence of ‘#’, ‘javascript’, or ‘mailto’ tags
75	onmouseover	presence of ‘onmouseover’ attribute to detect disabling right-click
76	right_clic	presence of ‘event.button==2’ to ‘onmouseover’ attribute to detect disabling right-click
77	empty_title	absence of web page title
78	domain_in_title	presence of domain of URL as part of web page title
79	domain_with_copyright	presence of domain of URLs within copyright logo
80	geolocation	country geolocation, based on IP address

4.3 External Service Features

External Service features are obtained by querying reference third party services and search engines. For instance, we check whether the URLs are indexed by Google. URLs without Google Index might be too short-lived to be indexed by Google and therefore might indicate that they are phishing URLs. A total of 7 such external service features are shown in the table below.

Table 3: External Service Features and Descriptions

No	Feature	Description
81	whois_registered_domain	whether domain matches WHOIS database
82	domain_registration_length	length of domain renewals
83	domain_age	age of URL domains
84	web_traffic	number of visitors, retrieved from Alexa
85	dns_record	whether URL domain is registered within the DNS
86	google_index	whether website was indexed by Google
87	page_rank	Google Page Rank value, from Openpagerank

5. Baseline Model

The fully extracted dataset contains 90,576 observations and 89 columns (comprising 3 ID columns, 1 target, and 85 features). Using only the 55 URL features, we built a logistic regression model to be used as baseline comparison before performing further feature engineering and selection. URL features were chosen as they are not dependent on accessibility of the websites and only requires raw URL texts. The recall score of this model is 81.7%.

6. Exploratory Data Analysis

6.1 Feature & Observation Removal

EDA was conducted to check for erroneous data and remove features to reduce compute as well as data collection cost especially for content and external features that require querying. This revealed 5,438 observations with all missing content feature data. Since the business case focuses on detecting live phishing websites (websites that are not live would also not be able to phish any data and thus do not pose any harm), these websites that were inaccessible are non-representative. These observations were thus removed from the dataset. Visual inspection of histogram, bar charts & correlation matrix were used for univariate and bivariate analysis. Discrete numeric or categorical features were dropped when the feature has a high proportion of missing values (for example, querying the site returned a response code that is not 200, or empty content), the feature’s positive label has very low frequency, or the feature has only one single label. These include ‘ratio_nullHyperlinks', ‘ratio_intRedirection’, ‘ratio_intErrors’, ‘submit_email’, ‘sfh’, ‘iframe’, ‘popup_window’, ‘onmouseover’, ‘right_clic’, ‘brand_in_subdomain’, ‘brand_in_path’, ‘suspecious_tld’, ‘statistical_report’, ‘whois_registered_domain’, ‘domain_registration_length’, ‘google_index’, ‘page_rank’, and ‘domain_age’.

Figure 3 Example features of ‘ratio_nullHyperlinks' showing only single value for label, and ‘brand_in_subdomain’ showing very low frequency positive label

Numeric features were checked for multi-collinearity. Where pair-wise correlation exceeds 0.7 (Table 4), selected features were dropped to maximize the remaining number of features: ‘nb_eq’, ‘length_url’, ‘avg_words_raw’, ‘longest_word_host’, ‘longest_word_path’, ‘nb_semicolumn’, ‘shortest_word_host’.

Table 4: Pair-wise correlation of top correlated features

6.2 Processing

The EDA below also showed that most numeric features are extremely skewed, thus requiring standardization. StandardScaler was thus fitted on the train dataset to preserve the numeric features’ mean and standard deviation values for subsequent standardization on the test dataset. Among categorical features, ‘geolocation’ was also heavily skewed and thus re-binned to 9 distinct labels. Lastly, categorical features were one-hot encoded as required for model building. As a result, the processed data obtained has 85,138 observations and 68 features in total (including one-hot encoded columns).

Figure 4: Feature ‘nb_slash’ before (left) and after (right) standardization

Figure 5: Feature ‘geolocation’ before (left) and after (right) re-binning

7. Train-Test Split & Sampling Strategy

The processed dataset was first split into a train:test split of 80:20 to ensure test comparison across all models would be fair since the same test data from sampling would be used.

The dataset was imbalanced at a ratio of roughly 10:1 legitimate:phishing observations. As well-elucidated in literature, a more balanced down-sampled dataset would increase recall at the expense of accuracy as well as possible loss of information when more data is discarded. On the other hand, up-sampling may introduce bias and overfitting, preventing the trained model from generalizing well upon deployment. Therefore, to address the imbalance which will implicate the performance of the models, a total of 4 different sampling strategies were conducted on the train dataset:

Original train set at 10:1 ratio
Down-sampled train set at 5:1 ratio
Down-sampled train set at 1:1 ratio
Up-sampled train set at 1:1 ratio

It is noted that different sampling strategies were done on the train dataset only, and model comparison based on performance measures was conducted on the original imbalanced test dataset.

8. Model Building

Using the 4 train datasets explained in the previous section, 5 different models were built: 1 logistic regression model, 3 ensemble models (Random Forest, XGBoost and LightGBM), and 2 Artificial Neural Networks (consisting of 1 or 2 hidden layers). These models were chosen in the order of decreasing explainability to trade off with improving performance.

While specific details differ for each model building process, all models were trained and evaluated with cross-validation on the 4 different train datasets for hyperparameter tuning. Hyperparameter tuning for all models except for neural network models was conducted with a combination of RandomizedSearchCV and/or GridSearchCV for a few rounds, progressively restricting the hyper-parameter values for improved performance. For neural network, Hyperband algorithm was used for hyperparameter tuning.

9. Model Evaluation and Selection

The main performance measure is recall score, with accuracy used as a secondary measure for an overall sanity check, for the test dataset.

Figure 6: Test performance across all models and sampling strategies

In general, the models’ performance agrees with existing literature where down-sampling to make training data more balance improves recall and lowers accuracy. With 1:1 up-sampling, recall score drops significantly compared to the 1:1 down-sampling models, with some slight accuracy gained. However, across all models at all different sampling strategies, accuracy does not vary much and stays well above 95%.

Looking at individual models, logistic regression is observed to significantly underperform against all other more complex machine learning models. Logistic regression is therefore not recommended, even with the benefit of more explainability since the financial consequence of wrongful prediction can be quite severe. On the other hand, the Artificial Neural Network, whether with 1 or 2 hidden layers, does not perform consistently better than the ensemble models either, possibly due to the small dataset size at less than 100,000 observations. The ensemble models stack up very well, with the Random Forest model trained on the 1:1 down-sampled dataset scoring the highest recall at 97.2%. This model also has an accuracy of 96.9%.

We also note that feature collection, engineering as well as model tuning greatly improved the results of our model as we saw an improvement of nearly 19% against the baseline model based on recall score from 81.7% to 97.2%.

10. Chosen Machine Learning Model

Selecting the best model from the model evaluation, this project then chooses the Random Forest with a 1:1 Legitimate:Phishing Ratio Down-sampled by maximizing Recall at 97.2%. Further analysis on the features is done by outputting the feature importance scores shown below. From a total of 68 features, the top 7 features are selected based on the original feature importance graph in Appendix 1.

Figure 7: Test performance across all models and sampling strategies

Classifying them based on the type of features, 6 of the top 7 are under extracted features from the URL: which are (1) the number of ‘www’s (nb_www), where usually legitimate URLs are observed to have only one, (2) the use of an HTTPS (https_token), providing a secure facility, (3) the number of repeating characters (char_repeat) which is an example of a natural language processing feature, (4) the ratio of digits in full URLs (ratio_digits_url) where a high number would be considered as a phishing indicator, and word-raw features such as (5) the average length of words in a path (avg_word_path), and (6) the shortest words in the URL (shortest_words_raw).

One feature based on the content of the website which was included is the number of hyperlinks (nb_hyperlinks) where legitimate websites are supposed to consist of greater number of pages compared with phishing ones. These top features from the chosen machine learning model are what we should be wary of when clicking on possible phishing websites.

11. Summary and Future Work

In summary, machine learning models are effective in detecting phishing websites due to the characteristics of these types of websites. By exploring different machine learning algorithms, from a baseline logistic regression model to a tree-based random forest model, an XGBoost model, a LightGBM model and to a more complex ANN model, it is important to evaluate models based on business use case, which was for this project maximizing recall to detect as many phishing websites as possible.

This project recommends two possible directions of future work. One is to improve the model by collecting more data on external features or updating certain features that the model has been trained with, e.g. phish_hints feature. Phish hints are collected from confirmed phishing websites from the past that include the strings, ‘wp’, ‘login’, ‘admin’, ‘content’, ‘site’, ‘images’, ‘js’, ‘alibaba’, ‘css’, ‘myaccount’, ‘dropbox’, ‘themes’, ‘plugins’, ‘signin’, ‘view’, and that these phishing URLs use the listed sensitive words to gain trust on visited web pages. An improvement of the model is appending the phish hints with an updated list from more phishing websites that will be trained and tested in the future.

After updating and collecting more features, feature selection can be improved by balancing the cost of feature collection versus the recall penalty. The chosen machine learning model in the previous section has revealed that the important features are mostly extracted from the URL, and only one from content, which is a hyperlink-based feature. Hyperlink-based features are found to be much slower than extracting URL-based and external-based features since these require checking every link in the web page. It can be an improvement in model selection to just use URL features without spending a considerable amount of time in extracting all 89 features from the content and external components. Because of lesser time it takes to extract just URL-based features, it is suitable enough for runtime detection systems.

Lastly, due to time constraints, this project only explored traditional machine learning algorithms. It is recommended for future work to explore model stacking by combining various machine learning algorithms to further improve the phishing detection model. Once the model is improved and trained with more recent and relevant data on legitimate and phishing websites, the obtained machine learning model can be deployed through a browser plug-in to catch unwanted phishes.

12. Appendix

Appendix 1: Test performance across all models and sampling strategies

Technology vector created by freepik

Ang Su Yiin

Catching Phishes with Machine Learning

Catching Phishes with Machine Learning

1. Introduction

2. Project Objective

3. Compiling Legitimate & Phishing Websites

4. Feature Extraction

4.1 URL Features

4.2 Content-Based & Geolocation Features

4.3 External Service Features

5. Baseline Model

6. Exploratory Data Analysis

6.1 Feature & Observation Removal

6.2 Processing

7. Train-Test Split & Sampling Strategy

8. Model Building

9. Model Evaluation and Selection

10. Chosen Machine Learning Model

11. Summary and Future Work

12. Appendix

Table of Contents