We're going to build a machine learning model that will predict the personality type of an individual using the 16 distinct personality types from Myers-Briggs Personality Type Indicator (MBPTI). Isabel Myers and her mother Katherine Briggs developed a questionnaire indicating differing psychological preferences in how people perceive the world and make decisions. The Myers-Briggs framework aims to help people identify and gain some understanding of how they take in information and make decisions (perception and judgment) in relation to their environment.
Chart by Jake Beech - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=30859659.
The framework consists of eight preferences organized into four pairs of opposites:
- Extraversion (E) or Introversion (I)
- Sensing (S) or Intuition (N)
- Thinking (T) or Feeling (F)
- Judging (J) or Perceiving (P)
Combining the preferences (E or I, S or N, T or F, J or P) makes up a four-letter code MBTI type as seen in the image below. These types represent a person's psychological or personality type. There are 16 possible combinations of letters leading to 16 distinct MBTI personality types.
"Whatever the circumstances of your life, the understanding of type can make your perceptions clearer, your judgments sounder, and your life closer to your heart's desire." —Isabel Briggs Myers
We'll use two data sets sourced from Kaggle, as described in the sections below.
The first dataset (26 MB) contains 8,675 rows of uncleaned data (individual posts collected from a forum and labelled with the personality type). This will allow us to learn the fundamentals of pre-processing data and data visualization. You can download the file (mbpt_dataset.csv
) in this repository or directly from the linked Kaggle page above.
The data is in a CSV file with the following columns:
Column | Description |
---|---|
type |
Four-letter MBTI code. |
posts |
The last 50 things an individual posted with each entry separated by three pipe characters (| | |). |
The resulting model from the first data set will not be optimal due to an imbalance in the data; hence, we will use the second dataset (346 MB) which contains 106,067 rows of cleaned and more balanced data. This dataset combines the first dataset and another one collected from Reddit and will give us much better accuracy. You can download the file (mbpt_dataset_2.csv
) or directly from the linked Kaggle page above. Since the file is large, I couldn't upload it to this repository (it exceeds GitHub's file size limit of 100.00 MB), so you can add it as a shortcut to your Google Drive. This way, you don't have to download the file and can easily mount it to your Colab session.
The data is in a CSV file with the following columns:
Column | Description |
---|---|
type |
Four-letter MBTI code. |
posts |
Equal-sized words per sample of the last 50 things an individual posted. |
We'll build our model using the Scikit-Learn library. Scikit-Learn is a free software machine learning library for the Python programming language, including various classification, regression, and clustering algorithms.
Here's a list and summary of each Python package we will use:
Package | Description |
---|---|
re |
Python built-in module for regular expression operations. |
tqdm |
A fast and extensible progress bar. |
pandas |
A fast, powerful, flexible and easy-to-use data analysis and manipulation tool. |
matplotlib.pyplot |
The state-based interface to matplotlib—the comprehensive library for creating static, animated, and interactive visualizations (more like an API for matplotlib making it a viable alternative to MATLAB). |
WordCloud |
A word cloud generator. |
train_test_split from sklearn.model_selection |
Scikit-Learn class for splitting arrays or matrices into random train and test subsets. |
LabelEncoder from sklearn.preprocessing |
Scikit-Learn class for encoding target labels with value between 0 and n_classes-1. |
TfidfVectorizer from sklearn.feature_extraction.text |
Scikit-Learn class for converting a collection of raw documents to a matrix of TF-IDF features. |
accuracy_score from sklearn.metrics |
Scikit-Learn class for calculating the accuracy score of classification algorithms. |
classification_report from sklearn.metrics |
Scikit-Learn class for building a text report showing the main classification metrics. |
confusion_matrix from sklearn.metrics |
Scikit-Learn class for building a text report showing the main classification metrics. |
LogisticRegression from sklearn.linear_model |
Scikit-Learn Logistic Regression classifier for classification models. |
SVC from sklearn.svm |
Scikit-Learn C-Support Vector classifier for classification models. |
LinearSVC from sklearn.svm |
Scikit-Learn Linear Support Vector classifier for classification models. |
MultinomialNB from sklearn.naive_bayes |
Scikit-Learn Naive Bayes classifier for multinomial classification models. |
RandomForestClassifier from sklearn.ensemble |
Scikit-Learn Random Forest classifier for classification models. |
pickle |
Python built-in module for object structure serialization and de-serialization. |
Here are the source code files for the Jupyter notebooks and Python scripts used in this lesson. We will discuss the sequence of steps in the Notebook further during the workshop. You can also find all the files in the ./model
directory.
model.ipynb
: Jupyter notebook for dataset one.model.py
: Python script for dataset one.
model.ipynb
: Jupyter notebook for dataset two.model.py
: Python script for dataset two.
Special thanks to Nada Gamal's initial notebook on the first dataset.
Model One | Model Two |
---|---|
Here's a list of terms you'll come across in this lesson (we will discuss them further during the workshop).
Term | Description |
---|---|
MBTI | Myers-Briggs Personality Type Indicator. |
Kaggle | This is an online community of data scientists and machine learning engineers. |
Mounting Google Drive | If you upload files (CSV datasets, image files, or anything else) during a Colab session, you will lose the files (both input and output) after the runtime session. Mounting your Google Drive allows you to access files directly from your drive and use them in your Colab session. |
Splitting Data into Train and Test | This is an essential process used to estimate the performance of an algorithm (checking how it performs when used on a fresh set of data) by dividing a large dataset into two subsets. |
Pie Chart | This is a circular statistical graphic which is divided into slices to illustrate numerical proportions. |
Regular Expression | This is a sequence of characters that define a search pattern (it is also shortened as regex or regexp). |
Word Cloud | This is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. |
Corpus | This is a collection of written or spoken material in machine-readable form, assembled to study linguistic structures, frequencies, etc. |
Tokenization | This is the process of breaking down a text into words, phrases, symbols, or other meaningful elements called tokens. |
Stop Words | These are words (like "a", "an", "are, "the", "in", etc.) which are filtered out before processing natural language data (text). Because ML algorithms check the frequency of words in data, they can give higher weights to stop words if they occur frequently in the document, resulting in an inefficient model. |
TF-IDF (Term Frequency-Inverse Document Frequency) | This is a numerical statistic that is intended to reflect how important a word is to a document in a corpus. |
Stemming/Lemmatization | NLP text normalization techniques that reduce the inflected words (words derived from another: React → Reactive) to their root form (base word/meaningful word). |
Label Encoding | This refers to converting the labels into numeric form (between 0 and n_classes-1) so as to convert them into machine-readable form. |
Classification Report | This is a report of each class's main classification metrics (precision, recall, f1-score, and support). |
Precision Score | This is the ratio of correctly predicted positive observations to the total predicted positive observations. |
Recall Score | This is the ratio of correctly predicted positive observations to all observations in actual class. |
F1 Score | This is the weighted average (harmonic mean) of the Precision and Recall scores. |
Support Score | This is the number of actual class occurrences in the specified dataset. |
Model Accuracy | This is the fraction of predictions our model got right. |
Macro Average | This is the average performance of each class. |
Weighted Average | This is the average performance of each class with respect to the number of samples available for each class. |
Pickling | This is the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network. |
Positive | Negative | |
Positive | True Positive (TP) | False Positive (FP) |
Negative | False Negative (FN) | True Negative (TN) |
Precision =
Recall =
F1-Score =
Accuracy =
For two (2) classes, you get a 2x2 matrix. But in the case of the model in this lesson (sixteen (16) classes), we get a 16x16 matrix.
Something like this:
[ 4 2 1 0 0 0 0 0 10 19 0 1 0 0 0 1]
[ 0 74 0 3 0 0 0 0 10 28 14 2 1 0 1 2]
[ 0 2 7 3 0 0 0 0 4 12 14 3 0 0 0 1]
[ 0 7 0 70 0 0 0 0 16 15 11 17 0 1 0 0]
[ 0 0 0 1 0 0 0 0 2 5 0 1 0 0 0 0]
[ 0 2 1 0 0 0 0 0 0 4 1 2 0 0 0 0]
[ 0 0 0 1 0 0 0 0 1 4 1 1 0 0 0 0]
[ 0 0 0 1 0 0 0 0 5 5 2 4 0 0 0 1]
[ 0 2 0 2 0 0 0 0 210 52 8 18 0 0 0 2]
[ 0 4 0 4 0 0 0 0 14 322 9 11 0 1 0 1]
[ 0 0 0 5 0 0 0 0 18 20 146 27 0 0 1 1]
[ 0 0 0 5 0 0 0 0 10 20 13 212 0 0 0 1]
[ 0 2 0 0 0 0 0 0 8 11 3 5 4 0 0 0]
[ 0 0 0 5 0 0 0 0 6 30 2 2 0 9 0 0]
[ 0 2 0 2 0 0 0 0 5 14 10 4 0 0 3 1]
[ 0 0 0 2 0 0 0 0 2 16 1 17 0 0 0 29]
Sample matrix for the sixteen (16) MBPTI classes
INFP | INFJ | INTP | INTJ | ENTP | ENFP | ISTP | ISFP | ENTJ | ISTJ | ENFJ | ISFJ | ESTP | ESFP | ESFJ | ESTJ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
INFP | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
INFJ | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
INTP | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False |
INTJ | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False |
ENTP | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False |
ENFP | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False |
ISTP | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False |
ISFP | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False |
ENTJ | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False |
ISTJ | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False |
ENFJ | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False |
ISFJ | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False |
ESTP | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False |
ESFP | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False |
ESFJ | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False |
ESTJ | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True |
Thank you for coming this far; you've done well 👏🏾. Please open a new GitHub discussion using the links below and let me know your thoughts about this lesson or any issues you're experiencing.
<< previous lesson | next lesson >>