Settings — PhishGuard AI

Model TypeLogistic Regression

VectorizerTF-IDF (unigrams + bigrams)

Max Features40,000

Test Accuracy98.26%

ROC-AUC0.9969

Training Set14,476 samples

Test Set3,620 samples

DatasetPhishing Email Dataset (18K+)

Thresholds are based on the trained model's probability outputs. Adjust in phishing_app.py for production use.

LowercaseEnabled

URL NormalizationEnabled (url token)

Email NormalizationEnabled (email token)

HTML Tag RemovalEnabled

Non-Alpha RemovalEnabled

Min Term Frequency3 documents

Max Term Frequency90% of docs

Sublinear TF ScalingEnabled

Email Text

TF-IDF

Logistic Reg.

Result

Core AI Brain

Logistic Regression Engine

The central threat model that evaluates semantic patterns to calculate the exact statistical likelihood of a phishing attempt.

sklearn.linear_model.LogisticRegression

Semantic Extraction

TF-IDF Word Frequency Model

Converts raw email text into unique numeric weight vectors by analyzing key word counts (individual words + double word pairs).

TfidfVectorizer(ngram_range=(1,2))

Complexity Tuning

Regularization Scale (C = 1.0)

Balances model sensitivity. A C-value of 1.0 guards against "over-fitting", making sure the model handles brand-new emails perfectly.

C_parameter = 1.0 (Balanced)

Optimizer Solver

L-BFGS Numerical Optimizer

A fast, memory-optimized optimization algorithm used to discover the mathematically perfect dividing line between safe and threat emails.

solver='lbfgs' (max_iter=1000)

Bias Prevention

Balanced Weight Distribution

Automatically balances the importance of benign and malicious classes during training to protect against biased alerts.

class_weight='balanced'

Threat Intel Split

80% Learn / 20% Rigorous Test

Splits the security threat library. 80% is used to train the system intelligence, and 20% is held back to rigorously verify correctness.

stratified_split(ratio=0.2)

PhishGuard AI is an academic AI cybersecurity project demonstrating email phishing detection using machine learning. The system uses a TF-IDF + Logistic Regression pipeline trained on 18,000+ real-world emails to achieve 98.26% classification accuracy.

Version 1.0 • Built with Flask, scikit-learn • Dataset: Phishing Email Dataset