AI & Machine Learning Glossary
Clear, human-friendly definitions for the most important AI terms. Use the A–Z bar or search below.
A
Activation Function
Mathematical function applied to neuron outputs; introduces non-linearity (e.g., ReLU, sigmoid).
Agent
An AI system that can take a series of actions toward a goal, often using tools/APIs and memory.
Alignment
Techniques that make model behavior match human goals and values (e.g., preference tuning, RLHF).
API
Application Programming Interface—structured way software talks to models or services programmatically.
Attention
Mechanism that lets each token “look at” others to build context; the core of transformer models.
Autoencoder
Neural network trained to compress input into a latent space and reconstruct it, useful for representation learning.
B
Backpropagation
Algorithm for training neural networks by propagating error gradients backward through layers.
Batch Normalization
Technique that normalizes activations within a mini-batch to stabilize and speed up training.
Beam Search
Decoding that keeps multiple candidate sequences to improve coherence vs. greedy decoding.
Benchmark
Standardized test set to compare model performance (e.g., MMLU, GSM8K).
Bias
Systematic error from data or design that skews outputs; reduced with careful data and evaluation.
Bi-Encoder
Model that encodes query and document separately for fast vector search (used before reranking).
C
Chain of Thought
Prompting that elicits step-by-step reasoning before final answers; use selectively for reliability.
Clustering
Unsupervised grouping of data points based on similarity (e.g., k-means, DBSCAN).
Convolutional Neural Network (CNN)
Architecture designed for grid-like data such as images; uses convolution filters to extract features.
Context Window
How much text (tokens) a model can consider at once; larger windows handle longer docs/chats.
Cross-Encoder
Scores a pair together (query+document) for precise relevance, often after vector recall.
D
Data Augmentation
Expanding training sets by adding modified copies (rotations, paraphrases, noise) to improve generalization.
Dataset
Curated data used for training or evaluation; quality and coverage strongly impact model behavior.
Diffusion Model
Generative model that learns to remove noise step-by-step to create images, audio, or video.
Distillation
Training a smaller “student” model to mimic a larger “teacher” model for speed and cost.
Dropout
Regularization technique where random neurons are “dropped” during training to prevent overfitting.
E
Embedding
Numeric vector capturing semantic meaning; enables search, clustering, and retrieval-augmented generation.
Encoder
Model part that turns input into representations; used in bi-encoders, cross-encoders, and enc-dec models.
Epoch
One full pass of the training dataset through the model.
Evaluation
Measuring model quality via metrics and human review; essential for safety and reliability.
Exploding Gradients
Problem in training deep nets where gradients grow uncontrollably, destabilizing updates.
F
Feature Extraction
Process of converting raw data into informative inputs (manual or learned).
Few-Shot Prompting
Providing a handful of examples in the prompt to guide the model toward the desired pattern.
Few-Shot Learning
Model’s ability to generalize to new tasks with very few labeled examples.
Fine-Tuning
Further training on a specific dataset to specialize a base model for a target task.
Foundation Model
Large pretrained model adaptable to many tasks (e.g., GPT, Llama, T5, CLIP).
G
GAN (Generative Adversarial Network)
Model with two networks (generator and discriminator) competing to create realistic outputs.
Generalization
How well a model performs on unseen data outside its training set.
Generative AI
Models that create text, images, audio, or video by learning patterns from data.
Gradient Descent
Optimization method that iteratively adjusts parameters to minimize loss.
Guardrails
Policies, prompts, or filters that constrain model behavior for safety and compliance.
H
Hallucination
When a model outputs confident but false information; mitigated by retrieval, prompts, and review.
Heuristics
Simple rules-of-thumb or strategies often used for initial problem-solving or evaluation baselines.
Hidden Layer
Intermediate layer of a neural network between input and output, where features are transformed.
Hyperparameters
Training settings chosen by humans (e.g., learning rate, batch size) rather than learned weights.
I
Inference
Running a trained model to produce outputs (e.g., generating a response or image).
Instruction Tuning
Fine-tuning on task instructions so the model follows natural language commands better.
J
JIT Compiler
“Just-In-Time” compilation to speed up model execution by compiling on the fly for the current hardware.
K
Knowledge Distillation
Transferring knowledge from a large “teacher” model to a smaller “student” for efficiency.
k-means
A simple clustering algorithm that partitions data into k groups by minimizing distance to centroids.
L
Large Language Model (LLM)
Neural network trained on vast text to understand and generate language.
Latency
Time from request to response; affected by model size, context length, and hardware.
Learning Rate
Hyperparameter controlling the size of weight updates during training.
Logits
Raw outputs of a model before applying softmax or another activation function.
Loss
The objective a model minimizes during training (e.g., cross-entropy for next-token prediction).
M
Machine Learning
Algorithms that improve automatically with data, powering prediction and generation tasks.
Meta-Learning
“Learning to learn” – models that can quickly adapt to new tasks with minimal data.
Mixture of Experts (MoE)
Architecture that routes tokens to specialized sub-networks (“experts”) for efficiency and scale.
Model Parameters
Learned weights that define model behavior; modern LLMs often have billions.
Multimodal AI
AI systems that process and integrate multiple input types (e.g., text + images + audio).
N
NLP (Natural Language Processing)
Field focused on enabling computers to understand and generate human language.
Normalization
Layer operations (e.g., LayerNorm) that stabilize training and improve gradients.
Negative Sampling
Training trick that samples negative examples to efficiently learn embeddings or classifiers.
O
One-Shot Learning
Generalizing from a single example per class or task.
Optimizer
Algorithm that updates model weights from gradients (e.g., Adam, Adafactor).
Out-of-Distribution (OOD)
Data that differs from the training set distribution, where models often fail.
Overfitting
Great on training data, poor on new data; solved by regularization and better data.
P
Parameters
Another name for learned weights inside a model.
Pretraining
Large-scale initial training to learn general language patterns before task adaptation.
Prompt
The instruction/context you give a model to steer its output; templates improve consistency.
Prompt Engineering
Systematically designing prompts and constraints to get reliable outputs from models.
Q
Quantization
Reducing numeric precision of weights/activations (e.g., 8-bit) to speed up inference and cut memory use.
Q-Learning
Reinforcement learning method that learns value of actions to select optimal policies over time.
R
Retrieval-Augmented Generation (RAG)
Combines search with generation so outputs can cite and ground facts from relevant sources.
Regularization
Techniques that reduce overfitting (dropout, weight decay, data augmentation).
Reinforcement Learning
Training agents by rewarding desired behaviors in simulated or real environments.
RLHF
Training with human preference signals via a reward model and reinforcement learning.
Reranking
Reordering retrieved documents by a stronger model (often a cross-encoder) for better precision.
S
Safety
Approaches to reduce harmful, biased, or private outputs—policies, filters, red-teaming.
Sampling
Stochastic decoding that trades determinism for diversity using temperature, top-p, or top-k.
Self-Attention
Operation where each token attends to others, building a contextual representation.
Stochastic Gradient Descent (SGD)
Training method using mini-batches of data to approximate gradients efficiently.
Synthetic Data
Artificially generated data used to augment or replace real-world data for training/testing.
T
Temperature
Controls randomness in sampling; lower = safer/drier, higher = more creative.
Token
A minimal unit of text (byte, char, or subword) the model processes.
Transformer
Neural architecture based on attention; foundation of modern language and vision models.
Transformer Block
A repeated unit in transformers, combining multi-head attention and feed-forward networks.
Transfer Learning
Applying knowledge from a pretrained model to new but related tasks with minimal retraining.
U
Unsupervised Learning
Learning patterns from unlabeled data (e.g., clustering, dimensionality reduction).
UX for AI
Designing interfaces that set expectations and guide users when interacting with probabilistic models.
V
Variational Autoencoder (VAE)
Generative model that learns a distribution over latent variables, useful for image and data synthesis.
Vector Database
Database optimized for storing/searching embeddings; used in RAG pipelines.
Vision Transformer (ViT)
Model applying transformer architecture to image patches for classification and vision tasks.
Vision-Language Model (VLM)
Model that jointly understands images and text (e.g., captioning, visual QA).
W
Weight Decay
L2 regularization applied to weights to prevent overfitting by discouraging large values.
WordPiece
Subword tokenization algorithm used by models like BERT to balance vocabulary size and coverage.
X
XGBoost
Efficient gradient-boosted decision tree library widely used for tabular ML tasks.
Y
YOLO
“You Only Look Once” – a family of real-time object detection architectures.
Z
Zero-Shot Learning
Handling tasks or classes never seen during training by leveraging generalizable representations.