AI & Machine Learning Glossary

Clear, human-friendly definitions for the most important AI terms. Use the A–Z bar or search below.

A

Activation Function

Mathematical function applied to neuron outputs; introduces non-linearity (e.g., ReLU, sigmoid).

Agent

An AI system that can take a series of actions toward a goal, often using tools/APIs and memory.

Alignment

Techniques that make model behavior match human goals and values (e.g., preference tuning, RLHF).

API

Application Programming Interface—structured way software talks to models or services programmatically.

Attention

Mechanism that lets each token “look at” others to build context; the core of transformer models.

Autoencoder

Neural network trained to compress input into a latent space and reconstruct it, useful for representation learning.

B

Backpropagation

Algorithm for training neural networks by propagating error gradients backward through layers.

Batch Normalization

Technique that normalizes activations within a mini-batch to stabilize and speed up training.

Beam Search

Decoding that keeps multiple candidate sequences to improve coherence vs. greedy decoding.

Benchmark

Standardized test set to compare model performance (e.g., MMLU, GSM8K).

Bias

Systematic error from data or design that skews outputs; reduced with careful data and evaluation.

Bi-Encoder

Model that encodes query and document separately for fast vector search (used before reranking).

C

Chain of Thought

Prompting that elicits step-by-step reasoning before final answers; use selectively for reliability.

Clustering

Unsupervised grouping of data points based on similarity (e.g., k-means, DBSCAN).

Convolutional Neural Network (CNN)

Architecture designed for grid-like data such as images; uses convolution filters to extract features.

Context Window

How much text (tokens) a model can consider at once; larger windows handle longer docs/chats.

Cross-Encoder

Scores a pair together (query+document) for precise relevance, often after vector recall.

D

Data Augmentation

Expanding training sets by adding modified copies (rotations, paraphrases, noise) to improve generalization.

Dataset

Curated data used for training or evaluation; quality and coverage strongly impact model behavior.

Diffusion Model

Generative model that learns to remove noise step-by-step to create images, audio, or video.

Distillation

Training a smaller “student” model to mimic a larger “teacher” model for speed and cost.

Dropout

Regularization technique where random neurons are “dropped” during training to prevent overfitting.

E

Embedding

Numeric vector capturing semantic meaning; enables search, clustering, and retrieval-augmented generation.

Encoder

Model part that turns input into representations; used in bi-encoders, cross-encoders, and enc-dec models.

Epoch

One full pass of the training dataset through the model.

Evaluation

Measuring model quality via metrics and human review; essential for safety and reliability.

Exploding Gradients

Problem in training deep nets where gradients grow uncontrollably, destabilizing updates.

F

Feature Extraction

Process of converting raw data into informative inputs (manual or learned).

Few-Shot Prompting

Providing a handful of examples in the prompt to guide the model toward the desired pattern.

Few-Shot Learning

Model’s ability to generalize to new tasks with very few labeled examples.

Fine-Tuning

Further training on a specific dataset to specialize a base model for a target task.

Foundation Model

Large pretrained model adaptable to many tasks (e.g., GPT, Llama, T5, CLIP).

G

GAN (Generative Adversarial Network)

Model with two networks (generator and discriminator) competing to create realistic outputs.

Generalization

How well a model performs on unseen data outside its training set.

Generative AI

Models that create text, images, audio, or video by learning patterns from data.

Gradient Descent

Optimization method that iteratively adjusts parameters to minimize loss.

Guardrails

Policies, prompts, or filters that constrain model behavior for safety and compliance.

H

Hallucination

When a model outputs confident but false information; mitigated by retrieval, prompts, and review.

Heuristics

Simple rules-of-thumb or strategies often used for initial problem-solving or evaluation baselines.

Hidden Layer

Intermediate layer of a neural network between input and output, where features are transformed.

Hyperparameters

Training settings chosen by humans (e.g., learning rate, batch size) rather than learned weights.

I

Inference

Running a trained model to produce outputs (e.g., generating a response or image).

Instruction Tuning

Fine-tuning on task instructions so the model follows natural language commands better.

J

JIT Compiler

“Just-In-Time” compilation to speed up model execution by compiling on the fly for the current hardware.

K

Knowledge Distillation

Transferring knowledge from a large “teacher” model to a smaller “student” for efficiency.

k-means

A simple clustering algorithm that partitions data into k groups by minimizing distance to centroids.

L

Large Language Model (LLM)

Neural network trained on vast text to understand and generate language.

Latency

Time from request to response; affected by model size, context length, and hardware.

Learning Rate

Hyperparameter controlling the size of weight updates during training.

Logits

Raw outputs of a model before applying softmax or another activation function.

Loss

The objective a model minimizes during training (e.g., cross-entropy for next-token prediction).

M

Machine Learning

Algorithms that improve automatically with data, powering prediction and generation tasks.

Meta-Learning

“Learning to learn” – models that can quickly adapt to new tasks with minimal data.

Mixture of Experts (MoE)

Architecture that routes tokens to specialized sub-networks (“experts”) for efficiency and scale.

Model Parameters

Learned weights that define model behavior; modern LLMs often have billions.

Multimodal AI

AI systems that process and integrate multiple input types (e.g., text + images + audio).

N

NLP (Natural Language Processing)

Field focused on enabling computers to understand and generate human language.

Normalization

Layer operations (e.g., LayerNorm) that stabilize training and improve gradients.

Negative Sampling

Training trick that samples negative examples to efficiently learn embeddings or classifiers.

O

One-Shot Learning

Generalizing from a single example per class or task.

Optimizer

Algorithm that updates model weights from gradients (e.g., Adam, Adafactor).

Out-of-Distribution (OOD)

Data that differs from the training set distribution, where models often fail.

Overfitting

Great on training data, poor on new data; solved by regularization and better data.

P

Parameters

Another name for learned weights inside a model.

Pretraining

Large-scale initial training to learn general language patterns before task adaptation.

Prompt

The instruction/context you give a model to steer its output; templates improve consistency.

Prompt Engineering

Systematically designing prompts and constraints to get reliable outputs from models.

Q

Quantization

Reducing numeric precision of weights/activations (e.g., 8-bit) to speed up inference and cut memory use.

Q-Learning

Reinforcement learning method that learns value of actions to select optimal policies over time.

R

Retrieval-Augmented Generation (RAG)

Combines search with generation so outputs can cite and ground facts from relevant sources.

Regularization

Techniques that reduce overfitting (dropout, weight decay, data augmentation).

Reinforcement Learning

Training agents by rewarding desired behaviors in simulated or real environments.

RLHF

Training with human preference signals via a reward model and reinforcement learning.

Reranking

Reordering retrieved documents by a stronger model (often a cross-encoder) for better precision.

S

Safety

Approaches to reduce harmful, biased, or private outputs—policies, filters, red-teaming.

Sampling

Stochastic decoding that trades determinism for diversity using temperature, top-p, or top-k.

Self-Attention

Operation where each token attends to others, building a contextual representation.

Stochastic Gradient Descent (SGD)

Training method using mini-batches of data to approximate gradients efficiently.

Synthetic Data

Artificially generated data used to augment or replace real-world data for training/testing.

T

Temperature

Controls randomness in sampling; lower = safer/drier, higher = more creative.

Token

A minimal unit of text (byte, char, or subword) the model processes.

Transformer

Neural architecture based on attention; foundation of modern language and vision models.

Transformer Block

A repeated unit in transformers, combining multi-head attention and feed-forward networks.

Transfer Learning

Applying knowledge from a pretrained model to new but related tasks with minimal retraining.

U

Unsupervised Learning

Learning patterns from unlabeled data (e.g., clustering, dimensionality reduction).

UX for AI

Designing interfaces that set expectations and guide users when interacting with probabilistic models.

V

Variational Autoencoder (VAE)

Generative model that learns a distribution over latent variables, useful for image and data synthesis.

Vector Database

Database optimized for storing/searching embeddings; used in RAG pipelines.

Vision Transformer (ViT)

Model applying transformer architecture to image patches for classification and vision tasks.

Vision-Language Model (VLM)

Model that jointly understands images and text (e.g., captioning, visual QA).

W

Weight Decay

L2 regularization applied to weights to prevent overfitting by discouraging large values.

WordPiece

Subword tokenization algorithm used by models like BERT to balance vocabulary size and coverage.

X

XGBoost

Efficient gradient-boosted decision tree library widely used for tabular ML tasks.

Y

YOLO

“You Only Look Once” – a family of real-time object detection architectures.

Z

Zero-Shot Learning

Handling tasks or classes never seen during training by leveraging generalizable representations.