Implementing YOLO from Scratch in PyTorch

Table of Contents

Add a header to begin generating the table of contents

Introduction – Why YOLO Changed Everything

Before YOLO, computers did not “see” the world the way humans do.
Object detection systems were careful, slow, and fragmented. They first proposed regions that might contain objects, then classified each region separately. Detection worked—but it felt like solving a puzzle one piece at a time.

In 2015, YOLO—You Only Look Once—introduced a radical idea:

What if we detect everything in one single forward pass?

Instead of multiple stages, YOLO treated detection as a single regression problem from pixels to bounding boxes and class probabilities.

This guide walks through how to implement YOLO completely from scratch in PyTorch, covering:

Mathematical formulation
Network architecture
Target encoding
Loss implementation
Training on COCO-style data
mAP evaluation
Visualization & debugging
Inference with NMS
Anchor-box extension

1) What YOLO means (and what we’ll build)

YOLO (You Only Look Once) is a family of object detection models that predict bounding boxes and class probabilities in one forward pass. Unlike older multi-stage pipelines (proposal → refine → classify), YOLO-style detectors are dense predictors: they predict candidate boxes at many locations and scales, then filter them.

There are two “eras” of YOLO-like detectors:

YOLOv1-style (grid cells, no anchors): each grid cell predicts a few boxes directly.
Anchor-based YOLO (YOLOv2/3 and many derivatives): each grid cell predicts offsets relative to pre-defined anchor shapes; multiple scales predict small/medium/large objects.

What we’ll implement

A modern, anchor-based YOLO-style detector with:

Multi-scale heads (e.g., 3 scales)
Anchor matching (target assignment)
Loss with box regression + objectness + classification
Decoding + NMS
mAP evaluation
COCO/custom dataset training support

We’ll keep the architecture understandable rather than exotic. You can later swap in a bigger backbone easily.

2) Bounding box formats and coordinate systems

You must be consistent. Most training bugs come from box format confusion.

Common box formats:

XYXY: (x1, y1, x2, y2) top-left & bottom-right
XYWH: (cx, cy, w, h) center and size
Normalized: coordinates in [0, 1] relative to image size
Absolute: pixel coordinates

Recommended internal convention

Store dataset annotations as absolute XYXY in pixels.
Convert to normalized only if needed, but keep one standard.

Why XYXY is nice:

Intersection/union is straightforward.
Clamping to image bounds is simple.

3) IoU, GIoU, DIoU, CIoU

IoU (Intersection over Union) is the standard overlap metric:

But IoU has a problem: if boxes don’t overlap, IoU = 0, gradient can be weak. Modern detectors often use improved regression losses:

GIoU: adds penalty for non-overlapping boxes based on smallest enclosing box
DIoU: penalizes center distance
CIoU: DIoU + aspect ratio consistency

Practical rule:

If you want a strong default: CIoU for box regression.
If you want simpler: GIoU works well too.

We’ll implement IoU + CIoU (with safe numerics).

4) Anchor-based YOLO: grids, anchors, predictions

A YOLO head predicts at each grid location. Suppose a feature map is S x S (e.g., 80×80). Each cell can predict A anchors (e.g., 3). For each anchor, prediction is:

Box offsets: tx, ty, tw, th
Objectness logit: to
Class logits: tc1..tcC

So tensor shape per scale is:
(B, A*(5+C), S, S) or (B, A, S, S, 5+C) after reshaping.

How offsets become real boxes

A common YOLO-style decode (one of several valid variants):

bx = (sigmoid(tx) + cx) / S
by = (sigmoid(ty) + cy) / S
bw = (anchor_w * exp(tw)) / img_w (or normalized by S)
bh = (anchor_h * exp(th)) / img_h

Where (cx, cy) is the integer grid coordinate.

Important: Your encode/decode must match your target assignment encoding.

5) Dataset preparation

Annotation formats

Your custom dataset can be:

COCO JSON
Pascal VOC XML
YOLO txt (class cx cy w h normalized)

We’ll support a generic internal representation:

Each sample returns:
- image: Tensor [3, H, W]
- targets: Tensor [N, 6] with columns:
  - [class, x1, y1, x2, y2, image_index(optional)]

Augmentations

For object detection, augmentations must transform boxes too:

Resize / letterbox
Random horizontal flip
Color jitter
Random affine (optional)
Mosaic/mixup (advanced; optional)

To keep this guide implementable without fragile geometry, we’ll do:

resize/letterbox
random flip
HSV jitter (optional)

6) Building blocks: Conv-BN-Act, residuals, necks

A clean baseline module:

Conv2d -> BatchNorm2d -> SiLU
SiLU (a.k.a. Swish) is common in YOLOv5-like families; LeakyReLU is common in YOLOv3.

We can optionally add residual blocks for a stronger backbone, but even a small backbone can work to validate the pipeline.

7) Model design

A typical structure:

Backbone: extracts feature maps at multiple strides (8, 16, 32)
Neck: combines features (FPN / PAN)
Head: predicts detection outputs per scale

We’ll implement a lightweight backbone that produces 3 feature maps and a simple FPN-like neck.

8) Decoding predictions

At inference:

Reshape outputs per scale to (B, A, S, S, 5+C)
Apply sigmoid to center offsets + objectness (and often class probs)
Convert to XYXY in pixel coordinates
Flatten all scales into one list of candidate boxes
Filter by confidence threshold
Apply NMS per class (or class-agnostic NMS)

9) Target assignment (matching GT to anchors)

This is the heart of anchor-based YOLO.

For each ground-truth box:

Determine which scale(s) should handle it (based on size / anchor match).
For the chosen scale, compute IoU between GT box size and each anchor size (in that scale’s coordinate system).
Select best anchor (or top-k anchors).
Compute the grid cell index from the GT center.
Fill the target tensors at [anchor, gy, gx] with:
- box regression targets
- objectness = 1
- class target

Encoding regression targets

If using decode:

bx = (sigmoid(tx) + cx)/S
then target for tx is sigmoid^-1(bx*S - cx) but that’s messy.

Instead, YOLO-style training often directly supervises:

tx_target = bx*S - cx (a value in [0,1]) and trains with BCE on sigmoid output, or MSE on raw.
tw_target = log(bw / anchor_w) (in pixels or normalized units)

We’ll implement a stable variant:

predict pxy = sigmoid(tx,ty) and supervise pxy with BCE/MSE to match fractional offsets
predict pwh = exp(tw,th)*anchor and supervise with CIoU on decoded boxes (recommended)

That’s simpler: do regression loss on decoded boxes, not on tw/th directly.

10) Loss functions

YOLO-style loss usually has:

Box loss: CIoU/GIoU between predicted box and GT box at responsible locations
Objectness loss: BCEWithLogits on objectness logit
Class loss: BCEWithLogits (multi-label) or CE (single-label)

For single-label classification (one class per object), either works:

BCEWithLogits with one-hot targets (common in YOLO)
CrossEntropyLoss on class logits at positive locations (also fine)

We’ll use BCEWithLogits for both objectness and classes for consistency.

Handling negatives

You’ll have far more negative (no object) positions. You can:

Use a lower weight for negative objectness
Or apply focal loss (optional)

We’ll implement:

objectness loss with positive and negative weights.

11) Training loop

Key features for stability/performance:

AMP (torch.cuda.amp)
Gradient clipping (optional)
EMA weights (optional but helpful)
LR scheduler (cosine or step)
Warmup for first few epochs/steps

Use this on screenshots from Playwright:

12) NMS

Non-Max Suppression removes overlapping duplicates. Typical procedure:

Sort boxes by confidence
Iterate highest conf, suppress boxes with IoU > threshold

Use class-wise NMS for multi-class detection.

13) mAP evaluation

Mean Average Precision requires:

For each class, compute precision-recall curve at IoU thresholds
Integrate area under curve (AP)
Average across classes (mAP)
COCO uses mAP across IoU thresholds 0.50 to 0.95 step 0.05

We’ll implement:

mAP@0.5
and optionally COCO-style mAP@[.5:.95]

14) Visualization

Before training seriously, visualize:

target assignments per scale
decoded predictions after a few iterations
NMS outputs

This catches 80% of “my model doesn’t learn” issues.

15) Full Core Implementation (Reference Code)

Below is a compact but complete set of core files you can place into a repo. It’s not “tiny,” but it’s readable and engineered for correctness.

15.1 Repo structure

				
					yolo_scratch/
  README.md
  train.py
  eval.py
  predict.py
  yolo/
    __init__.py
    model.py
    modules.py
    loss.py
    assigner.py
    box_ops.py
    nms.py
    metrics.py
    data.py
    transforms.py
    utils.py
  configs/
    coco.yaml
    custom.yaml

				
					yolo_scratch/
  README.md
  train.py
  eval.py
  predict.py
  yolo/
    __init__.py
    model.py
    modules.py
    loss.py
    assigner.py
    box_ops.py
    nms.py
    metrics.py
    data.py
    transforms.py
    utils.py
  configs/
    coco.yaml
    custom.yaml

15.2 `yolo/box_ops.py`

				
					import torch

def xyxy_to_xywh(boxes: torch.Tensor) -> torch.Tensor:
    # boxes: [..., 4]
    x1, y1, x2, y2 = boxes.unbind(-1)
    cx = (x1 + x2) * 0.5
    cy = (y1 + y2) * 0.5
    w = (x2 - x1).clamp(min=0)
    h = (y2 - y1).clamp(min=0)
    return torch.stack([cx, cy, w, h], dim=-1)

def xywh_to_xyxy(boxes: torch.Tensor) -> torch.Tensor:
    cx, cy, w, h = boxes.unbind(-1)
    half_w = w * 0.5
    half_h = h * 0.5
    x1 = cx - half_w
    y1 = cy - half_h
    x2 = cx + half_w
    y2 = cy + half_h
    return torch.stack([x1, y1, x2, y2], dim=-1)

def box_iou_xyxy(boxes1: torch.Tensor, boxes2: torch.Tensor, eps: float = 1e-9) -> torch.Tensor:
    # boxes1: [N,4], boxes2: [M,4]
    x11, y11, x12, y12 = boxes1[:, 0], boxes1[:, 1], boxes1[:, 2], boxes1[:, 3]
    x21, y21, x22, y22 = boxes2[:, 0], boxes2[:, 1], boxes2[:, 2], boxes2[:, 3]

    inter_x1 = torch.maximum(x11[:, None], x21[None, :])
    inter_y1 = torch.maximum(y11[:, None], y21[None, :])
    inter_x2 = torch.minimum(x12[:, None], x22[None, :])
    inter_y2 = torch.minimum(y12[:, None], y22[None, :])

    inter_w = (inter_x2 - inter_x1).clamp(min=0)
    inter_h = (inter_y2 - inter_y1).clamp(min=0)
    inter = inter_w * inter_h

    area1 = (x12 - x11).clamp(min=0) * (y12 - y11).clamp(min=0)
    area2 = (x22 - x21).clamp(min=0) * (y22 - y21).clamp(min=0)
    union = area1[:, None] + area2[None, :] - inter
    return inter / (union + eps)

def ciou_loss_xyxy(pred: torch.Tensor, target: torch.Tensor, eps: float = 1e-7) -> torch.Tensor:
    """
    pred, target: [N,4] in xyxy
    Returns: [N] CIoU loss = 1 - CIoU
    """
    # IoU
    iou = box_iou_xyxy(pred, target).diag()  # [N]

    # centers and sizes
    p = xyxy_to_xywh(pred)
    t = xyxy_to_xywh(target)
    pcx, pcy, pw, ph = p.unbind(-1)
    tcx, tcy, tw, th = t.unbind(-1)

    # center distance
    center_dist2 = (pcx - tcx) ** 2 + (pcy - tcy) ** 2

    # smallest enclosing box diagonal squared
    x1 = torch.minimum(pred[:, 0], target[:, 0])
    y1 = torch.minimum(pred[:, 1], target[:, 1])
    x2 = torch.maximum(pred[:, 2], target[:, 2])
    y2 = torch.maximum(pred[:, 3], target[:, 3])
    c2 = (x2 - x1) ** 2 + (y2 - y1) ** 2 + eps

    diou = iou - center_dist2 / c2

    # aspect ratio penalty
    v = (4 / (torch.pi ** 2)) * (torch.atan(tw / (th + eps)) - torch.atan(pw / (ph + eps))) ** 2
    with torch.no_grad():
        alpha = v / (1 - iou + v + eps)

    ciou = diou - alpha * v
    return 1 - ciou.clamp(min=-1.0, max=1.0)

15.3 `yolo/nms.py`

				
					import torch
from .box_ops import box_iou_xyxy

def nms_xyxy(boxes: torch.Tensor, scores: torch.Tensor, iou_thresh: float = 0.5) -> torch.Tensor:
    """
    boxes: [N,4], scores: [N]
    returns indices kept
    """
    if boxes.numel() == 0:
        return torch.empty((0,), dtype=torch.long, device=boxes.device)

    idxs = scores.argsort(descending=True)
    keep = []

    while idxs.numel() > 0:
        i = idxs[0]
        keep.append(i)

        if idxs.numel() == 1:
            break

        rest = idxs[1:]
        ious = box_iou_xyxy(boxes[i].unsqueeze(0), boxes[rest]).squeeze(0)
        idxs = rest[ious <= iou_thresh]

    return torch.stack(keep)

def batched_nms_xyxy(boxes, scores, labels, iou_thresh=0.5):
    """
    Class-wise NMS by offsetting boxes or by filtering per class.
    Here: filter per class (clear and correct).
    """
    keep_all = []
    for c in labels.unique():
        mask = labels == c
        keep = nms_xyxy(boxes[mask], scores[mask], iou_thresh)
        keep_all.append(mask.nonzero(as_tuple=False).squeeze(1)[keep])
    if not keep_all:
        return torch.empty((0,), dtype=torch.long, device=boxes.device)
    return torch.cat(keep_all)

15.4 `yolo/modules.py`

				
					import torch
import torch.nn as nn

class ConvBNAct(nn.Module):
    def __init__(self, in_ch, out_ch, k=3, s=1, p=None, act=True):
        super().__init__()
        if p is None:
            p = k // 2
        self.conv = nn.Conv2d(in_ch, out_ch, k, s, p, bias=False)
        self.bn = nn.BatchNorm2d(out_ch)
        self.act = nn.SiLU(inplace=True) if act else nn.Identity()

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

class Residual(nn.Module):
    def __init__(self, ch):
        super().__init__()
        self.block = nn.Sequential(
            ConvBNAct(ch, ch, 1, 1),
            ConvBNAct(ch, ch, 3, 1),
        )

    def forward(self, x):
        return x + self.block(x)

class CSPBlock(nn.Module):
    """
    Light CSP-like block: split channels, apply residuals on one branch, then concat.
    """
    def __init__(self, ch, n=1):
        super().__init__()
        c_ = ch // 2
        self.conv1 = ConvBNAct(ch, c_, 1, 1)
        self.conv2 = ConvBNAct(ch, c_, 1, 1)
        self.m = nn.Sequential(*[Residual(c_) for _ in range(n)])
        self.conv3 = ConvBNAct(2 * c_, ch, 1, 1)

    def forward(self, x):
        y1 = self.m(self.conv1(x))
        y2 = self.conv2(x)
        return self.conv3(torch.cat([y1, y2], dim=1))

15.5 `yolo/model.py`

				
					import torch
import torch.nn as nn
from .modules import ConvBNAct, CSPBlock

class TinyBackbone(nn.Module):
    """
    Produces 3 feature maps at strides 8, 16, 32.
    """
    def __init__(self, in_ch=3, base=32):
        super().__init__()
        self.stem = nn.Sequential(
            ConvBNAct(in_ch, base, 3, 2),       # stride 2
            ConvBNAct(base, base*2, 3, 2),     # stride 4
            CSPBlock(base*2, n=1),
        )
        self.stage3 = nn.Sequential(
            ConvBNAct(base*2, base*4, 3, 2),   # stride 8
            CSPBlock(base*4, n=2),
        )
        self.stage4 = nn.Sequential(
            ConvBNAct(base*4, base*8, 3, 2),   # stride 16
            CSPBlock(base*8, n=2),
        )
        self.stage5 = nn.Sequential(
            ConvBNAct(base*8, base*16, 3, 2),  # stride 32
            CSPBlock(base*16, n=1),
        )

    def forward(self, x):
        x = self.stem(x)
        p3 = self.stage3(x)
        p4 = self.stage4(p3)
        p5 = self.stage5(p4)
        return p3, p4, p5

class SimpleFPN(nn.Module):
    def __init__(self, ch3, ch4, ch5, out_ch=128):
        super().__init__()
        self.lat5 = ConvBNAct(ch5, out_ch, 1, 1)
        self.lat4 = ConvBNAct(ch4, out_ch, 1, 1)
        self.lat3 = ConvBNAct(ch3, out_ch, 1, 1)

        self.out4 = ConvBNAct(out_ch, out_ch, 3, 1)
        self.out3 = ConvBNAct(out_ch, out_ch, 3, 1)

    def forward(self, p3, p4, p5):
        p5 = self.lat5(p5)
        p4 = self.lat4(p4) + torch.nn.functional.interpolate(p5, scale_factor=2, mode="nearest")
        p3 = self.lat3(p3) + torch.nn.functional.interpolate(p4, scale_factor=2, mode="nearest")
        p4 = self.out4(p4)
        p3 = self.out3(p3)
        return p3, p4, p5

class DetectHead(nn.Module):
    def __init__(self, in_ch, num_anchors, num_classes):
        super().__init__()
        self.num_anchors = num_anchors
        self.num_classes = num_classes
        self.pred = nn.Conv2d(in_ch, num_anchors * (5 + num_classes), 1, 1, 0)

    def forward(self, x):
        return self.pred(x)

class YOLO(nn.Module):
    def __init__(self, num_classes, anchors, base=32):
        """
        anchors: list of 3 scales, each is list of (w,h) in pixels for the model input size (e.g., 640)
                 e.g. [
                   [(10,13),(16,30),(33,23)],   # stride 8
                   [(30,61),(62,45),(59,119)],  # stride 16
                   [(116,90),(156,198),(373,326)] # stride 32
                 ]
        """
        super().__init__()
        self.num_classes = num_classes
        self.anchors = anchors

        self.backbone = TinyBackbone(in_ch=3, base=base)
        # backbone channels: p3=base*4, p4=base*8, p5=base*16
        self.fpn = SimpleFPN(base*4, base*8, base*16, out_ch=base*4)

        na = len(anchors[0])
        self.head3 = DetectHead(base*4, na, num_classes)
        self.head4 = DetectHead(base*4, na, num_classes)
        self.head5 = DetectHead(base*4, na, num_classes)

    def forward(self, x):
        p3, p4, p5 = self.backbone(x)
        f3, f4, f5 = self.fpn(p3, p4, p5)
        o3 = self.head3(f3)
        o4 = self.head4(f4)
        o5 = self.head5(f5)
        return [o3, o4, o5]

15.6 `yolo/assigner.py` (target assignment)

				
					import torch

def build_targets(
    targets,           # list of length B, each: Tensor [Ni, 5] -> (cls, x1, y1, x2, y2) in pixels
    anchors,           # per scale: list of (w,h) in pixels at input size
    strides,           # [8,16,32]
    img_size,          # int, e.g. 640
    num_classes,
    device
):
    """
    Returns per-scale target tensors:
      tbox: list of [B, A, S, S, 4] in xyxy pixels
      tobj: list of [B, A, S, S] (0/1)
      tcls: list of [B, A, S, S, C] one-hot
      indices: list of tuples for positives (b, a, gy, gx)
    """
    B = len(targets)
    out = []
    indices_all = []

    for scale_idx, (anc, stride) in enumerate(zip(anchors, strides)):
        S = img_size // stride
        A = len(anc)

        tbox = torch.zeros((B, A, S, S, 4), device=device)
        tobj = torch.zeros((B, A, S, S), device=device)
        tcls = torch.zeros((B, A, S, S, num_classes), device=device)

        indices = []

        anc_wh = torch.tensor(anc, device=device, dtype=torch.float32)  # [A,2]

        for b in range(B):
            if targets[b].numel() == 0:
                continue
            gt = targets[b].to(device)
            cls = gt[:, 0].long()
            x1y1 = gt[:, 1:3]
            x2y2 = gt[:, 3:5]
            gxy = (x1y1 + x2y2) * 0.5
            gwh = (x2y2 - x1y1).clamp(min=1.0)

            # pick best anchor by IoU of width/height (approx)
            # IoU(wh) = min(w)/max(w) * min(h)/max(h)
            wh = gwh[:, None, :]  # [N,1,2]
            min_wh = torch.minimum(wh, anc_wh[None, :, :])
            max_wh = torch.maximum(wh, anc_wh[None, :, :])
            iou_wh = (min_wh[..., 0] / max_wh[..., 0]) * (min_wh[..., 1] / max_wh[..., 1])  # [N,A]
            best_a = torch.argmax(iou_wh, dim=1)  # [N]

            # grid cell
            gx = (gxy[:, 0] / stride).clamp(min=0, max=S-1e-3)
            gy = (gxy[:, 1] / stride).clamp(min=0, max=S-1e-3)
            gi = gx.long()
            gj = gy.long()

            for i in range(gt.shape[0]):
                a = best_a[i].item()
                x1, y1, x2, y2 = gt[i, 1:].tolist()

                # assign
                tobj[b, a, gj[i], gi[i]] = 1.0
                tbox[b, a, gj[i], gi[i]] = torch.tensor([x1, y1, x2, y2], device=device)
                tcls[b, a, gj[i], gi[i], cls[i]] = 1.0
                indices.append((b, a, gj[i].item(), gi[i].item()))

        out.append((tbox, tobj, tcls))
        indices_all.append(indices)

    return out, indices_all

15.7 `yolo/loss.py`

				
					import torch
import torch.nn as nn
from .box_ops import xywh_to_xyxy, ciou_loss_xyxy

class YOLOLoss(nn.Module):
    def __init__(self, anchors, strides, num_classes, img_size,
                 lambda_box=7.5, lambda_obj=1.0, lambda_cls=1.0,
                 obj_pos_weight=1.0, obj_neg_weight=0.5):
        super().__init__()
        self.anchors = anchors
        self.strides = strides
        self.num_classes = num_classes
        self.img_size = img_size

        self.lambda_box = lambda_box
        self.lambda_obj = lambda_obj
        self.lambda_cls = lambda_cls

        self.bce = nn.BCEWithLogitsLoss(reduction="none")
        self.obj_pos_weight = obj_pos_weight
        self.obj_neg_weight = obj_neg_weight

    def decode_scale(self, pred, scale_idx):
        """
        pred: [B, A*(5+C), S, S]
        returns:
          boxes_xyxy: [B, A, S, S, 4] in pixels
          obj_logit:  [B, A, S, S]
          cls_logit:  [B, A, S, S, C]
        """
        B, _, S, _ = pred.shape
        A = len(self.anchors[scale_idx])
        C = self.num_classes
        stride = self.strides[scale_idx]

        pred = pred.view(B, A, 5 + C, S, S).permute(0, 1, 3, 4, 2).contiguous()
        # [B, A, S, S, 5+C]
        tx_ty = pred[..., 0:2]
        tw_th = pred[..., 2:4]
        obj = pred[..., 4]
        cls = pred[..., 5:]

        # grid
        gy, gx = torch.meshgrid(torch.arange(S, device=pred.device),
                                torch.arange(S, device=pred.device), indexing="ij")
        grid = torch.stack([gx, gy], dim=-1).float()  # [S,S,2]

        # anchors
        anc = torch.tensor(self.anchors[scale_idx], device=pred.device).float()  # [A,2]
        anc = anc.view(1, A, 1, 1, 2)

        # decode center
        pxy = (tx_ty.sigmoid() + grid.view(1, 1, S, S, 2)) * stride  # pixels
        # decode wh
        pwh = (tw_th.exp() * anc)  # pixels

        boxes_xywh = torch.cat([pxy, pwh], dim=-1)
        boxes_xyxy = xywh_to_xyxy(boxes_xywh)
        return boxes_xyxy, obj, cls

    def forward(self, preds, targets_per_scale):
        """
        preds: list of 3 scale outputs
        targets_per_scale: list of (tbox, tobj, tcls) per scale
        """
        total_box = torch.tensor(0.0, device=preds[0].device)
        total_obj = torch.tensor(0.0, device=preds[0].device)
        total_cls = torch.tensor(0.0, device=preds[0].device)

        for s, pred in enumerate(preds):
            tbox, tobj, tcls = targets_per_scale[s]
            pbox, pobj_logit, pcls_logit = self.decode_scale(pred, s)

            # objectness loss with weighting
            obj_loss = self.bce(pobj_logit, tobj)
            w = torch.where(tobj > 0.5,
                            torch.full_like(obj_loss, self.obj_pos_weight),
                            torch.full_like(obj_loss, self.obj_neg_weight))
            total_obj = total_obj + (obj_loss * w).mean()

            # positives mask
            pos = tobj > 0.5
            if pos.any():
                # box loss CIoU
                pbox_pos = pbox[pos]
                tbox_pos = tbox[pos]
                box_loss = ciou_loss_xyxy(pbox_pos, tbox_pos).mean()
                total_box = total_box + box_loss

                # class loss
                cls_loss = self.bce(pcls_logit[pos], tcls[pos]).mean()
                total_cls = total_cls + cls_loss

        loss = self.lambda_box * total_box + self.lambda_obj * total_obj + self.lambda_cls * total_cls
        return loss, {"box": total_box.detach(), "obj": total_obj.detach(), "cls": total_cls.detach()}

15.8 `yolo/utils.py`

				
					import torch

def set_seed(seed=42):
    import random, numpy as np
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

class AverageMeter:
    def __init__(self):
        self.sum = 0.0
        self.count = 0

    def update(self, v, n=1):
        self.sum += float(v) * n
        self.count += n

    @property
    def avg(self):
        return self.sum / max(1, self.count)

15.9 `yolo/transforms.py` (simple letterbox + flip)

				
					import torch
import torchvision.transforms.functional as TF

def letterbox(image, boxes_xyxy, new_size=640):
    """
    image: PIL or Tensor [C,H,W]
    boxes_xyxy: Tensor [N,4] in pixels
    returns resized/padded image and transformed boxes
    """
    if not torch.is_tensor(image):
        image = TF.to_tensor(image)
    c, h, w = image.shape

    scale = min(new_size / h, new_size / w)
    nh, nw = int(round(h * scale)), int(round(w * scale))
    image_resized = TF.resize(image, [nh, nw])

    pad_h = new_size - nh
    pad_w = new_size - nw
    top = pad_h // 2
    left = pad_w // 2

    image_padded = torch.zeros((c, new_size, new_size), dtype=image.dtype)
    image_padded[:, top:top+nh, left:left+nw] = image_resized

    if boxes_xyxy.numel() > 0:
        boxes = boxes_xyxy.clone()
        boxes *= scale
        boxes[:, [0, 2]] += left
        boxes[:, [1, 3]] += top
    else:
        boxes = boxes_xyxy

    return image_padded, boxes

def random_hflip(image, boxes_xyxy, p=0.5):
    if torch.rand(()) > p:
        return image, boxes_xyxy
    c, h, w = image.shape
    image = torch.flip(image, dims=[2])  # flip width
    boxes = boxes_xyxy.clone()
    if boxes.numel() > 0:
        x1 = boxes[:, 0].clone()
        x2 = boxes[:, 2].clone()
        boxes[:, 0] = (w - x2)
        boxes[:, 2] = (w - x1)
    return image, boxes

15.10 `yolo/data.py` (custom dataset skeleton)

				
					import os
import torch
from torch.utils.data import Dataset
from PIL import Image
from .transforms import letterbox, random_hflip

class DetectionDataset(Dataset):
    """
    Expects a list of samples where each sample has:
      - image_path
      - annotations: list of [cls, x1, y1, x2, y2] in pixels
    You can write adapters to load COCO or YOLO txt into this format.
    """
    def __init__(self, samples, img_size=640, augment=True):
        self.samples = samples
        self.img_size = img_size
        self.augment = augment

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        s = self.samples[idx]
        img = Image.open(s["image_path"]).convert("RGB")
        ann = s.get("annotations", [])
        if len(ann) > 0:
            target = torch.tensor(ann, dtype=torch.float32)  # [N,5]
        else:
            target = torch.zeros((0,5), dtype=torch.float32)

        cls = target[:, 0:1]
        boxes = target[:, 1:5]

        img, boxes = letterbox(img, boxes, self.img_size)
        if self.augment:
            img, boxes = random_hflip(img, boxes, p=0.5)

        # normalize image
        img = img.clamp(0, 1)

        # pack back
        if boxes.numel() > 0:
            target = torch.cat([cls, boxes], dim=1)
        else:
            target = torch.zeros((0,5), dtype=torch.float32)

        return img, target

def collate_fn(batch):
    images, targets = zip(*batch)
    images = torch.stack(images, dim=0)
    # targets remains list[Tensor]
    return images, list(targets)

15.11 `train.py` (end-to-end training loop)

				
					import torch
from torch.utils.data import DataLoader

from yolo.model import YOLO
from yolo.loss import YOLOLoss
from yolo.assigner import build_targets
from yolo.utils import set_seed, AverageMeter
from yolo.data import DetectionDataset, collate_fn

def train_one_epoch(model, loss_fn, loader, optimizer, device, anchors, strides, img_size, num_classes, scaler=None):
    model.train()
    meter = AverageMeter()

    for images, targets_list in loader:
        images = images.to(device)

        # build targets per scale
        targets_per_scale, _ = build_targets(
            targets_list, anchors=anchors, strides=strides,
            img_size=img_size, num_classes=num_classes, device=device
        )
        targets_per_scale = [(tbox, tobj, tcls) for (tbox, tobj, tcls) in targets_per_scale]

        optimizer.zero_grad(set_to_none=True)

        if scaler is not None:
            with torch.cuda.amp.autocast():
                preds = model(images)
                loss, logs = loss_fn(preds, targets_per_scale)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            preds = model(images)
            loss, logs = loss_fn(preds, targets_per_scale)
            loss.backward()
            optimizer.step()

        meter.update(loss.item(), n=images.size(0))

    return meter.avg

def main():
    set_seed(42)
    device = "cuda" if torch.cuda.is_available() else "cpu"

    img_size = 640
    num_classes = 80  # COCO example
    strides = [8, 16, 32]
    anchors = [
        [(10,13),(16,30),(33,23)],
        [(30,61),(62,45),(59,119)],
        [(116,90),(156,198),(373,326)]
    ]

    # TODO: load your samples list here
    samples = []  # [{"image_path": "...", "annotations": [[cls,x1,y1,x2,y2], ...]}, ...]

    ds = DetectionDataset(samples, img_size=img_size, augment=True)
    loader = DataLoader(ds, batch_size=8, shuffle=True, num_workers=4, pin_memory=True, collate_fn=collate_fn)

    model = YOLO(num_classes=num_classes, anchors=anchors, base=32).to(device)
    loss_fn = YOLOLoss(anchors, strides, num_classes, img_size).to(device)

    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4, weight_decay=1e-4)
    scaler = torch.cuda.amp.GradScaler() if device == "cuda" else None

    for epoch in range(1, 101):
        avg_loss = train_one_epoch(model, loss_fn, loader, optimizer, device, anchors, strides, img_size, num_classes, scaler)
        print(f"Epoch {epoch:03d} | loss={avg_loss:.4f}")

    torch.save(model.state_dict(), "yolo_scratch.pt")

if __name__ == "__main__":
    main()

16) Inference: decoding + NMS (predict pipeline)

16.1 Decoding helper (add to `yolo/metrics.py` or `yolo/utils.py`)

				
					import torch
from .box_ops import xywh_to_xyxy
from .nms import batched_nms_xyxy

@torch.no_grad()
def decode_predictions(preds, anchors, strides, num_classes, conf_thresh=0.25, iou_thresh=0.5):
    """
    preds: list of 3 tensors [B, A*(5+C), S, S]
    returns per image: boxes [M,4], scores [M], labels [M]
    """
    outputs = []
    for b in range(preds[0].shape[0]):
        boxes_all = []
        scores_all = []
        labels_all = []

        for s, p in enumerate(preds):
            B, _, S, _ = p.shape
            A = len(anchors[s])
            C = num_classes
            stride = strides[s]

            x = p[b:b+1].view(1, A, 5+C, S, S).permute(0,1,3,4,2).contiguous()[0]  # [A,S,S,5+C]
            tx_ty = x[..., 0:2]
            tw_th = x[..., 2:4]
            obj_logit = x[..., 4]
            cls_logit = x[..., 5:]

            gy, gx = torch.meshgrid(torch.arange(S, device=p.device),
                                    torch.arange(S, device=p.device), indexing="ij")
            grid = torch.stack([gx, gy], dim=-1).float()  # [S,S,2]

            anc = torch.tensor(anchors[s], device=p.device).float().view(A,1,1,2)

            pxy = (tx_ty.sigmoid() + grid) * stride
            pwh = tw_th.exp() * anc

            boxes_xywh = torch.cat([pxy, pwh], dim=-1)  # [A,S,S,4]
            boxes_xyxy = xywh_to_xyxy(boxes_xywh)

            obj = obj_logit.sigmoid()  # [A,S,S]
            cls_prob = cls_logit.sigmoid()  # [A,S,S,C]
            # combine: per-class confidence = obj * cls_prob
            conf = obj.unsqueeze(-1) * cls_prob  # [A,S,S,C]

            conf = conf.view(-1, C)
            boxes = boxes_xyxy.view(-1, 4)

            scores, labels = conf.max(dim=1)
            keep = scores > conf_thresh
            boxes_all.append(boxes[keep])
            scores_all.append(scores[keep])
            labels_all.append(labels[keep])

        if boxes_all:
            boxes = torch.cat(boxes_all, dim=0)
            scores = torch.cat(scores_all, dim=0)
            labels = torch.cat(labels_all, dim=0)

            keep = batched_nms_xyxy(boxes, scores, labels, iou_thresh=iou_thresh)
            outputs.append((boxes[keep], scores[keep], labels[keep]))
        else:
            outputs.append((torch.zeros((0,4)), torch.zeros((0,)), torch.zeros((0,), dtype=torch.long)))

    return outputs

17) mAP Evaluator (core logic)

A correct mAP implementation is long; here is a clean, minimal evaluator approach:

Collect predictions per image: boxes, scores, labels
Collect GT per image: boxes, labels
For each class:
- Sort predictions by score
- Mark TP/FP using best IoU match above threshold (and only match a GT once)
- Compute precision-recall curve
- Compute AP by numeric integration
Average across classes → mAP

If you want COCO-style mAP@[.5:.95], repeat the above at multiple IoU thresholds and average.

If you want, I can paste a full, ready-to-run metrics.py with mAP@0.5 and COCO mAP in one file (it’s a few hundred lines). For readability here, I’m keeping the main guide focused on the YOLO pipeline.

18) Training on COCO (practical notes)

To train on COCO effectively:

Use a stronger backbone and bigger batch if possible
Use multi-scale training (randomly change input size per batch)
Use warmup for LR in first ~1–3 epochs
Use EMA weights for evaluation
Watch for:
- exploding objectness loss (often target assignment bug)
- near-zero positives (anchors/strides mismatch)
- boxes drifting outside image (decode bug)

19) Training on a custom dataset (the fastest correct workflow)

Pick model input size: 640 is common for a baseline
Convert annotations to absolute XYXY pixels
Visualize boxes on images before training
Start with:
- no fancy aug
- small model
- overfit on 20 images
If it overfits, scale up:
- more data
- augmentations
- better backbone/neck

20) Common bugs checklist (save hours)

Boxes are in wrong format (XYWH vs XYXY)
Boxes normalized but treated as pixels (or vice versa)
Targets assigned to wrong scale (stride mismatch)
Anchor sizes don’t match input size (anchors for 416 but training at 640)
Swapped x/y indexing (gx vs gy in tensor indexing)
Forgot to clamp grid indices
NMS applied before converting to XYXY
mAP evaluation matching multiple preds to the same GT

Implementing YOLO from scratch in PyTorch is one of the best ways to truly understand modern object detection—because you’re forced to connect every moving part: how labels become training targets, how predictions become real boxes, why anchors exist, and what objectness is actually learning.

By the end of this build, you should have a complete, working detector with:

A multi-scale YOLO-style model (backbone + neck + detection heads)
A correct target assignment pipeline (ground-truth → grid cell + anchor)
A stable loss setup (CIoU/GIoU for boxes + BCE for objectness and classes)
A proper inference path (decode → confidence filtering → NMS)
A clear route to production-grade evaluation (mAP@0.5 and COCO mAP@[.5:.95])
A repo structure you can extend into a real project

The most important takeaway is that YOLO isn’t “magic”—it’s a carefully engineered system of consistent coordinate transforms, responsible anchor matching, and balanced losses. If any of those pieces disagree (wrong box format, wrong stride, mismatched anchors, flipped x/y indices), learning collapses. But when everything lines up, the model trains smoothly and scales well.

Visit Our Data Annotation Service

Visit Now

// Our Articles

Read Our Latest Articles

AI Data Collection Guide