SO Development

Implementing YOLO from Scratch in PyTorch

Table of Contents
    Add a header to begin generating the table of contents

    Introduction – Why YOLO Changed Everything

    Before YOLO, computers did not “see” the world the way humans do.
    Object detection systems were careful, slow, and fragmented. They first proposed regions that might contain objects, then classified each region separately. Detection worked—but it felt like solving a puzzle one piece at a time.

    In 2015, YOLO—You Only Look Once—introduced a radical idea:

    What if we detect everything in one single forward pass?

    Instead of multiple stages, YOLO treated detection as a single regression problem from pixels to bounding boxes and class probabilities.

    This guide walks through how to implement YOLO completely from scratch in PyTorch, covering:

    • Mathematical formulation

    • Network architecture

    • Target encoding

    • Loss implementation

    • Training on COCO-style data

    • mAP evaluation

    • Visualization & debugging

    • Inference with NMS

    • Anchor-box extension

    1) What YOLO means (and what we’ll build)

    YOLO (You Only Look Once) is a family of object detection models that predict bounding boxes and class probabilities in one forward pass. Unlike older multi-stage pipelines (proposal → refine → classify), YOLO-style detectors are dense predictors: they predict candidate boxes at many locations and scales, then filter them.

    There are two “eras” of YOLO-like detectors:

    • YOLOv1-style (grid cells, no anchors): each grid cell predicts a few boxes directly.

    • Anchor-based YOLO (YOLOv2/3 and many derivatives): each grid cell predicts offsets relative to pre-defined anchor shapes; multiple scales predict small/medium/large objects.

    What we’ll implement

    A modern, anchor-based YOLO-style detector with:

    • Multi-scale heads (e.g., 3 scales)

    • Anchor matching (target assignment)

    • Loss with box regression + objectness + classification

    • Decoding + NMS

    • mAP evaluation

    • COCO/custom dataset training support

    We’ll keep the architecture understandable rather than exotic. You can later swap in a bigger backbone easily.

    2) Bounding box formats and coordinate systems

    You must be consistent. Most training bugs come from box format confusion.

    Common box formats:

    • XYXY: (x1, y1, x2, y2) top-left & bottom-right

    • XYWH: (cx, cy, w, h) center and size

    • Normalized: coordinates in [0, 1] relative to image size

    • Absolute: pixel coordinates

    Recommended internal convention

    • Store dataset annotations as absolute XYXY in pixels.

    • Convert to normalized only if needed, but keep one standard.

    Why XYXY is nice:

    • Intersection/union is straightforward.

    • Clamping to image bounds is simple.

    3) IoU, GIoU, DIoU, CIoU

    IoU (Intersection over Union) is the standard overlap metric:

    IoU=∣A∩B∣/∣A∪B∣

    But IoU has a problem: if boxes don’t overlap, IoU = 0, gradient can be weak. Modern detectors often use improved regression losses:

    • GIoU: adds penalty for non-overlapping boxes based on smallest enclosing box

    • DIoU: penalizes center distance

    • CIoU: DIoU + aspect ratio consistency

    Practical rule:

    • If you want a strong default: CIoU for box regression.

    • If you want simpler: GIoU works well too.

    We’ll implement IoU + CIoU (with safe numerics).

    4) Anchor-based YOLO: grids, anchors, predictions

    A YOLO head predicts at each grid location. Suppose a feature map is S x S (e.g., 80×80). Each cell can predict A anchors (e.g., 3). For each anchor, prediction is:

    • Box offsets: tx, ty, tw, th

    • Objectness logit: to

    • Class logits: tc1..tcC

    So tensor shape per scale is:
    (B, A*(5+C), S, S) or (B, A, S, S, 5+C) after reshaping.

    How offsets become real boxes

    A common YOLO-style decode (one of several valid variants):

    • bx = (sigmoid(tx) + cx) / S

    • by = (sigmoid(ty) + cy) / S

    • bw = (anchor_w * exp(tw)) / img_w (or normalized by S)

    • bh = (anchor_h * exp(th)) / img_h

    Where (cx, cy) is the integer grid coordinate.

    Important: Your encode/decode must match your target assignment encoding.

    5) Dataset preparation

    Annotation formats

    Your custom dataset can be:

    • COCO JSON

    • Pascal VOC XML

    • YOLO txt (class cx cy w h normalized)

    We’ll support a generic internal representation:

    • Each sample returns:

      • image: Tensor [3, H, W]

      • targets: Tensor [N, 6] with columns:

        • [class, x1, y1, x2, y2, image_index(optional)]

    Augmentations

    For object detection, augmentations must transform boxes too:

    • Resize / letterbox

    • Random horizontal flip

    • Color jitter

    • Random affine (optional)

    • Mosaic/mixup (advanced; optional)

    To keep this guide implementable without fragile geometry, we’ll do:

    • resize/letterbox

    • random flip

    • HSV jitter (optional)

    6) Building blocks: Conv-BN-Act, residuals, necks

    A clean baseline module:

    • Conv2d -> BatchNorm2d -> SiLU
      SiLU (a.k.a. Swish) is common in YOLOv5-like families; LeakyReLU is common in YOLOv3.

    We can optionally add residual blocks for a stronger backbone, but even a small backbone can work to validate the pipeline.

    7) Model design

    A typical structure:

    • Backbone: extracts feature maps at multiple strides (8, 16, 32)

    • Neck: combines features (FPN / PAN)

    • Head: predicts detection outputs per scale

    We’ll implement a lightweight backbone that produces 3 feature maps and a simple FPN-like neck.

    8) Decoding predictions

    At inference:

    1. Reshape outputs per scale to (B, A, S, S, 5+C)

    2. Apply sigmoid to center offsets + objectness (and often class probs)

    3. Convert to XYXY in pixel coordinates

    4. Flatten all scales into one list of candidate boxes

    5. Filter by confidence threshold

    6. Apply NMS per class (or class-agnostic NMS)

    9) Target assignment (matching GT to anchors)

    This is the heart of anchor-based YOLO.

    For each ground-truth box:

    1. Determine which scale(s) should handle it (based on size / anchor match).

    2. For the chosen scale, compute IoU between GT box size and each anchor size (in that scale’s coordinate system).

    3. Select best anchor (or top-k anchors).

    4. Compute the grid cell index from the GT center.

    5. Fill the target tensors at [anchor, gy, gx] with:

      • box regression targets

      • objectness = 1

      • class target

    Encoding regression targets

    If using decode:

    • bx = (sigmoid(tx) + cx)/S
      then target for tx is sigmoid^-1(bx*S - cx) but that’s messy.

    Instead, YOLO-style training often directly supervises:

    • tx_target = bx*S - cx (a value in [0,1]) and trains with BCE on sigmoid output, or MSE on raw.

    • tw_target = log(bw / anchor_w) (in pixels or normalized units)

    We’ll implement a stable variant:

    • predict pxy = sigmoid(tx,ty) and supervise pxy with BCE/MSE to match fractional offsets

    • predict pwh = exp(tw,th)*anchor and supervise with CIoU on decoded boxes (recommended)

    That’s simpler: do regression loss on decoded boxes, not on tw/th directly.

    10) Loss functions

    YOLO-style loss usually has:

    1. Box loss: CIoU/GIoU between predicted box and GT box at responsible locations

    2. Objectness loss: BCEWithLogits on objectness logit

    3. Class loss: BCEWithLogits (multi-label) or CE (single-label)

    For single-label classification (one class per object), either works:

    • BCEWithLogits with one-hot targets (common in YOLO)

    • CrossEntropyLoss on class logits at positive locations (also fine)

    We’ll use BCEWithLogits for both objectness and classes for consistency.

    Handling negatives

    You’ll have far more negative (no object) positions. You can:

    • Use a lower weight for negative objectness

    • Or apply focal loss (optional)

    We’ll implement:

    • objectness loss with positive and negative weights.

    11) Training loop

    Key features for stability/performance:

    • AMP (torch.cuda.amp)

    • Gradient clipping (optional)

    • EMA weights (optional but helpful)

    • LR scheduler (cosine or step)

    • Warmup for first few epochs/steps

    Use this on screenshots from Playwright:

    12) NMS

    Non-Max Suppression removes overlapping duplicates. Typical procedure:

    • Sort boxes by confidence

    • Iterate highest conf, suppress boxes with IoU > threshold

    Use class-wise NMS for multi-class detection.

    13) mAP evaluation

    Mean Average Precision requires:

    • For each class, compute precision-recall curve at IoU thresholds

    • Integrate area under curve (AP)

    • Average across classes (mAP)

    • COCO uses mAP across IoU thresholds 0.50 to 0.95 step 0.05

    We’ll implement:

    • mAP@0.5

    • and optionally COCO-style mAP@[.5:.95]

     

    14) Visualization

    Before training seriously, visualize:

    • target assignments per scale

    • decoded predictions after a few iterations

    • NMS outputs

    This catches 80% of “my model doesn’t learn” issues.

     

    15) Full Core Implementation (Reference Code)

    Below is a compact but complete set of core files you can place into a repo. It’s not “tiny,” but it’s readable and engineered for correctness.

    15.1 Repo structure

     
    				
    					yolo_scratch/
      README.md
      train.py
      eval.py
      predict.py
      yolo/
        __init__.py
        model.py
        modules.py
        loss.py
        assigner.py
        box_ops.py
        nms.py
        metrics.py
        data.py
        transforms.py
        utils.py
      configs/
        coco.yaml
        custom.yaml
    
    				
    			
    				
    					yolo_scratch/
      README.md
      train.py
      eval.py
      predict.py
      yolo/
        __init__.py
        model.py
        modules.py
        loss.py
        assigner.py
        box_ops.py
        nms.py
        metrics.py
        data.py
        transforms.py
        utils.py
      configs/
        coco.yaml
        custom.yaml
    
    				
    			

    15.2 yolo/box_ops.py

     
    				
    					import torch
    
    def xyxy_to_xywh(boxes: torch.Tensor) -> torch.Tensor:
        # boxes: [..., 4]
        x1, y1, x2, y2 = boxes.unbind(-1)
        cx = (x1 + x2) * 0.5
        cy = (y1 + y2) * 0.5
        w = (x2 - x1).clamp(min=0)
        h = (y2 - y1).clamp(min=0)
        return torch.stack([cx, cy, w, h], dim=-1)
    
    def xywh_to_xyxy(boxes: torch.Tensor) -> torch.Tensor:
        cx, cy, w, h = boxes.unbind(-1)
        half_w = w * 0.5
        half_h = h * 0.5
        x1 = cx - half_w
        y1 = cy - half_h
        x2 = cx + half_w
        y2 = cy + half_h
        return torch.stack([x1, y1, x2, y2], dim=-1)
    
    def box_iou_xyxy(boxes1: torch.Tensor, boxes2: torch.Tensor, eps: float = 1e-9) -> torch.Tensor:
        # boxes1: [N,4], boxes2: [M,4]
        x11, y11, x12, y12 = boxes1[:, 0], boxes1[:, 1], boxes1[:, 2], boxes1[:, 3]
        x21, y21, x22, y22 = boxes2[:, 0], boxes2[:, 1], boxes2[:, 2], boxes2[:, 3]
    
        inter_x1 = torch.maximum(x11[:, None], x21[None, :])
        inter_y1 = torch.maximum(y11[:, None], y21[None, :])
        inter_x2 = torch.minimum(x12[:, None], x22[None, :])
        inter_y2 = torch.minimum(y12[:, None], y22[None, :])
    
        inter_w = (inter_x2 - inter_x1).clamp(min=0)
        inter_h = (inter_y2 - inter_y1).clamp(min=0)
        inter = inter_w * inter_h
    
        area1 = (x12 - x11).clamp(min=0) * (y12 - y11).clamp(min=0)
        area2 = (x22 - x21).clamp(min=0) * (y22 - y21).clamp(min=0)
        union = area1[:, None] + area2[None, :] - inter
        return inter / (union + eps)
    
    def ciou_loss_xyxy(pred: torch.Tensor, target: torch.Tensor, eps: float = 1e-7) -> torch.Tensor:
        """
        pred, target: [N,4] in xyxy
        Returns: [N] CIoU loss = 1 - CIoU
        """
        # IoU
        iou = box_iou_xyxy(pred, target).diag()  # [N]
    
        # centers and sizes
        p = xyxy_to_xywh(pred)
        t = xyxy_to_xywh(target)
        pcx, pcy, pw, ph = p.unbind(-1)
        tcx, tcy, tw, th = t.unbind(-1)
    
        # center distance
        center_dist2 = (pcx - tcx) ** 2 + (pcy - tcy) ** 2
    
        # smallest enclosing box diagonal squared
        x1 = torch.minimum(pred[:, 0], target[:, 0])
        y1 = torch.minimum(pred[:, 1], target[:, 1])
        x2 = torch.maximum(pred[:, 2], target[:, 2])
        y2 = torch.maximum(pred[:, 3], target[:, 3])
        c2 = (x2 - x1) ** 2 + (y2 - y1) ** 2 + eps
    
        diou = iou - center_dist2 / c2
    
        # aspect ratio penalty
        v = (4 / (torch.pi ** 2)) * (torch.atan(tw / (th + eps)) - torch.atan(pw / (ph + eps))) ** 2
        with torch.no_grad():
            alpha = v / (1 - iou + v + eps)
    
        ciou = diou - alpha * v
        return 1 - ciou.clamp(min=-1.0, max=1.0)
    
    				
    			

    15.3 yolo/nms.py

    				
    					import torch
    from .box_ops import box_iou_xyxy
    
    def nms_xyxy(boxes: torch.Tensor, scores: torch.Tensor, iou_thresh: float = 0.5) -> torch.Tensor:
        """
        boxes: [N,4], scores: [N]
        returns indices kept
        """
        if boxes.numel() == 0:
            return torch.empty((0,), dtype=torch.long, device=boxes.device)
    
        idxs = scores.argsort(descending=True)
        keep = []
    
        while idxs.numel() > 0:
            i = idxs[0]
            keep.append(i)
    
            if idxs.numel() == 1:
                break
    
            rest = idxs[1:]
            ious = box_iou_xyxy(boxes[i].unsqueeze(0), boxes[rest]).squeeze(0)
            idxs = rest[ious <= iou_thresh]
    
        return torch.stack(keep)
    
    def batched_nms_xyxy(boxes, scores, labels, iou_thresh=0.5):
        """
        Class-wise NMS by offsetting boxes or by filtering per class.
        Here: filter per class (clear and correct).
        """
        keep_all = []
        for c in labels.unique():
            mask = labels == c
            keep = nms_xyxy(boxes[mask], scores[mask], iou_thresh)
            keep_all.append(mask.nonzero(as_tuple=False).squeeze(1)[keep])
        if not keep_all:
            return torch.empty((0,), dtype=torch.long, device=boxes.device)
        return torch.cat(keep_all)
    
    				
    			

    15.4 yolo/modules.py

    				
    					import torch
    import torch.nn as nn
    
    class ConvBNAct(nn.Module):
        def __init__(self, in_ch, out_ch, k=3, s=1, p=None, act=True):
            super().__init__()
            if p is None:
                p = k // 2
            self.conv = nn.Conv2d(in_ch, out_ch, k, s, p, bias=False)
            self.bn = nn.BatchNorm2d(out_ch)
            self.act = nn.SiLU(inplace=True) if act else nn.Identity()
    
        def forward(self, x):
            return self.act(self.bn(self.conv(x)))
    
    class Residual(nn.Module):
        def __init__(self, ch):
            super().__init__()
            self.block = nn.Sequential(
                ConvBNAct(ch, ch, 1, 1),
                ConvBNAct(ch, ch, 3, 1),
            )
    
        def forward(self, x):
            return x + self.block(x)
    
    class CSPBlock(nn.Module):
        """
        Light CSP-like block: split channels, apply residuals on one branch, then concat.
        """
        def __init__(self, ch, n=1):
            super().__init__()
            c_ = ch // 2
            self.conv1 = ConvBNAct(ch, c_, 1, 1)
            self.conv2 = ConvBNAct(ch, c_, 1, 1)
            self.m = nn.Sequential(*[Residual(c_) for _ in range(n)])
            self.conv3 = ConvBNAct(2 * c_, ch, 1, 1)
    
        def forward(self, x):
            y1 = self.m(self.conv1(x))
            y2 = self.conv2(x)
            return self.conv3(torch.cat([y1, y2], dim=1))
    
    				
    			

    15.5 yolo/model.py

    				
    					import torch
    import torch.nn as nn
    from .modules import ConvBNAct, CSPBlock
    
    class TinyBackbone(nn.Module):
        """
        Produces 3 feature maps at strides 8, 16, 32.
        """
        def __init__(self, in_ch=3, base=32):
            super().__init__()
            self.stem = nn.Sequential(
                ConvBNAct(in_ch, base, 3, 2),       # stride 2
                ConvBNAct(base, base*2, 3, 2),     # stride 4
                CSPBlock(base*2, n=1),
            )
            self.stage3 = nn.Sequential(
                ConvBNAct(base*2, base*4, 3, 2),   # stride 8
                CSPBlock(base*4, n=2),
            )
            self.stage4 = nn.Sequential(
                ConvBNAct(base*4, base*8, 3, 2),   # stride 16
                CSPBlock(base*8, n=2),
            )
            self.stage5 = nn.Sequential(
                ConvBNAct(base*8, base*16, 3, 2),  # stride 32
                CSPBlock(base*16, n=1),
            )
    
        def forward(self, x):
            x = self.stem(x)
            p3 = self.stage3(x)
            p4 = self.stage4(p3)
            p5 = self.stage5(p4)
            return p3, p4, p5
    
    class SimpleFPN(nn.Module):
        def __init__(self, ch3, ch4, ch5, out_ch=128):
            super().__init__()
            self.lat5 = ConvBNAct(ch5, out_ch, 1, 1)
            self.lat4 = ConvBNAct(ch4, out_ch, 1, 1)
            self.lat3 = ConvBNAct(ch3, out_ch, 1, 1)
    
            self.out4 = ConvBNAct(out_ch, out_ch, 3, 1)
            self.out3 = ConvBNAct(out_ch, out_ch, 3, 1)
    
        def forward(self, p3, p4, p5):
            p5 = self.lat5(p5)
            p4 = self.lat4(p4) + torch.nn.functional.interpolate(p5, scale_factor=2, mode="nearest")
            p3 = self.lat3(p3) + torch.nn.functional.interpolate(p4, scale_factor=2, mode="nearest")
            p4 = self.out4(p4)
            p3 = self.out3(p3)
            return p3, p4, p5
    
    class DetectHead(nn.Module):
        def __init__(self, in_ch, num_anchors, num_classes):
            super().__init__()
            self.num_anchors = num_anchors
            self.num_classes = num_classes
            self.pred = nn.Conv2d(in_ch, num_anchors * (5 + num_classes), 1, 1, 0)
    
        def forward(self, x):
            return self.pred(x)
    
    class YOLO(nn.Module):
        def __init__(self, num_classes, anchors, base=32):
            """
            anchors: list of 3 scales, each is list of (w,h) in pixels for the model input size (e.g., 640)
                     e.g. [
                       [(10,13),(16,30),(33,23)],   # stride 8
                       [(30,61),(62,45),(59,119)],  # stride 16
                       [(116,90),(156,198),(373,326)] # stride 32
                     ]
            """
            super().__init__()
            self.num_classes = num_classes
            self.anchors = anchors
    
            self.backbone = TinyBackbone(in_ch=3, base=base)
            # backbone channels: p3=base*4, p4=base*8, p5=base*16
            self.fpn = SimpleFPN(base*4, base*8, base*16, out_ch=base*4)
    
            na = len(anchors[0])
            self.head3 = DetectHead(base*4, na, num_classes)
            self.head4 = DetectHead(base*4, na, num_classes)
            self.head5 = DetectHead(base*4, na, num_classes)
    
        def forward(self, x):
            p3, p4, p5 = self.backbone(x)
            f3, f4, f5 = self.fpn(p3, p4, p5)
            o3 = self.head3(f3)
            o4 = self.head4(f4)
            o5 = self.head5(f5)
            return [o3, o4, o5]
    
    				
    			

    15.6 yolo/assigner.py (target assignment)

    				
    					import torch
    
    def build_targets(
        targets,           # list of length B, each: Tensor [Ni, 5] -> (cls, x1, y1, x2, y2) in pixels
        anchors,           # per scale: list of (w,h) in pixels at input size
        strides,           # [8,16,32]
        img_size,          # int, e.g. 640
        num_classes,
        device
    ):
        """
        Returns per-scale target tensors:
          tbox: list of [B, A, S, S, 4] in xyxy pixels
          tobj: list of [B, A, S, S] (0/1)
          tcls: list of [B, A, S, S, C] one-hot
          indices: list of tuples for positives (b, a, gy, gx)
        """
        B = len(targets)
        out = []
        indices_all = []
    
        for scale_idx, (anc, stride) in enumerate(zip(anchors, strides)):
            S = img_size // stride
            A = len(anc)
    
            tbox = torch.zeros((B, A, S, S, 4), device=device)
            tobj = torch.zeros((B, A, S, S), device=device)
            tcls = torch.zeros((B, A, S, S, num_classes), device=device)
    
            indices = []
    
            anc_wh = torch.tensor(anc, device=device, dtype=torch.float32)  # [A,2]
    
            for b in range(B):
                if targets[b].numel() == 0:
                    continue
                gt = targets[b].to(device)
                cls = gt[:, 0].long()
                x1y1 = gt[:, 1:3]
                x2y2 = gt[:, 3:5]
                gxy = (x1y1 + x2y2) * 0.5
                gwh = (x2y2 - x1y1).clamp(min=1.0)
    
                # pick best anchor by IoU of width/height (approx)
                # IoU(wh) = min(w)/max(w) * min(h)/max(h)
                wh = gwh[:, None, :]  # [N,1,2]
                min_wh = torch.minimum(wh, anc_wh[None, :, :])
                max_wh = torch.maximum(wh, anc_wh[None, :, :])
                iou_wh = (min_wh[..., 0] / max_wh[..., 0]) * (min_wh[..., 1] / max_wh[..., 1])  # [N,A]
                best_a = torch.argmax(iou_wh, dim=1)  # [N]
    
                # grid cell
                gx = (gxy[:, 0] / stride).clamp(min=0, max=S-1e-3)
                gy = (gxy[:, 1] / stride).clamp(min=0, max=S-1e-3)
                gi = gx.long()
                gj = gy.long()
    
                for i in range(gt.shape[0]):
                    a = best_a[i].item()
                    x1, y1, x2, y2 = gt[i, 1:].tolist()
    
                    # assign
                    tobj[b, a, gj[i], gi[i]] = 1.0
                    tbox[b, a, gj[i], gi[i]] = torch.tensor([x1, y1, x2, y2], device=device)
                    tcls[b, a, gj[i], gi[i], cls[i]] = 1.0
                    indices.append((b, a, gj[i].item(), gi[i].item()))
    
            out.append((tbox, tobj, tcls))
            indices_all.append(indices)
    
        return out, indices_all
    
    				
    			

    15.7 yolo/loss.py

    				
    					import torch
    import torch.nn as nn
    from .box_ops import xywh_to_xyxy, ciou_loss_xyxy
    
    class YOLOLoss(nn.Module):
        def __init__(self, anchors, strides, num_classes, img_size,
                     lambda_box=7.5, lambda_obj=1.0, lambda_cls=1.0,
                     obj_pos_weight=1.0, obj_neg_weight=0.5):
            super().__init__()
            self.anchors = anchors
            self.strides = strides
            self.num_classes = num_classes
            self.img_size = img_size
    
            self.lambda_box = lambda_box
            self.lambda_obj = lambda_obj
            self.lambda_cls = lambda_cls
    
            self.bce = nn.BCEWithLogitsLoss(reduction="none")
            self.obj_pos_weight = obj_pos_weight
            self.obj_neg_weight = obj_neg_weight
    
        def decode_scale(self, pred, scale_idx):
            """
            pred: [B, A*(5+C), S, S]
            returns:
              boxes_xyxy: [B, A, S, S, 4] in pixels
              obj_logit:  [B, A, S, S]
              cls_logit:  [B, A, S, S, C]
            """
            B, _, S, _ = pred.shape
            A = len(self.anchors[scale_idx])
            C = self.num_classes
            stride = self.strides[scale_idx]
    
            pred = pred.view(B, A, 5 + C, S, S).permute(0, 1, 3, 4, 2).contiguous()
            # [B, A, S, S, 5+C]
            tx_ty = pred[..., 0:2]
            tw_th = pred[..., 2:4]
            obj = pred[..., 4]
            cls = pred[..., 5:]
    
            # grid
            gy, gx = torch.meshgrid(torch.arange(S, device=pred.device),
                                    torch.arange(S, device=pred.device), indexing="ij")
            grid = torch.stack([gx, gy], dim=-1).float()  # [S,S,2]
    
            # anchors
            anc = torch.tensor(self.anchors[scale_idx], device=pred.device).float()  # [A,2]
            anc = anc.view(1, A, 1, 1, 2)
    
            # decode center
            pxy = (tx_ty.sigmoid() + grid.view(1, 1, S, S, 2)) * stride  # pixels
            # decode wh
            pwh = (tw_th.exp() * anc)  # pixels
    
            boxes_xywh = torch.cat([pxy, pwh], dim=-1)
            boxes_xyxy = xywh_to_xyxy(boxes_xywh)
            return boxes_xyxy, obj, cls
    
        def forward(self, preds, targets_per_scale):
            """
            preds: list of 3 scale outputs
            targets_per_scale: list of (tbox, tobj, tcls) per scale
            """
            total_box = torch.tensor(0.0, device=preds[0].device)
            total_obj = torch.tensor(0.0, device=preds[0].device)
            total_cls = torch.tensor(0.0, device=preds[0].device)
    
            for s, pred in enumerate(preds):
                tbox, tobj, tcls = targets_per_scale[s]
                pbox, pobj_logit, pcls_logit = self.decode_scale(pred, s)
    
                # objectness loss with weighting
                obj_loss = self.bce(pobj_logit, tobj)
                w = torch.where(tobj > 0.5,
                                torch.full_like(obj_loss, self.obj_pos_weight),
                                torch.full_like(obj_loss, self.obj_neg_weight))
                total_obj = total_obj + (obj_loss * w).mean()
    
                # positives mask
                pos = tobj > 0.5
                if pos.any():
                    # box loss CIoU
                    pbox_pos = pbox[pos]
                    tbox_pos = tbox[pos]
                    box_loss = ciou_loss_xyxy(pbox_pos, tbox_pos).mean()
                    total_box = total_box + box_loss
    
                    # class loss
                    cls_loss = self.bce(pcls_logit[pos], tcls[pos]).mean()
                    total_cls = total_cls + cls_loss
    
            loss = self.lambda_box * total_box + self.lambda_obj * total_obj + self.lambda_cls * total_cls
            return loss, {"box": total_box.detach(), "obj": total_obj.detach(), "cls": total_cls.detach()}
    
    				
    			

    15.8 yolo/utils.py

    				
    					import torch
    
    def set_seed(seed=42):
        import random, numpy as np
        random.seed(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    
    class AverageMeter:
        def __init__(self):
            self.sum = 0.0
            self.count = 0
    
        def update(self, v, n=1):
            self.sum += float(v) * n
            self.count += n
    
        @property
        def avg(self):
            return self.sum / max(1, self.count)
    
    				
    			

    15.9 yolo/transforms.py (simple letterbox + flip)

    				
    					import torch
    import torchvision.transforms.functional as TF
    
    def letterbox(image, boxes_xyxy, new_size=640):
        """
        image: PIL or Tensor [C,H,W]
        boxes_xyxy: Tensor [N,4] in pixels
        returns resized/padded image and transformed boxes
        """
        if not torch.is_tensor(image):
            image = TF.to_tensor(image)
        c, h, w = image.shape
    
        scale = min(new_size / h, new_size / w)
        nh, nw = int(round(h * scale)), int(round(w * scale))
        image_resized = TF.resize(image, [nh, nw])
    
        pad_h = new_size - nh
        pad_w = new_size - nw
        top = pad_h // 2
        left = pad_w // 2
    
        image_padded = torch.zeros((c, new_size, new_size), dtype=image.dtype)
        image_padded[:, top:top+nh, left:left+nw] = image_resized
    
        if boxes_xyxy.numel() > 0:
            boxes = boxes_xyxy.clone()
            boxes *= scale
            boxes[:, [0, 2]] += left
            boxes[:, [1, 3]] += top
        else:
            boxes = boxes_xyxy
    
        return image_padded, boxes
    
    def random_hflip(image, boxes_xyxy, p=0.5):
        if torch.rand(()) > p:
            return image, boxes_xyxy
        c, h, w = image.shape
        image = torch.flip(image, dims=[2])  # flip width
        boxes = boxes_xyxy.clone()
        if boxes.numel() > 0:
            x1 = boxes[:, 0].clone()
            x2 = boxes[:, 2].clone()
            boxes[:, 0] = (w - x2)
            boxes[:, 2] = (w - x1)
        return image, boxes
    
    				
    			

    15.10 yolo/data.py (custom dataset skeleton)

    				
    					import os
    import torch
    from torch.utils.data import Dataset
    from PIL import Image
    from .transforms import letterbox, random_hflip
    
    class DetectionDataset(Dataset):
        """
        Expects a list of samples where each sample has:
          - image_path
          - annotations: list of [cls, x1, y1, x2, y2] in pixels
        You can write adapters to load COCO or YOLO txt into this format.
        """
        def __init__(self, samples, img_size=640, augment=True):
            self.samples = samples
            self.img_size = img_size
            self.augment = augment
    
        def __len__(self):
            return len(self.samples)
    
        def __getitem__(self, idx):
            s = self.samples[idx]
            img = Image.open(s["image_path"]).convert("RGB")
            ann = s.get("annotations", [])
            if len(ann) > 0:
                target = torch.tensor(ann, dtype=torch.float32)  # [N,5]
            else:
                target = torch.zeros((0,5), dtype=torch.float32)
    
            cls = target[:, 0:1]
            boxes = target[:, 1:5]
    
            img, boxes = letterbox(img, boxes, self.img_size)
            if self.augment:
                img, boxes = random_hflip(img, boxes, p=0.5)
    
            # normalize image
            img = img.clamp(0, 1)
    
            # pack back
            if boxes.numel() > 0:
                target = torch.cat([cls, boxes], dim=1)
            else:
                target = torch.zeros((0,5), dtype=torch.float32)
    
            return img, target
    
    def collate_fn(batch):
        images, targets = zip(*batch)
        images = torch.stack(images, dim=0)
        # targets remains list[Tensor]
        return images, list(targets)
    
    				
    			

    15.11 train.py (end-to-end training loop)

    				
    					import torch
    from torch.utils.data import DataLoader
    
    from yolo.model import YOLO
    from yolo.loss import YOLOLoss
    from yolo.assigner import build_targets
    from yolo.utils import set_seed, AverageMeter
    from yolo.data import DetectionDataset, collate_fn
    
    def train_one_epoch(model, loss_fn, loader, optimizer, device, anchors, strides, img_size, num_classes, scaler=None):
        model.train()
        meter = AverageMeter()
    
        for images, targets_list in loader:
            images = images.to(device)
    
            # build targets per scale
            targets_per_scale, _ = build_targets(
                targets_list, anchors=anchors, strides=strides,
                img_size=img_size, num_classes=num_classes, device=device
            )
            targets_per_scale = [(tbox, tobj, tcls) for (tbox, tobj, tcls) in targets_per_scale]
    
            optimizer.zero_grad(set_to_none=True)
    
            if scaler is not None:
                with torch.cuda.amp.autocast():
                    preds = model(images)
                    loss, logs = loss_fn(preds, targets_per_scale)
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()
            else:
                preds = model(images)
                loss, logs = loss_fn(preds, targets_per_scale)
                loss.backward()
                optimizer.step()
    
            meter.update(loss.item(), n=images.size(0))
    
        return meter.avg
    
    def main():
        set_seed(42)
        device = "cuda" if torch.cuda.is_available() else "cpu"
    
        img_size = 640
        num_classes = 80  # COCO example
        strides = [8, 16, 32]
        anchors = [
            [(10,13),(16,30),(33,23)],
            [(30,61),(62,45),(59,119)],
            [(116,90),(156,198),(373,326)]
        ]
    
        # TODO: load your samples list here
        samples = []  # [{"image_path": "...", "annotations": [[cls,x1,y1,x2,y2], ...]}, ...]
    
        ds = DetectionDataset(samples, img_size=img_size, augment=True)
        loader = DataLoader(ds, batch_size=8, shuffle=True, num_workers=4, pin_memory=True, collate_fn=collate_fn)
    
        model = YOLO(num_classes=num_classes, anchors=anchors, base=32).to(device)
        loss_fn = YOLOLoss(anchors, strides, num_classes, img_size).to(device)
    
        optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4, weight_decay=1e-4)
        scaler = torch.cuda.amp.GradScaler() if device == "cuda" else None
    
        for epoch in range(1, 101):
            avg_loss = train_one_epoch(model, loss_fn, loader, optimizer, device, anchors, strides, img_size, num_classes, scaler)
            print(f"Epoch {epoch:03d} | loss={avg_loss:.4f}")
    
        torch.save(model.state_dict(), "yolo_scratch.pt")
    
    if __name__ == "__main__":
        main()
    
    				
    			

    16) Inference: decoding + NMS (predict pipeline)

    16.1 Decoding helper (add to yolo/metrics.py or yolo/utils.py)

    				
    					import torch
    from .box_ops import xywh_to_xyxy
    from .nms import batched_nms_xyxy
    
    @torch.no_grad()
    def decode_predictions(preds, anchors, strides, num_classes, conf_thresh=0.25, iou_thresh=0.5):
        """
        preds: list of 3 tensors [B, A*(5+C), S, S]
        returns per image: boxes [M,4], scores [M], labels [M]
        """
        outputs = []
        for b in range(preds[0].shape[0]):
            boxes_all = []
            scores_all = []
            labels_all = []
    
            for s, p in enumerate(preds):
                B, _, S, _ = p.shape
                A = len(anchors[s])
                C = num_classes
                stride = strides[s]
    
                x = p[b:b+1].view(1, A, 5+C, S, S).permute(0,1,3,4,2).contiguous()[0]  # [A,S,S,5+C]
                tx_ty = x[..., 0:2]
                tw_th = x[..., 2:4]
                obj_logit = x[..., 4]
                cls_logit = x[..., 5:]
    
                gy, gx = torch.meshgrid(torch.arange(S, device=p.device),
                                        torch.arange(S, device=p.device), indexing="ij")
                grid = torch.stack([gx, gy], dim=-1).float()  # [S,S,2]
    
                anc = torch.tensor(anchors[s], device=p.device).float().view(A,1,1,2)
    
                pxy = (tx_ty.sigmoid() + grid) * stride
                pwh = tw_th.exp() * anc
    
                boxes_xywh = torch.cat([pxy, pwh], dim=-1)  # [A,S,S,4]
                boxes_xyxy = xywh_to_xyxy(boxes_xywh)
    
                obj = obj_logit.sigmoid()  # [A,S,S]
                cls_prob = cls_logit.sigmoid()  # [A,S,S,C]
                # combine: per-class confidence = obj * cls_prob
                conf = obj.unsqueeze(-1) * cls_prob  # [A,S,S,C]
    
                conf = conf.view(-1, C)
                boxes = boxes_xyxy.view(-1, 4)
    
                scores, labels = conf.max(dim=1)
                keep = scores > conf_thresh
                boxes_all.append(boxes[keep])
                scores_all.append(scores[keep])
                labels_all.append(labels[keep])
    
            if boxes_all:
                boxes = torch.cat(boxes_all, dim=0)
                scores = torch.cat(scores_all, dim=0)
                labels = torch.cat(labels_all, dim=0)
    
                keep = batched_nms_xyxy(boxes, scores, labels, iou_thresh=iou_thresh)
                outputs.append((boxes[keep], scores[keep], labels[keep]))
            else:
                outputs.append((torch.zeros((0,4)), torch.zeros((0,)), torch.zeros((0,), dtype=torch.long)))
    
        return outputs
    
    				
    			

    17) mAP Evaluator (core logic)

    A correct mAP implementation is long; here is a clean, minimal evaluator approach:

    • Collect predictions per image: boxes, scores, labels

    • Collect GT per image: boxes, labels

    • For each class:

      • Sort predictions by score

      • Mark TP/FP using best IoU match above threshold (and only match a GT once)

      • Compute precision-recall curve

      • Compute AP by numeric integration

    • Average across classes → mAP

    If you want COCO-style mAP@[.5:.95], repeat the above at multiple IoU thresholds and average.

    If you want, I can paste a full, ready-to-run metrics.py with mAP@0.5 and COCO mAP in one file (it’s a few hundred lines). For readability here, I’m keeping the main guide focused on the YOLO pipeline.


    18) Training on COCO (practical notes)

    To train on COCO effectively:

    • Use a stronger backbone and bigger batch if possible

    • Use multi-scale training (randomly change input size per batch)

    • Use warmup for LR in first ~1–3 epochs

    • Use EMA weights for evaluation

    • Watch for:

      • exploding objectness loss (often target assignment bug)

      • near-zero positives (anchors/strides mismatch)

      • boxes drifting outside image (decode bug)


    19) Training on a custom dataset (the fastest correct workflow)

    1. Pick model input size: 640 is common for a baseline

    2. Convert annotations to absolute XYXY pixels

    3. Visualize boxes on images before training

    4. Start with:

      • no fancy aug

      • small model

      • overfit on 20 images

    5. If it overfits, scale up:

      • more data

      • augmentations

      • better backbone/neck


    20) Common bugs checklist (save hours)

    • Boxes are in wrong format (XYWH vs XYXY)

    • Boxes normalized but treated as pixels (or vice versa)

    • Targets assigned to wrong scale (stride mismatch)

    • Anchor sizes don’t match input size (anchors for 416 but training at 640)

    • Swapped x/y indexing (gx vs gy in tensor indexing)

    • Forgot to clamp grid indices

    • NMS applied before converting to XYXY

    • mAP evaluation matching multiple preds to the same GT

    Implementing YOLO from scratch in PyTorch is one of the best ways to truly understand modern object detection—because you’re forced to connect every moving part: how labels become training targets, how predictions become real boxes, why anchors exist, and what objectness is actually learning.

    By the end of this build, you should have a complete, working detector with:

    • A multi-scale YOLO-style model (backbone + neck + detection heads)

    • A correct target assignment pipeline (ground-truth → grid cell + anchor)

    • A stable loss setup (CIoU/GIoU for boxes + BCE for objectness and classes)

    • A proper inference path (decode → confidence filtering → NMS)

    • A clear route to production-grade evaluation (mAP@0.5 and COCO mAP@[.5:.95])

    • A repo structure you can extend into a real project

    The most important takeaway is that YOLO isn’t “magic”—it’s a carefully engineered system of consistent coordinate transforms, responsible anchor matching, and balanced losses. If any of those pieces disagree (wrong box format, wrong stride, mismatched anchors, flipped x/y indices), learning collapses. But when everything lines up, the model trains smoothly and scales well.

    Visit Our Data Annotation Service


    This will close in 20 seconds