zcrop vlog

yolov8n(补充)

发表于 2026-01-14 更新于 2026-02-40

yolov8详细网络详解补充

1. Detect 模块 (解耦检测头)

文件路径：ultralytics/nn/modules/head.py

这是 YOLOv8 的最终输出层。与 YOLOv5 不同，v8 不再在一个卷积层里同时预测类别和坐标，而是分两条路走。

Python

class Detect(nn.Module):
    """YOLOv8 Detect head for detection models."""
    dynamic = False  # force grid reconstruction
    export = False   # export mode
    shape = None
    anchors = torch.empty(0)  # init
    strides = torch.empty(0)  # init

    def __init__(self, nc=80, ch=()):  # detection layer
        super().__init__()
        self.nc = nc  # number of classes (类别数，如 3)
        self.nl = len(ch)  # number of detection layers (通常是 3层: P3, P4, P5)
        self.reg_max = 16  # DFL channels (每个坐标预测 16 个概率值)
        self.no = nc + self.reg_max * 4  # number of outputs per anchor (例如: 3 + 16*4 = 67)
        self.stride = torch.zeros(self.nl)  # strides computed during build

        # ==================== 核心结构：解耦头 ====================
        # 1. Box 分支 (回归分支)：预测坐标分布
        # 结构：Conv(3x3) -> Conv(3x3) -> Conv(1x1)
        # 注意：c2 通常被计算为 max(16, ch[0]//4, 64)
        c2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))
        self.cv2 = nn.ModuleList(
            nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch
        )

        # 2. Cls 分支 (分类分支)：预测类别置信度
        # 结构：DWConv -> DWConv -> Conv(1x1) (v8 默认使用深度可分离卷积以提速)
        self.cv3 = nn.ModuleList(
            nn.Sequential(
                nn.Sequential(DWConv(x, x, 3), Conv(x, c3, 1)),
                nn.Sequential(DWConv(c3, c3, 3), Conv(c3, c3, 1)),
                nn.Conv2d(c3, self.nc, 1),
            )
            for x in ch
        )
        
        # 3. DFL 模块 (积分模块)：仅当 reg_max > 1 时启用
        self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()

    def forward(self, x):
        """Concatenates and returns predicted bounding boxes and class probabilities."""
        shape = x[0].shape  # BCHW
        
        # 遍历三个尺度 (P3, P4, P5)
        for i in range(self.nl):
            # 这里的 x[i] 是 backbone+neck 传过来的特征图
            # self.cv2[i](x[i]): 跑 Box 分支，得到 (B, 64, H, W)
            # self.cv3[i](x[i]): 跑 Cls 分支，得到 (B, 3, H, W)
            # torch.cat: 在通道维度拼接，得到 (B, 67, H, W)
            x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)
            
        if self.training:
            return x
            
        # 推理阶段逻辑 (Inference Path) ...
        # (这部分通常包含 anchor 生成和 DFL 解码，详见下文 DFL 解析)
        return x

详细解析：

解耦 (Decoupling)：
- Box 分支 (cv2)：只专注于“物体轮廓在哪”。它需要高精度的边缘信息，所以通常使用标准卷积。
- Cls 分支 (cv3)：只专注于“这是什么物体”。它需要语义信息，YOLOv8 为了在 RK3566 等设备上提速，这里默认使用了 DWConv（深度可分离卷积）。
通道数 (c2, c3)：
- 这是模型轻量化的关键魔改点。默认计算方式会导致 c3 很大（256）。对于扫地机（3类），建议强制改为 c2=c3=64，计算量可立减 50% 以上。
输出维度 (self.no)：
- 输出通道数 = nc (3) + reg_max * 4 (64) = 67。
- 这也是为什么你在 NPU 后处理时，拿到的数据不是直接的 x,y,w,h，而是一大坨数据的原因。

2. DFL 模块 (分布式焦点损失)

文件路径：ultralytics/nn/modules/block.py

这就是那个把 64 个通道变成 4 个坐标值的“数学魔法”。它的核心思想是：不直接预测距离，而是预测距离的概率分布，然后求期望（积分）。

Python

class DFL(nn.Module):
    """
    Integral module of Distribution Focal Loss (DFL).
    Proposed in Generalized Focal Loss https://ieeexplore.ieee.org/document/9792391
    """

    def __init__(self, c1=16):
        super().__init__()
        # 1. 定义一个卷积层，输入通道 16，输出通道 1
        # 卷积核大小 1x1，没有 Bias
        self.conv = nn.Conv2d(c1, 1, 1, bias=False).requires_grad_(False)
        
        # 2. 初始化权重为 [0, 1, 2, ..., 15]
        # 这个权重是固定的，不可训练！它就是一个积分算子。
        x = torch.arange(c1, dtype=torch.float)
        self.conv.weight.data[:] = nn.Parameter(x.view(1, c1, 1, 1))
        self.c1 = c1

    def forward(self, x):
        """Applies a transformer layer on input tensor 'x' and returns a tensor."""
        # x shape: (Batch, 64, Anchors)
        b, c, a = x.shape
        
        # 1. view & transpose: 
        # 把 64 拆成 4组 x 16个值。 4 代表 (left, top, right, bottom)
        # 形状变换: (B, 4, 16, Anchors)
        
        # 2. softmax(1): 
        # 对这 16 个值做 Softmax，变成概率分布 (和为1)。
        # 例如: [0.01, ..., 0.8, 0.19, ...] 表示大概率落在索引 5 和 6 之间。
        
        # 3. self.conv: 
        # 利用 1x1 卷积进行加权求和（即求期望）。
        # Output = 0*P0 + 1*P1 + ... + 15*P15
        
        return self.conv(x.view(b, 4, self.c1, a).transpose(2, 1).softmax(1)).view(b, 4, a)

DFL 的计算过程详解：

假设我们要预测“中心点到左边的距离” (dist_left)：

输入：模型输出了 16 个数值（Logits）。
Softmax：将这 16 个数值归一化为概率 P0,P1,…,P15。
积分（求期望）：

$$\hat{y} = \sum_{i=0}^{15} P_i \times i$$

比如：P5=0.8,P6=0.2，其他为0。结果：5×0.8+6×0.2=5.2。
物理意义：这比直接让网络预测 “5.2” 要稳定得多，因为网络可以表达“我不确定，大概在 5 和 6 之间”这种模糊性。