系列首篇:自监督学习Visual Transformers(ViT)的训练经验(Moco v3) – 论文解析
系列上篇:Facebook自监督学习Visual Transformers(ViT)的训练经验(Moco v3) – 训练代码解析
Moco v3 代码链接:
https://github.com/facebookresearch/moco-v3
看vits.py:
from timm.models.vision_transformer import VisionTransformer, _cfg
from timm.models.layers.helpers import to_2tuple
from timm.models.layers import PatchEmbed
# 释出的代码定义了四种不同的vit模型给moco v3
__all__ = [
'vit_small',
'vit_base',
'vit_conv_small',
'vit_conv_base',
]
vits.py定义了四种ViT,分别是vit_small,vit_base,vit_conv_small,vit_conv_base。继承的都是timm里的VisionTransformer。
要先了解moco v3中的vit,首先我们得了解vit的基本原理,可以看ViT (Vision Transformer)原理及代码解析
在ViT (Vision Transformer)原理及代码解析这篇我们说过,timm里的Position Embedding初始是随机数,到Moco v3中,又把它改成了2d sin cos。
def build_2d_sincos_position_embedding(self, temperature=10000.):
# grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
# num_patches = grid_size[0] * grid_size[1]
# grid_size即横纵坐标上patch数量
h, w = self.patch_embed.grid_size
# grid_w = tensor([0., 1., 2., ..., w-1])
grid_w = torch.arange(w, dtype=torch.float32)
# grid_h = tensor([0., 1., 2., ..., h-1])
grid_h = torch.arange(h, dtype=torch.float32)
# h*w
# grid_w = tensor([[0., 0., ..., 0.], [1., 1., ..., 1.], ..., [w-1, w-1, ..., w-1]])
# grid_h = tensor([[0., ..., h-1], [0., ..., h-1], ..., [0., ..., h-1]])
grid_w, grid_h = torch.meshgrid(grid_w, grid_h)
assert self.embed_dim % 4 == 0, 'Embed dimension must be divisible by 4 for 2D sin-cos position embedding'
pos_dim = self.embed_dim // 4
# pos_dim
omega = torch.arange(pos_dim, dtype=torch.float32) / pos_dim
omega = 1. / (temperature**omega)
# 外积 12, 16 -> (12, 16)
out_w = torch.einsum('m,d->md', [grid_w.flatten(), omega])
out_h = torch.einsum('m,d->md', [grid_h.flatten(), omega])
# (1, 12, embed_dim)
pos_emb = torch.cat([torch.sin(out_w), torch.cos(out_w), torch.sin(out_h), torch.cos(out_h)], dim=1)[None, :, :]
assert self.num_tokens == 1, 'Assuming one and only one token, [cls]'
pe_token = torch.zeros([1, 1, self.embed_dim], dtype=torch.float32)
self.pos_embed = nn.Parameter(torch.cat([pe_token, pos_emb], dim=1))
self.pos_embed.requires_grad = False
Moco v3 中 position embedding 的图示效果见上图,第一行是cls token的position embedding,是全0,第一个patch的position embedding,则是分四块0,1交替,和ViT (Vision Transformer)原理及代码解析中分两块和单双数0,1交替都不一样。这边为什么要这么设计position embedding呢?我的理解,如果是做语义处理,patch和patch之间只有一个距离,patch的坐标是一个一维向量,因为句子是词组成的一个序列。而做图像处理的时候,图像是一个二维信息块,切割成patch,每个patch的坐标也是二维的,而用于语义处理的position embedding并不能很好地表达这个二维信息。Moco v3中的position embedding代码则能很好的表达这个二维的位置信息,从上图不难看出,embedding的前半部分要表达的是行信息,而embedding的后半部分要表达的是列信息。我们再看一个(8,8)结构的patch的position embedding:
class VisionTransformerMoCo(VisionTransformer):
def __init__(self, stop_grad_conv1=False, **kwargs):
super().__init__(**kwargs)
# Use fixed 2D sin-cos position embedding
# 初始化position embedding
self.build_2d_sincos_position_embedding()
# weight initialization
for name, m in self.named_modules():
# 挑拣出所有的全连接层
if isinstance(m, nn.Linear):
# 挑拣出qkv字符串在名字里的全连接层
# blocks.i.attn.qkv,其中i是block的id,从0-block总数
if 'qkv' in name:
# treat the weights of Q, K, V separately
val = math.sqrt(6. / float(m.weight.shape[0] // 3 + m.weight.shape[1]))
# 使值服从均匀分布U(a,b)
nn.init.uniform_(m.weight, -val, val)
# blocks.i.attn.qkv, blocks.i.mlp.fc1, blocks.i.mlp.fc2
# head
else:
# xavier初始化方法中服从均匀分布U(−a,a)
nn.init.xavier_uniform_(m.weight)
nn.init.zeros_(m.bias)
nn.init.normal_(self.cls_token, std=1e-6)
if isinstance(self.patch_embed, PatchEmbed):
# xavier_uniform initialization
val = math.sqrt(6. / float(3 * reduce(mul, self.patch_embed.patch_size, 1) + self.embed_dim))
nn.init.uniform_(self.patch_embed.proj.weight, -val, val)
nn.init.zeros_(self.patch_embed.proj.bias)
# Moco v3的核心,即patch embedding那层不参与收敛
if stop_grad_conv1:
self.patch_embed.proj.weight.requires_grad = False
self.patch_embed.proj.bias.requires_grad = False
这一段主要是Moco v3调整模型初始化的一些方式,以及允许将产生patch embedding的本来可以参与反向传播调整参数的proj参数固定,不参与整个反向传播,不调整参数,这是Moco v3的核心。
这里有些问题可以细究一下,比如为什么要这么调整这些初始化方式,其中包括,position embedding的方式,还有各个全连接层的调整方式。
这里还引入了patch embedding的改进方式ConvStem,由于代码只有proj相关的部分不同,因此这里也只贴proj这部分出来:
class ConvStem(nn.Module):
"""
ConvStem, from Early Convolutions Help Transformers See Better, Tete et al. https://arxiv.org/abs/2106.14881
"""
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True):
super().__init__()
assert patch_size == 16, 'ConvStem only supports patch size of 16'
assert embed_dim % 8 == 0, 'Embed dimension must be divisible by 8 for ConvStem'
# build stem, similar to the design in https://arxiv.org/abs/2106.14881
stem = []
input_dim, output_dim = 3, embed_dim // 8
for l in range(4):
# 卷积:输入通道,输出通道,卷积核大小,步长
# (B, 3, H, W) -> (B, embed_dim // 8, (H+1)//2, (W+1)//2)
# (B, embed_dim // 8, H', W')-> (B, embed_dim // 4, (H+1)//2, (W+1)//2)
# (B, embed_dim // 4, H'', W'')-> (B, embed_dim // 2, (H+1)//2, (W+1)//2)
# (B, embed_dim // 2, H''', W''') -> (B, embed_dim, (H+1)//2, (W+1)//2)
stem.append(nn.Conv2d(input_dim, output_dim, kernel_size=3, stride=2, padding=1, bias=False))
stem.append(nn.BatchNorm2d(output_dim))
stem.append(nn.ReLU(inplace=True))
input_dim = output_dim
output_dim *= 2
# (B, embed_dim, H'''', W'''') -> (B, embed_dim, H'''', W'''')
stem.append(nn.Conv2d(input_dim, embed_dim, kernel_size=1))
# proj总共有五层卷积
self.proj = nn.Sequential(*stem)
由此可见在做patch embedding的时候,ConvStem将原本的全连接层替换成了四层卷积层(最后一层不太能算)。有机会我们可以详细解说一下ConvStem,对了ConvStem同样也是Facebook的成果。
最后我们来看一下Moco v3定义的四种vit分别长什么样子,以及有哪些不同:
def vit_small(**kwargs):
model = VisionTransformerMoCo(
patch_size=16, embed_dim=384, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
model.default_cfg = _cfg()
return model
def vit_base(**kwargs):
model = VisionTransformerMoCo(
patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
model.default_cfg = _cfg()
return model
这两个模型只有embed_dim的区别,一个小一个大。
def vit_conv_small(**kwargs):
# minus one ViT block
model = VisionTransformerMoCo(
patch_size=16, embed_dim=384, depth=11, num_heads=12, mlp_ratio=4, qkv_bias=True,
norm_layer=partial(nn.LayerNorm, eps=1e-6), embed_layer=ConvStem, **kwargs)
model.default_cfg = _cfg()
return model
def vit_conv_base(**kwargs):
# minus one ViT block
model = VisionTransformerMoCo(
patch_size=16, embed_dim=768, depth=11, num_heads=12, mlp_ratio=4, qkv_bias=True,
norm_layer=partial(nn.LayerNorm, eps=1e-6), embed_layer=ConvStem, **kwargs)
model.default_cfg = _cfg()
return model
这里的主要区别是embed layer的区别,有没有用ConvStem,另外depth也浅了一层。
Moco v3 paper中的vit更丰富一点,也会有各种不同参数vit的训练效果,想要详细了解的可以去看一下。
参考:
[1] 我是啤酒,ViT (Vision Transformer)原理及代码解析,Chaos万有引力,2021
[2] Jay Alammar, The Illustrated Transformer, jalammar.github.io, 2018
Comments