一句话总结:计算机视觉让 AI 理解图像内容,从分类到分割再到生成,覆盖视觉理解的全场景。
| 模型 | 参数量 | Top-1 准确率 | 速度 (FPS) |
|---|
| ResNet-50 | 25M | 76.1% | 1000+ |
| ViT-B/16 | 86M | 81.8% | 500+ |
| ConvNeXt-L | 200M | 84.8% | 300+ |
| Swin-L | 197M | 86.6% | 200+ |
| 模型 | 任务 | mAP | 速度 (FPS) |
|---|
| DeepLabV3+ | 语义分割 | 45.4 | 30 |
| Mask R-CNN | 实例分割 | 40.5 | 20 |
| SAM | 零样本分割 | - | 15 |
| YOLOv8-Seg | 实时分割 | 38.2 | 80 |
from torchvision import models, transforms
from PIL import Image
model = models.resnet50(pretrained=True)
model.eval()
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
image = Image.open('input.jpg')
input_tensor = preprocess(image)
output = model(input_tensor.unsqueeze(0))
import torch
from torchseg import create_segmentation_model
model = create_segmentation_model(
arch='deeplabv3+',
encoder_name='efficientnet-b4',
classes=21
)
with torch.no_grad():
segmentation = model(input_image)
| 模型 | 类型 | 分辨率 | 特点 |
|---|
| DALL-E 3 | Text-to-Image | 1024² | 高质量文本理解 |
| Stable Diffusion | Text-to-Image | 512² | 开源可定制 |
| Midjourney | Text-to-Image | 1024² | 艺术风格 |
| Sora | Text-to-Video | 1080p | 视频生成 |