ML infra: tensorRT 使用

360影视 国产动漫 2025-04-22 00:35 2

摘要:之前尝试过用huggingface的transformer的library来下载bertmodel,并且简单的跑了一下。今天就打算测试一下CPU和gpu跑模型的性能,然后用trtllm优化一下,看能够优化到多少。

之前尝试过用huggingface的transformer的library来下载bertmodel,并且简单的跑了一下。今天就打算测试一下CPU和gpu跑模型的性能,然后用trtllm优化一下,看能够优化到多少。

CPU推理测试# Load test on CPU# Test datatexts = ["This is a test sentence. It is long, boring and verbose."] * 100 # 100 identical requests# Warmup_ = model(**tokenizer(texts[0], return_tensors="pt")) #.to("cuda"))# timed inferencelatencies = for text in texts: start = time.time inputs = tokenizer(text, return_tensors="pt") #.to("cuda") _ = model(**inputs) latencies.append(time.time - start)print(f"Average latency: {np.mean(latencies):.4f}s")

显示平均latency是0.0113s

GPU推理测试# Test with cudamodel_cuda = model.to("cuda")# Warmup_ = model_cuda(**tokenizer(texts[0], return_tensors="pt").to("cuda"))# Timed inferencelatencies = for text in texts: start = time.time inputs = tokenizer(text, return_tensors="pt").to("cuda") _ = model_cuda(**inputs) latencies.append(time.time - start)print(f"Average latency: {np.mean(latencies):.4f}s")

显示平均latency是0.0024s。用这个GPU,就能够快大概4.5倍,还是效果非常好的。

TensorRT安装

对于cuda 12.8,用这个cmd来安装torch相关组建。

pip3 install --pre torch torchvision torchaudio torch-tensorrt --index-url

在安装torch-tensorrt的时候,就会自动去安装tensorrt,可以说是省去了一步麻烦。如果不是用pytorch,那么就需要参考官方文档来安装

Model 转换

这里参考官方教程

# Prepare a dummy input for conversion tracetext = ["Hello, world"]inputs = tokenizer(text, return_tensors="pt")print(inputs)# Convert to TensorRT with torch-tensorrtimport torch, torch_tensorrtoptimized_model = torch.compile(model, backend="tensorrt")optimized_model(**inputs.to("cuda"))TensorRT推理测试

用之前类似的方法,来测试这个tensorRT转换过的model。

import timeimport numpy as np# Load testtexts = ["This is a test sentence. It is long, boring and verbose."] * 100 # 100 identical requests# Warmup_ = optimized_model(**tokenizer(texts[0], return_tensors="pt").to("cuda"))# Timed inferencelatencies = for text in texts: start = time.time inputs = tokenizer(text, return_tensors="pt").to("cuda") _ = optimized_model(**inputs) latencies.append(time.time - start)print(f"Average latency: {np.mean(latencies):.4f}s")

结果是有些跌眼镜,这个Average latency: 0.0047s, 是比直接用cuda,来的要更高的。一般来说tensorRT自称是能够优化2-3倍,看来这里可能是有哪个环节出了bug。具体之后可以再来研究。

来源:莱娜探长

相关推荐