摘要:之前尝试过用huggingface的transformer的library来下载bertmodel,并且简单的跑了一下。今天就打算测试一下CPU和gpu跑模型的性能,然后用trtllm优化一下,看能够优化到多少。
之前尝试过用huggingface的transformer的library来下载bertmodel,并且简单的跑了一下。今天就打算测试一下CPU和gpu跑模型的性能,然后用trtllm优化一下,看能够优化到多少。
CPU推理测试# Load test on CPU# Test datatexts = ["This is a test sentence. It is long, boring and verbose."] * 100 # 100 identical requests# Warmup_ = model(**tokenizer(texts[0], return_tensors="pt")) #.to("cuda"))# timed inferencelatencies = for text in texts: start = time.time inputs = tokenizer(text, return_tensors="pt") #.to("cuda") _ = model(**inputs) latencies.append(time.time - start)print(f"Average latency: {np.mean(latencies):.4f}s")显示平均latency是0.0113s
GPU推理测试# Test with cudamodel_cuda = model.to("cuda")# Warmup_ = model_cuda(**tokenizer(texts[0], return_tensors="pt").to("cuda"))# Timed inferencelatencies = for text in texts: start = time.time inputs = tokenizer(text, return_tensors="pt").to("cuda") _ = model_cuda(**inputs) latencies.append(time.time - start)print(f"Average latency: {np.mean(latencies):.4f}s")显示平均latency是0.0024s。用这个GPU,就能够快大概4.5倍,还是效果非常好的。
TensorRT安装对于cuda 12.8,用这个cmd来安装torch相关组建。
pip3 install --pre torch torchvision torchaudio torch-tensorrt --index-url
在安装torch-tensorrt的时候,就会自动去安装tensorrt,可以说是省去了一步麻烦。如果不是用pytorch,那么就需要参考官方文档来安装
Model 转换这里参考官方教程
# Prepare a dummy input for conversion tracetext = ["Hello, world"]inputs = tokenizer(text, return_tensors="pt")print(inputs)# Convert to TensorRT with torch-tensorrtimport torch, torch_tensorrtoptimized_model = torch.compile(model, backend="tensorrt")optimized_model(**inputs.to("cuda"))TensorRT推理测试用之前类似的方法,来测试这个tensorRT转换过的model。
import timeimport numpy as np# Load testtexts = ["This is a test sentence. It is long, boring and verbose."] * 100 # 100 identical requests# Warmup_ = optimized_model(**tokenizer(texts[0], return_tensors="pt").to("cuda"))# Timed inferencelatencies = for text in texts: start = time.time inputs = tokenizer(text, return_tensors="pt").to("cuda") _ = optimized_model(**inputs) latencies.append(time.time - start)print(f"Average latency: {np.mean(latencies):.4f}s")结果是有些跌眼镜,这个Average latency: 0.0047s, 是比直接用cuda,来的要更高的。一般来说tensorRT自称是能够优化2-3倍,看来这里可能是有哪个环节出了bug。具体之后可以再来研究。
来源:莱娜探长