vllm.model_executor.layers.quantization.utils.mxfp8_utils ¶
Mxfp8LinearBackend ¶
Mxfp8LinearOp ¶
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
__init__ ¶
__init__(backend: Mxfp8LinearBackend)
apply ¶
apply(
input: Tensor,
weight: Tensor,
weight_scale: Tensor,
out_dtype: dtype,
bias: Tensor | None = None,
) -> Tensor
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
_mxfp8_e4m3_quantize_impl ¶
_mxfp8_e4m3_quantize_impl(
x: Tensor, is_sf_swizzled_layout: bool = False
) -> tuple[Tensor, Tensor]
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
dequant_mxfp8_to_bf16 ¶
Dequantize MXFP8 tensor to BF16.
Source code in vllm/model_executor/layers/quantization/utils/mxfp8_utils.py
mxfp8_e4m3_quantize ¶
mxfp8_e4m3_quantize_fake ¶
Fake implementation for torch.compile tracing.