ALMOST all of the CUDA package libraries can be pip-installed and used instead of relying on a system-wide installation of CUDA. This allows you to:
Not worry if a user has installed the correct CUDA version; and/or
Allow users to use a system-wide installation (e.g. for other programs that might require it) while still allowing them to use a compatible version for your specific program’s needs…all without having to uninstall/install it system-wide.
All of these libraries originate from CUDA release 12.4.1 DESPITE the different version numbers. This is just the way Nvidia does it. For example, they might not make any changes to “cublas,” but other libraries had significant changes so they issue a new CUDA release - hence the number for cublas might stay the same.
You can see all of the pip-installable specific libraries for all cuda releases in the .json files here
2) pip install CuDNN
pip install nvidia-cudnn-cu12==9.1.0.70
CuDNN 9+ is massively forwards and backwards compatible. There are edge cases for very old versions I won’t go into. However, if you want to find the pip install a specific version you can go here and also check out here to get the pip-installable numbers.
HERE IS A LIST I COMPILED, BUT IT WON'T BE UPDATED IN THIS POST
This will prepend the relevant PATH/CUDA_PATH variables (but only when the program runs) to specify where you pip-installed all of the necessary CUDA/CuDNN/Triton libraries. This will force your program to FIRST look for the libraries where you pip-installed them but WILL NOT otherwise remove the other paths that other program may rely on.
Remember…if your program creates a new process/subprocess you must either pass these environment variables or, alternatively, re-invoke the “set_cuda_paths” function to set them for the new process/subprocess.
5) Additional steps
Triton requires ptxas.exe and, assuming you set the paths within something like the set_cuda_paths function above, Triton will look for this important file in \Lib\site-packages\nvidia\cuda_runtime\bin. If it’s not found, it will give an error.
However, pip-installing nvidia-cuda-nvcc-cu12 puts ptxas.exe here instead: \Lib\site-packages\nvidia\cuda_nvcc\bin\ptxas.exe.
Therefore, I recommend manually copying this file to the cuda_runtime\bin directory as seen here:
Alternatively, you can do this automatically so a user doesn’t have to worry about it, as seen in this example:
When pip-install for some reason a “lib” folder is missing that triton requires. It looks for this folder at \Lib\site-packages\nvidia\cuda_runtime\
From release v3.1.0-windows.post8 onwards, “cuda_12.4_lib.zip” is provided. The contents of this .ZIP file merely need to be placed within the ```\Lib\site-packages\nvidia\cuda_runtime```` directory where you pip-installed the CUDA-related libraries.
Alternatively, if you encounter compatibility issues or simply want to download a different CUDA release version for the lib folder, I have provided a link near the bottom of this github issue to my repository that allows you to do this. Again, once downloaded, place the folder in the \Lib\site-packages\nvidia\cuda_runtime\ directory.
Chances are if you’re using Triton, CUDA, etc. you’re using other libraries as well. Below is additional information I’ve created to help people save time.
Intro
Too many users simply install the latest version of CUDA thinking “latest must mean greatest.” The reality is, very few libraries are compatible with the “latest” CUDA. Typically, libraries (even Torch) are NOT compatible with the latest CUDA. This leads to the situation where novice users simply install the latest CUDA thinking they’ve done things correctly, but in reality, they need an older version. The following tables try to clarify this for CUDA as well as other common libraries.
Torch
The torch library’s pre-built wheels (i.e. not compiling from source) are only tested with certain versions of CUDA. (Compiling from source allows for more permutations of compatibility…)
First, torch only provides pre-built wheels that have been specifically tested on CUDA release 12.4.1 or 12.1.1 as follows:
And here are the most recent versions and their pip-installable counterparts:
Torch Version
cuda-nvrtc
cuda-runtime
cublas
cufft
cudnn
triton
2.6.0 (CUDA 12.6)
12.6.77
12.6.77
12.6.4.1
11.3.0.4
9.5.1.17
3.2.0
2.6.0 (CUDA 12.4)
12.4.127
12.4.127
12.4.5.8
11.2.1.3
9.1.0.70
3.2.0
2.5.1 (CUDA 12.4)
12.4.127
12.4.127
12.4.5.8
11.2.1.3
9.1.0.70
3.1.0
2.5.1 (CUDA 12.1)
12.1.105
12.1.105
12.1.3.1
11.0.2.54
9.1.0.70
3.1.0
NOT installing a version of CUDA that torch tests with may lead to errors, but not 100%, just be aware if you want to use a different version of CUDA in your program.
You can get the most up-to-date information by examining the generate_binary_build_matrix.py script on torch’s github repository.
Xformers (for Windows only)
Xformers pre-built wheels are STRICTLY tied to a specific version of torch. You WILL encounter errors if you don’t install a correct version.
However, I noticed this repository also has 3.0.0 wheels.
Regardless of which ones you use, Triton 3.0.0 only supports up to Python 3.11
Triton 3.1.0+
Use this repository’s.
Requires torch 2.4.0+
“The wheels are built against CUDA 12.5, and they should work with other CUDA 12.x.”
Supports Python 3.12
Flash Attention 2
First, you WILL NOT see Windows releases for FA2 for every Linux release…that’s just a decision by the repo’s owner. Here is the best repo I’ve found for Windows wheels:
Flash Attention 2 is very particular with both torch and cuda (almost as bad as xformers). The full compatibility for the pre-built Windows wheels are as follows:
As you can see, if using FA2 there is no need to install any CUDA version above 12.4.1. Again, if you compile from source its different…but when using pre-built wheels you NEVER need above CUDA 12.4.1.
IMPORTANT: Newer versions of FA2 actually regressed…they no longer support CUDA 12.4.1. See here for more details:
Python 3.12.4 is incompatible with pydantic.v1 as of pydantic==2.7.3 https://github.com/langchain-ai/langchain/issues/22692 Everything should now be fine as long as Langchain 0.3+ is used, which requires pydantic version 2+