Self-hosting Moondream
Use a locally running or self-deployed Moondream server
If you want to run Moondream yourself, you can either:
- Run locally on the same machine where you’re running Magnitude tests (great for development on high end machines)
- Self-host models with GPUs on a cloud platform like Modal
Running Moondream locally
To run Moondream on your local machine, first go to https://moondream.ai/c/moondream-server and download the appropriate server (currently Moondream only provides macOS/Linux executables).
Follow the instructions on that page to run the executable.
Then you can configure magnitude.config.ts
with the local base URL:
That’s it! Now you can run tests with local Moondream.
Self-hosting Moondream on Modal
Probably the easiest way to deploy Moondream is by hosting it on Modal, so we provide a deployment script to do this.
Set up Modal
- Create an account at modal.com
- Run
pip install modal
to install the modal Python package - Run
modal setup
to authenticate (if this doesn’t work, trypython -m modal setup
)
More info: https://modal.com/docs/guide
Deploy Moondream
Clone Magnitude and run the moondream.py
deploy script to automatically download the Moondream server and model weights to a Modal volume and deploy the endpoint:
Your Moondream endpoint will then be available at https://<your-modal-username>--moondream.modal.run/v1
.
Then you can configure magnitude.config.ts
with this base URL and start running tests powered by your newly deployed Moondream server!
Keep in mind there may be some cold start times since Modal automatically scales down containers when not in use.
Customizing Deployment
You may want to customize the modal deployment depending on your needs.
You can modify the moondream.py
deployment script to fit your needs.
Common options you may want to change:
gpu
: GPU configuration. See Modal’s pricing page for details on different available GPUs and their cost. Also see comparison below.scaledown_window
: Time in seconds a container will wait to shut down after receiving no requests. If higher, will let you run tests after longer without needing the container to cold-start again.min_containers
: By setting this option you force Modal to keep some number of containers open to handle requests. This means Modal will also bill you for those containers all the time, but eliminates cold-starts.
GPU Comparison
Here’s a breakdown of how quickly different GPUs available on Modal are able to handle typical requests that Magnitude makes to Moondream:
GPU | Approximate inference time per action | Modal cost per hour |
---|---|---|
H100 | ~200ms | $3.95 |
A100 (40GB) | ~300ms | $2.10 |
A10G | ~500ms | $1.10 |
T4 | ~800ms | $0.59 |
Since Magnitude needs to wait a bit for the page to stabilize anyway, probably something like the A10G is a good price/performance balance, but any of these work well!