145 points by tosh on 2024-05-14 | 15 comments

Automated Summary

PaliGemma is an open-source vision-language model (VLM) inspired by PaLI-3, utilizing the SigLIP vision model and Gemma language model. It processes both images and text, providing detailed analysis such as image captioning, object detection, and text extraction. Two model sets are available: PaliGemma for general tasks and PaliGemma-FT for research. Most models require fine-tuning, except for paligemma-3b-mix. Key benefits include understanding both images and text, adaptability for various vision-language tasks, and a fine-tuned checkpoint for immediate research use.


mmastrac on 2024-05-14

This is an impressive amount of public AI work coming out of Google. The competition we're seeing here is really pushing things forward.

curl-up on 2024-05-14

Anyone here have experience with extracting image embeddings out of these models? All the image emb. models I tried so far were quite bad for my use cases, and I feel that hidden representations of models like these might be much better.

jerpint on 2024-05-15

Have you tried CLIP image embeddings ?

curl-up on 2024-05-15

Yes, that's what I am mainly trying to replace, as the performance is just not there for my needs.

histories on 2024-05-15

Just from the name my mind raced to LLMs trained on the Pali canon

echelon_musk on 2024-05-15

I had the same assumption!

airbreather on 2024-05-15

It refers to images, but would that extend to diagrams, like engineering drawings?

tosh on 2024-05-14

How does this model compare to the 3b Gemma if I would use it only with text?

coder543 on 2024-05-14

Well, to start with, there is no regular 3B Gemma. There are 2B and 7B Gemma models. I would guess this model is adding an extra 1B parameters to the 2B model to handle visual understanding.

The 2B model is not very smart to begin with, so… I would expect this one to not be very smart either if you only use it for text, but I wouldn’t expect it to be much worse. It could potentially be useful/interesting for simple visual understanding prompts.

simonw on 2024-05-15

Anyone found a good recipe to run this on a Mac yet?

adefa on 2024-05-15

You can run it by compiling transformers from source:

simonw on 2024-05-15

Have you seen that work on a Mac? I've had very bad luck getting anything complex to work with transformers on that platform.

adefa on 2024-05-15

Yes, I was able to run inference on the unquantized model in CPU land on Apple Silicon.

Alifatisk on 2024-05-15

Is this related to Project Astra?

m3kw9 on 2024-05-14

Google markets their new tech like arxiv articles. They have lots to learn from OpenAI