Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hacktoberfest 2024 | Llama 3.2 Vision 🤝 Workflows #694

Open
PawelPeczek-Roboflow opened this issue Sep 30, 2024 · 23 comments
Open

Hacktoberfest 2024 | Llama 3.2 Vision 🤝 Workflows #694

PawelPeczek-Roboflow opened this issue Sep 30, 2024 · 23 comments

Comments

@PawelPeczek-Roboflow
Copy link
Collaborator

Llama 3.2 Vision in Workflows

Are you ready to make a difference this Hacktoberfest? We’re excited to invite you to contribute by integrating LLama 3.2 Vision into our Workflows ecosystem! This new block for image generation will be a fantastic addition, broadening the horizons of what our platform can achieve.

Join us in enhancing our capabilities and empowering users to harness the power of vision technology. Whether you're a seasoned developer or just starting your journey in open source, your contributions will play a vital role in shaping the future of our ecosystem. Let’s collaborate and bring this innovative functionality to life!

Task description

  • The task is to integrate the new Llama 3.2 Vision into workflows
  • We haven't discover the model capabilities yet - that is also the part of the task 🥳
  • We prefer light integration to REST API through requests library - we've found that OpenRouter provides REST API access (see this) - but if you find a better option - feel free to discuss
  • We imagine the model can be implemented in similar way as other VLMs:
  • please raise any issues with the task in the discussion below

Cheatsheet

@AHB102
Copy link

AHB102 commented Nov 14, 2024

@PawelPeczek-Roboflow Can I have a go at it ? And can you tell me what is expected, are we talking about complete integration end to end or breaking down this issue into sub issues which can be tackled.

@PawelPeczek-Roboflow
Copy link
Collaborator Author

Hi @AHB102, thanks for engaging into the issue.

Sure, you can pick up the task - so the point is we would like to:
a) find a suitable hosted version of Llama vision model such that we can integrate via making HTTP requests
b) once this is agreed - we need to create Workflow blocks similar to https://github.com/roboflow/inference/blob/main/inference/core/workflows/core_steps/models/foundation/openai/v2.py wrapping up the model prompting for various tasks - that would require a little bit of exploration of model capabilities

@PawelPeczek-Roboflow
Copy link
Collaborator Author

First step would definitely be agreeing on API that host llama
Options I see:

but was not investigating all of the options, which would be good to do.

I would try to find cheap and reliable third party

@AHB102
Copy link

AHB102 commented Nov 14, 2024

@PawelPeczek-Roboflow I looked into hosted Llama 3.2 Vision APIs and found a few options: Together.ai (https://api.together.xyz/models) , Google Vertex AI (https://cloud.google.com/blog/products/ai-machine-learning/llama-3-2-metas-new-generation-models-vertex-ai) , Azure (https://techcommunity.microsoft.com/blog/machinelearningblog/meta%E2%80%99s-new-llama-3-2-slms-and-image-reasoning-models-now-available-on-azure-ai-m/4255167) and AWS Bedrock(https://aws.amazon.com/blogs/machine-learning/vision-use-cases-with-llama-3-2-11b-and-90b-models-from-meta/).Except Together.ai all of the other options have massive scale , it would be reliable and cheap. Hugging face also has a offering for inferencing. I checked out OpenRouter's API limits. The Llama 3.2 11B model is currently free, and the usage rates are pretty good. I think 20 requests per minute should be plenty for most things Any thoughts ?

@PawelPeczek-Roboflow
Copy link
Collaborator Author

PawelPeczek-Roboflow commented Nov 14, 2024

I do not have particular bias towards any of the vendor - I even see that the decision which is most handy for people to use is strictly related to individual preferences of the consumer.
I believe AWS / Google / MS would be the "stable" choice, whereas I expect other third parties to be more attractive cost-wise.
One thing to keep in mind is also how easy it is to integrate - I remember that at least part of google API clients are bulky and require setting API key at process-level, not for individual invocation (which is 🔴 flag for multi-tennant deployments which we do with workflows).

I see the construction of the block in the following way:

  • we support parameters required to deal with model
  • and on the "orthogonal" axis - we do support 2 parameters - api_key and provider - which will choose backend to use. This approach do also have cons, but at least we do not need multiple blocks to handle the same model from different providers
  • we could start easy with one provider, ensuring extensibility for the future
    wdyt?

@AHB102
Copy link

AHB102 commented Nov 14, 2024

That sounds great! This approach provides a solid foundation for future scalability and flexibility. By not committing to a single vendor upfront, we can adapt to evolving needs and avoid potential vendor lock-in.

To start, I suggest we explore OpenRouter. It offers free API usage for Llama 2.3 11B, making it ideal for initial testing and development. Additionally, its compatibility with familiar libraries like requests and openai can streamline the integration process and minimize security risks.

Once we have a robust core structure in place, we can easily pivot to other providers. wdyt ?

@PawelPeczek-Roboflow
Copy link
Collaborator Author

yeah, that sounds right

@AHB102
Copy link

AHB102 commented Nov 15, 2024

I've been diving into the Workflow Block (https://github.com/roboflow/inference/blob/main/inference/core/workflows/core_steps/models/foundation/openai/v2.py) and feel comfortable with the workings of the OpenAIBlockV2 class. I'm about to start writing code. Any advice for getting the most out of it? When modifying the inference core, I understand that I need to include test cases, right?

@PawelPeczek-Roboflow
Copy link
Collaborator Author

yeah, tests are recommended.

Here is our block creation guide: https://inference.roboflow.com/workflows/create_workflow_block/

@PawelPeczek-Roboflow
Copy link
Collaborator Author

you can find information how to run development smoothly

to test remote apis - we usually create unit-tests agains mocks - and place some integration test skipped if API key not provided

@AHB102
Copy link

AHB102 commented Nov 15, 2024

@PawelPeczek-Roboflow Let's get v1.py for Llama working, I'll be focusing on getting it functional before tackling test cases. I'll definitely ask for your input and help along the way, and I'll keep you updated on how it's going. Thanks for the docs 😁

@PawelPeczek-Roboflow
Copy link
Collaborator Author

hi there :) anything I can help you with?

@AHB102
Copy link

AHB102 commented Nov 22, 2024

@PawelPeczek-Roboflow Hi, sorry for the late reply.
So, I’ve been diving into the workflow block documentation. It was a lot to take in, but I’ve managed to work through it. I’ve also been experimenting with the OpenRouter Llama 3.2 vision model locally.
I’ve started building the first version of the block. I defined a BlockManifest class for Llama 3.2 vision, referencing the workflow block docs to understand the mechanics and looking at the OpenAI and Claude VLM implementations to see how it’s done in practice.
I had a couple of questions based on my observations:

  1. Claude seems to have a specific resolution it scales input images to, but I couldn’t find anything similar in the Llama documentation.
  2. OpenAI has image resolution settings (low, auto, high), but again, Llama doesn’t seem to support this.
  3. Both implementations have a response limit of around 450 tokens. However, Llama’s token limits vary significantly based on the task:
    Object detection: 10-100 objects with 10-50 attributes each (100-5000 tokens)
    Image classification: 1-10 class labels with 10-50 attributes each (10-500 tokens)

@PawelPeczek-Roboflow
Copy link
Collaborator Author

Do not worry to much if the API for all VLM blocks cannot be identical, we strive for similar experience regarding blocks integration, not 100% the same config parameters.

I do not see the list of all params that open-router APIs support, they name it recommended, and use openai client
in Python - so I expect that majority of params work as in openai client, it's just not reported - worth verifyng the ones that can limit the clients spendings on API calls (if paid version in use) - mainly the token limits

image

@AHB102
Copy link

AHB102 commented Nov 22, 2024

For now I'll dive into the details of max_tokens and top_p to keep our responses concise and cost-effective. We can also explore other tricks like choosing the right model and batching requests. I will update you once I have something, and concurrently keep working on the manifest block. 😁

@PawelPeczek-Roboflow
Copy link
Collaborator Author

👍

@AHB102
Copy link

AHB102 commented Nov 28, 2024

@PawelPeczek-Roboflow I experimented with the tiktoken library to determine the maximum token count and top-p values used by OpenRouter Llama 3.2 Vision. For tasks like object detection, image captioning, and OCR, I found that the maximum number of tokens rarely exceeds 200. I also tested different top-p values, ranging from 0.7 to the default of 1.0, and observed a decrease in the number of tokens required as the top-p value increased.
Llama 3.2 Vision is able to perform all the tasks that OpenAI models can, I don’t think there will be change in terms of prompt functions like prepare_caption_prompt() , right ?

@PawelPeczek-Roboflow
Copy link
Collaborator Author

PawelPeczek-Roboflow commented Nov 29, 2024

I guess so - in this case (contrary to popular approach) I suggest just copy-pasting the functions into your block module. We do not follow DRY rule for blocks, in practice it's easier to manage changes for each blocks separately

@AHB102
Copy link

AHB102 commented Dec 6, 2024

@PawelPeczek-Roboflow Hi 🖐️, I’ve nearly completed the workflow block, which now consists of approximately 600 lines of code. To ensure thorough understanding, I’ve been analyzing the function of each individual component, which has slowed progress. I’ve constructed the block by referencing Anthropic and OpenAI workflow implementations. However, I haven’t tested anything it yet. What should be the next steps toward testing and integration?

PS: I haven’t worked on a codebase of this size before, so I’ve learned a lot in the process. 😅

@PawelPeczek-Roboflow
Copy link
Collaborator Author

cool - submit the pr even if not 100% ready, we will figure out the way forward
I do really appreciate the effort

@AHB102
Copy link

AHB102 commented Dec 7, 2024

@PawelPeczek-Roboflow Nice !, I will make a PR 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants