Gemini API: Why do you need it?
TLDR
Anthropic has focused on training Claude models for coding capabilities, so they havenât invested much in vision abilities. This makes Claude modelsâ image analysis quality subpar, affecting software development workflows.
Note: Gemini API will incur costs. If you feel you donât need it, you can completely skip this step!
Installation
- Get your Gemini key at Google AI Studio
- Find the
.env.examplefile:
- If you installed AgencyOS at project scope: copy
.claude/.env.exampleto.claude/.env - If you installed AgencyOS at global scope: copy
~/.claude/.env.exampleto~/.claude/.env(for Windows users:%USERPROFILE%\.claude\.env)
- Open the
.envfile and fill in the value forGEMINI_API_KEY
Thatâs it!
The following section is adapted from my recent research on:
The Problem with Claudeâs âEyesâ
To make debugging easier, providing screenshots so CC can visualize the problem is essential. I use this method very often.
But recently I discovered something: Claudeâs vision model is quite poor, not as good as competitor models (Gemini, ChatGPT,âŠ).
Look at this example, Claude Desktop completely failed compared to Gemini and ChatGPT:

Claude couldnât correctly identify the actions and devices in the image.
Now letâs compare directly between AgencyOS CLI and Gemini CLI!
Iâll ask both to read a blueprint image and describe in detail what they see:

Gemini CLI provides detailed descriptions of the blueprint, while AgencyOS CLI is quite superficialâŠ
Do you see the difference?
Wait, thereâs one more thing that Claudeâs âeyesâ currently CANNOT do: VIDEO ANALYSIS capability
But Gemini (web version, not CLI) can do this, which makes debugging in Vibe Coding much easier.

You donât always fully understand the situation, describe how to reproduce the bug, and figure out the solution direction. Recording the screen and giving it to Gemini (Web version) to help guess the root causes or suggest handling directions is not a bad solution at all.
The only problem is that Gemini web version doesnât have codebase context, so I have to include that information in the prompt, which is quite tediousâŠ
So I decided to create this MCP: Human MCP
The purpose is to use Gemini API to analyze images, documents (PDF, docx, xlsx,âŠ) and videos.
In the early days of AgencyOS, I had âHuman MCPâ pre-installed by default.
And you need GEMINI_API_KEY in the âHuman MCPâ env for it to work.
Then, Anthropic launched Agent Skills!
Everyone knows: MCP also has its problem - it consumes too much context
Hereâs an example from Chrome Devtools MCP and Playwright MCP:

And Agent Skills was created to solve that problem.
So Iâve converted all Human MCP tools into Agent Skills, so we can have more free space in the context window for agents to work.
Therefore, GEMINI_API_KEY has been moved to .claude/.env, you just need to enter the value there.
Typically, skills will be automatically activated depending on the context the agent is handling.
But if you need to manually activate this skill, just prompt like this: use ai-multimodal to analyze this screenshot: ...
Itâs that simple.