Hivefinancialsystems~22d ago

Peach Pilot —Sr QA Engineer (AI Systems & Platform) Remote — Latin America

Argentinalead

EngineeringData ScienceOtherQa EngineerPeach Pilot

3 views0 saves0 applied

Apply Now

Quick Summary

Technical Tools

anthropicazuredockerexcelfastapigcpgithub-actionsjavascriptlangchainnextjsopenaipostgresqlpythonreactredistypescriptci-cd

Remote — Latin America | Contract | US Eastern Timezone Overlap Required (5+ hours daily)

Most AI companies sell tools. We transform how businesses run.

Peach Pilot builds a platform that ingests everything about how a company operates — every system, every process, every signal — and constructs a Company Brain: a living knowledge graph that connects people, decisions, and outcomes across the entire organization. We deploy 92 pre-built AI agents that work together across every business function, governed by humans at every critical step. The system gets smarter with every interaction.

We don’t sell software licenses. We embed into a client’s operation, learn their business in weeks, show them what’s broken backed by their own data, and redesign their highest-impact business functions with AI. Our first vertical is insurance. Our first client engagement is already scoped and funded.

Peach Pilot is a funded early-stage AI startup headquartered in Atlanta, Georgia, with a working platform on live infrastructure and a proven data-to-insights methodology.

95% of enterprise AI pilots fail — not because the technology is broken, but because users don’t trust it. At Peach Pilot, trust is the product. Every feature we ship must work exactly as the user expects, every time. One broken interaction at the wrong moment can undo months of adoption.

This is a hands-on contractor role on a small, collaborative team. You will partner closely with our US-based QA Lead and the engineering team to help build out our quality practice as the platform moves from early-stage development into enterprise deployment writing test code, running evals, reproducing bugs, and strengthening the test suites that keep the product trustworthy.

You are a valued part of the team, and your testing, iteration, and feedback will directly shape what ships. Every test you write, every eval you tune, and every edge case you catch makes the platform better for the clients who depend on it.

This is a fully remote contract role based in Latin America. You must be available during US Eastern business hours with a minimum of 5 hours of daily overlap.

We are building agentic-first software. AI agents are not a feature we are adding they are the foundation we are building on from day one. Traditional QA assumes deterministic outputs. Agentic systems don’t give you that. You will be testing in an environment where:

Multi-model routing through LiteLLM (Anthropic Claude, OpenAI GPT, and additional providers) means the same input can produce different outputs depending on which model handled it.
92 pre-built AI agents operate across every business function, governed by humans at every critical step any drift between agent execution and governance is a critical failure.
The Company Brain (a living knowledge graph powered by Memgraph, Neo4j, and Qdrant) must return accurate, traceable answers against messy, real-world enterprise data.
The Nango-based integration layer ingests from 700+ connectors — CRM, email, calls, calendars, documents, chat, financial systems — and the file pipeline (Word, Excel, PowerPoint, PDF) must survive edge cases enterprise clients will find within the first week.
Our users are CEOs and operations leaders who have never touched a terminal. A confusing error state isn’t a minor bug — it kills adoption.

Responsibilities

~1 min read

Pair with full-stack and backend engineers on the features they are shipping — understand what they built, write tests that prove it works, and flag gaps early.
Reproduce and triage bugs with enough detail that an engineer can fix them without a round-trip.
Contribute to and help evolve our automated test suites (unit, integration, end-to-end) alongside the QA Lead.

Help build and run evaluation pipelines for non-deterministic LLM outputs, prompt regression, model drift detection, and output quality scoring across the LiteLLM routing layer.
Build and run automated tests for the agent orchestration layer, covering governance audit trail integrity, human-in-the-loop override behavior, and cross-agent handoffs.
Test retrieval quality and failure modes against the Company Brain (Memgraph, Neo4j, Qdrant, PostgreSQL) using real enterprise data scenarios.

Test the Nango-based integration layer across connectors and the file ingestion pipeline (Word, Excel, PowerPoint, PDF) including encryption, formatting edge cases, and audit trail continuity.
Validate streaming response handling, latency thresholds, and graceful degradation when a model is unavailable or slow.
Verify multi-model routing logic so cost-optimized task allocation behaves correctly across LLM providers, and outputs remain faithful regardless of which model served the request.

Test the trust-layer UX onboarding flows, progressive disclosure, uncertainty states, agent activity surfacing, and human-in-the-loop governance interfaces and help shape the standards as we go.
Flag anything that would confuse a non-technical enterprise user. If a CEO would be confused by it, it doesn’t ship.

5+ years of QA engineering experience, with meaningful time spent writing test code (not just managing test cases).
Hands-on experience testing LLM-powered applications you understand prompt sensitivity, output variance, and how eval pipelines catch regressions across model updates.
You write test code. Python is your primary tool.
Experience contributing to CI/CD-integrated test suites.
Comfortable testing complex API chains, async/streaming responses, and multi-service workflows.
Collaborative and self-directed you work well as part of a team, pair well with engineers, and move work forward without hand-holding.
Strong English communication skills, written and verbal.
Available during US Eastern business hours with a minimum of 5 hours of daily overlap.

Experience with LLM evaluation frameworks such as LangSmith, PromptFlow, or custom eval pipelines.
Experience testing agent frameworks (LangChain, CrewAI, or similar) and agent orchestration systems.
Experience testing graph databases (Memgraph, Neo4j) or vector stores (Qdrant).
Background in enterprise software or regulated industries where audit trail integrity is non-negotiable.
Insurance industry background is a plus — it is our first vertical.

AI/LLM: Anthropic Claude, OpenAI GPT, LiteLLM (multi-model routing)
Frontend: React/Next.js, TypeScript, Tailwind CSS
Backend: Python (FastAPI), Node.js/TypeScript
Data & Graph: Memgraph, Neo4j, Qdrant, PostgreSQL, Redis
Integrations: Nango (700+ connectors)
Infrastructure: Google Cloud Platform (Cloud Run, GCE, Firebase) · Azure (Cosmos DB, AI Search) · GitHub Actions CI/CD · Docker
Visualization: Plotly, D3, Recharts, Mermaid

You are joining a funded early-stage AI startup with a working platform on live infrastructure and a first client engagement already in motion. You will have access to production data, live workflows, and real compliance requirements from day one — the kind of environment where your testing work has visible, immediate impact on what clients see.

What We Offer

~1 min read

Competitive contractor rate commensurate with experience. Paid monthly via Deel in USD.

Tell us about a quality failure — one you caught before it shipped, or one that got through. What did you do about it, and what did you change in how you worked afterward?