About HubSync
HubSync is building the AI platform for the tax and accounting profession. Our product, Halo, runs the full engagement lifecycle for CPA firms (engage, gather, prep, deliver) and layers an AI assistant, retrieval, and agentic automation across all of it. We work with some of the largest accounting firms in North America, in a domain where the output has to be right: a misclassified form or a missed deadline is not a bug, it is a client's tax return.
We are building non-deterministic, agentic systems for professionals who cannot accept errors, at the volume and time pressure of tax season. Getting that right is a genuinely hard systems problem, and it is the problem this role owns.
The Role
As a Staff AI Engineer, you are a technical leader for the Halo platform. You set architectural direction for how we build agentic systems, define the patterns and guardrails the rest of the AI engineering team builds on, and are accountable for the reliability and trustworthiness of what we ship, not just the code you personally write.
This is a hands-on staff role. You will still design and build the hardest parts of the system yourself. But your leverage comes from the decisions that shape everyone's work: the agent-orchestration and state model, the evaluation and observability harness, the multi-tenant isolation and data-integrity guarantees, and the build-versus-buy calls on core platform components. You will move fluidly across backend, data, infrastructure, and product, and you will raise the engineering bar of the people and vendor teams around you.
You will partner directly with product, AI, and firm-facing teams, and your technical decisions will connect straight to customer outcomes and hard commercial deadlines.
What You Will Own
These are the areas where you will set the technical direction, define the standards others build against, and be accountable for the result.
Agentic workflow orchestration. The architecture for multi-agent coordination across tax document workflows with human-in-the-loop oversight: agent state machines, tool routing, context windowing, and retry semantics for processes that run for minutes or hours. You define the orchestration patterns the team reuses, not one workflow at a time.
Workflow state management. State hydration for long-running agentic workflows, failure handling, checkpoint and resume, and recovery across distributed services. You own the durable run-record and state layer that makes long-running agents auditable and resumable, and the contract every agent builds on top of.
Document intelligence at scale. Production-grade pipelines that extract, classify, and validate tax forms and financial documents across dozens of formats and quality levels. You set the architecture for accuracy, coverage, and the validation layer that turns raw extraction into output a firm can trust.
Evaluation and observability. The measurement backbone: task completion rates, accuracy attribution, cost tracking per action, and regression detection, with outcomes attributable to specific agent reasoning steps when something goes wrong. You stand up the eval and observability discipline as a platform capability, so that "is this good enough to ship" is a number, not an opinion.
Trust and reliability. Making non-deterministic agent output trustworthy for professionals who cannot accept errors: supervision layers, validation rules, and human review gates, plus the provenance and audit trail that lets a firm defend its own work. This is a first-class architectural mandate, and you own the standard for it.
Cost, accuracy, and latency. Optimizing the trade-offs across document types, complexity levels, and client tiers during peak tax-season volume, and setting the framework the team uses to reason about them rather than tuning case by case.
What We Are Looking For
You have built and shipped production systems, and then kept them running. You have taken features from requirements through implementation, testing, deployment, and monitoring, and you have handled the incidents that follow. You can point to enterprise-grade products that users rely on today, and you can architect solutions end-to-end when correctness and uptime matter.
At Staff level, you do more than deliver your own work. You have set technical direction on complex distributed systems and been right often enough that teams follow your lead. You define patterns, standards, and interfaces that make other engineers faster and safer, and you have the judgment to make build-versus-buy calls on core infrastructure and defend them. You reduce key-person risk instead of becoming it: you mentor engineers, raise the bar in design and code review, and leave systems more legible than you found them.
You understand concurrency, failure modes, data integrity, and why things break at scale, and you reason clearly about non-determinism where perfect answers are not available and the job is to bound and manage the uncertainty. The boundaries between backend, data, infrastructure, and product are not rigid here, and you move between them based on what the problem demands. You communicate well across product, design, AI, and engineering, and you connect technical decisions to customer impact and long-term business value.
Must Have
8+ years building and shipping backend systems in production environments where uptime and correctness matter, including several years operating them.
A track record of leading the design of complex, distributed, or high-scale systems from architecture through deployment and ongoing operations, with enterprise-grade features that users depend on today.
Hands-on production experience with agent orchestration frameworks (LangGraph or equivalent) and long-running, stateful, multi-step agentic workflows. You will set the orchestration and state patterns the team builds on, so you need to have shipped this class of system, not just read about it.
Demonstrated technical leadership beyond your own commits: setting patterns and standards, driving cross-team or cross-functional initiatives, mentoring engineers, and influencing decisions across an organization.
Deep experience with relational databases (PostgreSQL or equivalent): schema design, query optimization, data modeling, and migrations.
Hands-on work with event-driven architectures: message queues, async processing, and distributed job execution.
Production experience with AWS (Lambda, SQS, S3, ECS) or an equivalent cloud platform.
Comfort reading and writing both TypeScript and Python, or clear evidence you pick up a second language fast.
Experience across the full software delivery lifecycle: design, implementation, testing, deployment, monitoring, and incident response.
Sound judgment on build versus buy, and the ability to make and defend architectural trade-offs under real time and cost constraints.
Good to Have
Depth in multi-agent coordination specifically: designing supervision, routing, and hand-off between multiple cooperating agents.
Familiarity with RAG architectures, vector databases, or document processing pipelines.
Experience with multi-tenant SaaS architecture: schema isolation, tenant-scoped data, and access control.
Background in document intelligence: OCR, structured extraction from PDFs, and form understanding.
Experience standing up evaluation, observability, or quality systems for ML or LLM products (offline and online eval, regression detection, cost and accuracy attribution).
Work in a regulated or high-trust domain (tax, finance, legal, healthcare) where output correctness is non-negotiable.
Open-source contributions, technical writing, or other public evidence of engineering depth.
Technologies and Frameworks
You do not need every item here on day one, but this is the environment you will work in and help shape.
Languages and runtimes
TypeScript and Python 3.12
Node.js 20 (Fastify and Express services)
React 18 with module federation for microfrontend architecture
AI and agent infrastructure
AWS Bedrock and AgentCore (Claude, Titan embeddings, cross-encoder reranking)
LangGraph for agent orchestration and state-machine management
LangChain for tool chaining and model integration
MCP (Model Context Protocol) for dynamic tool generation from OpenAPI specs
Data and infrastructure
PostgreSQL (Aurora), pgvector
AWS (Lambda, SQS, S3, ECS)
Event-driven, distributed job execution
We are an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.