Staff AI Engineer

Remote Full Time Experienced

About HubSync

HubSync is building the AI platform for the tax and accounting profession. Our product, Halo, runs the full engagement lifecycle for CPA firms (engage, gather, prep, deliver) and layers an AI assistant, retrieval, and agentic automation across all of it. We work with some of the largest accounting firms in North America, in a domain where the output has to be right: a misclassified form or a missed deadline is not a bug, it is a client's tax return.

We are building non-deterministic, agentic systems for professionals who cannot accept errors, at the volume and time pressure of tax season. Getting that right is a genuinely hard systems problem, and it is the problem this role owns.

The Role

As a Staff AI Engineer, you are a technical leader for the Halo platform. You set architectural direction for how we build agentic systems, define the patterns and guardrails the rest of the AI engineering team builds on, and are accountable for the reliability and trustworthiness of what we ship, not just the code you personally write.

This is a hands-on staff role. You will still design and build the hardest parts of the system yourself. But your leverage comes from the decisions that shape everyone's work: the agent-orchestration and state model, the evaluation and observability harness, the multi-tenant isolation and data-integrity guarantees, and the build-versus-buy calls on core platform components. You will move fluidly across backend, data, infrastructure, and product, and you will raise the engineering bar of the people and vendor teams around you.

You will partner directly with product, AI, and firm-facing teams, and your technical decisions will connect straight to customer outcomes and hard commercial deadlines.

What You Will Own

These are the areas where you will set the technical direction, define the standards others build against, and be accountable for the result.

Agentic workflow orchestration. The architecture for multi-agent coordination across tax document workflows with human-in-the-loop oversight: agent state machines, tool routing, context windowing, and retry semantics for processes that run for minutes or hours. You define the orchestration patterns the team reuses, not one workflow at a time.

Workflow state management. State hydration for long-running agentic workflows, failure handling, checkpoint and resume, and recovery across distributed services. You own the durable run-record and state layer that makes long-running agents auditable and resumable, and the contract every agent builds on top of.

Document intelligence at scale. Production-grade pipelines that extract, classify, and validate tax forms and financial documents across dozens of formats and quality levels. You set the architecture for accuracy, coverage, and the validation layer that turns raw extraction into output a firm can trust.

Evaluation and observability. The measurement backbone: task completion rates, accuracy attribution, cost tracking per action, and regression detection, with outcomes attributable to specific agent reasoning steps when something goes wrong. You stand up the eval and observability discipline as a platform capability, so that "is this good enough to ship" is a number, not an opinion.

Trust and reliability. Making non-deterministic agent output trustworthy for professionals who cannot accept errors: supervision layers, validation rules, and human review gates, plus the provenance and audit trail that lets a firm defend its own work. This is a first-class architectural mandate, and you own the standard for it.

Cost, accuracy, and latency. Optimizing the trade-offs across document types, complexity levels, and client tiers during peak tax-season volume, and setting the framework the team uses to reason about them rather than tuning case by case.

What We Are Looking For

You have built and shipped production systems, and then kept them running. You have taken features from requirements through implementation, testing, deployment, and monitoring, and you have handled the incidents that follow. You can point to enterprise-grade products that users rely on today, and you can architect solutions end-to-end when correctness and uptime matter.

At Staff level, you do more than deliver your own work. You have set technical direction on complex distributed systems and been right often enough that teams follow your lead. You define patterns, standards, and interfaces that make other engineers faster and safer, and you have the judgment to make build-versus-buy calls on core infrastructure and defend them. You reduce key-person risk instead of becoming it: you mentor engineers, raise the bar in design and code review, and leave systems more legible than you found them.

You understand concurrency, failure modes, data integrity, and why things break at scale, and you reason clearly about non-determinism where perfect answers are not available and the job is to bound and manage the uncertainty. The boundaries between backend, data, infrastructure, and product are not rigid here, and you move between them based on what the problem demands. You communicate well across product, design, AI, and engineering, and you connect technical decisions to customer impact and long-term business value.

Must Have

8+ years building and shipping backend systems in production environments where uptime and correctness matter, including several years operating them.

A track record of leading the design of complex, distributed, or high-scale systems from architecture through deployment and ongoing operations, with enterprise-grade features that users depend on today.

Hands-on production experience with agent orchestration frameworks (LangGraph or equivalent) and long-running, stateful, multi-step agentic workflows. You will set the orchestration and state patterns the team builds on, so you need to have shipped this class of system, not just read about it.

Demonstrated technical leadership beyond your own commits: setting patterns and standards, driving cross-team or cross-functional initiatives, mentoring engineers, and influencing decisions across an organization.

Deep experience with relational databases (PostgreSQL or equivalent): schema design, query optimization, data modeling, and migrations.

Hands-on work with event-driven architectures: message queues, async processing, and distributed job execution.

Production experience with AWS (Lambda, SQS, S3, ECS) or an equivalent cloud platform.

Comfort reading and writing both TypeScript and Python, or clear evidence you pick up a second language fast.

Experience across the full software delivery lifecycle: design, implementation, testing, deployment, monitoring, and incident response.

Sound judgment on build versus buy, and the ability to make and defend architectural trade-offs under real time and cost constraints.

Good to Have

Depth in multi-agent coordination specifically: designing supervision, routing, and hand-off between multiple cooperating agents.

Familiarity with RAG architectures, vector databases, or document processing pipelines.

Experience with multi-tenant SaaS architecture: schema isolation, tenant-scoped data, and access control.

Background in document intelligence: OCR, structured extraction from PDFs, and form understanding.

Experience standing up evaluation, observability, or quality systems for ML or LLM products (offline and online eval, regression detection, cost and accuracy attribution).

Work in a regulated or high-trust domain (tax, finance, legal, healthcare) where output correctness is non-negotiable.

Open-source contributions, technical writing, or other public evidence of engineering depth.

Technologies and Frameworks

You do not need every item here on day one, but this is the environment you will work in and help shape.

Languages and runtimes

TypeScript and Python 3.12

Node.js 20 (Fastify and Express services)

React 18 with module federation for microfrontend architecture

AI and agent infrastructure

AWS Bedrock and AgentCore (Claude, Titan embeddings, cross-encoder reranking)

LangGraph for agent orchestration and state-machine management

LangChain for tool chaining and model integration

MCP (Model Context Protocol) for dynamic tool generation from OpenAPI specs

Data and infrastructure

PostgreSQL (Aurora), pgvector

AWS (Lambda, SQS, S3, ECS)

Event-driven, distributed job execution

We are an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Apply for this job

* Required fields

First name*

Last name*

Email address*

Location

Phone number*

Resume*

Attach resume or Paste resume

Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or paste resume

Paste your resume here or attach resume file

What time zone are you currently located in?*

Are you open to relocation to Franklin, Tennessee?

Are you currently authorized to work in the United States?*

Will you now or in the future require employer sponsorship for employment visa status (e.g., H-1B)?*

The following questions are entirely optional.

To comply with government Equal Employment Opportunity and/or Affirmative Action reporting regulations, we are requesting (but NOT requiring) that you enter this personal data. This information will not be used in connection with any employment decisions, and will be used solely as permitted by state and federal law. Your voluntary cooperation would be appreciated. Learn more.

Gender

Race/Ethnicity

Human Check*

Apply for this job

This website uses cookies and other analytics technologies.