How I Deployed Enterprise AI Copilot to 9,000 Employees
TL;DR
Deploying an enterprise AI copilot to 9,000 employees in a regulated bank requires three things in order: governance scaffolding (DLP, prompt logging, model risk approval), a tiered rollout (legal/risk/compliance review of use cases), and active change management to drive adoption past the novelty curve.
Key takeaways
- Governance comes before access — DLP, prompt logging, audit, and model risk sign-off first.
- Use-case tiering separates low-risk productivity from regulated workflows.
- Adoption is a change management problem, not a technology problem.
The starting position
The institution had 9,000 employees, a regulated U.S. banking license, and a board that wanted to move on GenAI without becoming a cautionary tale. The mandate was a secure, internal AI copilot that any employee could use for day-to-day work — drafting, summarizing, querying internal knowledge — without exposing customer data, leaking IP to a foundation model vendor, or creating unmanaged model risk.
The architecture, briefly
The chosen pattern: a managed foundation-model service inside the bank’s cloud tenancy, fronted by an internal gateway that enforced policy, logged every prompt, and routed retrieval to a vector store populated only with documents employees were already authorized to read.
The copilot UI was a custom web app and a Teams integration; no employee could reach the underlying model except through the gateway.
- Foundation model: hosted inside the bank’s cloud account, no data egress to vendor
- Gateway: prompt logging, PII redaction, prompt-injection filters, per-team rate limits
- Retrieval: a vector store scoped by entitlement — the agent only sees what the user can see
- Audit: every prompt, response, and tool call written to immutable storage for model risk review
The governance decisions that mattered most
Authorization-aware retrieval
The single most important architectural decision was enforcing existing entitlements at retrieval time. If a user was not allowed to read a document in SharePoint, the copilot could not retrieve it. This sounds obvious; it is hard to get right, and it is the difference between a copilot you can ship and one you cannot.
Prompt logging from day one
Every prompt is logged. Every response is logged. The logs are reviewed periodically by the model risk team. Employees know this — it is in the acceptable use policy. The transparency builds trust and provides the audit trail regulators expect.
A real acceptable-use policy
Not a document buried on the intranet. A short, clear policy presented at first use, repeated quarterly, with examples of what is and is not appropriate. Adoption rises when employees know the rules and falls when they don’t.
What change management actually looked like
The technical rollout was the easy part. The harder work was: a champions program across business units, role-specific prompt libraries (a credit analyst’s prompts look nothing like a marketer’s), office hours twice a week for the first three months, and a feedback channel that fed directly into the engineering backlog.
An enterprise copilot does not succeed because the model is good. It succeeds because employees are trained, trust the system, and use it for tasks where it actually saves them time.
What I would do differently
- Start the model risk review earlier — it is the long pole in the tent, not the engineering
- Build the evaluation harness before the copilot, not after — you cannot improve what you cannot measure
- Resist the urge to launch with too many tools wired in; a focused copilot beats a sprawling one
- Plan for the second wave of use cases from day one — the copilot is the platform, not the product
Results worth measuring
Adoption alone is not success. The metrics that matter: time saved per task category, accuracy of model outputs on internal benchmarks, escalations to human review, and reduction in shadow AI usage (employees using consumer chatbots for work). All four moved in the right direction, and all four had to be measured deliberately. Vanity metrics like ‘weekly active users’ tell you very little.
The three-workstream model
A 9,000-employee copilot rollout in a regulated bank does not succeed because the model is good. It succeeds because three workstreams ran in parallel for nine months: governance, rollout, and adoption. Drop any one and the program either gets blocked by risk, lands without users, or lands so loudly that one incident shuts it down.
- Governance: DLP, prompt logging, model risk approval, acceptable-use policy, incident playbooks. Months 1–3 are 80% governance.
- Rollout: Tiered use-case approval, identity-aware retrieval, environment isolation, dark-launch then progressive enablement.
- Adoption: Champions network, role-based playbooks, measurable usage and time-saved benchmarks, executive review cadence.
Governance: what shipped before any user logged in
- Data loss prevention — outbound regex and entity scanning on every prompt; blocked terms log to security review.
- Prompt and response logging to immutable storage with 7-year retention, aligned to SR 11-7 expectations.
- Model risk documentation — model card, intended use, validation evidence, monitoring plan, rollback procedure.
- Acceptable-use policy linked from the copilot UI, with mandatory click-through on first use.
- Incident response playbook — paging on PII egress, hallucination escalation, and model degradation.
The rollout sequence that actually worked
The rollout used a four-tier model that matches the risk posture of the work, not the seniority of the user. Tier 1 (productivity) reached everyone in week 6. Tier 4 (regulated workflows) reached fewer than 200 users by month 12 — and that was the right answer.
- T1 Productivity: Drafting, summarization, internal Q&A. Low risk. Light review. Available to all employees.
- T2 Knowledge: Grounded RAG over internal policies and procedures. Citations required. Available to all employees.
- T3 Operational: Workflow-specific copilots (collections, ops, IT). Trained prompts. Restricted access groups.
- T4 Regulated: Credit, compliance, customer-facing. Model risk sign-off per use case. Human-in-the-loop mandatory.
Identity-aware retrieval — the single most important architectural choice
The copilot inherits the user’s existing entitlements at retrieval time. If a user cannot read a document in SharePoint, the copilot cannot retrieve it. This sounds obvious until you try to implement it across a federated content estate — and it is the line between a copilot you can ship and one that becomes a privacy incident. The pattern matters enough that it gets its own treatment in the LLMOps reference.
Adoption: the hard part
Three months in, usage was 18% of licensed seats. Three months later, after the champions program ramped and role-specific prompt libraries shipped, it was 64%. The model did not change. The change management did.
- Champions network — one trained champion per 200 employees, named publicly, with monthly office hours.
- Role-based playbooks — a credit analyst’s prompts look nothing like a marketer’s. Generic training fails both.
- Weekly office hours for the first quarter, monthly thereafter.
- Feedback loop into engineering — top 10 friction points triaged every sprint.
- Executive review on adoption, time-saved, and incident metrics — quarterly to the operating committee.
Metrics that actually mean something
- 📊 Time saved per task: Surveyed and observed. Vanity-free. Maps to capacity reallocation in the board AI ROI report.
- 🎯 Accuracy on benchmarks: Internal eval set per use case. Refreshed monthly. Regression triggers rollback.
- 🆘 Escalation rate: Fraction of interactions sent to human review. Leading indicator of trust.
- 🚫 Shadow AI reduction: Telemetry on consumer chatbot usage from corporate devices. Falls as the sanctioned copilot earns trust.
An enterprise copilot does not succeed because the model is good. It succeeds because employees are trained, trust the system, and use it for tasks where it actually saves them time.
What I would do differently
- Start the model risk review earlier — it is the long pole in the tent, not the engineering.
- Build the evaluation harness before the copilot, not after. You cannot improve what you cannot measure.
- Resist sprawling tool surfaces — a focused copilot beats a multi-tool one for the first 12 months.
- Plan the second wave from day one — the copilot is the platform, not the product.