The “Copilot Readiness” Checklist: Is Your SharePoint Data Clean Enough?

Buying the license is the easy part. You pay Microsoft £24.70 per user, assign the seat, and wait for the magic to happen. But for 80% of organizations, the magic doesn’t happen. Instead, they face a sudden crisis of data governance.
Turning on Microsoft 365 Copilot without cleaning your SharePoint environment is malpractice. It is like installing a Ferrari engine into a rusty go-kart. The moment you hit the accelerator, the structural weaknesses of your data architecture will tear the system apart.
The problem is not the AI. The problem is your history. For the last decade, your organization has likely treated SharePoint as a digital dumping ground. Files are duplicated, permissions are broken, and sensitive data is buried in “temporary” folders that became permanent.
When you deploy Copilot, you are not just deploying a chatbot. You are deploying a semantic search engine that can read every file you have ever created. Understanding the full scope of Copilot security risks is essential before implementation. If your data is messy, your AI will be hallucinating, inaccurate, and dangerous.
This guide is your technical remediation roadmap. We will walk through the exact steps to sanitize your tenant, flatten your permissions, and prepare your infrastructure for the age of agentic AI.
The “Garbage In, Speed Out” Problem
The Gist: AI does not fix bad data; it amplifies it. If you feed Copilot obsolete or conflicting files, it will confidently generate wrong answers at lightning speed.
Most IT leaders operate under the false assumption that Copilot is smart enough to know which file is the “final” version. It is not.
Copilot uses Semantic Indexing to retrieve information. It looks for context, not just keywords. If you have five versions of your “2024 HR Policy” saved in different folders—and four of them are drafts—Copilot treats them all as valid sources. When an employee asks, “What is our remote work policy?”, the AI might synthesize an answer from the draft version you rejected three years ago.
This is the “Garbage In, Speed Out” phenomenon. You are not getting better intelligence; you are getting bad intelligence faster.
This same principle applies to how we automate internal reporting with AI – clean, structured data inputs are essential for reliable outputs.
Entities Tracked:
- Semantic Indexing: The retrieval method that connects distinct data points.
- Version Control: The discipline of maintaining a single source of truth.
- Data Hygiene: The practice of keeping data clean and error-free.
Step 1: The ROT Analysis (Redundant, Obsolete, Trivial)
- Redundant: Duplicate files stored in multiple sites.
- Obsolete: Data that is no longer accurate or business-relevant.
- Trivial: Personal files, memes, and non-business content.
Before you can secure your data, you must delete what you do not need. Industry estimates suggest that 30% to 50% of enterprise data is ROT (Redundant, Obsolete, Trivial). This data is not just costing you storage fees; it is poisoning your Semantic Index.
You cannot manually review 50 terabytes of data. You must use automated tools. For organizations unsure where to begin, an check your Copilot deployment readiness can systematically identify these data quality issues.
Similar principles apply when automating document processing in the Microsoft ecosystem – clean data inputs are essential for reliable AI outputs.
The Cleanup Protocol
- Identify Duplicates: Use SharePoint Advanced Management (SAM) to scan for duplicate file hashes across your tenant. You will likely find that your “Marketing Assets” folder exists in six different sites. Delete five of them.
- Archive by Date: Implement a retention policy that automatically archives data untouched for 3 years. Move it to “Cold Storage” (Azure Blob) where Copilot cannot see it. This immediately reduces your risk surface by 40%.
- Purge Triviality: Scan for non-business file types (.mp3, .exe, personal .jpgs). There is no reason for Copilot to be indexing your employee’s holiday photos.
Entities Tracked:
- ROT Data: The primary target for pre-AI cleanup.
- SharePoint Advanced Management (SAM): The toolset for governing sprawl.
- Azure Blob Storage: A cost-effective destination for archived data.
Step 2: Flattening Permissions (The Security Layer)
The Gist: Nested permissions are the enemy of Zero Trust. You must strip away complex inheritance and move to a flat, role-based access model.
In the old world of “Security by Obscurity,” it was fine to have a folder shared with “Everyone” because nobody could find it. In the AI world, obscurity is dead. Copilot has a flashlight, and it shines it everywhere.
The most dangerous setting in your tenant is the “Everyone except external users” group.
If a sensitive document—like a salary spreadsheet—lives in a site where this group has read access, Copilot will use that data to answer questions. It doesn’t matter if the user intended to find the salary data. If they ask, “How much do we spend on payroll?”, Copilot will do the math.
The Flattening Strategy
You must break permission inheritance.
- Audit the “Everyone” Group: Use Microsoft Purview to find every site where the “Everyone” claim exists. Remove it. Replace it with specific Security Groups (e.g., “HR Team,” “Finance Team”).
- Enforce Just Enough Access (JEA): Adopt a Zero Trust Architecture. Users should only have access to the data they need to do their jobs today. Not what they might need next year.
- Review Shared Links: Expire all “Anyone with the link” sharing links. These are open doors for data exfiltration.
For a deeper dive into the specific risks of permission drift, read our analysis on Microsoft 365 Copilot Security Risks.
Entities Tracked:
- Zero Trust Architecture: The mandatory security standard for AI.
- Permission Inheritance: The legacy feature that causes oversharing.
- Microsoft Purview: The compliance dashboard for auditing access.
Step 3: The Technical Readiness Checklist
Use this checklist to audit your environment. If you cannot check every box, you are not ready for deployment.
| Category | Action Item | Success Metric |
| Licensing | Assign Microsoft 365 Copilot Licenses | Licenses assigned to pilot group only |
| Apps | Update Microsoft 365 Apps to Monthly Enterprise Channel | All users on verified version |
| Identity | Enforce Entra ID (Azure AD) Conditional Access | MFA enabled for 100% of users |
| Data Hygiene | Run ROT Analysis (Delete/Archive old data) | Storage reduced by >30% |
| Permissions | Remove “Everyone” form High-Risk Sites | Zero oversharing alerts in Purview |
| Governance | Define Sensitivity Labels (Public, Internal, Confidential) | Labels applied to 80% of files |
| Network | Optimize WebSocket Connections for fluid UI | Latency < 50ms to Microsoft Graph |
| Search | Verify Semantic Indexing status in Admin Center | Indexing status = Complete |
Entities Tracked:
- Entra ID: The identity management system formerly known as Azure AD.
- Conditional Access: The gatekeeper for user logins.
- WebSocket: The protocol used for real-time AI communication.
Step 4: Sensitivity Labels & Automated Governance
The Gist: You cannot rely on users to classify documents. You must automate the labeling process to ensure sensitive data stays locked down.
A Sensitivity Label is a metadata tag that travels with the document. If a file is labeled “Confidential – Finance,” that tag enforces encryption and access control. Even if that file is emailed to the wrong person, they cannot open it.
Copilot respects these labels. If a user does not have the “Confidential – Finance” right, Copilot will refuse to summarize that document for them.
The Automation Rule
Do not ask users to label files. They will forget. Configure Auto-Labeling Policies in Purview.
- Pattern Matching: If a document contains a Credit Card Number or National Insurance Number, automatically label it “Confidential.”
- Keyword Matching: If a document contains “Merger,” “Acquisition,” or “Layoff,” automatically label it “Highly Restricted.”
This creates a safety net. Even if your permissions are imperfect, the encryption on the file itself prevents the AI from leaking data.
Entities Tracked:
- Sensitivity Labels: The metadata tags that control encryption.
- Auto-Labeling Policies: The mechanism for enforcing rules without user input.
- Data Loss Prevention (DLP): The broader strategy of preventing leaks.
Step 5: The Implementation Timeline
Do not rush. A successful rollout takes 3 to 6 months.
Phase 1: The Audit (Month 1)
- Run the AI Strategy & Data Readiness Audit.
- Map your data estate.
- Identify the “Toxic Data” (ROT).
However, if your readiness audit reveals gaps Copilot cannot fill, here is how custom AI compares as an alternative solution.
Phase 2: The Cleanup (Months 2-3)
- Execute the delete/archive scripts.
- Flatten permissions on SharePoint.
- Deploy Sensitivity Labels.
Phase 3: The Pilot (Month 4)
- Deploy Copilot to a small group (10-20 users).
- Ideally, pick “Champions” from different departments.
- Monitor for hallucinations and access errors.
Phase 4: The Rollout (Months 5-6)
- Gradual expansion to the rest of the organization.
- Host training sessions on “Prompt Engineering.”
Conclusion: Sanitation Before Intelligence
The allure of AI is powerful. The promise of instant answers and automated reports is seductive. But you must resist the urge to skip the homework.
Sanitation must come before intelligence. If you deploy Copilot on top of a messy, overshared, duplicate-filled SharePoint tenant, you are not innovating. You are accelerating confusion.
Get your house in order. Clean the data. Fix the permissions. Then, and only then, let the AI in.