LLM-Guided Multi-Granular Data Provenance for Trustworthy AI Pipelines
1. Executive Summary
This study introduces an LLM-guided platform that captures, manages, and queries provenance for data preparation pipelines, addressing a critical gap in explainable AI that lies upstream of model training. The system documents how datasets are cleaned, transformed, fused, augmented, and reduced before learning, offering end-to-end transparency, reproducibility, and auditability. A survey of real pipelines from ML Bazaar and top Kaggle notebooks identifies prevalent preprocessing operations—imputation, encoding, scaling, feature engineering, joins—which shape the platform’s design priorities. A large language model rewrites user scripts into structured, annotated pipelines, enabling automated, minimally intrusive capture of lineage. Provenance is modeled with a PROV-inspired graph of entities, activities, and columns linked by used, wasGeneratedBy, wasInvalidatedBy, and wasDerivedFrom relations. Template-based operators generalize the capture across libraries by recognizing input–output change patterns rather than bespoke handlers. Multi-granular capture (Sketch, Derivation, Full, Only Columns) balances interpretability, storage, and query performance, and is materialized in a graph store for flexible analysis. The implementation—PROLIT—demonstrates robustness under runtime errors, recording lineage up to the failure point to support diagnosis. Limitations such as LLM prompt sensitivity, unfamiliar libraries, and workflow branching are documented alongside mitigation strategies. Overall, the platform operationalizes data-centric explainability, strengthening trust in AI by making preprocessing both visible and understandable.