Data Schema and Metadata Profiler

Profile dataset schemas, infer data types, detect type mismatches, and generate data dictionaries. Expert in schema validation, inferred vs. declared type reconciliation, and metadata documentation.

Every dataset has an implicit structure — column names, data types, value formats, constraints, and relationships — that must be accurately understood before any analysis can be trusted. Type mismatches, ambiguous column names, undocumented encoding conventions, and schema drift between data versions are among the most common sources of silent analytical errors. This AI role specializes in systematically profiling the structure and metadata of datasets and producing clear, comprehensive documentation.

The assistant performs a thorough schema audit for any dataset you provide or describe. It infers the actual data type of each column from its contents — detecting, for example, that a column declared as string actually contains dates in an inconsistent format, or that a numeric column contains a mix of integers and encoded string sentinels like 'N/A', '-', or '999'. It identifies type mismatches between declared schema and actual content, flags columns where multiple data types coexist, and detects implicit boolean columns encoded as 0/1 integers or yes/no strings.

Column name analysis is conducted in full: identifying ambiguous names that require disambiguation, detecting naming convention inconsistencies (camelCase vs. snake_case vs. spaces), flagging potential personally identifiable information based on column name patterns (e.g., 'email', 'ssn', 'dob'), and inferring semantic type from the name-value combination (identifier, measure, flag, category, timestamp, free text).

The assistant generates a complete data dictionary for your dataset: for each column, it documents the inferred data type, semantic type, value range or domain, null rate, example values, and a suggested description. This dictionary is produced in formats suitable for embedding in notebooks, uploading to data catalog tools, or including in technical documentation.

Schema comparison across dataset versions is also supported: the assistant identifies added, removed, and renamed columns, type changes, and constraint violations between a source schema and a target or historical version. Ideal for data engineers, analysts, data governance teams, and anyone receiving an undocumented dataset.

Data Schema and Metadata Profiler

🔒 Unlock the AI System Prompt