EDA Report Generator
End-to-end path I ship: Streamlit ingestion (CSV/.xlsx, sheet pick, UTF-8 + latin-1 CSV fallback), config-driven thresholds, profiling engine (missing, duplicates, distributions, correlations, column intelligence, datetime probe, structured warnings), Jinja HTML report plus optional PDF, and pytest coverage on load/profile/render.
Static project page: https://eda-report.vahdetkaratas.com — Streamlit demo: https://eda.vahdetkaratas.com (not hosted on this static domain).
Limitations & scope
- Demo-scope app with configured max rows/sample size—not designed as enterprise BI.
- Profiling reflects uploaded or sampled rows; tune limits for representative slices.
- PDF is optional and host-dependent (WeasyPrint + system libraries).
- No REST API surface: upload and export flow only through Streamlit.
Hosting & demo URLs
This static write-up: eda-report.vahdetkaratas.com.
Live Streamlit demo: eda.vahdetkaratas.com (upload, profile, export — not served from the static explanation page).
What ships in this artifact
In scope. A Streamlit-powered upload path (CSV plus .xlsx via openpyxl), optional target column for ML-ish summaries, environment- and sidebar-driven limits with first-N or stratified sampling for large frames, then an in-memory profile payload consumed by both the UI blocks and the Jinja report template.
Profiling surface. Schema and uniques; duplicate row counts; missing rates; numeric histogram buckets (SVG); categorical top-k with coverage; correlation matrix highlighting tied to configured |r| threshold; structured warnings (build_warnings); derived quick_notes; executive bullets; data-quality headline plus counts; column intelligence (constants, near-constants, cardinality, ID-like hints); conservative datetime probing where names or dtypes warrant it.
Export. HTML string from get_report_html; PDF via optional WeasyPrint when native deps exist (imports now fail soft with OSError so the app survives missing GTK/Pango).
Architecture (high level)
src/app.py— Thin orchestration only: load → optional sample →run_full_profile→ downloads.src/load.py— CSV UTF-8 / latin-1 fallback; Excel sheet enumeration with openpyxl.src/profile.py— Single entryrun_full_profilecomposing schema, duplicates, stats, distributions, correlations, intelligence, warnings, executive summary, artifacts list.src/report.py— Renders templates with correlation-highlight threshold mirrored from profiling config constants.src/config.py— File limits plus env-backed PROFILE_* knobs (missing thresholds, outliers, correlations, cardinality, …).
Honest boundaries
This codebase is deliberately not a warehouse scanner, lineage catalog, or governed semantic layer. It does not ingest streaming tables, lacks multi-tenant auth, does not advertise an external REST/API contract, and is not positioned as unattended production monitoring. It showcases disciplined Python modules, repeatable HTML output, pdf fallbacks you can tune per host.
Regression safety
tests/ cover CSV load (including latin-1 fallback), xlsx sheets, profiler invariants (including categorical metadata after recent payload slimming), and HTML render smoke including sampling/target metadata assertions—use them when extending prompts or thresholds.