EDA Report Generator

End-to-end path I ship: Streamlit ingestion (CSV/.xlsx, sheet pick, UTF-8 + latin-1 CSV fallback), config-driven thresholds, profiling engine (missing, duplicates, distributions, correlations, column intelligence, datetime probe, structured warnings), Jinja HTML report plus optional PDF, and pytest coverage on load/profile/render.

Static project page: https://eda-report.vahdetkaratas.com — Streamlit demo: https://eda.vahdetkaratas.com (not hosted on this static domain).

Streamlit

Pandas

Jinja2

pytest

Limitations & scope

Demo-scope app with configured max rows/sample size—not designed as enterprise BI.
Profiling reflects uploaded or sampled rows; tune limits for representative slices.
PDF is optional and host-dependent (WeasyPrint + system libraries).
No REST API surface: upload and export flow only through Streamlit.

Hosting & demo URLs

This static write-up: eda-report.vahdetkaratas.com.

Live Streamlit demo: eda.vahdetkaratas.com (upload, profile, export — not served from the static explanation page).

What ships in this artifact

In scope. A Streamlit-powered upload path (CSV plus .xlsx via openpyxl), optional target column for ML-ish summaries, environment- and sidebar-driven limits with first-N or stratified sampling for large frames, then an in-memory profile payload consumed by both the UI blocks and the Jinja report template.

Profiling surface. Schema and uniques; duplicate row counts; missing rates; numeric histogram buckets (SVG); categorical top-k with coverage; correlation matrix highlighting tied to configured |r| threshold; structured warnings (build_warnings); derived quick_notes; executive bullets; data-quality headline plus counts; column intelligence (constants, near-constants, cardinality, ID-like hints); conservative datetime probing where names or dtypes warrant it.

Export. HTML string from get_report_html; PDF via optional WeasyPrint when native deps exist (imports now fail soft with OSError so the app survives missing GTK/Pango).

Architecture (high level)

src/app.py — Thin orchestration only: load → optional sample → run_full_profile → downloads.
src/load.py — CSV UTF-8 / latin-1 fallback; Excel sheet enumeration with openpyxl.
src/profile.py — Single entry run_full_profile composing schema, duplicates, stats, distributions, correlations, intelligence, warnings, executive summary, artifacts list.
src/report.py — Renders templates with correlation-highlight threshold mirrored from profiling config constants.
src/config.py — File limits plus env-backed PROFILE_* knobs (missing thresholds, outliers, correlations, cardinality, …).

Honest boundaries

This codebase is deliberately not a warehouse scanner, lineage catalog, or governed semantic layer. It does not ingest streaming tables, lacks multi-tenant auth, does not advertise an external REST/API contract, and is not positioned as unattended production monitoring. It showcases disciplined Python modules, repeatable HTML output, pdf fallbacks you can tune per host.

Regression safety

tests/ cover CSV load (including latin-1 fallback), xlsx sheets, profiler invariants (including categorical metadata after recent payload slimming), and HTML render smoke including sampling/target metadata assertions—use them when extending prompts or thresholds.

Vahdettin Karatas

Location:

Technical focus

Review this artifact

ML systems