Vahdettin Karatas
Applied ML & data tooling — reproducible delivery
  • Location:
    Prague, Czech Republic
Technical focus
  • Streamlit data apps
  • Profiling & quality warnings
  • HTML/PDF reporting
  • Configurable pipelines
  • pytest-backed modules
ML systems
  • Portfolio Curated demos and write-ups spanning applied ML and data systems.
  • Batch scoring Pilot surface for offline or batch inference and score workflows on tabular data.
  • Feature store Serving patterns for online and batch features to keep modelling reproducible.
  • VahdetLabs Consulting hub for engagements, pilots, and how teams can engage.
  • Monitoring & data quality Health and data-quality viewpoints for pipelines—freshness, regressions, and guardrails.
  • Customer churn prediction Retention-oriented scoring with explanations teams can operationalise quickly.
Data tools
  • Data cleaning toolkit Diagnose and tidy messy spreadsheets before deeper analytics investment.
  • EDA report generator Static narrative for this project; use the recruiter Streamlit host for uploads.
  • Forecasting demo Time-series style demos for forecasting conversations and baseline planning.
  • CSV analytics Ad hoc exploratory assistance on CSVs—quick answers before heavyweight BI builds.
  • KPI dashboard Executive-friendly KPI summaries for stakeholder readouts.
Portfolio artifact — reproducible EDA pipeline

EDA Report Generator

End-to-end path I ship: Streamlit ingestion (CSV/.xlsx, sheet pick, UTF-8 + latin-1 CSV fallback), config-driven thresholds, profiling engine (missing, duplicates, distributions, correlations, column intelligence, datetime probe, structured warnings), Jinja HTML report plus optional PDF, and pytest coverage on load/profile/render.

Static project page: https://eda-report.vahdetkaratas.com — Streamlit demo: https://eda.vahdetkaratas.com (not hosted on this static domain).

Streamlit
Pandas
Jinja2
pytest

Limitations & scope

  • Demo-scope app with configured max rows/sample size—not designed as enterprise BI.
  • Profiling reflects uploaded or sampled rows; tune limits for representative slices.
  • PDF is optional and host-dependent (WeasyPrint + system libraries).
  • No REST API surface: upload and export flow only through Streamlit.

Hosting & demo URLs

This static write-up: eda-report.vahdetkaratas.com.

Live Streamlit demo: eda.vahdetkaratas.com (upload, profile, export — not served from the static explanation page).

What ships in this artifact

In scope. A Streamlit-powered upload path (CSV plus .xlsx via openpyxl), optional target column for ML-ish summaries, environment- and sidebar-driven limits with first-N or stratified sampling for large frames, then an in-memory profile payload consumed by both the UI blocks and the Jinja report template.

Profiling surface. Schema and uniques; duplicate row counts; missing rates; numeric histogram buckets (SVG); categorical top-k with coverage; correlation matrix highlighting tied to configured |r| threshold; structured warnings (build_warnings); derived quick_notes; executive bullets; data-quality headline plus counts; column intelligence (constants, near-constants, cardinality, ID-like hints); conservative datetime probing where names or dtypes warrant it.

Export. HTML string from get_report_html; PDF via optional WeasyPrint when native deps exist (imports now fail soft with OSError so the app survives missing GTK/Pango).

Architecture (high level)

  • src/app.py — Thin orchestration only: load → optional sample → run_full_profile → downloads.
  • src/load.py — CSV UTF-8 / latin-1 fallback; Excel sheet enumeration with openpyxl.
  • src/profile.py — Single entry run_full_profile composing schema, duplicates, stats, distributions, correlations, intelligence, warnings, executive summary, artifacts list.
  • src/report.py — Renders templates with correlation-highlight threshold mirrored from profiling config constants.
  • src/config.py — File limits plus env-backed PROFILE_* knobs (missing thresholds, outliers, correlations, cardinality, …).

Honest boundaries

This codebase is deliberately not a warehouse scanner, lineage catalog, or governed semantic layer. It does not ingest streaming tables, lacks multi-tenant auth, does not advertise an external REST/API contract, and is not positioned as unattended production monitoring. It showcases disciplined Python modules, repeatable HTML output, pdf fallbacks you can tune per host.

Regression safety

tests/ cover CSV load (including latin-1 fallback), xlsx sheets, profiler invariants (including categorical metadata after recent payload slimming), and HTML render smoke including sampling/target metadata assertions—use them when extending prompts or thresholds.

EDA Report Generator

Portfolio · Streamlit demo + report modules · tests

© Vahdettin Karatas. All rights reserved.