How It Works

An end-to-end Flask pipeline that automates your first hour of EDA โ€” with real code and real stats, built for analysts who care about trust.

Step 1: Upload & Load

We support CSV, Excel, JSON, and Feather formats. Your file is handled securely and converted into a Pandas DataFrame using our custom function load_dataframe(), which dynamically detects format and prevents memory issues via smart fallback logic.

Step 2: Data Profiling

The core profiling is handled by our custom function data_quality_check(). This performs a deep classification and statistical summary for every column:

  • ๐Ÿ”ข Numerics: mean, std, skew, kurtosis, variance, outliers (3ฯƒโ€“5ฯƒ)
  • ๐Ÿงฎ Booleans: distribution, imbalance (โ‰ฅ70%)
  • ๐Ÿ•’ Datetimes: flexible parsing, monotonic checks, time spans
  • ๐Ÿ”  Categoricals: mode, value counts, cardinality flags
  • ๐Ÿšซ Constants/Null Columns: flagged early for pipeline hygiene

We also handle edge cases like 1-unique-value features, implicit booleans, and messy time columns via check_timeseries().

Step 3: Statistical Overview

Next, we build a birdโ€™s-eye view using overview(). This aggregates key metrics across the dataset:

  • ๐Ÿ“ Data shape, column names, column type counts
  • ๐Ÿ“‰ Nulls, duplicates, low-variance, high-cardinality flags
  • ๐Ÿ“ˆ Skewness and kurtosis distributions
  • ๐Ÿšฉ Outlier count by ฯƒ-bracket, including graphical output via Matplotlib

Step 4: Recommendations Engine

Hereโ€™s where things get opinionated. Our custom engine data_quality_recommendations() reviews all stats and triggers warnings & suggestions in structured categories:

  • โš ๏ธ Missingness: 30%+ null warnings, full-null column drop alerts
  • ๐Ÿชž Duplicates: exact row matches flagged
  • ๐Ÿšซ Constant Columns: fully repeated values
  • ๐Ÿ“ˆ Outliers: flagged at 3ฯƒโ€“5ฯƒ with encoded plots via generate_outlier_plot()
  • ๐ŸŽฏ Skewness: classified by tail and severity via generate_skewness_plot()
  • ๐ŸŽญ Cardinality: high (>50) or medium (12โ€“49) categorical uniqueness
  • ๐Ÿ“‰ Variance: low numeric variance, boolean imbalance, dominant categories
  • ๐Ÿ”— Correlation: Pearson heatmap + Cramรฉr's V for categorical pairs
  • ๐Ÿ“Ž Multicollinearity: VIF analysis with pre-cleaning logic for nulls & constants

Step 5: Output & Cleaning

Everything is encoded for front-end use via convert_numpy() โ€” even complex stats and plots are serialized to JSON-safe formats. Session outputs and files are cleaned hourly via periodic_cleanup() using wipe_all_files_in_folder() for both user uploads and intermediate results.

Framework & Stack

Everything runs inside a secure Flask app using:

  • ๐Ÿ“ฆ Pandas / NumPy: Data wrangling and math
  • ๐Ÿ“Š Seaborn / Matplotlib: Diagnostic plots
  • ๐Ÿ“ Scipy / StatsModels: Z-scores, VIF, statistical tests
  • ๐Ÿ–ผ๏ธ Bootstrap 5 + Jinja2: Fully styled templates and modals

Designed with mobile users in mind and structured to run securely in the browser (server-side only, no data leaves your machine).