How It Works
An end-to-end Flask pipeline that automates your first hour of EDA โ with real code and real stats, built for analysts who care about trust.
Step 1: Upload & Load
We support CSV, Excel, JSON, and Feather formats. Your file is handled securely and converted into a Pandas DataFrame using our custom function load_dataframe()
, which dynamically detects format and prevents memory issues via smart fallback logic.
Step 2: Data Profiling
The core profiling is handled by our custom function data_quality_check()
. This performs a deep classification and statistical summary for every column:
- ๐ข Numerics: mean, std, skew, kurtosis, variance, outliers (3ฯโ5ฯ)
- ๐งฎ Booleans: distribution, imbalance (โฅ70%)
- ๐ Datetimes: flexible parsing, monotonic checks, time spans
- ๐ Categoricals: mode, value counts, cardinality flags
- ๐ซ Constants/Null Columns: flagged early for pipeline hygiene
We also handle edge cases like 1-unique-value features, implicit booleans, and messy time columns via check_timeseries()
.
Step 3: Statistical Overview
Next, we build a birdโs-eye view using overview()
. This aggregates key metrics across the dataset:
- ๐ Data shape, column names, column type counts
- ๐ Nulls, duplicates, low-variance, high-cardinality flags
- ๐ Skewness and kurtosis distributions
- ๐ฉ Outlier count by ฯ-bracket, including graphical output via Matplotlib
Step 4: Recommendations Engine
Hereโs where things get opinionated. Our custom engine data_quality_recommendations()
reviews all stats and triggers warnings & suggestions in structured categories:
- โ ๏ธ Missingness: 30%+ null warnings, full-null column drop alerts
- ๐ช Duplicates: exact row matches flagged
- ๐ซ Constant Columns: fully repeated values
- ๐ Outliers: flagged at 3ฯโ5ฯ with encoded plots via
generate_outlier_plot()
- ๐ฏ Skewness: classified by tail and severity via
generate_skewness_plot()
- ๐ญ Cardinality: high (>50) or medium (12โ49) categorical uniqueness
- ๐ Variance: low numeric variance, boolean imbalance, dominant categories
- ๐ Correlation: Pearson heatmap + Cramรฉr's V for categorical pairs
- ๐ Multicollinearity: VIF analysis with pre-cleaning logic for nulls & constants
Step 5: Output & Cleaning
Everything is encoded for front-end use via convert_numpy()
โ even complex stats and plots are serialized to JSON-safe formats. Session outputs and files are cleaned hourly via periodic_cleanup()
using wipe_all_files_in_folder()
for both user uploads and intermediate results.
Framework & Stack
Everything runs inside a secure Flask app using:
- ๐ฆ Pandas / NumPy: Data wrangling and math
- ๐ Seaborn / Matplotlib: Diagnostic plots
- ๐ Scipy / StatsModels: Z-scores, VIF, statistical tests
- ๐ผ๏ธ Bootstrap 5 + Jinja2: Fully styled templates and modals
Designed with mobile users in mind and structured to run securely in the browser (server-side only, no data leaves your machine).