Welcome to tabulight’s documentation!

API Reference

class tabulight.EDA(data: DataFrame | List[DataFrame] | Dict | ndarray, in_cols=None, out_cols=None, path=None, dpi=300, save=False, show=True)

Bases: Plot

Performns a comprehensive exploratory data analysis on a tabular/structured data. It is meant to be a one stop shop for eda.

Methods

  • data_availability

  • box_plot

  • plot_missing

  • plot_histograms

  • plot_index

  • plot_data

  • plot_pcs

  • grouped_scatter

  • correlation

  • stats

  • autocorrelation

  • partial_autocorrelation

  • probability_plots

  • lag_plot

  • plot_ecdf

  • normality_test

  • parallel_coordinates

  • show_unique_vals

Example:
>>> from tabulight import wq_data
>>> eda = EDA(data=wq_data())
>>> eda()  # to plot all available plots with single line
__init__(data: DataFrame | List[DataFrame] | Dict | ndarray, in_cols=None, out_cols=None, path=None, dpi=300, save=False, show=True)

Arguments

dataDataFrame, array, dict, list

either a dataframe, or list of dataframes or a dictionary whose values are dataframes or a numpy arrays

in_colsstr, list, optional

columns to consider as input features

out_colsstr, optional

columns to consider as output features

pathstr, optional

the path where to save the figures. If not given, plots will be saved in ‘data’ folder in current working directory.

savebool, optional

whether to save the plots or not

showbool, optional

whether to show the plots or not

dpiint, optional

the resolution with which to save the image

autocorrelation(n_lags: int = 10, cols: list | str = None, figsize: tuple = None)

autocorrelation of individual features of data

Arguments

n_lagsint, optional

number of lag steps to consider

colsstr, list, optional

columns to use. If not defined then all the columns are used

figsizetuple, optional

figure size

box_plot(st=None, en=None, cols: list | str = None, violen=False, normalize=True, figsize=(12, 8), max_features=8, show_datapoints=False, freq=None, **kwargs)

Plots box whisker or violen plot of data.

Arguments

stoptional

starting row/index in data to be used for plotting

enoptional

end row/index in data to be used for plotting

colslist,

the name of columns from data to be plotted.

normalize :

If True, then each feature/column is rescaled between 0 and 1.

figsize :

figure size

freqstr,

one of ‘weekly’, ‘monthly’, ‘yearly’. If given, box plot will be plotted for these intervals.

max_featuresint,

maximum number of features to appear in one plot.

violenbool,

if True, then violen plot will be plotted else box_whisker plot

show_datapointsbool

if True, sns.swarmplot() will be plotted. Will be time consuming for bigger data.

**kwargs :

any args for seaborn.boxplot/seaborn.violenplot or seaborn.swarmplot.

correlation(remove_targets=False, st=None, en=None, cols=None, method: str = 'pearson', split: str = None, **kwargs)

Plots correlation between features.

Arguments

remove_targetsbool, optional

whether to remove the output/target column or not

st :

starting row/index in data to be used for plotting

en :

end row/index in data to be used for plotting

cols :

columns to use

methodstr, optional

{pearson, spearman, kendall, covariance}, by default pearson

splitstr

To plot only positive correlations, set it to “pos” or to plot only negative correlations, set it to “neg”.

**kwargskeyword Args

Any additional keyword arguments for easy_mpl.imshow

Example

>>> from tabulight.eda import EDA
>>> from tabulight import wq_data
>>> vis = EDA(wq_data())
>>> vis.correlation()
data_availability(st=None, en=None, cols=None, figsize: tuple = None, **kwargs) Figure

Plots data as heatmap which depicts missing values.

Arguments

stint, str, optional

starting row/index in data to be used for plotting

enint, str, optional

end row/index in data to be used for plotting

colsstr, list

columns to use to draw heatmap

figsizetuple, optional

figure size

**kwargs :

Keyword arguments for easy_mpl.imshow

Return

None

Example

>>> from tabulight import wq_data
>>> data = wq_data()
>>> vis = EDA(data)
>>> vis.data_availability()
grouped_scatter(cols=None, st=None, en=None, max_subplots: int = 8, **kwargs)

Makes scatter plot for each of feature in data.

Arguments

st :

starting row/index in data to be used for plotting

en :

end row/index in data to be used for plotting

cols : max_subplots : int, optional

it can be set to large number to show all the scatter plots on one axis.

kwargs :

keyword arguments for sns.pariplot

heatmap(st=None, en=None, cols=None, figsize: tuple = None, **kwargs) Figure

Plots data as heatmap which depicts missing values.

Arguments

stint, str, optional

starting row/index in data to be used for plotting

enint, str, optional

end row/index in data to be used for plotting

colsstr, list

columns to use to draw heatmap

figsizetuple, optional

figure size

**kwargs :

Keyword arguments for sns.heatmap

Return

None

Example

>>> from tabulight import wq_data
>>> data = wq_data()
>>> vis = EDA(data)
>>> vis.heatmap()
property in_cols
lag_plot(n_lags: int | list = 1, cols=None, figsize=None, **kwargs)

lag plot between an array and its lags

Arguments

n_lags :

lag step against which to plot the data, it can be integer or a list of integers

cols :

columns to use

figsize :

figure size

kwargs : any keyword arguments for axis.scatter

normality_test(method='shapiro', cols=None, st=None, en=None, orientation='h', color=None, figsize: tuple = None)

plots the statistics of nromality test as bar charts. The statistics for each feature are calculated either Shapiro-wilke test or Anderson-Darling test][] or Kolmogorov-Smirnov test using scipy.stats.shapiro or scipy.stats.anderson functions respectively.

Arguments

method :

either “shapiro” or “anderson”, or “kolmogorov” default is “shapiro”

cols :

columns to use

stoptional

start of data

enoptional

end of data to use

orientationoptional

orientation of bars

color :

color to use

figsizetuple, optional

figure size (width, height)

Example

>>> from tabulight import EDA
>>> from tabulight import wq_data
>>> eda = EDA(data=wq_data())
>>> eda.normality_test()
property out_cols
parallel_corrdinates(cols=None, st=None, en=100, color=None, **kwargs)

Plots data as parallel coordinates.

Arguments

st :

start of data to be considered

en :

end of data to be considered

cols :

columns from data to be considered.

color :

color or colormap to be used.

**kwargs :

any additional keyword arguments to be passed to easy_mpl.parallel_coordinates

partial_autocorrelation(n_lags: int = 10, cols: list | str = None)

Partial autocorrelation of individual features of data

Arguments

n_lagsint, optional

number of lag steps to consider

colsstr, list, optional

columns to use. If not defined then all the columns are used

plot_data(st=None, en=None, freq: str = None, cols=None, max_cols_in_plot: int = 10, ignore_datetime_index=False, **kwargs)

Plots the data.

Arguments

stint, str, optional

starting row/index in data to be used for plotting

enint, str, optional

end row/index in data to be used for plotting

colsstr, list, optional

columns in data to consider for plotting

max_cols_in_plotint, optional

Maximum number of columns in one plot. Maximum number of plots depends upon this value and number of columns in data.

freqstr, optional

one of ‘daily’, ‘weekly’, ‘monthly’, ‘yearly’, determines interval of plot of data. It is valid for only time-series data.

ignore_datetime_indexbool, optional

only valid if dataframe’s index is pd.DateTimeIndex. In such a case, if you want to ignore time index on x-axis, set this to True.

**kwargs :

ary arguments for pandas plot method

Example

>>> from tabulight import wq_data
>>> eda = EDA(wq_data())
>>> eda.plot_data(subplots=True, figsize=(12, 14), sharex=True)
>>> eda.plot_data(freq='monthly', subplots=True, figsize=(12, 14), sharex=True)
plot_ecdf(cols=None, figsize=None, **kwargs)

plots empirical cummulative distribution function

Arguments

cols :

columns to use

figsize : kwargs :

any keyword argument for axis.plot

plot_histograms(st=None, en=None, cols=None, max_subplots: int = 40, figsize: tuple = (20, 14), **kwargs)

Plots distribution of data as histogram.

Arguments

st :

starting index of data to use

en :

end index of data to use

cols :

columns to use

max_subplotsint, optional

maximum number of subplots in one figure

figsize :

figure size

**kwargs : anykeyword argument for pandas.DataFrame.hist function

plot_index(st=None, en=None, **kwargs)

plots the datetime index of dataframe

plot_missing(st=None, en=None, cols=None, **kwargs)

plot data to indicate missingness in data

Arguments

colslist, str, optional

columns to be used.

stint, str, optional

starting row/index in data to be used for plotting

enint, str, optional

end row/index in data to be used for plotting

**kwargs :

Keyword Args such as figsize

Example

>>> from tabulight import wq_data
>>> data = wq_data()
>>> vis = EDA(data)
>>> vis.plot_missing()
plot_pcs(num_pcs=None, st=None, en=None, save_as_csv=False, figsize=(12, 8), **kwargs)

Plots principle components.

Arguments

num_pcs : st : starting row/index in data to be used for plotting en : end row/index in data to be used for plotting save_as_csv : figsize : kwargs :will go to sns.pairplot.

probability_plots(cols: str | list = None)

draws prbability plot using scipy.stats.probplot . See scipy distributions

show_unique_vals(threshold: int = 10, st=None, en=None, cols=None, max_subplots: int = 9, figsize: tuple = None, **kwargs)

Shows percentage of unique/categorical values in data. Only those columns are used in which unique values are below threshold.

Arguments

threshold : int, optional st : int, str, optional en : int, str, optional cols : str, list, optional max_subplots : int, optional figsize : tuple, optional **kwargs :

Any keyword arguments for easy_mpl.pie

stats(precision=3, inputs=True, outputs=True, st=None, en=None, out_fmt='csv')

Finds the stats of inputs and outputs and puts them in a json file.

inputs: bool fpath: str, path like out_fmt: str, in which format to save. csv or json

Examples

Scripts

Indices and tables