Welcome to tabulight’s documentation!

API Reference

class tabulight.EDA(data: DataFrame | List[DataFrame] | Dict | ndarray, in_cols=None, out_cols=None, path=None, dpi=300, save=False, show=True)

Bases: Plot

Performns a comprehensive exploratory data analysis on a tabular/structured data. It is meant to be a one stop shop for eda.

Methods

data_availability
box_plot
plot_missing
plot_histograms
plot_index
plot_data
plot_pcs
grouped_scatter
correlation
stats
autocorrelation
partial_autocorrelation
probability_plots
lag_plot
plot_ecdf
normality_test
parallel_coordinates
show_unique_vals

Example:

>>> from tabulight import wq_data
>>> eda = EDA(data=wq_data())
>>> eda()  # to plot all available plots with single line

__init__(data: DataFrame | List[DataFrame] | Dict | ndarray, in_cols=None, out_cols=None, path=None, dpi=300, save=False, show=True)

Arguments

dataDataFrame, array, dict, list
either a dataframe, or list of dataframes or a dictionary whose values are dataframes or a numpy arrays

in_colsstr, list, optional
columns to consider as input features

out_colsstr, optional
columns to consider as output features

pathstr, optional
the path where to save the figures. If not given, plots will be saved in ‘data’ folder in current working directory.

savebool, optional
whether to save the plots or not

showbool, optional
whether to show the plots or not

dpiint, optional
the resolution with which to save the image

autocorrelation(n_lags: int = 10, cols: list | str = None, figsize: tuple = None)

autocorrelation of individual features of data

Arguments

n_lagsint, optional
number of lag steps to consider

colsstr, list, optional
columns to use. If not defined then all the columns are used

figsizetuple, optional
figure size

box_plot(st=None, en=None, cols: list | str = None, violen=False, normalize=True, figsize=(12, 8), max_features=8, show_datapoints=False, freq=None, **kwargs)

Plots box whisker or violen plot of data.

Arguments

stoptional
starting row/index in data to be used for plotting

enoptional
end row/index in data to be used for plotting

colslist,
the name of columns from data to be plotted.

normalize :
If True, then each feature/column is rescaled between 0 and 1.

figsize :
figure size

freqstr,
one of ‘weekly’, ‘monthly’, ‘yearly’. If given, box plot will be plotted for these intervals.

max_featuresint,
maximum number of features to appear in one plot.

violenbool,
if True, then violen plot will be plotted else box_whisker plot

show_datapointsbool
if True, sns.swarmplot() will be plotted. Will be time consuming for bigger data.

**kwargs :
any args for seaborn.boxplot/seaborn.violenplot or seaborn.swarmplot.

correlation(remove_targets=False, st=None, en=None, cols=None, method: str = 'pearson', split: str = None, **kwargs)

Plots correlation between features.

Arguments

remove_targetsbool, optional
whether to remove the output/target column or not

st :
starting row/index in data to be used for plotting

en :
end row/index in data to be used for plotting

cols :
columns to use

methodstr, optional
{pearson, spearman, kendall, covariance}, by default pearson

splitstr
To plot only positive correlations, set it to “pos” or to plot only negative correlations, set it to “neg”.

**kwargskeyword Args
Any additional keyword arguments for easy_mpl.imshow

Example

>>> from tabulight.eda import EDA
>>> from tabulight import wq_data
>>> vis = EDA(wq_data())
>>> vis.correlation()

data_availability(st=None, en=None, cols=None, figsize: tuple = None, **kwargs) → Figure

Plots data as heatmap which depicts missing values.

Arguments

stint, str, optional
starting row/index in data to be used for plotting

enint, str, optional
end row/index in data to be used for plotting

colsstr, list
columns to use to draw heatmap

figsizetuple, optional
figure size

**kwargs :
Keyword arguments for easy_mpl.imshow

Return

None

Example

>>> from tabulight import wq_data
>>> data = wq_data()
>>> vis = EDA(data)
>>> vis.data_availability()

grouped_scatter(cols=None, st=None, en=None, max_subplots: int = 8, **kwargs)

Makes scatter plot for each of feature in data.

Arguments

st :
starting row/index in data to be used for plotting

en :
end row/index in data to be used for plotting

cols : max_subplots : int, optional

it can be set to large number to show all the scatter plots on one axis.

kwargs :
keyword arguments for sns.pariplot

heatmap(st=None, en=None, cols=None, figsize: tuple = None, **kwargs) → Figure

Plots data as heatmap which depicts missing values.

Arguments

stint, str, optional
starting row/index in data to be used for plotting

enint, str, optional
end row/index in data to be used for plotting

colsstr, list
columns to use to draw heatmap

figsizetuple, optional
figure size

**kwargs :
Keyword arguments for sns.heatmap

Return

None

Example

>>> from tabulight import wq_data
>>> data = wq_data()
>>> vis = EDA(data)
>>> vis.heatmap()

property in_cols

lag_plot(n_lags: int | list = 1, cols=None, figsize=None, **kwargs)

lag plot between an array and its lags

Arguments

n_lags :
lag step against which to plot the data, it can be integer or a list of integers

cols :
columns to use

figsize :
figure size

kwargs : any keyword arguments for axis.scatter

normality_test(method='shapiro', cols=None, st=None, en=None, orientation='h', color=None, figsize: tuple = None)

plots the statistics of nromality test as bar charts. The statistics for each feature are calculated either Shapiro-wilke test or Anderson-Darling test][] or Kolmogorov-Smirnov test using scipy.stats.shapiro or scipy.stats.anderson functions respectively.

Arguments

method :
either “shapiro” or “anderson”, or “kolmogorov” default is “shapiro”

cols :
columns to use

stoptional
start of data

enoptional
end of data to use

orientationoptional
orientation of bars

color :
color to use

figsizetuple, optional
figure size (width, height)

Example

>>> from tabulight import EDA
>>> from tabulight import wq_data
>>> eda = EDA(data=wq_data())
>>> eda.normality_test()

property out_cols

parallel_corrdinates(cols=None, st=None, en=100, color=None, **kwargs)

Plots data as parallel coordinates.

Arguments

st :
start of data to be considered

en :
end of data to be considered

cols :
columns from data to be considered.

color :
color or colormap to be used.

**kwargs :
any additional keyword arguments to be passed to easy_mpl.parallel_coordinates

partial_autocorrelation(n_lags: int = 10, cols: list | str = None)

Partial autocorrelation of individual features of data

Arguments

n_lagsint, optional
number of lag steps to consider

colsstr, list, optional
columns to use. If not defined then all the columns are used

plot_data(st=None, en=None, freq: str = None, cols=None, max_cols_in_plot: int = 10, ignore_datetime_index=False, **kwargs)

Plots the data.

Arguments

stint, str, optional
starting row/index in data to be used for plotting

enint, str, optional
end row/index in data to be used for plotting

colsstr, list, optional
columns in data to consider for plotting

max_cols_in_plotint, optional
Maximum number of columns in one plot. Maximum number of plots depends upon this value and number of columns in data.

freqstr, optional
one of ‘daily’, ‘weekly’, ‘monthly’, ‘yearly’, determines interval of plot of data. It is valid for only time-series data.

ignore_datetime_indexbool, optional
only valid if dataframe’s index is pd.DateTimeIndex. In such a case, if you want to ignore time index on x-axis, set this to True.

**kwargs :
ary arguments for pandas plot method

Example

>>> from tabulight import wq_data
>>> eda = EDA(wq_data())
>>> eda.plot_data(subplots=True, figsize=(12, 14), sharex=True)
>>> eda.plot_data(freq='monthly', subplots=True, figsize=(12, 14), sharex=True)

plot_ecdf(cols=None, figsize=None, **kwargs)

plots empirical cummulative distribution function

Arguments

cols :
columns to use

figsize : kwargs :

any keyword argument for axis.plot

plot_histograms(st=None, en=None, cols=None, max_subplots: int = 40, figsize: tuple = (20, 14), **kwargs)

Plots distribution of data as histogram.

Arguments

st :
starting index of data to use

en :
end index of data to use

cols :
columns to use

max_subplotsint, optional
maximum number of subplots in one figure

figsize :
figure size

**kwargs : anykeyword argument for pandas.DataFrame.hist function

plot_index(st=None, en=None, **kwargs): plots the datetime index of dataframe

plot_missing(st=None, en=None, cols=None, **kwargs)

plot data to indicate missingness in data

Arguments

colslist, str, optional
columns to be used.

stint, str, optional
starting row/index in data to be used for plotting

enint, str, optional
end row/index in data to be used for plotting

**kwargs :
Keyword Args such as figsize

Example

>>> from tabulight import wq_data
>>> data = wq_data()
>>> vis = EDA(data)
>>> vis.plot_missing()

plot_pcs(num_pcs=None, st=None, en=None, save_as_csv=False, figsize=(12, 8), **kwargs): Plots principle components.

Arguments

num_pcs : st : starting row/index in data to be used for plotting en : end row/index in data to be used for plotting save_as_csv : figsize : kwargs :will go to sns.pairplot.

probability_plots(cols: str | list = None): draws prbability plot using scipy.stats.probplot . See scipy distributions

show_unique_vals(threshold: int = 10, st=None, en=None, cols=None, max_subplots: int = 9, figsize: tuple = None, **kwargs): Shows percentage of unique/categorical values in data. Only those columns are used in which unique values are below threshold.

Arguments

threshold : int, optional st : int, str, optional en : int, str, optional cols : str, list, optional max_subplots : int, optional figsize : tuple, optional **kwargs :

Any keyword arguments for easy_mpl.pie

stats(precision=3, inputs=True, outputs=True, st=None, en=None, out_fmt='csv')

Finds the stats of inputs and outputs and puts them in a json file.

inputs: bool fpath: str, path like out_fmt: str, in which format to save. csv or json

Examples

Scripts

Welcome to tabulight’s documentation!

API Reference

Methods

Arguments

Arguments

Arguments

Arguments

Example

Arguments

Return

Example

Arguments

Arguments

Return

Example

Arguments

Arguments

Example

Arguments

Arguments

Arguments

Example

Arguments

Arguments

Arguments

Example

Arguments

Arguments

Examples

Indices and tables