Welcome to tabulight’s documentation!
API Reference
- class tabulight.EDA(data: DataFrame | List[DataFrame] | Dict | ndarray, in_cols=None, out_cols=None, path=None, dpi=300, save=False, show=True)
Bases:
PlotPerformns a comprehensive exploratory data analysis on a tabular/structured data. It is meant to be a one stop shop for eda.
Methods
data_availability
box_plot
plot_missing
plot_histograms
plot_index
plot_data
plot_pcs
grouped_scatter
correlation
stats
autocorrelation
partial_autocorrelation
probability_plots
lag_plot
plot_ecdf
normality_test
parallel_coordinates
show_unique_vals
- Example:
>>> from tabulight import wq_data >>> eda = EDA(data=wq_data()) >>> eda() # to plot all available plots with single line
- __init__(data: DataFrame | List[DataFrame] | Dict | ndarray, in_cols=None, out_cols=None, path=None, dpi=300, save=False, show=True)
Arguments
- dataDataFrame, array, dict, list
either a dataframe, or list of dataframes or a dictionary whose values are dataframes or a numpy arrays
- in_colsstr, list, optional
columns to consider as input features
- out_colsstr, optional
columns to consider as output features
- pathstr, optional
the path where to save the figures. If not given, plots will be saved in ‘data’ folder in current working directory.
- savebool, optional
whether to save the plots or not
- showbool, optional
whether to show the plots or not
- dpiint, optional
the resolution with which to save the image
- autocorrelation(n_lags: int = 10, cols: list | str = None, figsize: tuple = None)
autocorrelation of individual features of data
Arguments
- n_lagsint, optional
number of lag steps to consider
- colsstr, list, optional
columns to use. If not defined then all the columns are used
- figsizetuple, optional
figure size
- box_plot(st=None, en=None, cols: list | str = None, violen=False, normalize=True, figsize=(12, 8), max_features=8, show_datapoints=False, freq=None, **kwargs)
Plots box whisker or violen plot of data.
Arguments
- stoptional
starting row/index in data to be used for plotting
- enoptional
end row/index in data to be used for plotting
- colslist,
the name of columns from data to be plotted.
- normalize :
If True, then each feature/column is rescaled between 0 and 1.
- figsize :
figure size
- freqstr,
one of ‘weekly’, ‘monthly’, ‘yearly’. If given, box plot will be plotted for these intervals.
- max_featuresint,
maximum number of features to appear in one plot.
- violenbool,
if True, then violen plot will be plotted else box_whisker plot
- show_datapointsbool
if True, sns.swarmplot() will be plotted. Will be time consuming for bigger data.
- **kwargs :
any args for seaborn.boxplot/seaborn.violenplot or seaborn.swarmplot.
- correlation(remove_targets=False, st=None, en=None, cols=None, method: str = 'pearson', split: str = None, **kwargs)
Plots correlation between features.
Arguments
- remove_targetsbool, optional
whether to remove the output/target column or not
- st :
starting row/index in data to be used for plotting
- en :
end row/index in data to be used for plotting
- cols :
columns to use
- methodstr, optional
{
pearson,spearman,kendall,covariance}, by defaultpearson- splitstr
To plot only positive correlations, set it to “pos” or to plot only negative correlations, set it to “neg”.
- **kwargskeyword Args
Any additional keyword arguments for easy_mpl.imshow
Example
>>> from tabulight.eda import EDA >>> from tabulight import wq_data >>> vis = EDA(wq_data()) >>> vis.correlation()
- data_availability(st=None, en=None, cols=None, figsize: tuple = None, **kwargs) Figure
Plots data as heatmap which depicts missing values.
Arguments
- stint, str, optional
starting row/index in data to be used for plotting
- enint, str, optional
end row/index in data to be used for plotting
- colsstr, list
columns to use to draw heatmap
- figsizetuple, optional
figure size
- **kwargs :
Keyword arguments for easy_mpl.imshow
Return
None
Example
>>> from tabulight import wq_data >>> data = wq_data() >>> vis = EDA(data) >>> vis.data_availability()
- grouped_scatter(cols=None, st=None, en=None, max_subplots: int = 8, **kwargs)
Makes scatter plot for each of feature in data.
Arguments
- st :
starting row/index in data to be used for plotting
- en :
end row/index in data to be used for plotting
cols : max_subplots : int, optional
it can be set to large number to show all the scatter plots on one axis.
- kwargs :
keyword arguments for sns.pariplot
- heatmap(st=None, en=None, cols=None, figsize: tuple = None, **kwargs) Figure
Plots data as heatmap which depicts missing values.
Arguments
- stint, str, optional
starting row/index in data to be used for plotting
- enint, str, optional
end row/index in data to be used for plotting
- colsstr, list
columns to use to draw heatmap
- figsizetuple, optional
figure size
- **kwargs :
Keyword arguments for sns.heatmap
Return
None
Example
>>> from tabulight import wq_data >>> data = wq_data() >>> vis = EDA(data) >>> vis.heatmap()
- property in_cols
- lag_plot(n_lags: int | list = 1, cols=None, figsize=None, **kwargs)
lag plot between an array and its lags
Arguments
- n_lags :
lag step against which to plot the data, it can be integer or a list of integers
- cols :
columns to use
- figsize :
figure size
kwargs : any keyword arguments for axis.scatter
- normality_test(method='shapiro', cols=None, st=None, en=None, orientation='h', color=None, figsize: tuple = None)
plots the statistics of nromality test as bar charts. The statistics for each feature are calculated either Shapiro-wilke test or Anderson-Darling test][] or Kolmogorov-Smirnov test using scipy.stats.shapiro or scipy.stats.anderson functions respectively.
Arguments
- method :
either “shapiro” or “anderson”, or “kolmogorov” default is “shapiro”
- cols :
columns to use
- stoptional
start of data
- enoptional
end of data to use
- orientationoptional
orientation of bars
- color :
color to use
- figsizetuple, optional
figure size (width, height)
Example
>>> from tabulight import EDA >>> from tabulight import wq_data >>> eda = EDA(data=wq_data()) >>> eda.normality_test()
- property out_cols
- parallel_corrdinates(cols=None, st=None, en=100, color=None, **kwargs)
Plots data as parallel coordinates.
Arguments
- st :
start of data to be considered
- en :
end of data to be considered
- cols :
columns from data to be considered.
- color :
color or colormap to be used.
- **kwargs :
any additional keyword arguments to be passed to easy_mpl.parallel_coordinates
- partial_autocorrelation(n_lags: int = 10, cols: list | str = None)
Partial autocorrelation of individual features of data
Arguments
- n_lagsint, optional
number of lag steps to consider
- colsstr, list, optional
columns to use. If not defined then all the columns are used
- plot_data(st=None, en=None, freq: str = None, cols=None, max_cols_in_plot: int = 10, ignore_datetime_index=False, **kwargs)
Plots the data.
Arguments
- stint, str, optional
starting row/index in data to be used for plotting
- enint, str, optional
end row/index in data to be used for plotting
- colsstr, list, optional
columns in data to consider for plotting
- max_cols_in_plotint, optional
Maximum number of columns in one plot. Maximum number of plots depends upon this value and number of columns in data.
- freqstr, optional
one of ‘daily’, ‘weekly’, ‘monthly’, ‘yearly’, determines interval of plot of data. It is valid for only time-series data.
- ignore_datetime_indexbool, optional
only valid if dataframe’s index is pd.DateTimeIndex. In such a case, if you want to ignore time index on x-axis, set this to True.
- **kwargs :
ary arguments for pandas plot method
Example
>>> from tabulight import wq_data >>> eda = EDA(wq_data()) >>> eda.plot_data(subplots=True, figsize=(12, 14), sharex=True) >>> eda.plot_data(freq='monthly', subplots=True, figsize=(12, 14), sharex=True)
- plot_ecdf(cols=None, figsize=None, **kwargs)
plots empirical cummulative distribution function
Arguments
- cols :
columns to use
figsize : kwargs :
any keyword argument for axis.plot
- plot_histograms(st=None, en=None, cols=None, max_subplots: int = 40, figsize: tuple = (20, 14), **kwargs)
Plots distribution of data as histogram.
Arguments
- st :
starting index of data to use
- en :
end index of data to use
- cols :
columns to use
- max_subplotsint, optional
maximum number of subplots in one figure
- figsize :
figure size
**kwargs : anykeyword argument for pandas.DataFrame.hist function
- plot_index(st=None, en=None, **kwargs)
plots the datetime index of dataframe
- plot_missing(st=None, en=None, cols=None, **kwargs)
plot data to indicate missingness in data
Arguments
- colslist, str, optional
columns to be used.
- stint, str, optional
starting row/index in data to be used for plotting
- enint, str, optional
end row/index in data to be used for plotting
- **kwargs :
Keyword Args such as figsize
Example
>>> from tabulight import wq_data >>> data = wq_data() >>> vis = EDA(data) >>> vis.plot_missing()
- plot_pcs(num_pcs=None, st=None, en=None, save_as_csv=False, figsize=(12, 8), **kwargs)
Plots principle components.
Arguments
num_pcs : st : starting row/index in data to be used for plotting en : end row/index in data to be used for plotting save_as_csv : figsize : kwargs :will go to sns.pairplot.
- probability_plots(cols: str | list = None)
draws prbability plot using scipy.stats.probplot . See scipy distributions
- show_unique_vals(threshold: int = 10, st=None, en=None, cols=None, max_subplots: int = 9, figsize: tuple = None, **kwargs)
Shows percentage of unique/categorical values in data. Only those columns are used in which unique values are below threshold.
Arguments
threshold : int, optional st : int, str, optional en : int, str, optional cols : str, list, optional max_subplots : int, optional figsize : tuple, optional **kwargs :
Any keyword arguments for easy_mpl.pie
- stats(precision=3, inputs=True, outputs=True, st=None, en=None, out_fmt='csv')
Finds the stats of inputs and outputs and puts them in a json file.
inputs: bool fpath: str, path like out_fmt: str, in which format to save. csv or json
Examples
Scripts