Exploratory data analysis¶
The module eda
has a collection of functions to perform exploratory data analysis.
|
Generates a summary card and plots for specified categories from a DataFrame. |
|
Generates a heatmap to analyze the relationship between two categorical variables in a DataFrame. |
|
Generates a histogram plot for data distribution across a specified axis, optionally segmented by categories. |
|
Generates a box plot visualizing the distribution of durations across different categories. |
|
Analyzes and visualizes the daily distribution of samples by categories. |
|
Generates a distribution plot for the 'duration' column in the provided DataFrame. |
|
Export a summary PDF report with analysis of file names for LEEC project. |
- maui.eda.card_summary(df, categories, show_plot=True)[source]
Generates a summary card and plots for specified categories from a DataFrame. This function processes the input DataFrame to compute various statistics, including the number of samples, distinct days, total and mean duration (in minutes) of some activities. It also dynamically incorporates additional specified categories into its computations and visualizations. If enabled, a plot is generated using Plotly to visually represent these statistics alongside the categories specified.
- Parameters:
- dfpandas.DataFrame
The input DataFrame containing at least the following columns: ‘file_path’, ‘dt’, and ‘duration’. Additional columns should match the specified categories if any.
- categorieslist of str
A list of category names (column names in df) to include in the summary and plot. At most two categories can be specified.
- show_plotbool, optional
If True (default), the function will generate and show a Plotly plot representing the calculated statistics and specified categories. If False, no plot will be displayed.
- Returns:
- tuple
Returns a tuple containing:
- card_dict (dict): A dictionary with keys for ‘n_samples’,
‘distinct_days’, ‘total_time_duration’, ‘mean_time_duration’, and one key per category specified. The values are the respective computed statistics.
- fig (plotly.graph_objs._figure.Figure): A Plotly figure object with indicators
for each of the statistics and categories specified. Only returned if show_plot is True.
- Raises:
- Exception
If more than two categories are specified, an exception is raised due to plotting limitations.
Notes
The function is designed to work with data pertaining to durations and occurrences across different categories. It’s particularly useful for analyzing time series or event data. The ‘duration’ column is expected to be in seconds.
Examples
>>> from maui import samples, eda >>> df = samples.get_audio_sample(dataset="leec") >>> categories = ['landscape', 'environment'] >>> card_dict, fig = eda.card_summary(df, categories)
- maui.eda.daily_distribution_analysis(df, date_column, category_column, show_plot=True)[source]
Analyzes and visualizes the daily distribution of samples by categories.
This function generates a histogram that shows the distribution of samples over days, separated by a specified category. It provides insights into how the frequency of samples varies daily and according to the categories within the specified category column.
- Parameters:
- dfpandas.DataFrame
The DataFrame containing the data to be analyzed. It must include the specified date_column and category_column.
- date_columnstr
The name of the column in df that contains date information. The values in this column should be in a date or datetime format.
- category_columnstr
The name of the column in df that contains categorical data, which will be used to color the bars in the histogram.
- show_plotbool, optional
If True (default), the function will display the generated plot. If False, the plot will not be displayed but will still be returned.
- Returns:
- plotly.graph_objs._figure.Figure
A Plotly figure object representing the histogram of daily sample distribution by the specified category. The histogram bars are colored based on the categories in the category_column.
Notes
The function leverages Plotly for plotting, thus ensuring interactive plots that can be further explored in a web browser. It’s particularly useful for time series data where understanding the distribution of events or samples over time and across different categories is crucial.
Examples
>>> from maui import samples, eda >>> df = samples.get_audio_sample(dataset="leec") >>> fig = eda.daily_distribution_analysis(df, 'dt', 'landscape')
- maui.eda.duration_analysis(df, category_column, duration_column, show_plot=True)[source]
Generates a box plot visualizing the distribution of durations across different categories.
This function takes a DataFrame and creates a box plot to analyze the distribution of durations (or any numerical data) across specified categories. The box plot provides a visual representation of the central tendency, dispersion, and skewness of the data and identifies outliers.
- Parameters:
- dfpandas.DataFrame
The DataFrame containing the data to be analyzed. It should include at least two columns: one for the category and one for the duration (or any numerical data to be analyzed).
- category_columnstr
The name of the column in df that contains the categorical data. This column will be used to group the numerical data into different categories for the box plot.
- duration_columnstr
The name of the column in df that contains the numerical data to be analyzed. This data will be distributed into boxes according to the categories specified by category_column.
- show_plotbool, optional
If True (default), the function will display the generated box plot. If False, the plot will not be displayed, but the figure object will still be returned.
- Returns:
- plotly.graph_objs._figure.Figure
The generated Plotly figure object containing the box plot. This object can be used for further customization or to display the plot at a later time if show_plot is False.
Notes
The box plot generated by this function can help identify the range, interquartile range, median, and potential outliers within each category. This visual analysis is crucial for understanding the distribution characteristics of numerical data across different groups.
Examples
>>> from maui import samples, eda >>> df = samples.get_audio_sample(dataset="leec") >>> fig = eda.duration_analysis(df, 'landscape', 'duration')
- maui.eda.duration_distribution(df, time_unit='s', show_plot=True)[source]
Generates a distribution plot for the ‘duration’ column in the provided DataFrame.
This function creates a distribution plot, including a histogram and a kernel density estimate (KDE), for the ‘duration’ column in the input DataFrame. It is designed to give a visual understanding of the distribution of duration values across the dataset.
- Parameters:
- dfpandas.DataFrame
The DataFrame containing the data to be analyzed. It must include a column named ‘duration’, which contains numeric data.
- time_unit: string
The time unit of the audio duration column. It is used to make it explicit in the visualization which is the time unit. Default: ‘s’
- show_plotbool, optional
If True (default), the function will display the generated plot. If False, the plot will not be displayed but will still be returned.
- Returns:
- plotly.graph_objs._figure.Figure
A Plotly figure object representing the distribution plot of the ‘duration’ column. The plot includes both a histogram of the data and a kernel density estimate (KDE) curve.
Notes
The function uses Plotly’s create_distplot function from the plotly.figure_factory module, offering a detailed visual representation of data distribution. It’s particularly useful for analyzing the spread and skewness of numeric data. The KDE curve provides insight into the probability density of the durations, complementing the histogram’s discrete bins.
Examples
>>> from maui import samples, eda >>> df = samples.get_audio_sample(dataset="leec") >>> fig = eda.duration_distribution(df)
- maui.eda.export_file_names_summary_pdf_leec(df, file_name, analysis_title=None, width=210)[source]
Export a summary PDF report with analysis of file names for LEEC project.
- Parameters:
- dfpandas.DataFrame
DataFrame containing the data to be analyzed.
- file_namestr
Name of the output PDF file.
- analysis_titlestr, optional
Title of the analysis section in the PDF.
- widthint, optional
Width of the PDF document in millimeters.
- Returns:
- None
Notes
This function exports a summary PDF report with various analyses of file names for the LEEC project. It includes landscape analysis, environment analysis, and duration analysis. The PDF is created using the provided DataFrame df and saved with the specified file_name.
Examples
>>> export_file_names_summary_pdf_leec(df, 'summary_report.pdf', analysis_title='Audio Files Analysis')
- maui.eda.heatmap_analysis(df, x_axis, y_axis, color_continuous_scale='Viridis', show_plot=True, **kwargs)[source]
Generates a heatmap to analyze the relationship between two categorical variables in a DataFrame.
This function groups the data by the specified x_axis and y_axis categories, counts the occurrences of each group, and then creates a heatmap visualization of these counts using Plotly Express. The heatmap intensity is determined by the count of occurrences, with an option to customize the color scale.
- Parameters:
- dfpandas.DataFrame
The input DataFrame containing the data to be analyzed. Must include the columns specified by x_axis and y_axis, as well as a ‘file_path’ column used for counting occurrences.
- x_axisstr
The name of the column in df to be used as the x-axis in the heatmap.
- y_axisstr
The name of the column in df to be used as the y-axis in the heatmap.
- color_continuous_scalestr, optional
The name of the color scale to use for the heatmap. Defaults to ‘Viridis’. For more options, refer to Plotly’s documentation on color scales.
- show_plotbool, optional
If True (default), displays the heatmap plot. If False, the plot is not displayed but is still returned.
- **kwargsdict
Additional arguments for plot customization, such as height and width.
- Returns:
- tuple
A tuple containing: - df_group (pandas.DataFrame): A DataFrame with the grouped counts for each combination of x_axis and y_axis values. - fig (plotly.graph_objs._figure.Figure): A Plotly figure object containing the heatmap.
Notes
The ‘file_path’ column in the input DataFrame is used to count occurrences of each group formed by the specified x_axis and y_axis values. This function is useful for visualizing the distribution and relationship between two categorical variables.
Examples
>>> from maui import samples, eda >>> df = samples.get_audio_sample(dataset="leec") >>> df_group, fig = eda.heatmap_analysis(df, 'landscape', 'environment')
- maui.eda.histogram_analysis(df, x_axis, category_column, show_plot=True)[source]
Generates a histogram plot for data distribution across a specified axis, optionally segmented by categories.
This function creates a histogram to visualize the distribution of data in df along the x_axis, with data optionally segmented by category_column. The histogram’s appearance, such as opacity and bar gap, is customizable. The plot is generated using Plotly Express and can be displayed in the notebook or IDE if show_plot is set to True.
- Parameters:
- dfpandas.DataFrame
The DataFrame containing the data to plot. Must include the columns specified by x_axis and category_column.
- x_axisstr
The name of the column in df to be used for the x-axis of the histogram.
- category_columnstr
The name of the column in df that contains categorical data for segmenting the histogram. Each category will be represented with a different color.
- show_plotbool, optional
If True (default), the generated plot will be immediately displayed. If False, the plot will not be displayed but will still be returned by the function.
- Returns:
- plotly.graph_objs._figure.Figure
The Plotly figure object for the generated histogram. This object can be further customized or saved after the function returns.
Notes
This function is designed to offer a quick and convenient way to visualize the distribution of data in a DataFrame along a specified axis. It is particularly useful for exploratory data analysis and for identifying patterns or outliers in dataset segments.
Examples
>>> from maui import samples, eda >>> df = samples.get_audio_sample(dataset="leec") >>> fig = eda.histogram_analysis(df, 'landscape', 'environment')