nxcals.api.extraction.data.builders.DataFrame.summary

DataFrame.summary(*statistics: str) DataFrame

Computes specified statistics for numeric and string columns. Available statistics are: - count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (e.g., 75%)

If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.

Added in version 2.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters:

statistics (str, optional) – Column names to calculate statistics by (default All columns).

Returns:

A new DataFrame that provides statistics for the given DataFrame.

Return type:

DataFrame

Notes

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

Examples

>>> df = spark.createDataFrame(
...     [("Bob", 13, 40.3, 150.5), ("Alice", 12, 37.8, 142.3), ("Tom", 11, 44.1, 142.2)],
...     ["name", "age", "weight", "height"],
... )
>>> df.select("age", "weight", "height").summary().show()
+-------+----+------------------+-----------------+
|summary| age|            weight|           height|
+-------+----+------------------+-----------------+
|  count|   3|                 3|                3|
|   mean|12.0| 40.73333333333333|            145.0|
| stddev| 1.0|3.1722757341273704|4.763402145525822|
|    min|  11|              37.8|            142.2|
|    25%|  11|              37.8|            142.2|
|    50%|  12|              40.3|            142.3|
|    75%|  13|              44.1|            150.5|
|    max|  13|              44.1|            150.5|
+-------+----+------------------+-----------------+
>>> df.select("age", "weight", "height").summary("count", "min", "25%", "75%", "max").show()
+-------+---+------+------+
|summary|age|weight|height|
+-------+---+------+------+
|  count|  3|     3|     3|
|    min| 11|  37.8| 142.2|
|    25%| 11|  37.8| 142.2|
|    75%| 13|  44.1| 150.5|
|    max| 13|  44.1| 150.5|
+-------+---+------+------+

See also

DataFrame.display