nxcals.api.extraction.data.builders.DataFrame.groupBy
- DataFrame.groupBy(*cols: ColumnOrName) GroupedData
- DataFrame.groupBy(__cols: Union[List[Column], List[str]]) GroupedData
Groups the
DataFrameusing the specified columns, so we can run aggregation on them. SeeGroupedDatafor all the available aggregate functions.groupby()is an alias forgroupBy().New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters:
cols (list, str or
Column) – columns to group by. Each element should be a column name (string) or an expression (Column) or list of them.- Returns:
Grouped data by given columns.
- Return type:
GroupedData
Examples
>>> df = spark.createDataFrame([ ... (2, "Alice"), (2, "Bob"), (2, "Bob"), (5, "Bob")], schema=["age", "name"])
Empty grouping columns triggers a global aggregation.
>>> df.groupBy().avg().show() +--------+ |avg(age)| +--------+ | 2.75| +--------+
Group-by ‘name’, and specify a dictionary to calculate the summation of ‘age’.
>>> df.groupBy("name").agg({"age": "sum"}).sort("name").show() +-----+--------+ | name|sum(age)| +-----+--------+ |Alice| 2| | Bob| 9| +-----+--------+
Group-by ‘name’, and calculate maximum values.
>>> df.groupBy(df.name).max().sort("name").show() +-----+--------+ | name|max(age)| +-----+--------+ |Alice| 2| | Bob| 5| +-----+--------+
Group-by ‘name’ and ‘age’, and calculate the number of rows in each group.
>>> df.groupBy(["name", df.age]).count().sort("name", "age").show() +-----+---+-----+ | name|age|count| +-----+---+-----+ |Alice| 2| 1| | Bob| 2| 2| | Bob| 5| 1| +-----+---+-----+