nxcals.api.extraction.data.builders.DataFrame.dropDuplicates

DataFrame.dropDuplicates(subset: List[str] | None = None) DataFrame

Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.

For a static batch DataFrame, it just drops duplicate rows. For a streaming DataFrame, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark() to limit how late the duplicate data can be and the system will accordingly limit the state. In addition, data older than watermark will be dropped to avoid any possibility of duplicates.

drop_duplicates() is an alias for dropDuplicates().

Added in version 1.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters:

subset (List of column names, optional) – List of columns to use for duplicate comparison (default All columns).

Returns:

DataFrame without duplicates.

Return type:

DataFrame

Examples

>>> from pyspark.sql import Row
>>> df = spark.createDataFrame([
...     Row(name='Alice', age=5, height=80),
...     Row(name='Alice', age=5, height=80),
...     Row(name='Alice', age=10, height=80)
... ])

Deduplicate the same rows.

>>> df.dropDuplicates().show()
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice|  5|    80|
|Alice| 10|    80|
+-----+---+------+

Deduplicate values on ‘name’ and ‘height’ columns.

>>> df.dropDuplicates(['name', 'height']).show()
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice|  5|    80|
+-----+---+------+