nxcals.api.extraction.data.builders.DataFrame.join

DataFrame.join(other: DataFrame, on: str | List[str] | Column | List[Column] | None = None, how: str | None = None) DataFrame

Joins with another DataFrame, using the given join expression.

Added in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters:
  • other (DataFrame) – Right side of the join

  • on (str, list or Column, optional) – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.

  • how (str, optional) – default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.

Returns:

Joined DataFrame.

Return type:

DataFrame

Examples

The following performs a full outer join between df1 and df2.

>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import desc
>>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", "name")
>>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")])
>>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, name="Bob")])
>>> df4 = spark.createDataFrame([
...     Row(age=10, height=80, name="Alice"),
...     Row(age=5, height=None, name="Bob"),
...     Row(age=None, height=None, name="Tom"),
...     Row(age=None, height=None, name=None),
... ])

Inner join on columns (default)

>>> df.join(df2, 'name').select(df.name, df2.height).show()
+----+------+
|name|height|
+----+------+
| Bob|    85|
+----+------+
>>> df.join(df4, ['name', 'age']).select(df.name, df.age).show()
+----+---+
|name|age|
+----+---+
| Bob|  5|
+----+---+

Outer join for both DataFrames on the ‘name’ column.

>>> df.join(df2, df.name == df2.name, 'outer').select(
...     df.name, df2.height).sort(desc("name")).show()
+-----+------+
| name|height|
+-----+------+
|  Bob|    85|
|Alice|  NULL|
| NULL|    80|
+-----+------+
>>> df.join(df2, 'name', 'outer').select('name', 'height').sort(desc("name")).show()
+-----+------+
| name|height|
+-----+------+
|  Tom|    80|
|  Bob|    85|
|Alice|  NULL|
+-----+------+

Outer join for both DataFrams with multiple columns.

>>> df.join(
...     df3,
...     [df.name == df3.name, df.age == df3.age],
...     'outer'
... ).select(df.name, df3.age).show()
+-----+---+
| name|age|
+-----+---+
|Alice|  2|
|  Bob|  5|
+-----+---+