nxcals.api.extraction.data.builders.DataFrame.union

DataFrame.union(other: DataFrame) DataFrame

Return a new DataFrame containing the union of rows in this and another DataFrame.

New in version 2.0.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters:

other (DataFrame) – Another DataFrame that needs to be unioned.

Returns:

A new DataFrame containing the combined rows with corresponding columns.

Return type:

DataFrame

Notes

This method performs a SQL-style set union of the rows from both DataFrame objects, with no automatic deduplication of elements.

Use the distinct() method to perform deduplication of rows.

The method resolves columns by position (not by name), following the standard behavior in SQL.

Examples

Example 1: Combining two DataFrames with the same schema

>>> df1 = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'value'])
>>> df2 = spark.createDataFrame([(3, 'C'), (4, 'D')], ['id', 'value'])
>>> df3 = df1.union(df2)
>>> df3.show()
+---+-----+
| id|value|
+---+-----+
|  1|    A|
|  2|    B|
|  3|    C|
|  4|    D|
+---+-----+

Example 2: Combining two DataFrames with different schemas

>>> from pyspark.sql.functions import lit
>>> df1 = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["name", "id"])
>>> df2 = spark.createDataFrame([(3, "Charlie"), (4, "Dave")], ["id", "name"])
>>> df1 = df1.withColumn("age", lit(30))
>>> df2 = df2.withColumn("age", lit(40))
>>> df3 = df1.union(df2)
>>> df3.show()
+-----+-------+---+
| name|     id|age|
+-----+-------+---+
|Alice|      1| 30|
|  Bob|      2| 30|
|    3|Charlie| 40|
|    4|   Dave| 40|
+-----+-------+---+

Example 3: Combining two DataFrames with mismatched columns

>>> df1 = spark.createDataFrame([(1, 2)], ["A", "B"])
>>> df2 = spark.createDataFrame([(3, 4)], ["C", "D"])
>>> df3 = df1.union(df2)
>>> df3.show()
+---+---+
|  A|  B|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

Example 4: Combining duplicate rows from two different DataFrames

>>> df1 = spark.createDataFrame([(1, 'A'), (2, 'B'), (3, 'C')], ['id', 'value'])
>>> df2 = spark.createDataFrame([(3, 'C'), (4, 'D')], ['id', 'value'])
>>> df3 = df1.union(df2).distinct().sort("id")
>>> df3.show()
+---+-----+
| id|value|
+---+-----+
|  1|    A|
|  2|    B|
|  3|    C|
|  4|    D|
+---+-----+