nxcals.api.extraction.data.builders.DataFrame

class nxcals.api.extraction.data.builders.DataFrame(jdf: JavaObject, sql_ctx: Union[SQLContext, SparkSession])

A distributed collection of data grouped into named columns.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Examples

A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession:

>>> people = spark.createDataFrame([
...     {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50},
...     {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100},
...     {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150},
...     {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200}
... ])

Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame, Column.

To select a column from the DataFrame, use the apply method:

>>> age_col = people.age

A more concrete example:

>>> # To create DataFrame using SparkSession
... department = spark.createDataFrame([
...     {"id": 1, "name": "PySpark"},
...     {"id": 2, "name": "ML"},
...     {"id": 3, "name": "Spark SQL"}
... ])

>>> people.filter(people.age > 30).join(
...     department, people.deptId == department.id).groupBy(
...     department.name, "gender").agg({"salary": "avg", "age": "max"}).show()
+-------+------+-----------+--------+
|   name|gender|avg(salary)|max(age)|
+-------+------+-----------+--------+
|     ML|     F|      150.0|      60|
|PySpark|     M|       75.0|      50|
+-------+------+-----------+--------+

Notes

A DataFrame should only be created as described above. It should not be directly created via using the constructor.

Methods

`DataFrame.__init__`(jdf, sql_ctx)
`DataFrame.agg`(*exprs)	Aggregate on the entire `DataFrame` without groups (shorthand for `df.groupBy().agg()`).
`DataFrame.alias`(alias)	Returns a new `DataFrame` with an alias set.
`DataFrame.approxQuantile`()	Calculates the approximate quantiles of numerical columns of a `DataFrame`.
`DataFrame.cache`()	Persists the `DataFrame` with the default storage level (MEMORY_AND_DISK).
`DataFrame.checkpoint`([eager])	Returns a checkpointed version of this `DataFrame`.
`DataFrame.coalesce`(numPartitions)	Returns a new `DataFrame` that has exactly numPartitions partitions.
`DataFrame.colRegex`(colName)	Selects column based on the column name specified as a regex and returns it as `Column`.
`DataFrame.collect`()	Returns all the records as a list of `Row`.
`DataFrame.corr`(col1, col2[, method])	Calculates the correlation of two columns of a `DataFrame` as a double value.
`DataFrame.count`()	Returns the number of rows in this `DataFrame`.
`DataFrame.cov`(col1, col2)	Calculate the sample covariance for the given columns, specified by their names, as a double value.
`DataFrame.createGlobalTempView`(name)	Creates a global temporary view with this `DataFrame`.
`DataFrame.createOrReplaceGlobalTempView`(name)	Creates or replaces a global temporary view using the given name.
`DataFrame.createOrReplaceTempView`(name)	Creates or replaces a local temporary view with this `DataFrame`.
`DataFrame.createTempView`(name)	Creates a local temporary view with this `DataFrame`.
`DataFrame.crossJoin`(other)	Returns the cartesian product with another `DataFrame`.
`DataFrame.crosstab`(col1, col2)	Computes a pair-wise frequency table of the given columns.
`DataFrame.cube`()	Create a multi-dimensional cube for the current `DataFrame` using the specified columns, so we can run aggregations on them.
`DataFrame.describe`(*cols)	Computes basic statistics for numeric and string columns.
`DataFrame.distinct`()	Returns a new `DataFrame` containing the distinct rows in this `DataFrame`.
`DataFrame.drop`()	Returns a new `DataFrame` without specified columns.
`DataFrame.dropDuplicates`([subset])	Return a new `DataFrame` with duplicate rows removed, optionally only considering certain columns.
`DataFrame.dropDuplicatesWithinWatermark`([subset])	Return a new `DataFrame` with duplicate rows removed,
`DataFrame.drop_duplicates`([subset])	`drop_duplicates()` is an alias for `dropDuplicates()`.
`DataFrame.dropna`([how, thresh, subset])	Returns a new `DataFrame` omitting rows with null values.
`DataFrame.exceptAll`(other)	Return a new `DataFrame` containing rows in this `DataFrame` but not in another `DataFrame` while preserving duplicates.
`DataFrame.explain`([extended, mode])	Prints the (logical and physical) plans to the console for debugging purposes.
`DataFrame.fillna`()	Replace null values, alias for `na.fill()`.
`DataFrame.filter`(condition)	Filters rows using the given condition.
`DataFrame.first`()	Returns the first row as a `Row`.
`DataFrame.foreach`(f)	Applies the `f` function to all `Row` of this `DataFrame`.
`DataFrame.foreachPartition`(f)	Applies the `f` function to each partition of this `DataFrame`.
`DataFrame.freqItems`(cols[, support])	Finding frequent items for columns, possibly with false positives.
`DataFrame.groupBy`()	Groups the `DataFrame` using the specified columns, so we can run aggregation on them.
`DataFrame.groupby`(*cols)	`groupby()` is an alias for `groupBy()`.
`DataFrame.head`()	Returns the first `n` rows.
`DataFrame.hint`(name, *parameters)	Specifies some hint on the current `DataFrame`.
`DataFrame.inputFiles`()	Returns a best-effort snapshot of the files that compose this `DataFrame`.
`DataFrame.intersect`(other)	Return a new `DataFrame` containing rows only in both this `DataFrame` and another `DataFrame`.
`DataFrame.intersectAll`(other)	Return a new `DataFrame` containing rows in both this `DataFrame` and another `DataFrame` while preserving duplicates.
`DataFrame.isEmpty`()	Checks if the `DataFrame` is empty and returns a boolean value.
`DataFrame.isLocal`()	Returns `True` if the `collect()` and `take()` methods can be run locally (without any Spark executors).
`DataFrame.join`(other[, on, how])	Joins with another `DataFrame`, using the given join expression.
`DataFrame.limit`(num)	Limits the result count to the number specified.
`DataFrame.localCheckpoint`([eager])	Returns a locally checkpointed version of this `DataFrame`.
`DataFrame.mapInArrow`(func, schema[, barrier])	Maps an iterator of batches in the current `DataFrame` using a Python native function that takes and outputs a PyArrow's RecordBatch, and returns the result as a `DataFrame`.
`DataFrame.mapInPandas`(func, schema[, barrier])	Maps an iterator of batches in the current `DataFrame` using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a `DataFrame`.
`DataFrame.melt`(ids, values, ...)	Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
`DataFrame.observe`(observation, *exprs)	Define (named) metrics to observe on the DataFrame.
`DataFrame.offset`(num)	Returns a new :class: DataFrame by skipping the first n rows.
`DataFrame.orderBy`(cols, *kwargs)	Returns a new `DataFrame` sorted by the specified column(s).
`DataFrame.pandas_api`([index_col])	Converts the existing DataFrame into a pandas-on-Spark DataFrame.
`DataFrame.persist`([storageLevel])	Sets the storage level to persist the contents of the `DataFrame` across operations after the first time it is computed.
`DataFrame.printSchema`([level])	Prints out the schema in the tree format.
`DataFrame.randomSplit`(weights[, seed])	Randomly splits this `DataFrame` with the provided weights.
`DataFrame.registerTempTable`(name)	Registers this `DataFrame` as a temporary table using the given name.
`DataFrame.repartition`()	Returns a new `DataFrame` partitioned by the given partitioning expressions.
`DataFrame.repartitionByRange`()	Returns a new `DataFrame` partitioned by the given partitioning expressions.
`DataFrame.replace`()	Returns a new `DataFrame` replacing a value with another value.
`DataFrame.rollup`()	Create a multi-dimensional rollup for the current `DataFrame` using the specified columns, so we can run aggregation on them.
`DataFrame.sameSemantics`(other)	Returns True when the logical query plans inside both `DataFrame`s are equal and therefore return the same results.
`DataFrame.sample`()	Returns a sampled subset of this `DataFrame`.
`DataFrame.sampleBy`(col, fractions[, seed])	Returns a stratified sample without replacement based on the fraction given on each stratum.
`DataFrame.select`()	Projects a set of expressions and returns a new `DataFrame`.
`DataFrame.selectExpr`()	Projects a set of SQL expressions and returns a new `DataFrame`.
`DataFrame.semanticHash`()	Returns a hash code of the logical query plan against this `DataFrame`.
`DataFrame.show`([n, truncate, vertical])	Prints the first `n` rows to the console.
`DataFrame.sort`(cols, *kwargs)	Returns a new `DataFrame` sorted by the specified column(s).
`DataFrame.sortWithinPartitions`(cols, *kwargs)	Returns a new `DataFrame` with each partition sorted by the specified column(s).
`DataFrame.subtract`(other)	Return a new `DataFrame` containing rows in this `DataFrame` but not in another `DataFrame`.
`DataFrame.summary`(*statistics)	Computes specified statistics for numeric and string columns.
`DataFrame.tail`(num)	Returns the last `num` rows as a `list` of `Row`.
`DataFrame.take`(num)	Returns the first `num` rows as a `list` of `Row`.
`DataFrame.to`(schema)	Returns a new `DataFrame` where each row is reconciled to match the specified schema.
`DataFrame.toDF`(*cols)	Returns a new `DataFrame` that with new specified column names
`DataFrame.toJSON`([use_unicode])	Converts a `DataFrame` into a `RDD` of string.
`DataFrame.toLocalIterator`([prefetchPartitions])	Returns an iterator that contains all of the rows in this `DataFrame`.
`DataFrame.toPandas`()	Returns the contents of this `DataFrame` as Pandas `pandas.DataFrame`.
`DataFrame.to_koalas`([index_col])
`DataFrame.to_pandas_on_spark`([index_col])
`DataFrame.transform`(func, args, *kwargs)	Returns a new `DataFrame`.
`DataFrame.union`(other)	Return a new `DataFrame` containing the union of rows in this and another `DataFrame`.
`DataFrame.unionAll`(other)	Return a new `DataFrame` containing the union of rows in this and another `DataFrame`.
`DataFrame.unionByName`(other[, ...])	Returns a new `DataFrame` containing union of rows in this and another `DataFrame`.
`DataFrame.unpersist`([blocking])	Marks the `DataFrame` as non-persistent, and remove all blocks for it from memory and disk.
`DataFrame.unpivot`(ids, values, ...)	Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
`DataFrame.where`(condition)	`where()` is an alias for `filter()`.
`DataFrame.withColumn`(colName, col)	Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name.
`DataFrame.withColumnRenamed`(existing, new)	Returns a new `DataFrame` by renaming an existing column.
`DataFrame.withColumns`(*colsMap)	Returns a new `DataFrame` by adding multiple columns or replacing the existing columns that have the same names.
`DataFrame.withColumnsRenamed`(colsMap)	Returns a new `DataFrame` by renaming multiple columns.
`DataFrame.withMetadata`(columnName, metadata)	Returns a new `DataFrame` by updating an existing column with metadata.
`DataFrame.withWatermark`(eventTime, ...)	Defines an event time watermark for this `DataFrame`.
`DataFrame.writeTo`(table)	Create a write configuration builder for v2 sources.

Attributes

`DataFrame.columns`	Retrieves the names of all columns in the `DataFrame` as a list.
`DataFrame.dtypes`	Returns all column names and their data types as a list.
`DataFrame.isStreaming`	Returns `True` if this `DataFrame` contains one or more sources that continuously return data as it arrives.
`DataFrame.na`	Returns a `DataFrameNaFunctions` for handling missing values.
`DataFrame.rdd`	Returns the content as an `pyspark.RDD` of `Row`.
`DataFrame.schema`	Returns the schema of this `DataFrame` as a `pyspark.sql.types.StructType`.
`DataFrame.sparkSession`	Returns Spark session that created this `DataFrame`.
`DataFrame.sql_ctx`
`DataFrame.stat`	Returns a `DataFrameStatFunctions` for statistic functions.
`DataFrame.storageLevel`	Get the `DataFrame`'s current storage level.
`DataFrame.write`	Interface for saving the content of the non-streaming `DataFrame` out into external storage.
`DataFrame.writeStream`	Interface for saving the content of the streaming `DataFrame` out into external storage.