nxcals.api.extraction.data.builders.DataFrame
- class nxcals.api.extraction.data.builders.DataFrame(jdf: JavaObject, sql_ctx: SQLContext | SparkSession)
A distributed collection of data grouped into named columns.
Added in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
Examples
A
DataFrame
is equivalent to a relational table in Spark SQL, and can be created using various functions inSparkSession
:>>> people = spark.createDataFrame([ ... {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50}, ... {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100}, ... {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150}, ... {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200} ... ])
Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in:
DataFrame
,Column
.To select a column from the
DataFrame
, use the apply method:>>> age_col = people.age
A more concrete example:
>>> # To create DataFrame using SparkSession ... department = spark.createDataFrame([ ... {"id": 1, "name": "PySpark"}, ... {"id": 2, "name": "ML"}, ... {"id": 3, "name": "Spark SQL"} ... ])
>>> people.filter(people.age > 30).join( ... department, people.deptId == department.id).groupBy( ... department.name, "gender").agg({"salary": "avg", "age": "max"}).show() +-------+------+-----------+--------+ | name|gender|avg(salary)|max(age)| +-------+------+-----------+--------+ | ML| F| 150.0| 60| |PySpark| M| 75.0| 50| +-------+------+-----------+--------+
Notes
A DataFrame should only be created as described above. It should not be directly created via using the constructor.
Methods
DataFrame.__init__
(jdf, sql_ctx)DataFrame.agg
(*exprs)Aggregate on the entire
DataFrame
without groups (shorthand fordf.groupBy().agg()
).DataFrame.alias
(alias)Returns a new
DataFrame
with an alias set.Calculates the approximate quantiles of numerical columns of a
DataFrame
.Persists the
DataFrame
with the default storage level (MEMORY_AND_DISK_DESER).DataFrame.checkpoint
([eager])Returns a checkpointed version of this
DataFrame
.DataFrame.coalesce
(numPartitions)Returns a new
DataFrame
that has exactly numPartitions partitions.DataFrame.colRegex
(colName)Selects column based on the column name specified as a regex and returns it as
Column
.Returns all the records as a list of
Row
.DataFrame.corr
(col1, col2[, method])Calculates the correlation of two columns of a
DataFrame
as a double value.Returns the number of rows in this
DataFrame
.DataFrame.cov
(col1, col2)Calculate the sample covariance for the given columns, specified by their names, as a double value.
Creates a global temporary view with this
DataFrame
.Creates or replaces a global temporary view using the given name.
Creates or replaces a local temporary view with this
DataFrame
.DataFrame.createTempView
(name)Creates a local temporary view with this
DataFrame
.DataFrame.crossJoin
(other)Returns the cartesian product with another
DataFrame
.DataFrame.crosstab
(col1, col2)Computes a pair-wise frequency table of the given columns.
Create a multi-dimensional cube for the current
DataFrame
using the specified columns, so we can run aggregations on them.DataFrame.describe
(*cols)Computes basic statistics for numeric and string columns.
Returns a new
DataFrame
containing the distinct rows in thisDataFrame
.Returns a new
DataFrame
without specified columns.DataFrame.dropDuplicates
([subset])Return a new
DataFrame
with duplicate rows removed, optionally only considering certain columns.DataFrame.dropDuplicatesWithinWatermark
([subset])Return a new
DataFrame
with duplicate rows removed,DataFrame.drop_duplicates
([subset])drop_duplicates()
is an alias fordropDuplicates()
.DataFrame.dropna
([how, thresh, subset])Returns a new
DataFrame
omitting rows with null values.DataFrame.exceptAll
(other)Return a new
DataFrame
containing rows in thisDataFrame
but not in anotherDataFrame
while preserving duplicates.DataFrame.explain
([extended, mode])Prints the (logical and physical) plans to the console for debugging purposes.
Replace null values, alias for
na.fill()
.DataFrame.filter
(condition)Filters rows using the given condition.
Returns the first row as a
Row
.Applies the
f
function to allRow
of thisDataFrame
.Applies the
f
function to each partition of thisDataFrame
.DataFrame.freqItems
(cols[, support])Finding frequent items for columns, possibly with false positives.
Groups the
DataFrame
using the specified columns, so we can run aggregation on them.DataFrame.groupby
(*cols)Returns the first
n
rows.DataFrame.hint
(name, *parameters)Specifies some hint on the current
DataFrame
.Returns a best-effort snapshot of the files that compose this
DataFrame
.DataFrame.intersect
(other)Return a new
DataFrame
containing rows only in both thisDataFrame
and anotherDataFrame
.DataFrame.intersectAll
(other)Return a new
DataFrame
containing rows in both thisDataFrame
and anotherDataFrame
while preserving duplicates.Checks if the
DataFrame
is empty and returns a boolean value.Returns
True
if thecollect()
andtake()
methods can be run locally (without any Spark executors).DataFrame.join
(other[, on, how])Joins with another
DataFrame
, using the given join expression.DataFrame.limit
(num)Limits the result count to the number specified.
DataFrame.localCheckpoint
([eager])Returns a locally checkpointed version of this
DataFrame
.DataFrame.mapInArrow
(func, schema[, barrier])Maps an iterator of batches in the current
DataFrame
using a Python native function that takes and outputs a PyArrow's RecordBatch, and returns the result as aDataFrame
.DataFrame.mapInPandas
(func, schema[, barrier])Maps an iterator of batches in the current
DataFrame
using a Python native function that takes and outputs a pandas DataFrame, and returns the result as aDataFrame
.DataFrame.melt
(ids, values, ...)Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
DataFrame.observe
(observation, *exprs)Define (named) metrics to observe on the DataFrame.
DataFrame.offset
(num)Returns a new :class: DataFrame by skipping the first n rows.
DataFrame.orderBy
(*cols, **kwargs)Returns a new
DataFrame
sorted by the specified column(s).DataFrame.pandas_api
([index_col])Converts the existing DataFrame into a pandas-on-Spark DataFrame.
DataFrame.persist
([storageLevel])Sets the storage level to persist the contents of the
DataFrame
across operations after the first time it is computed.DataFrame.printSchema
([level])Prints out the schema in the tree format.
DataFrame.randomSplit
(weights[, seed])Randomly splits this
DataFrame
with the provided weights.Registers this
DataFrame
as a temporary table using the given name.Returns a new
DataFrame
partitioned by the given partitioning expressions.Returns a new
DataFrame
partitioned by the given partitioning expressions.Returns a new
DataFrame
replacing a value with another value.Create a multi-dimensional rollup for the current
DataFrame
using the specified columns, so we can run aggregation on them.DataFrame.sameSemantics
(other)Returns True when the logical query plans inside both
DataFrame
s are equal and therefore return the same results.Returns a sampled subset of this
DataFrame
.DataFrame.sampleBy
(col, fractions[, seed])Returns a stratified sample without replacement based on the fraction given on each stratum.
Projects a set of expressions and returns a new
DataFrame
.Projects a set of SQL expressions and returns a new
DataFrame
.Returns a hash code of the logical query plan against this
DataFrame
.DataFrame.show
([n, truncate, vertical])Prints the first
n
rows to the console.DataFrame.sort
(*cols, **kwargs)Returns a new
DataFrame
sorted by the specified column(s).DataFrame.sortWithinPartitions
(*cols, **kwargs)Returns a new
DataFrame
with each partition sorted by the specified column(s).DataFrame.subtract
(other)Return a new
DataFrame
containing rows in thisDataFrame
but not in anotherDataFrame
.DataFrame.summary
(*statistics)Computes specified statistics for numeric and string columns.
DataFrame.tail
(num)Returns the last
num
rows as alist
ofRow
.DataFrame.take
(num)Returns the first
num
rows as alist
ofRow
.DataFrame.to
(schema)Returns a new
DataFrame
where each row is reconciled to match the specified schema.DataFrame.toDF
(*cols)Returns a new
DataFrame
that with new specified column namesDataFrame.toJSON
([use_unicode])Converts a
DataFrame
into aRDD
of string.DataFrame.toLocalIterator
([prefetchPartitions])Returns an iterator that contains all of the rows in this
DataFrame
.Returns the contents of this
DataFrame
as Pandaspandas.DataFrame
.DataFrame.to_koalas
([index_col])DataFrame.to_pandas_on_spark
([index_col])DataFrame.transform
(func, *args, **kwargs)Returns a new
DataFrame
.DataFrame.union
(other)Return a new
DataFrame
containing the union of rows in this and anotherDataFrame
.DataFrame.unionAll
(other)Return a new
DataFrame
containing the union of rows in this and anotherDataFrame
.DataFrame.unionByName
(other[, ...])Returns a new
DataFrame
containing union of rows in this and anotherDataFrame
.DataFrame.unpersist
([blocking])Marks the
DataFrame
as non-persistent, and remove all blocks for it from memory and disk.DataFrame.unpivot
(ids, values, ...)Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
DataFrame.where
(condition)DataFrame.withColumn
(colName, col)Returns a new
DataFrame
by adding a column or replacing the existing column that has the same name.DataFrame.withColumnRenamed
(existing, new)Returns a new
DataFrame
by renaming an existing column.DataFrame.withColumns
(*colsMap)Returns a new
DataFrame
by adding multiple columns or replacing the existing columns that have the same names.DataFrame.withColumnsRenamed
(colsMap)Returns a new
DataFrame
by renaming multiple columns.DataFrame.withMetadata
(columnName, metadata)Returns a new
DataFrame
by updating an existing column with metadata.DataFrame.withWatermark
(eventTime, ...)Defines an event time watermark for this
DataFrame
.DataFrame.writeTo
(table)Create a write configuration builder for v2 sources.
Attributes
Retrieves the names of all columns in the
DataFrame
as a list.Returns all column names and their data types as a list.
Returns
True
if thisDataFrame
contains one or more sources that continuously return data as it arrives.Returns a
DataFrameNaFunctions
for handling missing values.Returns the content as an
pyspark.RDD
ofRow
.Returns the schema of this
DataFrame
as apyspark.sql.types.StructType
.Returns Spark session that created this
DataFrame
.Returns a
DataFrameStatFunctions
for statistic functions.Get the
DataFrame
's current storage level.Interface for saving the content of the non-streaming
DataFrame
out into external storage.Interface for saving the content of the streaming
DataFrame
out into external storage.