nxcals.api.extraction.data.builders.SparkSession

class nxcals.api.extraction.data.builders.SparkSession(sparkContext: SparkContext, jsparkSession: Optional[JavaObject] = None, options: Dict[str, Any] = {})

The entry point to programming Spark with the Dataset and DataFrame API.

A SparkSession can be used to create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

Changed in version 3.4.0: Supports Spark Connect.

builder

Examples

Create a Spark session.

>>> spark = (
...     SparkSession.builder
...         .master("local")
...         .appName("Word Count")
...         .config("spark.some.config.option", "some-value")
...         .getOrCreate()
... )

Create a Spark session with Spark Connect.

>>> spark = (
...     SparkSession.builder
...         .remote("sc://localhost")
...         .appName("Word Count")
...         .config("spark.some.config.option", "some-value")
...         .getOrCreate()
... )  

Methods

SparkSession.__init__(sparkContext[, ...])

SparkSession.active()

Returns the active or default SparkSession for the current thread, returned by the builder.

SparkSession.addArtifact(*path[, pyfile, ...])

Add artifact(s) to the client session.

SparkSession.addArtifacts(*path[, pyfile, ...])

Add artifact(s) to the client session.

SparkSession.addTag(tag)

Add a tag to be assigned to all the operations started by this thread in this session.

SparkSession.clearTags()

Clear the current thread's operation tags.

SparkSession.copyFromLocalToFs(local_path, ...)

Copy file from local to cloud storage file system.

SparkSession.createDataFrame()

Creates a DataFrame from an RDD, a list, a pandas.DataFrame or a numpy.ndarray.

SparkSession.getActiveSession()

Returns the active SparkSession for the current thread, returned by the builder

SparkSession.getTags()

Get the tags that are currently set to be assigned to all the operations started by this thread.

SparkSession.interruptAll()

Interrupt all operations of this session currently running on the connected server.

SparkSession.interruptOperation(op_id)

Interrupt an operation of this session with the given operationId.

SparkSession.interruptTag(tag)

Interrupt all operations of this session with the given operation tag.

SparkSession.newSession()

Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.

SparkSession.range(start[, end, step, ...])

Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.

SparkSession.removeTag(tag)

Remove a tag previously added to be assigned to all the operations started by this thread in this session.

SparkSession.sql(sqlQuery[, args])

Returns a DataFrame representing the result of the given query.

SparkSession.stop()

Stop the underlying SparkContext.

SparkSession.table(tableName)

Returns the specified table as a DataFrame.

Attributes

SparkSession.builder

SparkSession.catalog

Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.

SparkSession.client

Gives access to the Spark Connect client.

SparkSession.conf

Runtime configuration interface for Spark.

SparkSession.read

Returns a DataFrameReader that can be used to read data in as a DataFrame.

SparkSession.readStream

Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.

SparkSession.sparkContext

Returns the underlying SparkContext.

SparkSession.streams

Returns a StreamingQueryManager that allows managing all the StreamingQuery instances active on this context.

SparkSession.udf

Returns a UDFRegistration for UDF registration.

SparkSession.udtf

Returns a UDTFRegistration for UDTF registration.

SparkSession.version

The version of Spark on which this application is running.