nxcals.api.extraction.data.builders.SparkSession

class nxcals.api.extraction.data.builders.SparkSession(sparkContext: SparkContext, jsparkSession: Optional[JavaObject] = None, options: Dict[str, Any] = {})

The entry point to programming Spark with the Dataset and DataFrame API.

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

builder

A class attribute having a Builder to construct SparkSession instances.

Examples

>>> spark = SparkSession.builder \
...     .master("local") \
...     .appName("Word Count") \
...     .config("spark.some.config.option", "some-value") \
...     .getOrCreate()
>>> from datetime import datetime
>>> from pyspark.sql import Row
>>> spark = SparkSession(sc)
>>> allTypes = sc.parallelize([Row(i=1, s="string", d=1.0, l=1,
...     b=True, list=[1, 2, 3], dict={"s": 0}, row=Row(a=1),
...     time=datetime(2014, 8, 1, 14, 1, 5))])
>>> df = allTypes.toDF()
>>> df.createOrReplaceTempView("allTypes")
>>> spark.sql('select i+1, d+1, not b, list[1], dict["s"], time, row.a '
...            'from allTypes where b and i > 0').collect()
[Row((i + 1)=2, (d + 1)=2.0, (NOT b)=False, list[1]=2,         dict[s]=0, time=datetime.datetime(2014, 8, 1, 14, 1, 5), a=1)]
>>> df.rdd.map(lambda x: (x.i, x.s, x.d, x.l, x.b, x.time, x.row.a, x.list)).collect()
[(1, 'string', 1.0, 1, True, datetime.datetime(2014, 8, 1, 14, 1, 5), 1, [1, 2, 3])]

Methods

SparkSession.__init__(sparkContext[, ...])

SparkSession.createDataFrame()

Creates a DataFrame from an RDD, a list or a pandas.DataFrame.

SparkSession.getActiveSession()

Returns the active SparkSession for the current thread, returned by the builder

SparkSession.newSession()

Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.

SparkSession.range(start[, end, step, ...])

Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.

SparkSession.sql(sqlQuery, **kwargs)

Returns a DataFrame representing the result of the given query.

SparkSession.stop()

Stop the underlying SparkContext.

SparkSession.table(tableName)

Returns the specified table as a DataFrame.

Attributes

SparkSession.builder

A class attribute having a Builder to construct SparkSession instances.

SparkSession.catalog

Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.

SparkSession.conf

Runtime configuration interface for Spark.

SparkSession.read

Returns a DataFrameReader that can be used to read data in as a DataFrame.

SparkSession.readStream

Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.

SparkSession.sparkContext

Returns the underlying SparkContext.

SparkSession.streams

Returns a StreamingQueryManager that allows managing all the StreamingQuery instances active on this context.

SparkSession.udf

Returns a UDFRegistration for UDF registration.

SparkSession.version

The version of Spark on which this application is running.