nxcals.api.extraction.data.builders.SparkSession.createDataFrame
- SparkSession.createDataFrame(data: Iterable['RowLike'], schema: Union[List[str], Tuple[str, ...]] = None, samplingRatio: Optional[float] = None) DataFrame
- SparkSession.createDataFrame(data: RDD[RowLike], schema: Union[List[str], Tuple[str, ...]] = None, samplingRatio: Optional[float] = None) DataFrame
- SparkSession.createDataFrame(data: Iterable['RowLike'], schema: Union[StructType, str], *, verifySchema: bool = True) DataFrame
- SparkSession.createDataFrame(data: RDD[RowLike], schema: Union[StructType, str], *, verifySchema: bool = True) DataFrame
- SparkSession.createDataFrame(data: RDD[AtomicValue], schema: Union[AtomicType, str], verifySchema: bool = True) DataFrame
- SparkSession.createDataFrame(data: Iterable['AtomicValue'], schema: Union[AtomicType, str], verifySchema: bool = True) DataFrame
- SparkSession.createDataFrame(data: PandasDataFrameLike, samplingRatio: Optional[float] = None) DataFrame
- SparkSession.createDataFrame(data: PandasDataFrameLike, schema: Union[StructType, str], verifySchema: bool = True) DataFrame
Creates a
DataFrame
from anRDD
, a list, apandas.DataFrame
or anumpy.ndarray
.New in version 2.0.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters:
data (
RDD
or iterable) – an RDD of any kind of SQL data representation (Row
,tuple
,int
,boolean
, etc.), orlist
,pandas.DataFrame
ornumpy.ndarray
.schema (
pyspark.sql.types.DataType
, str or list, optional) –a
pyspark.sql.types.DataType
or a datatype string or a list of column names, default is None. The data type string format equals topyspark.sql.types.DataType.simpleString
, except that top level struct type can omit thestruct<>
.When
schema
is a list of column names, the type of each column will be inferred fromdata
.When
schema
isNone
, it will try to infer the schema (column names and types) fromdata
, which should be an RDD of eitherRow
,namedtuple
, ordict
.When
schema
ispyspark.sql.types.DataType
or a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is notpyspark.sql.types.StructType
, it will be wrapped into apyspark.sql.types.StructType
as its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.samplingRatio (float, optional) – the sample ratio of rows used for inferring. The first few rows will be used if
samplingRatio
isNone
.verifySchema (bool, optional) –
verify data types of every row against schema. Enabled by default.
New in version 2.1.0.
- Return type:
Notes
Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.
Examples
Create a DataFrame from a list of tuples.
>>> spark.createDataFrame([('Alice', 1)]).show() +-----+---+ | _1| _2| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame from a list of dictionaries.
>>> d = [{'name': 'Alice', 'age': 1}] >>> spark.createDataFrame(d).show() +---+-----+ |age| name| +---+-----+ | 1|Alice| +---+-----+
Create a DataFrame with column names specified.
>>> spark.createDataFrame([('Alice', 1)], ['name', 'age']).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame with the explicit schema specified.
>>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("name", StringType(), True), ... StructField("age", IntegerType(), True)]) >>> spark.createDataFrame([('Alice', 1)], schema).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame with the schema in DDL formatted string.
>>> spark.createDataFrame([('Alice', 1)], "name: string, age: int").show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create an empty DataFrame. When initializing an empty DataFrame in PySpark, it’s mandatory to specify its schema, as the DataFrame lacks data from which the schema can be inferred.
>>> spark.createDataFrame([], "name: string, age: int").show() +----+---+ |name|age| +----+---+ +----+---+
Create a DataFrame from Row objects.
>>> from pyspark.sql import Row >>> Person = Row('name', 'age') >>> df = spark.createDataFrame([Person("Alice", 1)]) >>> df.show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame from a pandas DataFrame.
>>> spark.createDataFrame(df.toPandas()).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+ >>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect() +---+---+ | 0| 1| +---+---+ | 1| 2| +---+---+