nxcals.api.extraction.data.builders.SparkSession.createDataFrame
- SparkSession.createDataFrame(data: Iterable[RowLike], schema: Union[List[str], Tuple[str, ...]] = None, samplingRatio: Optional[float] = None) DataFrame
- SparkSession.createDataFrame(data: RDD[RowLike], schema: Union[List[str], Tuple[str, ...]] = None, samplingRatio: Optional[float] = None) DataFrame
- SparkSession.createDataFrame(data: Iterable[RowLike], schema: Union[StructType, str], *, verifySchema: bool = True) DataFrame
- SparkSession.createDataFrame(data: RDD[RowLike], schema: Union[StructType, str], *, verifySchema: bool = True) DataFrame
- SparkSession.createDataFrame(data: RDD[AtomicValue], schema: Union[AtomicType, str], verifySchema: bool = True) DataFrame
- SparkSession.createDataFrame(data: Iterable[AtomicValue], schema: Union[AtomicType, str], verifySchema: bool = True) DataFrame
- SparkSession.createDataFrame(data: PandasDataFrameLike, samplingRatio: Optional[float] = None) DataFrame
- SparkSession.createDataFrame(data: PandasDataFrameLike, schema: Union[StructType, str], verifySchema: bool = True) DataFrame
Creates a
DataFramefrom anRDD, a list or apandas.DataFrame.When
schemais a list of column names, the type of each column will be inferred fromdata.When
schemaisNone, it will try to infer the schema (column names and types) fromdata, which should be an RDD of eitherRow,namedtuple, ordict.When
schemaispyspark.sql.types.DataTypeor a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is notpyspark.sql.types.StructType, it will be wrapped into apyspark.sql.types.StructTypeas its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.If schema inference is needed,
samplingRatiois used to determined the ratio of rows used for schema inference. The first row will be used ifsamplingRatioisNone.New in version 2.0.0.
Changed in version 2.1.0: Added verifySchema.
- Parameters:
data (
RDDor iterable) – an RDD of any kind of SQL data representation (Row,tuple,int,boolean, etc.), orlist, orpandas.DataFrame.schema (
pyspark.sql.types.DataType, str or list, optional) – apyspark.sql.types.DataTypeor a datatype string or a list of column names, default is None. The data type string format equals topyspark.sql.types.DataType.simpleString, except that top level struct type can omit thestruct<>.samplingRatio (float, optional) – the sample ratio of rows used for inferring
verifySchema (bool, optional) – verify data types of every row against schema. Enabled by default.
- Return type:
Notes
Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.
Examples
>>> l = [('Alice', 1)] >>> spark.createDataFrame(l).collect() [Row(_1='Alice', _2=1)] >>> spark.createDataFrame(l, ['name', 'age']).collect() [Row(name='Alice', age=1)]
>>> d = [{'name': 'Alice', 'age': 1}] >>> spark.createDataFrame(d).collect() [Row(age=1, name='Alice')]
>>> rdd = sc.parallelize(l) >>> spark.createDataFrame(rdd).collect() [Row(_1='Alice', _2=1)] >>> df = spark.createDataFrame(rdd, ['name', 'age']) >>> df.collect() [Row(name='Alice', age=1)]
>>> from pyspark.sql import Row >>> Person = Row('name', 'age') >>> person = rdd.map(lambda r: Person(*r)) >>> df2 = spark.createDataFrame(person) >>> df2.collect() [Row(name='Alice', age=1)]
>>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("name", StringType(), True), ... StructField("age", IntegerType(), True)]) >>> df3 = spark.createDataFrame(rdd, schema) >>> df3.collect() [Row(name='Alice', age=1)]
>>> spark.createDataFrame(df.toPandas()).collect() [Row(name='Alice', age=1)] >>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect() [Row(0=1, 1=2)]
>>> spark.createDataFrame(rdd, "a: string, b: int").collect() [Row(a='Alice', b=1)] >>> rdd = rdd.map(lambda row: row[1]) >>> spark.createDataFrame(rdd, "int").collect() [Row(value=1)] >>> spark.createDataFrame(rdd, "boolean").collect() Traceback (most recent call last): ... Py4JJavaError: ...