Extracting field types
Based on the information originating from DataFrame, field data type can be determined in different ways. Selected DataFrames will be used to illustrate data type for:
Complete list of Spark data types is available on Spark Apache documentation pages
Example of a scalar data type
Required libraries for the sample code:
from nxcals.api.extraction.data.builders import DataQuery
from pyspark.sql.functions import col
```java
import cern.nxcals.api.extraction.data.builders.DataQuery; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataType; import org.apache.spark.sql.types.StructField;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.stream.Collectors;
import static org.apache.spark.sql.functions.col;
```
Creation of DataFrame based on entity having scalar fields:
df1 = DataQuery.builder(spark).byEntities().system('CMW') \
.startTime('2018-04-29 00:00:00.000').endTime('2018-04-30 00:00:00.000') \
.entity().keyValues({'device': 'LHC.LUMISCAN.DATA', 'property': 'CrossingAngleIP1'}) \
.build()
Map<String, Object> keyValues = new HashMap<String, Object>();
keyValues.put("device", "LHC.LUMISCAN.DATA");
keyValues.put("property", "CrossingAngleIP1");
Dataset<Row> df1 = DataQuery.builder(spark).byEntities().system("CMW")
.startTime("2018-04-29 00:00:00.000").endTime("2018-04-30 00:00:00.000")
.entity().keyValues(keyValues)
.build();
In order to visualise schema (including DataFrame column types) the following method can be used:
df1.printSchema() # prints schema in the tree format
df1.printSchema(); // prints schema in the tree format
Click to see/hide expected application output...
root
|-- DeltaCrossingAngle: double (nullable = true)
|-- Moving: long (nullable = true)
|-- __record_timestamp__: long (nullable = true)
|-- __record_version__: long (nullable = true)
|-- acqStamp: long (nullable = true)
|-- class: string (nullable = true)
|-- cyclestamp: long (nullable = true)
|-- device: string (nullable = true)
|-- property: string (nullable = true)
|-- selector: string (nullable = true)
|-- nxcals_entity_id: long (nullable = true)
In order to retrieve that information in a form of array with all column names and their data types a dtype method can be used:
df1.dtypes # DataFrame property returning names and data types of all the columns
df1.dtypes(); // DataFrame property returning names and data types of all the columns
Click to see/hide expected application output...
[('DeltaCrossingAngle', 'double'),
('Moving', 'bigint'),
('__record_timestamp__', 'bigint'),
('__record_version__', 'bigint'),
('acqStamp', 'bigint'),
('class', 'string'),
('cyclestamp', 'bigint'),
('device', 'string'),
('property', 'string'),
('selector', 'string'),
('nxcals_entity_id', 'bigint')]
To obtain Spark data type we can refer to the DataFrame schema:
df1.schema # DataFrame property returning its schema as StructType(List(StructField(name, Spark dataType, nullable), ...))
df1.schema(); // DataFrame property returning its schema as StructType(List(StructField(name, Spark dataType, nullable), ...))
Click to see/hide expected application output...
StructType(List(
StructField(DeltaCrossingAngle,DoubleType,true),
StructField(Moving,LongType,true),StructField(__record_timestamp__,LongType,true),
StructField(__record_version__,LongType,true),StructField(acqStamp,LongType,true),
StructField(class,StringType,true),StructField(cyclestamp,LongType,true),
StructField(device,StringType,true),StructField(property,StringType,true),
StructField(selector,StringType,true),StructField(nxcals_entity_id,LongType,true)
))
or directly to the DataFrame schema fields:
df1.schema.fields # Schema property returning List of StructField(name, Spark dataType, nullable)
df1.schema().fields(); // Schema property returning List of StructField(name, Spark dataType, nullable)
Click to see/hide expected application output...
[StructField(DeltaCrossingAngle,DoubleType,true),
StructField(Moving,LongType,true),
StructField(__record_timestamp__,LongType,true),
StructField(__record_version__,LongType,true),
StructField(acqStamp,LongType,true),
StructField(class,StringType,true),
StructField(cyclestamp,LongType,true),
StructField(device,StringType,true),
StructField(property,StringType,true),
StructField(selector,StringType,true),
StructField(nxcals_entity_id,LongType,true)]
For convenience we can create the field name <-> Spark data type mapping:
# Getting data types from schema in Spark as a dictionary
d = dict([f.name, f.dataType] for f in df1.schema.fields)
d['Moving']
// Getting data types from schema in Spark as a dictionary
Map<String, DataType> d = Arrays.stream(df1.schema().fields())
.collect(Collectors.toMap(StructField::name, StructField::dataType));
d.get("Moving");
LongType
Example of a vector data type
Vector and matrix data are expressed in NXCALS as a complex type using two ArrayType types for holding of vectro/matrix data (called elements) and for describing "shape" of the data through the list of dimensions. The concept is ilustrated through the sample code below:
Creation of DataFrame containing vector data:
df2 = DataQuery.builder(spark).byVariables().system('CMW') \
.startTime('2018-05-21 00:00:00.000').endTime('2018-05-21 00:05:00.000') \
.variable('SPS.BCTDC.51895:TOTAL_INTENSITY') \
.build()
Dataset<Row> df2 = DataQuery.builder(spark).byVariables().system("CMW")
.startTime("2018-05-21 00:00:00.000").endTime("2018-05-21 00:05:00.000")
.variable("SPS.BCTDC.51895:TOTAL_INTENSITY")
.build();
having following schema:
df2.schema.fields
df2.schema().fields();
Click to see/hide expected application output...
StructType(List(
StructField(nxcals_value,StructType(List(
StructField(elements,ArrayType(DoubleType,true),true),
StructField(dimensions,ArrayType(IntegerType,true),true)
)),true),
StructField(nxcals_entity_id,LongType,true),StructField(nxcals_timestamp,LongType,true),
StructField(nxcals_variable_name,StringType,true)
))
Selecting first 3 vectors (elements):
elements = df2.withColumn("nx_elements", col("nxcals_value.elements")).withColumn("nx_dimensions", col("nxcals_value.dimensions")).select("nx_elements")
elements.take(3)
Dataset<Row> elements = df2.withColumn("nx_elements", col("nxcals_value.elements"))
.withColumn("nx_dimensions", col("nxcals_value.dimensions")).select("nx_elements");
elements.take(3);
Click to see/hide expected application output...
[Row(nx_elements=[0.2579849, 0.28976566, 0.30659077, 0.29101196, 0.27730262, 0.27481002, 0.25362283, ... , 0.27543315]),
Row(nx_elements=[0.22745048, 0.24552187, 0.24302925, 0.23118937, 0.2374209, 0.24552187, 0.22620416, ... , 0.24302925]),
Row(nx_elements=[61.52985, 1040.4697, 1529.6321, 1572.7029, 1562.2429, 1557.9358, 1555.4746, 1554.244, ... , 2.461194])
]
and their corresponding sizes (please note alternative notation for referencing field names):
dimensions = df2.withColumn("nx_elements", col("nxcals_value")["elements"]).withColumn("nx_dimensions", col("nxcals_value")["dimensions"]).select("nx_dimensions")
dimensions.take(3)
Dataset<Row> dimensions = df2.withColumn("nx_elements", col("nxcals_value").getField("elements"))
.withColumn("nx_dimensions", col("nxcals_value").getField("dimensions")).select("nx_dimensions");
dimensions.take(3);
Click to see/hide expected application output...
[Row(nx_dimensions=[1228]), Row(nx_dimensions=[1228]), Row(nx_dimensions=[1836])]
Example of a matrix data type
Creation of DataFrame containing Matrix data:
df3 = DataQuery.builder(spark).byVariables().system('CMW') \
.startTime('2018-08-15 00:00:00.000').endTime('2018-08-30 00:00:00.000') \
.variable('HIE-BCAM-T2M03:RAWMEAS#NPIXELS') \
.build()
Dataset<Row> df3 = DataQuery.builder(spark).byVariables().system("CMW")
.startTime("2018-08-15 00:00:00.000").endTime("2018-08-30 00:00:00.000")
.variable("HIE-BCAM-T2M03:RAWMEAS#NPIXELS")
.build();
Retrieving matrix data present in the DataFrame:
matrices = df3.withColumn("matrix", col("nxcals_value.elements")) \
.withColumn("dim1", col("nxcals_value.dimensions")[0]) \
.withColumn("dim2", col("nxcals_value.dimensions")[1]) \
.select("matrix", "dim1", "dim2")
matrices.take(2)
Dataset<Row> matrices = (Dataset<Row>) df3
.withColumn("matrix", col("nxcals_value.elements"))
.withColumn("dim1", col("nxcals_value.dimensions").getItem(0))
.withColumn("dim2", col("nxcals_value.dimensions").getItem(1))
.select("matrix", "dim1", "dim2");
matrices.take(2);
Click to see/hide expected application output...
[Row(matrix=[14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 10, 17, 15, 0, 0, ... , 0], dim1=100, dim2=10),
Row(matrix=[14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 10, 16, 13, 0, 0, ... , 0], dim1=100, dim2=10)]