java.lang.Object
- cern.nxcals.api.extraction.data.ExtractionUtils

public final class ExtractionUtils
extends java.lang.Object

Field Summary

Fields
Modifier and Type Field Description

static org.apache.avro.Schema STRING_SCHEMA

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method	Description
`static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>`	`createEmptyDataFrame(org.apache.spark.sql.SparkSession session, java.util.Collection<cern.nxcals.common.domain.ColumnMapping> mappings)`
`static java.util.List<org.apache.avro.Schema.Field>`	`extractSchemaFields(java.lang.String schemaContent)`
`static org.apache.spark.sql.types.DataType`	`getDataTypeFor(org.apache.avro.Schema schema)`
`static java.lang.String`	`getHbaseTypeNameFor(org.apache.avro.Schema schema)`
`static org.apache.spark.sql.types.StructType`	`getStructSchemaFor(java.util.Collection<cern.nxcals.common.domain.ColumnMapping> fields)`
`static java.lang.String`	`getTimestampFieldName(@NonNull SystemSpec systemData)`
`static java.util.Map<org.apache.spark.sql.types.DataType,java.util.Set<Variable>>`	`groupVariablesByType(org.apache.spark.sql.SparkSession spark, java.util.Set<Variable> variables, TimeWindow timeWindow)`	Groups variables by the single Spark `DataType` that covers all their schemas over the time window.
`static java.util.Map<org.apache.spark.sql.types.DataType,java.util.Map<TimeWindow,java.util.Set<Variable>>>`	`groupVariablesByTypeAndTimeWindow(java.util.Set<Variable> variables, TimeWindow timeWindow)`	Groups variables by their Spark DataType and the TimeWindow during which that type is active.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

STRING_SCHEMA

public static final org.apache.avro.Schema STRING_SCHEMA

Method Detail

getTimestampFieldName

public static java.lang.String getTimestampFieldName(@NonNull
                                                     @NonNull SystemSpec systemData)

extractSchemaFields

public static java.util.List<org.apache.avro.Schema.Field> extractSchemaFields(java.lang.String schemaContent)

getDataTypeFor

public static org.apache.spark.sql.types.DataType getDataTypeFor(org.apache.avro.Schema schema)

getHbaseTypeNameFor

public static java.lang.String getHbaseTypeNameFor(org.apache.avro.Schema schema)

getStructSchemaFor

public static org.apache.spark.sql.types.StructType getStructSchemaFor(java.util.Collection<cern.nxcals.common.domain.ColumnMapping> fields)

createEmptyDataFrame

public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> createEmptyDataFrame(org.apache.spark.sql.SparkSession session,
                                                                                          java.util.Collection<cern.nxcals.common.domain.ColumnMapping> mappings)

groupVariablesByTypeAndTimeWindow
```
public static java.util.Map<org.apache.spark.sql.types.DataType,java.util.Map<TimeWindow,java.util.Set<Variable>>> groupVariablesByTypeAndTimeWindow(java.util.Set<Variable> variables,
                                                                                                                                                                 TimeWindow timeWindow)
```
Groups variables by their Spark DataType and the TimeWindow during which that type is active. Variables with schema evolution (e.g. BOOLEAN in the first half of a year, DOUBLE in the second half) are correctly split across multiple (DataType, TimeWindow) groups rather than being dropped or crashing the call. The caller should iterate over the outer map and submit one DataQuery per (DataType, TimeWindow) entry, using the time window from the inner map as the query range.

Parameters:

variables - variables to group

timeWindow - time window for which schemas are resolved

Returns:

map of DataType to a map of TimeWindow to the set of variables that have that type during that window.

groupVariablesByType

public static java.util.Map<org.apache.spark.sql.types.DataType,java.util.Set<Variable>> groupVariablesByType(org.apache.spark.sql.SparkSession spark,
                                                                                                                    java.util.Set<Variable> variables,
                                                                                                                    TimeWindow timeWindow)

Groups variables by the single Spark DataType that covers all their schemas over the time window. Variables with incompatible schema evolution (e.g. BOOLEAN to DOUBLE) are skipped with a warning. This method is used in pivot. It will throw an exception if the types within one Variable and multi VariableConfigs are not compatible.

Class ExtractionUtils

Field Summary