Package cern.nxcals.api.extraction.data
Class ExtractionUtils
- java.lang.Object
-
- cern.nxcals.api.extraction.data.ExtractionUtils
-
public final class ExtractionUtils extends java.lang.Object
-
-
Field Summary
Fields Modifier and Type Field Description static org.apache.avro.SchemaSTRING_SCHEMA
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>createEmptyDataFrame(org.apache.spark.sql.SparkSession session, java.util.Collection<cern.nxcals.common.domain.ColumnMapping> mappings)static java.util.List<org.apache.avro.Schema.Field>extractSchemaFields(java.lang.String schemaContent)static org.apache.spark.sql.types.DataTypegetDataTypeFor(org.apache.avro.Schema schema)static java.lang.StringgetHbaseTypeNameFor(org.apache.avro.Schema schema)static org.apache.spark.sql.types.StructTypegetStructSchemaFor(java.util.Collection<cern.nxcals.common.domain.ColumnMapping> fields)static java.lang.StringgetTimestampFieldName(@NonNull SystemSpec systemData)static java.util.Map<org.apache.spark.sql.types.DataType,java.util.Set<Variable>>groupVariablesByType(org.apache.spark.sql.SparkSession spark, java.util.Set<Variable> variables, TimeWindow timeWindow)Groups variables by the single SparkDataTypethat covers all their schemas over the time window.static java.util.Map<org.apache.spark.sql.types.DataType,java.util.Map<TimeWindow,java.util.Set<Variable>>>groupVariablesByTypeAndTimeWindow(java.util.Set<Variable> variables, TimeWindow timeWindow)Groups variables by their Spark DataType and the TimeWindow during which that type is active.
-
-
-
Method Detail
-
getTimestampFieldName
public static java.lang.String getTimestampFieldName(@NonNull @NonNull SystemSpec systemData)
-
extractSchemaFields
public static java.util.List<org.apache.avro.Schema.Field> extractSchemaFields(java.lang.String schemaContent)
-
getDataTypeFor
public static org.apache.spark.sql.types.DataType getDataTypeFor(org.apache.avro.Schema schema)
-
getHbaseTypeNameFor
public static java.lang.String getHbaseTypeNameFor(org.apache.avro.Schema schema)
-
getStructSchemaFor
public static org.apache.spark.sql.types.StructType getStructSchemaFor(java.util.Collection<cern.nxcals.common.domain.ColumnMapping> fields)
-
createEmptyDataFrame
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> createEmptyDataFrame(org.apache.spark.sql.SparkSession session, java.util.Collection<cern.nxcals.common.domain.ColumnMapping> mappings)
-
groupVariablesByTypeAndTimeWindow
public static java.util.Map<org.apache.spark.sql.types.DataType,java.util.Map<TimeWindow,java.util.Set<Variable>>> groupVariablesByTypeAndTimeWindow(java.util.Set<Variable> variables, TimeWindow timeWindow)
Groups variables by their Spark DataType and the TimeWindow during which that type is active. Variables with schema evolution (e.g. BOOLEAN in the first half of a year, DOUBLE in the second half) are correctly split across multiple (DataType, TimeWindow) groups rather than being dropped or crashing the call. The caller should iterate over the outer map and submit one DataQuery per (DataType, TimeWindow) entry, using the time window from the inner map as the query range.- Parameters:
variables- variables to grouptimeWindow- time window for which schemas are resolved- Returns:
- map of DataType to a map of TimeWindow to the set of variables that have that type during that window.
-
groupVariablesByType
public static java.util.Map<org.apache.spark.sql.types.DataType,java.util.Set<Variable>> groupVariablesByType(org.apache.spark.sql.SparkSession spark, java.util.Set<Variable> variables, TimeWindow timeWindow)
Groups variables by the single SparkDataTypethat covers all their schemas over the time window. Variables with incompatible schema evolution (e.g. BOOLEAN to DOUBLE) are skipped with a warning. This method is used in pivot. It will throw an exception if the types within one Variable and multi VariableConfigs are not compatible.
-
-