Class ExtractionUtils


  • public final class ExtractionUtils
    extends java.lang.Object
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static org.apache.avro.Schema STRING_SCHEMA  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> createEmptyDataFrame​(org.apache.spark.sql.SparkSession session, java.util.Collection<cern.nxcals.common.domain.ColumnMapping> mappings)  
      static java.util.List<org.apache.avro.Schema.Field> extractSchemaFields​(java.lang.String schemaContent)  
      static org.apache.spark.sql.types.DataType getDataTypeFor​(org.apache.avro.Schema schema)  
      static java.lang.String getHbaseTypeNameFor​(org.apache.avro.Schema schema)  
      static org.apache.spark.sql.types.StructType getStructSchemaFor​(java.util.Collection<cern.nxcals.common.domain.ColumnMapping> fields)  
      static java.lang.String getTimestampFieldName​(@NonNull SystemSpec systemData)  
      static java.util.Map<org.apache.spark.sql.types.DataType,​java.util.Set<Variable>> groupVariablesByType​(org.apache.spark.sql.SparkSession spark, java.util.Set<Variable> variables, TimeWindow timeWindow)
      Groups variables by the single Spark DataType that covers all their schemas over the time window.
      static java.util.Map<org.apache.spark.sql.types.DataType,​java.util.Map<TimeWindow,​java.util.Set<Variable>>> groupVariablesByTypeAndTimeWindow​(java.util.Set<Variable> variables, TimeWindow timeWindow)
      Groups variables by their Spark DataType and the TimeWindow during which that type is active.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • STRING_SCHEMA

        public static final org.apache.avro.Schema STRING_SCHEMA
    • Method Detail

      • getTimestampFieldName

        public static java.lang.String getTimestampFieldName​(@NonNull
                                                             @NonNull SystemSpec systemData)
      • extractSchemaFields

        public static java.util.List<org.apache.avro.Schema.Field> extractSchemaFields​(java.lang.String schemaContent)
      • getDataTypeFor

        public static org.apache.spark.sql.types.DataType getDataTypeFor​(org.apache.avro.Schema schema)
      • getHbaseTypeNameFor

        public static java.lang.String getHbaseTypeNameFor​(org.apache.avro.Schema schema)
      • getStructSchemaFor

        public static org.apache.spark.sql.types.StructType getStructSchemaFor​(java.util.Collection<cern.nxcals.common.domain.ColumnMapping> fields)
      • createEmptyDataFrame

        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> createEmptyDataFrame​(org.apache.spark.sql.SparkSession session,
                                                                                                  java.util.Collection<cern.nxcals.common.domain.ColumnMapping> mappings)
      • groupVariablesByTypeAndTimeWindow

        public static java.util.Map<org.apache.spark.sql.types.DataType,​java.util.Map<TimeWindow,​java.util.Set<Variable>>> groupVariablesByTypeAndTimeWindow​(java.util.Set<Variable> variables,
                                                                                                                                                                         TimeWindow timeWindow)
        Groups variables by their Spark DataType and the TimeWindow during which that type is active. Variables with schema evolution (e.g. BOOLEAN in the first half of a year, DOUBLE in the second half) are correctly split across multiple (DataType, TimeWindow) groups rather than being dropped or crashing the call. The caller should iterate over the outer map and submit one DataQuery per (DataType, TimeWindow) entry, using the time window from the inner map as the query range.
        Parameters:
        variables - variables to group
        timeWindow - time window for which schemas are resolved
        Returns:
        map of DataType to a map of TimeWindow to the set of variables that have that type during that window.
      • groupVariablesByType

        public static java.util.Map<org.apache.spark.sql.types.DataType,​java.util.Set<Variable>> groupVariablesByType​(org.apache.spark.sql.SparkSession spark,
                                                                                                                            java.util.Set<Variable> variables,
                                                                                                                            TimeWindow timeWindow)
        Groups variables by the single Spark DataType that covers all their schemas over the time window. Variables with incompatible schema evolution (e.g. BOOLEAN to DOUBLE) are skipped with a warning. This method is used in pivot. It will throw an exception if the types within one Variable and multi VariableConfigs are not compatible.