Multiple schemas data extraction
In NXCALS an entity can evolve over the time, changing its schema and partition. This evolution process has certain consequences when working with data extractions: extraction API can raise 'IncompatibleSchemaPromotionException'. This exception is caused by incompatible data types between a field or fields under the same name, on 2 or more schemas, associated with the data being extracted.
Performing simple query:
from nxcals.api.extraction.data.builders import DevicePropertyDataQuery
df = DevicePropertyDataQuery \
.builder(spark) \
.system("CMW") \
.startTime("2019-08-22 11:00:00.000") \
.endTime("2019-08-22 12:00:00.000") \
.entity() \
.parameter("IGBT_EE_1/PM") \
.build()
results in the situation described above:
cern.nxcals.data.access.api.exception.IncompatibleSchemaPromotionException: Unsupported type promotion for field __NX_0FMZDIVRNKJUWO2DU. Please extract separate datasets for time windows : 2019-08-22T11:00:00Z - 2019-08-22T12:00:00Z; 2019-08-22T11:00:00Z - 2019-08-22T12:00:00Z
at cern.nxcals.data.access.api.DataAccessServiceImpl.getRequiredSchemaFieldsOrThrow(DataAccessServiceImpl.java:227)
...
Caused by: cern.nxcals.data.access.api.exception.IncompatibleSchemaPromotionException: Unsupported type promotion of schemas [[[{"type":"record","name":"float_multi_array","namespace":"cern.nxcals","fields":[{"name":"elements","type":[{"type":"array","items":"float"},"null"]},{"name":"dimensions","type":[{"type":"array","items":"int"},"null"]}]},"null"], [{"type":"record","name":"double_multi_array","namespace":"cern.nxcals","fields":[{"name":"elements","type":[{"type":"array","items":"double"},"null"]},{"name":"dimensions","type":[{"type":"array","items":"int"},"null"]}]},"null"]]]. Found non-primitive schema type!
at cern.nxcals.data.access.api.FieldTypeResolver.getPromotedSchema(FieldTypeResolver.java:90)
at cern.nxcals.data.access.api.FieldTypeResolver.enrichFieldWithSchema(FieldTypeResolver.java:62)
... 24 more
At the same time the query works without any issues with the different time ranges preceding the problematic time window, for example:
.startTime("2019-08-22 10:00:00.000") \
.endTime("2019-08-22 11:00:00.000") \
and following it:
.startTime("2019-08-22 12:00:00.000") \
.endTime("2019-08-22 13:00:00.000") \
Background process explained
When running a query for an entity over a time range spanning 2 or more schemas NXCALS service is obliged to produce a unified data frame with all the requested points.
This means that the process must run a UNION between the different datasets that it has found for a given query. In order to perform the UNION it must merge the different schema descriptions it found.
This has to do with the data evolution over time: every time something is changed on the message format (new fields, change on datatypes for values etc.) a new schema is created.
The union process between structures/records is not possible if they contain differences, it can only work if the structures with the same field name are exactly identical in both schema objects. The union process between primitive types is possible, but only if they belong to types that can be promoted.
Example:
- long + int = long (int can be promoted to long)
- int + string = error (string cannot be promoted to int, cause they are incompatible types)
- record + record = record (if they are completely identical in structure)
- recordA + recordB (when recordA != recordB) = error (incompatible records)
In the example above changes in the schema defintion occured on 22.08.19 11:02:55. In particular field __NX_0FMZDIVRNKJUWO2DU which was reported in the exception has been changed from
{"name":"__NX_0FMZDIVRNKJUWO2DU","type":["float_multi_array","null"]}
to
{"name":"__NX_0FMZDIVRNKJUWO2DU","type":["double_multi_array","null"]}
which cannot be promoted as an non-primitive data type.