Python for Java APIs
Introduction
To be able to use NXCALS public APIs with another language such as Python, one approach (the so-called "native" approach) is to write a wrapper around the NXCALS Public API in the given language.
For Python, NXCALS provides such a package, namely for data extraction (check NXCALS python package). This provides a very useful and straightforward solution for data extraction with Python. However, it exposes only the Extraction API.
For the other APIs, such as: CERN Extraction API, Meta data API, Ingestion API and Backport API, it would be possible to write a native API clients or port the already existing Java implementation to Python. With that approach, Java code would be replicated in Python, with more code to maintain and the risk that the two implementation diverge.
Luckily, there are two open source packages Py4J and JPype, that make a Java API accessible to the Python world.
Most existing controls projects, such as pyJAPC use JPype. In NXCALS, the preference was given to Py4J, because PySpark, the official package to use Spark from Python, is based on Py4J. Only with Py4J is it possible to seamlessly integrate with the PySpark functionality, e.g. to easily and efficiently extract a Spark DataFrame and to convert it to a Pandas DataFrame.
Below are some examples of using NXCALS through Py4J and through JPype.
Accessing Java APIs using Py4J
Py4J enables Python programs running in a Python interpreter to dynamically access Java objects running in a Java Virtual Machine (JVM). For instance, when a developer runs a PySpark program, the PySpark library automatically also starts up a JVM that runs the "real" Java Spark code, which does most of the work, e.g. interact with a Spark Cluster.
This is also the case when using the NXCALS package in which the py4j module is (pre)installed and the JVM can be accessed directly from the Spark session (already available in PySpark tool or which can be created using provided session builders in Standalone Python application).
There are two approaches to using Py4J:
- The first approach is low-level and exposes some internals of Py4J. It uses the so-called 'jvm' property to access Java classes from Python.
- The second approach is more high-level. The resulting code looks more natural, because classes are imported and then used as in normal Python.
This approach also makes it possible to invoke Java methods like and()
, or()
, in()
, which are valid in Java but reserved keywords in Python.
A third advantage is that this approach provides code completion in IDEs such as PyCharm.
Code examples for both approaches are provided in the following sections.
Examples of using the low-level approach of Py4J with an existing PySpark session
Getting variable name:
variableService = spark._jvm.cern.nxcals.api.extraction.metadata.ServiceClientFactory.createVariableService()
var = variableService.findOne(variables.suchThat().variableName().eq('HX:BMODE')).get()
print(var.getVariableName())
Getting information about specific fill:
# Using an existing SparkSession provided by PySpark
fillservice = spark._jvm.cern.nxcals.api.custom.service.Services.newInstance().fillService()
fill=fillservice.findFill(3000)
print(fill)
Extracting variable data within time range and using a specific lookup strategy:
cern_nxcals_api = spark._jvm.cern.nxcals.api
variableService = cern_nxcals_api.extraction.metadata.ServiceClientFactory.createVariableService()
extractionService = cern_nxcals_api.custom.service.Services.newInstance().extractionService()
myVariable = variableService.findOne(cern_nxcals_api.extraction.metadata.queries.Variables.suchThat().variableName().eq("CPS.TGM:CYCLE"))
if myVariable.isEmpty():
raise ValueError("Could not obtain variable from service")
startTime = cern_nxcals_api.utils.TimeUtils.getInstantFromString("2020-04-25 00:00:00.000000000")
endTime = cern_nxcals_api.utils.TimeUtils.getInstantFromString("2020-04-26 00:00:00.000000000")
properties = cern_nxcals_api.custom.service.extraction.ExtractionProperties.builder().timeWindow(startTime, endTime) \
.lookupStrategy(cern_nxcals_api.custom.service.extraction.LookupStrategy.LAST_BEFORE_START_IF_EMPTY).build()
dataset = extractionService.getData(myVariable.get(), properties)
print(dataset.count())
dataset.show()
Examples of using the high-level approach of Py4J with the NXCALS session builder
The following general principles apply (also have a look at the first example code)
- NXCALS must be installed using pip install nxcals
, which makes all necessary functionality available.
- The programs use normal Python imports, such as from module import Class
.
- Imports of the Py4J classes start with py4jgw
(which stands for Py4J Gateway), followed by the Java API packages,
e.g. from py4jgw.cern.nxcals.api.extraction.metadata import ServiceClientFactory
.
- The line with the NXCALS session builder spark = spark_session_builder.get_or_create(...)
must be called
before methods of the NXCALS API are invoked. It initializes both the Spark session and
the high-level Py4j API. Failing to do so throws an RuntimeError as shown in the last code example.
- The NXCALS classes can then be used with normal syntax, e.g. ServiceClientFactory.createVariableService()
.
Code illustrating the structure explained above:
# spark_session_builder is available after doing `pip install nxcals`
from nxcals import spark_session_builder
# Import the Python counterparts of the NXCALS APIs (automatically generated with code completion):
from py4jgw.cern.nxcals.api.extraction.metadata import ServiceClientFactory
from py4jgw.cern.nxcals.api.extraction.metadata.queries import Variables
# initialize a session with NXCALS
spark = spark_session_builder.get_or_create(app_name='nxcals-py4j-demo')
# now you can use the NXCALS APIs imported above
variableService = ServiceClientFactory.createVariableService()
var = variableService.findOne(Variables.suchThat().variableName().eq('HX:BMODE')).get()
print(var.getVariableName())
An example of using the VariableService:
from nxcals import spark_session_builder
from py4jgw.cern.nxcals.api.extraction.metadata import ServiceClientFactory
from py4jgw.cern.nxcals.api.extraction.metadata.queries import Variables
spark = spark_session_builder.get_or_create(app_name='nxcals-stubs-demo')
variableService = ServiceClientFactory.createVariableService()
var = variableService.findOne(Variables.suchThat().variableName().eq('HX:BMODE')).get()
print(var.getVariableName())
An example of using the FillService:
from nxcals import spark_session_builder
from py4jgw.cern.nxcals.api.custom.service import Services
spark = spark_session_builder.get_or_create(app_name='nxcals-stubs-demo')
fillservice = Services.newInstance().fillService()
fill = fillservice.findFill(3000)
print(fill)
An example of using the method and_()
with a trailing underscore:
from nxcals import spark_session_builder
from py4jgw.cern.nxcals.api.extraction.metadata.queries import Variables
spark = spark_session_builder.get_or_create(app_name='nxcals-stubs-demo')
# please note the use of and_() with an underscore, like in jPype
var_cond = Variables.suchThat().variableName().like("%MTG%").and_().description().exists()
A more complete example:
from nxcals import spark_session_builder
from py4jgw.cern.nxcals.api.custom.service import Services
from py4jgw.cern.nxcals.api.custom.service.extraction import ExtractionProperties, LookupStrategy
from py4jgw.cern.nxcals.api.extraction.metadata import ServiceClientFactory
from py4jgw.cern.nxcals.api.extraction.metadata.queries import Variables
from py4jgw.cern.nxcals.api.utils import TimeUtils
spark = spark_session_builder.get_or_create(app_name='nxcals-stubs-demo')
variableService = ServiceClientFactory.createVariableService()
extractionService = Services.newInstance().extractionService()
myVariable = variableService.findOne(Variables.suchThat().variableName().eq("CPS.TGM:CYCLE"))
if myVariable.isEmpty():
raise ValueError("Could not obtain variable from service")
startTime = TimeUtils.getInstantFromString("2020-04-25 00:00:00.000000000")
endTime = TimeUtils.getInstantFromString("2020-04-26 00:00:00.000000000")
properties = ExtractionProperties.builder().timeWindow(startTime, endTime) \
.lookupStrategy(LookupStrategy.LAST_BEFORE_START_IF_EMPTY).build()
dataset = extractionService.getData(myVariable.get(), properties)
print(dataset.count())
dataset.show()
Forgetting to import and call spark_session_builder.get_or_create(...)
leads the following RuntimeError:
Traceback (most recent call last):
File "py4j_stubs_demo/demo/throw_module_not_found_exception.py", line 3, in <module>
varCond = Variables.suchThat().variableName().like("%TGM%").and_().description().exists()
File "py4j_stubs_demo/venv/lib/python3.9/site-packages/py4j_utils/py4j_importer.py", line 237, in __getattr__
self._lazy_init()
File "py4j_stubs_demo/venv/lib/python3.9/site-packages/py4j_utils/py4j_importer.py", line 205, in _lazy_init
raise RuntimeError(
RuntimeError: PySpark and/or py4j gateway is not yet initialized, please create an NXCALS spark session
with `spark_session_builder.get_or_create(...)` before using classes imported from p4jgw.*
from py4jgw.cern.nxcals.api.extraction.metadata.queries import Variables
var_cond = Variables.suchThat().variableName().like("%MTG%").and_().description().exists()
Example of using Py4j from standalone Python application (requires Spark session creation)
from nxcals import spark_session_builder
spark = spark_session_builder.get_or_create(app_name='spark-basic')
_cern_nxcals_api = spark._jvm.cern.nxcals.api
fillService = _cern_nxcals_api.custom.service.Services.newInstance().fillService()
fills = fillService.findFills(_cern_nxcals_api.utils.TimeUtils.getInstantFromString("2018-04-25 00:00:00.000000000"),
_cern_nxcals_api.utils.TimeUtils.getInstantFromString("2018-04-28 00:00:00.000000000"))
Accessing Java APIs using JPype
JPype Python module is interfacing JVM at the native level. It allows Python to make use of Java specific libraries.
Installation steps
First JPype module has to be installed. This can be achieved by using the python pip module:
pip install JPype1
JPype requires jars so it can create a JVM and include them to the classpath. The imported libraries would be automatically exposed with a pythonic way to our process and the general look and feel of the library structure would be as it was natively developed with python.
Hint
One way of obtaining NXCALS API jars and their dependencies is through the installation of NXCALS package. All the neccesary jars can be found in the newly created venv directories:
venv/nxcals-bundle/jars
venv/nxcals-bundle/nxcals_jars
Start JVM and access services
Assuming that the required jars can be accessed from the previously installed NXCALS package (see the hint above), we can execute the following code in order to access data from metadata storage:
import jpype
# Enable Java imports
import jpype.imports
# Import all standard Java types into the global scope
from jpype.types import *
# Launch the JVM, convertStrings=True is used for the proper conversion of java.lang.String to Python string literals
jpype.startJVM(classpath=['/your_nxcals_package_location/venv/nxcals-bundle/jars/*',
'/your_nxcals_package_location/venv/nxcals-bundle/nxcals_jars/*'], convertStrings=True)
# point metadata client to NXCALS PRO services
from java.lang import System
System.setProperty("service.url", "https://cs-ccr-nxcals5.cern.ch:19093,https://cs-ccr-nxcals5.cern.ch:19094,https://cs-ccr-nxcals6.cern.ch:19093,https://cs-ccr-nxcals6.cern.ch:19094,https://cs-ccr-nxcals7.cern.ch:19093,https://cs-ccr-nxcals7.cern.ch:19094,https://cs-ccr-nxcals8.cern.ch:19093,https://cs-ccr-nxcals8.cern.ch:19094");
from cern.nxcals.api.extraction.metadata import ServiceClientFactory
vs = ServiceClientFactory.createVariableService()
from cern.nxcals.api.extraction.metadata.queries import Variables
var = vs.findOne(Variables.suchThat().variableName().eq("HX:BMODE")).get()
var.getVariableName()
var.getId()
Load java packages under a package alias
Please take note
In some rare cases CERN root package on python namespace might be already used (or clashing packages) the classpath packages might not be directly loaded on the JPype process. Thus, while trying to import a class from java (ex. ServiceClientFactory from metadata-api), might lead to module not loaded from classpath exception:
>>> from cern.nxcals.api.extraction.metadata import ServiceClientFactory
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'cern.nxcals.api.extraction.metadata'
The solution for that issue is to use the advanced Jpype import options and register the java package(s) under an alias that we're sure that is not clashing with already available modules on python path. Therefore, we can eventually load and use the above factory by performing the following steps:
# actual packages: cern.nxcals
# expose as alias: nxcals_meta
jpype.imports.registerDomain('nxcals_meta', alias='cern.nxcals')
# now we can normally import and use exposed units under package alias
from nxcals_meta.api.extraction.metadata import ServiceClientFactory
vs = ServiceClientFactory.createVariableService()
from nxcals_meta.api.extraction.metadata.queries import Variables
var = vs.findOne(Variables.suchThat().variableName().eq("HX:BMODE")).get()
var.getVariableName()
RSQL queries
Important
Please note that in case of working with certain RSQL queries some conflicts may occur related to reserved Python keywords such as: and, or, in. Both JPype and Py4J address those issues in a different way. JPype introduces corresponding methods with an underscore as a suffix e.i.: and_(), or_(), in_(). At the same time Py4J relies on a special syntax based on usage of getattr() method.
Below one can find examples illustrating a difference between Py4J and JPype when it comes to special handling of RSQL queries with reserved keywords. Keep in mind that for simplicity JPype is used for all other code snippets provided in the documentation.
Selecting a variable in CMW system:
_metadata = spark._jvm.cern.nxcals.api.extraction.metadata
variableService = _metadata.ServiceClientFactory.createVariableService()
Variables = _metadata.queries.Variables
variable = variableService \
.findOne(getattr(Variables.suchThat().systemName().eq("CMW"), 'and')().variableName().eq("SPS:NXCALS_FUNDAMENTAL")) \
.get()
variable.getDescription()
variable = variableService \
.findOne(Variables.suchThat().systemName().eq("CMW").and_().variableName().eq("SPS:NXCALS_FUNDAMENTAL")) \
.get()
variable.getDescription()
Retrieving 2 variables (please note that unfortunately Py4J requires some conditions nesting whereas syntax in JPype is more straightforward):
_metadata = spark._jvm.cern.nxcals.api.extraction.metadata
variableService = _metadata.ServiceClientFactory.createVariableService()
Variables = _metadata.queries.Variables
variables = variableService \
.findAll(getattr( \
getattr(Variables.suchThat().variableName().eq("SPS:NXCALS_FUNDAMENTAL"), 'or')().variableName().eq("CPS:NXCALS_FUNDAMENTAL"), \
'and')().systemName().eq("CMW"))
len(variables)
variables = variableService.findOne(Variables.suchThat() \
.systemName().eq("CMW").and_(Variables.suchThat.variableName().eq("SPS:NXCALS_FUNDAMENTAL").or_().variableName().eq("CPS:NXCALS_FUNDAMENTAL")))
Conclusion
Both options: JPype an Py4J are quite powerful and allow bridging Java and Python worlds with a minimal effort.
JPype exposes the JVM objects with a more user-friendly approach (less boilerplate code) and has straightforward way to access static properties and classes.
Py4j may become interesting when there is a need to run the java code on a remote host and access objects from a Python client running elsewhere.
It can be run easily from SWAN notebooks.