Using SWAN

Using SWAN is the easiest and recommended way for simple scripting. Everything is configured and up to date. For more details check Using SWAN page.

Instalation from acc-py repository (Python 3.7+)

To be able to install package, you need Python3, in version at least 3.7 (currently supported version is 3.7.9). Acc-Py is available on machines from the CERN accelerator sector (ACC).

Acc-Py

source /acc/local/share/python/acc-py/base/pro/setup.sh
acc-py venv ./venv
source ./venv/bin/activate
python -m pip install nxcals

Pure Python

python3 -m venv ./venv
source ./venv/bin/activate
python -m pip install -U --index-url https://acc-py-repo.cern.ch/repository/vr-py-releases/simple --trusted-host acc-py-repo.cern.ch nxcals

Important

In case of getting the following error message during installation of NXCALS package :

zipfile.BadZipFile: File is not a zip file

please use "no cache" option as follows:

python -m pip install nxcals --no-cache

Running

You need a valid kerberos ticket. To init kerberos: kinit
Activate virtual environment: source ./venv/bin/activate
Start pyspark session: pyspark or python

During the first execution of PySpark/Python, a compressed virtual environment (packed venv) is created in the user's default temporary space. The created packed venv contains: - default Spark configuration file - minimalistic set of packages required by the PySpark session to be used in YARN mode - packages provided by a user which a present in the venv

Consecutive executions require less time for startup because the previously created file is reused. The unique name of the file is determined by the SPARK_HOME environment variable (the path to Spark files), USER (username), and the Python version.

The file's name is displayed during the startup, along with instructions for rebuilding the packed venv (which essentially involves removing the file).

Rebuilding of the packed venv should only be done in cases where new or modified packages are provided by the user. After the rebuild, they will be available in the user script being executed in the new (restarted) PySpark session.

Important

Target directory for the packed venv can be set with an env variable NXCALS_WORKSPACE_TEMP_DIR (if not set a temp dir will be used).

Setting env variable NXCALS_PACK_ALL_PACKAGES=true enables adding a NXCALS related files to the packed venv

In case of running python rather than pyspark a spark session object has to be created:

from nxcals.spark_session_builder import get_or_create
from nxcals.api.extraction.data.builders import DataQuery
spark = get_or_create("My_APP")

df = DataQuery.builder(spark).entities().system('CMW') \
    .keyValuesEq({'device': 'LHC.LUMISERVER', 'property': 'CrossingAngleIP1'}) \
    .timeWindow('2022-04-22 00:00:00.000', '2022-04-23 00:00:00.000') \
    .build()

You can find more examples in Extraction API chapter.

Using bundle

If you want to have everything packed, and you don't want to configure venv with acc-py, you can use our bundle. It contains preconfigured spark and needed python packages - everything what you need to start using NXCALS with Scala or Python. Instalation guide can be found here.

Instalation on LXPLUS

Instalation python package on LXPLUS is covered on dedicated page Using LXPLUS.

Jupyter

First install Jupyter into the venv with installed NXCALS.

python -m pip install jupyter

Important

Make sure to have jupyter in your PATH. Verify using "which jupyter".

Once done, export the pyspark python driver to be the jupyter and run the pyspark utility from {venv}/nxcals-bundle/bin/pyspark :

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

./venv/nxcals-bundle/bin/pyspark --master yarn --num-executors 10 # please update the path to venv if needed

This will open a browser with the Jupyter notebook. Create/open a notebook file and wait for the kernel and SparkSession to start. After that, the already created SparkSession and SparkContext will be available under the spark and sc variables respectively.

Now your notebook is ready for the interaction with NXCALS API.

Known issues

Executing script in the YARN mode from a non "ACC" machine

In the case when machine does not have access to ACC-PY distribution and a script is submitted to the cluster using YARN mode, the similar error may occur:

Error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (ithdp1058.cern.ch executor 3):
java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory

The source of the problem is a bug in venv-pack, which can incorrectly rewrite symlink to python exec. The solution is either fixing it manually, upgrading NXCALS or using ACC-PY's python distribution (link to installers).

Use NXCALS package with other BE libraries (JPype naming clash)

After successfully running the above pip install nxcals commands, the process will install the following packages on your python setup:

(venv) [user@host]$ pip list | grep nxcals
nxcals                        x.y.z
nxcals-spark-session-builder  x.y.z
nxcals-extraction-api-python3 x.y.z
nxcals-extraction-api-python3-legacy x.y.z

Note the last package, nxcals-extraction-api-python3-legacy. This package contains the obsolete DataQuery builders under cern namespace and is loaded only for compatibility reasons. The legacy package is phased-out - if you happen to have it please upgrade to the latest NXCALS version!

Legacy package is locking cern Python namespace!

Unfortunately, it's the legacy package that locks the cern namespace and needs to be removed if your intention is to use NXCALS together with other cern libraries (especially the ones that expose java classes via Jpype)

Overcome issues with NXCALS and JPype

When using the newly nxcals package with another cern library that exposes Java classes directly as python modules (ex. PyLSA), one will quickly experience exceptions similar to the following example.

Try loading Java classes exposed as python modules via JPype on python context:

from cern.lsa.domain.settings import *

Will yield the following exception:

Traceback (most recent call last):
    File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'cern.lsa'

In order to avoid this exception, we need to unlock the cern namespace in the python environment. Only then JPype would be able to expose the requested Java classes as python proxy modules.

To unlock cern namespace in the python env, the nxcals legacy extraction package needs to be removed! One can achieve that by the following steps:

1) remove the legacy package from python environemnt

pip uninstall nxcals-extraction-api-python3-legacy

2) remove the cern package from the nxcals native python DataQuery imports

Before:

from cern.nxcals.api.extraction.data.builders import *

Now:

from nxcals.api.extraction.data.builders import *

Hint

Proceed with the above actions, only if you experience issues with module loading (ModuleNotFoundError). In most use-cases this issue is not visible and can be safely ignored

Info

In the scenarios when the application is deployed, it is not easy to perform "pip uninstall" step described above directly at the deployment location.

One posibility to overcome this issue is to manually edit 'deployment/app/requirments.txt' file created in your project folder while performing:

acc-py app lock ./

and remove the nxcals-extraction-api-python3-legacy==.... line. Then the application can be deployed without that particular package removed from the requirements specification, using:

acc-py app deploy --deploy-base /tmp/your_deployment_location ./