Using SWAN
Using SWAN is the easiest and recommended way for simple scripting. Everything is configured and up to date. For more details check Using SWAN page.
Instalation from acc-py repository (Python 3.7+)
To be able to install package, you need Python3, in version at least 3.7 (currently supported version is 3.7.9). Acc-Py is available on machines from the CERN accelerator sector (ACC).
source /acc/local/share/python/acc-py/base/pro/setup.sh
acc-py venv ./venv
source ./venv/bin/activate
python -m pip install nxcals
python3 -m venv ./venv
source ./venv/bin/activate
python -m pip install -U --index-url https://acc-py-repo.cern.ch/repository/vr-py-releases/simple --trusted-host acc-py-repo.cern.ch nxcals
Important
In case of getting the following error message during installation of NXCALS package :
zipfile.BadZipFile: File is not a zip file
python -m pip install nxcals --no-cache
Running
- You need a valid kerberos ticket. To init kerberos:
kinit
- Activate virtual environment:
source ./venv/bin/activate
- Start pyspark session:
pyspark
orpython
During the first execution of PySpark/Python, a compressed virtual environment (packed venv) is created in the user's default temporary space. The created packed venv contains: - default Spark configuration file - minimalistic set of packages required by the PySpark session to be used in YARN mode - packages provided by a user which a present in the venv
Consecutive executions require less time for startup because the previously created file is reused.
The unique name of the file is determined by the SPARK_HOME
environment variable (the path to Spark files), USER
(username), and the Python version.
The file's name is displayed during the startup, along with instructions for rebuilding the packed venv (which essentially involves removing the file).
Rebuilding of the packed venv should only be done in cases where new or modified packages are provided by the user. After the rebuild, they will be available in the user script being executed in the new (restarted) PySpark session.
Important
Target directory for the packed venv can be set with an env variable NXCALS_WORKSPACE_TEMP_DIR
(if not set a temp dir will be used).
Setting env variable NXCALS_PACK_ALL_PACKAGES
=true enables adding a NXCALS related files to the packed venv
In case of running python
rather than pyspark
a spark session object has to be created:
from nxcals.spark_session_builder import get_or_create
from nxcals.api.extraction.data.builders import DataQuery
spark = get_or_create("My_APP")
df = DataQuery.builder(spark).entities().system('CMW') \
.keyValuesEq({'device': 'LHC.LUMISERVER', 'property': 'CrossingAngleIP1'}) \
.timeWindow('2022-04-22 00:00:00.000', '2022-04-23 00:00:00.000') \
.build()
You can find more examples in Extraction API chapter.
Using bundle
If you want to have everything packed, and you don't want to configure venv with acc-py, you can use our bundle. It contains preconfigured spark and needed python packages - everything what you need to start using NXCALS with Scala or Python. Instalation guide can be found here.
Instalation on LXPLUS
Instalation python package on LXPLUS is covered on dedicated page Using LXPLUS.
Jupyter
First install Jupyter into the venv with installed NXCALS.
python -m pip install jupyter
Important
Make sure to have jupyter in your PATH. Verify using "which jupyter".
Once done, export the pyspark python driver to be the jupyter and run the pyspark utility
from {venv}/nxcals-bundle/bin/pyspark
:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
./venv/nxcals-bundle/bin/pyspark --master yarn --num-executors 10 # please update the path to venv if needed
This will open a browser with the Jupyter notebook. Create/open a notebook file and wait for the kernel and SparkSession to
start. After that, the already created SparkSession and SparkContext will be available under the spark
and sc
variables respectively.
Now your notebook is ready for the interaction with NXCALS API.
Known issues
Executing script in the YARN mode from a non "ACC" machine
In the case when machine does not have access to ACC-PY distribution and a script is submitted to the cluster using YARN mode, the similar error may occur:
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (ithdp1058.cern.ch executor 3):
java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory
venv-pack
, which can incorrectly rewrite symlink to python
exec.
The solution is either fixing it manually, upgrading NXCALS or using ACC-PY's python distribution (link to installers).
Use NXCALS package with other BE libraries (JPype naming clash)
After successfully running the above pip install nxcals
commands, the process will install the following packages on your python setup:
(venv) [user@host]$ pip list | grep nxcals
nxcals x.y.z
nxcals-spark-session-builder x.y.z
nxcals-extraction-api-python3 x.y.z
nxcals-extraction-api-python3-legacy x.y.z
nxcals-extraction-api-python3-legacy
. This package contains the obsolete DataQuery builders under
cern
namespace and is loaded only for compatibility reasons. The legacy package is phased-out - if you happen to have it please upgrade to the latest NXCALS version!
Legacy package is locking cern Python namespace!
Unfortunately, it's the legacy package that locks the cern namespace and needs to be removed if your intention is to use NXCALS together with other cern libraries (especially the ones that expose java classes via Jpype)
Overcome issues with NXCALS and JPype
When using the newly nxcals package with another cern library that exposes Java classes directly as python modules (ex. PyLSA), one will quickly experience exceptions similar to the following example.
Try loading Java classes exposed as python modules via JPype on python context:
from cern.lsa.domain.settings import *
Will yield the following exception:
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'cern.lsa'
In order to avoid this exception, we need to unlock the cern namespace in the python environment. Only then JPype would be able to expose the requested Java classes as python proxy modules.
To unlock cern namespace in the python env, the nxcals legacy extraction package needs to be removed! One can achieve that by the following steps:
1) remove the legacy package from python environemnt
pip uninstall nxcals-extraction-api-python3-legacy
cern
package from the nxcals native python DataQuery
imports
Before:
from cern.nxcals.api.extraction.data.builders import *
from nxcals.api.extraction.data.builders import *
Hint
Proceed with the above actions, only if you experience issues with module loading (ModuleNotFoundError). In most use-cases this issue is not visible and can be safely ignored
Info
In the scenarios when the application is deployed, it is not easy to perform "pip uninstall" step described above directly at the deployment location.
One posibility to overcome this issue is to manually edit 'deployment/app/requirments.txt' file created in your project folder while performing:
acc-py app lock ./
nxcals-extraction-api-python3-legacy==....
line.
Then the application can be deployed without that particular package removed from the requirements specification, using:
acc-py app deploy --deploy-base /tmp/your_deployment_location ./