Running PySpark as standalone application
You can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application.
Submitting Spark Applications
To submit an application consisting of a Python file you can use the spark-submit script.
Symplified spark-submit Syntax
spark-submit --option value python file [application arguments]
|python file||Path to a Python file containing a Spark application. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster;|
|application arguments||Arguments to pass to the main method of your application.|
Selected spark-submit Options
|deploy-mode||Deployment mode: cluster and client. In cluster mode, the driver runs on worker hosts. In client mode, the driver runs locally as an external client. Use cluster mode with production jobs; client mode is more appropriate for interactive and debugging uses, where you want to see your application output immediately.
|driver-cores||Number of cores used by the driver in cluster mode.
|driver-memory||Maximum heap size (represented as a JVM string; for example 1024m, 2g, and so on) to allocate to the driver. Alternatively, you can use the spark.driver.memory property.|
|master||The location to run the application.|
|local||Run Spark locally with one worker thread (that is, no parallelism).|
|local[K]||Run Spark locally with K worker threads. (Ideally, set this to the number of cores on your host.)|
|local[*]||Run Spark locally with as many worker threads as logical cores on your host.|
|yarn||Run using a YARN cluster manager.|
For more information you can refer to Apache Submitting Applications documentation.
Let's create a test.py file which contains code for counting of intensity datapoints within a time window:
from nxcals.api.extraction.data.builders import DevicePropertyDataQuery from pyspark import SparkContext from pyspark.conf import SparkConf from pyspark.sql import SparkSession conf = (SparkConf() .setAppName("intensity_example") .setMaster('yarn') .set('spark.driver.memory', '16G') ) sc = SparkContext(conf=conf) spark = SparkSession(sc) intensity = DevicePropertyDataQuery.builder(spark).system("CMW") \ .startTime("2018-06-17 00:00:00.000").endTime("2018-06-20 00:00:00.000") \ .entity().parameter("PR.BCT/HotspotIntensity").build() intensity.count() print("Found %d intensity data points." % intensity.count())
Alternatively spark session can be initialized using a session builder:
spark = SparkSession.builder.master("local").appName('intensity_example').getOrCreate() ...
In order to submit a spark job spark-submit must be invoked from our Spark bundle for NXCALS directory:
Certain Spark pramaters can be specified as arguments. Lets run our test application on a YARN cluster on 8 cores and using 4GB of memory:
./bin/spark-submit --master yarn --executor-memory 4G --total-executor-cores 8 test.py