Spark session creation

There are different methods of Spark session creation in Java presented in this document:

Based on default Spark properties

When obtaining an instance of a ServiceBuilder without specifying any additional arguments we actually create a local (default) Spark session:

Java

ServiceBuilder defaultServiceBuilder = ServiceBuilder.getInstance();
return defaultServiceBuilder;

The same can be achieved in a more explicit way by providing default Spark properties (with a required application name):

Java

SparkProperties sparkProperties = SparkProperties.defaults("MY_APP");
ServiceBuilder defaultServiceBuilder = ServiceBuilder.getInstance(sparkProperties);

With supplied Spark properties

Spark properties can be freely customised as in the example below where application will be executed in YARN mode running on the cluster:

Java

SparkProperties sparkProperties = new SparkProperties();
sparkProperties.setAppName("MY_APP");
sparkProperties.setMasterType("yarn");

Map<String, String> properties = new HashMap<>();
properties.put("spark.executor.memory", "8G");
properties.put("spark.executor.cores", "10");

// yarn
properties.put("spark.yarn.appMasterEnv.JAVA_HOME", "/var/nxcals/jdk1.11");
properties.put("spark.executorEnv.JAVA_HOME", "/var/nxcals/jdk1.11");
properties.put("spark.yarn.jars", "hdfs:////project/nxcals/lib/spark-3.2.1/*.jar,hdfs:////project/nxcals/nxcals_lib/nxcals_pro/*.jar");
properties.put("spark.yarn.am.extraLibraryPath", "/usr/lib/hadoop/lib/native");
properties.put("spark.executor.extraLibraryPath", "/usr/lib/hadoop/lib/native");
properties.put("spark.yarn.historyServer.address", "ithdp1001.cern.ch:18080");
properties.put("spark.yarn.access.hadoopFileSystems", "nxcals");
properties.put("spark.sql.caseSensitive", "true");

sparkProperties.setProperties(properties);

ServiceBuilder yarnServiceBuilder = ServiceBuilder.getInstance(sparkProperties);

Note

More information about used properties for YARN can be found in Spark Apache documentation pages

Using predefined resources

For user convenience NXCALS provides predefined resource sizing which can be used for the creation of Spark session on YARN:

Java

SparkSessionFlavor initialProperties = SparkSessionFlavor.MEDIUM;

// Please replace "pro" with "testbed" when accessing NXCALS TESTBED
ServiceBuilder yarnServiceBuilderFromPredefinedProperties
        = ServiceBuilder.getInstance(SparkProperties.remoteSessionProperties("MY_APP", initialProperties, "pro"));

Note

There are three different configurations available:

Configuration type	Spark executor cores	Spark executor instances	Spark executor memory
SMALL	2	4	2g
MEDIUM	4	8	4g
LARGE	4	16	4g

Using predefined resources as a base for customization

Java

SparkSessionFlavor initialProperties = SparkSessionFlavor.LARGE;
SparkProperties sparkProperties = SparkProperties.remoteSessionProperties("MY_APP", initialProperties, "pro");

Map<String, String> props = new HashMap<>();

// Customize retrieved properties, for example:
props.put("spark.executor.memory", "10G");
props.put("spark.executor.cores", "10");

sparkProperties.setProperties(props);

SparkSession sparkSession = SparkUtils.createSparkSession(sparkProperties);
ServiceBuilder serviceBuilder = ServiceBuilder.getInstance(sparkSession);

By reusing Spark session

Yet another way of initializing ServiceBuilder is through the reusing of an existing Spark session:

Java

SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("My_APP");
org.apache.spark.SparkContext sc = new org.apache.spark.SparkContext(sparkConf);
SparkSession session = new SparkSession(sc);

ServiceBuilder serviceBuilderFromSession = ServiceBuilder.getInstance(session);

Via NXCALS SparkUtils

For covenience one can use NXCALS SparkUtils methods as in the example below:

Java

SparkConf sparkConf = SparkUtils.createSparkConf(SparkProperties.defaults("MY_APP"));
ServiceBuilder serviceBuilderFromSparkUtility =
        ServiceBuilder.getInstance(SparkUtils.createSparkSession(sparkConf));