Skip to content

Spark session creation

There are different methods of Spark session creation in Java presented in this document:

Based on default Spark properties

When obtaining an instance of a ServiceBuilder without specifying any additional arguments we actually create a local (default) Spark session:

ServiceBuilder defaultServiceBuilder = ServiceBuilder.getInstance();
return defaultServiceBuilder;

The same can be achieved in a more explicit way by providing default Spark properties (with a required application name):

SparkProperties sparkProperties = SparkProperties.defaults("MY_APP");
ServiceBuilder defaultServiceBuilder = ServiceBuilder.getInstance(sparkProperties);

With supplied Spark properties

Spark properties can be freely customised as in the example below where application will be executed in YARN mode running on the cluster:

SparkProperties sparkProperties = new SparkProperties();
sparkProperties.setAppName("MY_APP");
sparkProperties.setMasterType("yarn");

Map<String, String> properties = new HashMap<>();
properties.put("spark.executor.memory", "8G");
properties.put("spark.executor.cores", "10");

// yarn
properties.put("spark.yarn.appMasterEnv.JAVA_HOME", "/var/nxcals/jdk1.11");
properties.put("spark.executorEnv.JAVA_HOME", "/var/nxcals/jdk1.11");
properties.put("spark.yarn.jars", "hdfs:////project/nxcals/lib/spark-3.2.1/*.jar,hdfs:////project/nxcals/nxcals_lib/nxcals_pro/*.jar");
properties.put("spark.yarn.am.extraLibraryPath", "/usr/lib/hadoop/lib/native");
properties.put("spark.executor.extraLibraryPath", "/usr/lib/hadoop/lib/native");
properties.put("spark.yarn.historyServer.address", "ithdp1001.cern.ch:18080");
properties.put("spark.yarn.access.hadoopFileSystems", "nxcals");
properties.put("spark.sql.caseSensitive", "true");

sparkProperties.setProperties(properties);

ServiceBuilder yarnServiceBuilder = ServiceBuilder.getInstance(sparkProperties);

Note

More information about used properties for YARN can be found in Spark Apache documentation pages

Using predefined resources

For user convenience NXCALS provides predefined resource sizing which can be used for the creation of Spark session on YARN:

SparkSessionFlavor initialProperties = SparkSessionFlavor.MEDIUM;

// Please replace "pro" with "testbed" when accessing NXCALS TESTBED
ServiceBuilder yarnServiceBuilderFromPredefinedProperties
        = ServiceBuilder.getInstance(SparkProperties.remoteSessionProperties("MY_APP", initialProperties, "pro"));

Note

There are three different configurations available:

Configuration type Spark executor cores Spark executor instances Spark executor memory
SMALL 2 4 2g
MEDIUM 4 8 4g
LARGE 4 16 4g

Using predefined resources as a base for customization

SparkSessionFlavor initialProperties = SparkSessionFlavor.LARGE;
SparkProperties sparkProperties = SparkProperties.remoteSessionProperties("MY_APP", initialProperties, "pro");

Map<String, String> props = new HashMap<>();

// Customize retrieved properties, for example:
props.put("spark.executor.memory", "10G");
props.put("spark.executor.cores", "10");

sparkProperties.setProperties(props);

SparkSession sparkSession = SparkUtils.createSparkSession(sparkProperties);
ServiceBuilder serviceBuilder = ServiceBuilder.getInstance(sparkSession);

By reusing Spark session

Yet another way of initializing ServiceBuilder is through the reusing of an existing Spark session:

SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("My_APP");
org.apache.spark.SparkContext sc = new org.apache.spark.SparkContext(sparkConf);
SparkSession session = new SparkSession(sc);

ServiceBuilder serviceBuilderFromSession = ServiceBuilder.getInstance(session);

Via NXCALS SparkUtils

For covenience one can use NXCALS SparkUtils methods as in the example below:

SparkConf sparkConf = SparkUtils.createSparkConf(SparkProperties.defaults("MY_APP"));
ServiceBuilder serviceBuilderFromSparkUtility =
        ServiceBuilder.getInstance(SparkUtils.createSparkSession(sparkConf));