how to access google storage bucket while running jupyter notebook with pyspark on GKE kubernetes?

Solution for how to access google storage bucket while running jupyter notebook with pyspark on GKE kubernetes?
is Given Below:

my goal is to run pyspark code on jupyter on k8s while reading logs form a google storage bucket. sounds simple, maybe

after much clicking sweat & tears i’ve managed to run jupyter with pyspark on k8s but fail to read from a google storage bucket. or to put it in code terms i fail to run:
df = spark.read.parquet("gs://bucket_name/puppy.snappy.parquet")

i’ve built the setup as follows. first a yaml file setting up a jupyter notebook running on a statefulset:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: jupyter
  namespace: spark
  labels:
    release: jupyter
secrets:
- name: bucket-key

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: jupyter
  labels:
    release: jupyter
  namespace: spark
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - create
  - get
  - delete
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - services
  verbs:
  - get
  - create
- apiGroups:
  - ""
  resources:
  - pods/log
  verbs:
  - get
  - list
- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - create
  - get
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - create
  - list
  - watch
  - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: jupyter
  labels:
    release: jupyter
  namespace: spark
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: jupyter
subjects:
- kind: ServiceAccount
  name: jupyter
  namespace: spark

---
apiVersion: v1
kind: Service
metadata:
  name: jupyter
  labels:
    release: jupyter
spec:
  type: ClusterIP
  selector:
    release: jupyter
  ports:
  - name: http
    port: 8888
    protocol: TCP
  - name: blockmanager
    port: 7777
    protocol: TCP
  - name: driver
    port: 2222
    protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  name: jupyter-headless
  labels:
    release: jupyter

spec:
  type: ClusterIP
  clusterIP: None
  publishNotReadyAddresses: false
  selector:
    release: jupyter
  ports:
  - name: http
    port: 8888
    protocol: TCP
  - name: blockmanager
    port: 7777
    protocol: TCP
  - name: driver
    port: 2222
    protocol: TCP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: jupyter
  namespace: spark
  labels:
    release: jupyter

spec:
  replicas:
  updateStrategy:
    type: RollingUpdate
  serviceName: jupyter-headless
  podManagementPolicy: Parallel

  volumeClaimTemplates:
  - metadata:
      name: notebook-data
      labels:
        release: jupyter
    spec:
      accessModes:
      - ReadWriteOnce
      volumeMode: Filesystem
      resources:
        requests:
          storage: 100Mi
  selector:
    matchLabels:
      release: jupyter
  template:
    metadata:
      labels:
        release: jupyter
      annotations:
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      serviceAccountName: jupyter
      dnsConfig:
        options:
        - name: ndots
          value: "1"
      volumes:
        - name: bucket-service-account-vol
          secret:
            secretName: bucket-key
      containers:
      - name: jupyter
        image: "jjgershon/spark:3.1.1-hadoop-3.2.0-gcp-jupyter"
        imagePullPolicy: IfNotPresent
        ports:
        - name: http
          containerPort: 8888
          protocol: TCP
        - name: blockmanager
          containerPort: 7777
          protocol: TCP
        - name: driver
          containerPort: 2222
          protocol: TCP
        volumeMounts:
        - name: notebook-data
          mountPath: /home/notebook
        - name: bucket-service-account-vol
          mountPath: /var/secrets/google
        env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /var/secrets/google/key.json
        resources:
          limits:
            cpu: 500m
            memory: 2048Mi
          requests:
            cpu: 500m
            memory: 2048Mi

the yaml is mostly from this post

now, this yaml succeeds in granting the jupyter notebook itself the iam service account but not the executor pods. when i run my code the executor pods are created

kubectl -n spark get pods
NAME                                    READY   STATUS    RESTARTS   AGE
gcplocalstack-736c087b06a73790-exec-1   1/1     Running   0          23s
gcplocalstack-736c087b06a73790-exec-2   1/1     Running   0          23s
jupyter-0                               1/1     Running   0          151m

but i get this error:

2*****[email protected] does not have storage.objects.get access to the Google Cloud Storage object

the reason for the error is that the pod is pointing to the default IAM service account that doesn’t have access to the bucket i’m trying to read.

to solve this i’ve added the following to the jupyter notebook:

    "spark.hadoop.google.cloud.auth.service.account.enable": "true",
    "spark.kubernetes.authenticate.driver.serviceAccountName": "jupyter",
    "spark.kubernetes.driver.secrets.bucket-key": "/var/secrets/google/key.json",
    "spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS": 
    "/var/secrets/google/key.json",
    "spark.kubernetes.driver.secrets.key": "/var/secrets/google",
    "spark.kubernetes.executor.secrets.key": "/var/secrets/google",
    "spark.kubernetes.executor.secrets.bucket-key": "/var/secrets/google/key.json",
    "spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS": "/var/secrets/google/key.json",
    "spark.hadoop.google.cloud.auth.service.account.json.keyfile": 
    "/var/secrets/google/key.json",

but now the executor pods get stuck on:

NAME                                    READY   STATUS              RESTARTS   AGE
gcplocalstack-1c6d917b06ac59fa-exec-1   0/1     ContainerCreating   0          53s
gcplocalstack-1c6d917b06ac59fa-exec-2   0/1     ContainerCreating   0          53s
jupyter-0                               1/1     Running             0          157m

and i get this error message:

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

this might be since the new IAM service account doesn’t have permission for GKE but i’m not sure. and seems like this is the line of code causing the problems: "spark.kubernetes.executor.secrets.key": "/var/secrets/google",

the full pyspark code on jupyter notebook:

from pyspark import SparkConf
from pyspark.sql import SparkSession

config = {
    "spark.kubernetes.driver.pod.name": "jupyter-0",
    "spark.kubernetes.namespace": "spark",
    "spark.kubernetes.container.image": "jjgershon/spark:3.1.1-hadoop-3.2.0-gcp",
    "spark.executor.instances": "2",
    "spark.executor.memory": "1g",
    "spark.executor.cores": "1",
    "spark.driver.blockManager.port": "7777",
    "spark.driver.port": "2222",
    "spark.driver.host": "jupyter.spark.svc.cluster.local",
    "spark.driver.bindAddress": "0.0.0.0",
   
    "spark.hadoop.google.cloud.auth.service.account.enable": "true",
    "spark.kubernetes.authenticate.driver.serviceAccountName": "jupyter",
    "spark.kubernetes.driver.secrets.bucket-key": "/var/secrets/google/key.json",
    "spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS": "/var/secrets/google/key.json",
    "spark.kubernetes.driver.secrets.key": "/var/secrets/google",
    "spark.kubernetes.executor.secrets.key": "/var/secrets/google",
    "spark.kubernetes.executor.secrets.bucket-key": "/var/secrets/google/key.json",
    "spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS": "/var/secrets/google/key.json",
    "spark.hadoop.google.cloud.auth.service.account.json.keyfile": "/var/secrets/google/key.json",
}

def get_spark_session(app_name: str, conf: SparkConf):
    conf.setMaster("k8s://https://kubernetes.default.svc.cluster.local")
    for key, value in config.items():
        conf.set(key, value)    
    return SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
spark = get_spark_session("gcplocalstack",SparkConf())
df = spark.read.parquet("gs://bucket_nane/puppy.snappy.parquet")

i will be forever grateful for any help
thanks!!! 🙂