I am able to load a spark dataset in a kedro ipython session.
- First, I configured the spark session as described here.
Then I launched a kedro ipython session with
ipython --ext kedro.extras.extensions.ipythonorkedro ipython - Then, I am able to load spark datasets as defined in the catalog
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from pathlib import Path
import pyspark.sql
#from kedro.io import DataCatalog
from kedro.extras.datasets.spark import SparkDataSet
import os
os.chdir('/myproject')
project_root = Path.cwd()
bootstrap_project(project_root)
session = KedroSession.create()
context = session.load_context()
catalog = context.catalog
test = catalog.load("mydata@spark")
test.show(2)
isinstance(test, pyspark.sql.DataFrame) # True
So there is a spark session correctly defined. question is, how to access this session object?
if I run spark = SparkSession.builder.getOrCreate(), I cannot confirm that this is indeed the session managed by Kedro, for example spark.conf.get('spark.driver.maxResultSize') throws a java.util.NoSuchElementException: although this maxResultSize is indeed defined in my project's spark.yml
How to access the right kedro-managed spark session?
so if you do
kedro ipython(or use the extension) you should havecatalogavailable as a global variable already and you don't need to create it yourself.I have a feeling this will work:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sparkSession.html