I have created a kubeflow pipeline with the following steps using kubeflow DSL construct. step 1. read data from a CSV step 2. tokenize using a model (encoding) step 3. summarize the text data (from CSV) in a dataframe using the model (transformers) step 4. Save the final results in to a csv. step 5. define the pipeline and execute all the 4 steps.
pipeline YAML file got generated after running and compiling the kubeflow steps. While trying to use the YAML file into GKE cluster, it is throwing an error related to YAML. I am not sure if this YAML is what I must use in GKE or should I create a deployment YAML separately. Any pointers or references to this process will greatly help for my experiment in running the kubeflow steps in GKE cluster.
I ran the kubeflow steps in a notebook and tried to upload to a GKE cluster and got the following error.
Error: error: error validating "pipeline.yaml": error validating data: [apiVersion not set, kind not set]; if you choose to ignore these errors, turn validation off with --validate=false. I have tried to check what values to be set but didnt get any pattern or templates from official docs
# PIPELINE DEFINITION
# Name: pipeline
components:
comp-preprocess-text:
executorLabel: exec-preprocess-text
inputDefinitions:
parameters:
dic:
parameterType: STRUCT
comp-publish-results:
executorLabel: exec-publish-results
inputDefinitions:
parameters:
dic:
parameterType: STRUCT
comp-read-csv-file:
executorLabel: exec-read-csv-file
inputDefinitions:
parameters:
file_path:
parameterType: STRING
comp-summarize-text:
executorLabel: exec-summarize-text
inputDefinitions:
parameters:
dic:
parameterType: STRUCT
deploymentSpec:
executors:
exec-preprocess-text:
container:
args:
- --executor_input
- '{{$}}'
- --function_to_execute
- preprocess_text
command:
- sh
- -c
- "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\
\ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
\ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.7.0'\
\ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' &&\
\ python3 -m pip install --quiet --no-warn-script-location 'pandas==1.2.4'\
\ && \"$0\" \"$@\"\n"
- sh
- -ec
- 'program_path=$(mktemp -d)
printf "%s" "$0" > "$program_path/ephemeral_component.py"
_KFP_RUNTIME=true python3 -m kfp.dsl.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@"
'
- "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
\ *\n\ndef preprocess_text(dic:dict):\n df=pd.DataFrame(dic)\n # Tokenize\
\ the text data using AutoTokenizer\n tokenizer = AutoTokenizer.from_pretrained(model)\n\
\ df['encoded_text'] = df['text'].apply(lambda text: tokenizer.encode(text,\
\ max_length=512, truncation=True))\n dic=df.to_dict()\n return dic\n\
\n"
image: us-central1-docker.pkg.dev/steel-climber-408809/dockerimage
exec-publish-results:
container:
args:
- --executor_input
- '{{$}}'
- --function_to_execute
- publish_results
command:
- sh
- -c
- "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\
\ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
\ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.7.0'\
\ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
$0\" \"$@\"\n"
- sh
- -ec
- 'program_path=$(mktemp -d)
printf "%s" "$0" > "$program_path/ephemeral_component.py"
_KFP_RUNTIME=true python3 -m kfp.dsl.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@"
'
- "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
\ *\n\ndef publish_results(dic:dict) -> None:\n df=pd.DataFrame(dic)\n\
# Example: Save the dataframe as a CSV file\n df.to_csv('results.csv',\
\ index=False)\n\n"
image: python:3.7
exec-read-csv-file:
container:
args:
- --executor_input
- '{{$}}'
- --function_to_execute
- read_csv_file
command:
- sh
- -c
- "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\
\ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
\ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.7.0'\
\ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
$0\" \"$@\"\n"
- sh
- -ec
- 'program_path=$(mktemp -d)
printf "%s" "$0" > "$program_path/ephemeral_component.py"
_KFP_RUNTIME=true python3 -m kfp.dsl.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@"
'
- "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
\ *\n\ndef read_csv_file(file_path: str):\n df= pd.read_csv(file_path)\n\
\ dic=df.to_dict()\n return dic\n\n"
image: python:3.7
exec-summarize-text:
container:
args:
- --executor_input
- '{{$}}'
- --function_to_execute
- summarize_text
command:
- sh
- -c
- "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\
\ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
\ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.7.0'\
\ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
$0\" \"$@\"\n"
- sh
- -ec
- 'program_path=$(mktemp -d)
printf "%s" "$0" > "$program_path/ephemeral_component.py"
_KFP_RUNTIME=true python3 -m kfp.dsl.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@"
'
- "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
\ *\n\ndef summarize_text(dic: dict):\n df=pd.DataFrame(dic)\n # Load\
\ the LLAMA-2 7B model\n model = Llama2Model.from_pretrained(model)\n\
\ df['summary'] = df['encoded_text'].apply(lambda encoded_text: model.generate(encoded_text,\
\ max_length=150))\n dic=df.to_dict()\n return dic\n\n"
image: python:3.7
pipelineInfo:
name: pipeline
root:
dag:
tasks:
preprocess-text:
cachingOptions:
enableCache: true
componentRef:
name: comp-preprocess-text
inputs:
parameters:
dic:
runtimeValue:
constant: {}
taskInfo:
name: preprocess-text
publish-results:
cachingOptions:
enableCache: true
componentRef:
name: comp-publish-results
inputs:
parameters:
dic:
runtimeValue:
constant: {}
taskInfo:
name: publish-results
read-csv-file:
cachingOptions:
enableCache: true
componentRef:
name: comp-read-csv-file
inputs:
parameters:
file_path:
runtimeValue:
constant: /content/sample_data/kfb_experiment.csv
taskInfo:
name: read-csv-file
summarize-text:
cachingOptions:
enableCache: true
componentRef:
name: comp-summarize-text
inputs:
parameters:
dic:
runtimeValue:
constant: {}
taskInfo:
name: summarize-text
schemaVersion: 2.1.0
sdkVersion: kfp-2.7.0