I have a pretty standard CI pipeline using Cloud Build for my Machine Learning training model based on container:
- check python error use flake8
- check syntax and style issue using pylint, pydocstyle ...
- build a base container (CPU/GPU)
- build a specialized ML container for my model
- check the vulnerability of the packages installed
- run tests units
Now in Machine Learning it is impossible to validate a model without testing it with real data. Normally we add 2 extra checks:
- Fix all random seed and run on a test data to see if we find the exact same results
- Train the model on a batch and see if we can over fit and have the loss going to zero
This allow to catch issues inside the code of model. In my setup, I have my Cloud Build in a build GCP project and the data in another GCP project.
Q1: did somebody managed to use AI Platform training service in Cloud Build to train on data sitting in another GCP project ?
Q2: how to tell Cloud Build to wait until the AI Platform training job finished and check what is the status (successful/failed) ? It seems that the only option when looking at the documentation link it to use --stream-logsbut it seems non optimal (using such option, I saw some huge delay)
When you submit an AI platform training job, you can specify a service account email to use.
Be sure that the service account has enough authorization in the other project to use data from there.
For you second question, you have 2 solutions
--stream-logsas you mentioned. If you don't want the logs in your Cloud Build, you can redirect the stdout and/or the stderr to/dev/nullOr you can create an infinite loop that check the status
Here my test is simple, but you can customize the status tests as you want to match your requirement
Don't forget to set the timeout as expected.