From https://cloud.google.com/document-ai/docs/process-forms, I can see some example of processing single files. But in most cases, companies have buckets of documents. In that case, how do you scale the document ai processing? Do you use the document ai in conjunction with Spark? Or is there another way?
How do you scale Google Cloud Document AI processing?
615 views Asked by Kevin Eid At
2
There are 2 answers
0
Holt Skinner
On
You will need to use Batch Processing to handle multiple documents at once with Document AI.
This page in the Cloud Documentation shows how to make Batch Processing requests with REST and the Client Libraries.
https://cloud.google.com/document-ai/docs/send-request#batch-process
This codelab also illustrates how to do this in Python with the OCR Processor. https://codelabs.developers.google.com/codelabs/docai-ocr-python
Related Questions in GOOGLE-CLOUD-PLATFORM
- Why do I need to wait to reaccess to Firestore database even though it has already done before?
- Unable to call datastore using GCP service account key json
- Troubleshooting Airflow Task Failures: Slack Notification Timeout
- GoogleCloud Error: Not Found The requested URL was not found on this server
- Kubernetes cluster on GCE connection refused error
- Best way to upload images to Google Cloud Storage?
- Permission 'storage.buckets.get' denied on resource (or it may not exist)
- Google Datastream errors on larger MySQL tables
- Can anyone explain the output of apache-beam streaming pipeline with Fixed Window of 60 seconds?
- Parametrizing backend in terraform on gcp
- Nonsense error using a Python Google Cloud Function
- Unable to deploy to GAE from Github Actions
- Assigned A record for Subdomain in Cloud DNS to Compute Engine VM instance but not propagated/resolved yet
- Task failure in DataprocCreateClusterOperator when i add metadata
- How can I get the long running operation with google.api_core.operations_v1.AbstractOperationsClient
Related Questions in GOOGLE-CLOUD-DATAPROC
- Task failure in DataprocCreateClusterOperator when i add metadata
- Dataproc Serverless
- getting ValueError: Cannot determine path without bucket name
- Dataproc Job Failed with ProviderNotFoundException on CloudSpanner JDBC write. (CloudSpanner connector works)
- Interacting with Dataproc Serverless using Dataproc Client Library
- DataProc Jupyter
- Cannot read credential_key.json in bitnami spark image on docker when connect to google cloud storage
- problem in configuring dataproc cluster from GCP Console since Friday (1 february 2024)
- Google Dataproc Vs Amazon EMR cluster configuration
- While running upsert command on hudi table in sparksql I am gettting error in reading _hoodie_partition_path
- how to optimize the join of two dataframes in pyspark using dataproc serverless
- Failure in converting the SparkDF to Pandas DF
- Airflow - Bashoperator task in GCP Composer
- Dataproc Serverless - Slow writes to GCS
- cannot set App Name and PySparkShell persists in Spark History Server
Related Questions in CLOUD-DOCUMENT-AI
- Document AI - Multi-page files performance affect
- Auto-Labeling in Document AI with Custom Extractor: Schema Requirement Issue
- How to fine-tuning the new custom extraction generative ai in document AI via api?
- Google Document AI create labeling instruction
- Document AI adding folders
- Book Digitization: Is Google Document AI Necessary?
- Does the `Number` type in Google Document AI include decimals?
- GCP API for AI Documents
- fail to train document extractor
- How can I tell Google Document AI Enterprise OCR to always assume one column?
- How can I use Google Document AI OCR to find the non-text images in a text document?
- Will adjusting the value acquired from bounding box annotation train the model to be able to make inferences?
- Line Ordering Issue with Arabic PDF Text Using Google Cloud Document AI
- Response from Document AI stored in Google Cloud Storage
- Reskewing GCP Document AI Result
Related Questions in GOOGLE-CLOUD-AI
- Real max tokens count for google vertex ai text-unicorn model
- Does Gemini embedding model support languages other than English?
- Impossible to activate Duet AI despite receiving provisioning email
- google-cloud/aiplatform vertex AI predictionserviceclient truncated response NodeJS
- Google Document AI Api integration in ReactJs
- Accessing data in Google Cloud Storage without downloading it
- Format issue when calling Vertex AI Custom Job Endpoint
- Why does this fail with "'async for' requires an object with __aiter__ method, got coroutine"
- Trying to get a Vertex AI prediction in BigQuery using Python
- Is TabNet's mask explanation biased?
- `steps_per_epoch` in google ai platform multi-worker distributed training
- Is there a way to have users access my Google Cloud AI Platform Notebook Instance easily?
- SQL Server : DATETIME to INT
- No such object error when saving TensorFlow model trained on Google Cloud AI Platform to a Google Cloud Storage Bucket
- Vertex AI Custom Container Training Job python SDK - InvalidArgument 400 error
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
I could only find the following:
batch_process_documentsprocess many documents and return an async response that'll get saved in cloud storage.From there, I think that we can parametrise our job by adding an input path of the bucket prefix and distribute the job over several machines.
All of that could be orchestrated via Airflow for example.