Can I connect to Vertica and HDFS in the same project using dbt core?

44 views Asked by At

I want to connect to Vertica and HDFS in the same project. I created a dbt project with command dbt init and try to connect to Vertica. It works but I don't know how to connect to hdfs to read data and load it to Vertica.

I need to read and load data from HDFS to Vertica database. Is it possible?

1

There are 1 answers

0
marcothesane On BEST ANSWER

You could start reading the docu here: https://docs.vertica.com/23.4.x/en/data-load/working-with-external-data/creating-external-tables/

An example for Parquet on HDFS:

Step 1: Let Vertica infer the Parquet file's definition:

\a
Output format is unaligned.
\t
Showing only tuples.
SELECT infer_table_ddl(
  'hdfs:///data/tpch/region/'
||'3d5f4653-v_sbx_node0001-140498341713664-0.parquet' 
  USING PARAMETERS 
    format     = 'Parquet'
  , table_name = 'region'
  , table_type ='external'
)
vsql:infer.sql:7: WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
create external table "region"(
  "r_regionkey" int, 
  "r_name" varchar, 
  "r_comment" varchar
) as copy from 'hdfs:///data/tpch/region/3d5f4653-v_sbx_node0001-140498341713664-0.parquet' parquet;

I would then suggest that you change the generated DDL statement to reflect the maximum lengths of r_name and r_comment (varchar(25) and varchar(152) , for example) before running it.

Once the command is executed, you can select from region as if it were a normal table.