Hive difference between PARTITIONED BY, CLUSTERED BY and SORTED BY with BUCKETS and insert overwrite with PARTITIONED and CLUSTER BY?

Question

Hive difference between PARTITIONED BY, CLUSTERED BY and SORTED BY with BUCKETS and insert overwrite with PARTITIONED and CLUSTER BY?

2.4k views Asked by chuckfinley At 30 January 2014 at 01:18

I've seen some good explanations of creating a table with partitions which are CLUSTERED BY and SORTED BY. How does this compare with creating a table with partitions, then populating the table (with INSERT OVERWRITE for instance) using CLUSTER BY? Is the CLUSTER BY a persistent sort within the table?

Original Q&A

There are 1 answers

**leftjoin** · Accepted Answer · 2014-02-06T15:24:24+00:00

Even if INSERT OVERWRITE + CLUSTER BY would produce table with persistently sorted data there is no way to tell hive that data is already sorted other than create CLUSTERED BY table. you can benefit from sorted data (sort-merge-join for example) only when the Hive knows about it and therefore can optimize the query. The data is not necessarily written to the disk in the same order it was produced or passed to the writer unless you specified that table is clustered(sorted). Usual (heap) tables are not sorted in theory. Writer process does not write data in the same order that the input because it is buffered (deferred write) and parallel.

TechQA.

Hive difference between PARTITIONED BY, CLUSTERED BY and SORTED BY with BUCKETS and insert overwrite with PARTITIONED and CLUSTER BY?

There are 1 answers

Related Questions in SORTING

Related Questions in HIVE

Related Questions in HIVEDDL

Popular Questions

Trending Questions