當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【hive-3.1.3】ORC 格式的表和 text 格式的表，当分区的字段数量和表的字段数量不一致，检索结果不相同

發布時間：2023/12/8 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了【hive-3.1.3】ORC 格式的表和 text 格式的表，当分区的字段数量和表的字段数量不一致，检索结果不相同小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

ORC 格式的表，和 text 格式的表，如果分區的字段數量和表的字段數量不一致，則 select 的結果不一致。

1. 測試內容

1.1 ORC 格式的表

CREATE EXTERNAL TABLE `test_part`( `id` string, `t1` string,`t2` string ) PARTITIONED BY ( `dt` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS ORC;insert into test_part partition (dt='124') values (1,1,1),(2,2,2),(3,3,NULL),(4,4,NULL);--刪除表結構，但是不刪除文件drop table test_part;-- 數據準備完畢CREATE EXTERNAL TABLE `test_part`( `id` string, `t1` string ) PARTITIONED BY ( `dt` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS ORC;-- 先添加分區再添加字段alter table test_part add partition(dt='124');alter table test_part add columns (t2 string);-- 字段格式與表格式不同

1.1.1 select

結果如下,能識別新添加的字段。

hive> select * from test_part; OK 1 1 1 124 2 2 2 124 3 3 NULL 124 4 4 NULL 124

1.2 Text 格式的表

drop table if exists `test_part_text`; CREATE EXTERNAL TABLE `test_part_text`( `id` string, `t1` string,`t2` string ) PARTITIONED BY ( `dt` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS textfile;insert into test_part_text partition (dt='124') values (1,1,1),(2,2,2),(3,3,NULL),(4,4,NULL);--刪除表結構，但是不刪除文件drop table test_part_text;-- 數據準備完畢

重新創建表，數據指向原來的地址

CREATE EXTERNAL TABLE `test_part_text`( `id` string, `t1` string ) PARTITIONED BY ( `dt` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS textfile;-- 先添加分區再添加字段alter table test_part_text add partition(dt='124');alter table test_part_text add columns (t2 string);-- 字段格式與表格式不同

1.2.1 select

可以看到t2 字段的內容為 NULL，和 orc 格式不一致。

hive> select * from test_part_text; OK 1 1 NULL 124 2 2 NULL 124 3 3 NULL 124 4 4 NULL 124

3. 查看表的信息

可以看到表的 StorageDescriptor 有3個字段。

hive> desc extended test_part_text; OK id string t1 string t2 string dt string # Partition Information # col_name data_type comment dt string Detailed Table Information Table(tableName:test_part_text, dbName:default, owner:hive, createTime:1667954622, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:t1, type:string, comment:null), FieldSchema(name:t2, type:string, comment:null), FieldSchema(name:dt, type:string, comment:null)], location:bos://bmr-rd-wh/houzhizhen/warehouse/test_part_text, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=\t, field.delim=\t}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:dt, type:string, comment:null)], parameters:{last_modified_time=1667954645, totalSize=0, EXTERNAL=TRUE, numRows=0, rawDataSize=0, COLUMN_STATS_ACCURATE={\"BASIC_STATS\":\"true\"}, numFiles=0, numPartitions=1, transient_lastDdlTime=1667954645, bucketing_version=2, last_modified_by=hive}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER)

3.2 查看分區的信息

可以看到 partition 的 StorageDescriptor 只有2個字段。

hive> desc extended test_part_text partition(dt='124'); OK col_name data_type comment id string t1 string dt string # Partition Information # col_name data_type comment dt string Detailed Partition Information Partition(values:[124], dbName:default, tableName:test_part_text, createTime:1667957398, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:t1, type:string, comment:null), FieldSchema(name:dt, type:string, comment:null)], location:hdfs://localhost:9000/home/disk1/hive/hive-313/test_part_text/dt=124, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format= , field.delim=

2. 原因分析

2.1 根本原因總結

orc 格式的表和text 格式的表，在表的定義層面都是3個字段，在分區的 serde 都是2個字段。并且在文件里都是存儲了3個字段的內容。但是 text 格式的表在讀取文件的時候使用的是分區的 serde，所以第3個字段 t2 的內容始終為 NULL。orc 格式的表在讀取文件時使用的是表的 serde，所以文件中的3個字段都可以讀取。

2.2 代碼分析

2.2.1 FetchOperator.getRecordReader

getRecordReader 因為是分區表，所以走 else 邏輯 currSerDe = needConversion(currDesc) ? currDesc.getDeserializer(job) : tableSerDe;.
currrDesc 是分區的 desc, tableSerDe 是表的 serde。
也就是說明，如果 needConversion(currDesc) 返回 true，則用分區的 serde（2個字段），否則用表的 serde(3個字段)。
ORC 格式的表 needConversion(currDesc) 返回 false。text 格式的表 needConversion(currDesc) 返回 true。

private RecordReader<WritableComparable, Writable> getRecordReader() { .// if (!isPartitioned || convertedOI == null) {currSerDe = tableSerDe;ObjectConverter = null; } else {currSerDe = needConversion(currDesc) ? currDesc.getDeserializer(job) : tableSerDe;ObjectInspector inputOI = currSerDe.getObjectInspector();ObjectConverter = ObjectInspectorConverters.getConverter(inputOI, convertedOI); }

2.2.2 needConversion 代碼

外部表 isAcid 是 false。

private boolean needConversion(PartitionDesc partitionDesc) {boolean isAcid = AcidUtils.isTablePropertyTransactional(partitionDesc.getTableDesc().getProperties());if (Utilities.isSchemaEvolutionEnabled(job, isAcid) && Utilities.isInputFileFormatSelfDescribing(partitionDesc)) {return false;}return needConversion(partitionDesc.getTableDesc(), Arrays.asList(partitionDesc));}

2.2.3 isSchemaEvolutionEnabled

配置項 ConfVars.HIVE_SCHEMA_EVOLUTION 默認為 true。

public static boolean isSchemaEvolutionEnabled(Configuration conf, boolean isAcid) {return isAcid || HiveConf.getBoolVar(conf, ConfVars.HIVE_SCHEMA_EVOLUTION);}

2.2.4 Utilities.isInputFileFormatSelfDescribing(partitionDesc)

SelfDescribingInputFormatInterface 有3個子類，分別是 SelfDescribingInputFormatInterface、 VectorizedOrcInputFormat ,OrcInputFormat,LlapInputFormat。所以 orc 格式的返回true，text 格式返回 false。
所以 orc 格式的表在 2.2.2 節 needConversion(PartitionDesc partitionDesc)節返回 false。

public static boolean isInputFileFormatSelfDescribing(PartitionDesc pd) {Class<?> inputFormatClass = pd.getInputFileFormatClass();return SelfDescribingInputFormatInterface.class.isAssignableFrom(inputFormatClass);}

2.2.5 needConversion(TableDesc tableDesc, List partDescs)

Text 格式的表繼續運行此函數，因為表的字段數量和字段類型和分區的不一樣，所以返回 true。在 2.2.1 節走分區 desc 的邏輯（2個字段），雖然文件內容也有3列，但是只取出 2 列。

private boolean needConversion(TableDesc tableDesc, List<PartitionDesc> partDescs) {Class<?> tableSerDe = tableDesc.getDeserializerClass();SerDeSpec spec = AnnotationUtils.getAnnotation(tableSerDe, SerDeSpec.class);if (null == spec) {// Serde may not have this optional annotation defined in which case be conservative// and say conversion is needed.return true;}String[] schemaProps = spec.schemaProps();Properties tableProps = tableDesc.getProperties();for (PartitionDesc partitionDesc : partDescs) {if (!tableSerDe.getName().equals(partitionDesc.getDeserializerClassName())) {return true;}Properties partProps = partitionDesc.getProperties();for (String schemaProp : schemaProps) {if (!org.apache.commons.lang3.StringUtils.equals(tableProps.getProperty(schemaProp), partProps.getProperty(schemaProp))) {return true;}}}return false;}

4. hive.exec.schema.evolution 參數的影響

以上分析都是在 hive.exec.schema.evolution=true的情況下。
當 hive.exec.schema.evolution=false 時，兩張表都用分區的描述，t2 字段的內容都為 NULL。

4.1 設置 hive.exec.schema.evolution=false

set hive.exec.schema.evolution=false;

4.1.1 orc 格式的表

hive> select * from test_part; OK 1 1 NULL 124 2 2 NULL 124 3 3 NULL 124 4 4 NULL 124

4.1.2 text 格式的表

hive> select * from test_part_text; OK 1 1 NULL 124 2 2 NULL 124 3 3 NULL 124 4 4 NULL 124

總結

以上是生活随笔為你收集整理的【hive-3.1.3】ORC 格式的表和 text 格式的表，当分区的字段数量和表的字段数量不一致，检索结果不相同的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：《刺》真的很痛！
下一篇： torch.cat()的类型转换