當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

BG.Hive - part1

發布時間：2024/4/17 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 BG.Hive - part1 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1. Hive架構

　　What is hive?　　Facebook，https://en.wikipedia.org/wiki/Apache_Hive

　　a> 一種工具，可以通過SQL輕松的訪問數據，可以完成數據倉庫任務，如ETL，報表及數據分析

　　b> 一種機制，增強多樣化數據格式的結構

　　c> 數據訪問，HDFS或者其他的數據存儲系統（HBase）

　　d> 查詢方式，類SQL的HiveQL

　　　　默認引擎為MapReduce，簡單的Select * From..不會轉換為MR任務

　　e> 快速查詢引擎，MapReduce，Spark，Tez

　　f> 支持存儲過程，通過HPL/SQL實現

　　　　HPL為apache的另外一個開源項目

　　g> LLAP（Live Long And Process），使Hive實現內存計算

　　　　將數據緩存到了多臺服務器的內存中

2. Hive特性和支持的格式

　　Hive提供了標準的SQL函數，HiveQL可以擴展用戶自定義函數

　　Hive提供內置的格式

　　　　a> 逗號和Tab字段分割的文本文件

　　　　b> Apache Parquet文件，https://parquet.apache.org/

　　　　c> Apache ORC文件，ORC：OptimizedRC File，RC：RecordColumnar File

　　　　d> 其他格式

3. 單用戶模式（derby，in memory database），多用戶模式（mysql，其他RDMS），遠程模式（服務器端啟動MetaStore Server，客戶端通過Thrift協議訪問）

4. 為什么會出現Hive

　　MR程序繁瑣，使用HQL可以非常簡單的實現任務

5. 環境搭建

　　要先具有：CentOS, Hadoop, MySQL

　　下載Hive，并放入虛擬機/opt下，https://mirrors.tuna.tsinghua.edu.cn/apache/hive/

　　tar zxf apache-hive-2.1.1-bin.tar.gz　　#解壓

　　mv apache-hive-2.1.1-bin hive-2.1.1　　#重命名

　　cd /opt/hive-2.1.1/conf/　　#進入conf目錄

　　cp hive-env.sh.template hive-env.sh　　#拷貝配置文件

　　cp hive-default.xml.template hive-site.xml　　#拷貝配置文件

　　vim /etc/profile　　#配置環境變量

　　source /etc/profile　　#應用環境變量

　　vim hive-env.sh　　#配置hive-env.sh

　　　　HADOOP_HOME=/opt/hadoop-2.7.3　　#設置HADOOP_HOME

　　/opt/hive-2.1.1/bin/schematool -dbType derby -initSchema　　#使用derby作為metastore，并初始化（message:Version information not found in metastore.錯誤解決方案）

　　vim hive-site.xml

　　　　${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D.錯誤解決方案　　　　

<property><name>hive.exec.local.scratchdir</name><value>/opt/hive-2.1.1/hivetmp/scratchdir/</value><description>Local scratch space for Hive jobs</description> </property> <property><name>hive.downloaded.resources.dir</name><value>/opt/hive-2.1.1/hivetmp/resources</value><description>Temporary local directory for added resources in the remote file system.</description> </property>

　　單用戶模式（derby）檢查：hive

　　同一時間，只允許一個用戶打開Hive Session

　　Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

　　Hive的元數據：表信息、字段屬性、分區、列、表Owner等信息，存儲于metastore_db

　　Hive的實際數據，存儲于HDFS上

　　vim /opt/hive-2.1.1/conf/hive-site.xml

　　　　javax.jdo.option.ConnectionURL, javax.jdo.option.ConnectionDriverName, javax.jdo.option.ConnectionDriverName, javax.jdo.option.ConnectionPassword

<name>javax.jdo.option.ConnectionURL</name><value>jdbc:mysql://bigdata.mysql:3306/hive?createDatabaseIfNotExist=true</value><description>JDBC connect string for a JDBC metastore.To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.</description><property><name>javax.jdo.option.ConnectionDriverName</name><value>com.mysql.jdbc.Driver</value><description>Driver class name for a JDBC metastore</description></property><property><name>javax.jdo.option.ConnectionUserName</name><value>bigdata</value><description>Username to use against metastore database</description></property><property><name>javax.jdo.option.ConnectionPassword</name><value>pas$w0rd</value><description>password to use against metastore database</description></property>

　　cp mysql-connector-java-5.1.41-bin.jar /opt/hive-2.1.1/lib/　　#copy jdbc到lib下，解決("com.mysql.jdbc.Driver") was not found.錯誤

　　/opt/hive-2.1.1/bin/schematool -dbType mysql -initSchema　　#初始化metaStore db，解決Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

　　多用戶模式（mysql）測試：hive

6. Hive CLI

　　Hive Command Line Interface　　Hive命令行接口

usage: hive-d,--define <key=value> Variable substitution to apply to Hivecommands. e.g. -d A=B or --define A=B-e <quoted-query-string> SQL from command line-f <filename> SQL from files -H,--help Print help information-h <hostname> Connecting to Hive Server on remote host--hiveconf <property=value> Use value for given property--hivevar <key=value> Variable substitution to apply to hivecommands. e.g. --hivevar A=B-i <filename> Initialization SQL file-p <port> Connecting to Hive Server on port number -S,--silent Silent mode in interactive shell -v,--verbose Verbose mode (echo executed SQL to theconsole)

　　設置配置屬性的3中方式：1. hive CLI set property = value > 2. --hiveconf property = value > 3. hive-site.xml

　　hive> create database HelloHive;　　#創建數據庫，數據庫文件存放于Hadoop

　　hive> show databases;　　#顯示所有數據庫

　　hive> use HelloHive;　　#切換到HelloHive數據庫

　　hive> create table T1(id int, name varchar(30));　　#創建表

　　hive> show tables;　　#顯示所有表

　　hive> insert into t1(id,name) values(1,'Niko'),(2,'Jim');　　#向T1表中插入數據

　　hive> select * from t1;　　#查詢T1表

　　hive -d col=id --database HelloHive　　#啟動Hive時，定義變量col等于id，并連接上HelloHive數據庫

　　hive> select ${col},name from T1;　　#使用col代替id進行查詢，輸出結果為id列的內容

　　hive> select '${col}',name from T1;　　#${col}的值為id，所以輸出結果為字符串“id”

　　hive> set mapred.reduce.tasks;　　#設置MR的任務數，不加參數輸出當前任務數
　　　　[output] mapred.reduce.tasks=-1　　#Hive默認的MR任務數-1代表Hive會根據實際情況設置任務數

　　hive --hiveconf mapred.reduce.tasks=3　　#在啟動Hive時指定MR任務數為3

　　hive> set mapred.reduce.tasks=5;　　#在Hive CLI中重新設定MR任務數為5

　　hive -e "select * from T1;" --database HelloHive;　　#使用-e將查詢語句傳入Hive并取回結果

　　vim t1.hql　　#創建t1.hql文件

　　　　use HelloHive;　　#文件中的SQL語句，每行必須要用;結尾
　　　　Select * From T1 Where id < 4;

　　hive -f t1.hql　　#使用hive只是文件中的SQL語句

　　hive -S -e "select?count(1) from T1;" --database HelloHive;　　#-S會去掉不必要的信息，如MR的信息等不會被顯示出來

7. Hive Shell

　　hive> quit;　　hive> exit;　　#退出interactive Hive Shell

　　hive> reset;　　#重置所有hive配置項，重置為hive-site.xml中的配置信息

　　hive> set XXX;　　hive> set XXX=Y;　　#設置或者顯示配置項信息

　　hive> set -v;　　#顯示所有Hadoop和Hive的配置項信息

　　hive> !ls;　　#在hive中執行Shell命令

　　hive> dfs -ls;　　#在hive中執行dfs命令

　　hive> add file t1.hql　　#添加t1.hql文件到分布式緩存

　　hive> list file;　　#顯示所有當前的分布式緩存文件

　　hive> delete file t1.hql　　#刪除指定的分布式緩存文件

8. Beeline

　　HiveServer2的CLI，一個JDBC客戶端；

　　嵌入式模式，返回一個嵌入式的Hive，類似Hive CLI；（beeline）

　　遠程模式，通過Thrift協議與某個單獨的Hive Server2進程進行連接通信（使用代碼連接HiveServer2）

　　HiveServer2的配置 hive-site.xml

　　　　Hive.Server2.thrift.min.worker.threads　　#最小工作線程數，默認5，最大500

　　　　Hive.Server2.thrift.Port　　#TCP監聽端口，默認是10000

　　　　Hive.Server2.thrift.bind.host　　#TCP綁定主機，默認是localhost

　　　　Hive.Server2.thrift.transport.mode　　#默認TCP，可選擇HTTP

　　　　Hive.Server2.thrift.http.port　　#HTTP的監聽端口，默認值為10001

　　啟動HiveServer2

　　　　hive -service hiveserver2

　　　　hiveserver2

　　啟動Beeline　

　　　　hive -service beeline

　　　　beeline

　　查看服務是否啟動：ps -ef | grep hive

　　cp /opt/hive-2.1.1/jdbc/hive-jdbc-2.1.1-standalone.jar /opt/hive-2.1.1/lib/　　#解決hive-jdbc-*-standalone.jar:No such file or directory文件

　　beeline　　#啟動beeline

　　beeline> !connect jdbc:hive2://localhost:10000/HelloHive　　#使用beeline連接Hive數據庫

　　　　Enter username for jdbc:hive2://localhost:10000/HelloHive: root

　　　　Enter password for jdbc:hive2://localhost:10000/HelloHive: ********

　　Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/HelloHive: Failed to open new session: java.lang.RuntimeException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: root is not allowed to impersonate root (state=08S01,code=0)

　　解決方案：

　　　　kill -9 15544　　#關閉hiveserver2進程

　　　　/opt/hadoop-2.7.3/sbin/stop-all.sh　　#停止Hadoop集群

　　　　vim /opt/hadoop-2.7.3/etc/hadoop/core-site.xml　　#配置hadoop的core-site，增加下面2個配置項。表示root用戶可以代理所有主機上的所有用戶

<property><name>hadoop.proxyuser.root.hosts</name><value>*</value></property><property><name>hadoop.proxyuser.root.groups</name><value>*</value></property>

　　　　scp /opt/hadoop-2.7.3/etc/hadoop/core-site.xml root@bigdata.hadoop.slave1:/opt/hadoop-2.7.3/etc/hadoop/　　#將core-site.xml文件分發到Hadoop集群的所有slave上

　　　　scp /opt/hadoop-2.7.3/etc/hadoop/core-site.xml root@bigdata.hadoop.slave2:/opt/hadoop-2.7.3/etc/hadoop/

　　　　scp /opt/hadoop-2.7.3/etc/hadoop/core-site.xml root@bigdata.hadoop.slave3:/opt/hadoop-2.7.3/etc/hadoop/

　　　　/opt/hadoop-2.7.3/sbin/start-all.sh　　#啟動Hadoop集群

　　　　hiveserver2　　#啟動Hive Server2

　　　　beeline　　#啟動beeline

　　　　beeline> !connect jdbc:hive2://localhost:10000/HelloHive　　#連接Hive數據庫 => 輸入用戶名，密碼

　　　　17/03/07 14:01:55 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ　　#警告

　　　　0: jdbc:hive2://localhost:10000/HelloHive> set autoCommit=false;　　#beeline啟動成功；設置autoCommit為false
　　　　　　[output]No rows affected (0.286 seconds)　　#設置成功

　　　　0: jdbc:hive2://localhost:10000/HelloHive> show tables;　　#顯示表

　　　　0: jdbc:hive2://localhost:10000/HelloHive> select * from t1;　　#查詢表

　　　　0: jdbc:hive2://localhost:10000/HelloHive> !quit　　#退出beeline

9. Hive數據類型

　　數值型：

　　　　TINYINT，1字節，-128 ~ 127，如：1　　Postfix：Y　　100Y

　　　　SMALLINT，2字節，-32768 ~ 32767，如：1　　Postfix：S　　100S

　　　　INT/INTEGER，4字節，-2,147,483,648 ~ 2,147,483,647，如：1

　　　　BIGINT，8字節，如：1　　Postfix：L　　100L

　　　　FLOAT，4字節單精度，如：1.0　　默認為double，在數值后面加上F代表Float。

　　　　DOUBLE，8字節雙精度，（Hive 2.2.0開始引入DOUBLE PRECISION），如：1.0　　FOLAT和DOUBLE都不支持科學計數法

　　　　DECIMAL，38位小數精度，（HIVE 0.11.0開始引入），支持科學/非科學計數法；默認為小數點后1位，或者指定小數點后位數decimal(10,2)

　　日期時間型：

　　　　TIMESTAMP，0.8.0開始引入，如：2017-03-07 14:00:00；支持傳統Unix時間戳，精確到納秒級。

　　　　DATE，0.12.0開始引入，0001-01-01 ~ 9999-12-31，如：2017-03-07

　　字符：

　　　　STRING，用單引號或者雙引號引起來的字符串

　　　　VARCHAR，0.12.0引入，字符數量1 ~ 65535

　　　　CHAR，0.13.0引入，固定長度，長度最大支持到255

　　Misc

　　　　BOOLEAN，布爾型，TRUE和FALSE

　　　　BINARY，0.8.0引入，二進制類型

　　數組

　　　　ARRAY<TYPE>，如ARRAY<INT>，元素訪問下標由0開始

　　映射

　　　　MAP<PRIMITIVE_TYPE,DATA_TYPE>，如MAP<STRING,INT>

　　結構體

　　　　STRUCT<COL_NAME:DATA_TYPE,...>，如STRUCT<a:STRING,b:INT,c:DOUBLE>

　　聯合體

　　　　UNIONTYPE<DATA_TYPE,DATA_TYPE,...>，如UNIONTYPE<STRING,INT,DOUBLE...>

CREATE TABLE complex(col1 ARRAY<INT>,col2 MAP<STRING,INT>,col3 STRUCT<a:STRING,b:INT,c:DOUBLE>,col4 UNIONTYPR<STRING,INT,STRUCT,MAP,ARRAY,...> )col1 = Array('Hadoop','spark','hive','hbase','sqoop') col1[1] = 'spark'col2 = MAP(1:hadoop,2:sqoop,3:hive) col2[1] = hadoopcol3 = STRUCT(a:5,b:'five') col3.b = 'five'

void

boolean

tinyint

smallint

int

bigint

float

double

decimal

string

varchar

timestamp

date

binary

void to

boolean to

tinyint to

smallint to

int to

bigint to

float to

double to

decimal to

string to

varchar to

timestamp to

date to

binary to

true

false

true

false

true

false

true

false

true

false

true

false

true

false

true

false

true

false

true

false

true

false

true

false

true

false

true

false

true

10. Hive表基本操作及概念

　　內表（Managed Table），其數據文件、元數據及統計信息全部由Hive進程自身管理。內表的數據存儲是有hive.metastroe.warehouse.dir指定的路徑下。

　　外表（External Table），通過元信息或者Schema描述外部文件的結構，外表可以被Hive之外的進程訪問和管理，如HDFS。

　　hive> desc formatted t1;　　#查看表的信息；Table Type顯示Managed Table或者External Table

　　hive> create external table t2(id int,name string);　　#創建外表

　　hive> desc t1;　　#查看表的字段及字段類型信息

11. Hive數據文件存儲格式

　　STORED AS TEXTFILE，默認的文件格式（除非特別用hive.default.fileformat指定，在hive-site.xml中設定）

　　STORED AS SEQUENCEFILE，已壓縮的序列化文件

　　STORED AS ORC，存儲ORC格式的文件，支持ACID事務操作及CBO（Cost_based Optimizer）

　　STORED AS PARQURT，存儲Parquet文件

　　STORED AS AVRO，存儲AVRO格式文件

　　STORED AS RCFILE，存儲RC（Record Columnar）格式的文件

　　STORED BY，由非內置的表格式存儲，例如HBase/Druid/Accumulo存儲數據

　　創建表

　　hive> create external table users(
??? > id int comment 'id of user',
??? > name string comment 'name of user',
??? > city varchar(30) comment 'city of user',
??? > industry varchar(20) comment 'industry of user')
??? > comment 'external table, users'
??? > row format delimited　　#使用分隔符形式，下面描述了3種序列化的形式
??? > fields terminated by ','
??? > stored as textfile
??? > location '/user/hive/warehouse/hellohive.db/users/';

　　row format內置類型：Regex（正則表達式），JSON，CSV/TSV

　　row format serde 'org.apache.hive.hcatalog.data.JsonSerDe' stored as textfile

　　row format serde 'org.apache.hive.serde2.RegexSerDe' with serdeproperties ("input.regex"="<regex>") stored as textfile

　　row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerDe' stored as textfile

　　或者使用hive -f的方式創建表

vim t_users_ext.hqlcreate external table users( id int, name string, city varchar(30), industry varchar(20)) row format delimited fields terminated by ',' stored as textfile location '/user/hive/warehouse/hellohive.db/users';insert into users(id,name,city,industry) values(1,'Niko','Shanghai','Bigdata'), (2,'Eric','Beijing','NAV'), (3,'Jim','Guangzhou','IT');hive -f t_users_ext.hql --database hellohive

12. Hive表

　　分區表 Partition Table

　　在Hive Select查詢中，一般會掃描這個表的內容（HDFS某個目錄下的所有文件），會消耗很多時間

　　分區表創建時，指定partition的分區空間，分區粒度 > 桶粒度

　　語法： partition by (par_col par_type)

　　靜態分區：如按照年-月進行分區　　#set hive.exec.dynamic.partition;

　　動態分區：如按照產品類別進行分區，產品類別會有新增　　#默認為動態分區，如果設置動態分區為false，則不能創建動態分區

　　　　動態分區模式：set hive.exec.dynamic.partition.mode = strict/nonstrict　　#默認模式為嚴格（strict），在strict模式下，動態分區表必須有一個字段為靜態分區字段

　　采用分區后，每個分區值都會形成一個具體的分區目錄

　　桶表 Bucketed Sorted Table

　　傾斜表 Skewed Table

　　　　通過將傾斜特別嚴重的列分開存儲為不同的文件，每個傾斜值指定為一個目錄或者文件，在查詢的時候，可以根據過濾條件來避免全表掃描的費時操作

　　　　Skewed by (field) on (value)

　　臨時表 Temporary Table

　　　　只在當前會話中可見的表為臨時表，臨時表所在的hdfs目錄為tmp目錄

　　DROP TABLE [IF EXISTS] TABLE_NAME [PURGE];　　#對于內表，使用PURGE，元數據和表數據一起刪除，不進入垃圾箱。對于外表，只刪除元數據

轉載于:https://www.cnblogs.com/Niko12230/p/6511399.html

總結

以上是生活随笔為你收集整理的BG.Hive - part1的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

BG
Hive

上一篇：通过城市联动实时将地址显示到text中
下一篇：根据坐标查500米范围内站点