當前位置：首頁 > 编程语言 > java >内容正文

java

Hive的安装和使用以及Java操作hive

發布時間：2025/3/15 java 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hive的安装和使用以及Java操作hive 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Hive 引言

簡介

hive是facebook開源，并捐獻給了apache組織，作為apache組織的頂級項目(hive.apache.org)。 hive是一個基于大數據技術的數據倉庫(DataWareHouse)技術，主要是通過將用戶書寫的SQL語句翻譯成MapReduce代碼，然后發布任務給MR框架執行，完成SQL 到 MapReduce的轉換。可以將結構化的數據文件映射為一張數據庫表，并提供類SQL查詢功能。

總結

Hive是一個數據倉庫
Hive構建在HDFS上，可以存儲海量數據。
Hive允許程序員使用SQL命令來完成數據的分布式計算，計算構建在yarn之上。(Hive會將SQL轉化為MR操作)

優點：

? 簡化程序員的開發難度，寫SQL即可，避免了去寫mapreduce,減少開發人員的學習成本

缺點：

? 延遲較高(MapReduce本身延遲，Hive SQL向MapReduce轉化優化提交)，適合做大數據的離線處理(TB PB級別的數據，統計結果延遲1天產出)

Hive不適合場景：

? 1：小數據量

? 2：實時計算

數據庫 DataBase
- 數據量級小，數據價值高
數據倉庫 DataWareHouse
- 數據體量大，數據價值低

Hive 的架構

1. 簡介

HDFS：用來存儲hive倉庫的數據文件 yarn：用來完成hive的HQL轉化的MR程序的執行 MetaStore：保存管理hive維護的元數據 Hive：用來通過HQL的執行，轉化為MapReduce程序的執行，從而對HDFS集群中的數據文件進行統計。

2. 圖

Hive的安裝

# 步驟 1. HDFS(Hadoop2.9.2) 2. Yarn(Hadoop2.9.2) 3. MySQL(5.6) 4. Hive(1.2.1)

虛擬機內存設置至少1G

1. 安裝mysql數據庫

參考MySQL安裝文檔

2. 安裝Hadoop

# 配置hdfs和yarn的配置信息 [root@hive40 ~]# jps 1651 NameNode 2356 NodeManager 2533 Jps 1815 DataNode 2027 SecondaryNameNode 2237 ResourceManager

3. 安裝hive

1 上傳hive安裝包到linux中

2 解壓縮hive

[root@hadoop ~]# tar -zxvf apache-hive-1.2.1-bin.tar.gz -C /opt/installs [root@hadoop ~]# mv apache-hive-1.2.1-bin hive1.2.1

3 配置環境變量

export HIVE_HOME=/opt/installs/hive1.2.1 export PATH=$PATH:$HIVE_HOME/bin

4 加載系統配置生效

[root@hadoop ~]# source /etc/profile

5 配置hive

hive-env.sh

拷貝一個hive-env.sh:[root@hadoop10 conf]# cp hive-env.sh.template hive-env.sh

# 配置hadoop目錄 HADOOP_HOME=/opt/installs/hadoop2.9.2/ # 指定hive的配置文件目錄 export HIVE_CONF_DIR=/opt/installs/hive1.2.1/conf/

hive-site.xml

拷貝得到hive-site.xml：[root@hadoop10 conf]# cp hive-default.xml.template hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration><property><name>javax.jdo.option.ConnectionURL</name><value>jdbc:mysql://hadoop10:3306/hive</value></property><property><name>javax.jdo.option.ConnectionDriverName</name><value>com.mysql.jdbc.Driver</value></property><property><name>javax.jdo.option.ConnectionUserName</name><value>root</value></property><property><name>javax.jdo.option.ConnectionPassword</name><value>admins</value></property> </configuration>

登錄mysql創建hive數據庫(使用命令行創建)

create database hive

復制mysql驅動jar到hive的lib目錄中

4 啟動

1. 啟動 hadoop

啟動hadoop

# 啟動HDFS start-dfs.sh # 啟動yarn start-yarn.sh

2. 本地啟動hive

初始化元數據：schematool -dbType mysql -initSchema

初始化mysql的hivedatabase中的信息。

3. 啟動Hive的兩種方式

# 本地模式啟動【管理員模式】 # 啟動hive服務器，同時進入hive的客戶端。只能通過本地方式訪問。 [root@hadoop10 ~]# hive Logging initialized using configuration in jar:file:/opt/installs/hive1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> # 客戶端操作之HQL(Hive Query language) # 1.查看數據庫hive> show databases; # 2. 創建一個數據庫hive> create database baizhi; # 3. 查看database hive> show databases; # 4. 切換進入數據庫hive> use baizhi; # 5.查看所有表hive> show tables; # 6.創建一個表hive> create table t_user(id string,name string,age int); # 7. 添加一條數據(轉化為MR執行--不讓用，僅供測試)hive> insert into t_user values('1001','zhangsan',20); # 8.查看表結構hive> desc t_user; # 9.查看表的schema描述信息。(表元數據，描述信息)hive> show create table t_user;# 明確看到，該表的數據存放在hdfs中。 # 10 .查看數據庫結構hive> desc database baizhi; # 11.查看當前庫hive> select current_database(); # 12 其他sqlselect * from t_user;select count(*) from t_user; (Hive會啟動MapReduce)select * from t_user order by id;

3.hive的客戶端和服務端

# 啟動hive的服務器，可以允許遠程連接方式訪問。 // 前臺啟動 [root@hadoop10 ~]# hiveserver2 // 后臺啟動 [root@hadoop10 ~]# hiveserver2 &

beeline客戶端

# 啟動客戶端 [root@hadoop10 ~]# beeline beeline> !connect jdbc:hive2://hadoop10:10000 回車輸入mysql用戶名回車輸入mysql密碼

DBeaver客戶端(圖形化界面)

# 1: 解壓 # 2: 準備dbeaver連接hive的依賴jarhadoop-common-2.9.2hive-jdbc-1.2.1-standalone # 3:啟動

JDBC

# 導入依賴 <dependency><groupId>org.apache.hive</groupId><artifactId>hive-jdbc</artifactId><version>1.2.1</version> </dependency> <dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.9.2</version> </dependency> # JDBC操作Hive public static void main(String[] args) throws Exception {BasicConfigurator.configure();//開啟日志//加載hive驅動Class.forName("org.apache.hive.jdbc.HiveDriver");//連接hive數據庫Connection conn = DriverManager.getConnection("jdbc:hive2://hadoop10:10000/baizhi","root","admins");String sql = "select * from t_user1";PreparedStatement pstm = conn.prepareStatement(sql);ResultSet rs = pstm.executeQuery();while(rs.next()){String id = rs.getString("id");String name = rs.getString("name");int age = rs.getInt("age");System.out.println(id+":"+name+":"+age);}rs.close();pstm.close();conn.close(); }

4. 數據類型

數據類型（primitive，array，map，struct )

primitive(原始類型)：

hive數據類型字節備注

TINYINT	1	java-byte 整型
SMALLINT	2	java-short 整型
INT	4	java-int 整型
BIGINT	8	java-long 整型
BOOLEAN		布爾
FLOAT	4	浮點型
DOUBLE	8	浮點型
STRING		字符串無限制
VARCHAR		字符串 varchar(20) 最長20
CHAR		字符串 char(20) 定長20
BINARY		二進制類型
TIMESTAMP		時間戳類型
DATE		日期類型

array（數組類型）：
# 建表 create table t_tab(score array<float>，字段名 array<泛型> );
map（key-value類型）：MAP <primitive_type, data_type>
# 建表 create table t_tab(score map<string,float> );
struct（結構體類型）：STRUCT <col_name:data_type, …>
# 建表 create table t_tab(info struct<name:string,age:int,sex:char(1)>，列名 struct<屬性名:類型,屬性名:類型> );

Hive數據導入

1.自定義分隔符

# 分隔符設計分隔符含義備注

,	用來表示每個列的值之間分隔符。 fields
-	用來分割array中每個元素，以及struct中的每個值，以及map中kv與kv之間。 collection items
\|	用來分割map的k和v之間 map keys
\n	每條數據分割使用換行。 lines

# 建表 create table t_person(id string,name string,salary double,birthday date,sex char(1),hobbies array<string>,cards map<string,string>,addr struct<city:string,zipCode:string> ) row format delimited fields terminated by ','--列的分割 collection items terminated by '-'--數組 struct的屬性 map的kv和kv之間 map keys terminated by '|'-- map的k與v的分割 lines terminated by '\n';--行數據之間的分割 # 測試數據 1,張三,8000.0,2019-9-9,1,抽煙-喝酒-燙頭,123456|中國銀行-22334455|建設銀行,北京-10010 2,李四,9000.0,2019-8-9,0,抽煙-喝酒-燙頭,123456|中國銀行-22334455|建設銀行,鄭州-45000 3,王五,7000.0,2019-7-9,1,喝酒-燙頭,123456|中國銀行-22334455|建設銀行,北京-10010 4,趙6,100.0,2019-10-9,0,抽煙-燙頭,123456|中國銀行-22334455|建設銀行,鄭州-45000 5,于謙,1000.0,2019-10-9,0,抽煙-喝酒,123456|中國銀行-22334455|建設銀行,北京-10010 6,郭德綱,1000.0,2019-10-9,1,抽煙-燙頭,123456|中國銀行-22334455|建設銀行,天津-20010 # 導入數據 # 在hive命令行中執行 -- local 代表本地路徑，如果不寫，代表讀取文件來自于HDFS -- overwrite 是覆蓋的意思，可以省略。 load data [local] inpath ‘/opt/datas/person1.txt’ [overwrite] into table t_person; # 本質上就是將數據上傳到hdfs中(數據是受hive的管理)

2.JSON分割符

jar添加和數據導入，建表，在beeline里面操作

數據

# 1.本地創建json文件 {"id":1,"name":"zhangsan","sex":0,"birth":"1991-02-08"} {"id":2,"name":"lisi","sex":1,"birth":"1991-02-08"}

添加格式解析器的jar(本地客戶端命令)

# 在hive的客戶端執行(臨時添加jar到hive的classpath，有效期本鏈接內) add jar /opt/installs/hive1.2.1/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar# 補充：永久添加，Hive服務器級別有效。 1. 將需要添加到hive的classpath的jar，拷貝到hive下的auxlib目錄下， 2. 重啟hiveserver即可。

建表

create table t_person2(id string,name string,sex char(1),birth date )row format serde 'org.apache.hive.hcatalog.data.JsonSerDe';

加載文件數據(本地客戶端命令)

# 注意：導入的json數據dbeaver看不了。(因為導入后的表本質上就是該json文件。) load data local inpath '/opt/person.json' into table t_person2;

查看數據

select * from t_person2;

3. 正則分隔符

數據：access.log

INFO 192.168.1.1 2019-10-19 QQ com.baizhi.service.IUserService#login INFO 192.168.1.1 2019-10-19 QQ com.baizhi.service.IUserService#login ERROR 192.168.1.3 2019-10-19 QQ com.baizhi.service.IUserService#save WARN 192.168.1.2 2019-10-19 QQ com.baizhi.service.IUserService#login DEBUG 192.168.1.3 2019-10-19 QQ com.baizhi.service.IUserService#login ERROR 192.168.1.1 2019-10-19 QQ com.baizhi.service.IUserService#register

建表語句

create table t_access(level string,ip string,log_time date,app string,service string,method string )row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'--正則表達式的格式轉化類 with serdeproperties("input.regex"="(.*)\\s(.*)\\s(.*)\\s(.*)\\s(.*)#(.*)");--(.*) 表示任意字符 \\s表示空格

導入數據

load data local inpath '/opt/access.log' into table t_access;

查看數據

select * from t_access;

HQL高級

– SQL關鍵詞執行順序
from > where條件 > group by > having條件>select>order by>limit

注意：sql一旦出現group by，后續的關鍵詞能夠操作字段只有(分組依據字段，組函數處理結果)

HQL高級

# 0. 各個數據類型的字段訪問(array、map、struct) select name,salary,hobbies[1],cards['123456'],addr.city from t_person; # 1. 條件查詢：= != >= <= select * from t_person where addr.city='鄭州'; # 2. and or between and select * from t_person where salary>5000 and array_contains(hobbies,'抽煙'); # 3. order by[底層會啟動mapreduce進行排序] select * from t_person order by salary desc; # 4. limit(hive沒有起始下標) select * from t_person sort by salary desc limit 5; # 5. 去重 select distinct addr.city from t_person; select distinct(addr.city) from t_person; # 表連接 select ... from table1 t1 left join table2 t2 on 條件 where 條件 group by having 1. 查詢性別不同，但是薪資相同的人員信息。 select t1.name,t1.sex,t1.salary,t2.name,t2.sex,t2.salary from t_person t1 join t_person t2 on t1.salary = t2.salary where t1.sex != t2.sex; 2. 查詢擁有相同第一愛好且來自不同城市的人信息。 SELECT t1.name,t1.salary,t1.hobbies,t1.addr.city,t2.name,t2.salary,t2.hobbies,t2.addr.city from t_person t1 join t_person t2 on t1.hobbies[0]=t2.hobbies[0] where t1.addr.city != t2.addr.city;

# 單行函數(show functions) 查看所有函數 -- 查看hive系統所有函數 show functions;1. array_contains(列,值); select name,hobbies from t_person where array_contains(hobbies,'喝酒'); 2. length(列) select length('123123'); 3. concat(列,列) select concat('123123','aaaa'); 4. to_date('1999-9-9') select to_date('1999-9-9'); 5. year(date),month(date), 6. date_add(date,數字) select name,date_add(birthday,-9) from t_person; # 組函數概念： max、min、sum、avg、count等。select max(salary) from t_person where addr.city='北京'; select count(id) from t_person; # 炸裂函數(集合函數) -- 查詢所有的愛好， select explode(hobbies) as hobby from t_person # lateral view -- 為指定表，的邊緣拼接一個列。(類似表連接) -- lateral view：為表的拼接一個列(炸裂結果) -- 語法：from 表 lateral view explode(數組字段) 別名 as 字段名; -- 查看id，name，愛好。一個愛好一條信息。 select id,name,hobby from t_person lateral view explode(hobbies) t_hobby as hobby # 分組 1. group by(查看各個城市的均薪) select addr.city,avg(salary) from t_person group by addr.city; 2. having(查看平均工資超過5000的城市和均薪) select addr.city,avg(salary) from t_person group by addr.city having avg(salary)>5000; 3. 統計各個愛好的人數 --explod+lateral view select hobby,count( * ) from t_person lateral view explode(hobbies) t_hobby as hobby group by hobby; 4. 統計最受歡迎的愛好TOP1 SELECT hb,count( * ) numfrom t_person lateral view explode(hobbies) h as hbgroup by hborder by num desc limit 1; # 子查詢 -- 統計有哪些愛好，并去重。 select distinct t.hobby from (select explode(hobbies) as hobby from t_person ) t

行列相轉

# 案例表和數據 --## 表（電影觀看日志） create table t_visit_video (username string,video_name string,video_date date )row format delimited fields terminated by ','; --## 數據：豆瓣觀影日志數據。(用戶觀影日志數據按照天存放 1天一個日志文件) 張三,大唐雙龍傳,2020-03-21 李四,天下無賊,2020-03-21 張三,神探狄仁杰,2020-03-21 李四,霸王別姬,2020-03-21 李四,霸王別姬,2020-03-21 王五,機器人總動員,2020-03-21 王五,放牛班的春天,2020-03-21 王五,盜夢空間,2020-03-21

# collect_list(組函數) 作用：對分組后的，每個組的某個列的值進行收集匯總。語法：select collect_list(列) from 表 group by 分組列; select username,collect_list(video_name) from t_visit_video group by username;

# collect_set(組函數) 作用：對分組后的，每個組的某個列的值進行收集匯總，并去掉重復值。語法：select collect_set(列) from 表 group by 分組列; select username,collect_set(video_name) from t_visit_video group by username;

# concat_ws(單行函數) 作用：如果某個字段是數組，對該值得多個元素使用指定分隔符拼接。 select id,name,concat_ws(',',hobbies) from t_person; --# 將t_visit_video數據轉化為如下圖效果 --統計每個人，2020-3-21看過的電影。 select username,concat_ws(',',collect_set(video_name)) from t_visit_video group by username;

全排序和局部排序

# 全局排序語法：select * from 表 order by 字段 asc|desc; -- 按照薪資降序排序 select * from t_person order by salary desc; # 局部排序(分區排序) 概念：啟動多個reduceTask，對數據進行排序(預排序)，局部有序。局部排序關鍵詞 sort by默認reducetask個數只有1個，所有分區也只有一個。所以默認和全排序效果一樣。語法：select * from 表 distribute by 分區字段 sort by 字段 asc|desc; -- 1. 開啟reduce個數-- 設置reduce個數set mapreduce.job.reduces = 3;-- 查看reduce個數set mapreduce.job.reduces; -- 2. 使用sort by排序 +distribute by 指定分區列。(使用distribute后select就只能*)select * from t_person distribute by addr.city sort by salary desc;

Hive中表分類

4.1 管理表

由Hive全權管理的表

? 所謂的管理表指hive是否具備數據的管理權限，如果該表是管理表，當用戶刪除表的同時，hive也會將表所對應的數據刪除，因此在生產環境下，為了防止誤操作，帶來數據損失，一般考慮將表修改為非管理表-外部表

總結：Hive的管理，表結構，hdfs中表的數據文件，都歸Hive全權管理。---- hive刪除管理表，HDFS對應文件也會被刪除。

缺點：數據不安全。

4.2 外部表

引用映射HDFS數據作為表管理,但無法刪除數據

外部表和管理表最大的區別在于刪除外部表，只是將MySQL中對應該表的元數據信息刪除，并不會刪除hdfs上的數據，因此外部表可以實現和第三方應用共享數據。在創建外表的時候需要添加一個關鍵字"external"即可。create external xxx()…

# 創建外部表 1. 準備數據文件personout.txt 2. 上傳至hdfs中，該數據文件必須被放在一個單獨的文件夾內。該文件夾內的數據文件被作為表數據 3. 創建表: create external location在最后使用location 指定hdfs中數據文件所在的文件夾即可。create external table t_personout(id int,name string,salary double,birthday date,sex char(1),hobbies array<string>,cards map<string,string>,addr struct<city:string,zipCode:string>)row format delimitedfields terminated by ',' --列的分割collection items terminated by '-'--數組 struct的屬性 map的kv和kv之間map keys terminated by '|'lines terminated by '\n'location '/file';4. 查詢表數據

4.3 分區表

將表按照某個列的一定規則進行分區存放，減少海量數據情況下的數據檢索范圍，提高查詢效率；

舉例：電影表、用戶表

分區方案：按照用戶區域、電影類型

應用：依據實際業務功能，拿查詢條件的列作為分區列來進行分區，縮小MapReduce的掃描范圍，提高MapReduce的執行效率，

總結：

? table中的多個分區的數據是分區管理

? 1：刪除數據按照分區刪除。如果刪除某個分區，則將分區對應的數據也刪除(外部表，數據刪除，數據文件依然在)。

? 2：查詢統計，多個分區被一個表管理起來。

? select * from 表 where 分區字段為條件。

4.3.1 創建分區表

數據源文件

# 文件"bj.txt" (china bj數據) 1001,張三,1999-1-9,1000.0 1002,李四,1999-2-9,2000.0 1008,孫帥,1999-9-8,50000.0 1010,王宇希,1999-10-9,10000.0 1009,劉春陽,1999-9-9,10.0 # 文件“tj.txt” (china tj數據) 1006,郭德綱,1999-6-9,6000.0 1007,胡鑫喆,1999-7-9,7000.0

建表

create external table t_user_part(id string,name string,birth date,salary double )partitioned by(country string,city string)--指定分區列,按照國家和城市分區。 row format delimited fields terminated by ',' lines terminated by '\n';

創建分區表并導入數據

# 導入china和bj的數據 load data local inpath "/opt/bj.txt" into table t_user_part partition(country='china',city='bj'); # 導入china和heb的數據 load data local inpath "/opt/tj.txt" into table t_user_part partition(country='china',city='tj');

查看分區信息

show partitions t_user_part;

使用分區查詢:本質上只要查詢條件在存在分區列

select * from t_user_part where city = 'bj'

刪除分區信息

會連同分區數據一塊刪除

外部分區表，刪除后，hive不管理數據，但是數據文件依然存在

alter table t_user_part drop partition(country='china',city='bj');

添加分區(了解)

alter table t_user_part add partition(country='china',city='heb') location '/file/t_user_part/heb'; # 表分類 1. 管理表hive中table數據和hdfs數據文件都是被hive管理。 2. 外部表--常用--hdfs文件安全。hive的table數據，如果刪除hive中的table，外部hdfs的數據文件依舊保留。 3. 分區表--重要。將table按照不同分區管理。好處：如果where條件中有分區字段，則Hive會自動對分區內的數據進行檢索(不再掃描其他分區數據)，提高hive的查詢效率。

Hive自定義函數

內置函數

# 查看hive內置函數 show functions; # 查看函數描述信息 desc function max;

用戶自定義函數UDF

用戶定義函數-UDF:user-defined function

操作作用于單個數據行，并且產生一個數據行作為輸出。大多數函數都屬于這一類（比如數學函數和字符串函數）。

用戶定義函數-UDF

user-defined function

操作作用于單個數據行，并且產生一個數據行作為輸出。大多數函數都屬于這一類（比如數學函數和字符串函數）。

簡單來說：

UDF:返回對應值，一對一

# 0. 導入hive依賴 <dependency><groupId>org.apache.hive</groupId><artifactId>hive-exec</artifactId><version>1.2.1</version> </dependency> # 1.定義一個類繼承UDF 1. 必須繼承UDF 2. 方法名必須是evaluate import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; @Description(name = "hello",value = "hello(str1,str2)-用來獲取 '你好 str1,str2 有美女嗎?'的結果"//這里的中文解釋以后看的時候會有亂碼，最好寫英文。 ) public class HelloUDF extends UDF {// 方法名必須交evaluatepublic String evaluate(String s1,String s2){return "你好，"+s1+","+s2+"有美女嗎?";} } # 2. 配置maven打包環境，打包jar <properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <build><finalName>funcHello</finalName><plugins><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-jar-plugin</artifactId><version>2.4</version><configuration><includes><include>**/function/**</include></includes></configuration></plugin></plugins></build> # 打包 mvn package # 3. 上傳linux，導入到函數庫中。 # 在hive命令中執行 add jar /opt/doc/funcHello.jar; # hive session級別的添加， delete jar /opt/doc/funcHello.jar; # 如果重寫，記得刪除。create [temporary] function hello as "function.HelloUDF"; # temporary是會話級別。 # 刪除導入的函數 drop [temporary] function hello; # 4. 查看函數并使用函數 -- 1. 查看函數 desc function hello; desc function extended hello; -- 2. 使用函數進行查詢 select hello(userid,cityname) from logs;

導入奇葩的依賴方法-pentahu

# 下載 https://public.nexus.pentaho.org/repository/proxied-pentaho-public-repos-group/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde-javadoc.jar # 放在本地英文目錄下 D:\work\pentaho-aggdesigner-algorithm-5.1.5-jhyde-javadoc.jar # 執行mvn安裝本地依賴的命令 D:\work> mvn install:install-file -DgroupId=org.pentaho -DartifactId=pentaho-aggdesigner-algorithm -Dversion=5.1.5-jhyde -Dpackaging=jar -Dfile=pentaho-aggdesigner-algorithm-5.1.5-jhyde-javadoc.jar

總結

以上是生活随笔為你收集整理的Hive的安装和使用以及Java操作hive的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： ueditor video 设置宽高的问
下一篇：【四】Java流程控制