當(dāng)前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

sqoop mysql hadoop_使用sqoop将mysql数据导入到hadoop

發(fā)布時(shí)間：2025/3/12 数据库 38 豆豆

生活随笔收集整理的這篇文章主要介紹了 sqoop mysql hadoop_使用sqoop将mysql数据导入到hadoop 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

hadoop的安裝配置這里就不講了。

Sqoop的安裝也很簡單。

完成sqoop的安裝后，可以這樣測試是否可以連接到mysql(注意：mysql的jar包要放到 SQOOP_HOME/lib 下)：

sqoop list-databases --connect jdbc:mysql://192.168.1.109:3306/ --username root --password 19891231

結(jié)果如下

即說明sqoop已經(jīng)可以正常使用了。

下面，要將mysql中的數(shù)據(jù)導(dǎo)入到hadoop中。

我準(zhǔn)備的是一個(gè)300萬條數(shù)據(jù)的身份證數(shù)據(jù)表：

先啟動hive(使用命令行：hive 即可啟動)

然后使用sqoop導(dǎo)入數(shù)據(jù)到hive：

sqoop import --connect jdbc:mysql://192.168.1.109:3306/hadoop --username root --password 19891231 --table test_sfz --hive-import

sqoop 會啟動job來完成導(dǎo)入工作。

完成導(dǎo)入用了2分20秒，還是不錯(cuò)的。

在hive中可以看到剛剛導(dǎo)入的數(shù)據(jù)表：

我們來一句sql測試一下數(shù)據(jù)：

select * from test_sfz where id < 10;

可以看到，hive完成這個(gè)任務(wù)用了將近25秒，確實(shí)是挺慢的(在mysql中幾乎是不費(fèi)時(shí)間)，但是要考慮到hive是創(chuàng)建了job在hadoop中跑，時(shí)間當(dāng)然多。

接下來，我們會對這些數(shù)據(jù)進(jìn)行復(fù)雜查詢的測試：

我機(jī)子的配置如下：

hadoop 是運(yùn)行在虛擬機(jī)上的偽分布式，虛擬機(jī)OS是ubuntu12.04 64位，配置如下：

TEST 1 計(jì)算平均年齡

測試數(shù)據(jù)：300.8 W

1. 計(jì)算廣東的平均年齡

mysql：select (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz where address like '廣東%';

用時(shí)： 0.877s

hive：select (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz where address like '廣東%';

用時(shí)：25.012s

2. 對每個(gè)城市的的平均年齡進(jìn)行從高到低的排序

mysql：select address, (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz GROUP BY address order by ageAvge desc;

用時(shí)：2.949s

hive：select address, (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz GROUP BY address order by ageAvge desc;

用時(shí)：51.29s

可以看到，在耗時(shí)上面，hive的增長速度較mysql慢。

TEST 2

測試數(shù)據(jù)：1200W

mysql 引擎： MyISAM(為了加快查詢速度)

導(dǎo)入到hive：

1. 計(jì)算廣東的平均年齡

mysql：select (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 where address like '廣東%';

用時(shí)： 5.642s

hive：select (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 where address like '廣東%';

用時(shí)：168.259s

2. 對每個(gè)城市的的平均年齡進(jìn)行從高到低的排序

mysql：select address, (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 GROUP BY address order by ageAvge desc;

用時(shí)：11.964s

hive：select address, (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 GROUP BY address order by ageAvge desc;

用時(shí)：311.714s

TEST 3

測試數(shù)據(jù)：2000W

mysql 引擎： MyISAM(為了加快查詢速度)

導(dǎo)入到hive：

(這次用的時(shí)間很短！可能是因?yàn)門EST2中的導(dǎo)入時(shí)，我的主機(jī)在做其他耗資源的工作..)

1. 計(jì)算廣東的平均年齡

mysql：select (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 where address like '廣東%';

用時(shí)： 6.605s

hive：select (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 where address like '廣東%';

用時(shí)：188.206s

2. 對每個(gè)城市的的平均年齡進(jìn)行從高到低的排序

mysql：select address, (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 GROUP BY address order by ageAvge desc;

用時(shí)：19.926s

hive：select address, (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 GROUP BY address order by ageAvge desc;

用時(shí)：411.816s

總結(jié)

以上是生活随笔為你收集整理的sqoop mysql hadoop_使用sqoop将mysql数据导入到hadoop的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： linux和信息资源管理,Linux t
下一篇：关于页面配色