sqoop mysql hadoop_使用sqoop将mysql数据导入到hadoop
hadoop的安裝配置這里就不講了。
Sqoop的安裝也很簡單。
完成sqoop的安裝后,可以這樣測試是否可以連接到mysql(注意:mysql的jar包要放到 SQOOP_HOME/lib 下):
sqoop list-databases --connect jdbc:mysql://192.168.1.109:3306/ --username root --password 19891231
結(jié)果如下
即說明sqoop已經(jīng)可以正常使用了。
下面,要將mysql中的數(shù)據(jù)導(dǎo)入到hadoop中。
我準(zhǔn)備的是一個(gè)300萬條數(shù)據(jù)的身份證數(shù)據(jù)表:
先啟動hive(使用命令行:hive 即可啟動)
然后使用sqoop導(dǎo)入數(shù)據(jù)到hive:
sqoop import --connect jdbc:mysql://192.168.1.109:3306/hadoop --username root --password 19891231 --table test_sfz --hive-import
sqoop 會啟動job來完成導(dǎo)入工作。
完成導(dǎo)入用了2分20秒,還是不錯(cuò)的。
在hive中可以看到剛剛導(dǎo)入的數(shù)據(jù)表:
我們來一句sql測試一下數(shù)據(jù):
select * from test_sfz where id < 10;
可以看到,hive完成這個(gè)任務(wù)用了將近25秒,確實(shí)是挺慢的(在mysql中幾乎是不費(fèi)時(shí)間),但是要考慮到hive是創(chuàng)建了job在hadoop中跑,時(shí)間當(dāng)然多。
接下來,我們會對這些數(shù)據(jù)進(jìn)行復(fù)雜查詢的測試:
我機(jī)子的配置如下:
hadoop 是運(yùn)行在虛擬機(jī)上的偽分布式,虛擬機(jī)OS是ubuntu12.04 64位,配置如下:
TEST 1 計(jì)算平均年齡
測試數(shù)據(jù):300.8 W
1. 計(jì)算廣東的平均年齡
mysql:select (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz where address like '廣東%';
用時(shí): 0.877s
hive:select (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz where address like '廣東%';
用時(shí):25.012s
2. 對每個(gè)城市的的平均年齡進(jìn)行從高到低的排序
mysql:select address, (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz GROUP BY address order by ageAvge desc;
用時(shí):2.949s
hive:select address, (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz GROUP BY address order by ageAvge desc;
用時(shí):51.29s
可以看到,在耗時(shí)上面,hive的增長速度較mysql慢。
TEST 2
測試數(shù)據(jù):1200W
mysql 引擎: MyISAM(為了加快查詢速度)
導(dǎo)入到hive:
1. 計(jì)算廣東的平均年齡
mysql:select (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 where address like '廣東%';
用時(shí): 5.642s
hive:select (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 where address like '廣東%';
用時(shí):168.259s
2. 對每個(gè)城市的的平均年齡進(jìn)行從高到低的排序
mysql:select address, (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 GROUP BY address order by ageAvge desc;
用時(shí):11.964s
hive:select address, (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 GROUP BY address order by ageAvge desc;
用時(shí):311.714s
TEST 3
測試數(shù)據(jù):2000W
mysql 引擎: MyISAM(為了加快查詢速度)
導(dǎo)入到hive:
(這次用的時(shí)間很短!可能是因?yàn)門EST2中的導(dǎo)入時(shí),我的主機(jī)在做其他耗資源的工作..)
1. 計(jì)算廣東的平均年齡
mysql:select (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 where address like '廣東%';
用時(shí): 6.605s
hive:select (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 where address like '廣東%';
用時(shí):188.206s
2. 對每個(gè)城市的的平均年齡進(jìn)行從高到低的排序
mysql:select address, (sum(year(NOW()) - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 GROUP BY address order by ageAvge desc;
用時(shí):19.926s
hive:select address, (sum(year('2014-10-01') - SUBSTRING(borth,1,4))/count(*)) as ageAvge from test_sfz2 GROUP BY address order by ageAvge desc;
用時(shí):411.816s
總結(jié)
以上是生活随笔為你收集整理的sqoop mysql hadoop_使用sqoop将mysql数据导入到hadoop的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: linux和信息资源管理,Linux t
- 下一篇: 关于页面配色