使用命令開啟一個spark-shell:
./spark-shell --master local[2] --jars /home/iie4bu/software/mysql-connector-java-5.1.35.jar
因為我們要操作hive,因此需要添加mysql的driver類。
查看表
使用命令spark.sql("show tables").show
查看表數據
比較spark-shell和hive的性能
在hive中執行一條join操作:
hive> select * from sal s join people p on s.transactionid=p.id;
可以看到使用Hive操作時,花費20s。
在Spark-shell中執行相同個操作:
scala> spark.sql("select * from sal s join people p on s.transactionid=p.id").show
在使用spark時,幾乎是瞬間出來。高下立判。
升級spark-shell
上面我們可以通過spark-shell來使用sql,當然我們可以使用另一個命令,更加方便的使用sql。那就是spark-sql.
使用命令:./spark-sql --master local[2] --jars /home/iie4bu/software/mysql-connector-java-5.1.35.jar --driver-class-path /home/iie4bu/software/mysql-connector-java-5.1.35.jar
當我們執行:spark-sql> select * from sal;
在頁面中也可以看到相應的執行計劃。
使用spark-sql創建表
使用命令spark-sql> create table t (key string, value string);
然后查看這張表:
可以看到剛創建的t表。
spark-sql執行計劃
我們使用上面創建的t表,運行執行計劃:
spark-sql> explain select a.key*(2+3), b.value from t a join t b on a.key = b.key and a.key > 3;
可以看到物理執行計劃:
spark-sql> explain select a.key*(2+3), b.value from t a join t b on a.key = b.key and a.key > 3;
20/03/27 00:04:55 INFO HiveMetaStore: 0: get_table : db=default tbl=t
20/03/27 00:04:55 INFO audit: ugi=iie4bu ip=unknown-ip-addr cmd=get_table : db=default tbl=t
20/03/27 00:04:55 INFO HiveMetaStore: 0: get_table : db=default tbl=t
20/03/27 00:04:55 INFO audit: ugi=iie4bu ip=unknown-ip-addr cmd=get_table : db=default tbl=t
20/03/27 00:04:55 INFO CodeGenerator: Code generated in 16.512742 ms
== Physical Plan ==
*(5) Project [(cast(key#28 as double) * 5.0) AS (CAST(key AS DOUBLE) * CAST((2 + 3) AS DOUBLE))#32, value#31]
+- *(5) SortMergeJoin [key#28], [key#30], Inner:- *(2) Sort [key#28 ASC NULLS FIRST], false, 0: +- Exchange hashpartitioning(key#28, 200): +- *(1) Filter (isnotnull(key#28) && (cast(key#28 as int) > 3)): +- Scan hive default.t [key#28], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#28, value#29]+- *(4) Sort [key#30 ASC NULLS FIRST], false, 0+- Exchange hashpartitioning(key#30, 200)+- *(3) Filter ((cast(key#30 as int) > 3) && isnotnull(key#30))+- Scan hive default.t [key#30, value#31], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#30, value#31]
Time taken: 0.362 seconds, Fetched 1 row(s)
20/03/27 00:04:55 INFO SparkSQLCLIDriver: Time taken: 0.362 seconds, Fetched 1 row(s)
如果要查看邏輯執行計劃,添加extended。
spark-sql> explain extended select a.key*(2+3), b.value from t a join t b on a.key = b.key and a.key > 3;
運行結果如下:
spark-sql> explain extended select a.key*(2+3), b.value from t a join t b on a.key = b.key and a.key > 3;
20/03/27 00:07:10 INFO HiveMetaStore: 0: get_table : db=default tbl=t
20/03/27 00:07:10 INFO audit: ugi=iie4bu ip=unknown-ip-addr cmd=get_table : db=default tbl=t
20/03/27 00:07:10 INFO HiveMetaStore: 0: get_table : db=default tbl=t
20/03/27 00:07:10 INFO audit: ugi=iie4bu ip=unknown-ip-addr cmd=get_table : db=default tbl=t
== Parsed Logical Plan ==
'Project [unresolvedalias(('a.key * (2 + 3)), None), 'b.value]
+- 'Join Inner, (('a.key = 'b.key) && ('a.key > 3)):- 'SubqueryAlias `a`: +- 'UnresolvedRelation `t`+- 'SubqueryAlias `b`+- 'UnresolvedRelation `t`== Analyzed Logical Plan ==
(CAST(key AS DOUBLE) * CAST((2 + 3) AS DOUBLE)): double, value: string
Project [(cast(key#35 as double) * cast((2 + 3) as double)) AS (CAST(key AS DOUBLE) * CAST((2 + 3) AS DOUBLE))#39, value#38]
+- Join Inner, ((key#35 = key#37) && (cast(key#35 as int) > 3)):- SubqueryAlias `a`: +- SubqueryAlias `default`.`t`: +- HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#35, value#36]+- SubqueryAlias `b`+- SubqueryAlias `default`.`t`+- HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#37, value#38]== Optimized Logical Plan ==
Project [(cast(key#35 as double) * 5.0) AS (CAST(key AS DOUBLE) * CAST((2 + 3) AS DOUBLE))#39, value#38]
+- Join Inner, (key#35 = key#37):- Project [key#35]: +- Filter (isnotnull(key#35) && (cast(key#35 as int) > 3)): +- HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#35, value#36]+- Filter ((cast(key#37 as int) > 3) && isnotnull(key#37))+- HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#37, value#38]== Physical Plan ==
*(5) Project [(cast(key#35 as double) * 5.0) AS (CAST(key AS DOUBLE) * CAST((2 + 3) AS DOUBLE))#39, value#38]
+- *(5) SortMergeJoin [key#35], [key#37], Inner:- *(2) Sort [key#35 ASC NULLS FIRST], false, 0: +- Exchange hashpartitioning(key#35, 200): +- *(1) Filter (isnotnull(key#35) && (cast(key#35 as int) > 3)): +- Scan hive default.t [key#35], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#35, value#36]+- *(4) Sort [key#37 ASC NULLS FIRST], false, 0+- Exchange hashpartitioning(key#37, 200)+- *(3) Filter ((cast(key#37 as int) > 3) && isnotnull(key#37))+- Scan hive default.t [key#37, value#38], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#37, value#38]
Time taken: 0.19 seconds, Fetched 1 row(s)
20/03/27 00:07:10 INFO SparkSQLCLIDriver: Time taken: 0.19 seconds, Fetched 1 row(s)
可以看到,第一步先使用抽象語法樹解析成unresolved邏輯執行計劃。第二步基于Metastore生成邏輯執行計劃,然后優化邏輯執行計劃,最后生成物理執行計劃。
總結
以上是生活随笔為你收集整理的Spark-shell和Spark-hive的使用的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。