mysql 虚拟列索引_使用MySQL 5.7虚拟列提高查询效率
原標題:使用MySQL 5.7虛擬列提高查詢效率
導讀
翻譯團隊:星耀隊@知數堂
團隊成員:星耀隊-芬達,星耀隊-順子,星耀隊-M哥
原文出處:https://www.percona.com/blog/2018/01/29/using-generated-columns-in-mysql-5-7-to-increase-query-performance/
原文作者:Alexander Rubin
在這篇博客中,我們將看看如何使用MySQL 5.7的虛擬列來提高查詢性能。
In this blog post, we’ll look at ways you can use MySQL 5.7 generated columns (or virtual columns) to improve query performance.
說明
大約兩年前,我發表了一個在MySQL5.7版本上關于虛擬列的文章。從那時開始,它成為MySQL5.7發行版當中,我最喜歡的一個功能點。原因很簡單:在虛擬列的幫助下,我們可以創建間接索引(fine-grained indexes),可以顯著提高查詢性能。我要告訴你一些技巧,可以潛在地解決那些使用了GROUP BY 和 ORDER BY而慢的報表查詢。
About two years ago I published a blog post about Generated (Virtual) Columns in MySQL 5.7. Since then, it’s been one of my favorite features in the MySQL 5.7 release. The reason is simple: with the help of virtual columns, we can create fine-grained indexes that can significantly increase query performance. I’m going to show you some tricks that can potentially fix slow reporting queries with GROUP BY and ORDER BY.
問題
最近我正在協助一位客戶,他正掙扎于這個查詢上:
Recently I was working with a customer who was struggling with this query:
SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*) as 'No. of API Calls', AVG(ExecutionTime) as 'Avg. Execution Time', COUNT(distinct AccountId) as 'No. Of Accounts', COUNT(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' GROUP BY CONCAT(verb, ' - ', replace(url,'.xml','')) HAVING COUNT(*) >= 1 ;
這個查詢運行了一個多小時,并且使用和撐滿了整個 tmp目錄(需要用到臨時文件完成排序)。
The query was running for more than an hour and used all space in the tmp directory (with sort files).
表結構如下:
The table looked like this:
CREATE TABLE `ApiLog` (`Id` int(11) NOT NULL AUTO_INCREMENT,`ts` timestamp DEFAULT CURRENT_TIMESTAMP,`ServerName` varchar(50) NOT NULL default '',`ServerIP` varchar(50) NOT NULL default '',`ClientIP` varchar(50) NOT NULL default '',`ExecutionTime` int(11) NOT NULL default 0,`URL` varchar(3000) NOT NULL COLLATE utf8mb4_unicode_ci NOT NULL,`Verb` varchar(16) NOT NULL,`AccountId` int(11) NOT NULL,`ParentAccountId` int(11) NOT NULL,`QueryString` varchar(3000) NOT NULL,`Request` text NOT NULL,`RequestHeaders` varchar(2000) NOT NULL,`Response` text NOT NULL,`ResponseHeaders` varchar(2000) NOT NULL,`ResponseCode` varchar(4000) NOT NULL,... // other fields removed for simplicityPRIMARY KEY (`Id`),KEY `index_timestamp` (`ts`),... // other indexes removed for simplicity) ENGINE=InnoDB;
我們發現查詢沒有使用時間戳字段(“TS”)的索引:
We found out the query was not using an index on the timestamp field (“ts”):
mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*) as 'No. of API Calls', avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts', count(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' GROUP BY CONCAT(verb, ' - ', replace(url,'.xml','')) HAVING COUNT(*) >= 1G*************************** 1. row *************************** id: 1 select_type: SIMPLE table: ApiLog partitions: NULL type: ALLpossible_keys: ts key: NULL key_len: NULL ref: NULL rows: 22255292 filtered: 50.00 Extra: Using where; Using filesort1 row in set, 1 warning (0.00 sec)
原因很簡單:符合過濾條件的行數太大了,以至于影響一次索引掃描掃描的效率(或者至少優化器是這樣認為的):
The reason for that is simple: the number of rows matching the filter condition was too large for an index scan to be efficient (or at least the optimizer thinks that):
mysql> select count(*) from ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' ;+----------+| count(*) |+----------+| 7948800 |+----------+1 row in set (2.68 sec)
總行數:21998514。查詢需要掃描的總行數的36%(7948800/21998514)(譯者按:當預估掃描行數超過20% ~ 30%時,即便有索引,優化器通常也會強制轉成全表掃描)。
Total number of rows: 21998514. The query needs to scan 36% of the total rows (7948800 / 21998514).
在這種情況下,我們有許多處理方法:
創建時間戳列和GROUP BY列的聯合索引;
創建一個覆蓋索引(包含所有查詢字段);
僅對GROUP BY列創建索引;
創建索引松散索引掃描。
In this case, we have a number of approaches:
Create a combined index on timestamp column + group by fields
Create a covered index (including fields that are selected)
Create an index on just GROUP BY fields
Create an index for loose index scan
然而,如果我們仔細觀察查詢中“GROUP BY”部分,我們很快就意識到,這些方案都不能解決問題。以下是我們的GROUP BY部分:
However, if we look closer at the “GROUP BY” part of the query, we quickly realize that none of those solutions will work. Here is our GROUP BY part:
GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))
這里有兩個問題:
它是計算列,所以MySQL不能掃描verb + url的索引。它首先需要連接兩個字段,然后組成連接字符串。這就意味著用不到索引;
URL被定義為“varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL”,不能被完全索引(即使在全innodb_large_prefix= 1 參數設置下,這是UTF8啟用下的默認參數)。我們能做部分索引,這對GROUP BY的sql優化并沒有什么幫助。
There are two problems here:
It is using a calculating field, so MySQL can’t just scan the index on verb + url. It needs to first concat two fields, and then group on the concatenated string. That means that the index won’t be used.
The URL is declared as “varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL” and can’t be indexed in full (even with innodb_large_prefix=1 option, which is the default as we have utf8 enabled). We can only do a partial index, which won’t be helpful for GROUP BY optimization.
在這里,我嘗試去對URL列添加一個完整的索引,在innodb_large_prefix=1參數下:
Here, I’m trying to add a full index on the URL with innodb_large_prefix=1:
mysql> alter table ApiLog add key verb_url(verb, url);ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes
嗯,通過修改“GROUP BY CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”為 “GROUP BY verb, url”會幫助(假設我們把字段定義從varchar(3000)調小一些,不管業務上允許或不允許)。然而,這將改變結果,因URL字段不會刪除.xml擴展名了。
Well, changing the “GROUP BY CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))” to “GROUP BY verb, url” could help (assuming that we somehow trim the field definition from varchar(3000) to something smaller, which may or may not be possible). However, it will change the results as it will not remove the .xml extension from the URL field.
解決方案
好消息是,在MySQL 5.7中我們有虛擬列。所以我們可以在“CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”之上創建一個虛擬列。最好的部分:我們不需要執行一組完整的字符串(可能大于3000字節)。我們可以使用MD5哈希(或更長的哈希,例如SHA1 / SHA2)作為GROUP BY的對象。
The good news is that in MySQL 5.7 we have virtual columns. So we can create a virtual column on top of “CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”. The best part: we do not have to perform a GROUP BY with the full string (potentially > 3000 bytes). We can use an MD5 hash (or longer hashes, i.e., sha1/sha2) for the purposes of the GROUP BY.
下面是解決方案:
Here is the solution:
alter table ApiLog add verb_url_hash varbinary(16) GENERATED ALWAYS AS (unhex(md5(CONCAT(verb, ' - ', replace(url,'.xml',''))))) VIRTUAL;alter table ApiLog add key (verb_url_hash);
所以我們在這里做的是:
聲明虛擬列,類型為varbinary(16);
在CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”)上創建虛擬列,并且使用MD5哈希轉化后再使用unhex轉化32位十六進制為16位二進制;
對上面的虛擬列創建索引。
So what we did here is:
Declared the virtual column with type varbinary(16)
Created a virtual column on CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”), and used an MD5 hash on top plus an unhex to convert 32 hex bytes to 16 binary bytes
Created and index on top of the virtual column
現在我們可以修改查詢語句,GROUP BY verb_url_hash列:
Now we can change the query and GROUP BY verb_url_hash column:
mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml',''))AS 'API Call', COUNT(*) as 'No. of API Calls',avg(ExecutionTime) as 'Avg. Execution Time',count(distinct AccountId) as 'No. Of Accounts',count(distinct ParentAccountId) as 'No. Of Parents'FROM ApiLogWHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'GROUP BY verb_url_hashHAVING COUNT(*) >= 1;ERROR 1055 (42000): Expression #1 of SELECT list is not inGROUP BY clause and contains nonaggregated column 'ApiLog.ApiLog.Verb'which is not functionally dependent on columns in GROUP BY clause;this is incompatible with sql_mode=only_full_group_by
MySQL 5.7的嚴格模式是默認啟用的,我們可以只針對這次查詢修改一下。
現在解釋計劃看上去好多了:
MySQL 5.7 has a strict mode enabled by default, which we can change for that query only.
Now the explain plan looks much better:
mysql> select @@sql_mode;+-------------------------------------------------------------------------------------------------------------------------------------------+| @@sql_mode |+-------------------------------------------------------------------------------------------------------------------------------------------+| ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |+-------------------------------------------------------------------------------------------------------------------------------------------+1 row in set (0.00 sec)mysql> set sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';Query OK, 0 rows affected (0.00 sec)mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*) as 'No. of API Calls', avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts', count(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' GROUP BY verb_url_hash HAVING COUNT(*) >= 1G*************************** 1. row *************************** id: 1 select_type: SIMPLE table: ApiLog partitions: NULL type: indexpossible_keys: ts,verb_url_hash key: verb_url_hash key_len: 19 ref: NULL rows: 22008891 filtered: 50.00 Extra: Using where1 row in set, 1 warning (0.00 sec)
MySQL可以避免排序,速度更快。它將最終還是要掃描所有表的索引的順序。響應時間明顯更好:只需大概38秒而不再是大于一小時。
MySQL will avoid any sorting, which is much faster. It will still have to eventually scan all the table in the order of the index. The response time is significantly better: ~38 seconds as opposed to > an hour.
覆蓋索引
現在我們可以嘗試做一個覆蓋索引,這將相當大:
Now we can attempt to do a covered index, which will be quite large:
mysql> alter table ApiLog add key covered_index (`verb_url_hash`,`ts`,`ExecutionTime`,`AccountId`,`ParentAccountId`, verb, url);Query OK, 0 rows affected (1 min 29.71 sec)Records: 0 Duplicates: 0 Warnings: 0
我們添加了一個“verb”和“URL”,所以之前我不得不刪除表定義的COLLATE utf8mb4_unicode_ci。現在執行計劃表明,我們使用了覆蓋索引:
We had to add a “verb” and “url”, so beforehand I had to remove the COLLATE utf8mb4_unicode_ci from the table definition. Now explain shows that we’re using the index:
mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*) as 'No. of API Calls', AVG(ExecutionTime) as 'Avg. Execution Time', COUNT(distinct AccountId) as 'No. Of Accounts', COUNT(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' GROUP BY verb_url_hash HAVING COUNT(*) >= 1G*************************** 1. row *************************** id: 1 select_type: SIMPLE table: ApiLog partitions: NULL type: indexpossible_keys: ts,verb_url_hash,covered_index key: covered_index key_len: 3057 ref: NULL rows: 22382136 filtered: 50.00 Extra: Using where; Using index1 row in set, 1 warning (0.00 sec)
響應時間下降到約12秒!但是,索引的大小明顯地比僅verb_url_hash的索引(每個記錄16字節)要大得多。
The response time dropped to ~12 seconds! However, the index size is significantly larger compared to just verb_url_hash (16 bytes per record).
結論
MySQL 5.7的生成列提供一個有價值的方法來提高查詢性能。如果你有一個有趣的案例,請在評論中分享。
MySQL 5.7 generated columns provide a valuable way to improve query performance. If you have an interesting case, please share in the comments.
知數堂
MySQL實戰/MySQL優化 / Python/ SQL優化
(MySQL 實戰/優化、Python開發,及SQL優化等課程)
責任編輯:
總結
以上是生活随笔為你收集整理的mysql 虚拟列索引_使用MySQL 5.7虚拟列提高查询效率的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 关于python_关于python的基础
- 下一篇: python rest api_Pyth