2019獨角獸企業重金招聘Python工程師標準>>>
本文轉自我的javaEye博客,鏈接http://kylinsoong.javaeye.com/blog/731208
最近, Cassandra 絕對是一個比較前端的話題,隨著互聯網的不斷發展, Distributed Database 的廣受重視是一種必然, Cassandra 在存取效率、分散管理、容錯能力、穩定性等方面的優點是其他Distributed Database 無法比擬的,So, 研究Cassandra 是非常有必要的。我將從下列方面學習Cassandra :
1. Cassandra目錄結構
從http://cassandra.apache.org/download/下載最新Cassandra,解壓后目錄結構如下?:
?如圖由上向下:
bin中存放這一些可操作Cassandra腳本,如cassandra.bat,點擊可以啟動Cassandra
conf中包含一些Cassandra配置信息
interface中存放Cassandra的Thrift接口定義文件,可以用于生成各種語言的接口代碼
javadoc中為Cassandra幫助文檔(API)
lib中為Cassandra運行時依賴的包
?
2. Cassandra HelloWorld
從學C語言開始,HelloWorld是一種傳統,所以先寫個HelloWorld,在apache-cassandra-0.6.4/conf目錄下storage-conf.xml文件中,該文件中包含Cassandra的所有配置,先列出簡單一些:
Xml代碼
<!--?A?-->??<ClusterName>Test?Cluster</ClusterName>????<!--?B?-->??<Keyspaces>??????<Keyspace?Name="Keyspace1">????????</Keyspace>??</Keyspaces>????<!--?C?-->??<CommitLogDirectory>/var/lib/cassandra/commitlog</CommitLogDirectory>????<DataFileDirectories>??????<DataFileDirectory>/var/lib/cassandra/data</DataFileDirectory>??</DataFileDirectories>????<!--?D?-->??<ThriftPort>9160</ThriftPort>?? ?
?如上:
A處定義Cluster名字(也稱結點名字),一個Cluster可以包含多個Keyspace;
B處定義了此Cluster中包含一個Keyspace,且名字為Keyspace1,Keyspace相當于與關系數據庫中的database;
C處定義了Cassandra數據和Commitlog存放的位置,如不對其做修改,在Windows下啟動Cassandra,在D盤根目錄下產生如上面所示的文件夾;
D處定義了Thrift RPC 端口號為9160,
在默認配置文件下啟動Cassandra,然后編寫如下代碼:
Java代碼
package?com.tibco.cassandra;????import?java.io.UnsupportedEncodingException;??import?java.util.Date;????import?org.apache.cassandra.thrift.Cassandra;??import?org.apache.cassandra.thrift.Column;??import?org.apache.cassandra.thrift.ColumnOrSuperColumn;??import?org.apache.cassandra.thrift.ColumnPath;??import?org.apache.cassandra.thrift.ConsistencyLevel;??import?org.apache.cassandra.thrift.InvalidRequestException;??import?org.apache.cassandra.thrift.NotFoundException;??import?org.apache.cassandra.thrift.TimedOutException;??import?org.apache.cassandra.thrift.UnavailableException;??import?org.apache.thrift.TException;??import?org.apache.thrift.protocol.TBinaryProtocol;??import?org.apache.thrift.protocol.TProtocol;??import?org.apache.thrift.transport.TSocket;??import?org.apache.thrift.transport.TTransport;????public?class?CassandraHelloWorld?{????????public?static?void?main(String[]?args)?throws?UnsupportedEncodingException,?InvalidRequestException,?UnavailableException,?TimedOutException,?TException,?NotFoundException?{??????????//Part?One??????????TTransport?trans?=?new?TSocket("127.0.0.1",?9160);??????????TProtocol?proto?=?new?TBinaryProtocol(trans);??????????Cassandra.Client?client?=?new?Cassandra.Client(proto);??????????trans.open();????????????????????//Part?Two??????????String?keyspace?=?"Keyspace1";??????????String?cf?=?"Standard2";??????????String?key?=?"kylinsoong";??????????long?timestamp?=?new?Date().getTime();??????????ColumnPath?path?=?new?ColumnPath(cf);??????????path.setColumn("id".getBytes("UTF-8"));??????????client.insert(keyspace,?key,?path,?"520".getBytes("UTF-8"),?timestamp,?ConsistencyLevel.ONE);??????????path.setColumn("action".getBytes("UTF-8"));??????????client.insert(keyspace,?key,?path,?"Hello,?World,?Cassandra!".getBytes("UTF-8"),?timestamp,?ConsistencyLevel.ONE);????????????????????//Part?Three??????????path.setColumn("id".getBytes("UTF-8"));??????????ColumnOrSuperColumn?cc?=?client.get(keyspace,?key,?path,?ConsistencyLevel.ONE);??????????Column?c?=?cc.getColumn();??????????String?value?=?new?String(c.value,?"UTF-8");??????????System.out.println(value);??????????path.setColumn("action".getBytes("UTF-8"));??????????ColumnOrSuperColumn?cc2?=?client.get(keyspace,?key,?path,?ConsistencyLevel.ONE);??????????Column?c2?=?cc2.getColumn();??????????String?value2?=?new?String(c2.value,?"UTF-8");??????????System.out.println(value2);????????????????????//Part?four??????????trans.close();??????}????}?? ?
運行代碼,偉大的Hello,World將會出現在我們眼前,運行結果:
Java代碼
520??Hello,?World,?Cassandra!?? 先對代碼做個簡單分析,我將代碼分為四個部分:
Part One,連接到數據庫,相當于JDBC,具體這里是通過RPC通信協議連接到Cassandra的;
Part Two,向數據庫中插入數據;
Part Three,讀出剛才插入的數據;
Part four, 關閉數據庫連接。
?
3. 和關系數據庫從存儲效率上做個比較:
我們先不說Cassandra的數據模型及它的集群,首先我們從實驗的角度比較它與Mysql的存儲效率,比較之前先做個解釋;
關系數據庫最小的存儲單元是row,而Cassandra是grid(此說法只是為了形象比喻)如下圖;
?
所示為一個row包含4個grid;
對關系數據庫,可以一次插入或讀取一行,而Cassandra只能一次插入或讀取一個格,也就是說要插入此行信息關系數據庫只需一個插入語句,而Cassandra需要四個,看上去關系數據庫更有效率,實際上結果將會使你為之Shocking;
開始我們的實驗,在本地開啟Cassandra服務器:
在Mysql中創建一個數據庫,在數據庫中創建下表:
Sql代碼
create?table?test??(??parseTime?varchar(40)primary?key,??id?varchar(40),??creationTime?varchar(40),??globalInstanceId?varchar(255),??msg?varchar(255),??severity?varchar(20),??modelName?varchar(255),??rComponent?varchar(255),??rExecutionEnvironment?varchar(255),??sExecutionEnvironment?varchar(255),??sLocation?varchar(255),??msgId?varchar(255)??);?? ?此表中包含12 column,為了簡單索引column對應類型都是字符串,
向此表中插入68768 * 2 條數據,查看記錄測試時間,如下為測試程序輸出Log
?
| Mysql --- 0 ---- Error Error Error Error Error Error Total add: 68768, spent time: 1654569 --- 1 ---- Error Total add: 68768, spent time: 1687645 Total add: 137536, Error: 7, Spent Time: 3342214 Average:??24.3004 |
分析日志,總共向Mysql插入
137536條數據,其中有7條數據添加時出錯,總共耗時
3342214毫秒,合計
25分鐘多一點,插入一條記錄時間為
24.3004毫秒;
?
將同樣的數據向Cassandra中插入68768 * 10條數據,查看記錄測試時間,程序輸出日志如下:
?
| Cassandra --- 0 ----Total add: 68768, spent time: 212047 --- 1 ----Total add: 68768, spent time: 210518 --- 2 ----Total add: 68768, spent time: 211602 --- 3 ----Total add: 68768, spent time: 213543 --- 4 ----Total add: 68768, spent time: 209558 --- 5 ----Total add: 68768, spent time: 211302 --- 6 ----Total add: 68768, spent time: 214699 --- 7 ----Total add: 68768, spent time: 212685 --- 8 ----Total add: 68768, spent time: 215412 --- 9 ----Total add: 68768, spent time: 218858 Total: 687680 Time: 2130224 Average: ???????????????????Insert one key: 3.0977 ???????????????????Insert one column: 0.2581 |
?
?分析日志文件,向Cassandra中插入687680條數據,實際執行(687680 * 12次插入),耗費時間:2130224毫秒,合計35分鐘多一點,沒有發生插入錯誤等現象,說明Cassandra穩定性比Mysql好,每插入一條記錄所需時間僅為3.0977毫秒,執行一次插入所需時間為0.2581毫秒;
比較兩組日志文件可以得出以下結論:
向Cassandra插入數據條數是向Mysql插入數據條數的5倍,但總消耗時間Cassandra少于Mysql;
就插入一條數據而言,Cassandra的效率是Mysql的8倍
?
在做另為一組實驗:在數據庫中載創建一個此時表,如下:
Sql代碼
create?table?time??(??id?varchar(20)??);?? ?是的,此表只有一個字段,目的是讓Cassandra與Mysql更有可比性。
同樣先看Mysql輸出日志:
?
| Mysql Add 100 000 keys, Spent time: 2477828 Average: 24.7783 |
?
?分析輸出日志向Mysql數據庫插入100 000條數據花費2477828毫秒,合計40分鐘,執行一次插入所需時間為24.7783毫秒;
再看Cassandra輸出日志
?
| Cassandra Add 100 000 keys, Spent time: 25281 Average: 0.2528 |
?
?分析日志,同樣向Cassandra插入100 000條數據,花費時間為25281毫秒,執行一次插入所需時間為0.2528毫秒;
?比較兩組輸出日志:
在插入數據條數相同的情況下(100 000條)Mysql花費的時間是Cassandra的98倍;
執行一次操作Mysql花費的時間是Cassandra的98倍;
結論:Cassandra的存儲效率是Mysql的100倍左右
?
4. Cassandra數據模型
Twitter的數據存儲用的就是Cassandra,這里我將以Twitter存儲數據的模型為例,說明Cassandra的數據模型,先將我們上面的storage-conf.xml配置文件做一下修改,如下:
Xml代碼
<Storage>??????<ClusterName>Kylin-PC</ClusterName>??????<Keyspaces>??????????<Keyspace?Name="Twitter">??????????????<ColumnFamily?CompareWith="UTF8Type"?Name="Statuses"?/>??????????????<ColumnFamily?CompareWith="UTF8Type"?Name="StatusAudits"?/>??????????????<ColumnFamily?CompareWith="UTF8Type"?Name="StatusRelationships"?CompareSubcolumnsWith="TimeUUIDType"?ColumnType="Super"?/>????????????????<ColumnFamily?CompareWith="UTF8Type"?Name="Users"?/>??????????????<ColumnFamily?CompareWith="UTF8Type"?Name="UserRelationships"?CompareSubcolumnsWith="TimeUUIDType"?ColumnType="Super"?/>??????????</Keyspace>??????????</Keyspaces>??</Storage>?? ?上面Keyspace就是真實的Twitter存儲數據的模型的定義,它里面包含5個ColumnFamily,對照Mysql,Keyspace相當于一個數據庫,ColumnFamily 相當于數據庫中一張表;
上面配置文件中ClusterName表示Cassandra的一個節點實例(邏輯上的一個Cassandra Server,一般為一臺PC),名字為Kylin-PC,一個節點實例可以包括多個Keyspace;
下面我分別結合實例從以下幾個方面說明Cassandra的數據模型:
(一)、ColumnFamily
?
?
?如圖,ColumnFamily 包含多個Row,上面說過ColumnFamily 相當于關系數據庫中的一個Table,每一個Row都包含有Client提供的Key以及和該Key相關的一系列Column,每個Column都包括name,value,timestamp,值得注意每個Row中包含的Column不一定相同;
修改上面HelloWorld程序,修給后代碼如下:
Java代碼
public?static?void?main(String[]?args)?throws?Exception?{??????????TTransport?trans?=?new?TSocket("127.0.0.1",?9160);?????????????TProtocol?proto?=?new?TBinaryProtocol(trans);?????????????Cassandra.Client?client?=?new?Cassandra.Client(proto);?????????????trans.open();?????????????????????String?keyspace?=?"Twitter";???????????????????????String?columnFamily??=?"Users";???????????ColumnPath?path?=?new?ColumnPath(columnFamily);?????????????????????String?row1?=?"kylin";??????????path.setColumn("id".getBytes());??????????client.insert(keyspace,?row1,?path,"101".getBytes(),new?Date().getTime(),ConsistencyLevel.ONE);??????????path.setColumn("name".getBytes());??????????client.insert(keyspace,?row1,?path,"Kylin?Soong".getBytes(),new?Date().getTime(),ConsistencyLevel.ONE);????????????????????String?row2?=?"kobe";??????????path.setColumn("id".getBytes());??????????client.insert(keyspace,?row2,?path,"101".getBytes(),new?Date().getTime(),ConsistencyLevel.ONE);??????????path.setColumn("name".getBytes());??????????client.insert(keyspace,?row2,?path,"Kobe?Bryant".getBytes(),new?Date().getTime(),ConsistencyLevel.ONE);??????????path.setColumn("age".getBytes());??????????client.insert(keyspace,?row2,?path,"32".getBytes(),new?Date().getTime(),ConsistencyLevel.ONE);??????????path.setColumn("desc".getBytes());??????????client.insert(keyspace,?row2,?path,"Five?NBA?title,?One?regular?season?MVP,?Two?Final?Games?MVP".getBytes(),new?Date().getTime(),ConsistencyLevel.ONE);????????????????????path.setColumn("id".getBytes());?????????????ColumnOrSuperColumn?cos11?=?client.get(keyspace,?row1,?path,?ConsistencyLevel.ONE);??????????path.setColumn("name".getBytes());????????????ColumnOrSuperColumn?cos12?=?client.get(keyspace,?row1,?path,?ConsistencyLevel.ONE);??????????Column?c11?=?cos11.getColumn();??????????Column?c12?=?cos12.getColumn();??????????System.out.println(new?String(c11.getValue())?+?"?|?"?+?new?String(c12.getValue()));????????????????????path.setColumn("id".getBytes());?????????????ColumnOrSuperColumn?cos21?=?client.get(keyspace,?row2,?path,?ConsistencyLevel.ONE);??????????path.setColumn("name".getBytes());????????????ColumnOrSuperColumn?cos22?=?client.get(keyspace,?row2,?path,?ConsistencyLevel.ONE);??????????path.setColumn("age".getBytes());?????????????ColumnOrSuperColumn?cos23?=?client.get(keyspace,?row2,?path,?ConsistencyLevel.ONE);??????????path.setColumn("desc".getBytes());????????????ColumnOrSuperColumn?cos24?=?client.get(keyspace,?row2,?path,?ConsistencyLevel.ONE);??????????Column?c21?=?cos21.getColumn();??????????Column?c22?=?cos22.getColumn();??????????Column?c23?=?cos23.getColumn();??????????Column?c24?=?cos24.getColumn();??????????System.out.println(new?String(c21.getValue())?+?"?|?"?+?new?String(c22.getValue())+?"?|?"?+?new?String(c23.getValue())?+?"?|?"?+?new?String(c24.getValue()));????????????????????trans.close();?????????????????}?? 上面代碼所示:向名字為“Users”的columnFamily中添加2行,第一行包含2個Column,Column名字分別為:id、name;第二行包含4個Column,Column名字非別為id、name、age、desc;運行上述代碼結果如下:
Java代碼
101?|?Kylin?Soong??101?|?Kobe?Bryant?|?32?|?Five?NBA?title,?One?regular?season?MVP,?Two?Final?Games?MVP?? ?(二)SuperColumn
SuperColumn中包含多個Column,下面我們用程序實現向SuperColumn中添加,讀取數據,先看下圖:
?如上圖所示ColumnFamily 包括2行,每行包括2個SuperColumn,每個SuperColumn中包含多個Column,下面我們用代碼演示上圖情景,為了簡單,我們把兩行,簡化為一行;
修改HelloWorld代碼,如下:
Java代碼
public?static?void?main(String[]?args)?throws?Exception?{??????????TTransport?trans?=?new?TSocket("127.0.0.1",?9160);?????????????TProtocol?proto?=?new?TBinaryProtocol(trans);?????????????Cassandra.Client?client?=?new?Cassandra.Client(proto);?????????????trans.open();?????????????????????String?keyspace?=?"Twitter";???????????String?columnFamily??=?"UserRelationships";???????????String?row?=?"row";????????????????????Map<String,?List<ColumnOrSuperColumn>>?cfmap?=?new?HashMap<String,?List<ColumnOrSuperColumn>>();?????????????List<ColumnOrSuperColumn>?cslist?=?new?ArrayList<ColumnOrSuperColumn>();?????????????ColumnOrSuperColumn?cos?=?new?ColumnOrSuperColumn();?????????????List<Column>?columnList?=?new?ArrayList<Column>();?????????????Column?id?=?new?Column();?????????????id.setName("id".getBytes());?????????????id.setValue("101".getBytes());?????????????id.setTimestamp(new?Date().getTime());?????????????Column?name?=?new?Column();?????????????name.setName("name".getBytes());?????????????name.setValue("Kylin?Soong".getBytes());?????????????name.setTimestamp(new?Date().getTime());?????????????columnList.add(id);?????????????columnList.add(name);???????????SuperColumn?sc?=?new?SuperColumn();?????????????sc.setColumns(columnList);?????????????sc.setName("super1".getBytes());?????????????cos.super_column?=?sc;?????????????cslist.add(cos);??????????????????cfmap.put(columnFamily,?cslist);???????????????????????Map<String,?List<ColumnOrSuperColumn>>?cfmap2?=?new?HashMap<String,?List<ColumnOrSuperColumn>>();?????????????List<ColumnOrSuperColumn>?cslist2?=?new?ArrayList<ColumnOrSuperColumn>();?????????????ColumnOrSuperColumn?cos2?=?new?ColumnOrSuperColumn();?????????????List<Column>?columnList2?=?new?ArrayList<Column>();?????????????Column?id2?=?new?Column();?????????????id2.setName("id".getBytes());?????????????id2.setValue("101".getBytes());?????????????id2.setTimestamp(new?Date().getTime());?????????????Column?name2?=?new?Column();?????????????name2.setName("name".getBytes());?????????????name2.setValue("Kobe?Bryant".getBytes());?????????????name2.setTimestamp(new?Date().getTime());???????????Column?age?=?new?Column();?????????????age.setName("age".getBytes());?????????????age.setValue("32".getBytes());?????????????age.setTimestamp(new?Date().getTime());??????????Column?desc?=?new?Column();?????????????desc.setName("desc".getBytes());?????????????desc.setValue("Five?NBA?title,?One?regular?season?MVP,?Two?Final?Games?MVP".getBytes());?????????????desc.setTimestamp(new?Date().getTime());??????????columnList2.add(id2);?????????????columnList2.add(name2);???????????columnList2.add(age);??????????columnList2.add(desc);??????????SuperColumn?sc2?=?new?SuperColumn();?????????????sc2.setColumns(columnList2);?????????????sc2.setName("super2".getBytes());?????????????cos2.super_column?=?sc2;?????????????cslist2.add(cos2);?????????????????cfmap2.put(columnFamily,?cslist2);?????????????????????client.batch_insert(keyspace,?row,?cfmap,?ConsistencyLevel.ONE);??????????client.batch_insert(keyspace,?row,?cfmap2,?ConsistencyLevel.ONE);????????????????????ColumnPath?path?=?new?ColumnPath(columnFamily);?????????????path.setSuper_column("super1".getBytes());?????????????ColumnOrSuperColumn?s?=?client.get(keyspace,?row,?path,?ConsistencyLevel.ONE);?????????????System.out.println(new?String(s.super_column.columns.get(0).value)?+?"?|?"?+?new?String(s.super_column.columns.get(1).value));???????????????????????path.setSuper_column("super2".getBytes());?????????????ColumnOrSuperColumn?s2?=?client.get(keyspace,?row,?path,?ConsistencyLevel.ONE);??????????System.out.println(new?String(s2.super_column.columns.get(2).value)?+?"?|?"?+?new?String(s2.super_column.columns.get(3).value)?+?"?|?"?+?new?String(s2.super_column.columns.get(0).value)?+?"?|?"?+?new?String(s2.super_column.columns.get(1).value));????????????????????trans.close();???????}?? ?上述代碼演示往名字叫“UserRelationships”的columnFamily中添加一行,這一行中包含兩個SuperColumn,名字分別:super1和super2,super1包含2個Column,名字分別為id,name;super2包含4個Column,名字分別為id,name,age,desc,運行結果:
Java代碼
101?|?Kylin?Soong??101?|?Kobe?Bryant?|?32?|?Five?NBA?title,?One?regular?season?MVP,?Two?Final?Games?MVP?? (三)Column
從上面一、二可以看到Column是Cassandra的最小存儲單位,它的結構如下:
Java代碼
struct?Column?{????1:?binary????????????????????????name,????2:?binary????????????????????????value,????3:?i64???????????????????????????timestamp,??}?? ?(四)keyspace
如上面一二中 String keyspace = "Twitter"; 都定義了keyspace 是名字為“Twitter”,相當于干系數據庫中的Schema或數據庫。
轉載于:https://my.oschina.net/iwuyang/blog/197182
總結
以上是生活随笔為你收集整理的Cassandra Dev 1: Cassandra 入门的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。