當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

从Scanpy的Anndata对象提取信息并转成Seurat对象（适用于空间组且涉及h5文件读写）

發布時間：2024/1/18 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了从Scanpy的Anndata对象提取信息并转成Seurat对象（适用于空间组且涉及h5文件读写）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

關鍵字

Anndata對象轉成Seurat對象
h5文件讀寫
空間組格式轉換

已補充快速使用的函數整理版本，如果不想看細節可以直接看已整理好的版本。

適用背景

眾所周知，單細胞數據分析有兩大軟件：基于R語言的Seurat和基于Python的Scanpy，在平時的分析中常常需要把Seurat對象轉成Scanpy的Anndata對象，這已經有比較成熟的流程了。但是，如果反過來把Anndata對象轉成Seurat對象，網上搜到的方案寥寥無幾，而且在本人親測之下均報錯無法成功實現。再加上我需要轉的是空間組對象，結構比單細胞的更為復雜，只好自己想法從Anndata對象提取信息重新構建出一個Seurat對象了。
這個步驟主要分為2步：

步驟一從Scanpy的Anndata對象中提取信息

1提取矩陣

import os import sys import scanpy as sc import anndata as ad import numpy as np import pandas as pd import h5pyob1=sc.read('tmp.h5ad') mat=pd.DataFrame(data=ob1.X.todense(),index=ob1.obs_names,columns=ob1.var_names) mat.to_hdf("mat.h5","mat")

如果要在Python里重新讀取h5格式的矩陣，可以運行下面代碼：
mat=pd.read_hdf("mat.h5",key="mat")
上面的腳本是我測試過比較好的保存矩陣的方案，下面代碼塊則是最初的版本，但是在R里面讀入之后會缺失行名和列名，所以還要額外保存行名與列名，之后加上，特別麻煩，所以采用上面的代碼塊更為簡潔方便。

mat=ob1.X.todense().T with h5py.File('mat.h5','w') as f1:f1.create_dataset("mat",data=mat)##在Python里讀取h5格式矩陣的方法 with h5py.File('mat.h5','r') as f1:mat=f1['mat'][:]

為什么用h5保存矩陣？
理論上是可以轉成數據框之后保存為csv或tsv文件，但這樣保存很慢，讀取數據也很慢，因此存為h5文件更加方便。但對于小文件，h5文件反而占的空間更大，但h5文件應該是壓縮過的，就很奇怪，不太懂其中原理，但如果是大文件則能有效進行壓縮使得所占空間更小。

pd.DataFrame(data=ob1.X.todense().T, index=ob1.var_names,columns=ob1.obs_names).to_csv('raw_mat.csv',sep="\t",float_format='%.0f')

2 提取metadata信息

meta=pd.DataFrame(data=ob1.obs) meta.to_csv('metadata.tsv',sep="\t")

3 提取坐標信息（UMAP坐標或空間組坐標）

#保存UMAP坐標 cord=pd.DataFrame(data=ob1.obsm['X_umap'],index=ob1.obs_names,columns=['x','y']) cord.to_csv('position_X_umap.tsv',sep="\t") #保存空間坐標 cord_name='spatial' cord=pd.DataFrame(data=ob1.obsm[cord_name],index=ob1.obs_names,columns=['x','y']) cord.to_csv('position_'+cord_name+'.tsv',sep="\t")

提取完以上信息之后，就可以在R里面重建Seurat對象了。

步驟二重建Seurat對象

1 讀取上面提取的信息

library(rhdf5) library(Matrix) mydata <- h5read("mat.h5","mat") #如果在Python保存h5文件是用【with h5py.File】方式則直接運行上面一行代碼即可得到矩陣的值，但之后要手動添加行名和列名，下面的代碼不適用，但如果用的【to_hdf】則運行下面代碼即可。 #獲取矩陣值 mat <- mydata$block0_values #添加行名 rownames(mat) <- mydata$axis0 #添加列名 colnames(mat) <- mydata$axis1 #轉成稀疏矩陣 mat <- Matrix(mat, sparse = TRUE) #讀取metadata信息 cord_name='X_umap' meta <- read.table('metadata.tsv',sep="\t",header=T,row.names=1) pos <- read.table(paste0('position_',cord_name,'.tsv'),sep="\t",header=T,row.names=1)

2 重建單細胞Seurat對象

obj <- CreateSeuratObject(mat,assay='Spatial',meta.data=meta)

需要對seurat對象進行簡單聚類后才能添加UMAP坐標信息，可略過直接進行常規分析

obj <- seob_cluster(obj) obj@reductions$umap@cell.embeddings[,1]<-pos[,1] obj@reductions$umap@cell.embeddings[,2]<-pos[,2]

如果只是單細胞數據，以上步驟已經足夠了，但如果是重建一個Seurat空間組對象，其實也就是填補image的slice1槽，就需要運行以下腳本。

3 重建空間組Seurat對象

library(dplyr) library(data.table) library(Matrix) library(rjson) #以下腳本參考了部分華大生命科學研究院的余浩師兄和鄒軒軒師姐寫的腳本，在此表示感謝 tissue_lowres_image <- matrix(1, max(pos$y), max(pos$x)) tissue_positions_list <- data.frame(row.names = colnames(obj),tissue = 1,row = pos$y, col = pos$x,imagerow = pos$y, imagecol = pos$x) scalefactors_json <- toJSON(list(fiducial_diameter_fullres = 1,tissue_hires_scalef = 1,tissue_lowres_scalef = 1)) mat <- obj@assays$Spatial@countsseurat_spatialObj <- CreateSeuratObject(mat, project = 'Spatial', assay = 'Spatial',min.cells=5, min.features=5) generate_spatialObj <- function(image, scale.factors, tissue.positions, filter.matrix = TRUE) {if (filter.matrix) {tissue.positions <- tissue.positions[which(tissue.positions$tissue == 1), , drop = FALSE]}unnormalized.radius <- scale.factors$fiducial_diameter_fullres * scale.factors$tissue_lowres_scalefspot.radius <- unnormalized.radius / max(dim(image))return(new(Class = 'VisiumV1',image = image,scale.factors = scalefactors(spot = scale.factors$tissue_hires_scalef,fiducial = scale.factors$fiducial_diameter_fullres,hires = scale.factors$tissue_hires_scalef,lowres = scale.factors$tissue_lowres_scalef),coordinates = tissue.positions,spot.radius = spot.radius)) }spatialObj <- generate_spatialObj(image = tissue_lowres_image,scale.factors = fromJSON(scalefactors_json),tissue.positions = tissue_positions_list)spatialObj <- spatialObj[Cells(seurat_spatialObj)] DefaultAssay(spatialObj) <- 'Spatial' seurat_spatialObj[['slice1']] <- spatialObj

看一下新建的對象結構，說明已經是一個標準的Seurat空間組對象了：

> str(seurat_spatialObj@images$slice1) Formal class 'VisiumV1' [package "Seurat"] with 6 slots..@ image : num [1:6, 1:9] 1 1 1 1 1 1 1 1 1 1 .....@ scale.factors:List of 4.. ..$ spot : num 1.. ..$ fiducial: num 1.. ..$ hires : num 1.. ..$ lowres : num 1.. ..- attr(*, "class")= chr "scalefactors"..@ coordinates :'data.frame': 71668 obs. of 5 variables:.. ..$ tissue : num [1:71668] 1 1 1 1 1 1 1 1 1 1 ..... ..$ row : num [1:71668] -2.904 -2.379 -0.798 -2.162 -1.436 ..... ..$ col : num [1:71668] 5.84 5.58 8.63 7.08 7.99 ..... ..$ imagerow: num [1:71668] -2.904 -2.379 -0.798 -2.162 -1.436 ..... ..$ imagecol: num [1:71668] 5.84 5.58 8.63

7.08 7.99 ... ..@ spot.radius : num 0.111 ..@ assay : chr "Spatial" ..@ key : chr "slice1_"

可以往后面進行分析了。

小結與補充

希望Scanpy或Seurat官方能出一下相關函數吧，涉及到空間組數據時，常規的轉換流程也容易報錯，有很多bugs。

以下為艱辛的報錯心路歷程，可借鑒可忽略……

error1
網上一查，基本都是以下教程，很簡單嘛，然而……

> library(Seurat) library(SeuratDisk)Attaching SeuratObject > library(SeuratDisk) Registered S3 method overwritten by 'cli':method fromprint.boxx spatstat.geom Registered S3 method overwritten by 'SeuratDisk':method fromas.sparse.H5Group Seurat > Convert("cell_adata.h5ad",'h5SeuratWarning: Unknown file type: h5ad Error in H5File.open(filename, mode, file_create_pl, file_access_pl) :HDF5-API Errors:error #000: H5F.c in H5Fopen(): line 793: unable to open fileclass: HDF5major: File accessibilityminor: Unable to open fileerror #001: H5VLcallback.c in H5VL_file_open(): line 3500: open failedclass: HDF5major: Virtual Object Layerminor: Can't open objecterror #002: H5VLcallback.c in H5VL__file_open(): line 3465: open failedclass: HDF5major: Virtual Object Layerminor: Can't open objecterror #003: H5VLnative_file.c in H5VL__native_file_open(): line 100: unable to open fileclass: HDF5major: File accessibilityminor: Unable to open fileerror #004: H5Fint.c in H5F_open(): line 1622: unable to lock the fileclass: HDF5major: File accessibilityminor: Unable to open fileerror #005: H5FD.c in H5FD_lock(): line 1675: driver lock request failedclass: HDF5major: Virtual File Layer

error2
修復上面的bug后（export HDF5_USE_FILE_LOCKING=FALSE），還是報錯

> Convert("cell_adata.h5ad",'h5SeuraWarning: Unknown file type: h5ad Creating h5Seurat file for version 3.1.5.9900 Adding X as data Adding X as counts Adding meta.features from var Adding bbox as cell embeddings for bbox Adding contour as cell embeddings for contour Error: Not a matrix dataset

error3
根據某個教程，可以直接在Rstudio里使用Python，加載半天后直接報錯，可能是因為我用的是集群……

> library(reticulate) > scanpy <- import("scanpy") *** Error in `/share/app/R/4.0.2/lib64/R/bin/exec/R': free(): invalid pointer: 0x000000000aa5f3b8 *** ======= Backtrace: ========= /usr/lib64/libc.so.6(+0x81329)[0x7fb79efc4329]

error4
嘗試在Python導出矩陣后再用R讀入，R直接崩潰退出……

> fmat <- npyLoad("fmat.npy") > fmat <- npyLoad("fmat.npy")*** caught segfault *** address 0x7f3bf5770000, cause 'invalid permissions'Traceback:1: .External(list(name = "InternalFunction_invoke", address = <pointer: 0x421bb10>, dll = list(name = "Rcpp", path = "/jdfssz1/software/R4.0/lib/Rcpp/libs/Rcpp.so", dynamicLookup = TRUE, handle = <pointer: 0x957cc10>, info = <pointer: 0x1d5ba90>), numParameters = -1L), <pointer: 0x140526f0>, filename, type, dotranspose)2: npyLoad("/jdfssz3/fmat.npy")Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: 3 rm: cannot remove ‘/hwfssz5/tmp/RtmpRSe1IF’: Directory not empty

作者：黃甫一
鏈接：https://www.jianshu.com/p/8a81071571d9
來源：簡書
著作權歸作者所有。商業轉載請聯系作者獲得授權，非商業轉載請注明出處。

總結

以上是生活随笔為你收集整理的从Scanpy的Anndata对象提取信息并转成Seurat对象（适用于空间组且涉及h5文件读写）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：整懵了，蚂蚁金服6面成功拿下offer涨
下一篇：被创业公司领导穿小鞋，一怒之下我拿了6个