Ensembl Annotation for Gene Expression Data

gene_x 0 like s 453 view s

Tags: R, human, genomics

The code provided is designed to annotate a list of genes by fetching associated gene information from the Ensembl database. Starting with the biomaRt library, it connects to the Ensembl dataset specifically tailored for human genes. Then, gene expression data is imported from a CSV file named "DEGs_heatmap_data.csv". The primary loop in the script iterates over each gene, retrieves detailed attributes such as gene name, gene biotype, chromosome position, and more, and then appends this to the initial data. To manage instances where the database returns multiple entries for a single gene, or none at all, there are conditions to handle such cases by either selecting the first entry or populating with 'NA' values respectively. As the loop progresses, for every 100 records, a progress message is printed to give users an indication of how much work has been completed. Once all genes are annotated, the enriched dataset is saved to a new CSV file named "annotated_output_file.csv".

# Load the necessary library
library(biomaRt)

# Connect to the Ensembl dataset
ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="104")

# Read the data from your CSV file
data <- read.csv("DEGs_heatmap_data.csv", header=TRUE, row.names=1)

# Initialize an empty data frame for the annotated data
annotated_data <- data.frame()

# Determine total number of genes
total_genes <- length(rownames(data))

# Loop through each gene to annotate
for (i in 1:total_genes) {
    gene <- rownames(data)[i]
    result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                    filters = 'ensembl_gene_id',
                    values = gene,
                    mart = ensembl)

    # If multiple rows are returned, take the first one
    if (nrow(result) > 1) {
        result <- result[1, ]
    }

    # Check if the result is empty
    if (nrow(result) == 0) {
        result <- data.frame(ensembl_gene_id = gene,
                             external_gene_name = NA,
                             gene_biotype = NA,
                             entrezgene_id = NA,
                             chromosome_name = NA,
                             start_position = NA,
                             end_position = NA,
                             strand = NA,
                             description = NA)
    }

    # Transpose expression values
    expression_values <- t(data.frame(t(data[gene, ])))
    colnames(expression_values) <- colnames(data)

    # Combine gene information and expression data
    combined_result <- cbind(result, expression_values)

    # Append to the final dataframe
    annotated_data <- rbind(annotated_data, combined_result)

    # Print progress every 100 genes
    if (i %% 100 == 0) {
        cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
    }
}

# Save the annotated data to a new CSV file
write.csv(annotated_data, "annotated_output_file.csv", row.names=FALSE)

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum