Updating Human Gene Identifiers using Ensembl BioMart: A Step-by-Step Guide

gene_x 0 like s 550 view s

Tags: genes, processing, database

GRCh38.p13 is the latest version of the human reference genome assembly, which was released by the Genome Reference Consortium in December 2019. It contains several updates and improvements over the previous assembly, GRCh38, including more accurate annotations of protein-coding genes, non-coding RNAs, and structural variations. The designation "p13" refers to the 13th minor update to the assembly since its initial release. GRCh38.p13 is currently the most commonly used reference genome assembly for human genetics research and clinical applications.

To update the external_gene_name for human genes with the latest Ensembl database using the ensembl_gene_id, you can use the BioMart tool provided by Ensembl. Here are the steps to follow:

  • Go to the Ensembl website (www.ensembl.org) and click on the "BioMart" link under the "Tools" section.

  • Select "Ensembl Genes" as the dataset and choose the latest version of the database (e.g., GRCh38.p13) for the human species.

  • Select the attributes you want to retrieve by choosing the "Attributes" option. In this case, select "External Gene Name" and "Ensembl Gene ID."

  • Filter the data using the "Filters" option by selecting "Ensembl Gene ID" as the filter type and entering the relevant gene IDs for which you want to update the external gene name.

  • Click on the "Results" button to generate the updated information.

  • Download the updated information in the desired format (e.g., CSV, TSV, or Excel).

  • Use the downloaded information to update the external_gene_name in your database or analysis pipeline.

Note that the Ensembl database may have updated gene annotations, so it is important to verify the updated information and ensure that it matches your requirements.

Here is a concrete example:

"DNAAF9" is an HGNC symbol. You can use the following website to translate all Ensembl gene IDs (namely the first column of your Excel table) to HGNC in a batch.

To translate identifiers from different databases, follow these steps:

  • Open the website: http://www.ensembl.org/biomart/martview

  • Choose the database "Ensembl genes 109"

  • Select the dataset for your desired organism: Human genes (GRCh38.p13)

  • Go to "Filters" > "Gene:" > "Input external reference ID list"

  • Select the chosen source database: Gene stable ID(s)

  • Provide a list of IDs, delimited by newline: copy the first column of your results. Screenshot_1

      #For example:
      ENSG00000088854
      ENSG00000226328
      ENSG00000086666
      ENSG00000215717
      ENSG00000168502
      ENSG00000223518
    
  • Go to "Attributes" > "Gene:"

    • Untick "Transcript stable ID"
    • Leave "Gene stable ID" ticked
    • Go to "External:" and tick "Gene name," "Gene description," "HGNC ID," and "HGNC symbol". Screenshot_2
    • Click "Results" at the top left. This gives a preview that can be exported into various formats. Screenshot_3.2

The HGNC symbol and gene name refer to two different types of identifiers for genes. The HGNC symbol (HUGO Gene Nomenclature Committee symbol) is a short abbreviation assigned to each human gene by the HGNC, a committee responsible for standardizing and naming human genes. The HGNC symbol is typically composed of uppercase letters and sometimes includes numbers or special characters. For example, the HGNC symbol for the gene that causes cystic fibrosis is "CFTR".

The gene name, on the other hand, is a longer, more descriptive name assigned to each gene based on its function, location, or other characteristics. Gene names are often more intuitive and easier to remember than HGNC symbols. For example, the gene name for the cystic fibrosis gene is "cystic fibrosis transmembrane conductance regulator".

While the HGNC symbol and gene name can differ, they are often used interchangeably to refer to the same gene. In general, the HGNC symbol is used more commonly in scientific publications and databases, while the gene name is more often used in popular science writing or in clinical settings.

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

36.01hw33jqczwh8zcxynz64fs3j4@mail4u.fun说:

cum ipsum veritatis optio corrupti iste adipisci ex quidem doloribus rem ipsam est repellendus ullam. deleniti quo aut dolore laudantium et recusandae asperiores sit voluptatum ipsa ad ea est minima o

35.01hw33jqczwh8zcxynz64fs3j4@mail5u.pw说:

iusto iure non quaerat deleniti voluptatibus ut omnis est molestias sit minima placeat labore hic. doloribus ipsam eius vel sapiente ipsam aperiam eveniet cupiditate dolore.

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum