Retrieving KEGG Genes Using Bioservices in Python

gene_x 0 like s 325 view s

Tags: python, Biopython, scripts

Biopython does not have built-in support for KEGG database. However, you can use the bioservices library to retrieve and interact with KEGG data. To fetch all available genes in the KEGG database, you would need to iterate through each organism and collect all their genes. Note that this process might take a long time and may not be efficient, as there are thousands of organisms and millions of genes in the KEGG database.

Use the bioservices library to fetch the list of available organisms and retrieve genes for the first few organisms:

from bioservices import KEGG

# Initialize KEGG API
kegg_api = KEGG()

# Get the list of organisms
organisms_raw = kegg_api.list("organism")
organisms = [entry.split("\t")[1] for entry in organisms_raw.split("\n") if entry]
#['hsa', 'ptr', 'pps', 'ggo', 'pon', 'nle', 'hmh', 'mcc', 'mcf', 'mthb', 'mni', 'csab', 'caty', 'panu', 'tge', 'mleu', 'rro', 'rbb', 'tfn', 'pteh', 'cang', 'cjc', 'sbq', 'cimi', 'csyr', 'mmur', 'lcat', 'pcoq', 'oga', 'mmu', 'mcal', ... , 'loki', 'psyt', 'agw', 'arg']

# Limit the number of organisms and genes for demonstration purposes
organism_limit = 3
gene_limit = 10

# Iterate through the organisms
for organism in organisms[:organism_limit]:
    print(f"Organism: {organism}")

    # Get the list of genes for the current organism
    genes = kegg_api.list(f"{organism}").split("\n")[:gene_limit]

    # Iterate through the genes and print gene identifiers
    for gene_entry in genes:
        gene_id = gene_entry.split("\t")[0]
        print(f"Gene ID: {gene_id}")


#Organism: hsa
#Gene ID: hsa:102466751
#Gene ID: hsa:100302278
#Gene ID: hsa:79501
#Gene ID: hsa:112268260
#Gene ID: hsa:729759
#Gene ID: hsa:124904706
#Gene ID: hsa:105378947
#Gene ID: hsa:113219467
#Gene ID: hsa:81399
#Gene ID: hsa:148398

This code will print the gene identifiers of the first 10 genes for the first 3 organisms in the KEGG database. You can modify the organism_limit and gene_limit variables to change the number of organisms and genes processed.

Remember that fetching all genes from the KEGG database might take a significant amount of time and may not be efficient. It's usually more practical to focus on specific organisms or pathways of interest.

like unlike





The text provides a helpful workaround for users who want to access KEGG database information through Biopython. By suggesting the use of the bioservices library, the author offers a viable alternative for retrieving and interacting with KEGG data. The text also thoughtfully warns users that collecting all genes from the KEGG database may not be efficient due to the large number of organisms and genes involved. The prompt to use the bioservices library for fetching the list of available organisms and retrieving genes for the first few organisms is a practical and valuable starting point for users interested in working with KEGG data.


© 2023 Impressum