Unveiling Difference in Promoter regions of MeDIP data

gene_x 0 like s 354 view s

Tags: R, pipeline

  1. Select Genes of Interest: Once Emilia provides the GO term, use a database like Gene Ontology or Bioconductor's "org.Hs.eg.db" package (assuming human genes) to fetch the list of genes associated with that GO term.

    # Using Bioconductor in R
    library(org.Mm.eg.db)
    genes <- select(org.Mm.eg.db, keys="GO:0048813", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    > genes
              GO EVIDENCE ONTOLOGY    SYMBOL
    1  GO:0048813      IDA       BP      Abi1
    2  GO:0048813      IGI       BP      Abl2
    3  GO:0048813      IMP       BP      Abl2
    4  GO:0048813      IMP       BP     Atp7a
    5  GO:0048813      IMP       BP   Cacna1a
    6  GO:0048813      IMP       BP    Camk2a
    7  GO:0048813      IMP       BP    Ctnna2
    8  GO:0048813      IMP       BP      Cdk5
    9  GO:0048813      IGI       BP     Dclk1
    10 GO:0048813      IGI       BP       Dcx
    11 GO:0048813      IMP       BP     Dscam
    12 GO:0048813      IMP       BP      Dvl1
    13 GO:0048813      IGI       BP      Fmn1
    14 GO:0048813      IMP       BP      Fmn1
    15 GO:0048813      IMP       BP       Fyn
    16 GO:0048813      IMP       BP      Hprt
    17 GO:0048813      IBA       BP      Sdc2
    18 GO:0048813      IMP       BP    Elavl4
    19 GO:0048813      IGI       BP     Itgb1
    20 GO:0048813      IMP       BP     Itgb1
    21 GO:0048813      IMP       BP      Lrp8
    22 GO:0048813      ISO       BP     Mef2a
    23 GO:0048813      IBA       BP      Map6
    24 GO:0048813      ISO       BP      Map6
    25 GO:0048813      IMP       BP   Slc11a2
    26 GO:0048813      IGI       BP      Rac1
    27 GO:0048813      IGI       BP     Rock2
    28 GO:0048813      IMP       BP    Sema3a
    29 GO:0048813      IMP       BP     Vldlr
    30 GO:0048813      IMP       BP     Mapk8
    31 GO:0048813      ISO       BP     Mink1
    32 GO:0048813      IMP       BP      Tet1
    33 GO:0048813      ISO       BP    Celsr2
    34 GO:0048813      IMP       BP   Cacna1f
    35 GO:0048813      IGI       BP     Dact1
    36 GO:0048813      IMP       BP     Dact1
    37 GO:0048813      IMP       BP  Mapk8ip2
    38 GO:0048813      ISO       BP     Trak1
    39 GO:0048813      IMP       BP      Rere
    40 GO:0048813      ISO       BP     Trak2
    41 GO:0048813      IBA       BP  Tmem106b
    42 GO:0048813      ISO       BP  Tmem106b
    43 GO:0048813      IMP       BP   Slitrk5
    44 GO:0048813      IMP       BP Kidins220
    45 GO:0048813      IMP       BP    Rbfox2
    46 GO:0048813      IMP       BP      Klf7
    47 GO:0048813      IMP       BP    Dtnbp1
    48 GO:0048813      IMP       BP     Prex2
    49 GO:0048813      IBA       BP    Dcdc2a
    50 GO:0048813      IGI       BP    Dcdc2a
    51 GO:0048813      ISO       BP     Farp1
    52 GO:0048813      ISO       BP      Lrp4
    53 GO:0048813      IBA       BP     Btbd3
    54 GO:0048813      IDA       BP     Btbd3
    55 GO:0048813      IBA       BP   Abitram
    56 GO:0048813      IMP       BP   Abitram
    57 GO:0048813      IMP       BP    Picalm
    58 GO:0048813      ISO       BP    Picalm
    
  2. Extract Promoter Regions: Typically, the promoter region can be defined as a certain number of base pairs (e.g., 1000-2000 bp) upstream of the transcription start site (TSS) of a gene. Tools like BEDTools can help you extract these regions from a genome reference.

    2.1 extract annotation.gtf for mouse. - UCSC Genome Browser: Navigate to the UCSC Table Browser. Choose the following settings: group: "Genes and Gene Predictions" track: "RefSeq Genes" (or another gene prediction track if you have a preference) output format: "GTF - gene transfer format" Click "get output", and you can then save the result as a .gtf file.

    • Ensembl: Go to the Ensembl FTP site. Navigate through: release-XYZ (choose the latest release) > gtf > mus_musculus (choose the appropriate assembly if there are multiple). Download the GTF file, which might be named something like Mus_musculus.GRCm38.XYZ.gtf.gz. Make sure to unzip the file after downloading. GENCODE:

    • Visit the GENCODE Mouse website: https://www.gencodegenes.org/mouse/release_M25.html Download the comprehensive gene annotation GTF file for the mm10 assembly.

    • Convert the GTF file to a BED file with only the TSS positions.

      awk 'OFS="\t" {if ($3 == "gene") print $1, $4, $4+1, $10, $7, $12, $14}' gencode.vM25.annotation.gtf > tss.bed

      grep "protein_coding" tss.bed | wc -l #21859 records

    In many genomics studies, the promoter region of a gene is typically considered to be the sequence immediately upstream of the transcription start site (TSS). However, the exact definition of the "promoter region" can vary between studies and depends on the specific context.

    For the mouse genome (mm10), a commonly used range for the promoter region is:

    -2000 to +500 bp relative to the TSS: This defines a 2.5 kb region centered around the TSS. The 2 kb upstream often captures many of the cis-regulatory elements, while the 500 bp downstream includes the TSS and possibly the beginning of the gene. Some studies might define a narrower region, such as:

    -1000 to +200 bp relative to the TSS Or even a broader one:

    -5000 to +1000 bp relative to the TSS The best definition often depends on the specific research question. If the goal is to capture as many potential regulatory elements as possible, a broader range might be chosen. If the goal is to focus specifically on elements directly at the TSS, a narrower range might be more appropriate.

    For many standard analyses, the -2000 to +500 bp window is a reasonable compromise that captures a significant portion of the regulatory elements without being overly broad. However, always consider the biological question and the specifics of your dataset when choosing the appropriate range for promoter regions.

    #I used the two methods to extract promoter regions, why the result is different as shown below.
    #promoters.bed:chrM      -498    2002    "ENSMUSG00000064336.1";
    #promoters_.bed:chrM     -999    1       "ENSMUSG00000064336.1";
    
    # Convert TSS positions to promoter regions. Here, we're considering a promoter to be 1kb upstream of the TSS ($7 is SYMBOL, $4 ENSEMBL-ID).
    # For genes on the '+' strand:
    awk 'OFS="\t" {start=$2-2000>0?$2-2000:0; end=$2+500; if ($5 == "+") print $1, start, end, $4}' tss.bed > promoters_plus.bed
    # For genes on the '-' strand:
    awk 'OFS="\t" {start=$3-500>0?$3-500:0; end=$3+2000; if ($5 == "-") print $1, start, end, $4}' tss.bed > promoters_minus.bed
    # Combine the two promoter files.
    cat promoters_plus.bed promoters_minus.bed > promoters.bed
    
    # ERROR --> NEED to be DEBUGGED!
    # # Initialize an empty promoters.bed file
    # > promoters.bed
    # # Loop through each line of tss.bed
    # while IFS=$'\t' read -r chr start end name strand; do
    #     if [[ $strand == "+" ]]; then
    #         # Adjust for '+' strand genes, set minimum start to 0
    #         adjusted_start=$(( $start - 2000 > 0 ? $start - 2000 : 0 ))
    #         echo -e "$chr\t$adjusted_start\t$(($start+500))\t$name" >> promoters.bed
    #     else
    #         # Adjust for '-' strand genes using the `end` column as the TSS, set minimum start to 0
    #         adjusted_start=$(( $end - 2500 > 0 ? $end - 2500 : 0 ))
    #         echo -e "$chr\t$adjusted_start\t$(($end-1000))\t$name" >> promoters.bed
    #     fi
    # done < tss.bed
    
    # # Extract sequences for these promoter regions.
    # bedtools getfasta -fi mm10_genome.fasta -bed promoters.bed -fo promoters.fasta
    
    # # Using bedtools. Assuming a BED file with TSS positions (tss.bed)
    # bedtools flank -i tss.bed -g mouse.genome -l 1000 -r 0 > promoters.bed
    
  3. Quantify Methylation Signal in Promoters: With the promoter regions defined, use your bam files to calculate coverage or read depth over these regions. The depth can be interpreted as a signal of methylation. You can use tools such as bedtools coverage to get the depth of coverage over these promoter regions.

    #Control females:
    bedtools coverage -a promoters.bed -b ../alg/1_0_1_sorted.bam > promoters_coverage_1.0.1.txt
    bedtools coverage -a promoters.bed -b ../alg/1_0_2_sorted.bam > promoters_coverage_1.0.2.txt
    bedtools coverage -a promoters.bed -b ../alg/1_0_3_sorted.bam > promoters_coverage_1.0.3.txt
    #Control males:
    bedtools coverage -a promoters.bed -b ../alg/1_1_2b_sorted.bam > promoters_coverage_1.1.2b.txt
    bedtools coverage -a promoters.bed -b ../alg/1_1_3_sorted.bam > promoters_coverage_1.1.3.txt
    bedtools coverage -a promoters.bed -b ../alg/1_1_4_sorted.bam > promoters_coverage_1.1.4.txt
    #Stress females:
    bedtools coverage -a promoters.bed -b ../alg/2_0_1b_sorted.bam > promoters_coverage_2.0.1b.txt
    bedtools coverage -a promoters.bed -b ../alg/2_0_2_sorted.bam > promoters_coverage_2.0.2.txt
    bedtools coverage -a promoters.bed -b ../alg/2_0_3_sorted.bam > promoters_coverage_2.0.3.txt
    #Stress males:
    bedtools coverage -a promoters.bed -b ../alg/2_1_1_sorted.bam > promoters_coverage_2.1.1.txt
    bedtools coverage -a promoters.bed -b ../alg/2_1_2_sorted.bam > promoters_coverage_2.1.2.txt
    bedtools coverage -a promoters.bed -b ../alg/2_1_3_sorted.bam > promoters_coverage_2.1.3.txt
    
  4. Parse and consolidate coverage files:

    # delete the 'chr positions '
    for file in ./promoters_coverage_*.txt; do
        awk '{$1=$2=$3=""; print substr($0, 4)}' $file > tmp.txt && mv tmp.txt $file
    done
    for file in ./promoters_coverage_*.txt; do
        sed 's/^ *//' $file > tmp.txt && mv tmp.txt $file
    done
    
    # text if they promoters_coverage_*.txt contain exactly the same set of genes (and in the same order),
    cut -d " " -f1 promoters_coverage_1.0.1.txt > reference_genes.txt
    for file in ./promoters_coverage_*.txt; do
        if ! cmp -s <(cut -d " " -f1 $file) reference_genes.txt; then
            echo "$file does not match reference!"
        fi
    done
    
    #merge the counts
    cut -d " " -f1 promoters_coverage_1.0.1.txt > read_counts_table.txt
    for file in ./promoters_coverage_*.txt; do
        paste read_counts_table.txt <(cut -d " " -f2 $file) > tmp && mv tmp read_counts_table.txt
    done
    
    #add the sample names
    for file in ./promoters_coverage_*.txt; do
        basename $file | sed 's/promoters_coverage_//;s/.txt//'
    done > sample_names.txt
    echo -n "Gene " > header.txt
    paste -s -d ' ' sample_names.txt >> header.txt
    {
        cat header.txt
        cat read_counts_table.txt
    } > read_counts_table_with_header.txt
    #Manually replace "\n" in the first line, DELETE " and "; 
    #Delete the version of EnsembleID
    cp read_counts_table_with_header.txt temp
    awk 'BEGIN {FS=OFS="\t"} NR==1 {print; next} {gsub(/\.+[0-9]+$/, "", $1); print}' temp > read_counts_table_with_header.txt
    
    #grep "ENSMUSG00000104217.1" *.txt
    #Gene    1.0.1   1.0.2   1.0.3   1.1.2b  1.1.3   1.1.4   2.0.1b  2.0.2   2.0.3   2.1.1   2.1.2   2.1.3
    #promoters_coverage_1.0.1.txt:"ENSMUSG00000104217.1"; 9 347 2500 0.1388000
    #promoters_coverage_1.0.2.txt:"ENSMUSG00000104217.1"; 22 491 2500 0.1964000
    #promoters_coverage_1.0.3.txt:"ENSMUSG00000104217.1"; 18 525 2500 0.2100000
    #promoters_coverage_1.1.2b.txt:"ENSMUSG00000104217.1"; 5 255 2500 0.1020000
    #promoters_coverage_1.1.3.txt:"ENSMUSG00000104217.1"; 11 315 2500 0.1260000
    #promoters_coverage_1.1.4.txt:"ENSMUSG00000104217.1"; 8 300 2500 0.1200000
    #promoters_coverage_2.0.1b.txt:"ENSMUSG00000104217.1"; 18 375 2500 0.1500000
    #promoters_coverage_2.0.2.txt:"ENSMUSG00000104217.1"; 9 458 2500 0.1832000
    #promoters_coverage_2.0.3.txt:"ENSMUSG00000104217.1"; 8 430 2500 0.1720000
    #promoters_coverage_2.1.1.txt:"ENSMUSG00000104217.1"; 17 389 2500 0.1556000
    #promoters_coverage_2.1.2.txt:"ENSMUSG00000104217.1"; 17 428 2500 0.1712000
    #promoters_coverage_2.1.3.txt:"ENSMUSG00000104217.1"; 15 456 2500 0.1824000
    #read_counts_table.txt:"ENSMUSG00000104217.1";   9       22      18      5       11      8       18      9       8       17      17      15
    #read_counts_table_with_header.txt:"ENSMUSG00000104217.1"        9       22      18      5       11      8       18      9       8       17      17   15
    #reference_genes.txt:"ENSMUSG00000104217.1";
    
  5. Load into R and format for DESeq2:

    library(DESeq2)
    setwd("/home/jhuang/DATA/Data_Emilia_MeDIP/analysis_2023_2")
    
    # Load the data
    count_data <- read.table("read_counts_table_with_header.txt", header = TRUE, sep = "\t", quote = "", row.names = 1)
    
    ## Define conditions (you can replace this with your experimental design)
    #conditions <- c(rep("control_female", 3), rep("control_male", 3), rep("stress_female", 3), rep("stress_male", 3))
    
    ## Create DESeq2 object
    #dds <- DESeqDataSetFromMatrix(countData = count_data, colData = data.frame(condition = conditions), design = ~condition)
    
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #55401-->41415
    #dds <- DESeq(dds)
    #rld <- rlogTransformation(dds)
    
    #dim(counts(dds))
    #head(counts(dds), 10)
    
    ## This is an example using DESeq2, and it assumes count data rather than coverage.
    ## Proper normalization might need more manual adjustments based on the nature of MeDIP-seq data.
    #library(DESeq2)
    #dds <- DESeqDataSetFromMatrix(countData = countMatrix, colData = sampleInfo, design = ~ condition)
    #dds <- DESeq(dds)
    #normalized_counts <- counts(dds, normalized=TRUE)
    
    ## Again using DESeq2 as an example
    #res <- results(dds)
    #significant_genes <- subset(res, padj < 0.05)
    
    #-------
    
    sampleInfo <- read.csv("sampleInfo.txt", header = TRUE, stringsAsFactors = FALSE)
    row.names(sampleInfo) <- sampleInfo$sampleID
    
    library(DESeq2)
    dds <- DESeqDataSetFromMatrix(
      countData = count_data,
      colData = sampleInfo,
      design = ~ condition
    )
    dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    
    ## Again using DESeq2 as an example
    #res <- results(dds)
    #significant_genes <- subset(res, padj < 0.05)
    
  6. differential analysis

    dds$condition <- relevel(dds$condition, "control_female")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("stress_female_vs_control_female")
    
    dds$condition <- relevel(dds$condition, "control_male")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("stress_male_vs_control_male")
    
    library(biomaRt)
    listEnsembl()
    listMarts()
    listDatasets(ensembl)
    
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", version="100")
    datasets <- listDatasets(ensembl)
    listEnsemblArchives()
    attributes = listAttributes(ensembl)
    attributes[1:25,]
    
    for (i in clist) {
      #i<-clist[1]
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("ENTREZID","EXONID","GENEBIOTYPE","GENEID","GENENAME","PROTEINDOMAINSOURCE","PROTEINID","SEQNAME","SEQSTRAND","SYMBOL","TXBIOTYPE","TXID","TXNAME","UNIPROTID"))
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "ENTREZID", "SYMBOL", "GENENAME","GENEBIOTYPE","TXBIOTYPE","SEQSTRAND","UNIPROTID"))
      # In the ENSEMBL-database, GENEID is ENSEMBL-ID.
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "SYMBOL", "GENEBIOTYPE"))  #  "ENTREZID", "TXID","TXBIOTYPE","TXSEQSTART","TXSEQEND"
      #geness <- geness[!duplicated(geness$GENEID), ]
      #using getBM replacing AnnotationDbi::select
      #filters = 'ensembl_gene_id' means the records should always have a valid ensembl_gene_ids.
      gene_ids_no_version <- gsub("\\.\\d+$", "", rownames(res))
      geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
          filters = 'ensembl_gene_id',
          values = gene_ids_no_version, 
          mart = ensembl)
      geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
      #merge by column by common colunmn name, in the case "GENEID"
      res$ENSEMBL = rownames(res)
      identical(rownames(res), geness_uniq$ensembl_gene_id)
      res_df <- as.data.frame(res)
      geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
      dim(geness_res)
      rownames(geness_res) <- geness_res$ensembl_gene_id
      geness_res$ensembl_gene_id <- NULL
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
  7. clustering the genes and draw heatmap (limiting the genes with ENSEMBL by giving GO-term)

    install.packages("gplots")
    library("gplots")
    
    # Combine the STEP 1
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0090399", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0090399.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0090398", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0090398.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:2000035", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:2000035.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0017145", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0017145.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0002761", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0002761.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0002244", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0002244.csv", genes)
    
    #write.csv(file="ids_", unique(genes$ENSEMBL))
    ##cut -d',' -f2 ids_ > ids
    ##add Gene_Id in the first line, delete the ""
    #GOI <- read.csv("ids")$Gene_Id  
    GOI <- unique(genes$ENSEMBL)
    RNASeq.NoCellLine <- assay(rld)
    # Defining the custom order
    #column_order <- c(
    #  "uninfected.2h_r1"
    #)
    #RNASeq.NoCellLine_reordered <- RNASeq.NoCellLine[, column_order]
    #head(RNASeq.NoCellLine_reordered)
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    GOI_clean <- GOI[GOI %in% rownames(RNASeq.NoCellLine) & !is.na(GOI)]
    #GOI_clean <- GOI_clean[GOI_clean != "ENSMUSG00000094658"]
    datamat <- RNASeq.NoCellLine[GOI_clean, ]
    #Remove Problematic Rows whose var is 0
    datamat <- datamat[!apply(datamat, 1, var) == 0, ]
    #datamat = RNASeq.NoCellLine[GOI, ]
    #rownames(datamat) <- genes$external_gene_name #DEBUG since genes has 58 records, datamat has only 48 records.
    write.csv(as.data.frame(datamat), file ="DEGs_heatmap_data.csv")
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.00)
    mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    mycol = mycol[as.vector(mycl)]
    
    png("DEGs_heatmap.png", width=1000, height=1010)
    #labRow="",
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                scale='row',trace='none',col=bluered(75), margins=c(4,22),
                RowSideColors = mycol, srtCol=20, keysize=0.72, cexRow = 2, cexCol = 1.4)
    dev.off()
    
    # Extract rows from datamat where the row names match the identifiers in subset_1
    #### cluster members #####
    subset_1<-names(subset(mycl, mycl == '1'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #212
    #subset_2<-names(subset(mycl, mycl == '2'))
    #data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #185
    #subset_3<-names(subset(mycl, mycl == '3'))
    #data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #173
    
    # Initialize an empty data frame for the annotated data
    annotated_data <- data.frame()
    # Determine total number of genes
    total_genes <- length(rownames(data))
    # Loop through each gene to annotate
    for (i in 1:total_genes) {
        gene <- rownames(data)[i]
        result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                        filters = 'ensembl_gene_id',
                        values = gene,
                        mart = ensembl)
        # If multiple rows are returned, take the first one
        if (nrow(result) > 1) {
            result <- result[1, ]
        }
        # Check if the result is empty
        if (nrow(result) == 0) {
            result <- data.frame(ensembl_gene_id = gene,
                                external_gene_name = NA,
                                gene_biotype = NA,
                                entrezgene_id = NA,
                                chromosome_name = NA,
                                start_position = NA,
                                end_position = NA,
                                strand = NA,
                                description = NA)
        }
        # Transpose expression values
        expression_values <- t(data.frame(t(data[gene, ])))
        #colnames(expression_values) <- colnames(data)
        colnames(expression_values) <- paste(colnames(data), "normalized", sep = "_")
    
        # # Combine gene information and expression data
        # combined_result <- cbind(result, expression_values)
    
        # Fetch raw counts for the gene from count_data
        raw_counts <- count_data[gene, , drop = FALSE]
        colnames(raw_counts) <- paste(colnames(raw_counts), "raw", sep = "_")
        # Combine gene information, expression data, and raw counts
        combined_result <- cbind(result, expression_values, raw_counts)
    
        # Append to the final dataframe
        annotated_data <- rbind(annotated_data, combined_result)
        # Print progress every 100 genes
        if (i %% 100 == 0) {
            cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
        }
    }
    # Save the annotated data to a new CSV file
    write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    #write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    #write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o gene_clusters.xls
    
  8. Add pvalues of p-adjusted to the end of output using stress_female_vs_control_female-all.txt and stress_male_vs_control_male-all.txt

    #python3 run.py
    import pandas as pd
    
    # Load the male and female stress data
    female_df = pd.read_csv('stress_female_vs_control_female-all.txt')
    male_df = pd.read_csv('stress_male_vs_control_male-all.txt')
    
    # List of files to process
    files_to_process = [
        './raw_and_normalized_counts_GO:0090399.csv',
        './raw_and_normalized_counts_GO:0017145.csv',
        './raw_and_normalized_counts_GO:0090398.csv',
        './raw_and_normalized_counts_GO:0002761.csv',
        './raw_and_normalized_counts_GO:0002244.csv',
        './raw_and_normalized_counts_GO:2000035.csv'
    ]
    
    for filepath in files_to_process:
        # Load the raw data
        raw_df = pd.read_csv(filepath)
    
        # Extracting the pvalue and padj for female dataframe and adding them to the raw dataframe
        raw_df = raw_df.merge(female_df[['Unnamed: 0', 'pvalue', 'padj']], 
                              left_on='ensembl_gene_id', 
                              right_on='Unnamed: 0', 
                              how='left')
        raw_df = raw_df.rename(columns={'pvalue': 'pvalue_female_stress_vs_control', 
                                        'padj': 'padj_female_stress_vs_control'})
    
        # Extracting the pvalue and padj for male dataframe and adding them to the raw dataframe
        raw_df = raw_df.merge(male_df[['Unnamed: 0', 'pvalue', 'padj']], 
                              left_on='ensembl_gene_id', 
                              right_on='Unnamed: 0', 
                              how='left')
        raw_df = raw_df.rename(columns={'pvalue': 'pvalue_male_stress_vs_control', 
                                        'padj': 'padj_male_stress_vs_control'})
    
        # Drop the merged columns 'Unnamed: 0' from the final dataframe
        raw_df = raw_df.drop(columns=['Unnamed: 0_x', 'Unnamed: 0_y'])
    
        # Save the raw_df with the added pvalues and padj to a new CSV file
        output_filepath = filepath.replace('.csv', '_with_pvalues_and_padj.csv')
        raw_df.to_csv(output_filepath, index=False)
    
    #~/Tools/csv2xls-0.4/csv_to_xls.py raw_and_normalized*with_pvalues_and_padj.csv -d',' -o raw_and_normalized_with_pvalues_and_padj.xls
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum