Unveiling Difference in Promoter regions of MeDIP data 2

gene_x 0 like s 892 view s

Tags: R, NGS, pipeline

  1. extract all genes by giving GO-terms

    1. # # Using Bioconductor in R
    2. # library(org.Mm.eg.db)
    3. # genes <- select(org.Mm.eg.db, keys="GO:0048813", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    4. # > genes
  2. extract annotation.gtf for mouse.

    • (Optional) UCSC Genome Browser: Navigate to the UCSC Table Browser. Choose the following settings: group: "Genes and Gene Predictions" track: "RefSeq Genes" (or another gene prediction track if you have a preference) output format: "GTF - gene transfer format" Click "get output", and you can then save the result as a .gtf file.

    • (Optional) Ensembl: Go to the Ensembl FTP site. Navigate through: release-XYZ (choose the latest release) > gtf > mus_musculus (choose the appropriate assembly if there are multiple). Download the GTF file, which might be named something like Mus_musculus.GRCm38.XYZ.gtf.gz. Make sure to unzip the file after downloading. GENCODE:

    • (Used) Visit the GENCODE Mouse website: https://www.gencodegenes.org/mouse/release_M25.html Download the comprehensive gene annotation GTF file for the mm10 assembly.

    • Convert the GTF file to a BED file with only the TSS positions.

      awk 'OFS="\t" {if ($3 == "gene") print $1, $4, $4+1, $10, $7, $12, $14}' gencode.vM25.annotation.gtf > tss_.bed #55401 (Note not under kate)

      grep "protein_coding" tss_.bed > tss.bed #21859 records

  3. In many genomics studies, the promoter region of a gene is typically considered to be the sequence immediately upstream of the transcription start site (TSS). However, the exact definition of the "promoter region" can vary between studies and depends on the specific context. For the mouse genome (mm10), a commonly used range for the promoter region is:

    • (Used) -2000 to +500 bp relative to the TSS: This defines a 2.5 kb region centered around the TSS. The 2 kb upstream often captures many of the cis-regulatory elements, while the 500 bp downstream includes the TSS and possibly the beginning of the gene. Some studies might define a narrower region, such as:

    • (Optional) -1000 to +200 bp relative to the TSS Or even a broader one:

    • (Optional) -5000 to +1000 bp relative to the TSS The best definition often depends on the specific research question. If the goal is to capture as many potential regulatory elements as possible, a broader range might be chosen. If the goal is to focus specifically on elements directly at the TSS, a narrower range might be more appropriate.

  4. Convert TSS positions to promoter regions (Note that $7 is SYMBOL, $4 ENSEMBL-ID).

    1. # For genes on the '+' strand:
    2. awk 'OFS="\t" {start=$2-2000>0?$2-2000:0; end=$2+500; if ($5 == "+") print $1, start, end, $4}' tss.bed > promoters_plus.bed
    3. # For genes on the '-' strand:
    4. awk 'OFS="\t" {start=$3-500>0?$3-500:0; end=$3+2000; if ($5 == "-") print $1, start, end, $4}' tss.bed > promoters_minus.bed
    5. # Combine the two promoter files.
    6. cat promoters_plus.bed promoters_minus.bed > promoters.bed
    7. # #(Optional) Extract sequences for these promoter regions.
    8. # bedtools getfasta -fi mm10_genome.fasta -bed promoters.bed -fo promoters.fasta
  5. Quantify Methylation Signal in Promoters: With the promoter regions defined, use your bam files to calculate coverage or read depth over these regions. The depth can be interpreted as a signal of methylation. You can use tools such as bedtools coverage to get the depth of coverage over these promoter regions.

    1. #Control females:
    2. bedtools coverage -a promoters.bed -b ../alg/1_0_1_sorted.bam > promoters_coverage_1.0.1.txt
    3. bedtools coverage -a promoters.bed -b ../alg/1_0_2_sorted.bam > promoters_coverage_1.0.2.txt
    4. bedtools coverage -a promoters.bed -b ../alg/1_0_3_sorted.bam > promoters_coverage_1.0.3.txt
    5. #Control males:
    6. bedtools coverage -a promoters.bed -b ../alg/1_1_2b_sorted.bam > promoters_coverage_1.1.2b.txt
    7. bedtools coverage -a promoters.bed -b ../alg/1_1_3_sorted.bam > promoters_coverage_1.1.3.txt
    8. bedtools coverage -a promoters.bed -b ../alg/1_1_4_sorted.bam > promoters_coverage_1.1.4.txt
    9. #Stress females:
    10. bedtools coverage -a promoters.bed -b ../alg/2_0_1b_sorted.bam > promoters_coverage_2.0.1b.txt
    11. bedtools coverage -a promoters.bed -b ../alg/2_0_2_sorted.bam > promoters_coverage_2.0.2.txt
    12. bedtools coverage -a promoters.bed -b ../alg/2_0_3_sorted.bam > promoters_coverage_2.0.3.txt
    13. #Stress males:
    14. bedtools coverage -a promoters.bed -b ../alg/2_1_1_sorted.bam > promoters_coverage_2.1.1.txt
    15. bedtools coverage -a promoters.bed -b ../alg/2_1_2_sorted.bam > promoters_coverage_2.1.2.txt
    16. bedtools coverage -a promoters.bed -b ../alg/2_1_3_sorted.bam > promoters_coverage_2.1.3.txt
  6. Parse and consolidate coverage files:

    1. # delete the 'chr positions '
    2. for file in ./promoters_coverage_*.txt; do
    3. awk '{$1=$2=$3=""; print substr($0, 4)}' $file > tmp.txt && mv tmp.txt $file
    4. done
    5. for file in ./promoters_coverage_*.txt; do
    6. sed 's/^ *//' $file > tmp.txt && mv tmp.txt $file
    7. done
    8. # text if they promoters_coverage_*.txt contain exactly the same set of genes (and in the same order),
    9. cut -d " " -f1 promoters_coverage_1.0.1.txt > reference_genes.txt
    10. for file in ./promoters_coverage_*.txt; do
    11. if ! cmp -s <(cut -d " " -f1 $file) reference_genes.txt; then
    12. echo "$file does not match reference!"
    13. fi
    14. done
    15. #merge the counts
    16. cut -d " " -f1 promoters_coverage_1.0.1.txt > read_counts_table.txt
    17. for file in ./promoters_coverage_*.txt; do
    18. paste read_counts_table.txt <(cut -d " " -f2 $file) > tmp && mv tmp read_counts_table.txt
    19. done
    20. #add the sample names
    21. for file in ./promoters_coverage_*.txt; do
    22. basename $file | sed 's/promoters_coverage_//;s/.txt//'
    23. done > sample_names.txt
    24. echo -n "Gene " > header.txt
    25. paste -s -d ' ' sample_names.txt >> header.txt
    26. {
    27. cat header.txt
    28. cat read_counts_table.txt
    29. } > read_counts_table_with_header.txt
    30. #MANUALLY replace "\n" in the first line, DELETE " and ";
    31. #DELETE the version of EnsembleID with the following command: 21860-1 records remaining.
    32. cp read_counts_table_with_header.txt temp
    33. awk 'BEGIN {FS=OFS="\t"} NR==1 {print; next} {gsub(/\.+[0-9]+$/, "", $1); print}' temp > read_counts_table_with_header.txt
    34. library(DESeq2)
    35. setwd("/home/jhuang/DATA/Data_Emilia_MeDIP/analysis_2023_3")
    36. # Load the data
    37. count_data <- read.table("read_counts_table_with_header.txt", header = TRUE, sep = "\t", quote = "", row.names = 1)
    38. sampleInfo <- read.csv("sampleInfo.txt", header = TRUE, stringsAsFactors = FALSE)
    39. row.names(sampleInfo) <- sampleInfo$sampleID
    40. dds <- DESeqDataSetFromMatrix(
    41. countData = count_data,
    42. colData = sampleInfo,
    43. design = ~ condition
    44. )
    45. # Plot PCA
    46. dds <- DESeq(dds)
    47. rld <- rlogTransformation(dds)
    48. png("pca.png", 1200, 800)
    49. plotPCA(rld, intgroup=c("condition"))
    50. dev.off()
    51. # Differential analysis
    52. dds$condition <- relevel(dds$condition, "control_female")
    53. dds = DESeq(dds, betaPrior=FALSE)
    54. resultsNames(dds)
    55. clist <- c("stress_female_vs_control_female")
    56. dds$condition <- relevel(dds$condition, "control_male")
    57. dds = DESeq(dds, betaPrior=FALSE)
    58. resultsNames(dds)
    59. clist <- c("stress_male_vs_control_male")
    60. library(biomaRt)
    61. listEnsembl()
    62. listMarts()
    63. listDatasets(ensembl)
    64. ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", version="100")
    65. datasets <- listDatasets(ensembl)
    66. listEnsemblArchives()
    67. attributes = listAttributes(ensembl)
    68. attributes[1:25,]
    69. for (i in clist) {
    70. #i<-clist[1]
    71. contrast = paste("condition", i, sep="_")
    72. res = results(dds, name=contrast)
    73. res <- res[!is.na(res$log2FoldChange),]
    74. gene_ids_no_version <- gsub("\\.\\d+$", "", rownames(res))
    75. geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
    76. filters = 'ensembl_gene_id',
    77. values = gene_ids_no_version,
    78. mart = ensembl)
    79. geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
    80. #merge by column by common colunmn name, in the case "GENEID"
    81. res$ENSEMBL = rownames(res)
    82. identical(rownames(res), geness_uniq$ensembl_gene_id)
    83. res_df <- as.data.frame(res)
    84. geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
    85. dim(geness_res)
    86. rownames(geness_res) <- geness_res$ensembl_gene_id
    87. geness_res$ensembl_gene_id <- NULL
    88. write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
    89. up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
    90. down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
    91. write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
    92. write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    93. }
    94. #~/Tools/csv2xls-0.4/csv_to_xls.py stress_female_vs_control_female-all.txt stress_female_vs_control_female-up.txt stress_female_vs_control_female-down.txt -d',' -o stress_female_vs_control_female.xls
    95. #~/Tools/csv2xls-0.4/csv_to_xls.py stress_male_vs_control_male-all.txt stress_male_vs_control_male-up.txt stress_male_vs_control_male-down.txt -d',' -o stress_male_vs_control_male.xls
    96. # Clustering the genes and draw heatmap for DEGs (limiting the genes with ENSEMBL by giving GO-term or give all genes via the file 'ids_')
    97. #install.packages("gplots")
    98. library("gplots")
    99. #(Optional) cat *.id | sort -u > ids #add Gene_Id in the first line, delete the ""
    100. #(Used) cut -d$'\t' -f1 read_counts_table_with_header.txt > ids #Rename the first line to Gene_Id
    101. GOI <- read.csv("ids")$Gene_Id #21859 genes
    102. RNASeq.NoCellLine <- assay(rld)
    103. # Defining the custom order
    104. #column_order <- c(
    105. # "uninfected.2h_r1", "uninfected.2h_r2", "uninfected.3h_r3",
    106. # "uninfected.3d_r1", "uninfected.3d_r2",
    107. # "1457-M10.2h_r1", "1457-M10.2h_r2", "1457-M10.2h_r3","1457.2h_r1", "1457.2h_r2", "1457.2h_r3",
    108. # "1457-M10.3d_r1", "1457-M10.3d_r2", "1457-M10.3d_r3","1457.3d_r1", "1457.3d_r2"
    109. #)
    110. #RNASeq.NoCellLine_reordered <- RNASeq.NoCellLine[, column_order]
    111. #head(RNASeq.NoCellLine_reordered)
    112. #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman
    113. datamat = RNASeq.NoCellLine[GOI, ]
    114. write.csv(as.data.frame(datamat), file ="DEGs_heatmap_data.csv")
    115. hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    116. hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    117. mycl = cutree(hr, h=max(hr$height)/1.01)
    118. mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED", "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    119. mycol = mycol[as.vector(mycl)]
    120. png("DEGs_heatmap.png", width=1000, height=1010)
    121. heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
    122. scale='row',trace='none',col=bluered(75), margins=c(4,22),
    123. RowSideColors = mycol, srtCol=20, keysize=0.72, cexRow = 2, cexCol = 1.4, labRow="")
    124. dev.off()
    125. # Generate the table for cluster members
    126. subset_1<-names(subset(mycl, mycl == '1'))
    127. data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #212
    128. subset_2<-names(subset(mycl, mycl == '2'))
    129. data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #185
    130. subset_3<-names(subset(mycl, mycl == '3'))
    131. data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #173
    132. # Take all
    133. data <- as.data.frame(datamat) #21859
    134. # Initialize an empty data frame for the annotated data
    135. annotated_data <- data.frame()
    136. # Determine total number of genes
    137. total_genes <- length(rownames(data))
    138. # Loop through each gene to annotate
    139. for (i in 1:total_genes) {
    140. gene <- rownames(data)[i]
    141. result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
    142. filters = 'ensembl_gene_id',
    143. values = gene,
    144. mart = ensembl)
    145. # If multiple rows are returned, take the first one
    146. if (nrow(result) > 1) {
    147. result <- result[1, ]
    148. }
    149. # Check if the result is empty
    150. if (nrow(result) == 0) {
    151. result <- data.frame(ensembl_gene_id = gene,
    152. external_gene_name = NA,
    153. gene_biotype = NA,
    154. entrezgene_id = NA,
    155. chromosome_name = NA,
    156. start_position = NA,
    157. end_position = NA,
    158. strand = NA,
    159. description = NA)
    160. }
    161. # Transpose expression values
    162. expression_values <- t(data.frame(t(data[gene, ])))
    163. #colnames(expression_values) <- colnames(data)
    164. colnames(expression_values) <- paste(colnames(data), "normalized", sep = "_")
    165. # # Combine gene information and expression data
    166. # combined_result <- cbind(result, expression_values)
    167. # Fetch raw counts for the gene from count_data
    168. raw_counts <- count_data[gene, , drop = FALSE]
    169. colnames(raw_counts) <- paste(colnames(raw_counts), "raw", sep = "_")
    170. # Combine gene information, expression data, and raw counts
    171. combined_result <- cbind(result, expression_values, raw_counts)
    172. # Append to the final dataframe
    173. annotated_data <- rbind(annotated_data, combined_result)
    174. # Print progress every 100 genes
    175. if (i %% 100 == 0) {
    176. cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
    177. }
    178. }
    179. # Save the annotated data to a new CSV file
    180. write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    181. write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    182. write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    183. write.csv(annotated_data, "annotated.csv", row.names=FALSE)
    184. #~/Tools/csv2xls-0.4/csv_to_xls.py annotated.csv -d',' -o raw_and_normalized_counts.xls

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum