Processing Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606

gene_x 0 like s 154 view s

Tags: pipeline, RNA-seq

  1. Targets

    1. Could you please assist me with processing RNA-seq data? The reference genome is CP059040. I aim to analyze the data using PCA, a Venn diagram, and KEGG and GO annotation enrichment analysis.
    2. The samples are labeled as follows (where 'x' indicates the replicate number):
    3. LB-AB-x
    4. LB-IJ-x
    5. LB-W1-x
    6. LB-WT19606-x
    7. LB-Y1-x
    8. Mac-AB-x
    9. Mac-IJ-x
    10. Mac-W1-x
    11. Mac-WT19606-x
    12. Mac-Y1-x
  2. Download the raw data

    1. ./lnd login -u X101SC25015922-Z02-J002 -p m*********5
    2. ./lnd list
    3. ./lnd cp -d oss:// ./
    4. ./lnd cp oss://CP2024102300053 . #Error
    5. ./lnd list oss://CP2024102300053
    6. ./lnd cp -d oss://CP2024102300053/H101SC25015922/RSMR00204 .
    7. #CP2024102300053/H101SC25015922/RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002
  3. Prepare raw data

    1. mkdir raw_data; cd raw_data
    2. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-1/LB-AB-1_1.fq.gz LB-AB-r1_R1.fq.gz
    3. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-1/LB-AB-1_2.fq.gz LB-AB-r1_R2.fq.gz
    4. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-2/LB-AB-2_1.fq.gz LB-AB-r2_R1.fq.gz
    5. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-2/LB-AB-2_2.fq.gz LB-AB-r2_R2.fq.gz
    6. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-3/LB-AB-3_1.fq.gz LB-AB-r3_R1.fq.gz
    7. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-AB-3/LB-AB-3_2.fq.gz LB-AB-r3_R2.fq.gz
    8. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-1/LB-IJ-1_1.fq.gz LB-IJ-r1_R1.fq.gz
    9. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-1/LB-IJ-1_2.fq.gz LB-IJ-r1_R2.fq.gz
    10. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-2/LB-IJ-2_1.fq.gz LB-IJ-r2_R1.fq.gz
    11. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-2/LB-IJ-2_2.fq.gz LB-IJ-r2_R2.fq.gz
    12. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-4/LB-IJ-4_1.fq.gz LB-IJ-r4_R1.fq.gz
    13. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-IJ-4/LB-IJ-4_2.fq.gz LB-IJ-r4_R2.fq.gz
    14. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-1/LB-W1-1_1.fq.gz LB-W1-r1_R1.fq.gz
    15. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-1/LB-W1-1_2.fq.gz LB-W1-r1_R2.fq.gz
    16. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-2/LB-W1-2_1.fq.gz LB-W1-r2_R1.fq.gz
    17. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-2/LB-W1-2_2.fq.gz LB-W1-r2_R2.fq.gz
    18. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-3/LB-W1-3_1.fq.gz LB-W1-r3_R1.fq.gz
    19. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-W1-3/LB-W1-3_2.fq.gz LB-W1-r3_R2.fq.gz
    20. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-2/LB-WT19606-2_1.fq.gz LB-WT19606-r2_R1.fq.gz
    21. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-2/LB-WT19606-2_2.fq.gz LB-WT19606-r2_R2.fq.gz
    22. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-3/LB-WT19606-3_1.fq.gz LB-WT19606-r3_R1.fq.gz
    23. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-3/LB-WT19606-3_2.fq.gz LB-WT19606-r3_R2.fq.gz
    24. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-4/LB-WT19606-4_1.fq.gz LB-WT19606-r4_R1.fq.gz
    25. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-WT19606-4/LB-WT19606-4_2.fq.gz LB-WT19606-r4_R2.fq.gz
    26. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-2/LB-Y1-2_1.fq.gz LB-Y1-r2_R1.fq.gz
    27. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-2/LB-Y1-2_2.fq.gz LB-Y1-r2_R2.fq.gz
    28. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-3/LB-Y1-3_1.fq.gz LB-Y1-r3_R1.fq.gz
    29. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-3/LB-Y1-3_2.fq.gz LB-Y1-r3_R2.fq.gz
    30. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-4/LB-Y1-4_1.fq.gz LB-Y1-r4_R1.fq.gz
    31. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/LB-Y1-4/LB-Y1-4_2.fq.gz LB-Y1-r4_R2.fq.gz
    32. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-1/Mac-AB-1_1.fq.gz Mac-AB-r1_R1.fq.gz
    33. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-1/Mac-AB-1_2.fq.gz Mac-AB-r1_R2.fq.gz
    34. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-2/Mac-AB-2_1.fq.gz Mac-AB-r2_R1.fq.gz
    35. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-2/Mac-AB-2_2.fq.gz Mac-AB-r2_R2.fq.gz
    36. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-3/Mac-AB-3_1.fq.gz Mac-AB-r3_R1.fq.gz
    37. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-AB-3/Mac-AB-3_2.fq.gz Mac-AB-r3_R2.fq.gz
    38. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-1/Mac-IJ-1_1.fq.gz Mac-IJ-r1_R1.fq.gz
    39. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-1/Mac-IJ-1_2.fq.gz Mac-IJ-r1_R2.fq.gz
    40. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-2/Mac-IJ-2_1.fq.gz Mac-IJ-r2_R1.fq.gz
    41. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-2/Mac-IJ-2_2.fq.gz Mac-IJ-r2_R2.fq.gz
    42. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-4/Mac-IJ-4_1.fq.gz Mac-IJ-r4_R1.fq.gz
    43. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-IJ-4/Mac-IJ-4_2.fq.gz Mac-IJ-r4_R2.fq.gz
    44. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-1/Mac-W1-1_1.fq.gz Mac-W1-r1_R1.fq.gz
    45. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-1/Mac-W1-1_2.fq.gz Mac-W1-r1_R2.fq.gz
    46. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-2/Mac-W1-2_1.fq.gz Mac-W1-r2_R1.fq.gz
    47. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-2/Mac-W1-2_2.fq.gz Mac-W1-r2_R2.fq.gz
    48. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-3/Mac-W1-3_1.fq.gz Mac-W1-r3_R1.fq.gz
    49. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-W1-3/Mac-W1-3_2.fq.gz Mac-W1-r3_R2.fq.gz
    50. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-2/Mac-WT19606-2_1.fq.gz Mac-WT19606-r2_R1.fq.gz
    51. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-2/Mac-WT19606-2_2.fq.gz Mac-WT19606-r2_R2.fq.gz
    52. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-3/Mac-WT19606-3_1.fq.gz Mac-WT19606-r3_R1.fq.gz
    53. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-3/Mac-WT19606-3_2.fq.gz Mac-WT19606-r3_R2.fq.gz
    54. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-4/Mac-WT19606-4_1.fq.gz Mac-WT19606-r4_R1.fq.gz
    55. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-4/Mac-WT19606-4_2.fq.gz Mac-WT19606-r4_R2.fq.gz
    56. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-2/Mac-Y1-2_1.fq.gz Mac-Y1-r2_R1.fq.gz
    57. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-2/Mac-Y1-2_2.fq.gz Mac-Y1-r2_R2.fq.gz
    58. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-3/Mac-Y1-3_1.fq.gz Mac-Y1-r3_R1.fq.gz
    59. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-3/Mac-Y1-3_2.fq.gz Mac-Y1-r3_R2.fq.gz
    60. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-4/Mac-Y1-4_1.fq.gz Mac-Y1-r4_R1.fq.gz
    61. ln -s ../RSMR00204/X101SC25015922-Z02/X101SC25015922-Z02-J002/01.RawData/Mac-Y1-4/Mac-Y1-4_2.fq.gz Mac-Y1-r4_R2.fq.gz
  4. Preparing the directory trimmed

    1. mkdir trimmed trimmed_unpaired;
    2. for sample_id in LB-AB-r1 LB-AB-r2 LB-AB-r3 LB-IJ-r1 LB-IJ-r2 LB-IJ-r4 LB-W1-r1 LB-W1-r2 LB-W1-r3 LB-WT19606-r2 LB-WT19606-r3 LB-WT19606-r4 LB-Y1-r2 LB-Y1-r3 LB-Y1-r4 Mac-AB-r1 Mac-AB-r2 Mac-AB-r3 Mac-IJ-r1 Mac-IJ-r2 Mac-IJ-r4 Mac-W1-r1 Mac-W1-r2 Mac-W1-r3 Mac-WT19606-r2 Mac-WT19606-r3 Mac-WT19606-r4 Mac-Y1-r2 Mac-Y1-r3 Mac-Y1-r4; do
    3. java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
    4. done
  5. Preparing samplesheet.csv

    1. sample,fastq_1,fastq_2,strandedness
    2. LB-AB-r1,LB-AB-r1_R1.fq.gz,LB-AB-r1_R2.fq.gz,auto
    3. LB-AB-r2,LB-AB-r2_R1.fq.gz,LB-AB-r2_R2.fq.gz,auto
    4. LB-AB-r3,LB-AB-r3_R1.fq.gz,LB-AB-r3_R2.fq.gz,auto
    5. LB-IJ-r1,LB-IJ-r1_R1.fq.gz,LB-IJ-r1_R2.fq.gz,auto
    6. LB-IJ-r2,LB-IJ-r2_R1.fq.gz,LB-IJ-r2_R2.fq.gz,auto
    7. LB-IJ-r4,LB-IJ-r4_R1.fq.gz,LB-IJ-r4_R2.fq.gz,auto
    8. LB-W1-r1,LB-W1-r1_R1.fq.gz,LB-W1-r1_R2.fq.gz,auto
    9. LB-W1-r2,LB-W1-r2_R1.fq.gz,LB-W1-r2_R2.fq.gz,auto
    10. LB-W1-r3,LB-W1-r3_R1.fq.gz,LB-W1-r3_R2.fq.gz,auto
    11. LB-WT19606-r2,LB-WT19606-r2_R1.fq.gz,LB-WT19606-r2_R2.fq.gz,auto
    12. LB-WT19606-r3,LB-WT19606-r3_R1.fq.gz,LB-WT19606-r3_R2.fq.gz,auto
    13. LB-WT19606-r4,LB-WT19606-r4_R1.fq.gz,LB-WT19606-r4_R2.fq.gz,auto
    14. LB-Y1-r2,LB-Y1-r2_R1.fq.gz,LB-Y1-r2_R2.fq.gz,auto
    15. LB-Y1-r3,LB-Y1-r3_R1.fq.gz,LB-Y1-r3_R2.fq.gz,auto
    16. LB-Y1-r4,LB-Y1-r4_R1.fq.gz,LB-Y1-r4_R2.fq.gz,auto
    17. Mac-AB-r1,Mac-AB-r1_R1.fq.gz,Mac-AB-r1_R2.fq.gz,auto
    18. Mac-AB-r2,Mac-AB-r2_R1.fq.gz,Mac-AB-r2_R2.fq.gz,auto
    19. Mac-AB-r3,Mac-AB-r3_R1.fq.gz,Mac-AB-r3_R2.fq.gz,auto
    20. Mac-IJ-r1,Mac-IJ-r1_R1.fq.gz,Mac-IJ-r1_R2.fq.gz,auto
    21. Mac-IJ-r2,Mac-IJ-r2_R1.fq.gz,Mac-IJ-r2_R2.fq.gz,auto
    22. Mac-IJ-r4,Mac-IJ-r4_R1.fq.gz,Mac-IJ-r4_R2.fq.gz,auto
    23. Mac-W1-r1,Mac-W1-r1_R1.fq.gz,Mac-W1-r1_R2.fq.gz,auto
    24. Mac-W1-r2,Mac-W1-r2_R1.fq.gz,Mac-W1-r2_R2.fq.gz,auto
    25. Mac-W1-r3,Mac-W1-r3_R1.fq.gz,Mac-W1-r3_R2.fq.gz,auto
    26. Mac-WT19606-r2,Mac-WT19606-r2_R1.fq.gz,Mac-WT19606-r2_R2.fq.gz,auto
    27. Mac-WT19606-r3,Mac-WT19606-r3_R1.fq.gz,Mac-WT19606-r3_R2.fq.gz,auto
    28. Mac-WT19606-r4,Mac-WT19606-r4_R1.fq.gz,Mac-WT19606-r4_R2.fq.gz,auto
    29. Mac-Y1-r2,Mac-Y1-r2_R1.fq.gz,Mac-Y1-r2_R2.fq.gz,auto
    30. Mac-Y1-r3,Mac-Y1-r3_R1.fq.gz,Mac-Y1-r3_R2.fq.gz,auto
    31. Mac-Y1-r4,Mac-Y1-r4_R1.fq.gz,Mac-Y1-r4_R2.fq.gz,auto
    32. #mv trimmed/* .
  6. nextflow run

    1. #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
    2. #docker pull nfcore/rnaseq
    3. ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    4. # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
    5. (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_m.gff" -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon' --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
  7. Import data and pca-plot

    1. #mamba activate r_env
    2. #install.packages("ggfun")
    3. # Import the required libraries
    4. library("AnnotationDbi")
    5. library("clusterProfiler")
    6. library("ReactomePA")
    7. library(gplots)
    8. library(tximport)
    9. library(DESeq2)
    10. #library("org.Hs.eg.db")
    11. library(dplyr)
    12. library(tidyverse)
    13. #install.packages("devtools")
    14. #devtools::install_version("gtable", version = "0.3.0")
    15. library(gplots)
    16. library("RColorBrewer")
    17. #install.packages("ggrepel")
    18. library("ggrepel")
    19. # install.packages("openxlsx")
    20. library(openxlsx)
    21. library(EnhancedVolcano)
    22. library(DESeq2)
    23. library(edgeR)
    24. setwd("~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon")
    25. # Define paths to your Salmon output quantification files
    26. files <- c("LB-AB_r1" = "./LB-AB-r1/quant.sf",
    27. "LB-AB_r2" = "./LB-AB-r2/quant.sf",
    28. "LB-AB_r3" = "./LB-AB-r3/quant.sf",
    29. "LB-IJ_r1" = "./LB-IJ-r1/quant.sf",
    30. "LB-IJ_r2" = "./LB-IJ-r2/quant.sf",
    31. "LB-IJ_r4" = "./LB-IJ-r4/quant.sf",
    32. "LB-W1_r1" = "./LB-W1-r1/quant.sf",
    33. "LB-W1_r2" = "./LB-W1-r2/quant.sf",
    34. "LB-W1_r3" = "./LB-W1-r3/quant.sf",
    35. "LB-WT19606_r2" = "./LB-WT19606-r2/quant.sf",
    36. "LB-WT19606_r3" = "./LB-WT19606-r3/quant.sf",
    37. "LB-WT19606_r4" = "./LB-WT19606-r4/quant.sf",
    38. "LB-Y1_r2" = "./LB-Y1-r2/quant.sf",
    39. "LB-Y1_r3" = "./LB-Y1-r3/quant.sf",
    40. "LB-Y1_r4" = "./LB-Y1-r4/quant.sf",
    41. "Mac-AB_r1" = "./Mac-AB-r1/quant.sf",
    42. "Mac-AB_r2" = "./Mac-AB-r2/quant.sf",
    43. "Mac-AB_r3" = "./Mac-AB-r3/quant.sf",
    44. "Mac-IJ_r1" = "./Mac-IJ-r1/quant.sf",
    45. "Mac-IJ_r2" = "./Mac-IJ-r2/quant.sf",
    46. "Mac-IJ_r4" = "./Mac-IJ-r4/quant.sf",
    47. "Mac-W1_r1" = "./Mac-W1-r1/quant.sf",
    48. "Mac-W1_r2" = "./Mac-W1-r2/quant.sf",
    49. "Mac-W1_r3" = "./Mac-W1-r3/quant.sf",
    50. "Mac-WT19606_r2" = "./Mac-WT19606-r2/quant.sf",
    51. "Mac-WT19606_r3" = "./Mac-WT19606-r3/quant.sf",
    52. "Mac-WT19606_r4" = "./Mac-WT19606-r4/quant.sf",
    53. "Mac-Y1_r2" = "./Mac-Y1-r2/quant.sf",
    54. "Mac-Y1_r3" = "./Mac-Y1-r3/quant.sf",
    55. "Mac-Y1_r4" = "./Mac-Y1-r4/quant.sf")
    56. # Import the transcript abundance data with tximport
    57. txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    58. # Define the replicates and condition of the samples
    59. #replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
    60. #adeA and adeB encode a membrane fusion protein that is part of the AdeABC efflux pump, which contributes to multidrug resistance.
    61. #System: Part of the AdeIJK efflux pump, which includes: adeI — membrane fusion protein, adeJ — RND transporter, adeK — outer membrane factor
    62. condition <- factor(c("LB-AB","LB-AB","LB-AB", "LB-IJ","LB-IJ","LB-IJ", "LB-W1","LB-W1","LB-W1","LB-WT19606","LB-WT19606","LB-WT19606","LB-Y1","LB-Y1","LB-Y1","Mac-AB","Mac-AB","Mac-AB","Mac-IJ","Mac-IJ","Mac-IJ","Mac-W1","Mac-W1","Mac-W1","Mac-WT19606","Mac-WT19606","Mac-WT19606","Mac-Y1","Mac-Y1","Mac-Y1"))
    63. # Define the colData for DESeq2
    64. colData <- data.frame(condition=condition, row.names=names(files))
    65. # -- transcript-level count data (x2) --
    66. # Create DESeqDataSet object
    67. dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    68. write.csv(counts(dds), file="transcript_raw_counts.csv")
    69. # -- gene-level count data (x2) --
    70. # Read in the tx2gene map from salmon_tx2gene.tsv
    71. tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    72. # Set the column names
    73. colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    74. # Remove the gene_name column if not needed
    75. tx2gene <- tx2gene[,1:2]
    76. # Import and summarize the Salmon data with tximport
    77. txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    78. # Continue with the DESeq2 workflow as before...
    79. colData <- data.frame(condition=condition, row.names=names(files))
    80. dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    81. #dds <- dds[rowSums(counts(dds) > 3) > 2, ] #3796->????
    82. write.csv(counts(dds, normalized=FALSE), file="gene_raw_counts.csv")
    83. write.xlsx(as.data.frame(counts(dds, normalized=FALSE)), "gene_raw_counts.xlsx")
    84. # ---- (Optional) split the factos media and strain from condition (for comparison Mac vs LB) ----
    85. # AdeIJK vs. AdeABC Efflux Pumps
    86. # * AdeIJK is the "housekeeping" pump — always active, broadly expressed, contributing to background resistance.
    87. # * AdeABC is the "emergency" pump — induced under stress or mutations, more potent in contributing to clinical multidrug resistance.
    88. #LB = Luria-Bertani broth (a standard rich growth medium)
    89. #Mac = MacConkey agar or broth (selective for Gram-negative bacteria)
    90. # - Growth medium Media or Condition, GrowthMedium
    91. # - Bacterial strain/genotype Strain or Isolate, Genotype, SampleType
    92. media <- factor(c("LB","LB","LB", "LB","LB","LB", "LB","LB","LB","LB","LB","LB","LB","LB","LB","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac","Mac"))
    93. strain <- factor(c("AB","AB","AB", "IJ","IJ","IJ", "W1","W1","W1","WT19606","WT19606","WT19606","Y1","Y1","Y1","AB","AB","AB","IJ","IJ","IJ","W1","W1","W1","WT19606","WT19606","WT19606","Y1","Y1","Y1"))
    94. # Define the colData for DESeq2
    95. colData <- data.frame(media=media, strain=strain, row.names=names(files))
    96. # -- transcript-level count data (x2) --
    97. # Create DESeqDataSet object
    98. dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~media+strain)
    99. #write.csv(counts(dds), file="transcript_counts_media_strain.csv") #check correctness, it should be identical to transcript_counts.csv
    100. # -- gene-level count data (x2) --
    101. # Read in the tx2gene map from salmon_tx2gene.tsv
    102. tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    103. # Set the column names
    104. colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    105. # Remove the gene_name column if not needed
    106. tx2gene <- tx2gene[,1:2]
    107. # Import and summarize the Salmon data with tximport
    108. txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    109. # Continue with the DESeq2 workflow as before...
    110. colData <- data.frame(media=media, strain=strain, row.names=names(files))
    111. dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~media+strain)
    112. #dds <- dds[rowSums(counts(dds) > 3) > 2, ] #3796->????
    113. #write.csv(counts(dds, normalized=FALSE), file="gene_counts_media_strain.csv") #check correctness, it should be identical to gene_counts.csv
    114. dim(counts(dds))
    115. head(counts(dds), 10)
    116. rld <- rlogTransformation(dds)
    117. # Save rlog-transformed counts
    118. rlog_counts <- assay(rld)
    119. write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")
    120. #CPM=每个基因的原始计数/该样本的总读数×10^6
    121. # 假设 counts 是一个矩阵或 data.frame
    122. # 行:基因,列:样本
    123. # 计算每个样本的总读数
    124. #total_counts <- colSums(counts)
    125. # 计算 CPM
    126. #cpm <- t( t(counts) / total_counts ) * 1e6
    127. counts_data <- as.data.frame(counts(dds)) # extract raw counts matrix
    128. cpm_counts <- cpm(counts_data)
    129. write.xlsx(cpm_counts, "gene_cpm_counts.xlsx")
    130. # -- pca --
    131. png("pca2.png", 1200, 800)
    132. plotPCA(rld, intgroup=c("condition"))
    133. dev.off()
    134. # -- heatmap --
    135. png("heatmap2.png", 1200, 800)
    136. distsRL <- dist(t(assay(rld)))
    137. mat <- as.matrix(distsRL)
    138. hc <- hclust(distsRL)
    139. hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    140. heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    141. dev.off()
    142. # -- pca_media_strain --
    143. png("pca_media.png", 1200, 800)
    144. plotPCA(rld, intgroup=c("media"))
    145. dev.off()
    146. png("pca_strain.png", 1200, 800)
    147. plotPCA(rld, intgroup=c("strain"))
    148. dev.off()
  8. (Optional; ERROR-->need to be debugged!) ) estimate size factors and dispersion values.

    1. #Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
    2. #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
    3. #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).
    4. sizeFactors(dds)
    5. #NULL
    6. # Estimate size factors
    7. dds <- estimateSizeFactors(dds)
    8. # Estimate dispersions
    9. dds <- estimateDispersions(dds)
    10. #> sizeFactors(dds)
    11. #control_r1 control_r2 HSV.d2_r1 HSV.d2_r2 HSV.d4_r1 HSV.d4_r2 HSV.d6_r1
    12. #2.3282468 2.0251928 1.8036883 1.3767551 0.9341929 1.0911693 0.5454526
    13. #HSV.d6_r2 HSV.d8_r1 HSV.d8_r2
    14. #0.4604461 0.5799834 0.6803681
    15. # (DEBUG) If avgTxLength is Necessary
    16. #To simplify the computation and ensure sizeFactors are calculated:
    17. assays(dds)$avgTxLength <- NULL
    18. dds <- estimateSizeFactors(dds)
    19. sizeFactors(dds)
    20. #If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
    21. dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
    22. sizeFactors(dds)
    23. # If alone with virus data, the following BUG occured:
    24. #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
    25. HeLa_TO_r1 HeLa_TO_r2
    26. 0.9978755 1.1092227
    27. data.frame(genes = rownames(dds), dispersions = dispersions(dds))
    28. #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
    29. 1/0.9978755=1.002129023
    30. 1/1.1092227=
    31. #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor --effectiveGenomeSize 2864785220
    32. bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023 --effectiveGenomeSize 2864785220
    33. bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor 0.901532217 --effectiveGenomeSize 2864785220
    34. raw_counts <- counts(dds)
    35. normalized_counts <- counts(dds, normalized=TRUE)
    36. #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
    37. #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    38. #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
    39. estimSf <- function (cds){
    40. # Get the count matrix
    41. cts <- counts(cds)
    42. # Compute the geometric mean
    43. geomMean <- function(x) prod(x)^(1/length(x))
    44. # Compute the geometric mean over the line
    45. gm.mean <- apply(cts, 1, geomMean)
    46. # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    47. gm.mean[gm.mean == 0] <- NA
    48. # Divide each line by its corresponding geometric mean
    49. # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    50. # MARGIN: 1 or 2 (line or columns)
    51. # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    52. # FUN: the function to be applied
    53. cts <- sweep(cts, 1, gm.mean, FUN="/")
    54. # Compute the median over the columns
    55. med <- apply(cts, 2, median, na.rm=TRUE)
    56. # Return the scaling factor
    57. return(med)
    58. }
    59. #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
    60. #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
    61. #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
    62. #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
    63. #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
    64. #DESeq2’s median of ratios [1]
    65. #EdgeR’s trimmed mean of M values (TMM) [2]
    66. #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html #very good website!
    67. test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
    68. sum(test_normcount != normalized_counts)
  9. Select the differentially expressed genes

    1. #https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
    2. #https://www.biostars.org/p/282295/
    3. #https://www.biostars.org/p/335751/
    4. #> dds$condition
    5. #LB-AB LB-IJ LB-W1 LB-WT19606 LB-Y1 Mac-AB Mac-IJ Mac-W1 Mac-WT19606 Mac-Y1
    6. #CONSOLE: mkdir star_salmon/degenes
    7. setwd("degenes")
    8. #---- relevel to control ----
    9. dds$condition <- relevel(dds$condition, "LB-WT19606")
    10. dds = DESeq(dds, betaPrior=FALSE)
    11. resultsNames(dds)
    12. clist <- c("LB.AB_vs_LB.WT19606","LB.IJ_vs_LB.WT19606","LB.W1_vs_LB.WT19606","LB.Y1_vs_LB.WT19606")
    13. dds$condition <- relevel(dds$condition, "Mac-WT19606")
    14. dds = DESeq(dds, betaPrior=FALSE)
    15. resultsNames(dds)
    16. clist <- c("Mac.AB_vs_Mac.WT19606","Mac.IJ_vs_Mac.WT19606","Mac.W1_vs_Mac.WT19606","Mac.Y1_vs_Mac.WT19606")
    17. # - 如果你的实验是关注细菌在没有选择性压力下的生长、基因表达或一般行为,LB 是更好的对照。
    18. # - 如果你希望研究细菌在选择性压力下的行为(例如,针对革兰氏阴性细菌、测试抗生素耐药性或区分乳糖发酵菌),那么 MacConkey 更适合作为对照。
    19. dds$media <- relevel(dds$media, "LB")
    20. dds = DESeq(dds, betaPrior=FALSE)
    21. resultsNames(dds)
    22. clist <- c("Mac_vs_LB")
    23. dds$media <- relevel(dds$media, "Mac")
    24. dds = DESeq(dds, betaPrior=FALSE)
    25. resultsNames(dds)
    26. clist <- c("LB_vs_Mac")
    27. for (i in clist) {
    28. #contrast = paste("condition", i, sep="_")
    29. contrast = paste("media", i, sep="_")
    30. res = results(dds, name=contrast)
    31. res <- res[!is.na(res$log2FoldChange),]
    32. res_df <- as.data.frame(res)
    33. write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
    34. up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
    35. down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
    36. write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
    37. write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    38. }
    39. # -- Under host-env --
    40. grep -P "\tgene\t" CP059040.gff > CP059040_gene.gff
    41. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.AB_vs_LB.WT19606-all.txt LB.AB_vs_LB.WT19606-all.csv
    42. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.AB_vs_LB.WT19606-up.txt LB.AB_vs_LB.WT19606-up.csv
    43. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.AB_vs_LB.WT19606-down.txt LB.AB_vs_LB.WT19606-down.csv
    44. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.IJ_vs_LB.WT19606-all.txt LB.IJ_vs_LB.WT19606-all.csv
    45. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.IJ_vs_LB.WT19606-up.txt LB.IJ_vs_LB.WT19606-up.csv
    46. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.IJ_vs_LB.WT19606-down.txt LB.IJ_vs_LB.WT19606-down.csv
    47. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.W1_vs_LB.WT19606-all.txt LB.W1_vs_LB.WT19606-all.csv
    48. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.W1_vs_LB.WT19606-up.txt LB.W1_vs_LB.WT19606-up.csv
    49. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.W1_vs_LB.WT19606-down.txt LB.W1_vs_LB.WT19606-down.csv
    50. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.Y1_vs_LB.WT19606-all.txt LB.Y1_vs_LB.WT19606-all.csv
    51. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.Y1_vs_LB.WT19606-up.txt LB.Y1_vs_LB.WT19606-up.csv
    52. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB.Y1_vs_LB.WT19606-down.txt LB.Y1_vs_LB.WT19606-down.csv
    53. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.AB_vs_Mac.WT19606-all.txt Mac.AB_vs_Mac.WT19606-all.csv
    54. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.AB_vs_Mac.WT19606-up.txt Mac.AB_vs_Mac.WT19606-up.csv
    55. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.AB_vs_Mac.WT19606-down.txt Mac.AB_vs_Mac.WT19606-down.csv
    56. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.IJ_vs_Mac.WT19606-all.txt Mac.IJ_vs_Mac.WT19606-all.csv
    57. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.IJ_vs_Mac.WT19606-up.txt Mac.IJ_vs_Mac.WT19606-up.csv
    58. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.IJ_vs_Mac.WT19606-down.txt Mac.IJ_vs_Mac.WT19606-down.csv
    59. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.W1_vs_Mac.WT19606-all.txt Mac.W1_vs_Mac.WT19606-all.csv
    60. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.W1_vs_Mac.WT19606-up.txt Mac.W1_vs_Mac.WT19606-up.csv
    61. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.W1_vs_Mac.WT19606-down.txt Mac.W1_vs_Mac.WT19606-down.csv
    62. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.Y1_vs_Mac.WT19606-all.txt Mac.Y1_vs_Mac.WT19606-all.csv
    63. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.Y1_vs_Mac.WT19606-up.txt Mac.Y1_vs_Mac.WT19606-up.csv
    64. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac.Y1_vs_Mac.WT19606-down.txt Mac.Y1_vs_Mac.WT19606-down.csv
    65. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac_vs_LB-all.txt Mac_vs_LB-all.csv
    66. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac_vs_LB-up.txt Mac_vs_LB-up.csv
    67. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff Mac_vs_LB-down.txt Mac_vs_LB-down.csv
    68. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB_vs_Mac-all.txt LB_vs_Mac-all.csv
    69. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB_vs_Mac-up.txt LB_vs_Mac-up.csv
    70. python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine/CP059040_gene.gff LB_vs_Mac-down.txt LB_vs_Mac-down.csv
    71. # ---- Mac_vs_LB ----
    72. res <- read.csv("Mac_vs_LB-all.csv")
    73. # Replace empty GeneName with modified GeneID
    74. res$GeneName <- ifelse(
    75. res$GeneName == "" | is.na(res$GeneName),
    76. gsub("gene-", "", res$GeneID),
    77. res$GeneName
    78. )
    79. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    80. #print(duplicated_genes)
    81. # [1] "bfr" "lipA" "ahpF" "pcaF" "alr" "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
    82. #[11] "pcaF" "tuf" "galE" "murI" "yccS" "rrf" "rrf" "arsB" "ptsP" "umuD"
    83. #[21] "map" "pgaB" "rrf" "rrf" "rrf" "pgaD" "uraH" "benE"
    84. #res[res$GeneName == "bfr", ]
    85. #1st_strategy First occurrence is kept and Subsequent duplicates are removed
    86. #res <- res[!duplicated(res$GeneName), ]
    87. #2nd_strategy keep the row with the smallest padj value for each GeneName
    88. res <- res %>%
    89. group_by(GeneName) %>%
    90. slice_min(padj, with_ties = FALSE) %>%
    91. ungroup()
    92. res <- as.data.frame(res)
    93. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    94. res <- res[order(res$padj, -res$log2FoldChange), ]
    95. # Assuming res is your dataframe and already processed
    96. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    97. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    98. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    99. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    100. # Create a new workbook
    101. wb <- createWorkbook()
    102. # Add the complete dataset as the first sheet
    103. addWorksheet(wb, "Complete_Data")
    104. writeData(wb, "Complete_Data", res)
    105. # Add the up-regulated genes as the second sheet
    106. addWorksheet(wb, "Up_Regulated")
    107. writeData(wb, "Up_Regulated", up_regulated)
    108. # Add the down-regulated genes as the third sheet
    109. addWorksheet(wb, "Down_Regulated")
    110. writeData(wb, "Down_Regulated", down_regulated)
    111. # Save the workbook to a file
    112. saveWorkbook(wb, "Gene_Expression_Mac_vs_LB.xlsx", overwrite = TRUE)
    113. # Set the 'GeneName' column as row.names
    114. rownames(res) <- res$GeneName
    115. # Drop the 'GeneName' column since it's now the row names
    116. res$GeneName <- NULL
    117. head(res)
    118. ## Ensure the data frame matches the expected format
    119. ## For example, it should have columns: log2FoldChange, padj, etc.
    120. #res <- as.data.frame(res)
    121. ## Remove rows with NA in log2FoldChange (if needed)
    122. #res <- res[!is.na(res$log2FoldChange),]
    123. # Replace padj = 0 with a small value
    124. res$padj[res$padj == 0] <- 1e-150
    125. #library(EnhancedVolcano)
    126. # Assuming res is already sorted and processed
    127. png("Mac_vs_LB.png", width=1200, height=2000)
    128. #max.overlaps = 10
    129. EnhancedVolcano(res,
    130. lab = rownames(res),
    131. x = 'log2FoldChange',
    132. y = 'padj',
    133. pCutoff = 1e-2,
    134. FCcutoff = 2,
    135. title = '',
    136. subtitleLabSize = 18,
    137. pointSize = 3.0,
    138. labSize = 5.0,
    139. colAlpha = 1,
    140. legendIconSize = 4.0,
    141. drawConnectors = TRUE,
    142. widthConnectors = 0.5,
    143. colConnectors = 'black',
    144. subtitle = expression("Mac versus LB"))
    145. dev.off()
    146. # ---- LB.AB_vs_LB.WT19606 ----
    147. res <- read.csv("LB.AB_vs_LB.WT19606-all.csv")
    148. # Replace empty GeneName with modified GeneID
    149. res$GeneName <- ifelse(
    150. res$GeneName == "" | is.na(res$GeneName),
    151. gsub("gene-", "", res$GeneID),
    152. res$GeneName
    153. )
    154. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    155. res <- res %>%
    156. group_by(GeneName) %>%
    157. slice_min(padj, with_ties = FALSE) %>%
    158. ungroup()
    159. res <- as.data.frame(res)
    160. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    161. res <- res[order(res$padj, -res$log2FoldChange), ]
    162. # Assuming res is your dataframe and already processed
    163. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    164. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    165. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    166. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    167. # Create a new workbook
    168. wb <- createWorkbook()
    169. # Add the complete dataset as the first sheet
    170. addWorksheet(wb, "Complete_Data")
    171. writeData(wb, "Complete_Data", res)
    172. # Add the up-regulated genes as the second sheet
    173. addWorksheet(wb, "Up_Regulated")
    174. writeData(wb, "Up_Regulated", up_regulated)
    175. # Add the down-regulated genes as the third sheet
    176. addWorksheet(wb, "Down_Regulated")
    177. writeData(wb, "Down_Regulated", down_regulated)
    178. # Save the workbook to a file
    179. saveWorkbook(wb, "Gene_Expression_LB.AB_vs_LB.WT19606.xlsx", overwrite = TRUE)
    180. # Set the 'GeneName' column as row.names
    181. rownames(res) <- res$GeneName
    182. # Drop the 'GeneName' column since it's now the row names
    183. res$GeneName <- NULL
    184. head(res)
    185. ## Ensure the data frame matches the expected format
    186. ## For example, it should have columns: log2FoldChange, padj, etc.
    187. #res <- as.data.frame(res)
    188. ## Remove rows with NA in log2FoldChange (if needed)
    189. #res <- res[!is.na(res$log2FoldChange),]
    190. # Replace padj = 0 with a small value
    191. res$padj[res$padj == 0] <- 1e-12
    192. #library(EnhancedVolcano)
    193. # Assuming res is already sorted and processed
    194. png("LB.AB_vs_LB.WT19606.png", width=1200, height=1200)
    195. #max.overlaps = 10
    196. EnhancedVolcano(res,
    197. lab = rownames(res),
    198. x = 'log2FoldChange',
    199. y = 'padj',
    200. pCutoff = 1e-2,
    201. FCcutoff = 2,
    202. title = '',
    203. subtitleLabSize = 18,
    204. pointSize = 3.0,
    205. labSize = 5.0,
    206. colAlpha = 1,
    207. legendIconSize = 4.0,
    208. drawConnectors = TRUE,
    209. widthConnectors = 0.5,
    210. colConnectors = 'black',
    211. subtitle = expression("LB.AB versus LB.WT19606"))
    212. dev.off()
    213. # ---- LB.IJ_vs_LB.WT19606 ----
    214. res <- read.csv("LB.IJ_vs_LB.WT19606-all.csv")
    215. # Replace empty GeneName with modified GeneID
    216. res$GeneName <- ifelse(
    217. res$GeneName == "" | is.na(res$GeneName),
    218. gsub("gene-", "", res$GeneID),
    219. res$GeneName
    220. )
    221. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    222. res <- res %>%
    223. group_by(GeneName) %>%
    224. slice_min(padj, with_ties = FALSE) %>%
    225. ungroup()
    226. res <- as.data.frame(res)
    227. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    228. res <- res[order(res$padj, -res$log2FoldChange), ]
    229. # Assuming res is your dataframe and already processed
    230. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    231. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    232. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    233. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    234. # Create a new workbook
    235. wb <- createWorkbook()
    236. # Add the complete dataset as the first sheet
    237. addWorksheet(wb, "Complete_Data")
    238. writeData(wb, "Complete_Data", res)
    239. # Add the up-regulated genes as the second sheet
    240. addWorksheet(wb, "Up_Regulated")
    241. writeData(wb, "Up_Regulated", up_regulated)
    242. # Add the down-regulated genes as the third sheet
    243. addWorksheet(wb, "Down_Regulated")
    244. writeData(wb, "Down_Regulated", down_regulated)
    245. # Save the workbook to a file
    246. saveWorkbook(wb, "Gene_Expression_LB.IJ_vs_LB.WT19606.xlsx", overwrite = TRUE)
    247. # Set the 'GeneName' column as row.names
    248. rownames(res) <- res$GeneName
    249. # Drop the 'GeneName' column since it's now the row names
    250. res$GeneName <- NULL
    251. head(res)
    252. ## Ensure the data frame matches the expected format
    253. ## For example, it should have columns: log2FoldChange, padj, etc.
    254. #res <- as.data.frame(res)
    255. ## Remove rows with NA in log2FoldChange (if needed)
    256. #res <- res[!is.na(res$log2FoldChange),]
    257. # Replace padj = 0 with a small value
    258. res$padj[res$padj == 0] <- 1e-12
    259. #library(EnhancedVolcano)
    260. # Assuming res is already sorted and processed
    261. png("LB.IJ_vs_LB.WT19606.png", width=1200, height=1200)
    262. #max.overlaps = 10
    263. EnhancedVolcano(res,
    264. lab = rownames(res),
    265. x = 'log2FoldChange',
    266. y = 'padj',
    267. pCutoff = 1e-2,
    268. FCcutoff = 2,
    269. title = '',
    270. subtitleLabSize = 18,
    271. pointSize = 3.0,
    272. labSize = 5.0,
    273. colAlpha = 1,
    274. legendIconSize = 4.0,
    275. drawConnectors = TRUE,
    276. widthConnectors = 0.5,
    277. colConnectors = 'black',
    278. subtitle = expression("LB.IJ versus LB.WT19606"))
    279. dev.off()
    280. # ---- LB.W1_vs_LB.WT19606 ----
    281. res <- read.csv("LB.W1_vs_LB.WT19606-all.csv")
    282. # Replace empty GeneName with modified GeneID
    283. res$GeneName <- ifelse(
    284. res$GeneName == "" | is.na(res$GeneName),
    285. gsub("gene-", "", res$GeneID),
    286. res$GeneName
    287. )
    288. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    289. res <- res %>%
    290. group_by(GeneName) %>%
    291. slice_min(padj, with_ties = FALSE) %>%
    292. ungroup()
    293. res <- as.data.frame(res)
    294. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    295. res <- res[order(res$padj, -res$log2FoldChange), ]
    296. # Assuming res is your dataframe and already processed
    297. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    298. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    299. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    300. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    301. # Create a new workbook
    302. wb <- createWorkbook()
    303. # Add the complete dataset as the first sheet
    304. addWorksheet(wb, "Complete_Data")
    305. writeData(wb, "Complete_Data", res)
    306. # Add the up-regulated genes as the second sheet
    307. addWorksheet(wb, "Up_Regulated")
    308. writeData(wb, "Up_Regulated", up_regulated)
    309. # Add the down-regulated genes as the third sheet
    310. addWorksheet(wb, "Down_Regulated")
    311. writeData(wb, "Down_Regulated", down_regulated)
    312. # Save the workbook to a file
    313. saveWorkbook(wb, "Gene_Expression_LB.W1_vs_LB.WT19606.xlsx", overwrite = TRUE)
    314. # Set the 'GeneName' column as row.names
    315. rownames(res) <- res$GeneName
    316. # Drop the 'GeneName' column since it's now the row names
    317. res$GeneName <- NULL
    318. head(res)
    319. ## Ensure the data frame matches the expected format
    320. ## For example, it should have columns: log2FoldChange, padj, etc.
    321. #res <- as.data.frame(res)
    322. ## Remove rows with NA in log2FoldChange (if needed)
    323. #res <- res[!is.na(res$log2FoldChange),]
    324. # Replace padj = 0 with a small value
    325. res$padj[res$padj == 0] <- 1e-12
    326. #library(EnhancedVolcano)
    327. # Assuming res is already sorted and processed
    328. png("LB.W1_vs_LB.WT19606.png", width=1200, height=1200)
    329. #max.overlaps = 10
    330. EnhancedVolcano(res,
    331. lab = rownames(res),
    332. x = 'log2FoldChange',
    333. y = 'padj',
    334. pCutoff = 1e-2,
    335. FCcutoff = 2,
    336. title = '',
    337. subtitleLabSize = 18,
    338. pointSize = 3.0,
    339. labSize = 5.0,
    340. colAlpha = 1,
    341. legendIconSize = 4.0,
    342. drawConnectors = TRUE,
    343. widthConnectors = 0.5,
    344. colConnectors = 'black',
    345. subtitle = expression("LB.W1 versus LB.WT19606"))
    346. dev.off()
    347. # ---- LB.Y1_vs_LB.WT19606 ----
    348. res <- read.csv("LB.Y1_vs_LB.WT19606-all.csv")
    349. # Replace empty GeneName with modified GeneID
    350. res$GeneName <- ifelse(
    351. res$GeneName == "" | is.na(res$GeneName),
    352. gsub("gene-", "", res$GeneID),
    353. res$GeneName
    354. )
    355. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    356. res <- res %>%
    357. group_by(GeneName) %>%
    358. slice_min(padj, with_ties = FALSE) %>%
    359. ungroup()
    360. res <- as.data.frame(res)
    361. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    362. res <- res[order(res$padj, -res$log2FoldChange), ]
    363. # Assuming res is your dataframe and already processed
    364. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    365. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    366. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    367. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    368. # Create a new workbook
    369. wb <- createWorkbook()
    370. # Add the complete dataset as the first sheet
    371. addWorksheet(wb, "Complete_Data")
    372. writeData(wb, "Complete_Data", res)
    373. # Add the up-regulated genes as the second sheet
    374. addWorksheet(wb, "Up_Regulated")
    375. writeData(wb, "Up_Regulated", up_regulated)
    376. # Add the down-regulated genes as the third sheet
    377. addWorksheet(wb, "Down_Regulated")
    378. writeData(wb, "Down_Regulated", down_regulated)
    379. # Save the workbook to a file
    380. saveWorkbook(wb, "Gene_Expression_LB.Y1_vs_LB.WT19606.xlsx", overwrite = TRUE)
    381. # Set the 'GeneName' column as row.names
    382. rownames(res) <- res$GeneName
    383. # Drop the 'GeneName' column since it's now the row names
    384. res$GeneName <- NULL
    385. head(res)
    386. ## Ensure the data frame matches the expected format
    387. ## For example, it should have columns: log2FoldChange, padj, etc.
    388. #res <- as.data.frame(res)
    389. ## Remove rows with NA in log2FoldChange (if needed)
    390. #res <- res[!is.na(res$log2FoldChange),]
    391. # Replace padj = 0 with a small value
    392. res$padj[res$padj == 0] <- 1e-12
    393. #library(EnhancedVolcano)
    394. # Assuming res is already sorted and processed
    395. png("LB.Y1_vs_LB.WT19606.png", width=1200, height=1200)
    396. #max.overlaps = 10
    397. EnhancedVolcano(res,
    398. lab = rownames(res),
    399. x = 'log2FoldChange',
    400. y = 'padj',
    401. pCutoff = 1e-2,
    402. FCcutoff = 2,
    403. title = '',
    404. subtitleLabSize = 18,
    405. pointSize = 3.0,
    406. labSize = 5.0,
    407. colAlpha = 1,
    408. legendIconSize = 4.0,
    409. drawConnectors = TRUE,
    410. widthConnectors = 0.5,
    411. colConnectors = 'black',
    412. subtitle = expression("LB.Y1 versus LB.WT19606"))
    413. dev.off()
    414. # ---- Mac.AB_vs_Mac.WT19606 ----
    415. res <- read.csv("Mac.AB_vs_Mac.WT19606-all.csv")
    416. # Replace empty GeneName with modified GeneID
    417. res$GeneName <- ifelse(
    418. res$GeneName == "" | is.na(res$GeneName),
    419. gsub("gene-", "", res$GeneID),
    420. res$GeneName
    421. )
    422. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    423. res <- res %>%
    424. group_by(GeneName) %>%
    425. slice_min(padj, with_ties = FALSE) %>%
    426. ungroup()
    427. res <- as.data.frame(res)
    428. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    429. res <- res[order(res$padj, -res$log2FoldChange), ]
    430. # Assuming res is your dataframe and already processed
    431. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    432. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    433. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    434. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    435. # Create a new workbook
    436. wb <- createWorkbook()
    437. # Add the complete dataset as the first sheet
    438. addWorksheet(wb, "Complete_Data")
    439. writeData(wb, "Complete_Data", res)
    440. # Add the up-regulated genes as the second sheet
    441. addWorksheet(wb, "Up_Regulated")
    442. writeData(wb, "Up_Regulated", up_regulated)
    443. # Add the down-regulated genes as the third sheet
    444. addWorksheet(wb, "Down_Regulated")
    445. writeData(wb, "Down_Regulated", down_regulated)
    446. # Save the workbook to a file
    447. saveWorkbook(wb, "Gene_Expression_Mac.AB_vs_Mac.WT19606.xlsx", overwrite = TRUE)
    448. # Set the 'GeneName' column as row.names
    449. rownames(res) <- res$GeneName
    450. # Drop the 'GeneName' column since it's now the row names
    451. res$GeneName <- NULL
    452. head(res)
    453. ## Ensure the data frame matches the expected format
    454. ## For example, it should have columns: log2FoldChange, padj, etc.
    455. #res <- as.data.frame(res)
    456. ## Remove rows with NA in log2FoldChange (if needed)
    457. #res <- res[!is.na(res$log2FoldChange),]
    458. # Replace padj = 0 with a small value
    459. res$padj[res$padj == 0] <- 1e-12
    460. #library(EnhancedVolcano)
    461. # Assuming res is already sorted and processed
    462. png("Mac.AB_vs_Mac.WT19606.png", width=1200, height=1200)
    463. #max.overlaps = 10
    464. EnhancedVolcano(res,
    465. lab = rownames(res),
    466. x = 'log2FoldChange',
    467. y = 'padj',
    468. pCutoff = 1e-2,
    469. FCcutoff = 2,
    470. title = '',
    471. subtitleLabSize = 18,
    472. pointSize = 3.0,
    473. labSize = 5.0,
    474. colAlpha = 1,
    475. legendIconSize = 4.0,
    476. drawConnectors = TRUE,
    477. widthConnectors = 0.5,
    478. colConnectors = 'black',
    479. subtitle = expression("Mac.AB versus Mac.WT19606"))
    480. dev.off()
    481. # ---- Mac.IJ_vs_Mac.WT19606 ----
    482. res <- read.csv("Mac.IJ_vs_Mac.WT19606-all.csv")
    483. # Replace empty GeneName with modified GeneID
    484. res$GeneName <- ifelse(
    485. res$GeneName == "" | is.na(res$GeneName),
    486. gsub("gene-", "", res$GeneID),
    487. res$GeneName
    488. )
    489. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    490. res <- res %>%
    491. group_by(GeneName) %>%
    492. slice_min(padj, with_ties = FALSE) %>%
    493. ungroup()
    494. res <- as.data.frame(res)
    495. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    496. res <- res[order(res$padj, -res$log2FoldChange), ]
    497. # Assuming res is your dataframe and already processed
    498. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    499. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    500. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    501. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    502. # Create a new workbook
    503. wb <- createWorkbook()
    504. # Add the complete dataset as the first sheet
    505. addWorksheet(wb, "Complete_Data")
    506. writeData(wb, "Complete_Data", res)
    507. # Add the up-regulated genes as the second sheet
    508. addWorksheet(wb, "Up_Regulated")
    509. writeData(wb, "Up_Regulated", up_regulated)
    510. # Add the down-regulated genes as the third sheet
    511. addWorksheet(wb, "Down_Regulated")
    512. writeData(wb, "Down_Regulated", down_regulated)
    513. # Save the workbook to a file
    514. saveWorkbook(wb, "Gene_Expression_Mac.IJ_vs_Mac.WT19606.xlsx", overwrite = TRUE)
    515. # Set the 'GeneName' column as row.names
    516. rownames(res) <- res$GeneName
    517. # Drop the 'GeneName' column since it's now the row names
    518. res$GeneName <- NULL
    519. head(res)
    520. ## Ensure the data frame matches the expected format
    521. ## For example, it should have columns: log2FoldChange, padj, etc.
    522. #res <- as.data.frame(res)
    523. ## Remove rows with NA in log2FoldChange (if needed)
    524. #res <- res[!is.na(res$log2FoldChange),]
    525. # Replace padj = 0 with a small value
    526. res$padj[res$padj == 0] <- 1e-12
    527. #library(EnhancedVolcano)
    528. # Assuming res is already sorted and processed
    529. png("Mac.IJ_vs_Mac.WT19606.png", width=1200, height=1200)
    530. #max.overlaps = 10
    531. EnhancedVolcano(res,
    532. lab = rownames(res),
    533. x = 'log2FoldChange',
    534. y = 'padj',
    535. pCutoff = 1e-2,
    536. FCcutoff = 2,
    537. title = '',
    538. subtitleLabSize = 18,
    539. pointSize = 3.0,
    540. labSize = 5.0,
    541. colAlpha = 1,
    542. legendIconSize = 4.0,
    543. drawConnectors = TRUE,
    544. widthConnectors = 0.5,
    545. colConnectors = 'black',
    546. subtitle = expression("Mac.IJ versus Mac.WT19606"))
    547. dev.off()
    548. # ---- Mac.W1_vs_Mac.WT19606 ----
    549. res <- read.csv("Mac.W1_vs_Mac.WT19606-all.csv")
    550. # Replace empty GeneName with modified GeneID
    551. res$GeneName <- ifelse(
    552. res$GeneName == "" | is.na(res$GeneName),
    553. gsub("gene-", "", res$GeneID),
    554. res$GeneName
    555. )
    556. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    557. res <- res %>%
    558. group_by(GeneName) %>%
    559. slice_min(padj, with_ties = FALSE) %>%
    560. ungroup()
    561. res <- as.data.frame(res)
    562. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    563. res <- res[order(res$padj, -res$log2FoldChange), ]
    564. # Assuming res is your dataframe and already processed
    565. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    566. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    567. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    568. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    569. # Create a new workbook
    570. wb <- createWorkbook()
    571. # Add the complete dataset as the first sheet
    572. addWorksheet(wb, "Complete_Data")
    573. writeData(wb, "Complete_Data", res)
    574. # Add the up-regulated genes as the second sheet
    575. addWorksheet(wb, "Up_Regulated")
    576. writeData(wb, "Up_Regulated", up_regulated)
    577. # Add the down-regulated genes as the third sheet
    578. addWorksheet(wb, "Down_Regulated")
    579. writeData(wb, "Down_Regulated", down_regulated)
    580. # Save the workbook to a file
    581. saveWorkbook(wb, "Gene_Expression_Mac.W1_vs_Mac.WT19606.xlsx", overwrite = TRUE)
    582. # Set the 'GeneName' column as row.names
    583. rownames(res) <- res$GeneName
    584. # Drop the 'GeneName' column since it's now the row names
    585. res$GeneName <- NULL
    586. head(res)
    587. ## Ensure the data frame matches the expected format
    588. ## For example, it should have columns: log2FoldChange, padj, etc.
    589. #res <- as.data.frame(res)
    590. ## Remove rows with NA in log2FoldChange (if needed)
    591. #res <- res[!is.na(res$log2FoldChange),]
    592. # Replace padj = 0 with a small value
    593. res$padj[res$padj == 0] <- 1e-12
    594. #library(EnhancedVolcano)
    595. # Assuming res is already sorted and processed
    596. png("Mac.W1_vs_Mac.WT19606.png", width=1200, height=1200)
    597. #max.overlaps = 10
    598. EnhancedVolcano(res,
    599. lab = rownames(res),
    600. x = 'log2FoldChange',
    601. y = 'padj',
    602. pCutoff = 1e-2,
    603. FCcutoff = 2,
    604. title = '',
    605. subtitleLabSize = 18,
    606. pointSize = 3.0,
    607. labSize = 5.0,
    608. colAlpha = 1,
    609. legendIconSize = 4.0,
    610. drawConnectors = TRUE,
    611. widthConnectors = 0.5,
    612. colConnectors = 'black',
    613. subtitle = expression("Mac.W1 versus Mac.WT19606"))
    614. dev.off()
    615. # ---- Mac.Y1_vs_Mac.WT19606 ----
    616. res <- read.csv("Mac.Y1_vs_Mac.WT19606-all.csv")
    617. # Replace empty GeneName with modified GeneID
    618. res$GeneName <- ifelse(
    619. res$GeneName == "" | is.na(res$GeneName),
    620. gsub("gene-", "", res$GeneID),
    621. res$GeneName
    622. )
    623. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    624. res <- res %>%
    625. group_by(GeneName) %>%
    626. slice_min(padj, with_ties = FALSE) %>%
    627. ungroup()
    628. res <- as.data.frame(res)
    629. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    630. res <- res[order(res$padj, -res$log2FoldChange), ]
    631. # Assuming res is your dataframe and already processed
    632. # Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
    633. up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
    634. # Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
    635. down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
    636. # Create a new workbook
    637. wb <- createWorkbook()
    638. # Add the complete dataset as the first sheet
    639. addWorksheet(wb, "Complete_Data")
    640. writeData(wb, "Complete_Data", res)
    641. # Add the up-regulated genes as the second sheet
    642. addWorksheet(wb, "Up_Regulated")
    643. writeData(wb, "Up_Regulated", up_regulated)
    644. # Add the down-regulated genes as the third sheet
    645. addWorksheet(wb, "Down_Regulated")
    646. writeData(wb, "Down_Regulated", down_regulated)
    647. # Save the workbook to a file
    648. saveWorkbook(wb, "Gene_Expression_Mac.Y1_vs_Mac.WT19606.xlsx", overwrite = TRUE)
    649. # Set the 'GeneName' column as row.names
    650. rownames(res) <- res$GeneName
    651. # Drop the 'GeneName' column since it's now the row names
    652. res$GeneName <- NULL
    653. head(res)
    654. ## Ensure the data frame matches the expected format
    655. ## For example, it should have columns: log2FoldChange, padj, etc.
    656. #res <- as.data.frame(res)
    657. ## Remove rows with NA in log2FoldChange (if needed)
    658. #res <- res[!is.na(res$log2FoldChange),]
    659. # Replace padj = 0 with a small value
    660. res$padj[res$padj == 0] <- 1e-12
    661. #library(EnhancedVolcano)
    662. # Assuming res is already sorted and processed
    663. png("Mac.Y1_vs_Mac.WT19606.png", width=1200, height=1200)
    664. #max.overlaps = 10
    665. EnhancedVolcano(res,
    666. lab = rownames(res),
    667. x = 'log2FoldChange',
    668. y = 'padj',
    669. pCutoff = 1e-2,
    670. FCcutoff = 2,
    671. title = '',
    672. subtitleLabSize = 18,
    673. pointSize = 3.0,
    674. labSize = 5.0,
    675. colAlpha = 1,
    676. legendIconSize = 4.0,
    677. drawConnectors = TRUE,
    678. widthConnectors = 0.5,
    679. colConnectors = 'black',
    680. subtitle = expression("Mac.Y1 versus Mac.WT19606"))
    681. dev.off()
    682. #TODO: annotate the Gene_Expression_xxx_vs_yyy.xlsx
  10. Clustering the genes and draw heatmap

    1. #http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
    2. #http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
    3. # save the Up-regulated and Down-regulated genes into -up.id and -down.id
    4. for i in Mac_vs_LB LB.AB_vs_LB.WT19606 LB.IJ_vs_LB.WT19606 LB.W1_vs_LB.WT19606 LB.Y1_vs_LB.WT19606 Mac.AB_vs_Mac.WT19606 Mac.IJ_vs_Mac.WT19606 Mac.W1_vs_Mac.WT19606 Mac.Y1_vs_Mac.WT19606; do
    5. echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
    6. echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
    7. done
    8. #5 LB.AB_vs_LB.WT19606-down.id
    9. #20 LB.AB_vs_LB.WT19606-up.id
    10. #64 LB.IJ_vs_LB.WT19606-down.id
    11. #69 LB.IJ_vs_LB.WT19606-up.id
    12. #23 LB.W1_vs_LB.WT19606-down.id
    13. #97 LB.W1_vs_LB.WT19606-up.id
    14. #9 LB.Y1_vs_LB.WT19606-down.id
    15. #20 LB.Y1_vs_LB.WT19606-up.id
    16. #20 Mac.AB_vs_Mac.WT19606-down.id
    17. #29 Mac.AB_vs_Mac.WT19606-up.id
    18. #65 Mac.IJ_vs_Mac.WT19606-down.id
    19. #197 Mac.IJ_vs_Mac.WT19606-up.id
    20. #359 Mac_vs_LB-down.id
    21. #308 Mac_vs_LB-up.id
    22. #290 Mac.W1_vs_Mac.WT19606-down.id
    23. #343 Mac.W1_vs_Mac.WT19606-up.id
    24. #75 Mac.Y1_vs_Mac.WT19606-down.id
    25. #0 Mac.Y1_vs_Mac.WT19606.png-down.id
    26. #0 Mac.Y1_vs_Mac.WT19606.png-up.id
    27. #68 Mac.Y1_vs_Mac.WT19606-up.id
    28. #2061 total
    29. cat *.id | sort -u > ids
    30. #Delete "GeneName"
    31. #add Gene_Id in the first line, delete the "" #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID.
    32. GOI <- read.csv("ids")$Gene_Id #1329
    33. RNASeq.NoCellLine <- assay(rld)
    34. #install.packages("gplots")
    35. library("gplots")
    36. #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman
    37. datamat = RNASeq.NoCellLine[GOI, ]
    38. #datamat = RNASeq.NoCellLine
    39. write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt")
    40. constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
    41. if(any(constant_rows)) {
    42. cat("Removing", sum(constant_rows), "constant rows.\n")
    43. datamat <- datamat[!constant_rows, ]
    44. }
    45. hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    46. hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    47. mycl = cutree(hr, h=max(hr$height)/1.15)
    48. mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
    49. mycol = mycol[as.vector(mycl)]
    50. #png("DEGs_heatmap.png", width=900, height=800)
    51. #cex.lab=10, labRow="",
    52. png("DEGs_heatmap.png", width=1200, height=1000)
    53. heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',labRow="",
    54. scale='row',trace='none',col=bluered(75), cexCol=1.8,
    55. RowSideColors = mycol, margins=c(10,2), cexRow=1.5, srtCol=30, lhei = c(1, 8), lwid=c(2, 8)) #rownames(datamat)
    56. #heatmap.2(datamat, Rowv=as.dendrogram(hr), col=bluered(75), scale="row", RowSideColors=mycol, trace="none", margin=c(5,5), sepwidth=c(0,0), dendrogram = 'row', Colv = 'false', density.info='none', labRow="", srtCol=30, lhei=c(0.1,2))
    57. dev.off()
    58. #### cluster members #####
    59. write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
    60. write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
    61. write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
    62. write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt')
    63. write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt')
    64. #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
    65. #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls;
    66. #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) #####
    67. subset_1<-names(subset(mycl, mycl == '1'))
    68. data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #2575
    69. subset_2<-names(subset(mycl, mycl == '2'))
    70. data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #1855
    71. subset_3<-names(subset(mycl, mycl == '3'))
    72. data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #217
    73. subset_4<-names(subset(mycl, mycl == '4'))
    74. data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ]) #
    75. subset_5<-names(subset(mycl, mycl == '5'))
    76. data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ]) #
    77. # Initialize an empty data frame for the annotated data
    78. annotated_data <- data.frame()
    79. # Determine total number of genes
    80. total_genes <- length(rownames(data))
    81. # Loop through each gene to annotate
    82. for (i in 1:total_genes) {
    83. gene <- rownames(data)[i]
    84. result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
    85. filters = 'ensembl_gene_id',
    86. values = gene,
    87. mart = ensembl)
    88. # If multiple rows are returned, take the first one
    89. if (nrow(result) > 1) {
    90. result <- result[1, ]
    91. }
    92. # Check if the result is empty
    93. if (nrow(result) == 0) {
    94. result <- data.frame(ensembl_gene_id = gene,
    95. external_gene_name = NA,
    96. gene_biotype = NA,
    97. entrezgene_id = NA,
    98. chromosome_name = NA,
    99. start_position = NA,
    100. end_position = NA,
    101. strand = NA,
    102. description = NA)
    103. }
    104. # Transpose expression values
    105. expression_values <- t(data.frame(t(data[gene, ])))
    106. colnames(expression_values) <- colnames(data)
    107. # Combine gene information and expression data
    108. combined_result <- cbind(result, expression_values)
    109. # Append to the final dataframe
    110. annotated_data <- rbind(annotated_data, combined_result)
    111. # Print progress every 100 genes
    112. if (i %% 100 == 0) {
    113. cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
    114. }
    115. }
    116. # Save the annotated data to a new CSV file
    117. write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    118. write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    119. write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    120. write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
    121. write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
    122. #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

Blast2GO_workflow

  1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

    1. EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.
    2. Install EggNOG-mapper:
    3. mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda #eggnog-mapper_2.1.12
    4. mamba activate eggnog_env
    5. Run annotation:
    6. #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
    7. mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
    8. download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
    9. #NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
    10. python ~/Scripts/update_fasta_header.py CP059040_protein_.fasta CP059040_protein.fasta
    11. emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60 --resume
    12. #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
    13. #----> 470.IX87_14445:
    14. * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
    15. * IX87_14445 would refer to a specific gene or protein within that genome.
    16. Extract KEGG KO IDs from annotations.emapper.annotations.

    Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    • 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) -->
    • Buttons 'blast' (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
    • Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
    • Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished." * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value. * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.

    or blast2go_cli_v1.5.1 (NOT_USED)

    1. #https://help.biobam.com/space/BCD/2250407989/Installation
    2. #see ~/Scripts/blast2go_pipeline.sh

    Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."

    MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation." --> "Finished merging GO terms from InterPro with annotations. Maybe you want to run ANNEX (Annotation Augmentation)." #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."

    1. #-- before merging (blast2go_annot.annot) --
    2. #H0N29_18790 GO:0004842 ankyrin repeat domain-containing protein
    3. #H0N29_18790 GO:0085020
    4. #-- after merging (blast2go_annot.annot2) -->
    5. #H0N29_18790 GO:0031436 ankyrin repeat domain-containing protein
    6. #H0N29_18790 GO:0070531
    7. #H0N29_18790 GO:0004842
    8. #H0N29_18790 GO:0005515
    9. #H0N29_18790 GO:0085020

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    • Go to KAAS
    • Upload your FASTA file.
    • Select an appropriate gene set.
    • Download the KO assignments.
  2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn't directly in KEGG, use a closely related organism.

    • Check available KEGG organisms:
      1. library(clusterProfiler)
      2. library(KEGGREST)
      3. kegg_organisms <- keggList("organism")
      4. Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).
      5. # Search for Acinetobacter in the list
      6. grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
      7. # Gammaproteobacteria
      8. #Extract KO IDs from the eggnog results for "Acinetobacter baumannii strain ATCC 19606"
  3. Find the Closest KEGG Organism for a Non-Model Species

    If your organism is not in KEGG, search for the closest relative:

    1. grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE) # Example search

    For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

    1. kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko") # "ko" = KEGG Orthology
  4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/ata_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes)

    1. #BiocManager::install("GO.db")
    2. #BiocManager::install("AnnotationDbi")
    3. # Load required libraries
    4. library(openxlsx) # For Excel file handling
    5. library(dplyr) # For data manipulation
    6. library(tidyr)
    7. library(stringr)
    8. library(clusterProfiler) # For KEGG and GO enrichment analysis
    9. #library(org.Hs.eg.db) # Replace with appropriate organism database
    10. library(GO.db)
    11. library(AnnotationDbi)
    12. setwd("~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes")
    13. # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
    14. # Step 1: Load the blast2go annotation file with a check for missing columns
    15. annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_2024/blast2go_annot.annot2_",
    16. header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)
    17. # If the structure is inconsistent, we can make sure there are exactly 3 columns:
    18. colnames(annot_df) <- c("GeneID", "Term")
    19. # Step 2: Filter and aggregate GO and EC terms as before
    20. go_terms <- annot_df %>%
    21. filter(grepl("^GO:", Term)) %>%
    22. group_by(GeneID) %>%
    23. summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
    24. ec_terms <- annot_df %>%
    25. filter(grepl("^EC:", Term)) %>%
    26. group_by(GeneID) %>%
    27. summarize(EC = paste(Term, collapse = ","), .groups = "drop")
    28. # Load the results
    29. #res <- read.csv("Mac_vs_LB-all.csv") #up307, down358
    30. #res <- read.csv("LB.AB_vs_LB.WT19606-all.csv") #up307, down358
    31. #res <- read.csv("LB.IJ_vs_LB.WT19606-all.csv") #up307, down358
    32. #res <- read.csv("LB.W1_vs_LB.WT19606-all.csv") #up307, down358
    33. #res <- read.csv("LB.Y1_vs_LB.WT19606-all.csv") #up307, down358
    34. #res <- read.csv("Mac.AB_vs_Mac.WT19606-all.csv") #up307, down358
    35. #res <- read.csv("Mac.IJ_vs_Mac.WT19606-all.csv") #up307, down358
    36. #res <- read.csv("Mac.W1_vs_Mac.WT19606-all.csv") #up307, down358
    37. res <- read.csv("Mac.Y1_vs_Mac.WT19606-all.csv") #up307, down358
    38. # Replace empty GeneName with modified GeneID
    39. res$GeneName <- ifelse(
    40. res$GeneName == "" | is.na(res$GeneName),
    41. gsub("gene-", "", res$GeneID),
    42. res$GeneName
    43. )
    44. # Remove duplicated genes by selecting the gene with the smallest padj
    45. duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    46. res <- res %>%
    47. group_by(GeneName) %>%
    48. slice_min(padj, with_ties = FALSE) %>%
    49. ungroup()
    50. res <- as.data.frame(res)
    51. # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    52. res <- res[order(res$padj, -res$log2FoldChange), ]
    53. # Read eggnog annotations
    54. eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
    55. # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
    56. res$GeneID <- gsub("gene-", "", res$GeneID)
    57. # Merge eggnog data with res based on GeneID
    58. res <- res %>% left_join(eggnog_data, by = c("GeneID" = "query"))
    59. # Merge with the res dataframe
    60. # Perform the left joins and rename columns
    61. res_updated <- res %>%
    62. left_join(go_terms, by = "GeneID") %>%
    63. left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
    64. # Filter up-regulated genes
    65. up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.01, ]
    66. # Filter down-regulated genes
    67. down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.01, ]
    68. # Create a new workbook
    69. wb <- createWorkbook()
    70. # Add the complete dataset as the first sheet (with annotations)
    71. addWorksheet(wb, "Complete_Data")
    72. writeData(wb, "Complete_Data", res_updated)
    73. # Add the up-regulated genes as the second sheet (with annotations)
    74. addWorksheet(wb, "Up_Regulated")
    75. writeData(wb, "Up_Regulated", up_regulated)
    76. # Add the down-regulated genes as the third sheet (with annotations)
    77. addWorksheet(wb, "Down_Regulated")
    78. writeData(wb, "Down_Regulated", down_regulated)
    79. # Save the workbook to a file
    80. saveWorkbook(wb, "Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx", overwrite = TRUE)
    81. # Set GeneName as row names after the join
    82. rownames(res_updated) <- res_updated$GeneName
    83. res_updated <- res_updated %>% dplyr::select(-GeneName)
    84. ## Set the 'GeneName' column as row.names
    85. #rownames(res_updated) <- res_updated$GeneName
    86. ## Drop the 'GeneName' column since it's now the row names
    87. #res_updated$GeneName <- NULL
    88. # -- BREAK_1 --
    89. # ---- Perform KEGG enrichment analysis (up_regulated) ----
    90. gene_list_kegg_up <- up_regulated$KEGG_ko
    91. gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
    92. kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
    93. # -- convert the GeneID (Kxxxxxx) to the true GeneID --
    94. # Step 0: Create KEGG to GeneID mapping
    95. kegg_to_geneid_up <- up_regulated %>%
    96. dplyr::select(KEGG_ko, GeneID) %>%
    97. filter(!is.na(KEGG_ko)) %>% # Remove missing KEGG KO entries
    98. mutate(KEGG_ko = str_remove(KEGG_ko, "ko:")) # Remove 'ko:' prefix if present
    99. # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
    100. kegg_to_geneid_clean <- kegg_to_geneid_up %>%
    101. mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>% # Remove 'ko:' prefixes
    102. separate_rows(KEGG_ko, sep = ",") %>% # Ensure each KEGG ID is on its own row
    103. filter(KEGG_ko != "-") %>% # Remove invalid KEGG IDs ("-")
    104. distinct() # Remove any duplicate mappings
    105. # Step 2.1: Expand geneID column in kegg_enrichment_up
    106. expanded_kegg <- kegg_enrichment_up %>%
    107. as.data.frame() %>%
    108. separate_rows(geneID, sep = "/") %>% # Split multiple KEGG IDs (Kxxxxx)
    109. left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>% # Explicitly handle many-to-many
    110. distinct() %>% # Remove duplicate matches
    111. group_by(ID) %>%
    112. summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop") # Re-collapse results
    113. #dplyr::glimpse(expanded_kegg)
    114. # Step 3.1: Replace geneID column in the original dataframe
    115. kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
    116. # Remove old geneID column and merge new one
    117. kegg_enrichment_up_df <- kegg_enrichment_up_df %>%
    118. dplyr::select(-geneID) %>% # Remove old geneID column
    119. left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>% # Merge new GeneID column
    120. dplyr::rename(geneID = GeneID) # Rename column back to geneID
    121. # ---- Perform KEGG enrichment analysis (down_regulated) ----
    122. # Step 1: Extract KEGG KO terms from down-regulated genes
    123. gene_list_kegg_down <- down_regulated$KEGG_ko
    124. gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
    125. # Step 2: Perform KEGG enrichment analysis
    126. kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
    127. # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
    128. # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
    129. kegg_to_geneid_down <- down_regulated %>%
    130. dplyr::select(KEGG_ko, GeneID) %>%
    131. filter(!is.na(KEGG_ko)) %>% # Remove missing KEGG KO entries
    132. mutate(KEGG_ko = str_remove(KEGG_ko, "ko:")) # Remove 'ko:' prefix if present
    133. # -- BREAK_2 --
    134. # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
    135. kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
    136. mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>% # Remove 'ko:' prefixes
    137. separate_rows(KEGG_ko, sep = ",") %>% # Ensure each KEGG ID is on its own row
    138. filter(KEGG_ko != "-") %>% # Remove invalid KEGG IDs ("-")
    139. distinct() # Remove duplicate mappings
    140. # Step 5: Expand geneID column in kegg_enrichment_down
    141. expanded_kegg_down <- kegg_enrichment_down %>%
    142. as.data.frame() %>%
    143. separate_rows(geneID, sep = "/") %>% # Split multiple KEGG IDs (Kxxxxx)
    144. left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>% # Handle many-to-many mappings
    145. distinct() %>% # Remove duplicate matches
    146. group_by(ID) %>%
    147. summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop") # Re-collapse results
    148. # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
    149. kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
    150. dplyr::select(-geneID) %>% # Remove old geneID column
    151. left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>% # Merge new GeneID column
    152. dplyr::rename(geneID = GeneID) # Rename column back to geneID
    153. # View the updated dataframe
    154. head(kegg_enrichment_down_df)
    155. # Create a new workbook
    156. wb <- createWorkbook()
    157. # Save enrichment results to the workbook
    158. addWorksheet(wb, "KEGG_Enrichment_Up")
    159. writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
    160. # Save enrichment results to the workbook
    161. addWorksheet(wb, "KEGG_Enrichment_Down")
    162. writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))
    163. # Define gene list (up-regulated genes)
    164. gene_list_go_up <- up_regulated$GeneID # Extract the 149 up-regulated genes
    165. gene_list_go_down <- down_regulated$GeneID # Extract the 65 down-regulated genes
    166. # Define background gene set (all genes in res)
    167. background_genes <- res_updated$GeneID # Extract the 3646 background genes
    168. # Prepare GO annotation data from res
    169. go_annotation <- res_updated[, c("GOs","GeneID")] # Extract relevant columns
    170. go_annotation <- go_annotation %>%
    171. tidyr::separate_rows(GOs, sep = ",") # Split multiple GO terms into separate rows
    172. # -- BREAK_3 --
    173. go_enrichment_up <- enricher(
    174. gene = gene_list_go_up, # Up-regulated genes
    175. TERM2GENE = go_annotation, # Custom GO annotation
    176. pvalueCutoff = 0.05, # Significance threshold
    177. pAdjustMethod = "BH",
    178. universe = background_genes # Define the background gene set
    179. )
    180. go_enrichment_up <- as.data.frame(go_enrichment_up)
    181. go_enrichment_down <- enricher(
    182. gene = gene_list_go_down, # Up-regulated genes
    183. TERM2GENE = go_annotation, # Custom GO annotation
    184. pvalueCutoff = 0.05, # Significance threshold
    185. pAdjustMethod = "BH",
    186. universe = background_genes # Define the background gene set
    187. )
    188. go_enrichment_down <- as.data.frame(go_enrichment_down)
    189. ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
    190. #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]
    191. # Update the Description column with the term descriptions
    192. go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
    193. # Using select to get the term description
    194. term <- tryCatch({
    195. AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
    196. }, error = function(e) {
    197. message(paste("Error for GO term:", go_id)) # Print which GO ID caused the error
    198. return(data.frame(TERM = NA)) # In case of error, return NA
    199. })
    200. if (nrow(term) > 0) {
    201. return(term$TERM)
    202. } else {
    203. return(NA) # If no description found, return NA
    204. }
    205. })
    206. ## Print the updated data frame
    207. #print(go_enrichment_up)
    208. ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
    209. #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
    210. # Update the Description column with the term descriptions
    211. go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
    212. # Using select to get the term description
    213. term <- tryCatch({
    214. AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
    215. }, error = function(e) {
    216. message(paste("Error for GO term:", go_id)) # Print which GO ID caused the error
    217. return(data.frame(TERM = NA)) # In case of error, return NA
    218. })
    219. if (nrow(term) > 0) {
    220. return(term$TERM)
    221. } else {
    222. return(NA) # If no description found, return NA
    223. }
    224. })
    225. addWorksheet(wb, "GO_Enrichment_Up")
    226. writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))
    227. addWorksheet(wb, "GO_Enrichment_Down")
    228. writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))
    229. # Save the workbook with enrichment results
    230. saveWorkbook(wb, "KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx", overwrite = TRUE)
    231. #Error for GO term: GO:0006807: replace "GO:0006807 obsolete nitrogen compound metabolic process"
    232. #obsolete nitrogen compound metabolic process #https://www.ebi.ac.uk/QuickGO/term/GO:0006807
    233. #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!
    234. #mv KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx KEGG_and_GO_Enrichments_Mac_vs_LB.xlsx
    235. #Mac_vs_LB
    236. #LB.AB_vs_LB.WT19606
    237. #LB.IJ_vs_LB.WT19606
    238. #LB.W1_vs_LB.WT19606
    239. #LB.Y1_vs_LB.WT19606
    240. #Mac.AB_vs_Mac.WT19606
    241. #Mac.IJ_vs_Mac.WT19606
    242. #Mac.W1_vs_Mac.WT19606
    243. #Mac.Y1_vs_Mac.WT19606
  5. (DEBUG) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

    1. library(openxlsx)
    2. # Function to read and clean gene ID files
    3. read_gene_ids <- function(file_path) {
    4. # Read the gene IDs from the file
    5. gene_ids <- readLines(file_path)
    6. # Remove any quotes and trim whitespaces
    7. gene_ids <- gsub('"', '', gene_ids) # Remove quotes
    8. gene_ids <- trimws(gene_ids) # Trim whitespaces
    9. # Remove empty entries or NAs
    10. gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]
    11. return(gene_ids)
    12. }
    13. # Example list of LB files with both -up.id and -down.id for each condition
    14. lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
    15. "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
    16. lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
    17. "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")
    18. # Combine both up and down files for each condition
    19. lb_files <- c(lb_files_up, lb_files_down)
    20. # Read gene IDs for each file in LB group
    21. #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
    22. lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))
    23. lb_degs_ <- list()
    24. combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
    25. #unique_combined_set <- unique(combined_set)
    26. lb_degs_$AB <- combined_set
    27. combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
    28. lb_degs_$IJ <- combined_set
    29. combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
    30. lb_degs_$W1 <- combined_set
    31. combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
    32. lb_degs_$Y1 <- combined_set
    33. # Example list of Mac files with both -up.id and -down.id for each condition
    34. mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
    35. "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
    36. mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
    37. "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")
    38. # Combine both up and down files for each condition in Mac group
    39. mac_files <- c(mac_files_up, mac_files_down)
    40. # Read gene IDs for each file in Mac group
    41. mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))
    42. mac_degs_ <- list()
    43. combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
    44. mac_degs_$AB <- combined_set
    45. combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
    46. mac_degs_$IJ <- combined_set
    47. combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
    48. mac_degs_$W1 <- combined_set
    49. combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
    50. mac_degs_$Y1 <- combined_set
    51. # Function to clean sheet names to ensure no sheet name exceeds 31 characters
    52. truncate_sheet_name <- function(names_list) {
    53. sapply(names_list, function(name) {
    54. if (nchar(name) > 31) {
    55. return(substr(name, 1, 31)) # Truncate sheet name to 31 characters
    56. }
    57. return(name)
    58. })
    59. }
    60. # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)
    61. # Define intersections between different conditions for LB
    62. inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
    63. inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
    64. inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
    65. inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
    66. inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
    67. inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)
    68. # Define intersections between three conditions for LB
    69. inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
    70. inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
    71. inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
    72. inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))
    73. # Define intersection between all four conditions for LB
    74. inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))
    75. # Now remove the intersected genes from each original set for LB
    76. venn_list_lb <- list()
    77. # For LB.AB, remove genes that are also in other conditions
    78. venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))
    79. # For LB.IJ, remove genes that are also in other conditions
    80. venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))
    81. # For LB.W1, remove genes that are also in other conditions
    82. venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))
    83. # For LB.Y1, remove genes that are also in other conditions
    84. venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))
    85. # Add the intersections for LB (same as before)
    86. venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
    87. venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
    88. venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
    89. venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
    90. venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
    91. venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1
    92. # Define intersections between three conditions for LB
    93. venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
    94. venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
    95. venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
    96. venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1
    97. # Define intersection between all four conditions for LB
    98. venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1
    99. # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)
    100. # Define intersections between different conditions
    101. inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
    102. inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
    103. inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
    104. inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
    105. inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
    106. inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)
    107. # Define intersections between three conditions
    108. inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
    109. inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
    110. inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
    111. inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))
    112. # Define intersection between all four conditions
    113. inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))
    114. # Now remove the intersected genes from each original set
    115. venn_list_mac <- list()
    116. # For Mac.AB, remove genes that are also in other conditions
    117. venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))
    118. # For Mac.IJ, remove genes that are also in other conditions
    119. venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))
    120. # For Mac.W1, remove genes that are also in other conditions
    121. venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))
    122. # For Mac.Y1, remove genes that are also in other conditions
    123. venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))
    124. # Add the intersections (same as before)
    125. venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
    126. venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
    127. venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
    128. venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
    129. venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
    130. venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1
    131. # Define intersections between three conditions
    132. venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
    133. venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
    134. venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
    135. venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1
    136. # Define intersection between all four conditions
    137. venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1
    138. # Save the gene IDs to Excel for further inspection (optional)
    139. write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
    140. write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")
    141. # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
    142. write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
    143. write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)
    144. # Venn Diagram for LB group
    145. venn1 <- ggvenn(lb_degs_,
    146. fill_color = c("skyblue", "tomato", "gold", "orchid"),
    147. stroke_size = 0.4,
    148. set_name_size = 5)
    149. ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)
    150. # Venn Diagram for Mac group
    151. venn2 <- ggvenn(mac_degs_,
    152. fill_color = c("lightgreen", "slateblue", "plum", "orange"),
    153. stroke_size = 0.4,
    154. set_name_size = 5)
    155. ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)
    156. cat("✅ All Venn intersection sets exported to Excel successfully.\n")

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum