Caleb’s Presentation - Rentrez

1) Create a search of a topic, gene, or organism

2) Use the summary function to extract relavent information

3) Create a plot to see how this topic has been reported over a particular time span

Install and load the following libraries:

require(rentrez)
## Loading required package: rentrez
## Warning: package 'rentrez' was built under R version 4.2.2
require(glue)
## Loading required package: glue
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.0 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
require(ggplot2)

Question #1

First look at the list of NCBI databases

entrez_dbs()
##  [1] "pubmed"          "protein"         "nuccore"         "ipg"            
##  [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
##  [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
## [13] "books"           "cdd"             "clinvar"         "gap"            
## [17] "gapplus"         "grasp"           "dbvar"           "gene"           
## [21] "gds"             "geoprofiles"     "homologene"      "medgen"         
## [25] "mesh"            "nlmcatalog"      "omim"            "orgtrack"       
## [29] "pmc"             "popset"          "proteinclusters" "pcassay"        
## [33] "protfam"         "pccompound"      "pcsubstance"     "seqannot"       
## [37] "snp"             "sra"             "taxonomy"        "biocollections" 
## [41] "gtr"

Next, we can specify a particular database summary; I chose nucleotides with “nuccore”

entrez_db_summary("nuccore")
##  DbName: nuccore
##  MenuName: Nucleotide
##  Description: Core Nucleotide db
##  DbBuild: Build221113-0445m.1
##  Count: 509183547
##  LastUpdate: 2022/11/15 09:50

Examine how many nucleotide hits there are for opsin… too many!

opsin_search<- entrez_search(db = "nuccore", term = "opsin")
opsin_search
## Entrez search result with 35130 hits (object contains 20 IDs and no web_history object)
##  Search term (as translated):  opsin[All Fields]

Lets narrow it down

Question #2

Create a file that returns a list of IDs that contain the word opsin in the title of pubmed articles

Return the number of IDs

entrez_summary(db = "pubmed", id = opsin_search$ids)
## Warning: ID 2328101885 produced error 'cannot get document summary'
## Warning: ID 2328088504 produced error 'cannot get document summary'
## Warning: ID 2328071543 produced error 'cannot get document summary'
## Warning: ID 2328070562 produced error 'cannot get document summary'
## Warning: ID 2328065778 produced error 'cannot get document summary'
## Warning: ID 2328033094 produced error 'cannot get document summary'
## Warning: ID 2325510557 produced error 'cannot get document summary'
## Warning: ID 2325510556 produced error 'cannot get document summary'
## Warning: ID 2325510555 produced error 'cannot get document summary'
## Warning: ID 2325510554 produced error 'cannot get document summary'
## Warning: ID 2325510552 produced error 'cannot get document summary'
## Warning: ID 922960043 produced error 'cannot get document summary'
## Warning: ID 2327721060 produced error 'cannot get document summary'
## Warning: ID 2327720543 produced error 'cannot get document summary'
## Warning: ID 2327717981 produced error 'cannot get document summary'
## Warning: ID 2327712446 produced error 'cannot get document summary'
## Warning: ID 2327712445 produced error 'cannot get document summary'
## Warning: ID 2327707916 produced error 'cannot get document summary'
## Warning: ID 2327707414 produced error 'cannot get document summary'
## Warning: ID 2327703735 produced error 'cannot get document summary'
## List of  20 esummary records. First record:
## 
##  $`2328101885`
## esummary result with 1 items:
## [1] uid
opsin_search <- entrez_search(db = "pubmed", term = "opsin [TITLE] AND amphibians")
opsin_search
## Entrez search result with 58 hits (object contains 20 IDs and no web_history object)
##  Search term (as translated):  opsin[TITLE] AND ("amphibians"[MeSH Terms] OR "amp ...

There are 20 summary records and 58 entrez search hits

Now we can implement the fetch command

Fetch gets complete representation, and rettype can specify fasta files

all_opsins<-entrez_fetch(db= "pubmed", id= opsin_search$ids, rettype = "fasta")
class(all_opsins)
## [1] "character"

Gives you the number of characters

nchar(all_opsins)
## [1] 37938

Export the data collected, if desired

write(all_opsins, file = "amphibian_opsin.fasta")

Question #3

I used to study the SHH gene for my master’s degree, so I wanted to compare the number of searches of SHH (Sonic Hedgehog) vs my current gene of study, OPSIN genes

year <- 1960:2022
opsin_search <- glue("opsin[TITLE]) AND {year}[PDAT]")
SHH_search <- glue("sonic hedgehog [TITLE] AND {year}[PDAT]")

search_counts <- tibble(year = year,
                        opsin_search = opsin_search, 
                        SHH_search = SHH_search) %>% 
  mutate(opsin = map_dbl(opsin_search, ~entrez_search(db="pubmed",
                                                      term = .x)$count),
         SHH = map_dbl(SHH_search, ~entrez_search(db="pubmed",
                                                  term = .x)$count))


search_counts %>% 
  select(year, opsin, SHH) %>% 
  pivot_longer(-year) %>% 
  ggplot(aes(x = year,
             y = value,
             group = name,
             color = name))+
  ylab("Search Count") +
  xlab("Year") +
  geom_line()+
  geom_smooth()+
  geom_point(color = "black")+
  theme_bw()+
  ggtitle("Comparison of Sonic Hedge vs Opsin Gene Searches")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'