1 Introduction

This document includes explanations and instructions for how to run the scripts (available here) to obtain the results that were described in the article. The authors have made effort to document and comment the code. However, if you find something you don’t fully understand, please, do not hesitate to contact Julia Romanowska.

The code uses the newest versions of several libraries. These can be installed from the net.

require( ff )
require( tidyr )
require( plyr )
require( dplyr )
require( ggplot2 )
require( ggrepel )
require( biomaRt )
require( Haplin )

Additionally, the code uses special functions to manage the DNA methylation data. Those functions were implemented in the development version of Haplin. Since this has not been yet well tested, we include it as an additional package to download from our webpage (https://folk.uib.no/gjessing/genetics/software/haplin/) or the bitbucket repo (https://bitbucket.org/jrom/haplinmethyl/) and install after installing main Haplin.

if( !requireNamespace( "Haplin", quietly = TRUE ) ){
  install.packages( "./HaplinMethyl.tgz", repos = NULL )
}
library( Haplin_methyl )

2 Reading the data

2.1 Chosen SNPs

In the analyses, we chose several SNPs that had been previously shown to be associated with clefting in Moreno Uribe, L. M., et al. (2017). A Population-Based Study of Effects of Genetic Loci on Orofacial Clefts. Journal of Dental Research, 96(11), 1322–1329. To get the correct positions, we ran this code and created a following table with the SNP names, their positions on the genome, and the nearest gene, as annotated in the mentioned article.

Chr	snpname	pos	gene_name
1	rs742071	18979874	PAX7
1	rs560426	94553438	ABCA4
1	rs642961	209989270	IRF6
2	rs7590268	43540125	THADA
8	rs12543318	88868340	8q21.3
8	rs987525	129946154	8q24
9	rs3758249	100614140	FOXE1
10	rs7078160	118827560	KIAA1598
13	rs8001641	80692811	SPRY2
15	rs1873147	63312632	TPM1
17	rs227731	54773238	NOG1
20	rs13041247	39269074	MAFB

2.2 Getting the nearest CpGs

The analyses require only a small subset of DNA methylation data at a time, therefore, we prepared it beforehand: for each of the chosen SNPs, we gathered only the CpGs from the chip that were within 50 kbp, and classified those into promoter, enhancer, or gene body regions, by querying ensembl regulation and ensembl gene databases, respectively. Check this code for details.

2.3 Genotype data

The genetic data was read from a .csv file and checked. All the genetic information was recoded to a factor with the subsequent levels: A, T, C, G, and NA. Next, to prepare the data to be read in by Haplin, we rearranged the columns so that it contained the alleles in the subsequent order: a1_mother, a2_mother, a1_father, a2_father, a1_child, a2_child. Note that since we do not have data from the fathers, both columns were filled with NA.

The data was then divided into subsets according to which orofacial cleft (OFC) subtype the child had: CLO (cleft lip only) dataset, CLP (cleft lip and palate), the joined CL/P (cleft lip with or without cleft palate), CPO (cleft palate only), or the control (CTRL) dataset. These subsets were saved as .RData and .ffData files, to be read in easily by Haplin.

Finally, we performed matching of IDs from the genetic and the DNA methylation datasets, so that we could link the correct data to the individuals. The IDs were not subsetted in the data files, but instead, an ID-matching data.frame was created and saved in separate files per each OFC subtype.

For details of those preparations contact the authors.

2.4 DNA methylation data

We created data.frames that consisted of only the necessary data for the subsequent analyses, among others: individual’s ID, CpG name, DNA methylation beta value. Each OFC subtype and controls had a separate data.frame. We also used the information on which CpG is in a promoter, enhancer or gene body, to create matrices with only those CpGs as columns and individuals as rows (separate matrix for each SNP). These matrices where then directly used in the analyses scripts.

Check here and here for details.

2.4.1 Gap hunting

As detailed in the “Discussion” in the main text and in the Supplementary Material to our article, we checked also for any multi-modal DNA methylation signals in our data. For this purpose, we needed to modify a bit the gaphunter function from the minfi package. The modified version is available here. For the output of this sub-analysis, see below, Fig.S4.

3 Main analyses

The main analyses were conducted separately on the CLO dataset, CLP, the joined CL/P, CPO, and the control (CTRL) dataset. On each of these datasets, we conducted GxMe and PoOxMe analyses separately for CpGs within the nearest promoter, enhancer or gene body regions.

We used one source Rmd file where each time we defined which dataset would be used. Below, you can see the output:

Click here for the raw output from CLO analysis
Click here for the raw output from CLP analysis
Click here for the raw output from CL/P analysis
Click here for the raw output from CPO analysis
Click here for the raw output from CTRL analysis

Additionally, we have run the same analysis on 20 randomly chosen SNPs.

3.1 PoO scan

Moreover, we ran a PoO scan and took the top 20 hits (i.e., the SNPs with the lowest p-values) as input into a subsequent PoOxMe analysis. Here, we needed to re-use and re-run the scripts used above in the Reading the data section. Check the output of the PoOxMe analysis for the CLP dataset and the control dataset.

4 Post-analysis and plotting of results

Below, is a list of the files that were used to produce each of the figures from the article:

All average beta distributions (i.e., Figures 1, and S5 - S8)
All the Q-Q plots (i.e., Figures 3, 6, 7, and 9)
Interesting GxMe results (i.e., Figures 4, 5, and S9)
Interesting PoOxMe results (i.e., Figures 8, S10, S11, and S12)
Figure S1
Figure S3
Figure S4
Figure S13
Figure S14
Figure S15

Code README

Julia Romanowska

16 October 2019