1 Introduction

This document includes explanations and instructions for how to run the scripts (available here) to obtain the results that were described in the article. The authors have made effort to document and comment the code. However, if you find something you don’t fully understand, please, do not hesitate to contact Julia Romanowska.

The code uses the newest versions of several libraries. These can be installed from the net.

Additionally, the code uses special functions to manage the DNA methylation data. Those functions were implemented in the development version of Haplin. Since this has not been yet well tested, we include it as an additional package to download from our webpage (https://folk.uib.no/gjessing/genetics/software/haplin/) or the bitbucket repo (https://bitbucket.org/jrom/haplinmethyl/) and install after installing main Haplin.

2 Reading the data

2.1 Chosen SNPs

In the analyses, we chose several SNPs that had been previously shown to be associated with clefting in Moreno Uribe, L. M., et al. (2017). A Population-Based Study of Effects of Genetic Loci on Orofacial Clefts. Journal of Dental Research, 96(11), 1322–1329. To get the correct positions, we ran this code and created a following table with the SNP names, their positions on the genome, and the nearest gene, as annotated in the mentioned article.

Chr snpname pos gene_name
1 rs742071 18979874 PAX7
1 rs560426 94553438 ABCA4
1 rs642961 209989270 IRF6
2 rs7590268 43540125 THADA
8 rs12543318 88868340 8q21.3
8 rs987525 129946154 8q24
9 rs3758249 100614140 FOXE1
10 rs7078160 118827560 KIAA1598
13 rs8001641 80692811 SPRY2
15 rs1873147 63312632 TPM1
17 rs227731 54773238 NOG1
20 rs13041247 39269074 MAFB

2.2 Getting the nearest CpGs

The analyses require only a small subset of DNA methylation data at a time, therefore, we prepared it beforehand: for each of the chosen SNPs, we gathered only the CpGs from the chip that were within 50 kbp, and classified those into promoter, enhancer, or gene body regions, by querying ensembl regulation and ensembl gene databases, respectively. Check this code for details.

2.3 Genotype data

The genetic data was read from a .csv file and checked. All the genetic information was recoded to a factor with the subsequent levels: A, T, C, G, and NA. Next, to prepare the data to be read in by Haplin, we rearranged the columns so that it contained the alleles in the subsequent order: a1_mother, a2_mother, a1_father, a2_father, a1_child, a2_child. Note that since we do not have data from the fathers, both columns were filled with NA.

The data was then divided into subsets according to which orofacial cleft (OFC) subtype the child had: CLO (cleft lip only) dataset, CLP (cleft lip and palate), the joined CL/P (cleft lip with or without cleft palate), CPO (cleft palate only), or the control (CTRL) dataset. These subsets were saved as .RData and .ffData files, to be read in easily by Haplin.

Finally, we performed matching of IDs from the genetic and the DNA methylation datasets, so that we could link the correct data to the individuals. The IDs were not subsetted in the data files, but instead, an ID-matching data.frame was created and saved in separate files per each OFC subtype.

For details of those preparations contact the authors.

2.4 DNA methylation data

We created data.frames that consisted of only the necessary data for the subsequent analyses, among others: individual’s ID, CpG name, DNA methylation beta value. Each OFC subtype and controls had a separate data.frame. We also used the information on which CpG is in a promoter, enhancer or gene body, to create matrices with only those CpGs as columns and individuals as rows (separate matrix for each SNP). These matrices where then directly used in the analyses scripts.

Check here and here for details.

2.4.1 Gap hunting

As detailed in the “Discussion” in the main text and in the Supplementary Material to our article, we checked also for any multi-modal DNA methylation signals in our data. For this purpose, we needed to modify a bit the gaphunter function from the minfi package. The modified version is available here. For the output of this sub-analysis, see below, Fig.S4.

3 Main analyses

The main analyses were conducted separately on the CLO dataset, CLP, the joined CL/P, CPO, and the control (CTRL) dataset. On each of these datasets, we conducted GxMe and PoOxMe analyses separately for CpGs within the nearest promoter, enhancer or gene body regions.

We used one source Rmd file where each time we defined which dataset would be used. Below, you can see the output:

Additionally, we have run the same analysis on 20 randomly chosen SNPs.

3.1 PoO scan

Moreover, we ran a PoO scan and took the top 20 hits (i.e., the SNPs with the lowest p-values) as input into a subsequent PoOxMe analysis. Here, we needed to re-use and re-run the scripts used above in the Reading the data section. Check the output of the PoOxMe analysis for the CLP dataset and the control dataset.

4 Post-analysis and plotting of results

Below, is a list of the files that were used to produce each of the figures from the article: