Academia.eduAcademia.edu
ARTICLE Received 6 Aug 2013 | Accepted 19 Nov 2013 | Published 14 Jan 2014 DOI: 10.1038/ncomms3957 OPEN The locust genome provides insight into swarm formation and long-distance flight Xianhui Wang1, Xiaodong Fang2, Pengcheng Yang1,3, Xuanting Jiang2, Feng Jiang1,3, Dejian Zhao1, Bolei Li1, Feng Cui1, Jianing Wei1, Chuan Ma1,3, Yundan Wang1,3, Jing He1, Yuan Luo1, Zhifeng Wang1, Xiaojiao Guo1, Wei Guo1, Xuesong Wang1,3, Yi Zhang1, Meiling Yang1, Shuguang Hao1, Bing Chen1, Zongyuan Ma1,3, Dan Yu1, Zhiqiang Xiong2, Yabing Zhu2, Dingding Fan2, Lijuan Han2, Bo Wang2, Yuanxin Chen2, Junwen Wang2, Lan Yang2, Wei Zhao2, Yue Feng2, Guanxing Chen2, Jinmin Lian2, Qiye Li2, Zhiyong Huang2, Xiaoming Yao2, Na Lv4, Guojie Zhang2, Yingrui Li2, Jian Wang2, Jun Wang2, Baoli Zhu4 & Le Kang1,3 Locusts are one of the world’s most destructive agricultural pests and represent a useful model system in entomology. Here we present a draft 6.5 Gb genome sequence of Locusta migratoria, which is the largest animal genome sequenced so far. Our findings indicate that the large genome size of L. migratoria is likely to be because of transposable element proliferation combined with slow rates of loss for these elements. Methylome and transcriptome analyses reveal complex regulatory mechanisms involved in microtubule dynamic-mediated synapse plasticity during phase change. We find significant expansion of gene families associated with energy consumption and detoxification, consistent with long-distance flight capacity and phytophagy. We report hundreds of potential insecticide target genes, including cys-loop ligand-gated ion channels, G-protein-coupled receptors and lethal genes. The L. migratoria genome sequence offers new insights into the biology and sustainable management of this pest species, and will promote its wide use as a model system. 1 State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China. 2 BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China. 3 Beijing Institutes of Life Science, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China. 4 CAS Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China. Correspondence and requests for materials should be addressed to L.K. (email: lkang@ioz.ac.cn). NATURE COMMUNICATIONS | 5:2957 | DOI: 10.1038/ncomms3957 | www.nature.com/naturecommunications & 2014 Macmillan Publishers Limited. All rights reserved. 1 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3957 S ince the dawn of agrarian civilization, locust plagues have been viewed as one of the most devastating natural disasters, linked with famine, strife and dissolution of societal order as documented in the Bible, the Qur’an and Chinese historical records1,2. Unfortunately, locust plagues continue to cause destruction even today. For example, in 1988, locust swarms covered an enormous area of some 29 million square kilometres, extending over or into parts of up to 60 countries, resulting in billions of dollars in economic losses3. The primary current method for combating locust outbreaks is through intensive spraying of chemical pesticides; however, their overall usefulness has been widely debated because of their highly negative impact on human and environmental health, and on biological diversity2. Locusts are grasshopper species with swarming and longdistance migratory behaviours1. Locust swarms form suddenly and unpredictably through the congregation of billions of insects, which can fly hundreds of kilometres each day, and even cross oceans4. They are polyphagous and a single individual consumes its own body weight in food in 1 day; this is, proportionately, 60–100 times a human’s daily consumption1. Locusts are characterized by a density-dependent phase polyphenism, which involves a variety of biological and phenotypic traits, including changes in body colour, morphology, behaviour, physiology, immune responsiveness and others5–9. An increase in population density triggers the transformation from a well-hidden solitarious phase to an overtly noticeable gregarious phase, which results in an extensive aggregation of individuals1. This phase change occurs quickly and reversibly, and its speed varies among species10. The entire transformation process of phase change pivots around a behavioural change, which has been proposed to be regulated by the peripheral and central nervous system (CNS)11–13. The migratory locust, Locusta migratoria, is the most widespread locust species and has long served as a model organism for insect morphology, behaviour and physiological research14–16. Here we provide a draft genome sequence of L. migratoria, the largest animal genome sequenced so far. We assess changes of gene families related to long-distance migration, feeding and other biological processes unique to the locust and identify genes that might serve as potential pesticide targets. Combining a set of transcriptome and methylome data from gregarious and solitarious locusts, we reveal potential neuronal regulatory mechanisms underlying phase change in the locust. Results Genome assembly. We sequenced the genome of a single, eightgeneration inbred female individual of L. migratoria, worldwide distributed agricultural pests (Supplementary Fig. S1), using the Illumina HiSeq 2000 sequencing platform. After quality control and filtering, 721 Gb of data were generated, covering 114  of the 6.3 Gb L. migratoria genome size as estimated by k-mer analysis and flow cytometry (Supplementary Tables S1 and S2, and Supplementary Figs S2 and S3). We used SOAPdenovo17 to perform de novo assembly, achieving a final assembly of 6.5 Gb with a length-weighted median (N50) contig size of 9.3 kb and scaffold N50 of 320.3 kb (Supplementary Table S3). A genetic map containing 11 major linkage groups was further developed based on 8,708 markers using restriction-site-associated DNA sequencing data (Fig. 1a). The total heterozygosity rate of the L. migratoria genome, which is the portion of heterozygous single-nucleotide polymorphisms between the two haploid components in the diploid genome, was estimated to be 1.15  10 3. We also observed a heterogeneous distribution of heterozygosity rates (Fig. 1a, track 2 b), which possibly resulted from inbreeding18. The possible reasons are the existence of alleles, which are homozygous lethal and the variation of recombination rates between different genomic regions19,20. We assessed the integrity and quality of the genome assembly using multiple evaluation methods (Supplementary Figs S4 and S5, and Supplementary Tables S4–S7). On the basis of these analyses, our assembly provides a good representation of the L. migratoria genome and is of suitable quality for subsequent analysis. Genome annotation and evolutionary analysis. We annotated the genome and predicted 17,307 gene models by combining de novo prediction and evidence-based searches using four sequenced insect reference gene sets (Drosophila melanogaster, Apis mellifera, Acyrthosiphon pisum and Pediculus humanus), L. migratoria expressed sequence tags (ESTs) and RNA-seq data from transcriptomes that we created previously and for this work from multiple organs and developmental stages (Supplementary Methods and Supplementary Table S8). We found that 93.8% of gene models showed expression (reads per kb per million mapped reads 41 in at least one sample). Of the inferred proteins, 74.9% matched entries in the NR, SWISS-PROT, InterPro or TrEMBL databases (Supplementary Table S9). We identified over 2,639 repeat families using RepeatModeler; however, the top ten repeat families only represented B10% of the total genome sequences, indicating there were no dominant families in the L. migratoria genome (Supplementary Tables S10 and S11). The LINE RTE/BovB family, which has been documented in many species as having been acquired through an ancient lateral gene transfer event21, is the most prevalent repeat family (244 Mb, 4.05%) in the L. migratoria genome, and it may be still active as indicated by the presence of many fulllength copies and intact protein domains22. The proliferation of a diverse range of repetitive elements is the main reason for the large size of the L. migratoria genome. Repetitive elements constituted B60% of the assembled genome, of which DNA transposons (B24%) and LINE retrotransposons (17%) were the most abundant elements (Fig. 1a, track c, Supplementary Note 1 and Supplementary Table S10). Among all transposons, DNA transposon exhibits the lowest divergent copies (Fig. 1a, track d), which reflects their most recent invasions in the locust genome. On the basis of homologous regions of a conserved and syntenic Osiris gene family among insect orders23, we compared genomic components of L. migratoria with that of D. melanogaster (Supplementary Fig. S6). The length of coding region has no difference, but the length of intronic, intergenic regions and repetitive elements in the L. migratoria genome is much larger than that of D. melanogaster (Fig. 1c). We compared the L. migratoria DNA deletion rates with that of five other insect genomes (D. melanogaster, Anopheles gambiae, Bombyx mori, A. pisum and Aedes aegypti) focusing on long terminal repeat retrotransposons, because these have neutral decay rates and easily identified complete structures24. We observed a positive correlation between DNA deletion and divergence of long terminal repeat copies (Po0.01, Pearson’s correlation tests). The L. migratoria genome exhibits the lowest rate of DNA deletions relative to the other insects (Po0.01, Wilcoxon tests; Fig. 1d). These results are consistent with the previous reports and support the hypothesis that slow DNA loss contributes to genomic gigantism in animals25. The average intron length in the L. migratoria genome was 11,159 bp, which is 410 times longer than the average intron size of other insects and twice that of the average size in Homo sapiens (Supplementary Table S8 and Supplementary Fig. S7). Pairwise comparison of intron sizes in nine insect species using one-to-one NATURE COMMUNICATIONS | 5:2957 | DOI: 10.1038/ncomms3957 | www.nature.com/naturecommunications & 2014 Macmillan Publishers Limited. All rights reserved. ARTICLE 200 Mb a 1 b 2 d e 0.2 3 7 6 20 Deletion rates (%) 8 6 LMI DME Length in log2 unit 9 BMO AGA 180 Mb 4 5 API 110 Mb PHU 295 Mb NVI 236 Mb AME 160 Mb TCA 429 Mb 278Mb c 11 10 DPU 6,500 Mb LMI 467 Mb 10 4 DME LMI DME AGA BMO API AME 2 0 od i In ng tro n te ic rg e n R ep ic et iti ve 0 5 10 15 20 Divergence (%) 25 In C Holometabolous Hemimetabolous NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3957 Figure 1 | Locust genomic characterization and comparative analysis of insect genomes. (a) A schematic representation of genomic characteristics of the locust pseudo-chromosomes (in an Mb scale). Track a: 11 linkage groups of the locust genome, grey denotes unmapped scaffolds. Track b: coverage of sequencing depth (blue) and distribution of allelic heterozygous single-nucleotide polymorphisms (SNPs) between the two haploids (orange). Track c: density distribution of three dominant subclasses of repetitive element (DNA transposons, red; LINEs, green; long terminal repeats (LTRs), blue). Track d: divergence rates of three dominant subclasses of repetitive element (DNA transposons, red; LINEs, green; LTRs, blue). Track e: ratio of observed to expected frequency of CpG dinucleotides in the coding region of gene sets (orange) and in the whole genomic region (violet). (b) Phylogenetic relationships inferred based on the concatenated data set from universal single-copy genes. The number in the branches indicates the genome sizes. (c) Size expansion of genomic components of a homologous and syntenic locus between the locust and fly genome inferred from the Osiris genes, a conserved gene family across insects. (d) Deletion rates across five insect genomes that contain a substantial fraction of LTR retrotransposon copies. Because of their relaxed selection pressures and reputed neutral decay rates, LTR retrotransposon copies were used to estimate deletion rates. RepeatMasker searches were performed to determine the deletion rates of genomic retrotransposon copies using intact LTR retrotransposons as a query. The genomic visualization was created using the programme Circos. AGA, A. gambiae; AME, A. mellifera; API, A. pisum; BMO, B. mori; DME, D. melanogaster; DPU, D. pulex; LMI, L. migratoria; NVI, Nasonia vitripennis; PHU, P. humanus; TCA, Tribolium castaneum; . orthologous genes provided further support for this average size difference (Supplementary Fig. S8 and Supplementary Table S12). We also carried out a genome-wide comparison of intron and genome size of 73 sequenced animal species and found that there was a positive correlation between average intron size and genome size (Po0.01, Pearson’s correlation tests; Supplementary Fig. S9). The increased intron size in the L. migratoria genome compared with the other insects may partly be attributed to transposable element invasion (Supplementary Fig. S10). Relative to this, we compared several other intron characteristics of the L. migratoria genome with other animal species (Supplementary Note 1) and found that the U12-type intron (minor-class intron) number and the ratio of ratcheting point sites of L. migratoria are similar to those in vertebrates rather than insects (Supplementary Figs S11 and S12). Previous reports have indicated that most insects have an enrichment of ratcheting point sites to allow for efficient splicing of long introns, whereas vertebrates use repetitive elements to aid in splicing long introns26. Our data here indicate that there may be convergent evolution of the splicing mechanisms associated with genome size expansion in animals. We used a maximum likelihood method for genome-scale phylogenetic analysis using 122 single-copy genes from 10 sequenced arthropod genomes (Fig. 1b and Supplementary Fig. S13). The phylogenetic analysis revealed that the locust is the basal taxon for the other insects sequenced so far, and supports the paraphyletic status of the hemimetabolous insect species. Given the distinct developmental differences between hemimetabolous and holometabolous insects, we identified metamorphosis-specific gene sets, of which a large number of genes were related to the regulation of developmental processes (Supplementary Table S13). A gene gain/loss analysis of these insect genomes showed a gain of about 55 new gene families in the lineage leading to L. migratoria (Supplementary Fig. S13). Twenty-five significantly expanded gene families in the L. migratoria genome were mainly involved in detoxification, chemoreception, chromosome activity and nutritional metabolism, indicating unique adaptation features of the L. migratoria genome (Supplementary Table S14). Phase change analysis. Phase change is the defining characteristic of locust biology because of its critical roles in swarm formation (Fig. 2a)5. As a fascinating type of phenotypic plasticity, phase change is required to achieve biological complexity at all levels of biological organization, from molecules to large aggregation. Recent studies have highlighted the importance of DNA methylation for understanding insect phenotypic plasticity and biological complexity27–29. To investigate the potential involvement of epigenetic regulation in locust phase change, we performed a comparative methylome analysis for brain tissues between solitarious and gregarious locusts by a reduced representation bisulphite sequencing (RRBS) technology and then conducted a comparative transcriptome analysis for brain tissues over two time courses (within 64 h): starting at the onset of crowding of solitarious locusts and the isolation of gregarious locusts. NATURE COMMUNICATIONS | 5:2957 | DOI: 10.1038/ncomms3957 | www.nature.com/naturecommunications & 2014 Macmillan Publishers Limited. All rights reserved. 3 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3957 a d Gregarious Up Solitarious 4h IG 8 h 16 h 32 h 64 h 4 h CS S/G 8 h 16 h 32 h 64 h Carbohydrate metabolic P. Lipid metabolic P. Mit. intermembrane space Oxidoreductase A. Monooxygenase A. Carboxypeptidase A. Serine-type endopeptidase A. Phototransduction Solute:sodium symporter A. Sulphotransferase A. Microtubule Motor A. Nucleosome assembly b 30 20 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 –Log10 (adjustedPv) 0 G en In om e te r G gen en i e c bo dy Ex on In tro R n ep ea t Relative methylation level (%) mCpG/CpG Down e c RNA-DNA binding Receptor Mitochondrion –O2– m C H2O C Receptors Glu Others m Signal transduction proteins in os tin Ac My Kinases Vesicle transfer Molecular motors Microtubule cytoskeleton organization ATP m C m C ATPase Protein kinase m C Kinesin Antioxidant MAP –O2– Dynein m C GTPase m C DEG DMG DSG Figure 2 | Transcriptome and methylome analysis of locust phase change. (a) Locust phase polyphenism. The solitarious phase individual is relatively inactive and cryptically coloured, but gregarious phase individual actively swarms and is conspicuously coloured. (b) Relative CG methylation level of different genomic regions through a RRBS method. The ratio was defined as the number of methylated CG to the number of total CG, and was calculated using the CG on the reads that map to the defined regions. The results were derived from the data of brain samples of fourth-instar locust nymphs. (c) Functional classification of differentially methylated genes (DMGs) of brain tissues between solitarious and gregarious fourth-instar female nymphs. Genes with at least four differentially methylated CG sites were thought of as DMGs. (d) Gene ontology enrichment of differentially expressed genes (DEGs) in brain tissues of fourth-instar locust nymphs over time for the gregarization (crowding of solitarious locusts (CS)) and solitarization (isolation of gregarious locusts (IG)) processes (4, 8, 16, 32 and 64 h) compared with controls. The rightmost column denotes the comparison between the solitarious and gregarious control. The up- and downregulated genes are coloured as red and blue, respectively. The length of the bar denotes the log10 of the adjusted P-value of the enrichment significance (Fisher’s exact test or w2-test). A, activity; Mit, mitochondrion; P, process. (e) Model for molecular regulation of synaptic plasticity involved in locust phase change. In this model, DEGs, DMGs or differentially spliced genes (DSGs) were found for different components. MAP, microtubule-associated protein. We first assessed the DNA methylation patterns of the L. migratoria genome. We examined the distribution of observed/expected CpG content (CpGO/E), which is a good measure of inferring the pattern of DNA methylation30, in the whole genome and observed that the CpGO/E of the coding region is lower and more fluctuated compared with those observed from the whole genomic region (Fig. 1a, track e). A bimodal peak is clearly found in the CpGO/E obtained from the L. migratoria coding regions (Supplementary Fig. S14). Furthermore, the L. migratoria genome seems to have a functional CpG methylation system, including two copies of Dnmt1 and one copy of Dnmt2 and Dnmt3 (Supplementary Table S15). This is consistent with the patterns observed in researches on methylation in A. mellifera and A. pisum31, suggesting that L. migratoria likewise has a large proportion of genes that may be methylated. Through a low-pass bisulphite sequencing and RRBS (Supplementary Fig. S15), we roughly estimated that 1.6% of genomic cytosines are methylcytosines. Nearly all the methylcytosines were located in CpG dinucleotides, which were 4 substantially enriched in gene bodies (Supplementary Figs S16 and S17, and Supplementary Tables S16 and S17). Interestingly, we find that repetitive elements are highly methylated and introns have higher methylation levels than exons (Fig. 2b), which is different in other insect species where exons have higher methylation levels32. Hence, the L. migratoria genome probably represents a unique DNA methylation pattern existing in ancestral branches of insecta. Through RRBS, we found a total of 9,311,972 CpG sites that cover 11,743 genes (Supplementary Table S18). Over 90 genes showed significant methylation differences between solitarious and gregarious locusts, and these were mainly classified into seven categories, including microtubule cytoskeleton organization, molecular motors, kinases, receptors, vesicle transfer, signal transduction proteins and RNA–DNA binding (Fig. 2c, Supplementary Fig. S18 and Supplementary Table S19). Many of these categories had been documented in various animal species to have a crucial role in neuronal functions connected to synaptic plasticity33,34. NATURE COMMUNICATIONS | 5:2957 | DOI: 10.1038/ncomms3957 | www.nature.com/naturecommunications & 2014 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3957 To further elucidate neuronal mechanisms underlying locust phase change, we conducted a series of transcriptome analyses of brain tissues from locust nymphs experiencing short-term solitarization and gregarization. We identified a total of 4,893 differentially expressed genes in at least one of the time points during both processes, and this accounted for 28.3% of the gene sets (Supplementary Table S20 and Supplementary Fig. S19). Gene ontology (GO) analyses showed these fell into a variety of categories (Supplementary Fig. S20), indicating that phase change induced a broad range of changes in CNS gene expression. GO enrichment analysis revealed that there were waves of gene expression changes occurring over the time courses of crowding of solitarious locusts and isolation of gregarious locusts (Supplementary Tables S21 and S22) and that there were distinct temporal patterns for a number of genes (Fig. 2d). As a whole, during locust crowding, the genes associated with synaptic transmission, carbohydrate metabolism and nucleosome assembly displayed higher expression levels, whereas the expression levels of genes associated with oxidoreductase and antioxidase, microtubule and motor activity showed decreased expression. The expression patterns of these genes were reversed during gregarious locust isolation. These results suggest that crowding of solitarious locusts may trigger an increase in neuronal activity with a simultaneous suppression of antioxidative responses in the CNS during the onset of phase change in locusts. Alternative pre-messenger RNA splicing (AS) has an important role in development and phenotypic plasticity35. On the basis of the RNA-seq data from different developmental stages and tissues of the L. migratoria, we identified 168,646 splice junctions and showed that alternative 30 -splicing is the most prevalent form of AS (Supplementary Fig. S21). Comparative AS transcript analysis revealed 45 genes that have differentially expressed isoforms between solitarious and gregarious locusts (Supplementary Table S23). These include several genes implicated in cytoskeleton dynamics, for example, microtubule-associated protein futsch, ankyrin repeat domain-containing protein, neurofilament, myosin, calcium-transporting ATPase and Arrestin. It is noteworthy that the genes involved in the regulation of the cytoskeletal microtubular system were highlighted from the analysis of methylome, transcriptome and AS, respectively (Fig. 2e). Microtubules are essential structures for stable neuronal morphology and have important roles in neuronal polarization, development and plasticity36,37. Neuronal plasticity has been demonstrated to occur between solitarious and gregarious locusts accompanying behavioural change. Our results suggested that phase change is associated with multiple molecular processes involved in the regulation of microtubule dynamics in the locust CNS. Long-distance flight trait analysis. The capacity of long-distance flight is the most distinguishing feature of L. migratoria38. Insect flight capacity depends on several factors, including wing and muscle morphology, neuroendocrine regulation and energy metabolism39. We annotated genes that are potentially relevant to these morphological features and physiological processes (Supplementary Note 2, Supplementary Tables S24–S31 and Supplementary Fig. S22) and compared the RNA-seq data derived from fat body tissues of locust adults after 2 h of flight and no flight (Fig. 3). Most of 472 differentially expressed genes related to flight were enriched into several GO categories involved in energy metabolism (Fig. 3a) and have more duplicated copies compared with non-differentially expressed genes (w2-test, Po1e 16; Fig. 3b). Through a comparison with nine other sequenced insects, we observed significant copy number expansion in genes associated with lipid mobilization and antioxidant protection in the L. migratoria genome, including 7 perilipins, 11 fatty-acid-binding proteins, 9 I cysteine peroxiredoxins (Prdx6s) and 12 sigma glutathione S-transferase (GST) genes (Fig. 3c, Supplementary Figs S23–S24, and Supplementary Tables S28 and S30–32). Perilipins are predominantly located on the surface layer of intracellular lipid droplets in adipocytes40 and have a critical role in lipid accumulation41. fatty-acid-binding proteins are known to be directly correlated with the metabolic rate in locust flight muscle tissues, and thus may contribute to the extremely high metabolic rate of fatty acid oxidation for energy generation42. The two antioxidant gene families, Prdx6 and sigma GST, are involved in protection against reactive oxygen species damage caused by flight activity43. Furthermore, several members of these families were differentially expressed in the fat body tissues before and after flying, indicating their important roles in the flying process (Supplementary Fig. S25). Given that flight is considered one of the most energy-demanding exercises in insects44, the expansion of genes involved in lipid metabolism indicate that L. migratoria has developed a highly efficient energy supply system to fulfill the intensive energy consumption during their long-distance flight45. Different insects use different fuel sources to provide energy during flight46. For example, A. mellifera and D. melanogaster primarily use carbohydrates, whereas Danaus plexippus and L. migratoria mainly use lipids during long-distance flight47. To investigate whether the energy metabolic pathways undergo changes owing to the need for different fuel source preferences, we compared the copy number variation of genes involved in carbohydrate and lipid metabolism. We found that there were more copies of glycolytic gene glyceraldehyde 3-phosphate dehydrogenase in A. mellifera and D. melanogaster, whereas there were more copies of enoyl-CoA hydratase and acetyl-CoA acyltransferase 2 in D. plexippus and L. migratoria (Fig. 3d). It is interesting to note that enoyl-CoA hydratase and acetyl-CoA acyltransferase 2 belong to a multi-enzyme complex located in the inner mitochondrial membrane that catalyse two distinct steps of b-oxidation cycle of fatty acid degradation48. Therefore, these two enzymes need be functionally tightly associated to allow efficient substrate channelling from one active site to the next. The increase of gene copies involved in L. migratoria and D. plexippus in fatty acid degradation process is likely to serve as an adaptation for the preference of lipids as energy source during long-distance flight. Feeding trait analysis. L. migratoria usually prefer gramineous plants (grasses) as a food source, which includes the majority of crop species49. Food selection processes are primarily mediated by the detection of chemical cues and the ability of an insect to handle plant toxins50. We identified four multigene families putatively related to the detection of plant odours in the L. migratoria genome by carrying out similarity searches using information from known insect protein sequences (Supplementary Note 2). These families included 22 odourantbinding proteins, 95 olfactory receptors, 75 gustatory receptors and 10 ionotropic receptors (Fig. 4a). We found that gustatory receptor and olfactory receptor gene families have undergone locust-specific expansion (Supplementary Figs S26 and S27), which might reflect their strong adaptation for ecological specialization related to host plant recognition. We also found that five of the known gene families putatively involved in detoxification enzymes and xenobiotic transporters in the locust had a unique composition and were expanded in L. migratoria. We identified 68 UDP glycosyltransferases (UGTs), 80 carboxyl/choline esterases, 94 cytochrome P450s, 28 GSTs and 65 ATP-binding cassette transporters (Supplementary Tables S32–S34 and Supplementary Figs S28–S32). Substantial expansion occurred in the UGT and carboxyl/choline esterase families, representing the largest family expansion in any insect genome sequenced so far (Fig. 4b). The expansion of multi-detoxification gene families suggests that L. migratoria has an all-purpose NATURE COMMUNICATIONS | 5:2957 | DOI: 10.1038/ncomms3957 | www.nature.com/naturecommunications & 2014 Macmillan Publishers Limited. All rights reserved. 5 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3957 Transition metal ion biding Glutathione peroxidase activity Transporter activity Epoxide hydrolase activity Response to chemical stimulus Inositol catabolic process Electron carrier activity Oxidation–reduction process Flight muscle cell LDLp DEGs All genes HDLp ER FA FABP Microtubule-based movement Oxidation-reduction process 40 Up (%) 20 0 20 40 Down (%) Proportion of genes CoA PAT TAG O2 DAG Glucose Glycogen 62.16% 80.22% H2O2 FA metabolism MultiCopy DEG SingleCopy GST(sigma) H2O Acetyl-CoA 19.78% OXPHOS AME DME DPL LMI GAPDH 3 3 1 2 NonDEG Mitochondrion Prdx6 –O2– Trehalose Glycolysis 37.84% CoA Carnitine Fat body TCA NDUFB9 AME DME DPL LMI NDUFB4 AME DME DPL LMI ECHS1 1 1 2 2 IDH 1 1 2 2 NDUFS5 ACAA2 1 1 3 2 CS 2 1 1 QCR7 2 AME DME DPL LMI 1 1 2 2 1 1 2 2 1 1 2 3 2 2 1 1 Figure 3 | Expansion of genes putatively involved in energy consumption during L. migratoria flight. (a) Overrepresented GO annotations for 472 differentially expressed genes (DEGs) in the fat body after 2 h of flight. The bar represents the proportion of the genes of one GO term in the DEGs (red) and all annotated genes (black). (b) Proportion of multi-copy (blue) and single-copy (green) genes for DEGs (right) and non-DEGs (left), respectively. (c) Expanded genes (red) involved in energy mobilization processes in the L. migratoria genome. PAT, perilipin; TAG, triacylglycerol; DAG, diacylglycerol; FA, fatty acid; ER, endoplasmic reticulum; FABP, fatty-acid-binding protein; Prdx6, 9 I cysteine peroxiredoxins; GST, glutathione S-transferase. (d) Gene copy number variation in energy-related metabolic pathways between two insect species, A. mellifera (AME) and D. melanogaster (DME), which primarily use glucose as an energy source during flight, and two insect species, D. plexippus (DPL) and L. migratoria (LMI), which primarily use fatty acids. ACAA2, acetyl-CoA acyltransferase 2; TCA, tricarboxylic acid cycle; CS, citrate synthase; ECHS1, enoyl-CoA hydratase; GAPDH, glyceraldehyde 3-phosphate dehydrogenase; IDH, isocitrate dehydrogenase; NDUFB9, NADH dehydrogenase (ubiquinone) 1b subcomplex 9; NDUFB4, NADH dehydrogenase (ubiquinone) 1b subcomplex 4; OXPHOS, Oxidative phosphorylation; NDUFS5, NADH dehydrogenase (ubiquinone) Fe-S protein 5; and QCR7, ubiquinol–cytochrome c reductase subunit 7. detoxification system, allowing it to handle a broad range of secondary metabolites present in their host plants51. Interestingly, we also found that the repertoires of the UGT gene family from four herbivorous insects far exceeded the numbers found in the non-herbivorous insects (Fig. 4a). Insect UGTs are involved in glucosidation, which has a major role in the inactivation and excretion of a variety of toxic plant secondary metabolites, including alkaloids, phenols and flavonoids, which are present in substantial amounts in gramineous plants52,53. Identification of gene targets for pest control. Locust plague control urgently requires safe and effective new insecticides or other solutions2. We therefore searched the L. migratoria genome for insecticide gene targets. We identified 34 cys-loop ligand-gated ion channels and 90 G-protein-coupled receptors, which are considered to be major traditional insecticide targets54,55 (Supplementary Note 3 and Supplementary Table S35). Using in-silico screening56 for L. migratoria orthologues of Caenorhabditis elegans and D. melanogaster lethal genes (linked to lethal phenotypes on gene perturbation), we identified 166 genes that may be worth considering for targets in alternative pest control strategies given that they are single copy, have lower allelic variability and lack similarity to human genes. Among these genes, several have been reported as successful targets in previous studies, and include kinases, ATPases, synthases, carboxylesterases and receptors57. We also identified the gene repertoire of several biological processes that may serve as mechanistic targets and lead 6 to the development of specific and sustainable pest control methods58, including the immune system, RNA interference pathway and cuticle metabolism (Supplementary Tables S36–S40). Discussion Our analysis indicates that a massive number of repetitive elements (at least 60%) exist in the L. migratoria genome and their rates of loss are lower than those in other insect species. This partially explains the large genome size in L. migratoria. Comparative genomic analysis revealed locust-specific expansions of many gene families, particularly those that are involved in energy consumption and detoxification, suggesting a genomic basis for long-distance flight capacity and phytophagy of locusts. The analyses of the methylome and transcriptome revealed the complex molecular regulation of phase change involving extensive participation of DNA methylation, transcription and alternative splicing in the CNS. The information from the L. migratoria genome, transcriptome and methylome obtained in this study serves as a major step towards better understanding of the locust plague outbreaks. It not only enables us to combat against this distractive event but also increase the use of this species as a model system in biological and biomedical research. Methods Sample collection and genome sequencing. The strain for genome sequencing originated from inbred laboratory strains of solitarious locusts, which were produced from eight generations of sib mating at the Institute of Zoology, CAS, China. NATURE COMMUNICATIONS | 5:2957 | DOI: 10.1038/ncomms3957 | www.nature.com/naturecommunications & 2014 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3957 Detoxification Chemosensory CCE GST ABC GR OR IR OBP 94 83 28 65 75 95 11 22 P.humanus 4 37 17 13 38 8 10 12 4 A. pisum* 58 83 29 20 71 53 48 11 14 A.mellifera 12 46 24 10 46 10 163 10 21 T. castaneum* 43 134 48 35 65 220 265 23 49 B.mori* 45 84 73 23 50 56 48 18 43 A. gambiae 26 106 51 31 51 60 79 46 81 D. melanogaster 34 85 35 38 56 73 62 66 52 UGT373A1 UGT373A3 A2 UGT373 A1 82 UGT3 9A1 37 UGT 71A1 3 UGT 364A21 T UG 364A 1 T A UG T387 A1 5 UG T37 4A1 1 UG T37 68A 1 UG T3 84A A1 UG T3 85 A1 UG GT3 370 U GT U 1 1A 2 T4 1A G T4 1A3 1 U G 4 C U GT 48 1 U GT 40S U GT 0P1 U GT4 N1 U T40 B2P UG T40 4 UG T40B UG T40B3 UG 40B1 UGT 0A1 UGT4 1 UGT40H UGT40K1 UGT40G2 UGT40G1 UGT34A2 UGT33 UGT3 R1 UGT 3R2 UGT 33Q1 UG 340C UG T340 1 UG T33 C2 U T3 K1 UGGT3 3N1 U T 3D U G 33 1 UG GT T33 D5 T3 33D D4 3D 2 8 3 3D 6 T3 D G 33 7 U GT 33D A1 U GT 50 1 U T 0E UGGT5 6C2 U T4 A1 UG T46 2 UG T46A UG T44A1 UG 43B1 UGT 42B1 UGT A1 UGT42 UGT42A2 UGT372A1 UGT366A1 U U GT UG GT3 360 UG T3 60H A1 UG T36 60D 1 UG T36 0E1 1 0 UG T361 B1 UG T361AA3 T UGT 361A21 UGT 376A1 3 UGT3 76A3 7 UGT38 6A2 1A1 UGT47A1 UGT47B1 U UG GT3 UG T3 86 UG T3 78A A1 6 UG T37 9A1 1 UG T377 7A1 UG T377 A2 A T UG 365H 3 T 1 UGT 365C1 3 UGT3 65B1 6 UGT36 5E1 5A UGT365A 1 2 UGT365J1 UGT365D1 UGT365G1 UGT365F1 5L1 UGT36 K1 65 UGT3 65II 3 UGT 80A1 3 UGT 362A11 T UG 367A 1 T A UG T383 0I1 UG T36 0G1 1 UG T36 0F 1 6 UG GT3 60C A2 U T3 360 G U GT U U UG GT T3 39C 9B 1 1 P450 68 UGT363A4 UGT363A3 UGT363 UGT3 A1 UGT 63A2 UG 363B2 UG T363B3 UG T363 UG T363 B5 T3 B1 63B 4 UGT L. migratoria* Figures 4 | Gene families putatively related to L. migratoria host plant adaptation. (a) Numbers of detoxification-related and chemosensory-related genes in the genomes of eight selected insects. *Herbivorous insects. UGT, UDP glycosyltransferases; CCE, carboxyl/choline esterases; GST, glutathione S-transferase; ABC, ATP-binding cassette; GR, gustatory receptor; OR, odorant receptor; IR, ionotropic receptor; OBP, odorant-binding protein. (b) The phylogenetic relationship of the different UGT families. Family members in red are from L. migratoria and those in black are from other insects. Nodes with 470% bootstrap support (100 replicates) are indicated in orange circles. UGTs are named after the standard UGT nomenclature guide. For clarity, we show only a selected number of representatives from the UGT families because of the large number of UGT families identified. DNA for genome sequencing was extracted from the whole body of one female adult, with the exception of its guts. Genomic DNA was isolated using standard molecular biology techniques. Gradient increased insert-size libraries, 170, 200, 500 and 800 bp and 2, 5, 10, 20 and 40 kb were constructed. All libraries were sequenced on Illumina HiSeq 2000. In total, 42 paired-end sequencing libraries were constructed and 82 lanes were sequenced, producing 1,135 Gb of raw data. Further low-quality and duplicated reads filtering resulted in 720 Gb (114  coverage) data for genome assembly (Supplementary Table S1). Genome assembly. For quality control, raw data was filtered for PCR duplication, adaptor contamination and low quality. After filtering, 721 Gb (or 114.4  ) of data were retained for assembly. SOAPdenovo17 was selected to assemble the L. migratoria genome. The paired-end reads of short insert-size libraries were used to construct contigs, and those of long insert-size libraries were used to connect contigs and generate scaffolds. All the reads were used to fill gaps in the scaffolds. On the basis of our tests of multiple combinations of assembly parameters, ‘–K 43 – M 3’ was chosen in the final assembly. Approximately 350.95 Gb (or 50.1  ) of data was used to build contigs, and all high-quality data were used to build scaffolds. Contigs and scaffolds o200 bp long were filtered out. The final total contig size and N50 were 5.95 Gb and 9.26 kb, respectively. The total scaffold size and N50 were 6.74 Gb and 281.75 kb, respectively. The total length of assembled sequences is B400 Mb larger than the genome size estimated by the flow cytometry and 17-mer analysis. To determine whether heterozygosity contributes to redundancy in assembled sequences, B22  pairedend reads with 200 bp inserts were mapped to the assembled sequences using NATURE COMMUNICATIONS | 5:2957 | DOI: 10.1038/ncomms3957 | www.nature.com/naturecommunications & 2014 Macmillan Publishers Limited. All rights reserved. 7 ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3957 MAQ-0.7.17 (ref. 59) with default parameters, except –C 1. Whole-genome coverage depth distribution revealed a peak at half of the expected average coverage depth, B11  (half depth), which may have resulted from the heterozygosity. To reduce redundancy, we first identified all regions with half depth using the following criteria: length4200 bp, average deptho15  and 495% of the positions have o15  depth. Next, all the scaffolds with half depth were aligned self-to-self using BLAT with identity 490%. The overlapped terminal region of the shorter scaffold was removed if the overlapping regions had half depth and at least one of the four terminal lengths was o2 kb long. Contamination of bacterial and viral DNA sequences was also removed. All assembled sequences were aligned to the genome sequences of viruses and bacteria available through NCBI. Aligned sequences with 490% identity and longer than 200 bp were filtered out from the final assembly. To minimize any negative influence of filtering, assembled sequences covered by an EST sequence were retained. Our de novo assembly approaches combined with the filtering steps resulted in a final genome assembly of 6.52 Gb with contig N50 9.3 kb (Supplementary Table S3). The assembly was further improved to a final scaffold N50 323 kb using the gene-scaffold method by combining evidence from RNA-seq data and homologous genes (Supplementary Methods). Linkage mapping of scaffolds. We used 106 F1 individuals produced from a cross between two L. migratoria, one from the Hainan province and another from the laboratory-raised population, to perform RADseq. BWA (v. 0.6.2) was used to align the reads to the reference. SAMtools (v. 0.1.18)60 were used to call single-nucleotide polymorphism and filtering. A custom PERL script was developed to identify segregating polymorphic patterns and classify linkage groups. Marker ordering and spacing were carried out using JoinMap (v. 3.0)61. Protein coding gene annotation. For homology-based prediction, Genewise was used to improve gene models that acquired from the protein sets of four insects. For de novo prediction, Augustus62, SNAP63 and Glimmer-HMM64 were employed on the repeat masked-assembly sequences. RefSeq proteins from A. pisum and P. humanus were used as training data to obtain suitable parameters in the L. migratoria gene prediction. For transcriptome-based prediction, PASA65 was used to define gene structures from 45,436 ESTs, and TopHat and Cufflinks were used to obtain transcript structures from RNA-seq data that were collected from various developmental stages. Finally, GLEAN66 was used to merge the evidences from homology-based, de novo-derived and transcript gene sets to form a comprehensive and non-redundant reference gene set. After filtering and manual curation, 17,307 genes were obtained (Supplementary Table S8 and Supplementary Methods). Gene family analysis. Orthology analysis was carried out by the TreeFam67 pipeline using nine insects and Daphnia pulex. The rate and direction of gene family size in L. migratoria and its related species were inferred using CAFÉ68 (Supplementary Methods). Transcriptome sequencing. Samples for RNA-seq were prepared from a mixed samples of various developmental stages and ten tissues of gregarious adults and brain tissues of fourth-instar nymphs undergoing time courses of gregarization and solitarization (Supplementary Methods). Total RNA was extracted using the TRIzol reagent (Invitrogen) and treated with RNase-free DNase I. Poly(A) mRNA was isolated using oligo dT beads. First-strand complementary DNA was generated using random hexamer-primed reverse transcription, followed by synthesis of the second-strand cDNA using RNaseH and DNA polymerase I. Paired-end RNA-seq libraries were prepared following Illumina’s protocols and sequenced on the Illumina HiSeq 2000 platform. To assist gene annotation, a cDNA library with mixed samples from various organs and developmental stages was constructed, and was then normalized by the duplex-specific nuclease method followed by cluster generation on the Illumina HiSeq 2000 platform. This method could reduce expression redundancy and facilitate novel gene discovery. Gene expression analysis. The RNA-seq reads were mapped using Tophat. Gene expression levels were measured using the reads per kb per million mapped reads criteria. To minimize the influence of differences in RNA output size between samples, the numbers of total reads were normalized by multiplying with normalization factors as suggested by Robinson and Oshlack69. Differentially expressed genes were detected using the method described by Chen et al.70, which was constructed based on Poisson distribution and eliminated the influences of RNA output size, sequencing depth and gene length. Reduced representation bisulphite sequencing. Solitarious and gregarious locusts were reared as described in a previous publication11. Brain tissues of 3-dayold fourth-instar gregarious and solitarious females were dissected and placed in liquid nitrogen. Each sample contained 100 brain tissues. DNA was extracted using the Gentra puregene Kit (Qiagen, USA). Next, 5 mg of genomic DNA was digested with 300 U of the MspI enzyme (New England Biolabs, USA) in 100 ml reactions at 37 °C for 16–19 h. After purification, the digested products were subjected to blunt 8 ending, dA addition and methylated-adapter ligation. To obtain DNA fractions of 40–120 bp and 120–220 bp ranges of MspI-digested products, two ranges of 160–240 bp and 240–340 bp adapter-ligated fractions were excised and purified from a 2% agarose gel. Bisulphite conversion was conducted using the ZYMO EZ DNA Methylation-Gold Kit (ZYMO) following the manufacturer’s instructions. The same bisulphite conversion was also conducted for one sample that was not digested by MspI. The final libraries were generated by PCR amplification of 11–13 cycles using JumpStart Taq DNA Polymerase (Sigma). After testing using an Agilent 2100 Bioanalyzer (Agilent Technologies) and real-time PCR, the libraries were analysed on the HiSeq 2000 system. In total, 13.5 Gb and 12.8 Gb of data were produced for gregarious and solitarious samples, respectively. To validate the methylation level, 63 TA clones (26 in 120–220 bp and 37 in 40–120 bp) were sequenced by bisulphite PCR using the Sanger method. A methylation ratio of 3.42% was detected for the CG sites. References 1. Uvarov, B. P. Grasshoppers and Locusts (Cambridge UP, 1977). 2. Enserink, M. Can the war on locusts be won? Science 306, 1880 (2004). 3. Skaf, R., Popov, G., Roffey, J., Scorer, R. & Hewitt, J. The desert locust: an international challenge [and discussion]. Philos. Trans. R. Soc. Lond. B 328, 525–538 (1990). 4. Lovejoy, N. R., Mullen, S. P., Sword, G. A., Chapman, R. F. & Harrison, R. G. Ancient trans-Atlantic flight explains locust biogeography: molecular phylogenetics of Schistocerca. Philos. Trans. R. Soc. Lond. B Biol. Sci. 273, 767–774 (2006). 5. Pener, M. P. & Simpson, S. J. Locust phase polyphenism: an update. Adv. Insect Physiol. 36, 1–272 (2009). 6. Sword, G. A., Lecoq, M. & Simpson, S. J. Phase polyphenism and preventative locust management. J. Insect Physiol. 56, 949–957 (2010). 7. Wang, Y., Yang, P., Cui, F. & Kang, L. Altered immunity in crowded locust reduced fungal (Metarhizium anisopliae) pathogenesis. PLoS Pathog. 9, e1003102 (2013). 8. Wang, H. S. et al. Parental phase status affects the cold hardiness of progeny eggs in locusts. Funct. Ecol. 26, 379–389 (2012). 9. Wang, X. H. & Kang, L. Molecular mechanisms of phase change in locusts. Annu. Rev. Entomol. 59, 225–243 (2014). 10. Guo, W. et al. CSP and takeout genes modulate the switch between attraction and repulsion during behavioral phase change in the migratory locust. PLoS Genet. 7, e1001291 (2011). 11. Ma, Z., Guo, W., Guo, X., Wang, X. & Kang, L. Modulation of behavioral phase changes of the migratory locust by the catecholamine metabolic pathway. Proc. Natl Acad. Sci. USA 108, 3882–3887 (2011). 12. Guo, X., Ma, Z. & Kang, L. Serotonin enhances solitariness in phase transition of the migratory locust. Front Behav. Neurosci. 7, 129 (2013). 13. Wu, R. et al. Metabolomic analysis reveals that carnitines are key regulatory metabolites in phase transition of the locusts. Proc. Natl Acad. Sci. USA 109, 3259–3263 (2012). 14. Ayali, A. & Yerushalmi, Y. Locust research in the age of model organisms: introduction to the special issue in honor of MP Pener’s 80th birthday. J. Insect Physiol. 56, 831–833 (2010). 15. Wei, Y., Chen, S., Yang, P., Ma, Z. & Kang, L. Characterization and comparative profiling of the small RNA transcriptomes in two phases of locust. Genome Biol. 10, R6 (2009). 16. Wang, H. S. et al. cDNA cloning of heat shock proteins and their expression in the two phases of the migratory locust. Insect Mol. Biol. 16, 207–219 (2007). 17. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010). 18. Fang, X. et al. The sequence and analysis of a Chinese pig genome. Gigascience 1, 16 (2012). 19. Charlesworth, D. & Willis, J. H. The genetics of inbreeding depression. Nat. Rev. Genet. 10, 783–796 (2009). 20. Howrigan, D. P., Simonson, M. A. & Keller, M. C. Detecting autozygosity through runs of homozygosity: a comparison of three autozygosity detection algorithms. BMC Genomics 12, 460 (2011). 21. Zupunski, V., Gubensek, F. & Kordis, D. Evolutionary dynamics and evolutionary history in the RTE clade of non-LTR retrotransposons. Mol. Biol. Evol. 18, 1849–1863 (2001). 22. Jiang, F., Yang, M., Guo, W., Wang, X. & Kang, L. Large-scale transcriptome analysis of retroelements in the migratory locust, Locusta migratoria. PLoS One 7, e40532 (2012). 23. Shah, N., Dorer, D. R., Moriyama, E. N. & Christensen, A. C. Evolution of a large, conserved, and syntenic gene family in insects. G3 (Bethesda) 2, 313–319 (2012). 24. Sabot, F. & Schulman, A. H. Parasitism and the retrotransposon life cycle in plants: a hitchhiker’s guide to the genome. Heredity (Edinb) 97, 381–388 (2006). 25. Gregory, T. R. The Evolution of the Genome (Academic Press, 2005). NATURE COMMUNICATIONS | 5:2957 | DOI: 10.1038/ncomms3957 | www.nature.com/naturecommunications & 2014 Macmillan Publishers Limited. All rights reserved. ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms3957 26. Shepard, S., McCreary, M. & Fedorov, A. The peculiarities of large intron splicing in animals. PLoS One 4, e7853 (2009). 27. Lyko, F. et al. The honey bee epigenomes: differential methylation of brain DNA in queens and workers. PLoS Biol. 8, e1000506 (2010). 28. Bonasio, R. et al. Genome-wide and caste-specific DNA methylomes of the ants Camponotus floridanus and Harpegnathos saltator. Curr. Biol. 22, 1755–1764 (2012). 29. Simpson, S. J., Sword, G. A. & Lo, N. Polyphenism in insects. Curr. Biol. 21, R738–R749 (2011). 30. Elango, N., Hunt, B. G., Goodisman, M. A. & Yi, S. V. DNA methylation is widespread and associated with differential gene expression in castes of the honeybee, Apis mellifera. Proc. Natl Acad. Sci. USA 106, 11206–11211 (2009). 31. Hunt, B. G., Brisson, J. A., Yi, S. V. & Goodisman, M. A. Functional conservation of DNA methylation in the pea aphid and the honeybee. Genome Biol. Evol. 2, 719–728 (2010). 32. Zemach, A., McDaniel, I. E., Silva, P. & Zilberman, D. Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328, 916–919 (2010). 33. Cortés-Mendoza, J., León-Guerrero, S. D. D., Pedraza-Alva, G. & PérezMartı́nez, L. Shaping synaptic plasticity: the role of activity-mediated epigenetic regulation on gene transcription. Int. J. Dev. Neurosci. 31, 359–369 (2013). 34. Reinhard, J. & Claudianos, C. in Honeybee Neurobiol. Behav. 359–372 (Springer, 2012). 35. Kelly, S. A., Panhuis, T. M. & Stoehr, A. M. Phenotypic plasticity: molecular mechanisms and adaptive significance. Compr. Physiol. 2, 1417–1439 (2012). 36. Jaworski, J. et al. Dynamic microtubules regulate dendritic spine morphology and synaptic plasticity. Neuron 61, 85–100 (2009). 37. Hoogenraad, C. C. & Bradke, F. Control of neuronal polarity and plasticity–a renaissance for microtubules? Trends Cell Biol. 19, 669–676 (2009). 38. Williams, C. B. Insect Migration (Collins London, 1958). 39. Chino, H., Lum, P. Y., Nagao, E. & Hiraoka, T. The molecular and metabolic essentials for long-distance flight in insects. J. Comp. Physiol. B 162, 101–106 (1992). 40. Blanchette-Mackie, E. et al. Perilipin is located on the surface layer of intracellular lipid droplets in adipocytes. J. Lipid Res. 36, 1211–1226 (1995). 41. Bi, J. et al. Opposite and redundant roles of the two Drosophila perilipins in lipid mobilization. J. Cell Sci. 125, 3568–3577 (2012). 42. Haunerland, N. H. Fatty acid binding protein in locust and mammalian muscle. Comparison of structure, function and regulation. Comp. Biochem. Physiol. B Biochem. Mol. Biol. 109, 199–208 (1994). 43. Magwere, T. et al. Flight activity, mortality rates, and lipoxidative damage in Drosophila. J. Gerontol. A Biol. Sci. Med. Sci. 61, 136–145 (2006). 44. Wegener, G. Flying insects: model systems in exercise physiology. Cell Mol. Life Sci. 52, 404–412 (1996). 45. Sacktor, B. Biochemical adaptations for flight in the insect. Biochem. Soc. Symp. 41, 111–131 (1976). 46. Sacktor, B. Cell structure and the metabolism of insect flight muscle. J. Biophys. Biochem. Cytol. 1, 29–46 (1955). 47. Beenakkers, A. M. Carbohydrate and Fat as a fuel for insect flight. A comparative study. J. Insect. Physiol. 15, 353–361 (1969). 48. Eaton, S. et al. The mitochondrial trifunctional protein: centre of a betaoxidation metabolon? Biochem. Soc. Trans. 28, 177–182 (2000). 49. Chapman, R. F. & Joern, A. Biology of Grasshoppers (John Wiley and Sons Inc., 1990). 50. Mulkern, G. B. Food selection by grasshoppers. Annu. Rev. Entomol. 12, 59–78 (1967). 51. Bernays, E. A. & Chapman, R. F. Plant secondary compounds and grasshoppers: beyond plant defenses. J. Chem. Ecol. 26, 1773–1794 (2000). 52. Luque, T., Okano, K. & O’Reilly, D. R. Characterization of a novel silkworm (Bombyx mori) phenol UDP-glucosyltransferase. Eur. J. Biochem. 269, 819–825 (2002). 53. Despres, L., David, J. P. & Gallet, C. The evolutionary ecology of insect resistance to plant chemicals. Trends Ecol. Evol. 22, 298–307 (2007). 54. Raymond-Delpech, V., Matsuda, K., Sattelle, B. M., Rauh, J. J. & Sattelle, D. B. Ion channels: molecular targets of neuroactive insecticides. Invert. Neurosci. 5, 119–133 (2005). 55. Bai, H. & Palli, S. R. in Advanced Technologies for Managing Insect Pests. (eds Ishaaya, I., Palli, S. R. & Horowitz, A. R.) 57–82 (Springer, 2013). 56. Caffrey, C. R. et al. A comparative chemogenomics strategy to predict potential drug targets in the metazoan pathogen, Schistosoma mansoni. PLoS One 4, e4413 (2009). 57. Huvenne, H. & Smagghe, G. Mechanisms of dsRNA uptake in insects and potential of RNAi for pest control: a review. J. Insect Physiol. 56, 227–235 (2010). 58. Bulmer, M. S., Bachelet, I., Raman, R., Rosengaus, R. B. & Sasisekharan, R. Targeting an antimicrobial effector function in insect immunity as a pest control strategy. Proc. Natl Acad. Sci. USA 106, 12652–12657 (2009). 59. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008). 60. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). 61. Van Ooijen, J. W. Multipoint maximum likelihood mapping in a full-sib family of an outbreeding species. Genet. Res. (Camb) 93, 343–349 (2011). 62. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003). 63. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). 64. Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004). 65. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003). 66. Elsik, C. G. et al. Creating a honey bee consensus gene set. Genome Biol. 8, R13 (2007). 67. Ruan, J. et al. TreeFam: 2008 Update. Nucleic Acids Res. 36, D735–D740 (2008). 68. De Bie, T., Cristianini, N., Demuth, J. P. & Hahn, M. W. CAFE: a computational tool for the study of gene family evolution. Bioinformatics 22, 1269–1271 (2006). 69. Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010). 70. Chen, S. et al. De novo analysis of transcriptome dynamics in the migratory locust during the development of phase traits. PLoS One 5, e15633 (2010). Acknowledgements We thank Y. Feng for his assistance with the construction of inbred locust lines; R. Wu and Z. Lin for their assistance with the dissection of brain tissues; J. Han for his assistance with the isolation of total DNA; F. Zhao and Z. Sun for useful discussions; and L. Goodman for her comments with the revision of the manuscript. We also thank other faculty and staff at the Institute of Zoology, Chinese Academy of Sciences, and BGIShenzhen who contributed to the locust genome project. The locust genome project was supported by grants from the National Basic Research Program of China (number 2012CB114102) and National Natural Science Foundation of China (grant numbers 31210103915, 30830022 and 31301915). Author contributions X.H.W., X.F, P.Y., X.J. and F.J. contributed equally to this work as first authors. Group leader: L.K. Group managers: X.H.W. and X.F. Project coordination: L.K., X.H.W., Jian W., Jun W., X.F., Y.L. and G.Z. Manuscript writing: X.H.W., P.Y., F.J., D.Z., X.F. and L.K. Inbred line management: X.H.W. Flow cytometry: B.L. and X.H.W. Assembly and evaluation: X.F., P.Y., X.J., B.W., D.F., Y.F. X.H.W., B.L., B.Z. and N.L. Genome annotation: P.Y., X.H.W., F.J., D.Z., L.Y., Y.C., L.H., Z.H. and X.Y. Repeat analyses: F.J., Y.C., L.H. and P.Y. Transcriptome analyses: P.Y., X.H.W., Y.B.Z. and G.C. Methylation analyses: P.Y., X.H.W., W.Z., Q.L., J.W.W., J.L. and X.S.W. Evolutionary analyses: F.J., Z.X. and P.Y. Phase change: X.H.W., P.Y., F.J., X.G. and Z.M. Long-distance flight: D.Z., X.H.W., C.M. and W.G. Feeding and detoxification: F.J., F.C., M.Y., Y.Z. and Z.W. Pest control: F.J., X.H.W., G.C., P.Y. and W.Z. Immunity: P.Y. and Y.W. RNA interference: F.J., Y.L. and J.H. Manual annotation: S.H., B.C., J.W. and D.Y. Additional information Accession codes: Assembled genome sequences for Locusta migratoria have been deposited in DDBJ/EMBL/GenBank nucleotide core database under accession code AVCP000000000. Genome sequence reads and RRBS methylation sequence reads have been deposited in the GenBank sequence read archive under accession codes SRA064067 and SRA067627, respectively. Supplementary Information, accompanies this paper at http://www.nature.com/ naturecommunications Competing financial interests: The authors declare no competing financial interests. Reprints and permission information is available online at http://npg.nature.com/ reprintsandpermissions/ How to cite this article: Wang, X. et al. The locust genome provides insight into swarm formation and long-distance flight. Nat. Commun. 5:2957 doi: 10.1038/ncomms3957 (2014). This work is licensed under a Creative Commons AttributionNonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ NATURE COMMUNICATIONS | 5:2957 | DOI: 10.1038/ncomms3957 | www.nature.com/naturecommunications & 2014 Macmillan Publishers Limited. All rights reserved. 9 Supplementary Information The locust genome provides insight into swarm formation and long-distance flight Xianhui Wang1, Xiaodong Fang2, Pengcheng Yang1,3, Xuanting Jiang2, Feng Jiang1,3, Dejian Zhao1, Bolei Li1, Feng Cui1, Jianing Wei1, Chuan Ma1,3, Yundan Wang1,3, Jing He1, Yuan Luo1, Zhifeng Wang1, Xiaojiao Guo1, Wei Guo1, Xuesong Wang1,3, Yi Zhang1, Meiling Yang1, Shuguang Hao1, Bing Chen1, Zongyuan Ma1,3, Dan Yu1, Zhiqiang Xiong2, Yabing Zhu2, Dingding Fan2, Lijuan Han2, Bo Wang2, Yuanxin Chen2, Junwen Wang2, Lan Yang2, Wei Zhao2, Yue Feng2, Guanxing Chen2, Jinmin Lian2, Qiye Li2, Zhiyong Huang2, Xiaoming Yao2, Na Lv4, Guojie Zhang2, Yingrui Li2, Jian Wang2, Jun Wang2, Baoli Zhu4, Le Kang1,3# 1 State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China 2 BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China 3 Beijing Institutes of Life Science, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China 4 CAS Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing 100101, China #Send correspondence to: Dr. Le Kang, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China Tel: 86-10-64807219; Fax: 86-10-84379559; E-mail: lkang@ioz.ac.cn Supplementary Figures Supplementary Figure S1 Worldwide distribution of locust plagues. Locust plague occurs in most continents of the world except Antarctic Continent. This figure was reproduced according to the reviews by Hemming71. Major locust species include Locusta migratoria, Schistocerca gregaria, Calliptamus italicu, Chortoicetes terminifera, Schistocerca americana and so on. Supplementary Figure S2 The distribution of 17-mer frequency in L. migratoria genome sequencing reads. Fifteen lanes of short (< 1 kb) paired-end genome sequencing reads that passed quality control were used to generate the 17-mer sequences. The 17-mer frequencies proportion to total 17-mer number was plotted against the 17-mer depth. The peak depth was 23X. The genome size of L. migratoria was estimated to be 6.3 Gb based on the formula: #Total 17-mer number / 17-mer depth (Supplementary Table S2). Supplementary Figure S3 Estimate of genome size based on flow cytometry. The genome size of L. migratoria (peak 414.24) was estimated to be ~6.3 Gb (414.24/209.46 * 3.25 * 0.978). Data on the M. musculus (Mus musculus) genome are provided for comparison (peak 209.46, genome size 3.25 pg). Supplementary Figure S4 Sequencing depth distribution of the L. migratoria genome. The used reads was 135 Gb (~22X) high quality reads with insert sizes of 200 bp. The depth distributions were plotted for the whole genome (red), gene body region including intron (blue), repeat (green) and gene body region of paralog genes (purple). The theoretical poisson distribution was plotted with =ββ.γ4. Supplementary Figure S5 Comparison of the assembled genome with BAC sequences. Sequencing depth on the BAC (Bacterial Artificial Chromosome) was calculated by mapping the Illumina HiSeq 2000 short reads onto the BAC sequence using BWA (version 0.6.2) with default parameters. Here we performed single-end mapping and reported all repetitive hits. The red line is the average depth across the genome. The predicted genes and annotated TEs on the BAC sequence are shown in green and black, respectively. The TEs were annotated by RepeatMasker using the locust repeat consensus sequence generated by RepeatModeler. The genes were annotated using proteins from 4 species, Drosophila melanogaster, Apis mellifera, Acyrthosiphon pisum and Pediculus humanus, with GeneWise. The remaining unclosed gaps on the scaffolds are marked as purple blocks. Supplementary Figure S6 Homologous regions between the locust and fly genome based on an Osiris gene family which is conserved and syntenic in insects. The homologous relationships were determined by reciprocal best hits of BLASTP searches. Supplementary Figure S7 Intron length distribution of 6 insect species. The density estimation of intron length was determined for each species. Supplementary Figure S8 Comparison of intron size expansion across insects. Pairwise comparisons between L. migratoria and other insects were visualized using the log expansion/contraction ratios of introns in the 1,046 conserved homologous genes. Supplementary Figure S9 Correlation of genome size and average intron length. The vertical and horizontal axes show the average intron size and the genome size, respectively. Pearson's correlation test is used to test the relationships between variables. Supplementary Figure S10 Statistics of TE-harboring and TE-free introns across 10 arthropod species. The fraction of TEs in introns was calculated from the aligned genome sequences of RepeatMasker screenings. Supplementary Figure S11 Counts of introns that are orthologous to 621 human U12-type introns in different species. The counts for other species were determined based on information in the U12db database. Locust, Locusta migratoria; Mosquito, Anopheles gambiae; Bee, Apis mellifera; Fruitfly, Drosophila melanogaster; Worm, Caenorhabditis elegans; Sea squirt, Ciona intestinalis; Human, Homo sapiens; Mouse, Mus musculus; Chicken, Gallus gallus; Fugu, Fugu rubripes; Zebrafish, Danio rerio. Supplementary Figure S12 Number of RP-sites per 100 kb inside large introns and their complementary sequences. Data for other species were retrieved from a previous study26. Locust, Locusta migratoria; Bee, Apis mellifera; Beetle, Tribolium castaneum; Mosquito, Anopheles gambiae; Fruitfly, Drosophila melanogaster; Sea squirt, Ciona intestinalis; Zebrafish, Danio rerio; Chicken, Gallus gallus; Opossum, Monodelphis domestica; Dog, Canis familiaris; Cow, Bos Taurus; Rat, Rattus norvegicus; Mouse, Mus musculus; Human, Homo sapiens. Supplementary Figure S13 Phylogenetic relationships of L. migratoria, D. pulex and other sequenced insects (left), and insect gene orthology (right). A) Phylogenetic relationships were inferred based on the concatenated data set from universal single-copy genes. The number of gains (1) and losses (2) of gene families is indicated as inferred by the CAFE program. The bootstrap values are shown in red. B) Comparison of the gene repertoire of nine insect and crustacean genomes. Bars are subdivided to represent different types of orthology relationships. 1:1 indicates universal single-copy genes (allowing absence and/or duplication for a single genome); X:X indicates orthologues present in multiple copies in all species (allowing one loss); homologue indicates a patchy orthologous relationship in the involved genomes. LOCMI: Locusta migratoria; DAPPU: Daphnia pulex; PEDHU: Pediculus humanus; ACYPI: Acyrthosiphon pisum; APIME: Apis mellifera; NASVI: Nasonia vitripennis; TRICA: Tribolium castaneum; BOMMO: Bombyx mori; ANOGA: Anopheles gambiae; DROME: Drosophila melanogaster. Supplementary Figure S14 Gene CpGo/e distribution of the 6 insect species. The observed/expected (Obs/Exp) ratio of CpG dinucleotides, based on G+C content, was calculated for each annotated gene in the six insect genomes. Similar to A. mellifera, L. migratoria also displays a bimodal distribution, with a clearly distinct class of low-CpG genes. L. migratoria: Locusta migratoria, P. humanus: Pediculus humanus, A. pisum: Acyrthosiphon pisum, A. mellifera: Apis mellifera, B. mori: Bombyx mori, and D. melanogaster: Drosophila melanogaster. Supplementary Figure S15 Insert size distribution of RRBS libraries. X axis denotes insert size of RRBS sequencing libraries, Y axis denotes the counts of read pairs with that insert size. (a) and (b) were libraries for gregarious, (c) and (d) were solitarious, (a) and (c) were libraries with short insert size (40-120bp), (b) and (d) were libraries with large insert size (120-220 bp). From these pictures, we can conclude that the insert size distributed as expected. Supplementary Figure S16 Methylation levels across all CpGs. X axis denotes the methylation levels of CpGs, binned at every 5 percent step. Y axis denotes the frequencies of CpGs at each bin. (a) Gregrarious and (b) solitarious samples were plotted. Supplementary Figure S17 Relative methylation level of CpG across different genomic regions. The promoter was defined as the upstream 2kb, 5kb and 10kb of start codon. Short and long denote the libraries of insert sizes 40-120 bp and 120-220 bp respectively. Supplementary Figure S18 RRBS and bisulfite PCR validation of the differences in CG site methylation ratio of LOCMI13806 between the brains of gregarious and solitarious locusts. The barplot at the top was produced from RRBS data and the bottom displayed the Bisulfite PCR validation of the yellow region of the top. Supplementary Figure S19 The number of differentially expressed genes during isolation of gregarious locusts (IG) and crowding of solitarious locusts (CS). The number of differentially expressed genes in the two conditions in total (A) and at every time point compared to control (B). Supplementary Figure S20 GO classification of the differentially expressed genes in the IG and CS processes. Schematic diagram showing GO classification of the differentially expressed genes in the IG and CS processes produced using the web-based tool WEGO72. Supplementary Figure S21 Statistics of splicing junctions and classification of alternative splicing across IG and CS processes. The junction numbers (A) were calculated based on the TopHat mapping result. The method of alternative splicing classification (B) was described at Supplementary methods. IG: Isolation of Gregarious locust; CS: Crowding of Solitarious locust. Supplementary Figure S22 Amino acid sequence alignment of L. migratoria and other insect JHEs. Numbers on the right side of the alignment indicate the position of residues in the sequence of each protein. The five catalytic motifs which are conserved in insect JHEs are underlined and marked with asterisks. Similar amino acid residues are shaded. AmJHE (Apis mellifera, NP_001011563), ArJHE (Athalia rosae, BAD91554), PhJHE (Psacothea hilaris, BAE94685), DmJHE (Drosophila melanogaster, NP_523758) and NlJHE (Nilaparvata lugens, ACB14344) were used in the alignment. Supplementary Figure S23 Protein alignments and PCR validation of L. migratoria FABPs. (A) Amino acid residues that are identical in all sequences are shaded red, while orange shadow indicates at least 50% identical amino acids in all sequences. The secondary structure is indicated as α (α-helix) and ( -strands). (B) PCR validation of all identified FABP genes. Genes are arranged in the same order as in the alignment (A). Primers are listed in Supplementary Table S30. Supplementary Figure S24 Expansion of antioxidant genes in the L. migratoria genome. (A) Phylogenetic tree of peroxiredoxin genes. The neighbor-joining algorithm was used with 500 bootstraps. The percentages of replicate trees (over 50%) in which the associated taxa clustered together in the bootstrap test (500 replicates) are shown next to the branches. Evolutionary distances were computed using the Poisson correction method and are expressed in units of number of amino acid substitutions per site. Rate variation among sites was modeled with gamma distribution (shape parameter = 1). (B) Organization of the expanded I cysteine peroxiredoxin (Prdx6) genes in the L. migratoria genome. Supplementary Figure S25 Expression levels of expanded gene families in L. migratoria in the energy mobilization processes. The red denotes the fat body samples before flying and the black denotes samples after flying. The asterisks at the bottom of the bar denote the significantly differentially expressed genes between these two samples. Supplementary Figure S26 The conserved TYhhhhhQF motif in the TM7 domain of GRs. Consensus sequences of the UGT motifs were created and displayed using the WebLogo server at http://weblogo.berkeley.edu/. Error bars indicate an approximate, Bayesian 95% confidence interval. The TM7 domain for the 75 gustatory receptors were used to construct the TYhhhhhQF motif. Supplementary Figure S27 A phylogenetic tree of L. migratoria GRs with representative GRs from other insects. Nodes with >50% bootstrap support (100 replicates) are indicated. APIME, Apis mellifera, TRICA, Tribolium castaneum, BOMMO, Bombyx mori, DROME, Drosophila melanogaster, and ACYPI, Acyrthosiphon pisum. Supplementary Figure S28 WebLogos representing the signature motif of L. migratoria UGT genes. UGT genes include a conserved C-terminal region that contain the UGT signature motif [FVA]-[LIVMF]-[TS]-[HQ]-[SGAC]-G-X[2]-[STG]-X[2]-[DE]-X[6]-P-[LIVMFA]-[ LIVMFA]-X[2]- P-[LMVFIQ]-X[2]-[DE]-Q, where X is any amino acid 73. The UGT signature motifs are underlined in red. Consensus sequences of the UGT motifs were created and displayed using the WebLogo server (http://weblogo.berkeley.edu/). Supplementary Figure S29 The phylogenetic tree of glutathione S-transferases in L. migratoria. The six subclasses are indicated on the right panel. Bootstrap values greater than 50 are labelled at the nodes. Scale bar shows the inferred amino acid distance. Supplementary Figure S30 The phylogenetic tree of L. migratoria CCE genes with representatives of CCE genes from different subclasses. Numbers at nodes indicate bootstrap values, and only values greater than 50% are shown. Nomenclature of the clades was done based on previous studies74. Supplementary Figure S31 The insect P450 gene family. The four distinct branches of the phylogenetic tree corresponding to the known four clades of insect P450 genes. Bootstrap values are indicated for the four insect clades. LOCMI: Locusta migratoria; ACYPI: Acyrthosiphon pisum; APIME: Apis mellifera; AEDAE:, Aedes aegypti; TRICA: Tribolium castaneum and ANOGA: Anopheles gambiae. Supplementary Figure S32 Comparison of CYP gene numbers across several insect species. The number of identified insect CYP genes in each P450 clade was retrieved from a previous study75. LOCMI: Locusta migratoria; ACYPI: Acyrthosiphon pisum; PEDHU: Pediculus humanus; APIME: Apis mellifera; NASVI: Nasonia vitripennis; TRICA: Tribolium castaneum; AEDAE, Aedes aegypti; BOMMO: Bombyx mori; DROME: Drosophila melanogaster and ANOGA: Anopheles gambiae. Supplementary Tables Supplementary Table S1 Illumina sequencing data for L. migratoria genome assembly. Raw Filter Insert Reads Total Sequence Physical Total Sequence Physical Size Length Data (G) coverage coverage Data coverage coverage (X) (X) (G) (X) (X) 170 100 38.57 6.12 10.41 30.23 4.8 8.59 200 150 54.63 8.67 11.56 44.51 7.07 9.81 200 100 277.47 44.04 88.09 218.36 34.66 74.4 500 100 252.93 40.15 200.74 168.41 26.73 144.9 800 100 165.23 26.23 209.82 103.02 16.35 139.61 2000 49 76.87 12.2 498.06 54.76 8.69 354.78 5000 49 84.75 13.45 1,372.72 42.16 6.69 682.84 10000 90 14.32 2.27 252.52 9.57 1.52 168.84 10000 49 39.96 6.34 1294.6 18.34 2.91 593.96 20000 49 80.46 12.77 5,212.82 21.77 3.46 1,410.63 20000 90 19.03 3.02 671.15 4.5 0.71 158.75 40000 49 30.81 4.89 3,992.78 5.15 0.82 667.3 1,135.06 180.17 13,815.27 720.79 114.41 4,414.41 Total Note: For raw data, PCR duplication, low quality and adaptor contamination have been filtered. Physical coverage was calculated as insert size x paired-end reads number / genome size. The genome size is assumed to be 6.3 Gb. Supplementary Table S2 Genome size estimation of L. migratoria based on 17-mer analysis. k-mer size #Total k-mer k-mer depth Genome size (bp) 17 146,642,996,731 23 6,375,782,466 Note: See Supplementary Supplementary Figure S2 for detailed description of the methods used. Supplementary Table S3 Statistics of the assembly of L. migratoria genome. Contig Scaffold Size(bp) Number Size(bp) Number N90 1,933 672,695 17,393 39,025 N80 3,763 463,944 56,676 19,124 N70 5,527 338,619 107,724 10,783 N60 7,349 248,612 187,295 6,135 N50 9,338 179,233 322,681 3,440 Total size (bp) 5,748,086,465 6,524,990,357 Longest (bp) 106,221 7,902,988 Total number (>100 bp) 1,438,086 551,270 Total number (>2 kb) 661,979 97,233 N50 is the size above which the additive length occupies 50% of the total length of the assembly sequences. The total sizes of the contigs and scaffold were 5.7 Gb and 6.5 Gb respectively, with 0.78 Gb Ns (12% of the total assembly). Supplementary Table S4 Statistics of Sanger sequencing reads aligned to the BAC sequences assembled using Solexa reads. Library Ratio mismatch #Reads Coverage>90% Coverage>80% 107-88 0.09% 261 98.85% 100.00% 107-50 0.03% 286 94.76% 99.65% 107-16 0.04% 251 97.21% 98.01% 107-74 0.02% 284 91.90% 94.01% 107-86 0.03% 287 97.91% 98.26% 107-25 0.03% 198 93.43% 94.44% Average 0.04% 261 95.68% 97.40% The alignment was performed using BLAT. Supplementary Table S5 Assembly quality validation by BAC coverage estimation. BAC ID Length (bp) Coverage (%) #Inconsistent #Scaffold Scaffold length (bp) 107-14 94,424 99.83 11 1 108,004 107-2 57,640 99.79 27 3 159,089 107-25 83,390 99.37 21 4 160,334 107-38 91,133 99.93 12 1 98,802 107-52 109,069 98.67 37 1 115,718 107-73 87,904 91.76 35 11 417,471 107-74 51,544 99.16 5 1 51,857 107-86 58,464 95.03 22 3 174,224 107-88 84,676 95.00 21 5 157,623 Average 79,678 94.00 22 4 206,414 Supplementary Table S6 Assembly quality validation by EST coverage estimation. Total Number Total Match >50% Percent >90% Number Percent Number >200 41,880 41,547 99.2% 41,360 98.76% Number 39,625 Percent 94.62% >500 23,408 23,242 99.29% 23,162 98.95% 22,647 96.75% Supplementary Table S7 Comparison of assembled scaffolds with 71 L. migratoria complete CDS sequences in GenBank. ID Length (bp) % bases ID covered by Length % bases (bp) covered by single best single best piece piece AB583233.1 1,446 100.00% FJ609648.1 2,075 89.69% AB698670.1 2,233 100.00% FJ609649.1 1,848 98.48% AB698671.1 2,625 100.00% FJ609738.1 2,089 99.52% AB698672.1 2,142 100.00% FJ609739.1 2,156 100.00% AF049136.1 2,401 87.51% FJ609741.1 2,144 97.81% AF083951.1 2,360 98.18% FJ771024.1 2,471 99.07% AF083952.1 2,031 97.29% FJ771025.1 2,388 98.91% AF083953.1 1,944 68.47% FJ795020.1 1,953 98.46% AF107732.1 579 99.83% GU067730.1 5,116 95.72% AF107733.1 700 99.86% GU067731.1 5,116 95.72% AF115777.1 1,850 90.05% GU593056.1 939 95.95% AF136372.1 4,677 99.64% GU722575.1 1,055 80.76% AY040537.1 3,926 96.03% GU722576.1 469 96.59% AY077627.1 1,835 96.84% GU722577.1 517 97.29% AY077628.1 1,469 98.16% GU722578.1 459 99.78% AY299637.3 1,968 99.54% GU722579.1 483 100.00% AY348873.1 4,752 99.54% HM131834.1 657 100.00% AY445913.2 2,465 96.80% HM131835.1 645 99.84% DQ340870.1 3,774 95.95% HM131836.1 615 99.67% DQ355963.1 883 95.58% HM131837.1 615 100.00% DQ355964.1 1,802 98.28% HM131838.1 615 100.00% DQ355965.1 773 96.25% HM131839.1 615 100.00% DQ355966.1 1,660 98.19% HM131840.1 609 100.00% DQ513322.1 2,052 98.73% HM131841.1 615 99.02% DQ848565.1 1,313 99.77% HM131842.1 615 100.00% EF090723.1 1,604 96.38% HM131843.1 696 99.71% EU131894.1 2,806 98.68% HM153425.1 1,917 78.46% EU231603.1 2,255 69.09% HM153426.1 1,895 98.47% FJ215322.1 607 95.22% HQ213937.1 2,271 98.90% FJ472841.1 2,863 90.95% J03888.1 672 100.00% FJ472842.1 2,395 97.58% JN129988.1 599 59.93% FJ472843.1 2,583 98.61% JN247410.1 465 82.80% FJ609646.1 2,488 99.12% JN661173.1 1,600 98.69% FJ609647.1 1,766 98.36% M36206.1 938 95.42% U90609.1 2,193 96.58% U74469.1 4,016 91.38% Z22805.1 603 99.34% Sum Length: 127,771 Covered%: 95.72% Supplementary Table S8 Comparison of gene parameters among the sequenced insects and H. sapiens. species GN CO % SE % ATL ACL AEG AEL AIL L. migratoria 17,307 14,805 85.54 3,079 17.79 54,341 1,160 5.77 201 11,159 A. pisum 33,486 32,316 96.51 8,967 26.78 4,333 4.06 221 1,121 P. humanus 10,775 10,646 98.80 353 3.28 3,139 1,545 6.43 240 294 A. mellifera 11,062 10,411 94.11 847 7.66 8,976 1,627 6.46 252 1,347 D. melanogaster 13,689 13,608 99.41 2,761 20.17 4,261 1,621 3.97 408 888 A. gambiae 14,324 14,324 100 1,667 11.64 5,955 1,583 4.21 376 1,363 B. mori 14,623 14,623 100 2,260 15.46 6,030 1,224 5.44 225 1,082 D. pulex 30,907 30,907 100 5,341 17.28 2,009 4.62 211 285 N. vitripennis 17,369 17,369 100 2,598 14.96 6,934 1,412 5.30 267 1,286 T. castaneum 16,531 15,733 95.17 2,937 17.77 5,289 1,351 4.34 311 1,179 H. sapiens 22,389 20,098 89.77 3,318 14.82 44,855 1,560 8.96 174 5,436 899 976 Abbreviations: GN: Gene set number; CO: Number of genes with complete open reading frames; SE: Single Exon Gene Number; ATL: Average transcript length (bp); ACL: Average CDS length (bp); AEG: Average exon number per gene; AEL: Average exon length (bp); AIL: Average intron length (bp). Supplementary Table S9 Gene functional annotation. Number %genes 17,307 100 Swissprot 11,513 66.52% TrEMBL 12,101 69.92% KEGG 10,687 61.75% InterPro 10,268 59.33% GO 8,459 48.88% NR 12,343 71.32% All annotated 12,963 74.90% 4,344 25.10% Total Annotated Unknown Supplementary Table S10 Number of base pairs occupied by transposable element derived sequences in the genomes of four species. L. migratoria D. melanogaster H. sapiens Types Length (bp) P% Length (bp) P(%) Length (bp) P(%) DNA 1,480,538,225 22.69 4,849,763 2.87 99,797,428 3.18 LINE 1,332,720,207 20.42 12,119,904 7.18 637,919,432 20.33 LTR 508,675,263 7.80 21,849,378 12.95 267,738,295 8.53 nonLTR 63,892,419 0.98 - - - - Retro 153,548,453 2.35 - - - - Unknown 406,097,360 6.22 11,211,970 6.64 1,298,163 0.04 SINE 141,176,698 2.16 52,841 0.03 397,225,496 12.66 Simple_repeat 13,026,240 0.20 2,733 0.00 26,240,511 0.84 Other 32,017 0.00 698,554 0.41 4,153,812 0.13 Total 3,840,808,141 58.86 50,785,143 30.00 1,434,373,137 46.00 Supplementary Table S11 Top 10 dominant repeat families. # Copy number Length (bp) Percentage (%) 1 611,942 260,834,265 4.00 NonLTR/LINE/RTE-BovB 2 1,434,794 235,189,848 3.60 NonLTR/SINE/LM1 3 440,451 172,763,602 2.65 NonLTR/LINE/CR1 4 216,915 77,610,704 1.19 NonLTR/LINE/CR1 5 117,885 56,089,356 0.86 DNA/TcMar-Tc1 6 144,881 51,843,228 0.79 DNA 7 87,352 37,702,822 0.58 Satellite 8 209,668 34,698,742 0.53 NonLTR/SINE/ID 9 143,144 24,029,149 0.37 NonLTR/SINE/MIR 10 48,432 18,584,822 0.28 NonLTR/LINE/CR1 Supplementary Table S12 Statistics of intronic expansion/contraction in 1,046 conserved single copy orthologous genes. Species Contraction% Expansion% Bombyx mori 2.4% 97.6% Pediculus humanus 1.5% 98.5% Anopheles gambiae 2.3% 97.7% Tribolium castaneum 2.3% 97.7% Drosophila melanogaster 2.0% 98.0% Apis mellifera 3.8% 96.2% Acyrthosiphon pisum 4.2% 95.8% Nasonia vitripennis 3.9% 96.1% Daphnia pulex 0.9% 99.1% Average 2.6% 97.3% Supplementary Table S13 insect-specific gene families. IPR ID IPR enrichment of hemi/holo-metabolous IPR Title FDR Holometabolous Specific gene families IPR013087 Zinc finger, C2H2-type/integrase, DNA-binding 2.49E-32 IPR007087 Zinc finger, C2H2-type 2.85E-30 IPR015880 Zinc finger, C2H2-like 2.30E-28 IPR004117 Olfactory receptor, Drosophila 1.63E-21 IPR009057 Homeodomain-like 1.09E-10 IPR012287 Homeodomain-related 7.61E-08 IPR012934 Zinc finger, AD-type 9.29E-08 IPR020479 Homeobox, eukaryotic 1.49E-07 IPR013653 FR47-like 3.23E-06 IPR007614 Retinin-like protein 5.68E-06 IPR001356 Homeobox 7.99E-06 IPR006625 Insect pheromone/odorant binding protein PhBP 2.21E-05 IPR006631 Protein of unknown function DM4/12 5.20E-05 IPR006170 Pheromone/general odorant binding protein, PBP/GOBP 7.79E-05 IPR023316 Pheromone/general odorant binding protein, PBP/GOBP, domain 1.14E-04 IPR005520 Attacin, N-terminal 1.14E-04 IPR000010 Proteinase inhibitor I25, cystatin 1.14E-04 IPR000219 Dbl homology (DH) domain 2.80E-04 IPR013069 BTB/POZ 4.93E-04 IPR002557 Chitin binding domain 5.73E-04 IPR005521 Attacin, C-terminal 1.28E-03 IPR011526 Helix-turn-helix, Psq-like 1.55E-03 IPR022773 Siva 1.78E-03 IPR011333 BTB/POZ fold 2.38E-03 IPR007889 Helix-turn-helix, Psq 4.48E-03 IPR022727 Pupal cuticle protein C1 5.86E-03 IPR000237 GRIP 2.63E-02 IPR020381 Proteinase inhibitor I25, cystatin, conserved region 2.79E-02 IPR008837 Serendipity locus alpha 2.79E-02 IPR008422 Homeobox KN domain 2.79E-02 IPR010562 Haemolymph juvenile hormone binding 2.99E-02 IPR012464 Protein of unknown function DUF1676 2.99E-02 IPR016179 Insulin-like 3.59E-02 IPR001159 Double-stranded RNA-binding 3.59E-02 IPR006593 Cytochrome b561/ferric reductase transmembrane 4.57E-02 IPR004877 Cytochrome b561, eukaryote 4.57E-02 IPR011993 Pleckstrin homology-type 4.57E-02 Hemimetabolous Specific gene families IPR006578 MADF domain 1.47E-33 IPR006612 Zinc finger, C2CH-type 3.11E-12 IPR000618 Insect cuticle protein 7.92E-03 IPR003961 Fibronectin, type III 7.92E-03 IPR002298 DNA polymerase A 1.08E-02 IPR008957 Fibronectin type III domain 1.08E-02 IPR004210 BESS motif 1.59E-02 IPR002156 Ribonuclease H domain 3.75E-02 IPR002350 Proteinase inhibitor I1, Kazal 4.53E-02 IPR012337 Ribonuclease H-like 4.53E-02 IPR004911 Gamma interferon inducible lysosomal thiol reductase GILT 4.91E-02 IPR015689 Tachykinin-like receptor 4.91E-02 FDR: false discovery rate. Supplementary Table S14 IPR annotation of the expanded families in L. migratoria. A L P A A B D D N T P M H G M M P M V C I I U A E O U E I A IPR006578: MADF domain; 41 64 0 3 1 4 0 1 1 2 0 no ipr annotation 27 41 0 0 0 0 0 0 0 0 0 0 26 0 0 0 0 0 0 0 0 0 0 96 0 0 0 0 0 0 0 0 0 4 29 0 0 0 0 0 0 0 3 0 19 78 5 29 16 52 9 19 30 35 0 9 51 6 6 5 16 2 4 9 18 0 0 33 0 0 0 0 0 0 0 0 0 133 46 2 2 0 11 1 0 2 24 1.0E-06 7 34 1 6 1 21 2 1 5 8 1.0E-06 45 70 4 23 11 33 24 33 19 27 4.0E-06 0 16 0 0 0 0 0 0 11 0 6.2E-05 23 69 10 42 31 27 12 34 45 68 1.6E-04 2 27 3 13 5 9 1 7 10 12 2.00E-04 IPR annotation FDR IPR009072: Histone-fold;IPR000558: Histone H2B;IPR007125: Histone core; IPR004117: Olfactory receptor, Drosophila; IPR001005: SANT domain, DNA binding;IPR012287: Homeodomain-related;IPR006 578: MADF domain; IPR002018: Carboxylesterase, type B; IPR005055: Insect pheromone-binding protein A10/OS-D; IPR013604: 7TM chemoreceptor; IPR013087: Zinc finger, C2H2-type/integrase, DNA-binding;IPR007087: Zinc finger, C2H2-type; IPR001360: Glycoside hydrolase, family 1;IPR013781: Glycoside hydrolase, subgroup, catalytic core; IPR002213: UDP-glucuronosyl/UDP-gluco syltransferase; IPR018272: PRANC domain; IPR020683: Ankyrin repeat-containing domain; IPR001128: Cytochrome P450; IPR013788: Arthropod hemocyanin/insect LSP;IPR005204: Hemocyanin, N-terminal;IPR000896: Hemocyanin, copper-containing;IPR005203: Hemocyanin, C-terminal; IPR003663: Sugar/inositol transporter;IPR020846: Major 17 43 9 17 16 20 1 13 23 41 3.84E-04 12 20 1 8 5 3 1 10 19 17 6.07E-03 19 29 16 12 7 21 1 22 8 30 7.72E-03 0 8 0 0 0 4 0 0 12 0 1.25E-02 10 24 4 11 4 20 6 14 10 26 1.87E-02 10 20 2 7 5 8 1 11 18 32 2.59E-02 facilitator superfamily; IPR002347: Glucose/ribitol dehydrogenase;IPR016040: NAD(P)-binding domain; IPR010562: Haemolymph juvenile hormone binding; IPR016187: C-type lectin fold;IPR001304: C-type lectin; IPR001071: Cellular retinaldehyde binding/alpha-tocopherol transport;IPR001251: Cellular retinaldehyde-binding/triple function, C-terminal;IPR011074: Phosphatidylinositol transfer protein-like, N-terminal; IPR003439: ABC transporter-like;IPR011527: ABC transporter, transmembrane domain, type 1;IPR001140: ABC transporter, transmembrane domain; Abbreviations: LMI: Locusta migratoria, DPU: Daphnia pulex, PHU: Pediculus humanus, API: Acyrthosiphon pisum, AME: Apis mellifera, NVI: Nasonia vitripennis, TCA: Tribolium castaneum, BMO: Bombyx mori, AGA: Anopheles gambiae and DME: Drosophila melanogaster. Supplementary Table S15 Members of DNA methyltransferase (DNMT) genes in various arthropod species. Species DNMT1 DNMT2 DNMT3 Daphnia pulex 1 1 1 Locusta migratoria 2 1 1 Acyrthosiphon pisum 2 1 2 Pediculus humanus 2 1 0 Apis mellifera 2 1 1 Nasonia vitripennis 3 1 1 Camponotus floridanus 1 1 2 Harpegnathos saltator 1 1 2 Solenopsis invicta 1 1 1 Pogonomyrmex barbatus 1 1 1 Atta cephalotes 1 1 1 Tribolium castaneum 1 1 0 Bombyx mori 1 1 0 Danaus plexippus 1 1 0 Drosophila melanogaster 0 1 0 Supplementary Table S16 Mapping statistics of RRBS and whole genome test reads. Samples WholeGenome Library Gregarious Solitarious short_lib Long_lib Short_lib Long_lib Insert Size 400 62 125 62 113 Read length 100 49 49 49 49 #Reads (M) 22 127 149 108 155 Total bp (Mb) 2,222 6,246 7,278 5,307 7,600 #AfterFilter_Reads (M) 22 106 132 99 151 #AfterFilter_bp (Mb) 2,134 5,148 6,449 4,817 7,378 Unique Mapped (%) 52.18% 39.73% 46.21% 42.34% 50.00% The whole genome test Bisulfite sequencing reads was sequenced using the gregarious brain samples. The Short_lib represents the libraries with 40-120 bp insert sizes and the Long_lib represents the libraries with 120-240 bp insert sizes. The insert size here was estimated by mapping the reads to the reference genome. Supplementary Table S17 Summary of methylation types in the L. migratoria genome by RRBS and whole genome test. SampleName TotalC CpGC CHGC CHHC CpGCT CHGCT CHHCT PctCpG PctCHG PctCHH WholeGenome 184,180,721 2,670,788 93,696 304,983 28,722,148 37,376,081 115,013,025 8.50% 0.30% 0.30% Gregarious brain 583,516,064 21,132,528 1,088,326 1,767,250 166,599,320 137,895,310 255,033,330 11.26% 0.78% 0.69% Solitarious brain 664,905,454 23,593,345 1,310,445 2,019,361 197,011,225 160,519,740 280,451,338 10.69% 0.81% 0.71% TotalC include three kinds cytosine, CpG, CHG and CHH. PctCpG was calculated as the ratio of methylated CpG (CpGC) to the sum of CpGC and CpGCT, which was the number of C that was converted to T in the CpG context. The same for PctCHG and PctCHH. Supplementary Table S18 statistics of covered CpG sites and related genes in the genome. #CoveredCpG #CoveredCpG_genebody %CoveredCpG_genebody #genes_covered_≥1_CpG #genes_covered_≥4_CpG Gregarious Solitarious Merge Intersect 7,568,981 935,461 12.36% 11,970 11,447 7,996,483 940,919 11.77% 11,802 11,337 9,311,972 1,092,751 11.73% 12,202 11,743 4,345,168 783,629 18.03% 11,570 11,041 The minimum depth of covered CpG sites was 10X. Merge: at any samples. Supplementary Table S19 The 90 differentially methylated genes between the gregarious and solitarious locust brains. GeneID 1 2 3 4 G CG G-B S-B #DMC #Cov RPKM RPKM 5 Adjpv Annotation (S-B/G-B) LOCMI01552 4 22 6.55 6.28 0.81 Neuroglobin LOCMI02733 6 24 4.23 4.87 0.42 NA LOCMI02892 6 87 147.55 94.15 0.00 Mitogen-activated protein-binding protein-interacting protein homolog LOCMI03529 4 64 6.98 6.54 0.66 NA LOCMI03618 7 43 0.40 2.34 0.00 NA LOCMI03849 4 330 33.39 56.35 0.00 SAP domain-containing ribonucleoprotein LOCMI04268 5 534 10.59 11.42 0.54 Kinesin-associated protein 3 LOCMI04715 4 110 1.90 2.28 0.47 NA LOCMI05065 5 93 28.11 34.45 0.00 Sterile alpha and TIR motif-containing protein 1 LOCMI05203 4 5 20.73 20.02 0.70 NA LOCMI05982 5 49 13.11 18.46 0.00 Early endosome antigen 1 LOCMI06392 4 75 14.91 20.56 0.00 MAGUK p55 subfamily member 7 LOCMI06494 12 97 60.17 51.91 0.00 Unc-112-related protein LOCMI06973 5 274 16.43 18.02 0.32 UPF0614 protein C14orf102 LOCMI07217 5 126 6.41 3.97 0.00 Transmembrane protein 194A LOCMI07377 9 422 176.42 157.21 0.00 Basement membrane-specific heparan sulfate proteoglycan core protein LOCMI07716 4 100 8.53 7.75 0.50 Protein FAN LOCMI07961 4 99 1.04 3.23 0.00 NA LOCMI08088 4 189 56.34 41.44 0.00 Probable protein-cysteine N-palmitoyltransferase porcupine LOCMI08168 4 15 0.00 2.40 0.00 Ankyrin-2 LOCMI08308 5 233 64.88 56.36 0.00 E3 ubiquitin-protein ligase hyd LOCMI08333 5 424 16.37 17.61 0.45 Polyhomeotic-like protein 1 LOCMI08659 11 160 4.36 6.85 0.00 Zinc finger protein 91 LOCMI08753 4 8 0.63 5.55 0.00 LOCMI08782 18 228 0.89 1.54 0.14 LOCMI08848 4 344 4.32 3.46 0.25 LOCMI08885 4 218 4.15 3.26 0.20 Switch-associated protein 70 LOCMI08924 4 125 26.63 22.97 0.04 NA LOCMI08951 4 51 10.70 14.04 0.01 PR domain zinc finger protein 4 LOCMI09066 10 150 13.18 9.94 0.01 Transmembrane protein 183 LOCMI09161 6 273 2.00 3.48 0.01 RNA exonuclease 1 homolog LOCMI09176 7 177 11.07 16.64 0.00 Rho GTPase-activating protein 190 LOCMI09218 5 255 29.82 25.77 0.03 HEAT repeat-containing protein 5B LOCMI09288 4 159 0.00 0.04 0.59 NA LOCMI09354 5 322 22.62 22.90 0.90 NA LOCMI09607 4 110 5.02 9.02 0.00 Zinc finger protein 106 homolog LOCMI09748 9 184 24.75 22.43 0.20 NA LOCMI09749 7 364 11.38 14.10 0.03 RING finger protein unkempt LOCMI09785 7 234 9.15 11.97 0.01 Probable uridine-cytidine kinase LOCMI09797 22 216 9.70 15.50 0.00 Telomerase-binding protein EST1A LOCMI09866 4 75 97.52 111.95 0.00 Eukaryotic translation initiation Inversin factor 4B LOCMI09918 8 244 51.84 57.26 0.04 Serine/threonine-protein kinase SRPK3 LOCMI09987 10 439 32.77 28.37 0.03 NA LOCMI10093 7 201 14.02 14.79 0.61 GRIP and coiled-coil domain-containing protein 1 LOCMI10164 4 145 15.86 17.44 0.30 Protein Daple LOCMI10228 5 256 1.95 3.27 0.02 Uncharacterized protein KIAA0467 LOCMI10293 6 176 46.96 29.62 0.00 Collagen alpha-1(XXVII) chain LOCMI10351 6 140 25.40 30.67 0.00 Brefeldin A-inhibited guanine nucleotide-exchange protein 3 LOCMI10509 5 60 17.58 12.03 0.00 ATP-binding cassette sub-family D member 3 LOCMI10788 4 58 1.76 1.15 0.18 Tubulin glycylase 3B LOCMI11004 9 115 43.14 31.20 0.00 Phosphatidylinositol glycan anchor biosynthesis class U protein LOCMI11005 8 51 9.64 12.00 0.05 Ribosome biogenesis protein BOP1 homolog LOCMI11060 5 68 4.41 4.15 0.77 Ankyrin-2 LOCMI11711 5 586 8.87 7.10 0.09 Ubiquitin-protein ligase E3B LOCMI12126 4 73 3.63 5.64 0.01 Uncharacterized protein C10orf118 homolog LOCMI12132 5 55 5.47 8.75 0.00 Protein FAM91A1 LOCMI12604 4 25 0.42 0.04 0.04 Putative ankyrin repeat protein RBE_0921 LOCMI12740 4 210 0.58 1.25 0.04 LOCMI12853 5 142 42.90 45.55 0.30 Phosphatidylinositol-4-phosphate 5-kinase type-1 alpha LOCMI13295 4 80 0.00 0.00 1.00 Sialin OS=Homo sapiens LOCMI13328 4 119 29.86 31.06 0.58 Calcium-binding and coiled-coil domain-containing protein 2 LOCMI13357 4 95 29.17 39.62 0.00 Endophilin-B1 LOCMI13712 4 300 43.42 40.68 0.26 Glycyl-tRNA synthetase LOCMI13714 6 358 2.82 3.99 0.08 NA LOCMI13789 5 76 10.96 13.58 0.04 Probable phospholipid-transporting ATPase IA LOCMI13989 4 101 0.59 0.05 0.00 Netrin-G1 ligand LOCMI14004 4 19 9.60 17.88 0.00 Serine--pyruvate aminotransferase, mitochondrial LOCMI14187 4 145 22.85 21.61 0.51 Polyphosphoinositide phosphatase LOCMI14270 4 41 104.11 97.50 0.07 Dynein heavy chain, cytoplasmic LOCMI14289 4 85 4.18 6.25 0.01 Dedicator of cytokinesis protein 7 LOCMI14667 11 102 4.75 4.87 0.92 Aladin LOCMI14843 6 44 0.04 0.07 1.00 NA LOCMI14963 4 78 4.80 11.27 0.00 Hyaluronan mediated motility receptor LOCMI15522 4 59 35.13 46.27 0.00 NA LOCMI15740 5 123 0.11 0.36 0.17 NA LOCMI15751 7 499 9.29 10.32 0.40 Protein son of sevenless LOCMI15754 4 226 6.10 11.38 0.00 NA LOCMI15756 5 72 32.96 53.96 0.00 NA LOCMI15961 4 288 11.51 12.66 0.41 NA LOCMI16020 7 95 25.43 17.96 0.00 NA LOCMI16380 6 38 49.61 38.72 0.00 Serine/threonine-protein kinase mTOR LOCMI16558 4 140 2.59 1.58 0.05 NA LOCMI16680 4 275 9.07 12.70 0.00 NA LOCMI17108 4 77 143.47 116.71 0.00 NA LOCMI17214 4 165 2.56 2.34 0.75 Serine/threonine-protein kinase DCLK1 LOCMI17232 5 172 66.60 51.17 0.00 Niemann-Pick C1 protein LOCMI17328 7 120 13.76 15.05 0.36 NUAK family SNF1-like kinase 1 LOCMI17336 4 106 17.77 13.89 0.01 Vacuolar protein sorting-associated protein 16 homolog LOCMI17416 6 23 3.26 1.80 0.01 NA LOCMI13806 8 107 6.15 7.60 0.13 Phosphatidylinositol-4,5-bisphospha te 3-kinase catalytic subunit delta isoform Notes: 1. Number of differentially methylated CG sites of one gene. 2. Number of Covered CG sites of one gene; 3. The expression level (RPKM value) in the brain of gregarious locust. 4. The expression level (RPKM value) in the brain of solitarious locust. 5. False discovery rate of the significant level of the differential expression in the brains between gregarious and solitarious locusts. Supplementary Table S20 Transcriptome data and mapping statistics. Sample Total reads Mapped reads Map% RPKM >0 >=1 >=5 Gegg 3,763,045 1,513,052 40.21% 12,231 11,735 8,512 G1_2 4,484,110 2,026,623 45.20% 12,339 11,668 8,010 G3 16,895,730 11,229,663 66.46% 14,392 12,046 8,056 G4 20,617,698 15,806,218 76.66% 14,925 13,265 9,713 G5 5,589,216 2,983,675 53.38% 13,375 12,628 8,656 Gadult 3,908,336 1,931,045 49.41% 11,977 11,639 9,148 Segg 5,612,745 2,464,268 43.90% 12,763 12,010 8,728 S1_2 4,117,835 1,701,616 41.32% 12,226 11,716 8,152 S3 16,669,884 11,209,294 67.24% 14,209 11,910 7,679 S4 19,959,546 16,142,892 80.88% 14,235 12,067 8,307 S5 5,111,368 2,601,197 50.89% 12,654 11,809 7,661 Sadult 4,974,685 2,650,487 53.28% 12,985 12,043 7,684 G-B 70,619,250 56,173,778 79.54% 14,637 12,584 9,783 S-B 76,275,942 58,821,271 77.12% 14,470 12,535 9,879 IG-C 85,275,914 68,065,130 79.82% 14,649 11,680 7,183 IG-4h 91,007,586 72,991,150 80.20% 14,385 11,766 7,529 IG-8h 75,905,038 59,360,007 78.20% 14,238 11,447 6,881 IG-16h 67,089,958 54,110,697 80.65% 14,324 11,714 7,599 IG-32h 77,311,238 57,580,234 74.48% 14,529 11,369 6,859 CS-C 88,614,074 69,474,148 78.40% 14,374 11,535 7,153 CS-4h 89,207,696 69,680,045 78.11% 14,623 11,921 7,981 CS-8h 80,837,168 63,208,282 78.19% 14,541 12,042 8,210 CS-16h 88,387,552 67,905,408 76.83% 14,239 11,199 6,568 CS-32h 79,676,856 65,032,909 81.62% 14,114 11,071 6,327 Antenna 126,501,261 101,230,019 80.02% 15,759 13,053 10,557 Brain 92,782,188 76,978,395 82.97% 14,378 12,103 9,719 Gangalia 114,664,633 90,319,596 78.77% 14,671 11,805 8,709 Hind leg 112,788,727 90,072,088 79.86% 14,104 10,399 6,510 Hemolymph 127,040,788 103,708,042 81.63% 15,083 12,119 9,237 Middle gut 95,655,682 74,337,067 77.71% 14,088 11,021 8,422 Muscle 99,234,003 84,745,107 85.40% 13,166 8,159 4,238 Ovary 109,948,380 89,723,778 81.61% 15,081 11,794 9,693 Testis 101,957,854 82,362,842 80.78% 14,933 11,747 8,847 Wing 134,083,093 107,204,353 79.95% 15,311 12,281 9,353 Mixed 120,544,806 100,795,136 83.62% 15,659 13,415 9,730 Total 1,202,457,276 935,458,225 78.31% 16,913 16,244 14,940 Data in the first 12 lines were retrieved from a report by Chen et al70, who has studied the developmental time course (egg, 1_2, 3, 4, 5 and adult developmental stages) of gregarious (G) and solitarious (S) L. migratoria using RNA-seq. G-B and S-B were RNA-seq data for samples used for the RRBS study. The RNA-seq data of time course (control, 4, 8, 16 and 32 hour) of isolation of gregarious (IG) and crowding of solitarious (CS) locusts was produced in this study. Mixed samples from various organs and developmental stages were used to assist gene annotation. Supplementary Table S21 GO enrichment of differentially expressed genes (DEGs) in the process of isolation of gregarious locust. GO_ID GO_Term GO_Class AdjustedPv GOlevl IG-4h.down GO:0008236 serine-type peptidase activity MF 7.06E-68 5 GO:0005975 carbohydrate metabolic process BP 4.55E-62 4 GO:0070011 peptidase activity, acting on L-amino acid MF 6.90E-51 5 peptides GO:0006508 proteolysis BP 3.66E-37 5 GO:0004175 endopeptidase activity MF 6.27E-37 6 GO:0004252 serine-type endopeptidase activity MF 3.27E-29 6 GO:0016787 hydrolase activity MF 2.62E-26 3 GO:0004553 hydrolase activity, hydrolyzing O-glycosyl MF 1.46E-25 5 compounds GO:0044238 primary metabolic process BP 5.52E-20 3 GO:0008152 metabolic process BP 2.65E-17 2 GO:0044430 cytoskeletal part CC 9.21E-05 4 GO:0005856 cytoskeleton CC 9.21E-05 5 GO:0016491 oxidoreductase activity MF 7.79E-04 3 GO:0015630 microtubule cytoskeleton CC 9.79E-04 6 GO:0005874 microtubule CC 2.25E-03 4 GO:0007017 microtubule-based process BP 3.61E-03 3 GO:0007018 microtubule-based movement BP 3.61E-03 4 GO:0016758 transferase activity, transferring hexosyl MF 4.14E-03 5 IG-4h.up groups GO:0051258 protein polymerization BP 1.91E-02 7 GO:0034622 cellular macromolecular complex assembly BP 3.88E-02 6 IG-8h.down GO:0005975 carbohydrate metabolic process BP 9.25E-40 4 GO:0008236 serine-type peptidase activity MF 3.67E-34 5 GO:0070011 peptidase activity, acting on L-amino acid MF 4.30E-30 5 peptides GO:0006508 proteolysis BP 5.34E-23 5 GO:0004175 endopeptidase activity MF 4.03E-20 6 GO:0004553 hydrolase activity, hydrolyzing O-glycosyl MF 1.13E-19 5 compounds GO:0004252 serine-type endopeptidase activity MF 6.24E-17 6 GO:0016787 hydrolase activity MF 8.06E-15 3 GO:0044238 primary metabolic process BP 4.08E-14 3 chitin binding MF 3.47E-10 5 GO:0009308 amine metabolic process BP 1.31E-11 4 GO:0005976 polysaccharide metabolic process BP 1.32E-10 4 GO:0006022 aminoglycan metabolic process BP 1.22E-09 5 GO:0008061 chitin binding MF 3.45E-08 5 GO:0006030 chitin metabolic process BP 3.15E-07 6 GO:0005975 carbohydrate metabolic process BP 3.15E-07 4 GO:0005576 extracellular region CC 1.60E-05 2 GO:0016491 oxidoreductase activity MF 6.52E-05 3 GO:0006816 calcium ion transport BP 6.52E-05 8 GO:0005262 calcium channel activity MF 2.22E-04 8 hydrolase activity, hydrolyzing O-glycosyl MF 2.55E-30 5 GO:0008061 IG-8h.up IG-16h.down GO:0004553 compounds GO:0005975 carbohydrate metabolic process BP 2.62E-29 4 GO:0008236 serine-type peptidase activity MF 3.89E-24 5 GO:0004252 serine-type endopeptidase activity MF 2.83E-22 6 GO:0008152 metabolic process BP 1.21E-20 2 GO:0044238 primary metabolic process BP 8.70E-20 3 GO:0008233 peptidase activity MF 1.88E-17 4 GO:0070011 peptidase activity, acting on L-amino acid MF 2.88E-17 5 peptides GO:0006508 proteolysis BP 6.49E-11 5 GO:0004175 endopeptidase activity MF 9.92E-11 6 GO:0016491 oxidoreductase activity MF 1.55E-06 3 GO:0005856 cytoskeleton CC 1.74E-05 5 GO:0008061 chitin binding MF 1.74E-05 5 GO:0020037 heme binding MF 1.74E-05 4 GO:0042302 structural constituent of cuticle MF 1.74E-05 3 GO:0004601 peroxidase activity MF 1.93E-05 3 GO:0006979 response to oxidative stress BP 1.93E-05 4 GO:0044430 cytoskeletal part CC 2.79E-05 4 GO:0006030 chitin metabolic process BP 5.88E-05 6 GO:0005976 polysaccharide metabolic process BP 1.08E-04 4 GO:0005975 carbohydrate metabolic process BP 5.54E-07 4 GO:0009628 response to abiotic stimulus BP 1.51E-06 3 GO:0055114 oxidation-reduction process BP 5.31E-05 3 GO:0016491 oxidoreductase activity MF 5.31E-05 3 IG-16h.up IG-32h.down GO:0009408 response to heat BP 1.05E-04 4 GO:0070011 peptidase activity, acting on L-amino acid MF 1.26E-04 5 peptides GO:0004175 endopeptidase activity MF 1.76E-04 6 GO:0004252 serine-type endopeptidase activity MF 4.30E-04 6 GO:0006508 proteolysis BP 7.19E-04 5 GO:0016620 oxidoreductase activity, acting on the MF 1.49E-03 5 aldehyde or oxo group of donors, NAD or NADP as acceptor IG-32h.up GO:0005975 carbohydrate metabolic process BP 9.06E-18 4 GO:0003824 catalytic activity MF 9.06E-18 2 GO:0009055 electron carrier activity MF 4.92E-16 2 GO:0016491 oxidoreductase activity MF 2.00E-12 3 GO:0004497 monooxygenase activity MF 3.40E-11 4 GO:0020037 heme binding MF 8.81E-10 4 GO:0008152 metabolic process BP 3.57E-09 2 GO:0004553 hydrolase activity, hydrolyzing O-glycosyl MF 1.46E-08 5 MF 1.48E-08 5 MF 1.61E-08 4 compounds GO:0016758 transferase activity, transferring hexosyl groups GO:0016757 transferase activity, transferring glycosyl groups IG-b-64h.down GO:0008152 metabolic process BP 5.03E-33 2 GO:0008233 peptidase activity MF 2.69E-32 4 GO:0070011 peptidase activity, acting on L-amino acid MF 8.53E-32 5 peptides GO:0008236 serine-type peptidase activity MF 8.48E-31 5 GO:0004252 serine-type endopeptidase activity MF 1.79E-28 6 GO:0003735 structural constituent of ribosome MF 6.41E-28 3 GO:0005840 ribosome CC 1.19E-27 4 GO:0030529 ribonucleoprotein complex CC 7.96E-24 3 GO:0005975 carbohydrate metabolic process BP 2.76E-23 4 GO:0006508 proteolysis BP 4.13E-23 5 GO:0005515 protein binding MF 4.89E-19 3 GO:0008270 zinc ion binding MF 6.40E-10 7 GO:0007165 signal transduction BP 4.51E-09 3 GO:0007154 cell communication BP 1.44E-08 3 GO:0051716 cellular response to stimulus BP 1.88E-08 3 IG-b-64h.up GO:0030695 GTPase regulator activity MF 2.00E-08 4 GO:0050794 regulation of cellular process BP 2.87E-08 3 GO:0065007 biological regulation BP 8.68E-08 2 GO:0005488 binding MF 1.92E-07 2 GO:0051056 regulation of small GTPase mediated signal BP 1.93E-07 5 transduction Both down and up represent the down-regulated and up-regulated genes at time 4, 8, 16, 32 and 64 hours compared with typical gregarious locust. Only top 10 most significant GO terms were listed according to AdjustedPv (false discover ratio, which was calculated according to Benjamini and Hochberg237). Supplementary Table S22 GO enrichment of the differentially expressed genes (DEGs) in the process of crowding of solitarious locusts. GO_ID GO_Term GO_Class AdjustedPv GOlevl GO:0016491 oxidoreductase activity MF 5.94E-14 3 GO:0042302 structural constituent of cuticle MF 4.12E-13 3 GO:0009055 electron carrier activity MF 1.16E-11 2 GO:0004497 monooxygenase activity MF 1.62E-09 4 GO:0020037 heme binding MF 2.92E-08 4 GO:0005975 carbohydrate metabolic process BP 8.82E-07 4 GO:0005198 structural molecule activity MF 1.52E-06 2 GO:0016758 transferase activity, transferring hexosyl MF 1.44E-05 5 CS-4h.down groups GO:0055114 oxidation-reduction process BP 1.82E-05 3 GO:0008061 chitin binding MF 6.60E-05 5 peptidase activity, acting on L-amino acid MF 1.15E-30 5 CS-4h.up GO:0070011 peptides GO:0004175 endopeptidase activity MF 8.65E-28 6 GO:0006508 proteolysis BP 3.41E-25 5 GO:0008236 serine-type peptidase activity MF 6.86E-23 5 GO:0004252 serine-type endopeptidase activity MF 8.63E-23 6 GO:0016787 hydrolase activity MF 5.69E-09 3 GO:0005975 carbohydrate metabolic process BP 2.65E-06 4 GO:0003824 catalytic activity MF 6.31E-05 2 GO:0008061 chitin binding MF 6.51E-05 5 GO:0006030 chitin metabolic process BP 2.18E-04 6 GO:0016491 oxidoreductase activity MF 9.58E-47 3 GO:0020037 heme binding MF 2.71E-32 4 GO:0009055 electron carrier activity MF 1.37E-30 2 GO:0055114 oxidation-reduction process BP 5.34E-29 3 GO:0016757 transferase activity, transferring glycosyl MF 6.57E-29 4 CS-8h.down groups GO:0005506 iron ion binding MF 4.29E-26 7 GO:0003824 catalytic activity MF 2.88E-22 2 GO:0004497 monooxygenase activity MF 7.04E-18 4 GO:0016758 transferase activity, transferring hexosyl MF 3.63E-15 5 BP 2.11E-13 2 groups GO:0008152 metabolic process CS-8h.up GO:0008236 serine-type peptidase activity MF 2.23E-25 5 GO:0004175 endopeptidase activity MF 2.15E-18 6 GO:0070011 peptidase activity, acting on L-amino acid MF 2.86E-15 5 peptides GO:0004252 serine-type endopeptidase activity MF 2.93E-14 6 GO:0006508 proteolysis BP 9.92E-13 5 GO:0005975 carbohydrate metabolic process BP 2.62E-12 4 GO:0006022 aminoglycan metabolic process BP 1.14E-11 5 GO:0008061 chitin binding MF 6.66E-11 5 GO:0006030 chitin metabolic process BP 9.42E-11 6 GO:0030246 carbohydrate binding MF 9.36E-10 3 GO:0042302 structural constituent of cuticle MF 1.19E-08 3 GO:0016758 transferase activity, transferring hexosyl MF 2.09E-07 5 MF 2.37E-07 4 CS-16h.down groups GO:0016757 transferase activity, transferring glycosyl groups GO:0051920 peroxiredoxin activity MF 8.29E-07 5 GO:0016209 antioxidant activity MF 1.42E-06 2 GO:0005198 structural molecule activity MF 9.98E-06 2 GO:0016491 oxidoreductase activity MF 2.39E-03 3 GO:0004497 monooxygenase activity MF 1.28E-02 4 GO:0020037 heme binding MF 3.20E-02 4 GO:0009055 electron carrier activity MF 3.84E-02 2 peptidase activity, acting on L-amino acid MF 3.48E-27 5 CS-16h.up GO:0070011 peptides GO:0004175 endopeptidase activity MF 7.02E-24 6 GO:0006508 proteolysis BP 1.41E-19 5 GO:0008236 serine-type peptidase activity MF 1.84E-17 5 GO:0004252 serine-type endopeptidase activity MF 2.29E-17 6 GO:0016787 hydrolase activity MF 3.75E-06 3 GO:0005576 extracellular region CC 1.18E-04 2 GO:0000786 nucleosome CC 1.23E-04 4 GO:0005975 carbohydrate metabolic process BP 1.29E-04 4 GO:0006334 nucleosome assembly BP 1.29E-04 7 GO:0016491 oxidoreductase activity MF 1.10E-09 3 GO:0042302 structural constituent of cuticle MF 6.88E-08 3 GO:0016209 antioxidant activity MF 1.47E-07 2 CS-32h.down GO:0016758 transferase activity, transferring hexosyl MF 1.47E-07 5 groups GO:0051920 peroxiredoxin activity MF 8.61E-07 5 GO:0016705 oxidoreductase activity, acting on paired MF 6.36E-05 4 donors, with incorporation or reduction of molecular oxygen GO:0004497 monooxygenase activity MF 1.81E-04 4 GO:0055114 oxidation-reduction process BP 3.22E-04 3 GO:0020037 heme binding MF 6.87E-04 4 GO:0005198 structural molecule activity MF 1.81E-03 2 peptidase activity, acting on L-amino acid MF 2.12E-06 5 CS-32h.up GO:0070011 peptides GO:0008236 serine-type peptidase activity MF 2.12E-06 5 GO:0006508 proteolysis BP 1.57E-05 5 GO:0004252 serine-type endopeptidase activity MF 5.72E-05 6 GO:0004175 endopeptidase activity MF 1.08E-03 6 GO:0006026 aminoglycan catabolic process BP 1.08E-03 6 GO:0006022 aminoglycan metabolic process BP 2.40E-03 5 GO:0016614 oxidoreductase activity, acting on CH-OH MF 3.16E-03 4 group of donors GO:0016491 oxidoreductase activity MF 3.16E-03 3 GO:0008745 N-acetylmuramoyl-L-alanine amidase MF 3.16E-03 6 activity CS-b-64h.down GO:0003735 structural constituent of ribosome MF 7.51E-42 3 GO:0005840 ribosome CC 9.96E-42 4 GO:0016491 oxidoreductase activity MF 8.29E-40 3 GO:0005198 structural molecule activity MF 1.64E-34 2 GO:0005506 iron ion binding MF 2.23E-34 7 GO:0020037 heme binding MF 4.21E-32 4 GO:0004497 monooxygenase activity MF 7.42E-30 4 GO:0044444 cytoplasmic part CC 1.32E-25 4 GO:0009055 electron carrier activity MF 2.01E-24 2 GO:0055114 oxidation-reduction process BP 1.15E-23 3 protein binding MF 8.84E-03 3 CS-b-64h.up GO:0005515 Both down and up represent the down-regulated and up-regulated genes at time 4, 8, 16, 32 and 64 hours compared with typical solitarious locust. Only top 10 most significant GO terms were listed according to AdjustedPv (false discover ratio, which was calculated according to Benjamini and Hochberg236). Supplementary Table S23 Differentially spliced genes between gregarious and solitarious brain samples. GeneID JuncStart JuncEnd LOCMI01089 9,503 9,731 LOCMI01289 208,041 208,185 FDR RatioDiff Annotation 2.9E-88 79% Ubiquitin 8.5E-03 41% Microtubule-associated protein futsch LOCMI01289 212,936 213,003 2.6E-03 24% Microtubule-associated protein futsch LOCMI01929 352 949 3.9E-02 24% Heterogeneous nuclear ribonucleoprotein F LOCMI02112 25,158 29,945 4.7E-02 44% S-adenosylmethionine LOCMI02840 56,175 56,247 1.9E-02 30% NA LOCMI03690 46,327 54,871 5.7E-14 38% Ankyrin repeat domain-containing decarboxylase proenzyme protein 50 LOCMI03696 63,788 84,710 8.9E-07 50% 39S ribosomal protein L22, mitochondrial LOCMI03875 114,742 115,138 2.6E-03 69% Neurofilament heavy polypeptide LOCMI06132 17,002 19,880 1.7E-05 26% Tankyrase-2 LOCMI06300 237,941 248,085 1.3E-02 34% Orphan sodium- and chloride-dependent neurotransmitter transporter NTT4 LOCMI06515 38,336 44,550 2.4E-06 38% Myosin light chain alkali LOCMI06569 120,498 138,528 1.3E-03 42% Calcium-transporting ATPase sarcoplasmic/endoplasmic reticulum type LOCMI07192 360,229 367,172 8.8E-03 55% Arrestin homolog LOCMI07498 358,962 370,733 1.5E-03 28% GatC-like protein LOCMI08053 257,589 349,754 9.5E-02 24% LOCMI08665 412,452 412,638 5.6E-22 25% Selenide, water dikinase 1 LOCMI08665 412,452 412,792 2.1E-08 22% Selenide, water dikinase 1 LOCMI08748 592,833 599,356 6.8E-02 29% NA LOCMI09211 250,251 250,481 2.2E-09 22% tRNA (uracil-5-)-methyltransferase homolog A LOCMI09348 520,566 527,423 5.9E-03 32% Troponin T LOCMI09800 351,323 400,625 4.1E-04 21% TM2 domain-containing protein almondex LOCMI10443 210,105 214,365 4.0E-20 33% Cytochrome c LOCMI10825 901,123 901,507 5.9E-23 30% S-adenosylmethionine synthetase LOCMI11033 279,947 288,905 4.0E-02 23% Mucosa-associated lymphoid tissue lymphoma translocation protein 1 homolog LOCMI11116 145,059 145,606 4.6E-02 91% Sorting nexin-9 LOCMI11537 390,510 500,197 1.8E-13 20% NA LOCMI11822 189,207 189,732 3.1E-09 24% Growth hormone-inducible transmembrane protein LOCMI11956 123,414 123,616 1.1E-35 93% NA LOCMI12061 127,789 128,009 8.4E-02 80% NA LOCMI12462 953,700 956,713 1.8E-09 24% Chromodomain-helicase-DNA-binding protein Mi-2 homolog LOCMI12462 954,372 973,680 5.8E-08 21% Chromodomain-helicase-DNA-binding protein Mi-2 homolog LOCMI13014 1,639,012 1,661,291 3.9E-03 31% Protein turtle LOCMI13647 715,041 715,533 1.2E-04 43% NA LOCMI13701 986,125 986,395 1.1E-38 21% Y-box factor homolog LOCMI14075 324,079 334,455 3.7E-14 21% Histone H3.3 LOCMI14096 1,537,431 1,537,510 1.0E-37 23% Eukaryotic translation initiation factor 4 gamma 2 LOCMI14124 664,510 695,269 4.0E-03 21% Elongation factor Ts, mitochondrial LOCMI14302 444,762 454,682 4.2E-02 21% Catenin delta-2 LOCMI14384 2,763,226 2,767,020 2.1E-02 67% Sodium-dependent phosphate transporter 1-A LOCMI14503 1,179,102 1,180,309 9.3E-04 25% Synaptobrevin LOCMI15369 3,450,741 3,461,535 1.0E-04 20% Tyrosine-protein phosphatase 99A LOCMI15498 255,468 255,569 0.0E+00 25% Elongation factor 1-alpha LOCMI16160 133,512 138,870 3.5E-02 29% NA LOCMI16195 1,504,719 1,505,088 1.7E-02 28% Cytochrome P450 4C1 Supplementary Table S24 Genes related to the contraction/relaxation activity of flight muscle. Gene ID LMI API AGA AME DME DPL NVI PHU BMO TCA CG12408-PA 0 0 0 0 1 0 0 0 0 0 CG15920-PA 1 0 0 1 1 1 1 0 1 0 CG1915-PC 1 1 1 1 1 1 2 1 2 1 CG32019-PF 1 1 1 1 1 1 1 1 1 2 CG3725-PD 2 1 1 1 1 1 1 1 1 1 CG7107-PA 1 1 1 1 1 2 1 1 1 1 Note: Abbreviation: LMI: Locusta migratoria, PHU: Pediculus humanus, API: Acyrthosiphon pisum, AME: Apis mellifera, NVI: Nasonia vitripennis, TCA: Tribolium castaneum, BMO: Bombyx mori, AGA: Anopheles gambiae, DME: Drosophila melanogaster and DPL: Danaus plexippus. Supplementary Table S25 Genes related to wing vein morphogenesis and wing vein specification. Gene ID LMI API AGA AME DME DPL NVI PHU BMO TCA CG1004-PA 0 0 0 1 1 0 1 0 0 0 CG10079-PB 1 1 1 1 1 1 1 1 1 1 CG1007-PA 1 1 2 1 1 1 1 1 1 1 CG10197-PC 1 2 1 1 1 1 1 1 1 1 CG10491-PA 0 1 1 1 1 1 1 0 1 0 CG10571-PA 1 0 0 0 1 0 0 1 0 0 CG10595-PB 1 2 1 2 1 1 2 1 1 1 CG10605-PA 0 0 0 1 1 0 1 0 0 0 CG1064-PA 1 1 1 1 1 2 1 1 1 1 CG10917-PA 0 0 1 0 1 2 0 1 1 1 CG11450-PB 0 1 1 1 1 1 0 1 1 1 CG11990-PB 1 2 1 1 1 2 1 1 1 1 CG12399-PA 1 2 1 2 1 2 3 3 2 4 CG12559-PF 2 1 1 1 3 1 3 0 1 1 CG14080-PB 1 1 1 1 1 1 1 1 1 1 CG15154-PB 1 1 0 1 1 1 1 1 1 1 CG15671-PA 2 1 1 1 1 1 4 2 1 1 CG1696-PA 1 1 1 1 1 2 2 1 1 1 CG17090-PA 1 2 1 1 1 1 1 0 1 1 CG17149-PA 1 2 1 1 1 3 2 2 2 1 CG17596-PA 1 1 1 1 1 1 1 2 1 1 CG17998-PA 1 1 1 1 1 2 1 1 1 1 CG18250-PC 1 1 1 1 1 1 1 1 1 1 CG18497-PA 1 1 1 1 1 1 1 1 1 1 CG18740-PA 2 1 1 1 1 2 1 1 1 1 CG30115-PE 1 0 1 1 1 1 1 0 1 1 CG31000-PO 1 1 1 1 1 1 1 1 1 1 CG32062-PD 1 0 0 0 1 1 0 0 1 1 CG32372-PA 1 1 2 1 1 1 1 1 1 1 CG3274-PA 1 1 1 1 1 1 1 1 1 1 CG3411-PA 1 1 1 1 1 1 1 1 0 1 CG34157-PH 2 2 1 2 1 1 1 1 0 1 CG34389-PC 0 1 1 1 1 1 1 1 1 2 CG3497-PA 1 2 1 1 1 1 2 1 1 1 CG3619-PA 1 1 1 1 1 1 1 1 1 2 CG3936-PA 3 1 1 1 1 1 1 1 1 2 CG40129-PB 1 1 1 1 1 2 1 1 1 1 CG4244-PA 1 1 1 1 1 1 1 1 1 1 CG4426-PA 1 1 1 1 1 2 2 1 1 1 CG4444-PA 1 1 1 1 1 1 0 1 1 1 CG4531-PA 1 4 0 1 1 1 1 1 1 2 CG4547-PB 1 1 1 0 1 1 0 1 1 1 CG4637-PA 1 2 1 1 1 1 1 1 1 1 CG4713-PA 1 0 1 1 1 3 1 2 1 1 CG4881-PB 0 0 0 0 1 0 0 0 0 0 CG4974-PA 0 1 1 1 1 1 1 1 1 1 CG5067-PB 0 1 1 1 1 0 0 0 1 1 CG5441-PA 0 1 1 0 1 0 0 0 0 2 CG5460-PD 0 0 1 0 1 0 0 0 1 0 CG5562-PA 1 2 2 1 1 2 2 1 1 2 CG5591-PA 1 0 1 1 1 1 1 0 1 0 CG5942-PA 5 1 1 1 1 1 2 8 1 0 CG6148-PA 4 2 1 1 1 1 2 2 1 2 CG6464-PA 1 0 1 0 1 0 0 1 0 0 CG6677-PD 2 1 1 1 1 1 3 1 1 1 CG6863-PB 1 1 1 0 1 1 1 1 1 1 CG6868-PA 0 0 0 0 1 0 0 0 0 0 CG6964-PC 1 1 1 1 1 1 0 1 0 0 CG7467-PB 1 1 1 1 1 1 1 1 1 2 CG7595-PA 2 2 3 2 1 3 2 2 3 3 CG7734-PD 1 1 1 1 1 1 0 1 1 1 CG7890-PA 1 1 1 1 1 1 1 2 2 1 CG7892-PG 1 1 1 1 1 1 1 1 1 1 CG7935-PA 2 1 1 1 1 1 1 1 1 1 CG8339-PA 1 2 1 1 1 1 1 1 1 1 CG8709-PA 1 1 1 1 1 1 1 1 1 1 CG9139-PA 1 1 1 1 1 1 1 1 1 1 CG9224-PA 1 1 1 1 1 1 0 1 1 1 CG9397-PI 0 1 0 1 1 0 1 0 0 1 CG9885-PE 1 4 1 1 1 1 1 1 1 2 Note: Abbreviation: LMI: Locusta migratoria, PHU: Pediculus humanus, API: Acyrthosiphon pisum, AME: Apis mellifera, NVI: Nasonia vitripennis, TCA: Tribolium castaneum, BMO: Bombyx mori, AGA: Anopheles gambiae, DME: Drosophila melanogaster and DPL: Danaus plexippus. Supplementary Table S26 Juvenile hormone genes in the L. migratoria genome. Genes L. migratoria ID B. mori ID LOCMI16298 NP_001093296.1 Biosynthesis Acetoacetyl-CoA thiolase LOCMI16359 LOCMI16371 Diphosphomevalonate decarboxylase LOCMI03890 NP_001093300.1 LOCMI05712 Epoxidase LOCMI16177 CYP15C1 Farnesol dehydrogenase LOCMI05376 BGIBMGA005248-PA LOCMI05943 LOCMI16310 Farnesyl diphosphate synthase LOCMI15132 NP_001036889.1 LOCMI16337 NP_001093301.1 LOCMI15132 NP_001093302.1 LOCMI15132 HMG CoA reductase LOCMI16309 NP_001093298.1 HMG CoA synthase LOCMI15805 NP_001093297.1 LOCMI17476 Isopentenyl diphosphate isomerase LOCMI16436 NP_001040323.2 JHA O-Methyltransferase LOCMI01015 BGIBMGA010392-PA LOCMI16697 BGIBMGA010393-PA LOCMI16699 BGIBMGA010563-PA LOCMI16705 BGIBMGA014014-PA LOCMI17143 NP_001036901.1 LOCMI17167 Mevalonate kinase LOCMI07473 NP_001093299.1 Phosphomevalonate kinase LOCMI10595 NP_001040145.1 LOCMI16350 BGIBMGA006444-PA LOCMI16608 BGIBMGA008813-PA LOCMI17114 BGIBMGA008815-PA Degradation juvenile hormone diol kinase NP_001037080.1 juvenile hormone epoxide hydrolase juvenile hormone esterase 1 LOCMI04730 BGIBMGA009211-PA LOCMI16691 BGIBMGA011468-PA LOCMI17120 BGIBMGA013793-PA LOCMI17136 BGIBMGA013929-PA LOCMI16251 BGIBMGA013994-PA LOCMI16258 NP_001037201.1 LOCMI15548 NP_001037027.1 LOCMI15559 LOCMI15560 LOCMI15621 Supplementary Table S27 Insulin metabolic pathway. GenesGene L. migratoria D. melanogaster convoluted LOCMI12686 CG8561-PA short neuropeptide F receptor LOCMI16723 CG7395-PA Ligands and secreted factors Ecdysone-inducible gene L2 Insulin-like peptide ND 1 LOCMI16379 CG15009-PC CG14173-PA CG13317-PA CG14049-PA CG14167-PA CG33273-PA CG6736-PA CG8167-PA Insulin-like receptor and its substrates Insulin-like receptor LOCMI16333 CG18402-PA LOCMI07370 LOCMI07679 Pi3K21B LOCMI07824 CG2699-PA Pi3K92E LOCMI13806 CG4141-PA chico LOCMI16419 CG5686-PA target of rapamycin LOCMI16380 CG5092-PA nucleostemin 3 LOCMI06145 CG14788-PA twins LOCMI07527 CG6235-PF widerborst LOCMI12365 CG5643-PA pten LOCMI16332 CG5671-PB CHARYBDE LOCMI16625 CG7533-PC melted LOCMI16412 CG8624-PC Akt1 LOCMI16427 CG4006-PC focal adhesion kinase LOCMI16428 CG10023-PD Tsc1 LOCMI16439 CG6147-PA RPS6-p70-protein kinase LOCMI16450 CG10539-PA SCYLLA LOCMI16625 CG7590-PA LOCMI00728 CG10944-PB Signal transduction pathway Targets ribosomal protein S6 LOCMI09740 gigas (TSC2) LOCMI13246 CG6975-PA spargel LOCMI16312 CG9809-PB thor LOCMI16388 CG8846-PA ND CG10798-PA LOCMI16409 CG3143-PA diminutive forkhead box, sub-group O 1 ND: not detected. Supplementary Table S28 PAT genes in insect genomes. Species PLIN1 PLIN2 PLIN3 PLIN4 PLIN5 H. sapiens NP_002657 NP_001113.2 NP_00580 NP_001073869.1 NP_00101372 8 L. migratoria LOCMI16244 LOCMI16243 LOCMI16245 LOCMI16246 LOCMI16247 LOCMI16248 LOCMI16250 P. humanus PHUM299200 A. pisum ACYPI007905 T. castaneum XP_966587 XP_976120 D. melanogaster CG10374 CG9057 A. gambiae AGAP002890 AGAP000167 B. mori BGIBMGA013593 BGIBMGA013612 D.plexippus DPGLEAN05949 DPGLEAN15300 A. mellifera GB15498 GB14434 8 Supplementary Table S29 Conserved domains of L. migratoria PAT proteins. Locust ID From To E-Value Bitscore LOCMI16244 6 338 9.40E-54 184.591 LOCMI16243 47 200 8.44E-06 44.7633 LOCMI16245 10 239 2.78E-26 105.24 LOCMI16246 66 177 0.0003606 38.6001 LOCMI16247 66 154 3.17E-05 41.6817 LOCMI16248 65 176 0.0002675 38.9853 LOCMI16250 10 237 1.82E-20 88.2909 The conserved domains of the PAT proteins were identified through the NCBI Batch CD-Search (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). The conserved PAT domain was identified in all the predicted candidates (PSSM-ID: 190508; accession: cl03851, Perilipin superfamily). In the two sequence lengths m and n, the statistics of high-scoring segment pair scores are characterized by two parameters, K and lambda. The E-value of high-scoring segment pairs with score at least S is given by the formula E=Kmne-Λs. Supplementary Table S30 Primers for L. migratoria FABP genes. Gene ID Forward LOCMI17560 GCCGCAAGGTCAAGTCTATC LOCMI04395 Reverse Length Position TATTCTCGTCGCCACCAAGT 152 239-390 GGCTACCTGGTCCGTAAGAT CCTCTGCCGCTGGATCAGT 189 85-273 LOCMI04101 TCTCGGTCGCAGGTACAAA TCCGCCACAGAAAGCCAAT 83 18-100 LOCMI17562 AAGACCTCTGTGGCGTTCC TGTGGTCGTCCTTGAGTGA 94 190-283 LOCMI17563 AAGCTCGTGCTCACCTACC TGACGGTGCTCTTCCTCTT 76 184-259 LOCMI03671 CACCCTCAAGTCCTCGTCG GCCCTTCTGGATGTGGTGC 145 150-294 LOCMI03672 GTTTGGAGCGGGCATAGTG CGATGGTGGTGGTCCTGTC 245 75-319 LOCMI03788 AAGACCTCTGTGGCGTTCC CGACCTTGAGCAGCGTGT 109 190-298 LOCMI05692 CAAGCAGACGACCAAGACG TGAAAGGGATCTCCGCCAC 92 111-202 LOCMI17564 AGCCGACGACGACCCTGAC TCGCTCTTCTGGGTGATGG 157 116-272 LOCMI17561 GTCACCAGGCACTTCACGC GTCACATCGGTCCTTACAGC 111 181-291 Supplementary Table S31 Comparison of antioxidant genes in insects. Type LMI PHU TCA DME AGA BMO DPL AME NVI SOD [Mn] 1 1 1 1 1 1 2 1 1 SOD [Cu-Zn] 4 4 4 4 4 4 4 2 4 Copper chaperone for SOD 1 0 1 1 1 1 2 1 1 General animal peroxidase 17 10 14 10 16 20 12 10 10 Catalase superfamily 2 1 3 2 1 9 6 3 1 Glutathione Peroxidase Superfamily 4 2 3 2 3 2 1 1 3 General Peroxiredoxin Superfamily 14 7 8 9 7 5 6 8 6 Prdx6 9 2 2 4 2 1 1 2 2 NADPH oxidase V 1 0 1 1 1 1 1 1 1 Thioredoxinreductase 2 3 2 2 1 1 1 1 1 Thioredoxin 14 12 10 15 10 10 9 12 9 Glutaredoxin 7 6 5 6 4 3 4 4 4 Methionine-R-sulphoxidereductase 3 3 2 2 3 2 2 2 2 Abbreviations: LMI: Locusta migratoria, DPL: Danaus plexippus, PHU: Pediculus humanus, AME: Apis mellifera, NVI: Nasonia vitripennis, TCA: Tribolium castaneum, BMO: Bombyx mori, AGA: Anopheles gambiae and DME: Drosophila melanogaster. Supplementary Table S32 List of glutathione S-transferase (GST) genes in the insect genomes. AAE AGA DME AME NVI TCA BMO API PHU LMI Delta 8 12 11 1 5 3 4 10 4 10 Epsilon 8 8 14 0 0 19 8 0 0 0 Omega 1 1 5 1 2 4 4 0 1 3 Sigma 1 1 1 4 8 7 2 6 4 12 Theta 4 2 4 1 3 1 1 2 1 2 Zeta 1 1 2 1 1 1 2 0 1 1 Others 3 3 0 0 0 1 2 0 0 0 Total 26 28 37 8 19 36 23 18 11 28 Abbreviations: LMI: Locusta migratoria, PHU: Pediculus humanus, API: Acyrthosiphon pisum, AME: Apis mellifera, NVI: Nasonia vitripennis, TCA: Tribolium castaneum, BMO: Bombyx mori, AAE:, Aedes aegypti, AGA: Anopheles gambiae, DME: Drosophila melanogaster. Supplementary Table S33 Summary of carboxyl/cholinesterase genes in insect genomes. Species A A D A N T B A P L A G M M V C M P H M E A E E I A O I U I A class (hymenopteran metabolising) 0 0 0 8 8 0 0 0 0 0 B class (α-esterases) 22 16 13 0 5 14 55 5 3 9 C class(microsomalα-esterases) 0 0 0 0 0 12 0 0 0 0 Locust-Specific 0 0 0 0 0 0 0 0 0 30 D class (integument esterases) 0 0 3 1 4 2 2 0 0 9 E class ( -esterases) 2 4 3 2 11 7 2 18 1 21 F class (dipteran JH esterases) 6 6 2 2 2 2 0 0 0 2 G class (lepidopteran JH esterases) 6 4 0 0 0 0 4 0 0 0 H class (glutactins) 7 10 4 1 1 1 0 0 1 3 I class (unknown) 1 1 1 1 1 1 2 1 1 1 J class (acetylcholinesterases) 2 2 1 2 2 2 2 2 2 1 K class (gliotactins) 1 1 1 1 1 1 1 1 1 2 L class (neuroligins) 5 5 5 5 5 5 6 3 5 1 M class (neurotactins) 2 2 2 1 1 2 2 0 3 1 Total 54 51 35 24 41 48 76 30 17 80 Dietary/detoxification Hormone/semiochemical processing Neuro/developmental Note: Numbers of CCE genes in other insects were retrieved from a previous report75. Species involved in this analysis include LMI: Locusta migratoria, PHU: Pediculus humanus, API: Acyrthosiphon pisum, AME: Apis mellifera, NVI: Nasonia vitripennis, TCA: Tribolium castaneum, BMO: Bombyx mori, AAE:, Aedes aegypti, AGA: Anopheles gambiae, DME: Drosophila melanogaster. Supplementary Table S34 Summary of ABC transporter subfamilies in the genomes of selected insects. Species ABCA ABCB ABCC ABCD ABCE ABCF ABCG ABCH Total LOCMI 5 19 19 2 1 3 15 1 65 ACYPI 11 9 16 2 1 4 19 9 71 ANOGA 11 6 12 2 1 3 15 1 51 APIME 8 7 10 2 1 3 14 1 46 BOMMO 7 11 11 2 1 3 14 1 50 DAPPU 17 12 6 3 1 4 19 3 65 DROME 12 8 14 2 1 3 15 1 56 NASVI 10 8 20 1 1 3 11 1 55 PEDHU 6 7 5 2 1 3 13 1 38 TRICA 3 0 7 0 0 0 2 1 13 LOCMI: Locusta migratoria; ACYPI: Acyrthosiphon pisum; PEDHU: Pediculus humanus humanus; APIME: Apis mellifera; NASVI: Nasonia vitripennis; TRICA: Tribolium castaneum; BOMMO: Bombyx mori; DROME: Drosophila melanogaster and ANOGA: Anopheles gambiae. Supplementary Table S35 ‘Lethal’ insecticide target candidates in the L. migratoria genome. Protein Cys-loop ligand-gated ion channels GPCRs ‘Lethal’ insecticide candidates Function Nicotinic acetylcholine receptor (17); GABA/Glycine chloride channel (10); Glutamate-gated chloride channel (4); Histamine-gated chloride channel (2); pH sensitive chloride channel (1) Class A Rhodopsin-like (61); Class B Secretin receptor-like (21); Class C Metabotropic glutamate receptor-like (3); Class D Atypical GPCRs (5) Kinase (9); Synthetase (7); receptor (6); Methyltransferase (5); Helicase (4); Polymerase (3); Transporter (3); GTPase (2); Phosphatase (2); Primase (2); Translocase (1); Peptidase (1); Ligase (1); Isomerase (1); Hydrolase (1); Dehydrogenase (1); Carboxylase (1); ATPase (1) Total number 34 90 51 Supplementary Table S36 Immune related genes across five species. Gene Family of Pathway DPU LMI API PHU DME Anti-microbial Peptides (AMPs) 11 9 2 4 21 Autophagy Pathway Members (APHAGs) 16 21 12 14 16 Gram-Negative Binding Proteins (GNBPs) 9 3 1 0 3 Caspases (CASPs) 22 9 6 5 7 Caspase Activators (CASPAs) 1 2 1 0 5 Catalases (CATs) 1 2 1 1 2 CLIP-domain Serine Proteases (CLIPs) 94 54 38 23 46 C-Type Lectins (CTLs) 52 27 11 13 35 Fibrinogen-Related proteins (FREPs) 26 4 1 3 13 Galectins (GALEs) 6 3 2 3 6 Glutathione Peroxidases (GPXs) 3 4 0 1 2 Heme Peroxidases (HPXs) 12 22 27 1 10 Inhibitors of Apoptosis (IAPs) 6 4 16 2 4 IMD Pathway Members 6 6 2 5 8 JAK/STAT Pathway Members 3 6 8 3 3 Lysozymes (LYSs) 0 3 0 0 12 MD2-like Proteins (MLs) 6 3 11 2 8 Peptidoglycan Recognition Proteins (PGRPs) 0 15 0 1 13 Prophenoloxidases (PPOs) 1 8 2 3 3 Rel-like NFkappa-B Proteins (RELs) 5 3 3 4 21 Scavenger Receptors (SCRs) 19 22 14 25 4 Superoxide Dismutases (SODs) 18 7 10 8 6 Spaetzle-like Proteins (SPZs) 6 6 9 5 30 Serine Protease Inhibitors (SRPNs) 5 23 10 12 14 Thio-Ester Containing Proteins (TEPs) 11 7 4 3 6 Toll Receptors (TOLLs) 6 8 7 6 9 Toll Pathway Members 7 6 3 4 5 Thioredoxin Peroxidases (TPXs) 0 10 0 0 8 414 297 244 167 320 Total Abbreviation: LMI: Locusta migratoria, DPU: Daphnia pulex, PHU: Pediculus humanus, API: Acyrthosiphon pisum and DME: Drosophila melanogaster. Supplementary Table S37 List of annotated genes of the miRNA and siRNA machinery in the L. migratoria genome. miRNA pathway Ago1 LOCMI17227 Loq LOCMI16664 siRNA pathway Ago2 LOCMI17302 LOCMI16696 LOCMI16661 R2D2 LOCMI16661 LOCMI16590 Dicer-2 LOCMI10988 Dicer-1 LOCMI16844 Exportin-5 LOCMI17101 Drosha LOCMI17264 Pasha LOCMI17070 LOCMI13152 Supplementary Table S38 Summary of cuticle protein genes in the L. migratoria genome. Name Type Score E-value1 LOCMI16739 RR1 141.5 5.00E-43 LOCMI16727 RR1 93.8 1.10E-28 LOCMI06228 RR1 81.1 8.00E-25 LOCMI17181 RR1 92.8 2.40E-28 LOCMI17140 RR1 23.5 1.60E-08 LOCMI01986 RR1 15 1.30E-07 LOCMI16703 RR1 33.8 1.30E-10 LOCMI16728 RR1 38.3 5.80E-12 LOCMI16731 RR1 58 7.10E-18 LOCMI16610 RR1 25.6 9.60E-09 LOCMI17322 RR1 30.4 1.50E-09 LOCMI17211 RR1 93.8 1.20E-28 LOCMI17165 RR1 22.9 1.90E-08 LOCMI16714 RR1 12.4 2.40E-07 LOCMI16609 RR1 77.2 1.10E-23 LOCMI16601 RR1 23.7 1.50E-08 LOCMI16604 RR1 68.8 3.80E-21 LOCMI03817 RR1 39.8 2.10E-12 LOCMI17205 RR1 154.5 6.00E-47 LOCMI17206 RR1 64.9 5.70E-20 LOCMI17125 RR1 20.7 3.10E-08 LOCMI01532 RR1 13.5 1.80E-07 LOCMI17128 RR1 20.4 3.40E-08 LOCMI17141 RR1 74.6 7.10E-23 LOCMI03209 RR1 27.4 6.10E-09 LOCMI16847 RR1 35.9 3.10E-11 LOCMI01460 RR1 32.8 2.70E-10 LOCMI16615 RR1 39.6 2.40E-12 LOCMI17159 RR1 36.6 1.90E-11 LOCMI17132 RR1 44.3 9.50E-14 LOCMI16800 RR1 58.4 5.30E-18 LOCMI16819 RR1 85.8 3.00E-26 LOCMI16851 RR2 100.6 1.00E-30 LOCMI16632 RR2 97.6 8.30E-30 LOCMI01688 RR2 66.1 2.50E-20 LOCMI05479 RR2 113.9 1.00E-34 LOCMI16766 RR2 21.9 4.10E-07 LOCMI04505 RR2 82.3 3.40E-25 LOCMI04506 RR2 82 4.10E-25 LOCMI05131 RR2 21.6 4.50E-07 LOCMI16617 RR2 58.5 4.80E-18 LOCMI17196 RR2 82.1 3.70E-25 LOCMI16600 RR2 112.2 3.30E-34 LOCMI16693 RR2 24.2 1.10E-07 LOCMI16599 RR2 105 5.00E-32 LOCMI12142 RR2 136.6 1.50E-41 LOCMI16606 RR2 30.2 1.70E-09 LOCMI13224 RR2 99.6 2.10E-30 LOCMI16637 RR2 99.6 2.10E-30 LOCMI16638 RR2 99.6 2.10E-30 LOCMI16718 RR2 73.2 1.80E-22 LOCMI16781 RR2 65.5 3.90E-20 LOCMI16838 RR2 103.7 1.20E-31 LOCMI07059 RR2 63.3 1.70E-19 LOCMI07060 RR2 110.2 1.40E-33 LOCMI16618 RR2 219.8 1.40E-66 LOCMI16597 RR2 91.3 6.40E-28 LOCMI17139 RR2 81.8 4.60E-25 LOCMI16612 RR2 84.7 6.40E-26 LOCMI16613 RR2 80.9 8.70E-25 LOCMI05383 RR2 67.5 9.60E-21 LOCMI16729 RR2 79.3 2.80E-24 LOCMI16621 RR2 134.2 8.30E-41 LOCMI16834 RR2 54 1.10E-16 LOCMI17147 RR2 51 8.90E-16 LOCMI16704 RR2 102.6 2.50E-31 LOCMI02816 RR2 77.3 1.00E-23 LOCMI17158 RR2 25.6 3.80E-08 LOCMI15346 RR2 24.3 9.90E-08 LOCMI16602 RR2 21.3 4.90E-07 LOCMI16626 RR2 40.2 1.60E-12 LOCMI16627 RR2 21.9 4.10E-07 LOCMI16628 RR2 29.7 2.30E-09 LOCMI16629 RR2 49.1 3.40E-15 LOCMI16630 RR2 26.4 2.30E-08 LOCMI16639 RR2 53.9 1.20E-16 LOCMI16640 RR2 28.2 6.70E-09 LOCMI16641 RR2 45.2 5.00E-14 LOCMI16648 RR2 44.2 1.00E-13 LOCMI16673 RR2 43.9 1.20E-13 LOCMI16833 RR2 30.2 1.60E-09 LOCMI16596 Others - - LOCMI17145 Others - - LOCMI16736 Others - - LOCMI17249 Others - - LOCMI16769 Others - - LOCMI07701 Others - - LOCMI16717 Others - - LOCMI13223 Others - - LOCMI16607 Others - - LOCMI03719 Others - - LOCMI16595 Others - - LOCMI16775 Others - - LOCMI16622 Others - - LOCMI16611 Others - - LOCMI16677 Others - - LOCMI00374 Others - - LOCMI16614 Others - - LOCMI17157 Others - - LOCMI16721 Others - - 1 The E-value were determined based on the HMMER searches in the cuticleDB. An E-value (expectation value) is the number of hits that would be expected to have a score equal to or better than this by chance alone. A good E-value is much less than 1, for example, an E-value of 0.01 would mean that on average about 1 false positive would be expected in every 100 searches with different query sequences. Supplementary Table S39 Chitinase and chitinase-like genes in the L. migratoria genome. Gene ID LmCht2 LOCMI17199 LmCht4 LOCMI16689 LmCht5 LOCMI17048 LmCht6 LOCMI17331 LmCht7 LOCMI17356 LmCht8 LOCMI17568 LmCht9 LOCMI17569 LmCht10 LOCMI13600 LmCht11 LOCMI17267 LmIDGF1 LOCMI17185 LmIDGF2 LOCMI17170 LmIDGF3 LOCMI17127 LmIDGF4 LOCMI02015 Abbreviations: Cht, Chitinase; IDGF, chitinase-like imaginal disc growth factor. Supplementary Table S40 Putative yellow genes of the L. migratoria genome. L. migratoria D. melanogaster Homology LOCMI17290 CG3757 yellow LOCMI16530 CG17914 yellow-b LOCMI16488 CG4182 yellow-c LOCMI16507 CG4182 yellow-c LOCMI17195 CG4182 yellow-c LOCMI17293 CG4182 yellow-c LOCMI17338 CG4182 yellow-c LOCMI17339 CG9889 yellow-d LOCMI17291 CG9891 yellow-d2 LOCMI16964 CG9792 yellow-e LOCMI16526 CG18550 yellow-f LOCMI17265 CG5717 yellow-g LOCMI17060 CG13804 yellow-g2 Supplementary Notes Supplementary Note 1. Genome features and evolutionary analysis Heterogeneous distribution of heterozygosity rates The heterozygosity rate, which was determined according to the previous studies, is the portion of heterozygous single-nucleotide polymorphisms (SNPs) between the two haploid components in the diploid genome18,76. It was calculated as the number of heterozygous SNPs divided by the length of corresponding genomic regions76. The sequencing reads (~130 GB, 20-fold genome coverage) were realigned with the assembled locust genome to identify the heterozygous SNPs. A heterogeneous distribution of heterozygosity rates was observed, possibly reflecting that the loss of heterozygous genotypes, which resulted from inbreeding line preparation of genome sequencing, varies among different genomic regions. The distribution of sequencing depth coverage and repetitive element divergence were determined to further exclude the possibility that the heterogeneous distribution of heterozygous SNPs are artifacts caused by un-equal sequencing depth coverage or high similar copies of repetitive elements. As shown in the track b of Figure 1a, the equal sequencing depth coverage indicates that the reduction of heterozygosity rates is not due to read mapping artifacts caused by low read coverage. To determine the genomic content and divergence of repetitive elements, the consensus repetitive sequences inferred from the de novo and homology identification were used as references to identify the genome-wide scale distribution of repetitive elements. In general, DNA transposons are the most abundant elements, and the amount of LINEs transposons is comparable to that of DNA transposons (Figure 1a track c). A homogeneous divergence distribution of these repetitive elements were observed (Figure 1a track d), and no obvious region that is of scarcity of divergent copies were detected (Figure 1a track d). Therefore, these results indicated that the results of heterozygosity rate distribution is not affected by neither sequencing overage nor repetitive element content. Intron evolution Comparative intron size analysis. To further elucidate the role of intron size expansion in locust evolution, pairwise comparisons between L. migratoria and other insects were performed using the log expansion/contraction ratios of introns of 1,046 conserved homologous genes. As shown in Supplementary Figure S8 and Supplementary Table S12, most introns (97%) experienced size expansion during locust evolution. It has been generally known that intron sizes vary widely in animals. A previous study indicated that introns in humans are eight times as large as those in pufferfish, which is in line with the ratio of their genome sizes77. However, these studies were generally based on limited gene sets or assessment of relatively few species. Therefore, a broad comparison at the genome scale across divergent species is required. Here, we studied the relationship between average intron size and genome size for 73 whole genome sequenced species, including 50 vertebrates and 23 invertebrates (retrieved from Ensemble). As shown in Supplementary Figure S9, a strong positive correlation was observed between average sizes of introns and genome sizes (P value: < 2.2e-16, adjusted R-squared: 0.8206, Pearson's product-moment correlation). The trend of correlated increase in genome size with increase in average intron size was previously considered to follow a slope of 4:1 of log-transformed sizes78. We found that the log-transformed value of the L. migratoria genome size was only 2.4 times larger than that of its average intron size, implying that there were stronger selective constraints on the length of large introns. Therefore, increase in intron size in parallel with genome size expansion was likely largely responsible for the loosening of gene structures in L. migratoria. TE invasions lead to intron size expansion in insects. In L. migratoria, the average gene size and average intron size are 42,426 bp and 10,473 bp, respectively. This average intron size is drastically larger than that of other sequenced insects (Supplementary Table S8). Most of the L. migratoria introns are longer than 1,000 bp, whereas most introns of other insects are about 100 bp long (Supplementary Figure S7). It is obvious that the intron size in L. migratoria is significantly expanded. To further elucidate intron size dynamics in insects from an evolutionary perspective, we carried out a comparison of intron length between L. migratoria and other insects for 1:1:1 insect orthloguous gene families that are subject to the same selection forces. We found that 97% (1,018 of 1,046) of the gene families experienced intron length expansion in the L. migratoria genome (Supplementary Table S12). The ratio density distribution of intron length in log2 unit showed a positive bias of ratio values and different peak location in all species, suggesting that intron size variation is associated with genome size in insects (Supplementary Figure S8). To investigate the contribution of TE proliferation in intron size expansion, the TE content of introns was determined. We found that TE-harboring introns are more dominant in the locust genome. RepeatMasker screening indicated that a large fraction (65%, 68, 242/104,488) of locust introns contains TEs, and that this proportion is higher than that of all other insects (Supplementary Figure S10). We found a positive correlation between mean intron length and the fraction of TE-harboring introns (P value: 0.02, adjusted R-squared: 0.4364, pearson's product-moment correlation), suggesting that TE invasions likely played a key role in intron size expansion. U12 Intron. The splicing process, which removes introns from primary transcripts, is essential for RNA maturation in eukaryotic cells. Spliceosome, a multicomponent complex, is involved in intron excision during the splicing process79. There are two types of introns—the U2-type and the U12-type—and they are spliced by distinct spliceosomes. The vast majority of introns are U2-dependent introns, whereas a small fraction of them are U12-dependent introns80. U12-type introns have an RT dinucleotide (the R indicates G or A) at the 5’ splicing site and variable terminal nucleotides at the γ’ splicing site. This breaks the rule observed in the Uβ-type introns, which have a GT dinucleotide at the 5’ splicing site and an AG dinucleotide at the γ’ splicing site. Further examination of U12 intron sequences revealed distinct properties compared to U2 introns, including conserved sequences (RTATCCTT nucleotide segments) at the 5’ splicing site and adjacent branch site (TCTTAAC nucleotide segments) from the γ’ splicing site. Both properties are required for prespliceosome complex formation81. U12-type introns have been identified in a wide range of eukaryotic species, indicating their early evolutionary origin82. Two recent studies based on comprehensive searches of genomic sequences revealed that U12 introns were lost in multiple invertebrate lineages during evolution83,84. Among insects, only 16 and 34 U12 introns have been identified in the Drosophila melanogaster and the Apis mellifera genomes, respectively. Of the 198 human-fruitfly orthologous genes with U12-type introns, 169 introns were lost and only sixteen cases have been maintained in the D. melanogaster genome. The number of U12 introns in the locust genome was determined and was compared with those of other species reported in previous studies. Surprisingly, we identified 197 U12 introns in the L. migratoria genome, which is significantly higher than that of most other invertebrates (P < 0.01, student’s t-test, Supplementary Figure S11), and twice as many as that of sea squirt (113), a vertebrate evolutionary ancestor, indicating the unique intron feature of L. migratoria genome compared to other insects. Recursive splicing sites. The pattern of intron size can be extremely variable among different species. Although intron size expansion can be of benefit to gene expression and evolution, they also increase the potential for transcription processing errors85. In insects, an effective way to avoid the generation of extremely long transcripts is recursive splicing (RP), which excises long intronic transcripts into sub-fragments co-transcriptionally86. However, this regulatory mechanism is not observed in vertebrates85. Using the splice junction consensus matrix, we calculated the ratios of RP-sites within large introns (>50 kb as previously described26) and within their complementary strands in the L. migratoria genome (Supplementary Figure S12). In general, the larger introns of insects exhibited marked enrichment of RP-sites in the sense strand85. Contrary to the case in insects, the RP-sites of larger introns in the sense strand are less abundant in vertebrates (a ratio of 1.5 or less85). In L. migratoria, the RP-sites (with ratio of 1.14) are dramatically smaller than those of other insects, but similar to those of vertebrates. In addition, the ratio of RP-sites between the two strands was also determined in shorter introns with sequence length less than 50 kb. In Drosophila, the RP-sites of larger introns are significantly more abundant than those of shorter introns, whereas the ratio of RP-sites is approximately equal between the two strands in humans 85. In L. migratoria, the ratio of shorter introns, 1.28, is similar to that of larger introns, indicating that the shorter introns in L. migratoria exhibit similar RP-site patterns as those of humans. Taken together, these results indicate that the regulatory mechanism of introns in L. migratoria is similar to those in vertebrates. Gene family analysis Among the 17,307 genes in the L. migratoria genome, 7,995 are conserved across insect and daphnia, while 1,562 and 335 of them are conserved across insects and hemimetabolous animals, respectively. A total of 4,571 genes are locust specific (Supplementary Figure S13). Supplementary Note 2. Genes related to pest and major Locust biology Phase change Fluctuations of CpGO/E in the coding region of genes. CpGO/E is an efficient measure of inferring the pattern of DNA methylation due to mutational mechanisms of hypermutable methylated cytosines that mutational decay of CpG di-nucleotides lead to a lower-than-expected CpGO/E from methylated region30. Briefly, methylated cytosines are chemically unstable and easily change to thymine via spontaneous deamination. This causes increased frequency of CpG to TpG mutations. Consequently, hypermethylated genomic regions lose CpG dinucleoties over time and have lower-than-expected CpGO/E. Hypomethylated regions have high CpGO/E. CpGO/E is expected to be negatively correlated with the levels of DNA methylation. Recent studies have demonstrated that the CpGO/E in several insect species that have the low level of methylation in their genomes, such as D. melanogaster, T. castaneun, A. gambiae has approximately normal distribution with a mean around 1. In contrast, the CpGO/E of genes in A. mellirera that has higher methylation level exhibits a striking bimodal pattern. Accordingly, CpGO/E can act as a indicator of the methylation level in insect genomes. We examined the distribution of CpGO/E and observed the intensity fluctuations of CpGO/E in the coding region but not in the whole genome. The depletion of CpG di-nucleotides in the coding region suggests widespread gene methylation in the locust genome. Locust methylation system. DNA methylation is a major mechanism of epigenetic regulation and refers to the addition of a methyl group to position 5 of cytosine bases. Several studies have proposed that DNA methylation may be involved in the regulation of phenotypic plasticity in insects87,88. The three DNA methyltransferase genes, Dnmt1–3, display different functions. Dnmt3 establishes DNA methylation patterns, while Dnmt1 maintains these patterns, and Dnmt2 is involved in tRNA methylation. There have been various duplications and deletions of the three Dnmt genes within arthropods. Only Dnmt2 is present in all sequenced arthropods. Dnmt1 and Dnmt3 have been lost in several lineages. We found two copies of Dnmt1 in L. migratoria, similar to A. mellifera and A. pisum. Both Dnmt2 and Dnmt3 exist as single copies in L. migratoria (Supplementary Table S15). All three Dnmt genes in L. migratoria exhibit conservation in the expected conserved domains. Long-distance flight Genes related to contraction/relaxation of flight muscles. Locust flight muscles are synchronous muscles89, which are characterized by a 1:1 correspondence between neural activation and muscle contraction. Each contraction/relaxation cycle is accompanied by membrane depolarization and subsequent repolarization, release of activator calcium, attachment of cross-bridges and muscle shortening, then removal of activator calcium and cross-bridge detachment90. We identified the genes reported to contribute to the high frequency of flight muscle contraction and relaxation cycles in the L. migratoria genome. For example, calcium pumps contribute partially to the high rate of calcium removal in synchronous muscles90,91, the troponin T subunit is possibly involved in muscle activation90; and resilin, one of the elastic proteins found in the wing hinge, contributes to the elastic storage of inertial energy in locust muscles92. The identification of these genes in the L. migratoria genome will help understand the function of flight muscles during long-distance flight. Comparison of these genes across insect genomes shows that there is no significant variation in their copy number across insect species, indicating that the capacity of L. migratoria for long-distance flight is likely supported by other genetic variations (Supplementary Table S24). Genes related to wing vein morphogenesis and wing vein specification. In long-distance flying insects, integrity of the wings is critical for insect flight93. During the lifetime of a flying insect, its wings are subjected to mechanical forces and deformations for millions of cycles. Because insects control the wings remotely by transmitting forces distally via veins and areas of thickened membrane, defects in the thin membranes or veins may affect the insect’s flight performance94. Seventy genes involved in wing vein morphogenesis (GO:0008586) and wing vein specification (GO:0007474) have been reported in D. melanogaster (retrieved from Amigo database: http://amigo.geneontology.org). We identified their homologues through homology-based searches (TBLASTN) in the genomes of nine other insects, including A. pisum, A. gambiae, A. mellifera, B. mori, D. plexioppus, L. migratoria, N. vitripennis, P. humanus, T. castaneum. Most of the genes are conserved across the insect species with only one copy in the insect genome (Supplementary Table S25). Thirty-five genes are shared by all the examined insect species, indicating that insects share a similar mechanism of vein morphogenesis and vein specification. Although wing shape and structure vary considerably between L. migratoria and D. melanogaster, 57 of the 70 genes involved in wing vein morphogenesis and wing vein specification are shared between them. This further supports a previously proposed theory that transition between different venation types may require relatively few changes in the regulatory gene networks involved95,96. Genes involved in biosynthesis of juvenile hormone. Juvenile hormone (JH) plays important roles in regulating flight activity in insects97-99. JH is synthesized and secreted from the corpora allata, a pair of endocrine glands with neural connections to the brain. The biosynthetic pathway of JH can be divided into two steps: in the first step, farnesyl pyrophosphate (FPP) is formed from acetate through a series of chemical reactions, and in the second step, FPP is hydrolyzed by a pyrophosphatase to farnesol, and then oxidized successively to farnesal and farnesoic acid (FA) by two dehydrogenases100. Silkworm genes involved in JH biosynthesis were used as bait genes to identify their homologues in the L. migratoria genome through homology-based searches. Juvenile hormone esterase (JHE) has five essential catalytic motifs101 which were confirmed by multiple sequence alignment with other insect JHE proteins (Supplementary Figure S22). We found that half of the JH pathway genes (7/14) in L. migratoria exist in more copies than their silkworm homologues do (Supplementary Table S26), which supports the idea that JH pathways function as integrators of metabolic physiology during locust flight102. Genes related to Insulin signalling pathway. The insulin/IGF signalling (IIS) pathway is highly conserved from yeast to worms, fruit flies, and rodents, and plays key roles in growth, metabolism, stress resistance, reproduction, and longevity in diverse organisms103-105. Several studies have demonstrated that the insulin signalling pathway is also involved in regulation of insect flight activity106. We identified most of the IIS genes in the L. migratoria genome using the IIS genes of Drosophila as bait genes (Supplementary Table S27). We only found one insulin-like peptide in the L. migratoria genome, which is significantly lower compared to 7 in Drosophila107 and 37 in C. elegans108. In addition, a total of 3 insulin-like peptide receptors were identified in the L. migratoria genome. PAT and FABP protein. Lipid deposition, activation and transportation are critical for long-distance flight of locusts109. Lipid droplets are intracellular organelles enriched in adipose tissues that regulate the body fat stores of animals. PAT proteins, comprising of Perilipin, Adipose differentiation-related protein (ADRP), Tail-interacting protein of 47 kDa (TIP47), S3-12, and OXPAT110,111, also known as PLIN1-5111, are positioned at the surface of the lipid droplet40. PAT proteins manage the access of lipases to lipid esters within the lipid droplet core and interact with cellular machinery important for lipid homeostasis112. Members of the PAT family are present in evolutionarily distant organisms, including mammals, insects, slime molds and fungi113. All PAT proteins share sequence similarity and the ability to bind intracellular lipid droplets. Using the human and D. melanogaster PAT homologous proteins as baits, we identified the PAT proteins of L. migratoria and 10 other arthropod species. The conserved domains of identified PAT proteins were further examined using Batch CD-Search114, and only those with the perilipin domain were retained (Supplementary Table S29). Finally, 1 PLIN1 and 6 PLIN2 were identified in the L. migratoria genome (Supplementary Table S28). Interestingly, the PLIN2 gene was found to be duplicated in the genome of D. plexippus. In D. melanogaster, the PLIN2 gene (Lsd2) has been reported to mainly control fat storage in fat bodies and the germ line of females115. Thus, duplication of PLIN2 genes might enhance the capacity of fat storage in L. migratoria and D. plexippus, which is possibly an adaptation to intensive energy demand during long-distance flight. The fatty-acid-binding proteins (FABPs) are a family of carrier proteins for fatty acids and other lipophilic substances such as eicosanoids and retinoids. These proteins are thought to facilitate the transfer of fatty acids between extra- and intracellular membranes116. FABPs have been demonstrated to play key roles in the long-distance flight capacity of locusts. So we also detected these proteins in L. migratoria and 10 other arthropod species, using the D. melanogaster FABP homologous proteins as baits, we identified the number of FABP in 10 species are: 3 copies in Pediculus humanus, 2 copies in Tribolium castaneum, 1 copy in Drosophila melanogaster, 2 copies in Anopheles gambiae, 7 copies in Bombyx mori, 6 copies in Danaus plexioppus, 3 copies in Apis mellifera, 2 copies in Nasonia vitripennis and 11 copies in Locusta migratoria. Antioxidants Generation of reactive oxygen species (ROS) is a ubiquitous phenomenon in eukaryotic cells, resulting in oxidative stress, pathogenesis and aging117. For insects with intensive flight activity, ROS damage mitochondrial proteins and the cell membrane, and thus shorten their life span43,118. As defense against oxidative stress, insects have evolved a variety of antioxidant systems, such as antioxidant enzymes including superoxide dismutase, catalase and glutathione transferase, and soluble antioxidants such as ascorbate, glutathione, tocopherols, and carotenoids119. We searched the L. migratoria and 8 other insect genomes for antioxidant genes using D. melanogaster and A. mellifera antioxidant genes120 and peroxidases from PeroxiBase121 as bait. After homology-based searching, peroxidases were further filtered and classified using PeroxiScan121. The copy numbers of most antioxidant genes across the examined species are conservative except Prdx6, which belongs to the general peroxiredoxin superfamily (PS52075) (Supplementary Table S31). In contrast to the small copy number of Prdx6 in other species, nine copies were identified in the L. migratoria genome (Supplementary Table S31). Seven Prdx6 genes are tandemly located on two scaffolds (Supplementary Figure S24), as observed in the D. melanogaster genome (CG11765, CG12405, and CG12896). Prdx6 can scavenge peroxides, reduce peroxidized membrane phospholipids by using glutathione as a reductant, and protect cells against oxidant-induced plasma membrane damage122-124. Expansion of Prdx6 in the L. migtratoria genome may help alleviate the oxidative stress caused by reactive oxidative species produced during extensive flight activity43,118. Feeding Odor related genes Chemoreceptor genes. Utilizing their senses of taste (gustation) and smell (olfaction) to locate and recognize plants is a key aspect of host plant recognition in insects125. Environmental chemicals are detected, recognized and discriminated by chemoreceptor genes expressed in olfactory and taste sensory organs. Chemoreceptor genes include gustatory receptors (GRs), odorant receptors (ORs), and ionotropic receptors (IRs), which were identified recently126. The locust GRs were first identified using known insect GR proteins as templates by combined TBLASTN and GENEWISE searches127. Then, identified locust GRs and known insect GRs were used to perform the iterated profile searches with PSI-BLAST, because of high divergence of the GR genes128. These strategies were also applied in the identification of locust IRs. Known insect ORs deposited in GenBank were retrieved to identify the locust ORs by the AUGUSTUS gene prediction program129. Multiple alignment of the OR protein sequences were performed to construct a block profile using accompanying scripts from AUGUSTUS. The members and their exon-intron structure of the locust ORs were determined based on the protein profile extension from the block profile. Probable gene fragments of which presumed protein sequences are shorter than 250 amino acids were discarded following the strategy of a previous study130. For phylogenetic analysis, protein sequences were aligned with representative sequences from other insects. The phylogenetic trees were inferred by maximum likelihood methods, and were drawn using the PhyML 3.0 program with the JTT substitutional matrix131. By combined GLEAN gene set and predicted gene models from PSI-BLAST searches, we identified 75 L. migratoria GRs (LmGRs) in the draft genome sequences. To verify our gene prediction, the relative frequency of the corresponding amino acid at each position was generated using WebLogo. The WebLogo results clearly illustrated that conserved TYhhhhhQF motifs, which are representative features of GR genes, are present in the TM7 domain of most locust GRs (Supplementary Figure S26). For phylogenetic analysis, representative GRs including known carbon dioxide receptors, bitter receptors and sugar receptors were included to provide context for LmGRs. Phylogenetic relationships of the 75 LmGrs with each other are shown in Supplementary Figure S27. Most LmGRs formed three large lineages that were unique in L. migratoria, indicating their origin from LmGR expansion. Orthologues of the carbon dioxide and sugar receptors were not detected in the L. migratoria genome. In addition, no orthologue of the well-conserved DmGr43a gene lineage could be identified. Taken together with a previous report that these genes were also lost in the A. pisum genome130, our results suggest that these genes are conserved only in holometabolous insects. However, the locust GR family does contain one member of the bitter receptor superfamily that is present in the silkworm moth Bombyx mori132. We also found 95 L. migratoria ORs and 10 L. migratoria IRs in the draft genome sequences. As expected, we identified a single orthologue of the Or83b gene, a highly conserved single-copy gene present in all insects studied to date. Odorant-binding proteins. In insects, recognition of volatile compounds allows for odor and pheromone perception that is essential for feeding, survival and reproduction. Odorant-binding proteins (OBPs) are important components of the insect olfactory system133. The OBP genes were identified through homology searches using information from already known insect protein sequences134. Using the known insect OBPs as templates, the GENEWISE program was used to improve and extend the GLEAN predictions into full-length proteins. The HMMER searches using the PBP/GOBP pfam01395 profiles were conducted to determine the conserved OBP domain. Furthermore, the secondary structures including α-helices were predicted using the PSIPRED program to assist with the OBP gene annotation process135. We found a total of 22 putative OBPs in the L. migratoria genome. This indicates that compared to other known insect genomes, the L. migratoria genome contains a relatively low number of genes encoding OBPs. Detoxification related genes Families of detoxification related genes paly essential roles in interactions and defense against natural and synthetic xenobiotics, which facilitate adaptation to specific ecological niches in insects. To fully characterize these gene families, we manually annotated and compared the detoxification related genes in the L. migratoria genome to those in other insect genomes, including the families of genes encoding UDP-glycosyltransferases, glutathione-S-transferase, carboxyl/cholinesterase, ATP-binding cassette transporters and cytochrome P450 monooxygenases. UDP glycosyltransferase. Potential defensive compounds of plants can act as toxins or feeding deterrents toward herbivorous insects. The detoxication of ingested plant compounds is considered to be one of the principle functions of UDP-glycosyltransferases (UGT) enzymes in insects. UGTs belong to a superfamily of metabolic enzymes that play important roles in the detoxification and elimination of a variety of endogenous and exogenous compounds52. Members of this superfamily are variable in sequence composition and abundance in bacteria, viruses, plants and animals, suggesting their very ancient origin and an essential role in living organisms136. These enzymes catalyze the conjugation of a glycosyl group from an activated sugar donor with various small lipophilic molecules, resulting in more hydrophilic products that are efficiently excreted73,137. We identified 68 putative UGT genes in the L. migratoria genome with subsequent manual curation. This number is larger than that of other insects of which genomes are determined, representing an approximately 2-fold expansion compared to D. melanogaster and 5.7-fold expansion relative to the number of UGTs identified in A. mellifera138. The deduced amino acid sequences showed similarity to a range of insect UGTs and included the UGT signature motif in a conserved C-terminal region: [FVA]-[LIVMF]-[TS]-[HQ]-[SGAC]-G-X[2]-[STG]-X[2]-[DE]-X[6]-P-[LIVMFA]-[ LIVMFA]-X[2]- P-[LMVFIQ]-X[2]-[DE]-Q. Generally, the binding site of UDP moiety of the nucleotide sugar, also known as the UGT signature motif, is present in the conserved region of the C-terminal halves139. The availability of the full repertoire of putative UGT genes from L. migratoria makes it possible to determine the consensus sequences of the conserved region, and provides the opportunity to assess the accuracy of the UGT gene identification program. Therefore, we generated consensus sequence logos to display a graphical representation of the amino acid frequency. Supplementary Figure S28 shows the consensus amino acid sequences of locust UGT genes. The presence of consensus sequences consisting of the UGT signature motif strongly supported that the identified genes are members of the UGT superfamily. Besides the signature motif, three conserved positions ([VI], L and H in the position 1, 2 and 4, respectively) upstream of the signature motif were also detected. As the UGT superfamily has been well characterized in several insects, we included the representatives of each UGT family in our analysis to determine the members of established families in L. migratoria138. The nomenclature system was used to designate the families in the UGT phylogeny138. All representatives of each UGT family in other insects were retrieved from a recent study138. Following the UGT nomenclature guidelines, delineation of families was done on the basis of 40% or greater overall amino acid sequence identity. Within a given family, sub-families were defined by approximately 60% amino acid sequence identity. At least 31 families of UTG genes could be identified in the L. migratoria genome, representing the largest number of UGT families in insects. The larger number of UGT families in L. migratoria supports a diversified pattern of UGT gene evolution in L. migratoria and may reflect its specific feeding style. The putative UGTs identified in the L. migratoria genome were used to construct a phylogenetic tree by the neighbor-joining method (Figure 2b) that allowed clarification of the molecular evolution and structure of the superfamily of UGT genes. In total, three phylogenetic families, UGT360, UGT363 and UGT365, appear to have expanded more than the others during the evolution of the UGT superfamily in L. migratoria. The largest UGT family, UGT 365, contains 13 members, accounting for 19% of all UGT members. Because of the deep divergence between the L. migratoria and other insects, all but two members of distinct UGT families in L. migratoria share less than 40% amino acid identity to those of previously identified families. The phylogenetic tree also showed that the UGT genes in L. migratoria generally have a tendency to cluster together. Two exceptions to this are LMIUGT50E1 and LMIUGT47B1, which have similarities with UGTs of other insects. UGT50 has not been found in the A. pisum genome, so it had previously been considered to be common only to holometabolous insects138. However, the presence of the UGT50 family in L. migratoria suggests that this family is fairly well conserved across insects. Glutathione-S-transferase. We identified putative glutathione-S-transferase (GST) genes by homology searches using the sequences of already known GST proteins as queries140. First, we identified the locust GST genes in the GLEAN gene set using BLASTP searches with an E-value threshold of 1E-5. Next, the GENEWISE program was used to improve and extend the GLEAN gene predictions into full-length proteins using GST protein sequence from other insects as a template127. The corresponding gene models were refined in light of supporting evidence from the expression data. The resulting gene models were verified by manual inspection using the Integrative Genomics Viewer program141. Furthermore, the GST N- terminal domain (PFAM PF00043) and the C-terminal domain (PFAM PF02798) were determined with the HMMER program. This combined strategy resulted in a total of 28 putative cytosolic GST genes in the L. migratoria genome. Multiple sequence alignment was performed using the MAFFT program142. Ambiguities and gap-containing columns in alignments were excluded from phylogeny analyses. Prior to phylogeny construction, the ProtTest program was used to estimate the best model of protein evolution143. The resulting optimal model, the LG+I+G model, was then used for the phylogenetic analysis. Finally, maximum likelihood-based inference of the phylogenetic tree was implemented in the RAxML software144. Nodal support was assessed using 1000 bootstrap pseudo-replicates. Supplementary Table S32 provides an overview of the six GST subclasses— Sigma, Theta, Omega, Zeta, Delta and Epsilon. The criteria for subclass division were obtained from published literatures and are supported by phylogenetic and BLAST-based evidence140. Our homology searches identified 28 GST genes in L. migratoria, which represents a considerable expansion in the size of this family compared with those in the other two hemimetabolous insects, A. pisum and P. humanus. No best hit was obtained for the kappa-class GSTs based on NCBI BLAST searches, which indicates that all of these locust GST genes are localized in the cytosol. The gene models of the GST genes were revised manually to improve the gene structures. Most GSTs are supported by transcriptome data in the libraries from different developmental stages. The Epsilon and Delta classes represent the insect-specific classes, which include the majority of the GSTs with a key role in metabolic detoxification of insecticides. As in the A. psium genome, no Epsilon class GSTs but abundant Delta class GSTs were found in the L. migratoria genome. In contrast to other insects involved in this study, the Sigma class is the largest GST subclass in the L. migratoria genome75. Twelve of the 28 GST genes were classified into this class. A lower divergence in several members of the Sigma GST genes in the phylogenetic tree suggested recent duplication events in the Sigma class. The reason for this duplication in the L. migratoria genome is unclear. Members of this class show high abundance in tissues that are either highly aerobic or particularly sensitive to oxidative damage, indicating a crucial role in protection from oxidative stress145. Therefore, in addition to their roles in metabolic detoxification of insecticides, their protective role against deleterious effects of oxidative stress might also have favored duplication of the Sigma class of GSTs in the L. migratoria genome. Carboxyl/cholinesterase(CCE). Carboxyl/cholinesterase family members are responsible for controlling pheromone or hormone degradation, xenobiotic detoxification and neurodevelopment, and are major insecticide targets and chemical warfare agents74. Based on previously reported CCE genes, TBLASTN was used to identify the potential genomic loci of the locust CCE genes with an E-value threshold of 1E-5. The ClustalW program was used to perform multiple sequence alignment prior to phylogenetic analysis. The alignments were visually inspected for accuracy based on amino acid sequences. Model selection was performed with ProtTest based on Akaike Information Criterion, and the WAG model was applied in the phylogenetic reconstruction. Finally, a maximum likelihood analysis was performed in the PhyML program with 100 bootstrap iterations131. We have found 80 CCE genes in the L. migratoria genome, and this represents the largest gene family expansion of the insect genomes sequenced so far (Supplementary Table S33). As in the other insect genomes, the locust CCE genes belong to three main functional classes—dietary/detoxification, hormone/semiochemical processing, and neuro/developmental functions. The numbers of locust CCE genes in these three functional classes are 39, 32 and 9, respectively. In general, the class distribution of the locust CCE genes conformed to that of aphid CCE genes with few differences. The most notable difference between L. migratoria and most other insects with regard to CCE genes is the relatively high number of CCE genes of the dietary/detoxification class in the former: 39 compared to 8-26 in other insect genomes, except for B. mori. Locust-specific expansion largely accounted for the high number of CCE genes in the dietary/detoxification class. In addition, a large difference in the number of CCE genes could also be observed in the -esterases subclass of the hormone/semiochemical processing class. The -esterases subclass serves diverse functions including sex pheromone degradation, reproductive behaviour control, juvenile hormone metabolism146,147. This subclass has 21 representatives in L. migratoria but 2-18 in other insect species. In the neuro/developmental class, the L. migratoria genome has the same distribution of CCE genes as those found in other insects, suggesting a conserved role for this class75. However, variations among locust members could be identified in the J and K subclasses, which are otherwise highly consistent across insects (Supplementary Figure S30). In most cases, the insect genome contains two acetylcholinesterases and one gliotactin gene. One exception is that the D. melanogaster genome contains only one acetylcholinesterase and two gliotactin genes. Acetylcholinesterase is a key target of organophosphorous and carbamate insecticides148. ATP-binding cassette transporters. For identification of ABC transporters, we generally followed the same procedure as those used previously for the same purpose in D. pulex149. Firstly, the GLEAN gene set was searched using the BLASTP program to identify putative ABC proteins in the locust genome. Our initial query sequences were those of already identified ABC proteins from invertebrates. Protein sequences with significant matches to known ABC proteins were retained for further analysis. We also performed TBLASTN searches to identify putative genomic loci of ABC transporters using the conserved NBDs (highly conserved nucleotide-binding domains) as queries. The ABC transporters in the putative genomic loci were predicted by the GENEWISE program, and their gene structures were further refined by manual adjustment on the basis of transcriptome and homology support. In an attempt to verify our identified ABC transporters, the NBDs for each gene were determined using Interpro domain IPR003439. The domain structures of NBDs were refined using the InterProScan program150. To further determine ABC subfamilies, specific domains were searched against the Conserved Domain Database (CDD) using the reverse position-specific (RPS) BLAST program151. The CDD domains included cd03263, cd03249, cd03250, cd03223, cd03236, cd03221, cd03213 and cd03230. The subfamily assignment was further confirmed by phylogenetic analysis together with known ABC transporters. For comparison, ABC transporters of several insect genomes were also determined by the same procedure. In total, our searches identified a total of 65 putative ABC genes in the L. migratoria genome (Supplementary Table S34), which includes members from all known ABC subfamilies. Of these, 19 genes were classified under the ABCB subfamily while the two ABCC subfamilies together comprised another 19 genes. These two subfamilies are involved in xenobiotic resistance and represent the largest ABC subfamilies in L. migratoria as well as other insects152. Cytochrome P450 genes. Cytochrome P450s (CYPs) form a large and diverse gene family in metabolic systems and are involved in the metabolism of xenobiotics such as pesticides, plant toxins and toxic environmental chemicals. They are also important in the regulation of endogenous lipophilic compounds, including hormones, fatty acids and steroids. To detect the CYP genes, the GLEAN gene models were searched by the TBLASTN and GENEWISE programs with known insect CYP sequences representing the four conserved CYP clades127. The gene models were corrected to avoid fusion with adjacent CYP genes or fragmentation when necessary. Through these manual annotation and curation of CYPs, we found 94 P450 genes in the L. migratoria genome. As shown in Supplementary Figure S31, this number represents an intermediate number of P450s in the insect genome but higher than that of the other two sequenced hemimetabolous insects, A. pisum and P. humanus. A phylogenetic analysis was performed to classify the locust P450s with other identified insect P450s. The amino acid sequences of the resulting P450 genes were aligned with those of other insect P450 genes by the MAFFT program with some manual adjustments142. A phylogenetic tree was then constructed based on the maximum likelihood algorithm using the RAxML program. Supplementary Figure S32 shows the four distinct branches of the phylogenetic tree corresponding to the known CYP clades. The locust P450s could be divided into four distinct clades—CYP2, CYP3, CYP4 and mitochondrial clades—that are also found in other insects. The number of genes in the CYP3 clade in the L. migratoria genome is higher than that of the typical insect with the exceptions of A. aegypti and T. castaneum. The CYP3 clade contains the largest number of insect P450 genes and considerable amount of evidence points to its role in xenobiotic metabolism and insecticide resistance, implying its critical role in locust biology153. Supplementary Note 3. Pest control Established insecticide targets Deficiency of ion channels and receptors promotes the death of insects by interfering with nervous system functions154. A variety of ion channels and receptors have been utilized as the major target sites of various insecticides54. Although a few genes of the ion channel superfamily have been reported in locusts, the full complement of locust ion channels has not yet been determined155. Sequencing of the L. migratoria genome offers a unique opportunity to characterize the complete set of potential insecticide target genes, representing a fundamental step in improving our understanding of key components of locust insecticide targets. Genes of the cys-loop ligand-gated ion channel (cys-loop LGICs) superfamily are major molecular targets for currently approved insecticides. Their members include glutamate-gated chloride channels (GluCls), excitatory cation-permeable nicotinic acetylcholine receptors (nAChRs), -amino butyric acid (GABA)/glycine chloride channels, and histamine-gated chloride channels (HisCls)156. They have important functions in the nervous system of insects and are essential in a variety of biological processes, such as synaptic inhibition, cellular excitability and organic solute transport157. Several classes of insecticides specifically target cys-loop LGIC genes. For example, owing to their strong binding affinity to nAChRs, neonicotinoids, which includes imidacloprid and thiamethoxam, function as agonists that selectively act on insect nAChRs. In addition, cyclodienes (such as dieldrin) and phenylpyrazole (such as fipronil) insecticides have been shown to inhibit GABA receptors, GluCls and HisCls. Thus, interference of these genes by insecticidal chemical classes is a widely adopted management strategy for controlling agricultural pests. Using a combined strategy of GLEAN gene predictions and homology searches, 34 candidate cys-loop LGIC genes were identified in the L. migratoria genome and manually annotated. We determined the orthologous relationships between the cys-loop LGIC genes of L. migratoria and other known insect cys-loop LGIC genes. Based on their orthologues, these locust cys-loop LGIC genes were classified into four categories: GluCl, nAChR, HisCl and GABA-gated chloride channels. In general, insect genomes have only one GluCl gene157. Surprisingly, we found four GluCl genes in the L. migratoria genome. GluCl gene expansion has also been reported for the basal ancestor of arthropods, Tetranychus urticae, suggesting a partial loss of GluCl genes during insect evolution157. We identified 17 candidate nAChR subunit-encoding genes in the L. migratoria genome. This represents a larger nAChR gene family compared to that in D. melanogaster, A. gambiae, A. mellifera, T. castaneum and N. vitripennis, which contain 10, 10, 11, 12 and 16 subunits, respectively. Although there is no evidence to date that insects use glycine as a neurotransmitter, we identified 11 genes that encode proteins similar to GABA/Glycine chloride channels. These locust GABA/Glycine chloride channels include orthologues of the three known GABA-gated chloride channels (Rdl, Grd and Lcch3) in D. melanogaster158. Interestingly, we found that the Grd gene has been duplicated in the L. migratoria genome, while most insect genomes sequenced to date have been reported to contain no more than one Grd orthologue157,158. Furthermore, as with D. melanogaster and T. castaneum, L. migratoria has two HisCl genes159. G-protein-coupled receptors (GPCRs) mediate physiological responses to hormones, neurotransmitters and environmental stimulants, and are thus insecticide target candidates for pest control160. We searched the L. migratoria draft genome with protein sequences corresponding to fly GPCRs161. The searches were performed using the AUGUSTUS gene prediction program based on protein family specific conservation162. Our annotation efforts resulted in the discovery of 90 GPCR genes composed of 61 rhodopsin-like receptors, 21 secretin-like receptors, 3 metabotropic glutamate receptors and 5 atypical 7 TM proteins. In the rhodopsin-like receptor family, we identified receptors for most biogenic amines, several of which are considered to be key regulators in phase transitions29. We found 20 GPCRs responsible for binding of biogenic amines such as dopamine, tyramine, octopamine, and serotonin (Supplementary Table S35). Lethal insecticide candidates Gene overexpression or silencing offers an important tool for development of alternative pest management strategies. These gene manipulation approaches enable us to protect crops by up- or down-regulating essential gene functions in pests to kill them163. Several essential genes in insects have been approved recently as insecticide targets; their functions are altered through chemical approaches or via uptake of dsRNA molecules57,164,165. However, the number of essential genes successfully utilized as insecticide targets is rather limited. Integration of genomic information with essential gene identification based on phenotypic data allows the discovery of new insecticide targets for suppressing agricultural pests. Complete genome sequences have been determined for an increasing number of insects from various orders, and these genomic resources provide a platform to further explore essential insect genes by comparative genomic analyses. Such analyses can facilitate discovery of novel insecticide targets that are unique to pests in a fair and effective manner at unprecedented selectivity. High specificity of gene manipulation approaches is necessary to guarantee safety to human health. Fine-scale, or even complete, specificity in essential-gene-based pest management depends largely on the nucleotide sequence identity between insecticide target genes in pests and similar-sequence genes of other species, including humans and transgenic plants57. Additionally, the specificity may be influenced by nucleotide variations of insecticide target genes in pests. Therefore, sequence similarity with other species and allelic variability should be taken into account in the identification of novel insecticide target genes. In this study, a genome-wide screening of essential genes based on comparative genomic analyses was carried out to identify candidate insecticide targets for developing selective, specific and human-safe control approaches. Firstly, all protein sequences and phenotype data for D. melanogaster and C. elegans were downloaded from Flybase (http://flybase.org/) and Wormbase (http://wormbase.org/), respectively. Following the approach used in a previous study56, genes with mutant phenotypes of “lethal” and/or “neurophysiology defective” were considered to be essential in D. melanogaster while those with mutant phenotypes of “lethal”, “paralyzed”, “movement abnormal” and/or “muscle system physiology abnormal” were considered essential in C. elegans. Only essential genes were considered for further analysis. Secondly, protein sequences of the L. migratoria genes were compared with those of essential genes from D. melanogaster and C. elegans using reciprocal BLAST searches to determine orthologous relationships. The orthologues of locust essential genes were defined as those that had best–best hits in BLAST searches. Thirdly, locust essential genes with sequences similar to those of human genes were filtered to avoid harmful side effects to human health. Finally, essential gene targets from single-copy genes or lower diversity genes were given a higher priority, given that higher allelic variability is likely to contribute to increased susceptibility towards silencing resistance. Altogether, 166 L. migratoria essential function genes with orthologous essential functions in C. elegans and D. melanogaster were identified by this in silico screening approach (Supplementary Table S35). A large number (51/166, 30%) of these genes have annotated molecular function of ‘enzyme or transporter or receptor activities’, and comprise diverse classes of proteins that mediate critical biological process. Among the 166 essential genes, several belong to molecular classes whose representatives have been successfully utilized as chemical and RNAi targets57,166. For instance, kinases, ATPases, synthases, carboxylasterases and receptors have been confirmed as effective targets for pest control in several insects57. In conclusion, our screen revealed many essential genes that are potential insecticide targets against L. migratoria. Importantly, this approach can also be applied to other sequenced pests. RNA interference RNA interference (RNAi) mediated by short interfering double-stranded RNA molecules (dsRNA) is a conserved post-transcriptional gene silencing mechanism involving degradation of a target mRNA in a variety of eukaryotic organisms 57. This dsRNA-mediated gene silencing can be categorized into two partially overlapping pathways: the microRNAs (miRNA) pathway and the small interfering RNA (siRNA) pathway167. In the miRNA pathway, endogenous genes encoding stem loop hairpin primary miRNA transcripts are cleaved into a ~70-nucleotide precursor miRNA molecules, which are processed in the nucleus by a microprocessor complex composed of Drosha and its cofactor, Pasha168. The resulting precursor miRNAs are then exported from the nucleus into the cytoplasm by Exportin-5, a Ran-GTP dependent nucleo–cytoplasmic transporter169. Dicer-1 interacts with a dsRNA binding protein partner, Loquacious (Loqs), to generate a miRNA/miRNA* duplex from precursor miRNAs170. In the siRNA pathway, long exogenous dsRNAs are processed into siRNAs in the cytoplasm by Dicer-2 and a dsRNA-binding domain partner protein, R2D2171. Both miRNAs and siRNAs bind the Argonaute (Ago) protein family in the RNA-induced silencing complex (RISC), resulting in either mRNA degradation or repression of mRNA translation168. As observed in the most sequenced insect genomes, the L. migratoria genome possesses one copy of each gene in the miRNA pathway: one Ago1, one Dicer-1, one Exportin-5, one Drosha and one Pasha, with exception of the 3 copies of Loqs (Supplementary Table S37). For siRNA biosynthesis, one copy of R2D2 was found in the L. migratoria genome, which is also typical of siRNA machinery in insects. However, we found duplications of Dicer-2 (2 copies) and Ago2 (2 copies) in the L. migratoria genome. The duplication in important proteins of siRNA machinery may reflect a functional expansion of the siRNA pathway, which requires further understanding their biological significance in locust biology. Immune system Locust is a notorious agricultural pest and migrate long distance after swarming. During them travel, they might encounter numerous kinds of pathogens attacking. Insect can trigger humoral and cellular immune defenses to resist infections by organizing the responses of melanization, anti-microbial peptides production and phagocytosis of hemocytes172. Immune genes in L. migratoria were identified by combining the results from both HMM profile and BLASTP based searches as previously described173. For the HMM method, two sets of profiles were used: one was built from manually curated protein sequences of four dipteran species, D. melanogaster, A. gambiae, A. aegypt,and C. quinquefasciatus, while the other one was built from automatically defined orthologous groups across 21 insect species. All protein sequences for each family were downloaded from ImmunoDB174 (http://cegg.unige.ch/Insecta/immunodb). 175 MUSCLE was used to align multiple protein sequences and HMMER 3.0176 was used to build profiles. The profiles were then searched against the L. migratoria proteins. For the BLASTP method, all known immune related genes in D. melanogastor were downloaded from ImmunoDB and were searched against the L. migratoria proteins. Then, all the results were integrated to produce the final list of immune related genes. For the purpose of comparison, the same method was also applied to the other three species, A. pisum, P. humanus and D. pulex (Supplementary Table S36). Using the above procedures, we found that the typical immune pathways including IMD, Toll and JAK/STAT all exists in L. migratoria. Interestingly, we found that the humoral immune gene families of prophenoloxidases (PPOs, 8) and serine protease inhibitors (SRPNs, 23) were expanded in L. migratoria, as the most abundant among the 5 species. It is known that the SRPNs involved in regulation of serine proteases cascades that ultimately activate PPO during melanisation response, and these two gene classes systematically control melanisation cascades spatially and temporally177. Moreover, the PGRP class expands as 15 members in locust in comparison of other insect species. Thus the expansion of these humoral gene classes suggests humoral immune defences might play a crucial role in L. migratoria life cycle. Environment friendly biopesticides, such as pathogenic fungus Metarhizium178, are widely used to control locust outbreak. However, as in other insects, the efficiency of bio-pesticide was reduced by the immune system of locusts172. Based on the genomic-scale observation, the dissection of the immune system of locust in responses to the challenge with the pathogenic fungus Metarhizium178 and laminarin179 could provide the important cues in development of novel pesticides. Cuticle metabolism Insect cuticle is an important ecological innovation of a highly successful invertebrate phylum. It is a strong, light exoskeleton and environmental interface predominantly composed of an ordered matrix of chitin fibers and protein180. The biomechanical properties of cuticle and the control of its deposition during development have long been of interest to researchers. More recently, evidence has emerged that cuticular proteins are relevant to problems in applied entomology, such as insecticide resistance, immune responses, heavy metal tolerance and drought tolerance181. The insect-specific metabolism that occurs in the integument is vital for growth and, therefore, is a good target for development of highly selective insecticides182. Here, we identified several major gene families, including cuticular proteins, chitinase and yellow proteins, which are involved in insect cuticle metabolism by conducting genome-wide searches using homologue BLAST. The cuticle consists mainly of chitin fibers embedded in a matrix of a large number of cuticular proteins (CPs)183. Hundreds of CPs have now been identified from various insect species184. The vast majority of these proteins belong to the CPR family, which is defined by a well-conserved domain termed the Rebers and Riddiford Consensus (R&R Consensus) and also recognized as pfam00379. This family has two distinct groups— RR-1 in soft cuticles, and RR-2 present in proteins from hard cuticles185. Putative cuticular protein genes containing the pfam00379 domain were searched using InterproScan with IPR000618. The InterproScan searches annotated 110 putative cuticular protein genes in the L. migratoria genome (Supplementary Table S38). R&R consensus in each putative cuticular protein gene was further identified by using a web server based on profile hidden Markov models in the cuticleDB website, which is capable of RR-1 and RR-2 consensus classification186. Thirty-two genes were identified as RR-1 cuticular protein genes and forty-nine were found to be RR-2 cuticular protein genes. Despite showing sequence similarity with known cuticular protein genes, nineteen other genes failed to be classified using the cuticleDB searches. Chitin is an important component of the exoskeleton of arthropods, but it is absent in vertebrates. Therefore, it represents a useful target for drugs against insects. During insect growth and development, the cuticle must be degraded periodically and replaced to allow for growth, maturation and repair187. Chitinolytic enzymes (chitinase) play important roles in shedding of the old cuticle and turnover of both the PM and tracheal lining. Chitinase (EC 3.2.1.14, endochitinase) is an enzyme that catalyzes the random hydrolysis of N-acetyl-b-D-glucosamine b-1, 4 glycosidic linkages in chitin and chitodextrins in a variety of organisms. All insect chitinases belong to family 18 of glycosylhydrolases and many of them may be involved in cuticle turnover, digestion and PM degradation during molting. We identified a total of 13 chitinase and chitinase-like genes in the L. migratoria genome based on the presence of signature sequences of insect chitinases by using homologue searching (Supplementary Table S39). The yellow gene family is perhaps one of the most enigmatic families discovered so far. There is some evidence that one of the primary functions of insect yellow genes is related to cuticle and eggshell (chorion) formation and hardening during insect development, and that the yellow gene family is vital for insect survival188. The unusual phylogenetic distribution of yellow-like sequences suggests either multiple horizontal transfer from bacteria into eukaryotes, or extensive gene loss in eukaryote lineages. The yellow gene family contains 10 to 15 members in most insect genomes, but the A. mellifera yellow gene family has 26 individual yellow-like genes189. Using the D. melanogaster YELLOW protein sequence as query sequences for iterated PSI-BLAST searches, we found a gene family composed of 13 members, each of which contained a conserved MRJP domain. We named it the YELLOW protein family of L. migratoria (Supplementary Table S40). Supplementary Methods Genome Sequencing, Assembly and Evaluation Samples for genome sequencing Locust swarms devastate crops and cause major agricultural damage. They occur in many areas, especially in African, Asian, and Australian countries (Supplementary Figure S1). The migratory locust Locusta migratoria is one of the most famous locust species and is considered a model for the study of insect physiology. Whole genome sequencing of L. migratoria will enhance our ability to understand locust biology and may provide effective choices for our battle against locust plagues. The strain used in this study for genome sequencing originated from the inbred laboratory strains of solitarious locusts at the Institute of Zoology, CAS, China16. Both colonies were reared under a 14:10 light/dark photo regime at 30℃ and on a diet of fresh greenhouse-grown wheat seedlings and wheat bran. To produce an even more inbred line, a sibling female adult and male adult were mated and eight generations of sib mating were then allowed to occur. DNA for genome sequencing was extracted from the whole body of one female adult except its guts. Genome size estimation Estimation of genome size using k-mer. K-mer analysis has previously been used to estimate genome size190. In this study, based on a k=17 estimation, the L. migratoria genome size was estimated to be 6.38 Gb (Supplementary Table S2, Supplementary Figure S2). However, based on k-mer distribution, we assumed that there was certain degree of heterozygosity in the L. migratoria diploid genome. Estimation of genome size using flow cytometric analysis. The genome size of L. migratoria was determined with flow cytometry, using 4'-6-Diamidino-2-phenylindole (DAPI) as the stain and Mus musculus testis cells as internal standards. The samples from L. migratoria testis cells of one male and the hemolymph of one female were tested in the flow cytometric analysis. We estimated the L. migratoria genome size to be ~6.3 Gb per haploid genome (Supplementary Figure S3). Our estimation was close to the previous Feulgen densitometry measurement that indicated that the locust genome size ranged from 5.28 to 6.35 pg/1C or approximately 5.21 to 6.27 Gb per haploid genome191,192. RAD sequencing and Genetic linkage group construction The mapping cross. A male L. migratoria from Hainan province was crossed with a female from the laboratory raised population, the same lineage used for the reference sequence. DNA from 106 F1 (F:55, M:51) individuals and two parents was isolated using QIAGEN DNeasy Blood&Tissue Kit (Qiagen) after getting rid of the whole gut. RAD library preparation and sequencing. Approximate 1 g genomic DNA of each individual was digested with 20U EcoRI for 1 hour at 37. Then the digestion products were ligated to a modified Illumina P1 adapter which has a unique index. The ligation products from 24 individuals were pooled together and sheared to an average size of 500 bp. DNA samples were electrophresised on a 2% TAE gel, and DNA with length 350 bp – 500 bp was isolated with gel extraction kit (Qiagen). After end polishing and adenine adding, samples were ligated with an Illumina P2 adapter. Finally the libraries were enriched by PCR amplification and qualified by Agilent 2100 test and Q-PCR. Illumina protocols were followed for cluster generating and sequencing. One hundred and six F1 individuals and two parents were sequenced with 50 bp single end on a HiSeq 2000 platform. Parents and F1 individuals were sequenced at ~20X and ~10X, respectively. SNP calling. The reads were aligned to the genome using BWA (version 0.6.2) and the SNPs for every individuals were called using “samtools pileup”60. Only the unique mapped reads were used. The SNPs were further filtered using "samtools.pl varFilter". All the SNPs sites across all F1 progenies were merged and the sites that exist or not be covered in the two F1 parents were filtered. Genetic linkage group construction. Finally, 8,708 markers were classified into 11 linkage groups using minimum LOD 25. Each linkage map was constructed using JoinMap 3.061. Assembled sequences were mapped to the linkage group based on the marker position. Totally, 7,165 scaffolds were anchored with accumulated length 2.9 Gb (44.7% of the genome). Genome Assembly Evaluation Strategies for resequencing-based genome assembly quality evaluation have been described previously12. To assess the representative of the assembly, 135 Gb (~22X) high quality reads with insert sizes of 200 bp were aligned onto the assembly using MAQ-0.7.159 with default parameters except -C 1. Unmapped reads were further aligned using BLAT with the following parameters: -minScore=80, -maxIntron=50, -repMatch=100. SNP calling was carried out using samtools pileup pipeline. A total of 94.37% of the reads (90.21% by MAQ and 4.16% by BLAT) could be mapped to the assembly, suggesting that nearly the entire L. migratoria genome was represented in our assembly. Completeness of the assembly was assessed by the sequencing depth of each base. The proportion of a given depth was calculated and plotted, then compared to the theoretical Poisson distribution with a mean corresponding to the peak (here is 22x) in our data (Supplementary Figure S4). However, there was a minor peak at half of the peak, which suggested the presence of some redundancy in the assembly. The redundancy could have been caused by heterozygosity, which makes the regions with higher heterozygous between sister chromatids were assembled into two sequences193, which was supported by Fig 1a track b. We found the depth distribution of the repeat regions was similar to those of the whole genome. However, the depth distributions of the gene body region of all genes and all the paralogs were similar to the theoretical distribution, which gave us confidence for further analysis. The positions with depth three times higher than 22x, comprised 1.14% of assembled sequences, indicated that repeat collapse was not a serious issue in our assembly. To evaluate the assembly quality, 9 BACs (bacterial artificial chromosomes) were sequenced using Illumina HiSeq 2000 with 500 bp insert size and >100x coverage. The reads for each BAC were assembled with SOAPdenovo v1.05 separately to avoid the influence of repeat sequences and heterozygosity. For each BAC, different k-mer sizes were optimized. To validate accuracy of the assembly and to connect the BAC fragments, we also sequenced ~1.5x Sanger paired-end reads with 3-5 Kb insert sizes for 6 BACs and then mapped the Sanger reads to the assembly. Two fragments were connected if the Sanger paired-end reads were mapped with correct direction and proper distance. The statistics for accuracy of the BAC assembly by Solexa sequencing were assessed through aligning Sanger reads to the assembly (Supplementary Table S4). The sequencing error rates were estimated at 4 x 10-4. The longest scaffold in each BAC assembly, ranging from 57-109 Kb, was chosen to evaluate the genome assembly (Supplementary Table S5). The alignment was carried out using LASTZ194 with the following parameters: --match=1,5; --gap=6,1; --seed=match15; --ambiguous=n; --identity=80; --format=axt. The aligned regions were then linked using chainNet195. On average, 94% of regions of these BACs were covered by the assembly. The assessment results are shown in Supplementary Figure S5 and the legends there showed the methods. In BAC 107-38, the gap region in the genome assembly was annotated as repeat regions in the corresponding region of the BAC, which was a known shortcoming of the next generation sequencing assembler196,197. The gap length in the genome assembly was longer than the actual gap length which might also lead to the larger assembly. The depth of BAC 107-38 was also normally distributed around 20X. However, all of the gene exons were assembled well. In BAC 107-52, several regions, illustrated by light red rectangle (Supplementary Figure S5), showed extremely low depth. However, the depth of the genome assembly was normal (data not shown). Manual alignments of the two regions demonstrated that they were aligned well between the BAC and the genome assembly. However, their identity was lower than 90%, indicating heterozygosity. Since the maximum mismatch tolerance for a 32-bp seed is 2 in BWA (Burrows-Wheeler Alignment tool), the reads in these heterozygous regions could not be aligned with default parameters198. The gene region coverage of the assembly was also evaluated. We downloaded 41,880 EST sequences (Supplementary Table S6) > 200 bp long from LocustDB199. The ESTs were aligned to the assembly genome using BLAT200 with default parameters. The cut-off for identity was set to 80%. More than 96.75% of ESTs with length > 500 bp were covered by the genome at > 90% coverage. We also aligned 71 complete L. migratoria CDS (Coding DNA Sequence) from GenBank to the genome assembly. On average, 95.72% region of these CDSs was covered by a single best piece (Supplementary Table S7). Additionally, among the 248 highly conserved Core Eukaryotic Genes201, 246 (99.19%) could be covered by more than 90% using TBLASTN searches. Finally, the RNA-seq mapping ratios of the L. migratoria genome assembly were comparable to that of previously published genomes (Supplementary Table S20). More than 75% of the reads in all the samples were aligned to the genome. In conclusion, the genome assembly covered most of the coding genes accurately, which provided a solid base for further analysis. Genome annotation Gene annotation pipeline Gene set prediction. For homology-based prediction, protein sequences of four insects were retrieved from public genome databases (of Drosophila melanogaster from Flybase, Apis mellifera from BeeBase, Acyrthosiphon pisum from AphidBase, and Pediculus humanus from VectorBase). These protein sequences were aligned to the L. migratoria genome using TBLASTN (E-value ≤ 1E-5) and the matches with length coverage > 20% of the homologous proteins were considered as gene model candidates. Then, to improve gene models, the corresponding homologous genome sequences were aligned against the matching proteins using Genewise127. For de novo prediction, Augustus62, SNAP63 and Glimmer-HMM64 were used to determine de novo predicted gene structures on the repeat masked assembly sequences. RefSeq proteins from A. pisum and P. humanus were used as training data to obtain suitable parameters in the L. migratoria gene prediction. For transcriptome-based prediction, ESTs and RNA-seq data were incorporated into gene prediction. The L. migratoria ESTs were aligned against genome sequences using BLAT (identity ≥ λ5%, coverage ≥ λ0%) to generate spliced alignments. Spliced EST alignments were identified using PASA to define model gene structures202. TopHat203 was applied to align the RNA-seq reads of multiple libraries to the genome assembly. To obtain transcript structures, Cufflinks204 was used to combine the junction sites from different libraries. Finally, homology-based, de novo derived and transcript gene sets were merged to form a comprehensive and non-redundant reference gene set using GLEAN66. GLEAN genes were filtered if they satisfied any of the following criteria: 1) coverage < 50% of any evidence; 2) genes that were only supported by de novo evidence, with the highest expression level RPKM (reads per kilobase per million) < 5 and with RNA-seq read mapping region < 50% at all the libraries; 3) length of deduced protein sequences < 50 aa (amino acid). Connecting split gene. Gene sets from gene annotation pipelines frequently contains split genes, which are annotated fragments on the same scaffold or across scaffolds that actually belong to the same gene205,206. To remove unreasonable gene models from our reference gene set, we performed large-scale manual curation. We first searched for split genes using ESPRIT206, which was developed based on proteomes of relatively distantly related species as references. Homologous proteins from seven insect species, including A. pisum, P. humanus, D. melanogaster, T. castaneum, B. mori, A. mellifera, and A. gambiae, were used as evidence. In order to maximize identification of split genes, we tested several combinations of parameters and finally selected the following: MinSeqLenContig 20, MinProbContig 0.4, MaxContigOverlap 5, MinBestScore 250, DistConfLevel 5, AllowMultipleHits, StablepairTol 1.81. A total of 50 genes were refined using these parameters. We also developed a pipeline to integrate RNA-seq data as evidence to connect split genes. Our method was inspired by Mortazavi et al, who have conducted transcriptome scaffolding using RNAPATH integrated within the ENRAGE package207. It combined the RNA-seq and homologous protein data and provided confirmation of the authenticity of the connections. The method is described below. To obtain evidence of homologous genes, protein sequences from the seven insect species same to ESPRIT were utilized. All insect proteins were searched against the GLEAN gene set using BLASTP with parameters "-e 1e-2 -F F". The fragmental hits were then conjoined linearly by the SOLAR program208. Two GLEAN genes were supported by one homologous protein if: 1) two GLEAN genes with > 40% of the region were covered by the same homologous gene, and 2) no obvious conflicts occurred in hit overlap, strands and directions on the scaffold. RNA-seq reads with unique mapping in the exon region were collected. Split gene pairs were determined. Based on evidence from the above two steps, two GLEAN genes were considered split from the same gene if they were hit by at least three RNA-seq read pairs or were located on the same scaffold, and were supported by at least one homologous protein. Gene pairs that were only supported by less than three RNA-seq pairs were omitted to avoid random mapping. A total of 3,953 gene pairs satisfied these criteria. The two GLEAN genes of any given split gene pair were connected in the following steps. Split gene pairs were clustered. In one cluster, any two genes could be connected through at least one path. Any two genes from two different clusters could not be connected. A new split gene was added to the cluster if it could be connected to any member in the cluster. Split genes that belonged to the same cluster were connected following these guidelines: All members of a cluster must be covered by at least one homologous protein. Clusters that did not satisfy this criterion were manually split. The GLEAN gene pairs with direction or strand conflict for any supported homologous proteins in one cluster have been manually checked. Based on the direction of homolog alignment, the genome sequences of the genes in one cluster were connected. Gene prediction was performed using GeneWise based on the homolog protein with the highest blast score among the seven insect species. If the cluster contained more than one assembly sequence, the direction and strand were also considered. A total of 50 “N” were placed between two sequences and the coordinates were adjusted. Based on this procedure, 2,396 GLEAN genes were refined, which covered all the genes that were discovered by the ESPRIT program. Gene filtering based on the coverage depth of resequencing reads. To lessen the influence of assembly redundancy on gene prediction, we further filtered the GLEAN gene set based on the coverage depth of resequencing reads. The average coverage depths of genic regions were calculated based on the WGS (Whole-Genome Shotgun) reads re-mapping. Based on the depth distribution (Supplementary Figure S4) and average depth of 20x, we set the cut off to 15x as the half average depth. All the CDS sequences were self-to-self aligned using BLAT. If two CDSs had average depth < 15x, the percentage of half depth positions > 50%, the alignment identity > 90% and the overlap ratio > 70% to any sequences, one of them was considered redundant, and the longer one was retained. Finally, 210 genes were filtered. Manual curation. Finally, we further manually improved 2,192 functionally important genes involved in detoxification, chemoreception, energy metabolism, phase changes and so on. Various information were integrated into the integrative genomics viewer141, including homologues from the 10 insects and human genomes, genomic BLAT searches of ESTs and RNA-seq data, location of junction sites, GLEAN gene structures, isoforms predicted by cufflinks and prediction of three de novo methods. All of these genes were visually inspected to avoid erroneous gene models. After filtering and manual correction, 17,307 genes were obtained (Supplementary Table S8), among which 6,837 (39.50%) were supported by three evidence, at least 9,170 (52.98%) were supported by homologous genes, 11,562 (66.81%) were supported by RNA-seq/EST, and 16,244 (93.86%) genes had RPKM > 1 in at least one sample. To evaluate gene quality, we compared the gene parameters to other sequenced insect gene sets (Supplementary Table S8). As a consequence of large intron size, mRNA lengths of L. migratoria are longer than that of other insects and slightly longer than that of humans. The average CDS length is a little shorter than those of other insects, while the average exon number per gene is comparable to those of other insects. However, the average exon length is shorter than those of all insects but similar to those of humans. Protein coding gene function assignment Protein coding gene functions were assigned based on the best match derived from alignments to proteins annotated in SwissProt, TrEMBL209 databases using BLASTP with E-value < 1E-5. The KEGG210 pathway was annotated using KAAS211 web service version 1.64a, BBH method, by searching against a representative set plus all available insect species with default parameters. We annotated motifs and domains using InterProScan212 (version 4.7) by searching against the InterPro213 database. Descriptions of gene products including Gene Ontology214 were retrieved from InterPro. The statistics of annotated results are listed in Supplementary Table S9. Repeat annotation Both homology-based and de novo methods were used to perform interspersed repeat identification. We first scanned the draft genome sequence assembly of L. migratoria to identify putative repeat regions using the RepeatModeler package, a tool for de novo repeat family identification and modeling based on two de-novo repeat finding programs, RepeatScout and RECON. Unclassified repetitive elements from the RepeatModeler predictions were identified using the REPCLASS package, a program that integrates three different independent classification modules: homology, structure and target site duplication215. Tandem repeats were searched using the program Tandem Repeat Finder216 with default settings. We found that most of these repeat elements showing lower similarity to known repeat elements of sequenced insects could not be classified by homology-based approaches in the previous steps. Therefore, additional approaches were conducted based on the de novo methods. A machine learning approach was adopted to classify unknown repeat elements in the previous steps into main functional categories using the TEclass program217. The total proportion of the genomic region containing interspersed repeat elements was estimated by RepeatMasker searching based on the integrated consensus library. Since predictions made by different methods lead to redundant repeat annotations, an additional filter step was adopted to remove the redundant regions of repeats. Three priority levels were assigned to all the consensus repeats. Known TEs from homology-based annotation had the highest priority, while tandem repeats and unknown TEs received the lowest priority. If two known TEs overlapped, the latter one was split at the end of the former one, and a fragment of the latter one remained. However, unknown TEs and tandem repeats were merged into one longer TE or repeat when they overlapped with each other. Considering that this filter step resulted in many short fragments, we removed fragments shorter than 50 bp. Estimation of divergence time for interspersed repeats Substitution rates in interspersed repeats can be used to estimate the time of divergence on the assumption that they are subject to decay with the same substitution rates under weak selective constraints. The time of divergence of interspersed repeats was estimated according to the molecular clock equation R = K/2T, where R is the average substitution rate, K is the average number of substitutions per site, and T is the divergence time between species. The number of substitutions per site was calculated under the one-parameter Jukes-Cantor model [-3/4 ㏑(1-4/3P)], where P represents the sequence divergence rate between fragmented repeats and consensus sequences. The average nucleotide substitution rate for interspersed repeats is 1.66 × 10-8 per site per year according to substitution rates for a nuclear pseudogene in insects218, and this rate is close to the numbers reported for substitutions in the intron region in insects219. Deletion rates measurements across insect genomes Long terminal repeat (LTR) retrotransposons constitute a substantial fraction of eukaryotic genomes, and they are characterized by the presence of flanking LTR sequences. The LTR sequences are the direct sequence repeats that flank the internal protein coding domains which include genes encoding both structural and enzymatic proteins. The structural and enzymatic proteins typically are composed of open reading frames for the GAG and POL proteins. The GAG protein is a retroviral structural protein that is capable of assembling into virus-like particles, and the POL protein encodes several enzymatic functions, including a protease, a reverse transcriptase and an integrase. In addition, two structural sites critical to replication, the primer-binding site (PBS) and polypurine tract (PPT), were also includes in the LTR retrotransposon identification. The two LTR sequences of LTR retrotransposons are identical at the time of retrotransposition into the host genome, and thus the degree of divergence between the two LTRs provides an estimate of the time that has elapsed since retrotransposition220. Therefore, it is feasible to identify intact LTR retrotransposons based on the sequence similarity (identical or nearly identical) of LTRs sequences and internal structural signatures. Because the insertion and deletion in most of new retrotransposition copies can lead to loss-of-function, LTR retrotransposons are presumably neutral proxies and under relaxed selection pressures. The RepeatMasker program was used in the genome sequence scanning to identify the LTR retrotransposon copies using the intact LTR retrotransposons as query221. The repeat copies search was performed using WU-blast as sequence search engine222. Scanning was carried out using a cutoff value of 300 bp in length. Gaps and ambiguous sites were removed from the WU-blast alignment in order to ensure accuracy of the deletion rates measurements. Comparative genomics analysis Orthologous gene family assignment Treefam67 defines a gene family as a group of genes descending from a single gene in the last common ancestor. We constructed a pipeline to cluster individual genes into gene families and performed phylogenetic analyses: Data preparation. We collected protein-coding sequences more than 30 aa of sequenced related species as well as those of L. migratoria. In total, 10 species were selected for orthologous gene assignment: D. pulex, P. humanus, A. pisum, A. mellifera, N. vitripennis, T. castaneum, B. mori, A. gambiae, A. aegypti, and D. melanogaster. The longest protein isoform was retained for each gene. Pair-wise BLAST alignment (graph building). ALL-to-ALL alignment of all proteins were performed using BLASTP with E-value < 1E-5, and the fragmental alignments were conjoined using SOLAR208. A connection (edge) between two nodes (genes) was assigned if more than 1/3 of the region was aligned in both genes. H-score ranging from 0 to 100 was used to weigh the similarity of two genes (edge). For two genes G1 and G2, the H-score was defined as score (G1G2) / max (score(G1G1), score(G2G2))(score = BLAST raw score). Gene family construction. The average distance was used in the hierarchical clustering algorithm, the minimum edge weight (H-score) was set to 10 and the minimum edge density (total number of edges/theoretical number of edges) was set to > 1/3. Having obtained gene families, we could extract special genes of one species or some relative species sharing common phylogenic node. Phylogeny and orthology analysis. We performed multiple alignments of protein sequences for each gene family using MUSCLE175 and converted the protein alignments to CDS alignments using an in-house Perl script. Phylogeny, divergence estimation and gene family evolution It has been suggested that combining data from multiple genes may potentially alleviate misleading phylogenetic signals in individual genes223. We conducted phylogenomic analysis based on the concatenated data set from universal single-copy genes. Gene orthologous relationships were identified with Treefam using all the protein sequences predicted by merging the GLEAN and manually curated gene sets. The predicted amino acid sequences of the 122 universal single-copy orthologues were aligned in MAFFT with its l-INS-i algorithm142. The resulting alignments were then manually inspected, and ambiguously aligned regions containing gaps were excluded from further analyses. Sequences from individual gene regions were concatenated into the final set of 67,387 amino acids. The phylogenomic trees were inferred using maximum likelihood (ML) algorithms. The ML analyses were performed on the PhyML 3.0 program with the JTT substitutional matrix and discrete gamma distribution in four categories + invariant sites131,224. Gamma shape parameter and the fraction of invariant sites estimated from the data set were set to 1.065 and 0.148, respectively. Tree topology search by nearest-neighbour interchanges (NNIs) was used, and BioNJ tree was used to designate the starting tree. Bootstrapping was done with 100 re-sampling replicates to assess the node supports. Estimates of divergence times for insect lineages were done using Bayesian relaxed molecular clock approaches implemented in MCMCTREE from the PAML package225. This method allows the use of soft age bounds and flexible probability distributions to accommodate uncertainties in fossil calibrations226. We considered Daphnia (Crustacea: Branchiopoda) to be outgroup of Insecta. The MCMC was run twice each for 100,000 generations, sampling every 2nd tree with a burn-in phase of 10,000. The simulations were repeated with different starting seed values to check for convergence of the MCMC chain. Fossil calibrations of nodes were set as soft bounds in the MCMCTREE analysis. Since a prior age for the root node (in our case the Daphnia–Insecta split) must be specified to effectively constrain the standard deviation of the estimated time, the minimum age of the split between Branchiopoda and Hexapoda took place ~550 MYA as suggested in a previous study227. This constraint is closer to the molecular divergence evidence in a recent study228, and is also consistent with the fossil record that the oldest known fossils of hexapods are Collembolans of Rhyniella praecursor from the Lower Devonian period 400 MYA229. We selected the following fossil calibrations: 1) the split between Aculeata (include Apis) and Chalcidoidea (include Nasonia) dates to the middle Mesozoic period 140~180 MYA230; 2) the minimum data for the spilt between Lepidoptera and Diptera is around 290 MYA230. The rate and direction of gene family size in L. migratoria and its related species was inferred using CAFE68, which is based on a stochastic birth-death model. In order to analyze gain and loss of genes with consideration for the phylogenomic tree, we built a pipeline as follows: Based on the phylogenomic tree and the gene family information gathered from the above steps, we estimated the average rate of gene turnover , which is the probability of both gene gain and loss per gene per unit time in the phylogeny. The significance of changes in gene family size was inferred in each branch. Intron evolution Intron analysis in orthologous genes. Genome sequences, gene models and coordinates were downloaded from public databases as described above. The intron length and intron number were extracted from the GFF files using customized Perl scripts to perform multiple gene comparison. The frequency distributions were built for each species and plotted using customized R scripts. To alleviate bias from multiple gene comparison, intron information of the 1:1 orthologous genes was also further retrieved based on the gene family. Identification of U12 introns. We used the program Tophat to perform spliced alignments of the reads against the L. migratoria genome203. Tophat identifies splice junctions by mapping paired-end RNA-seq data to the genomic sequences in two steps. In the first step, all reads are aligned to the genomic sequences, and reads that do not match any location are set aside as ‘initially unmapped reads’. In the following step, the program identifies putative exons and assesses whether previously unmatched reads span any of these islands. Tophat calls a splice junction if at least one read is found that spans two expressed islands (hereafter called ‘spliced reads’). By default, Tophat finds any reads that span splice junctions by at least five bases on each side. Tophat also considers paired-end read information when available to support detected splice junctions. We considered a splice junction expressed in a given subspecies if it was covered by ≥5 spliced reads. The U1β introns have consensus sequences at the 5’ splice site (RTATCCTTT) as well as branch site (TCCTTAACT), which are more conserved than their counterparts in U2 introns80. However, the γ’ splice site is more variable. We considered the intron containing these two properties as the U12-type intron. The splice junctions that meet the criterion of U12 intron identification were retained for further comparison across species. Transcriptome Analysis Samples for RNA sequencing Brain transcriptome during phase transition processes To reveal the molecular mechanisms underlying locust phase change, we previously used two colors cDNA array10,11 determine the differential gene expression profiles from locust head tissues between both phases and used RNA-seq70 determine the differential gene expression profiles from locust whole body between both phases. These studies had highlighted that the critical roles of the peripheral and central nervous systems in locust phase change. So, to further elucidate signaling pathways involved in phase change in central nervous systems, we here carried out the transcriptome sequence for brain tissues in the early process of phase change. Gregarious and solitarious locusts were reared as previously described10. For isolation of gregarious L. migratoria (IG), the 3-day-old fourth-instar gregarious nymphs were separately reared in solitarious-rearing cages. After 0, 4, 8, 16, 32 hours, their brain tissues were dissected and immediately put into liquid nitrogen. Each treatment condition contained 20 individuals. For crowding of solitarious L. migratoria (CS), two solitarious fourth-instar nymphs were introduced into the gregarious-rearing cages, in which there were 20 3-day-old gregarious nymphs. After placing together with the stimulus group for 0, 4, 8, 16, 32 h, solitarous L. migratoria were removed from the cages for dissection of brain tissue. Each treatment condition contained 20 individuals. All samples were immediately put into liquid nitrogen and stored at -80oC until further analysis. All insects were dissected at 3:00 PM. Transcriptome of ten tissues Tissues of brain, antenna, fat body, midgut, testis, ovary, thorax flight muscle, hind leg, hemolymph, gangalia, labipalp and wing were collected from gregarious adults L. migratoria. Transcriptome of mixed samples These samples were from various developmental stages, including eggs. Transcriptome of flight Adult male locusts aging at least 8 days after the final molt were used and a computer-aided flight mill system similar to Schumacher et al231 was constructed. The treated locust was fixed to the arm of the flight mill by a thin copper wire surrounding the metathorax and the wings were free to fly. For the control locust, the wings were simultaneously fastened to prevent flying. Ambient temperature was maintained at 30±2 °C. To initiate flight, a 2-second wind stimulus from an automated wind supplier above the mill was exerted when the locusts stopped flying for a period of 20 seconds. When the treated locust flied for a total of 2 hours, fat bodies were dissected from the treated and control locusts, and these tissues were each used for subsequent RNA extraction. Alternative splicing analysis Short paired-end reads from transcriptome sequencing datasets were aligned to genome sequence using TopHat package to identify splicing junction sites. TopHat first aligns the raw reads onto the reference and reports the “initially unmapped reads” (IUM reads) not mapped to the genome.Then, through a seed-and-extend strategy,TopHat finds reads spanning junctions from IUM reads. Based on the junction result identified by TopHat203 and predicted gene structures, we identified alternative splicing events by comparing the position of the observed junction sites to the annotated gene models and categories them into seven categories: exonskipping (ES), intron-retention (IR), mutually exclusive exon (MXE), alternative 5’ splice site (A5SS),alternative γ’ splice site (A3SS), alternative first exon (AFE), and alternative last exon (ALE). The current methods to detect differential splicing always require experimental replicates and well annotated reference232,233, which is not satisfied for non model species. To simplify the question and got credible differential alternative splicing events, we used two criteria: one was the p value calculated based on fisher’s exact test or chi-square test, another was the ratio difference of the “percentage spliced in” (PSI)234 between two samples. The junction data of every sample was produced from tophat (Supplementary Figure S21). For each junction, the number of reads that support this junction n and the number of reads that across this junction N. N was calculated as the maximum read number across the junction region. To eliminate the bias between samples, the supporting numbers of original reads were normalized using DESeq’s method235. The PSI was defined as n/N. For two samples, the fisher’s exact test or chi-square test was performed based on the four numbers. The fisher’s exact test was selected when there is any theory number smaller than 5. P values of multiple test were adjusted using Benjamini’s FDR65,236. The differential junctions were detected as adjusted p value < 0.1 and the PSI difference > 20%. BEDTools237 was used to extract the junctions that overlapped with annotated genes. Finally, 45 differentially spliced genes (DSG) were found between gregarious and solitarious brain samples (Supplementary Table S23). Enrichment analysis Enrichment analysis for the supplied gene list was carried out based on an algorithm presented by GOstat238, with the whole annotated gene set as the background. The p-value was approximated using the Chi-square test. Fisher’s exact test was used when any expected value of count was below 5. This program was implemented as a pipeline70. For the GO and IPR enrichment analyses, in order to obtain succinct results, if one item was ancestor of another and the enriched gene list of these two items were the same, the ancestor item was deleted from the results. To adjust for multiple testing, we calculated the False Discovery Rate using the Benjamini-Hochberg method236 for each class. Reduced Representation Bisulfite Sequencing data analysis Bioinformatic processing of the data Short reads generated by Illumina sequencing were first processed by Trimmomatic0.22239 with parameters “SLIDINGWINDOW:4:15 MINLEN:40” for filtering low quality, adaptor contaminated reads. After filtering, the remain reads were aligned to L. migratoria genome sequence using Bismark v0.7.12240 with default parameters. Bismark aligned the converted reads to the converted reference genome using bowtie241 (version 2 was used in our analysis), and the unique best alignment was used to determine the methylation states of cytosine position by comparing the reads with their corresponding genomic sequences. The unique mapped reads were 39~50% for the four RRBS and one whole genome libraries (Supplementary Table S16). After mapping the reads to the reference genome, BAM files were generated. Based on these BAM files, insert size distribution were plotted using picard (http://picard.sourceforge.net/). The insert size was ideally centered at 60 bp for short libraries and 130 bp for large libraries, which were consistent with expect (Supplementary Figure S15). The methylation state of cytosine of every reads was extracted using bismark_methylation_extractor. The extractor outputted the overall methylation level of three kinds cytosine for all reads, which was defined as the ratio of the number of methylated C sites to the total number of C sites on the reads (Supplementary Table S17). Based on the methylation state of cytosine of every reads, the CG methylation level for the regions of gene body, exon, intron, intergenic sequence, promoter were calculated (Supplementary Figure S17). In total, 7,568,981 and 7,996,483 CpG sites were covered by more than 10X for G and S samples respectively. The methylation levels across all CpGs was also plotted (Supplementary Figure S16) and followed bimodal distribution. Among these CpG sites, 954,031 (12.6%) and 959,172 (12.0%) were found in the gene body. In total, 11,447 and 11,337 genes were covered at least 4 CpG sites (Supplementary Table S18). Differentially methylated genes To identify differentially methylated (DM) gene (DMG), we first identified DM CG sites using methylKit version 0.5.7242. 4,345,168 CpG sites were covered at minimum 10X in both samples. The P-value of Fisher’s exact test of CG site was calculated using the number of sequenced methylated and non-methylated cytosines in gregarious and solitarious brain samples. The P-value was adjusted by FDR236. The DM CG site was defined as FDR < 0.01 and the difference of methylation level between two samples > 0.25. Because the reads of RRBS were singled-end in 50 bp length, the covered regions were not continuous as those whole genome bisulfite sequencing. So we defined the DMG as those with at least 4 DM CG sites (Supplementary Table S19). 89 DMGs were identified between two phases in the genome. Supplementary References 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. Hemming, C.F. The Locust Menace, (Centre for Overseas Pest Research, 1974). Ye, J. et al. WEGO: a web tool for plotting GO annotations. Nucleic Acids Res 34, W293-7 (2006). Mackenzie, P.I. et al. The UDP glycosyltransferase gene superfamily: recommended nomenclature update based on evolutionary divergence. Pharmacogenetics 7, 255-69 (1997). Johnson, G. & Moore, S.W. The carboxylesterase/cholinesterase gene family in invertebrate deuterostomes. Comp Biochem Physiol Part D Genomics Proteomics 7, 83-93 (2012). Oakeshott, J.G. et al. Metabolic enzymes associated with xenobiotic and chemosensory responses in Nasonia vitripennis. Insect Mol Biol 19 Suppl 1, 147-63 (2010). Bactrian Camels Genome, S. et al. Genome sequences of wild and domestic bactrian camels. Nat Commun 3, 1202 (2012). McLysaght, A., Enright, A.J., Skrabanek, L. & Wolfe, K.H. Estimation of synteny conservation and genome compaction between pufferfish (Fugu) and human. Yeast 17, 22-36 (2000). Vinogradov, A.E. Intron-genome size relationship on a large evolutionary scale. J Mol Evol 49, 376-84 (1999). Tarn, W.Y. & Steitz, J.A. A novel spliceosome containing U11, U12, and U5 snRNPs excises a minor class (AT-AC) intron in vitro. Cell 84, 801-11 (1996). Burge, C.B., Padgett, R.A. & Sharp, P.A. Evolutionary fates and origins of U12-type introns. Mol Cell 2, 773-85 (1998). Frilander, M.J. & Steitz, J.A. Initial recognition of U12-dependent introns requires both U11/5′ splice-site and U12/branchpoint interactions. Genes Dev 13, 851-863 (1999). Russell, A.G., Charette, J.M., Spencer, D.F. & Gray, M.W. An early evolutionary origin for the minor spliceosome. Nature 443, 863-6 (2006). Lin, C.F., Mount, S.M., Jarmolowski, A. & Makalowski, W. Evolutionary dynamics of U12-type spliceosomal introns. BMC Evol Biol 10, 47 (2010). Bartschat, S. & Samuelsson, T. U12 type introns were lost at multiple occasions during evolution. BMC Genomics 11, 106 (2010). Burnette, J.M., Miyamoto-Sato, E., Schaub, M.A., Conklin, J. & Lopez, A.J. Subdivision of large introns in Drosophila by recursive splicing at nonexonic elements. Genetics 170, 661 (2005). Hatton, A.R., Subramaniam, V. & Lopez, A.J. Generation of alternative Ultrabithorax isoforms and stepwise removal of a large intron by resplicing at exon-exon junctions. Mol Cell 2, 787-796 (1998). Law, J.A. & Jacobsen, S.E. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nat Rev Genet 11, 204-20 (2010). 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. Lyko, F. & Maleszka, R. Insects as innovative models for functional studies of DNA methylation. Trends Genet 27, 127-31 (2011). Peckham, M., Cripps, R., White, D. & Bullard, B. Mechanics and Protein-Content of Insect Flight Muscles. J Exp Biol 168, 57-76 (1992). Syme, D.A. & Josephson, R.K. How to build fast muscles: synchronous and asynchronous designs. Integr Comp Biol 42, 762-70 (2002). Rome, L.C. & Lindstedt, S.L. The Quest for Speed: Muscles Built for High-Frequency Contractions. Physiology 13, 261-268 (1998). Mizisin, A.P. & Josephson, R.K. Mechanical power output of locust flight muscle. J Comp Physiol A Neuroethol Sens Neural Behav Physiol 160, 413-419 (1987). Dirks, J.-H. & Taylor, D. Veins Improve Fracture Toughness of Insect Wings. PLoS ONE 7, e43411 (2012). Wootton, R.J. Functional Morphology of Insect Wings. Annu Rev Entomol 37, 113-140 (1992). De Celis, J.F. & Diaz-Benjumea, F.J. Developmental basis for vein pattern variations in insect wings. Int J Dev Biol 47, 653-63 (2003). Marcus, J.M. The development and evolution of crossveins in insect wings. J Anat 199, 211-6 (2001). Tobe, S.S. & Pratt, G.E. The influence of substrate concentrations on the rate of insect juvenile hormone biosynthesis by corpora allata of the desert locust in vitro. Biochem J 144, 107-13 (1974). Min, K.J., Jones, N., Borst, D.W. & Rankin, M.A. Increased juvenile hormone levels after long-duration flight in the grasshopper, Melanoplus sanguinipes. J Insect Physiol 50, 531-7 (2004). de Oliveira Tozetto, S., Rachinsky, A. & Engels, W. Juvenile hormone promotes flight activity in drones (Apis mellifera carnica). Apidologie 28, 77-84 (1997). Mayoral, J.G., Nouzova, M., Navare, A. & Noriega, F.G. NADP+-dependent farnesol dehydrogenase, a corpora allata enzyme involved in juvenile hormone synthesis. Proc Natl Acad Sci U S A 106, 21091-6 (2009). Liu, S. et al. Molecular cloning and characterization of a juvenile hormone esterase gene from brown planthopper, Nilaparvata lugens. J Insect Physiol 54, 1495-502 (2008). Nilsen, K.A. et al. Insulin-like peptide genes in honey bee fat body respond differently to manipulation of social behavioral physiology. J Exp Biol 214, 1488-97 (2011). Barbieri, M., Bonafe, M., Franceschi, C. & Paolisso, G. Insulin/IGF-I-signaling pathway: an evolutionarily conserved mechanism of longevity from yeast to humans. Am J Physiol Endocrinol Metab 285, E1064-71 (2003). Sajid, W. et al. Structural and Biological Properties of the Drosophila Insulin-like Peptide 5 Show Evolutionary Conservation. J Biol Chem 286, 661-673 (2011). Gronke, S., Clarke, D.F., Broughton, S., Andrews, T.D. & Partridge, L. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. Molecular evolution and functional characterization of Drosophila insulin-like peptides. PLoS Genet 6, e1000857 (2010). Zhan, S., Merlin, C., Boore, Jeffrey L. & Reppert, Steven M. The Monarch Butterfly Genome Yields Insights into Long-Distance Migration. Cell 147, 1171-1185 (2011). Brogiolo, W. et al. An evolutionarily conserved function of the Drosophila insulin receptor and insulin-like peptides in growth control. Curr Biol 11, 213-21 (2001). Pierce, S.B. et al. Regulation of DAF-2 receptor signaling by human insulin and ins-1, a member of the unusually large and diverse C. elegans insulin gene family. Genes Dev 15, 672-86 (2001). Van der Horst, D.J. & Rodenburg, K.W. Locust flight activity as a model for hormonal regulation of lipid mobilization and transport. J Insect Physiol 56, 844-53 (2010). Bickel, P.E., Tansey, J.T. & Welte, M.A. PAT proteins, an ancient family of lipid droplet proteins that regulate cellular lipid stores. Biochim Biophys Acta 1791, 419-40 (2009). Kimmel, A.R., Brasaemle, D.L., McAndrews-Hill, M., Sztalryd, C. & Londos, C. Adoption of PERILIPIN as a unifying nomenclature for the mammalian PAT-family of intracellular lipid storage droplet proteins. J Lipid Res 51, 468-71 (2010). Beller, M. et al. PERILIPIN-dependent control of lipid droplet structure and fat storage in Drosophila. Cell Metab 12, 521-32 (2010). Miura, S. et al. Functional conservation for lipid storage droplet association among Perilipin, ADRP, and TIP47 (PAT)-related proteins in mammals, Drosophila, and Dictyostelium. J Biol Chem 277, 32253-7 (2002). Marchler-Bauer, A. et al. CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res 39, D225-9 (2011). Teixeira, L., Rabouille, C., Rorth, P., Ephrussi, A. & Vanzo, N.F. Drosophila Perilipin/ADRP homologue Lsd2 regulates lipid metabolism. Mech Dev 120, 1071-81 (2003). Chmurzynska, A. The multigene family of fatty acid-binding proteins (FABPs): function, structure and polymorphism. J Appl Genet 47, 39-48 (2006). Barbieri, E. & Sestili, P. Reactive oxygen species in skeletal muscle signaling. J Signal Transduct 2012, 982794 (2012). Yan, L.J. & Sohal, R.S. Prevention of flight activity prolongs the life span of the housefly, Musca domestica, and attenuates the age-associated oxidative damamge to specific mitochondrial proteins. Free Radic Biol Med 29, 1143-50 (2000). Felton, G.W. & Summers, C.B. Antioxidant systems in insects. Arch Insect Biochem Physiol 29, 187-97 (1995). Corona, M. & Robinson, G.E. Genes of the antioxidant system of the honey bee: annotation and phylogeny. Insect Mol Biol 15, 687-701 (2006). Koua, D. et al. PeroxiBase: a database with new tools for peroxidase family classification. Nucleic Acids Res 37, D261-6 (2009). 122. Manevich, Y. et al. 1-Cys peroxiredoxin overexpression protects cells against phospholipid peroxidation-mediated membrane damage. Proc Natl Acad Sci U S A 99, 11599-604 (2002). 123. Manevich, Y. & Fisher, A.B. Peroxiredoxin 6, a 1-Cys peroxiredoxin, functions in antioxidant defense and lung phospholipid metabolism. Free Radic Biol Med 38, 1422-32 (2005). 124. Chen, J.W., Dodia, C., Feinstein, S.I., Jain, M.K. & Fisher, A.B. 1-Cys peroxiredoxin, a bifunctional enzyme with glutathione peroxidase and phospholipase A2 activities. J Biol Chem 275, 28421-7 (2000). 125. Gardiner, A., Barker, D., Butlin, R.K., Jordan, W.C. & Ritchie, M.G. Drosophila chemoreceptor gene evolution: selection, specialization and genome size. Mol Ecol 17, 1648-57 (2008). 126. Croset, V. et al. Ancient protostome origin of chemosensory ionotropic glutamate receptors and the evolution of insect taste and olfaction. PLoS Genet 6, e1001064 (2010). 127. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res 14, 988-95 (2004). 128. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389-402 (1997). 129. Keller, O., Kollmar, M., Stanke, M. & Waack, S. A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics 27, 757-63 (2011). 130. Smadja, C., Shi, P., Butlin, R.K. & Robertson, H.M. Large gene family expansions and adaptive evolution for odorant and gustatory receptors in the pea aphid, Acyrthosiphon pisum. Mol Biol Evol 26, 2073-86 (2009). 131. Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59, 307-21 (2010). 132. Wanner, K.W. & Robertson, H.M. The gustatory receptor family in the silkworm moth Bombyx mori is characterized by a large expansion of a single lineage of putative bitter receptors. Insect Mol Biol 17, 621-629 (2008). 133. Jin, X. et al. Expression and immunolocalisation of odorant-binding and chemosensory proteins in locusts. Cell Mol Life Sci 62, 1156-66 (2005). 134. Vieira, F.G. & Rozas, J. Comparative genomics of the odorant-binding and chemosensory protein gene families across the Arthropoda: origin and evolutionary history of the chemosensory system. Genome Biol Evol 3, 476-90 (2011). 135. McGuffin, L.J., Bryson, K. & Jones, D.T. The PSIPRED protein structure prediction server. Bioinformatics 16, 404-5 (2000). 136. Mackenzie, P.I. et al. Nomenclature update for the mammalian UDP glycosyltransferase (UGT) gene superfamily. Pharmacogenet Genomics 15, 677-85 (2005). 137. Ahn, S.J. et al. Metabolic detoxification of capsaicin by 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. UDP-glycosyltransferase in three Helicoverpa species. Arch Insect Biochem Physiol 78, 104-18 (2011). Ahn, S.J., Vogel, H. & Heckel, D.G. Comparative analysis of the UDP-glycosyltransferase multigene family in insects. Insect Biochem Mol Biol 42, 133-47 (2012). Hughes, J. & Hughes, M.A. Multiple secondary plant product UDP-glucose glucosyltransferase genes expressed in cassava (Manihot esculenta Crantz) cotyledons. Mitochondrial DNA 5, 41-49 (1994). Friedman, R. Genomic organization of the glutathione S-transferase family in insects. Mol Phylogenet Evol 61, 924-32 (2011). Robinson, J.T. et al. Integrative genomics viewer. Nat Biotechnol 29, 24-6 (2011). Katoh, K., Asimenos, G. & Toh, H. Multiple alignment of DNA sequences with MAFFT. Methods Mol Biol 537, 39-64 (2009). Darriba, D., Taboada, G.L., Doallo, R. & Posada, D. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics 27, 1164-5 (2011). Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688-90 (2006). Singh, S.P., Coronella, J.A., Beneš, H., Cochrane, B.J. & Zimniak, P. Catalytic function of Drosophila melanogaster glutathione S‐transferase DmGSTS1‐1 (GST‐2) in conjugation of lipid peroxidation end products. Eur J Biochem 268, 2912-2923 (2001). Ishida, Y. & Leal, W.S. Rapid inactivation of a moth pheromone. Proc Natl Acad Sci U S A 102, 14075-9 (2005). Robin, C., Bardsley, L.M., Coppin, C. & Oakeshott, J.G. Birth and death of genes and functions in the beta-esterase cluster of Drosophila. J Mol Evol 69, 10-21 (2009). Charpentier, A., Menozzi, P., Marcel, V., Villatte, F. & Fournier, D. A method to estimate acetylcholinesterase-active sites and turnover in insects. Anal Biochem 285, 76-81 (2000). Sturm, A., Cunningham, P. & Dean, M. The ABC transporter gene family of Daphnia pulex. BMC Genomics 10, 170 (2009). Zdobnov, E.M. & Apweiler, R. InterProScan–an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847-848 (2001). Marchler-Bauer, A. & Bryant, S.H. CD-Search: protein domain annotations on the fly. Nucleic Acids Res 32, W327-31 (2004). Labbe, R., Caveney, S. & Donly, C. Genetic analysis of the xenobiotic resistance-associated ABC gene subfamilies of the Lepidoptera. Insect Mol Biol 20, 243-56 (2011). Feyereisen, R. Evolution of insect P450. Biochem Soc Trans 34, 1252-5 (2006). Millar, N.S. & Denholm, I. Nicotinic acetylcholine receptors: targets for commercially important insecticides. Invert Neurosci 7, 53-66 (2007). Janssen, D., Derst, C., Rigo, J.M. & Van Kerkhove, E. Cys-loop ligand-gated 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. chloride channels in dorsal unpaired median neurons of Locusta migratoria. J Neurophysiol 103, 2587-98 (2010). Jones, A.K., Bera, A.N., Lees, K. & Sattelle, D.B. The cys-loop ligand-gated ion channel gene superfamily of the parasitoid wasp, Nasonia vitripennis. Heredity (Edinb) 104, 247-59 (2010). Dermauw, W. et al. The cys-loop ligand-gated ion channel gene family of Tetranychus urticae: implications for acaricide toxicology and a novel mutation associated with abamectin resistance. Insect Biochem Mol Biol 42, 455-65 (2012). Dale, R.P. et al. Identification of ion channel genes in the Acyrthosiphon pisum genome. Insect Mol Biol 19 Suppl 2, 141-53 (2010). Jones, A.K. & Sattelle, D.B. The cys-loop ligand-gated ion channel gene superfamily of the red flour beetle, Tribolium castaneum. BMC Genomics 8, 327 (2007). Bai, H. & Palli, S.R. G Protein-Coupled Receptors as Target Sites for Insecticide Discovery. in Advanced Technologies for Managing Insect Pests (eds. Ishaaya, I., Palli, S.R. & Horowitz, A.R.) 57-82 (2012). Brody, T. & Cravchik, A. Drosophila melanogasterG Protein–Coupled Receptors. J Cell Biol 150, F83-F88 (2000). Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res 33, W465-W467 (2005). Price, D.R. & Gatehouse, J.A. RNAi-mediated crop protection against insects. Trends Biotechnol 26, 393-400 (2008). Casida, J.E. & Quistad, G.B. Golden age of insecticide research: past, present, or future? Annu Rev Entomol 43, 1-16 (1998). Zlotkin, E. The insect voltage-gated sodium channel as target of insecticides. Annu Rev Entomol 44, 429-455 (1999). Xue, X.-Y., Mao, Y.-B., Tao, X.-Y., Huang, Y.-P. & Chen, X.-Y. Chapter 3 New Approaches to Agricultural Insect Pest Control Based on RNA Interference. in Advances in Insect Physiology, Vol. Volume 42 (ed. Elizabeth, L.J.) 73-117 (Academic Press, 2012). Okamura, K., Robine, N., Liu, Y., Liu, Q. & Lai, E.C. R2D2 organizes small regulatory RNA pathways in Drosophila. Mol Cell Biol 31, 884-96 (2011). Jaubert-Possamai, S. et al. Expansion of the miRNA pathway in the hemipteran insect Acyrthosiphon pisum. Mol Biol Evol 27, 979-87 (2010). Lund, E., Guttinger, S., Calado, A., Dahlberg, J.E. & Kutay, U. Nuclear export of microRNA precursors. Science 303, 95-8 (2004). Tomoyasu, Y. et al. Exploring systemic RNA interference in insects: a genome-wide survey for RNAi genes in Tribolium. Genome Biol 9, R10 (2008). Okamura, K. & Lai, E.C. Endogenous small interfering RNAs in animals. Nat Rev Mol Cell Biol 9, 673-8 (2008). Hoffmann, J.A. The immune response of Drosophila. Nature 426, 33-8 (2003). Bartholomay, L.C. et al. Pathogenomics of Culex quinquefasciatus and 174. 175. 176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189. 190. 191. meta-analysis of infection responses to diverse pathogens. Science 330, 88-90 (2010). Waterhouse, R.M. et al. Evolutionary dynamics of immune-related genes and pathways in disease-vector mosquitoes. Science 316, 1738-43 (2007). Edgar, R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792-7 (2004). Eddy, S.R. Profile hidden Markov models. Bioinformatics 14, 755-63 (1998). Cerenius, L., Lee, B.L. & Soderhall, K. The proPO-system: pros and cons for its role in invertebrate immunity. Trends Immunol 29, 263-71 (2008). Mullen, L.M. & Goldsworthy, G.J. Immune responses of locusts to challenge with the pathogenic fungus Metarhizium or high doses of laminarin. J Insect Physiol 52, 389-98 (2006). Wells, K.L. The effects of immune challenge on phenoloxidase activity in locust salivary glands in vitro. Bioscience Horizons 1, 122-127 (2008). Neville, A.C. Biology of the arthropod cuticle, (Springer-verlag New York:, 1975). Willis, J.H. & Muthukrishnan, S. Insect Cuticle. Foreword. Insect Biochem Mol Biol 40, 165 (2010). Kramer Karl, J., Hopkins Theodore, L. & Schaefer, J. Insect Cuticle Structure and Metabolism. in Biotechnology for Crop Protection, Vol. 379 160-185 (American Chemical Society, 1988). Andersen, S.O., Hojrup, P. & Roepstorff, P. Insect cuticular proteins. Insect Biochem Mol Biol 25, 153-76 (1995). Cornman, R.S. et al. Annotation and analysis of a large cuticular protein family with the R&R Consensus in Anopheles gambiae. BMC Genomics 9, 22 (2008). Andersen, S.O. Studies on proteins in post-ecdysial nymphal cuticle of locust, Locusta migratoria, and cockroach, Blaberus craniifer. Insect Biochem Mol Biol 30, 569-77 (2000). Karouzou, M.V. et al. Drosophila cuticular proteins with the R&R Consensus: annotation and classification with a new tool for discriminating RR-1 and RR-2 sequences. Insect Biochem Mol Biol 37, 754-60 (2007). Merzendorfer, H. & Zimoch, L. Chitin metabolism in insects: structure, function and regulation of chitin synthases and chitinases. J Exp Biol 206, 4393-412 (2003). Li, J. & Christensen, B. Biological Function of Insect Yellow Gene Family. in Recent Advances in Entomological Research (eds. Liu, T. & Kang, L.) 121-131 (Springer Berlin Heidelberg, 2012). Ferguson, L.C., Green, J., Surridge, A. & Jiggins, C.D. Evolution of the insect yellow gene family. Mol Biol Evol 28, 257-72 (2011). Li, Z. et al. Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Brief Funct Genomics 11, 25-37 (2012). Wilmore, P.J. & Brown, A.K. Molecular properties of orthopteran DNA. Chromosoma 51, 337-45 (1975). 192. Rees, H., Shaw, D.D. & Wilkinson, P. Nuclear DNA Variation among Acridid Grasshoppers. Proc R Soc Lond B Biol Sci 202, 517-525 (1978). 193. Baker, M. De novo genome assembly: what every biologist should know. Nat Methods 9, 333-7 (2012). 194. Harris, R.S. Ph.D, The Pennsylvania State University (2007). 195. Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W. & Haussler, D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A 100, 11484-9 (2003). 196. Treangen, T.J. & Salzberg, S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13, 36-46 (2012). 197. Mak, H.C. Genome interpretation and assembly-recent progress and next steps. Nat Biotechnol 30, 1081-3 (2012). 198. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-60 (2009). 199. Ma, Z., Yu, J. & Kang, L. LocustDB: a relational database for the transcriptome and biology of the migratory locust (Locusta migratoria). BMC Genomics 7, 11 (2006). 200. Kent, W.J. BLAT--the BLAST-like alignment tool. Genome Res 12, 656-64 (2002). 201. Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061-7 (2007). 202. Tuskan, G.A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 1596-604 (2006). 203. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105-11 (2009). 204. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28, 511-5 (2010). 205. Hubbard, T.J. et al. Ensembl 2009. Nucleic Acids Res 37, D690-7 (2009). 206. Dessimoz, C. et al. Comparative genomics approach to detecting split-coding regions in a low-coverage genome: lessons from the chimaera Callorhinchus milii (Holocephali, Chondrichthyes). Brief Bioinform 12, 474-84 (2011). 207. Mortazavi, A. et al. Scaffolding a Caenorhabditis nematode genome with RNA-seq. Genome Res 20, 1740-7 (2010). 208. Yu, X.J., Zheng, H.K., Wang, J., Wang, W. & Su, B. Detecting lineage-specific adaptive evolution of brain-expressed genes in human using rhesus macaque as outgroup. Genomics 88, 745-51 (2006). 209. UniProt, C. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38, D142-8 (2010). 210. Kanehisa, M. et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res 36, D480-4 (2008). 211. Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A.C. & Kanehisa, M. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res 35, W182-5 (2007). 212. Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res 33, W116-20 (2005). 213. Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Res 37, D211-5 (2009). 214. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-9 (2000). 215. Feschotte, C., Keswani, U., Ranganathan, N., Guibotsy, M.L. & Levine, D. Exploring repetitive DNA landscapes using REPCLASS, a tool that automates the classification of transposable elements in eukaryotic genomes. Genome Biol Evol 1, 205-20 (2009). 216. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, 573-80 (1999). 217. Abrusan, G., Grundmann, N., DeMester, L. & Makalowski, W. TEclass--a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25, 1329-30 (2009). 218. Pons, J. & Vogler, A.P. Complex pattern of coalescence and fast evolution of a mitochondrial rRNA pseudogene in a recent radiation of tiger beetles. Mol Biol Evol 22, 991-1000 (2005). 219. Papadopoulou, A., Anastasiou, I. & Vogler, A.P. Revisiting the insect mitochondrial molecular clock: the mid-Aegean trench calibration. Mol Biol Evol 27, 1659-72 (2010). 220. Baucom, R.S., Estill, J.C., Leebens-Mack, J. & Bennetzen, J.L. Natural selection on gene function drives the evolution of LTR retrotransposon families in the rice genome. Genome Res 19, 243-54 (2009). 221. Smit, A., Hubley, R. & Green, P. RepeatMasker Open-3.0. (1996). 222. Lopez, R., Silventoinen, V., Robinson, S., Kibria, A. & Gish, W. WU-Blast2 server at the European bioinformatics institute. Nucleic Acids Res 31, 3795-3798 (2003). 223. Delsuc, F., Brinkmann, H. & Philippe, H. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet 6, 361-75 (2005). 224. Jones, D.T., Taylor, W.R. & Thornton, J.M. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8, 275-82 (1992). 225. Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24, 1586-91 (2007). 226. Yang, Z. & Rannala, B. Molecular phylogenetics: principles and practice. Nat Rev Genet 13, 303-14 (2012). 227. Regier, J.C., Shultz, J.W. & Kambic, R.E. Phylogeny of basal hexapod lineages and estimates of divergence times. Ann Entomol Soc Am 97, 411-419 (2004). 228. Rehm, P. et al. Dating the arthropod tree based on large-scale transcriptome data. Mol Phylogenet Evol 61, 880-7 (2011). 229. Whalley, P. Unfair to ancient fossil springtails. Antenna 19, 2-3 (1995). 230. Grimaldi, D.A. & Engel, M.S. Evolution of the Insects, (Cambridge Univ Press, 2005). 231. Schumacher, P., WeyEneth, A., Weber, D.C. & Dorn, S. Long flights in Cydia pomonella L. (Lepidoptera: Tortricidae) measured by a flight mill: influence of sex, mated status and age. Physiol Entomol 22, 149-160 (1997). 232. Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res 22, 2008-17 (2012). 233. Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31, 46-53 (2012). 234. Katz, Y., Wang, E.T., Airoldi, E.M. & Burge, C.B. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods (2010). 235. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol 11, R106 (2010). 236. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B Met 57, 289-300 (1995). 237. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-2 (2010). 238. Beissbarth, T. & Speed, T.P. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20, 1464-5 (2004). 239. Lohse, M. et al. RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res 40, W622-7 (2012). 240. Krueger, F. & Andrews, S.R. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27, 1571-2 (2011). 241. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25 (2009). 242. Akalin, A. et al. methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol 13, R87 (2012).