Please log in to watch this conference skillscast.
Lynn Langit shares lessons learned and cloud data pipeline patterns via examples from work she’s doing with CSIRO Bioinformatics Australia. The team there, led by Dr. Denis Bauer, is analyzing a number of large genomic datasets.
First, Lynn examines real-time analysis with cloud-based solutions. Keeping runtime constant can be challenging for problems that vary in complexity, such as genome engineering. The CSIRO GT-Scan2 tool works by instantaneously recruiting additional Lambda functions as the complexity increases. It was built using a microservices pattern (serverless) using AWS services.
Next, Lynn will demo a Jupyter notebook which shows how genomic research can leverage Apache Spark to massively parallelize the generation of random forests to identify disease genes efficiently.She’ll discuss the pipeline’s use of an OSS library written by the team at CSIRO (VariantSpark).
VariantSpark can analyze 3,000 samples with 80 million features in under 30 minutes. This pipeline enables real-time diagnosis by finding similar patients. This platform is contributing to motor neuron disease research (publicized by the Ice Bucket Challenge) in Australia.
YOU MAY ALSO LIKE:
Building Genomics Pipelines with AWS Lambda and Apache Spark
Lynn Langit
Lynn Langit is an independent Cloud Architect and Developer. She works on genomic-scale cloud pipelines. Also Lynn is an author for LinkedIn Learning, having created 25 courses on cloud topics. For her technical education work, she has been awarded as an AWS Community Hero, Google Cloud Developer Expert and Microsoft Regional Director.