<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Technology on Morgan Bye</title><link>https://morganbye.com/tags/technology/</link><description>Recent content in Technology on Morgan Bye</description><generator>Hugo</generator><language>en-ca</language><copyright>CC BY-SA 4.0</copyright><lastBuildDate>Tue, 11 Mar 2025 13:30:00 -0500</lastBuildDate><atom:link href="https://morganbye.com/tags/technology/index.xml" rel="self" type="application/rss+xml"/><item><title>DataProc - a (near) complete guide</title><link>https://morganbye.com/posts/dataproc/</link><pubDate>Tue, 11 Mar 2025 13:30:00 -0500</pubDate><guid>https://morganbye.com/posts/dataproc/</guid><description>&lt;p&gt;This post is the summation of months of debugging and optimizing ML and AI workflows on Google Cloud DataProc clusters using Cloud Composer / Apache Airflow for orchestration.&lt;/p&gt;
&lt;p&gt;If you would prefer to jump directly to a chapter:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_overview/"&gt;What is DataProc? And why do we need it?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_environments/"&gt;Environments, what&amp;rsquo;s shared and what&amp;rsquo;s local&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_variables/"&gt;Environment variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_cluster_configuration/"&gt;Cluster configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_startup_optimization/"&gt;Cluster startup optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_runtime_variables/"&gt;Runtime variables&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Otherwise, hold on tight, because here we go&amp;hellip;&lt;/p&gt;</description></item><item><title>DataProc - Passing runtime variables</title><link>https://morganbye.com/posts/dataproc_runtime_variables/</link><pubDate>Sun, 23 Feb 2025 16:00:00 -0500</pubDate><guid>https://morganbye.com/posts/dataproc_runtime_variables/</guid><description>&lt;p&gt;This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this &lt;a href="https://morganbye.com/posts/dataproc/"&gt;one monster piece&lt;/a&gt;, or you can jump to a particular section&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_overview/"&gt;What is DataProc? And why do we need it?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_environments/"&gt;Environments, what&amp;rsquo;s shared and what&amp;rsquo;s local&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_variables/"&gt;Environment variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_cluster_configuration/"&gt;Cluster configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_startup_optimization/"&gt;Cluster startup optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_runtime_variables/"&gt;Runtime variables&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h1 id="on-this-page"&gt;On this page&lt;/h1&gt;
&lt;nav id="TableOfContents"&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#introduction"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#methods-for-passing-runtime-variables"&gt;Methods for passing runtime variables&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#os-level-initialization-script-cluster-startup"&gt;OS-level initialization script (cluster startup)&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#example-command"&gt;Example command&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#example-startup-script"&gt;Example startup script&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#compute-engine-metadata-instance-level-variables"&gt;Compute engine metadata (instance-level variables)&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#example-command-1"&gt;Example command&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#spark-properties-job-submission-level-variables"&gt;Spark properties (job submission-level variables)&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#example-command-2"&gt;Example command&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#command-line-arguments-explicit-variable-passing"&gt;Command-line arguments (explicit variable passing)&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#use-a-secrets-manager"&gt;Use a secrets manager&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#why-use-a-secrets-manager"&gt;Why Use a Secrets Manager?&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#google-cloud-secret-manager"&gt;Google Cloud Secret Manager&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#1-store-a-secret"&gt;1) Store a Secret&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#2-set-a-value"&gt;2) Set a value&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#3-retrieve-a-secret-from-airflow"&gt;3) Retrieve a secret from Airflow&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#4-retrieve-a-secret-from-dataproc"&gt;4) Retrieve a Secret from DataProc&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
&lt;/nav&gt;

&lt;hr&gt;
&lt;h1 id="passing-runtime-variables-in-dataproc-and-airflow"&gt;Passing runtime variables in DataProc and Airflow&lt;/h1&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Using environment variables to pass secrets and configurations at runtime is a &lt;strong&gt;DevOps best practice&lt;/strong&gt; that enhances security, flexibility, and maintainability. Hard-coding sensitive values in source code increases the risk of leaks, while environment variables keep secrets out of version control systems. This also enables &lt;strong&gt;dynamic configuration&lt;/strong&gt;, allowing jobs to adapt across different environments (e.g., DEV, UAT, PRD) without modifying the application code.&lt;/p&gt;</description></item><item><title>DataProc - Cluster startup optimization</title><link>https://morganbye.com/posts/dataproc_startup_optimization/</link><pubDate>Fri, 21 Feb 2025 16:00:00 -0500</pubDate><guid>https://morganbye.com/posts/dataproc_startup_optimization/</guid><description>&lt;p&gt;This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this &lt;a href="https://morganbye.com/posts/dataproc/"&gt;one monster piece&lt;/a&gt;, or you can jump to a particular section&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_overview/"&gt;What is DataProc? And why do we need it?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_environments/"&gt;Environments, what&amp;rsquo;s shared and what&amp;rsquo;s local&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_variables/"&gt;Environment variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_cluster_configuration/"&gt;Cluster configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_startup_optimization/"&gt;Cluster startup optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_runtime_variables/"&gt;Runtime variables&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h1 id="on-this-page"&gt;On this page&lt;/h1&gt;
&lt;nav id="TableOfContents"&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#introduction"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#optimizing-dependency-installation"&gt;Optimizing dependency installation&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#option-1-installing-from-a-requirements-file"&gt;Option 1: Installing from a requirements file&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#option-2-installing-from-a-whl-package"&gt;Option 2: Installing from a &lt;code&gt;.whl&lt;/code&gt; Package&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#final-optimization-strategy"&gt;Final optimization strategy&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#key-optimizations"&gt;Key optimizations&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#python-package-management-for-dataproc"&gt;Python package management for DataProc&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#five-methods-for-building-a-python-package"&gt;Five methods for building a Python package&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#option-1-poetry-build"&gt;Option 1) Poetry build&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#option-2-use-pythons-built-in-tools"&gt;Option 2) Use Python’s built-in tools&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#2a-old-method-setuptools"&gt;2a) Old method (setuptools)&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#2b-modern-python-build"&gt;2b) Modern python (build)&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#option-3-zipping-the-source"&gt;Option 3) Zipping the source&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#3a-zipping-with-the-pyprojecttoml"&gt;3a) Zipping with the pyproject.toml&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#3b-zipping-with-the-setuppy"&gt;3b) Zipping with the setup.py&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#bundling-the-dependencies-in-the-package"&gt;Bundling the dependencies in the package&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#pull-the-docker-image"&gt;Pull the Docker image&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#install-dependencies-to-a-target-directory"&gt;Install dependencies to a target directory&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#benchmarking-cluster-initialization"&gt;Benchmarking cluster initialization&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#benchmark-results"&gt;Benchmark results&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#key-takeaways"&gt;Key takeaways&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#1-whl-is-the-most-performant-file-format"&gt;1) &lt;code&gt;.whl&lt;/code&gt; is the most performant file format&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#2-versioned-dependencies-vastly-increase-install-speed"&gt;2) Versioned dependencies vastly increase install speed&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#3-bundling-the-dependencies-yields-no-real-benefit"&gt;3) Bundling the dependencies yields no real benefit&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#4-pyprojecttoml-performs-slightly-better-than-setuppy"&gt;4) &lt;code&gt;pyproject.toml&lt;/code&gt; performs slightly better than &lt;code&gt;setup.py&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#5-modern-build-systems-produce-more-performant-builds"&gt;5) Modern build systems produce more performant builds&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#6-larger-jobs-are-more-performant-than-increasing-parallelization"&gt;6) Larger jobs are more performant than increasing parallelization&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;/ul&gt;
&lt;/nav&gt;

&lt;hr&gt;
&lt;h1 id="dataproc-cluster-startup-optimization"&gt;DataProc cluster startup optimization&lt;/h1&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Cluster startup time is a crucial factor in DataProc job efficiency. When a DataProc cluster is provisioned, worker nodes are allocated from a Compute Instance pool. The provisioning time depends on machine type availability in the compute region. Larger memory machines typically take longer to provision due to lower availability in the data center.&lt;/p&gt;</description></item><item><title>DataProc - Cluster configuration</title><link>https://morganbye.com/posts/dataproc_cluster_configuration/</link><pubDate>Wed, 19 Feb 2025 16:00:00 -0500</pubDate><guid>https://morganbye.com/posts/dataproc_cluster_configuration/</guid><description>&lt;p&gt;This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this &lt;a href="https://morganbye.com/posts/dataproc/"&gt;one monster piece&lt;/a&gt;, or you can jump to a particular section&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_overview/"&gt;What is DataProc? And why do we need it?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_environments/"&gt;Environments, what&amp;rsquo;s shared and what&amp;rsquo;s local&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_variables/"&gt;Environment variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_cluster_configuration/"&gt;Cluster configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_startup_optimization/"&gt;Cluster startup optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_runtime_variables/"&gt;Runtime variables&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h1 id="on-this-page"&gt;On this page&lt;/h1&gt;
&lt;nav id="TableOfContents"&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#introduction"&gt;Introduction&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#example-shared-configuration-file"&gt;Example shared configuration file&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#number-of-workers"&gt;Number of workers&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#cluster-timeout-auto_delete_ttl"&gt;Cluster timeout (&lt;code&gt;auto_delete_ttl&lt;/code&gt;)&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#typical-startup-durations"&gt;Typical startup durations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#recommended-configuration"&gt;Recommended configuration&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#dag-structure"&gt;DAG structure&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#example-dag-structure"&gt;Example DAG structure&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#important-considerations"&gt;Important considerations&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#job-submission-and-argument-handling"&gt;Job submission and argument handling&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#reworking-existing-code"&gt;Reworking existing code&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;/ul&gt;
&lt;/nav&gt;

&lt;hr&gt;
&lt;h1 id="dataproc-cluster-configuration"&gt;DataProc cluster configuration&lt;/h1&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;In many DataProc projects, jobs are executed under nearly identical conditions — they typically have low memory and CPU requirements and access the same data sources.&lt;/p&gt;</description></item><item><title>DataProc - Variables</title><link>https://morganbye.com/posts/dataproc_variables/</link><pubDate>Mon, 17 Feb 2025 16:00:00 -0500</pubDate><guid>https://morganbye.com/posts/dataproc_variables/</guid><description>&lt;p&gt;This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this &lt;a href="https://morganbye.com/posts/dataproc/"&gt;one monster piece&lt;/a&gt;, or you can jump to a particular section&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_overview/"&gt;What is DataProc? And why do we need it?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_environments/"&gt;Environments, what&amp;rsquo;s shared and what&amp;rsquo;s local&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_variables/"&gt;Environment variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_cluster_configuration/"&gt;Cluster configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_startup_optimization/"&gt;Cluster startup optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_runtime_variables/"&gt;Runtime variables&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h1 id="on-this-page"&gt;On this page&lt;/h1&gt;
&lt;nav id="TableOfContents"&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#executive-summary"&gt;Executive summary&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#standard-workflow"&gt;Standard workflow&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#managing-composer-variables"&gt;Managing composer variables&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#deploying-variables"&gt;Deploying variables&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#method-1-deploy-to-the-gcs-bucket"&gt;Method 1: Deploy to the GCS bucket&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#method-2-import-via-command-line"&gt;Method 2: Import via command line&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#method-3-manual-import-via-web-interface"&gt;Method 3: Manual import via web interface&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#preferred-approach"&gt;Preferred approach&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;/ul&gt;
&lt;/nav&gt;

&lt;hr&gt;
&lt;h1 id="passing-environment-variables-in-dataproc-and-composer"&gt;Passing environment variables in DataProc and Composer&lt;/h1&gt;
&lt;h2 id="executive-summary"&gt;Executive summary&lt;/h2&gt;
&lt;p&gt;A common design pattern for orchestrating AI and ML pipelines leverages &lt;strong&gt;Google Cloud DataProc&lt;/strong&gt; to execute Spark-based workflows, offering scalable compute resources for distributed data processing.&lt;/p&gt;</description></item><item><title>DataProc - Understanding environments: shared vs local</title><link>https://morganbye.com/posts/dataproc_environments/</link><pubDate>Sat, 15 Feb 2025 16:00:00 -0500</pubDate><guid>https://morganbye.com/posts/dataproc_environments/</guid><description>&lt;p&gt;This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this &lt;a href="https://morganbye.com/posts/dataproc/"&gt;one monster piece&lt;/a&gt;, or you can jump to a particular section&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_overview/"&gt;What is DataProc? And why do we need it?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_environments/"&gt;Environments, what&amp;rsquo;s shared and what&amp;rsquo;s local&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_variables/"&gt;Environment variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_cluster_configuration/"&gt;Cluster configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_startup_optimization/"&gt;Cluster startup optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_runtime_variables/"&gt;Runtime variables&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h1 id="on-this-page"&gt;On this page&lt;/h1&gt;
&lt;nav id="TableOfContents"&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#conceptual-overview-of-airflow-and-spark-environments"&gt;Conceptual overview of Airflow and Spark environments&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#how-airflow-and-dataproc-work-together"&gt;How Airflow and DataProc work together&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#configuring-the-dataproc-environment"&gt;Configuring the DataProc environment&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#provisioning-worker-nodes"&gt;Provisioning worker nodes&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#executing-pyspark-jobs"&gt;Executing PySpark Jobs&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#configuration-scope-and-security-considerations"&gt;Configuration scope and security considerations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
&lt;/nav&gt;

&lt;hr&gt;
&lt;h1 id="understanding-environments-shared-vs-local"&gt;Understanding environments: shared vs local&lt;/h1&gt;
&lt;h2 id="conceptual-overview-of-airflow-and-spark-environments"&gt;Conceptual overview of Airflow and Spark environments&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://cloud.google.com/composer?hl=en"&gt;&lt;strong&gt;Google Cloud Composer&lt;/strong&gt;&lt;/a&gt;, a managed &lt;a href="https://airflow.apache.org/"&gt;&lt;strong&gt;Apache Airflow&lt;/strong&gt;&lt;/a&gt;, orchestrates workflows using &lt;a href="https://www.geeksforgeeks.org/introduction-to-directed-acyclic-graph/"&gt;&lt;strong&gt;Directed Acyclic Graphs (DAGs)&lt;/strong&gt;&lt;/a&gt;. DAGs define a sequence of tasks and their dependencies, ensuring reliable workflow execution.&lt;/p&gt;</description></item><item><title>DataProc - Everything You Need to Know</title><link>https://morganbye.com/posts/dataproc_overview/</link><pubDate>Thu, 13 Feb 2025 16:00:00 -0500</pubDate><guid>https://morganbye.com/posts/dataproc_overview/</guid><description>&lt;p&gt;This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this &lt;a href="https://morganbye.com/posts/dataproc/"&gt;one monster piece&lt;/a&gt;, or you can jump to a particular section&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_overview/"&gt;What is DataProc? And why do we need it?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_environments/"&gt;Environments, what&amp;rsquo;s shared and what&amp;rsquo;s local&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_variables/"&gt;Environment variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_cluster_configuration/"&gt;Cluster configuration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_startup_optimization/"&gt;Cluster startup optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://morganbye.com/posts/dataproc_runtime_variables/"&gt;Runtime variables&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h1 id="on-this-page"&gt;On this page&lt;/h1&gt;
&lt;nav id="TableOfContents"&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#understanding-airflow-and-spark-in-depth"&gt;Understanding Airflow and Spark in Depth&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#structure-of-an-airflow-dag"&gt;Structure of an Airflow DAG&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#what-is-apache-spark"&gt;What is Apache Spark?&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#key-advantages-of-spark"&gt;Key Advantages of Spark&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#integrating-airflow-and-dataproc-for-scalable-workflows"&gt;Integrating Airflow and DataProc for scalable workflows&lt;/a&gt;
 &lt;ul&gt;
 &lt;li&gt;&lt;a href="#example-airflow-dag-for-dataproc"&gt;Example Airflow DAG for DataProc&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
 &lt;/li&gt;
 &lt;li&gt;&lt;a href="#optimization-strategies-for-cost-and-performance"&gt;Optimization strategies for cost and performance&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
 &lt;/ul&gt;
&lt;/nav&gt;

&lt;hr&gt;
&lt;h1 id="everything-you-need-to-know-about-dataproc"&gt;Everything you need to know about DataProc&lt;/h1&gt;
&lt;h2 id="understanding-airflow-and-spark-in-depth"&gt;Understanding Airflow and Spark in Depth&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://cloud.google.com/composer?hl=en"&gt;&lt;strong&gt;Google Cloud Composer&lt;/strong&gt;&lt;/a&gt; is a managed version of &lt;a href="https://airflow.apache.org/"&gt;&lt;strong&gt;Apache Airflow&lt;/strong&gt;&lt;/a&gt;, an orchestration tool for defining, scheduling, and monitoring workflows. Workflows in Airflow are structured as &lt;a href="https://www.geeksforgeeks.org/introduction-to-directed-acyclic-graph/"&gt;&lt;strong&gt;Directed Acyclic Graphs (DAGs)&lt;/strong&gt;&lt;/a&gt;, which define a series of tasks and their dependencies.&lt;/p&gt;</description></item><item><title>Dumpster diving &amp; retrieval of Wordpress content</title><link>https://morganbye.com/posts/20231220/</link><pubDate>Wed, 20 Dec 2023 13:28:00 -0500</pubDate><guid>https://morganbye.com/posts/20231220/</guid><description>&lt;h1 id="background"&gt;Background&lt;/h1&gt;
&lt;p&gt;For reasons that are stupid, I&amp;rsquo;ve lost my hosted Wordpress website. The TLDR version is that the webhost keeps doing &lt;em&gt;something&lt;/em&gt;. The PHP version keeps changing, the MySQL database corrupts, the Wordpress backend auto-updates either via a webhost script or within Wordpress itself and then corrupts.&lt;/p&gt;
&lt;p&gt;Either way, enough is enough. I don&amp;rsquo;t actually use Wordpress for all of the bells and whistles. I don&amp;rsquo;t actually use it like a dynamic site, and I don&amp;rsquo;t need a relational database powering the backend. What I mostly use it as, is a static file store. In work projects, I&amp;rsquo;ve been using the Python Sphinx library for a long time to have project documentation being automatically built by the CI pipeline after a commit is merged into the develop branch.&lt;/p&gt;</description></item><item><title>Xepr license work around</title><link>https://morganbye.com/posts/20120817_1/</link><pubDate>Fri, 17 Aug 2012 16:54:00 +0000</pubDate><guid>https://morganbye.com/posts/20120817_1/</guid><description>&lt;div class="admonition danger"&gt;
 &lt;div class="admonition-header"&gt;
 &lt;svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512"&gt;&lt;path d="M256 32c14.2 0 27.3 7.5 34.5 19.8l216 368c7.3 12.4 7.3 27.7 .2 40.1S486.3 480 472 480L40 480c-14.3 0-27.6-7.7-34.7-20.1s-7-27.8 .2-40.1l216-368C228.7 39.5 241.8 32 256 32zm0 128c-13.3 0-24 10.7-24 24l0 112c0 13.3 10.7 24 24 24s24-10.7 24-24l0-112c0-13.3-10.7-24-24-24zm32 224a32 32 0 1 0 -64 0 32 32 0 1 0 64 0z"/&gt;&lt;/svg&gt;
 &lt;span&gt;Danger&lt;/span&gt;
 &lt;/div&gt;
 &lt;div class="admonition-content"&gt;
 &lt;p&gt;Disclaimer: the author does not condone piracy but this may be used as a temporary measure&lt;/p&gt;
 &lt;/div&gt;
 &lt;/div&gt;
&lt;p&gt;​As readers of my work-sided blog posts will be aware, I&amp;rsquo;ve had some recent problems with spectrometers and their control PCs. As a result I&amp;rsquo;ve been swapping computers and network cards right, left and centre to try and get things working. As a temporary measure and so as to avoid contacting Bruker every 20 minutes for a new Xepr license file for the latest network card I&amp;rsquo;m trying I&amp;rsquo;ve written a little script to get around the Xepr license check.&lt;/p&gt;</description></item><item><title>HTC Desire headphone jack fix</title><link>https://morganbye.com/posts/20120817_2/</link><pubDate>Fri, 17 Aug 2012 12:00:00 +0000</pubDate><guid>https://morganbye.com/posts/20120817_2/</guid><description>&lt;p&gt;This morning on the way to work I got down to the road, plugged in my earphones to my phone and hit play on Spotify. Only for some horrible, tinny, quiet music to come out. It seemed that the phone just simply would not recognise that anything had been plugged in. Something that I confirmed when I got to work and confirmed with a pair of headphones and PC that it was the phone and not the earphones.&lt;/p&gt;</description></item><item><title>Troubleshooting a non-connecting Bruker ELEXSYS spectrometer</title><link>https://morganbye.com/posts/20120812/</link><pubDate>Sun, 12 Aug 2012 17:50:00 +0000</pubDate><guid>https://morganbye.com/posts/20120812/</guid><description>&lt;p&gt;Recently in the lab we&amp;rsquo;ve had some problems thanks to a helpful software update rendering it impossible to connect to a spectrometer. This troubleshooting provides a record of my steps to diagnosing the problem. Hopefully shortening the process in the future. The spectrometer in question is a Bruker ELEXSYS E580/E680 hybrid connected to a HP/Bruker desktop running openSUSE 12.1 x64, with Xepr v2.6b53. You will need super user privileges for almost all of these commands.&lt;/p&gt;</description></item><item><title>Logarithmic backups on a linux server</title><link>https://morganbye.com/posts/20120802_2/</link><pubDate>Thu, 02 Aug 2012 17:50:00 +0000</pubDate><guid>https://morganbye.com/posts/20120802_2/</guid><description>&lt;p&gt;This is the second part of the &lt;a href="https://morganbye.com/tech/20120801/"&gt;lab server setup guide&lt;/a&gt;. Mainly for my own memory purposes, but if you find it useful then so much the better.&lt;/p&gt;
&lt;h1 id="synchronization"&gt;Synchronization&lt;/h1&gt;
&lt;p&gt;As previously discussed all of the machines in the lab are mounted onto the server using NFS under &lt;code&gt;/media/lab-machine-XX&lt;/code&gt;. But this does not back them up. So the first step of the backup is to copy the files from the lab machines to the server. For this I use &lt;code&gt;unison&lt;/code&gt; which is an easy implementation of rsync which will compare the two versions in question, determine the newest and only copy across changes saving bandwidth and time.&lt;/p&gt;</description></item><item><title>Lab server setup - Ubuntu 12.4</title><link>https://morganbye.com/posts/20120802_1/</link><pubDate>Thu, 02 Aug 2012 15:31:00 +0000</pubDate><guid>https://morganbye.com/posts/20120802_1/</guid><description>&lt;p&gt;With a recent change of machines in the lab, I thought that I should update the backup server according to reflect the new machines. However, upon inspection the backup server OS hard drive had died long ago; but then what do you expect when you use 8 year old battered PCs as a server.&lt;/p&gt;
&lt;p&gt;This wasn&amp;rsquo;t too much of a problem as the machine&amp;rsquo;s RAID array was otherwise fine, it just hadn&amp;rsquo;t been mounted in some time. And the lab machines all have their files safely on them still. So I thought I&amp;rsquo;d take the opportunity to bring new life to the server and update it. Something that had been on the list of things-to-do for a long time and I just hadn&amp;rsquo;t got there.&lt;/p&gt;</description></item><item><title>Moving apps from the Android phone memory to SD card: HTC Desire</title><link>https://morganbye.com/posts/20120403/</link><pubDate>Tue, 03 Apr 2012 12:00:00 +0000</pubDate><guid>https://morganbye.com/posts/20120403/</guid><description>&lt;p&gt;So I&amp;rsquo;ve had my HTC Desire now for approximately 2 years now and whilst it has served me well I&amp;rsquo;ve had a bug bear with it for a long time. Essentially the phone has only 148 MB of internal storage and after the basic Android OS has been installed this is reduced to under 134 MB. Now I knew that this phone didn&amp;rsquo;t have much memory when I bought it, and as a result when I bought the phone I bought it with a 16 GB microSD card.&lt;/p&gt;</description></item><item><title>Clone an (openSUSE) machine and make it functional</title><link>https://morganbye.com/posts/20120129_2/</link><pubDate>Sun, 29 Jan 2012 12:00:00 +0000</pubDate><guid>https://morganbye.com/posts/20120129_2/</guid><description>&lt;p&gt;In the last week I&amp;rsquo;ve had some need to setup some new PCs to control some of our machines in the lab. The problem being that the control interface between machine and PC is quite complicated and uses some pretty niche software, on the openSUSE OS (which is one of the most difficult linux OS&amp;rsquo;s IMHO).&lt;/p&gt;
&lt;p&gt;So rather than having to install openSUSE from scratch using repositories that are no longer supported (because the software isn&amp;rsquo;t supported by the new SUSE distros) and then try to install the software and configure it, I thought it perhaps best to take a PC that is working and clone it onto a new PC.&lt;/p&gt;</description></item><item><title>Getting an openSUSE machine online</title><link>https://morganbye.com/posts/20120129_3/</link><pubDate>Sun, 29 Jan 2012 12:00:00 +0000</pubDate><guid>https://morganbye.com/posts/20120129_3/</guid><description>&lt;p&gt;This week I&amp;rsquo;ve been cloning machines, but each new machine has slightly different hardware, even if it&amp;rsquo;s just a MAC address or serial number.&lt;/p&gt;
&lt;p&gt;So, a quick disclaimer I&amp;rsquo;m using openSUSE 11.3 and so things will be slightly different for different versions of SUSE or different distros.&lt;/p&gt;
&lt;h1 id="get-online"&gt;Get online&lt;/h1&gt;
&lt;p&gt;SUSE unlike some of the more user friendly OS&amp;rsquo;s doesn&amp;rsquo;t go out of it&amp;rsquo;s way to get you online. So we need to manually tell it what to do. First open a terminal and log in as a system admin:&lt;/p&gt;</description></item><item><title>PyMOL with Ubuntu 11.10 rendering problems</title><link>https://morganbye.com/posts/20120224_1/</link><pubDate>Sun, 29 Jan 2012 12:00:00 +0000</pubDate><guid>https://morganbye.com/posts/20120224_1/</guid><description>&lt;p&gt;Recently my (Ubuntu 11.10 64bit) machine stopped playing nice with &lt;a href="http://www.pymol.org/"&gt;PyMOL&lt;/a&gt; and whenever I tried to render or move an object I was facing a &amp;gt;40 second render time, while both of my CPU cores jumped to &amp;gt;90%&lt;/p&gt;
&lt;p&gt;When I paid close attention to the startup messages in the PyMOL command window I could see:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;span class="lnt"&gt;7
&lt;/span&gt;&lt;span class="lnt"&gt;8
&lt;/span&gt;&lt;span class="lnt"&gt;9
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;Detected&lt;/span&gt; &lt;span class="n"&gt;OpenGL&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;greater&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Shaders&lt;/span&gt; &lt;span class="n"&gt;available&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;PyMOLShader_NewFromFile&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Unable&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;open&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/usr/lib/pymodules/python2.7/pymol/data/shaders/default.vs&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;PYMOL_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/usr/lib/pymodules/python2.7/pymol&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;PyMOLShader_NewFromFile&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt; &lt;span class="n"&gt;shader&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;found&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loading&lt;/span&gt; &lt;span class="n"&gt;from&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;PyMOLShader_NewFromFile&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Unable&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;open&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/usr/lib/pymodules/python2.7/pymol/data/shaders/volume.vs&amp;#39;&lt;/span&gt; &lt;span class="n"&gt;PYMOL_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/usr/lib/pymodules/python2.7/pymol&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;PyMOLShader_NewFromFile&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="n"&gt;shader&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;found&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loading&lt;/span&gt; &lt;span class="n"&gt;from&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;In short, the current version of PyMOL supported by Ubuntu (1.4.1-1) was having troubles playing nicely with the new version of Python (2.7).&lt;/p&gt;</description></item><item><title>Rosetta 3.2 with Ubuntu 11.04</title><link>https://morganbye.com/posts/20110511/</link><pubDate>Wed, 11 May 2011 10:45:00 +0000</pubDate><guid>https://morganbye.com/posts/20110511/</guid><description>&lt;p&gt;&lt;a href="https://www.rosettacommons.org/home"&gt;Rosetta&lt;/a&gt; is a very useful program for the prediction of protein structure, folding and interactions be that docking with proteins or ligands.&lt;/p&gt;
&lt;p&gt;However, the current version of Rosetta (3.2.1) uses &lt;code&gt;SCons&lt;/code&gt; to compile itself on your system. This is usually fine except that these use &lt;code&gt;gcc&lt;/code&gt; libraries 4.1 and before.&lt;/p&gt;
&lt;p&gt;Ubunutu 11.04 considers anything before gcc v4.4 to be obsolete and you can&amp;rsquo;t install them (even with &lt;code&gt;apt-get&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;As a result, a small amount of hacking is required.&lt;/p&gt;</description></item><item><title>Ubuntu 11.04 and MATLAB x64 C, C++ and Fortran libraries</title><link>https://morganbye.com/posts/20110501/</link><pubDate>Sun, 01 May 2011 12:00:00 +0000</pubDate><guid>https://morganbye.com/posts/20110501/</guid><description>&lt;p&gt;MatLab requires the use of libraries (think translation dictionaries) to run code that is outside of it&amp;rsquo;s native realm (a hybrid C).&lt;/p&gt;
&lt;p&gt;The most common of computing languages are those of C and Fortran (an old machine code).&lt;/p&gt;
&lt;p&gt;The problem is that old code runs typically on 32 bit (a way of addressing memory), whereas most modern operating systems are now running 64 bit.&lt;/p&gt;
&lt;p&gt;If you run a 64 bit Linux distro then you cannot run 32 bit MATLAB and you will be forced to install and run x64 MATLAB (there are some work around for Windows).&lt;/p&gt;</description></item><item><title>Recovering a RAID1 disk from a corrupt array</title><link>https://morganbye.com/posts/20110412/</link><pubDate>Tue, 12 Apr 2011 12:00:00 +0000</pubDate><guid>https://morganbye.com/posts/20110412/</guid><description>&lt;p&gt;I recently had our backup server at work die. Upon investigation the Kubuntu 8 server had had it’s OS drive and one of the RAID1 data disks corrupted to such an extent it couldn’t boot. The bad data disk was so bad that it couldn’t even be assigned a hardware address under Linux or Windows.&lt;/p&gt;
&lt;p&gt;The only hope remained with the remaining RAID1 disk.&lt;/p&gt;
&lt;p&gt;As the server had been set up by a previous administrator as a software RAID, I thought it would be a nice simple job of sticking the drive into a good computer and copying files off.&lt;/p&gt;</description></item><item><title>Western Digital Green Drives (WD20EARS) and spin downs</title><link>https://morganbye.com/posts/20110401/</link><pubDate>Fri, 01 Apr 2011 12:00:00 +0000</pubDate><guid>https://morganbye.com/posts/20110401/</guid><description>&lt;p&gt;Recently I bought myself 2 new Western Digital (WD) drives for my home NAS box. Doing the good thing for the planet and not wanting massive electricity bills, I thought I&amp;rsquo;d try the recommend Western Digital Green Drives. Drives that are touted as being energy efficient compared to their WD blue and black office and server based brethren.&lt;/p&gt;
&lt;p&gt;I personally have always opted for WD drives as all Hitachi drives I&amp;rsquo;ve had seem to make horrible noises and die quickly, Samsung always expensive and Seagate always been a bit slow.&lt;/p&gt;</description></item><item><title>[Geek-post] Why I choose open source</title><link>https://morganbye.com/posts/20100701/</link><pubDate>Thu, 01 Jul 2010 12:29:00 +0000</pubDate><guid>https://morganbye.com/posts/20100701/</guid><description>&lt;p&gt;I&amp;rsquo;m currently using an awful lot of software and getting to the stage where I&amp;rsquo;m having to develop my own.  For this reason I thought I&amp;rsquo;d keep a little track of what I&amp;rsquo;m using and why.
I&amp;rsquo;m in the process of moving all of my computers across to Ubuntu for a number of reasons.  Really the tipping point came one day when Windows 7 (the uncrashable) blue screened on me one too many times.  Now I&amp;rsquo;m perfectly happy with my gaming machine to use Windows and crash occasionally.  This is because by and large (despite Wine trying hard) you can really only game in Windows, and that&amp;rsquo;s largely due to it being the largest market.  This combined with the fact that the machine barely ever has the sides on means that a crash or two is unavoidable.  However, this my work PC, the PC that has its core programs, talks to the internet and that&amp;rsquo;s it.&lt;/p&gt;</description></item></channel></rss>