Technology on Morgan Bye

DataProc - a (near) complete guide

Tue, 11 Mar 2025 13:30:00 -0500

This post is the summation of months of debugging and optimizing ML and AI workflows on Google Cloud DataProc clusters using Cloud Composer / Apache Airflow for orchestration.

If you would prefer to jump directly to a chapter:

Otherwise, hold on tight, because here we go…

DataProc - Passing runtime variables

Sun, 23 Feb 2025 16:00:00 -0500

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

Passing runtime variables in DataProc and Airflow

Introduction

Using environment variables to pass secrets and configurations at runtime is a DevOps best practice that enhances security, flexibility, and maintainability. Hard-coding sensitive values in source code increases the risk of leaks, while environment variables keep secrets out of version control systems. This also enables dynamic configuration, allowing jobs to adapt across different environments (e.g., DEV, UAT, PRD) without modifying the application code.

DataProc - Cluster startup optimization

Fri, 21 Feb 2025 16:00:00 -0500

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

DataProc cluster startup optimization

Introduction

Cluster startup time is a crucial factor in DataProc job efficiency. When a DataProc cluster is provisioned, worker nodes are allocated from a Compute Instance pool. The provisioning time depends on machine type availability in the compute region. Larger memory machines typically take longer to provision due to lower availability in the data center.

DataProc - Cluster configuration

Wed, 19 Feb 2025 16:00:00 -0500

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

DataProc cluster configuration

Introduction

In many DataProc projects, jobs are executed under nearly identical conditions — they typically have low memory and CPU requirements and access the same data sources.

DataProc - Variables

Mon, 17 Feb 2025 16:00:00 -0500

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

Passing environment variables in DataProc and Composer

Executive summary

A common design pattern for orchestrating AI and ML pipelines leverages Google Cloud DataProc to execute Spark-based workflows, offering scalable compute resources for distributed data processing.

DataProc - Understanding environments: shared vs local

Sat, 15 Feb 2025 16:00:00 -0500

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

Understanding environments: shared vs local

Conceptual overview of Airflow and Spark environments

Google Cloud Composer, a managed Apache Airflow, orchestrates workflows using Directed Acyclic Graphs (DAGs). DAGs define a sequence of tasks and their dependencies, ensuring reliable workflow execution.

DataProc - Everything You Need to Know

Thu, 13 Feb 2025 16:00:00 -0500

This post is part of a comprehensive series to DataProc and Airflow. You can read everything in this one monster piece, or you can jump to a particular section

Everything you need to know about DataProc

Understanding Airflow and Spark in Depth

Google Cloud Composer is a managed version of Apache Airflow, an orchestration tool for defining, scheduling, and monitoring workflows. Workflows in Airflow are structured as Directed Acyclic Graphs (DAGs), which define a series of tasks and their dependencies.

Dumpster diving & retrieval of Wordpress content

Wed, 20 Dec 2023 13:28:00 -0500

Background

For reasons that are stupid, I’ve lost my hosted Wordpress website. The TLDR version is that the webhost keeps doing something. The PHP version keeps changing, the MySQL database corrupts, the Wordpress backend auto-updates either via a webhost script or within Wordpress itself and then corrupts.

Either way, enough is enough. I don’t actually use Wordpress for all of the bells and whistles. I don’t actually use it like a dynamic site, and I don’t need a relational database powering the backend. What I mostly use it as, is a static file store. In work projects, I’ve been using the Python Sphinx library for a long time to have project documentation being automatically built by the CI pipeline after a commit is merged into the develop branch.

Xepr license work around

Fri, 17 Aug 2012 16:54:00 +0000

Danger

Disclaimer: the author does not condone piracy but this may be used as a temporary measure

As readers of my work-sided blog posts will be aware, I’ve had some recent problems with spectrometers and their control PCs. As a result I’ve been swapping computers and network cards right, left and centre to try and get things working. As a temporary measure and so as to avoid contacting Bruker every 20 minutes for a new Xepr license file for the latest network card I’m trying I’ve written a little script to get around the Xepr license check.

HTC Desire headphone jack fix

Fri, 17 Aug 2012 12:00:00 +0000

This morning on the way to work I got down to the road, plugged in my earphones to my phone and hit play on Spotify. Only for some horrible, tinny, quiet music to come out. It seemed that the phone just simply would not recognise that anything had been plugged in. Something that I confirmed when I got to work and confirmed with a pair of headphones and PC that it was the phone and not the earphones.

Troubleshooting a non-connecting Bruker ELEXSYS spectrometer

Sun, 12 Aug 2012 17:50:00 +0000

Recently in the lab we’ve had some problems thanks to a helpful software update rendering it impossible to connect to a spectrometer. This troubleshooting provides a record of my steps to diagnosing the problem. Hopefully shortening the process in the future. The spectrometer in question is a Bruker ELEXSYS E580/E680 hybrid connected to a HP/Bruker desktop running openSUSE 12.1 x64, with Xepr v2.6b53. You will need super user privileges for almost all of these commands.

Logarithmic backups on a linux server

Thu, 02 Aug 2012 17:50:00 +0000

This is the second part of the lab server setup guide. Mainly for my own memory purposes, but if you find it useful then so much the better.

Synchronization

As previously discussed all of the machines in the lab are mounted onto the server using NFS under /media/lab-machine-XX. But this does not back them up. So the first step of the backup is to copy the files from the lab machines to the server. For this I use unison which is an easy implementation of rsync which will compare the two versions in question, determine the newest and only copy across changes saving bandwidth and time.

Lab server setup - Ubuntu 12.4

Thu, 02 Aug 2012 15:31:00 +0000

With a recent change of machines in the lab, I thought that I should update the backup server according to reflect the new machines. However, upon inspection the backup server OS hard drive had died long ago; but then what do you expect when you use 8 year old battered PCs as a server.

This wasn’t too much of a problem as the machine’s RAID array was otherwise fine, it just hadn’t been mounted in some time. And the lab machines all have their files safely on them still. So I thought I’d take the opportunity to bring new life to the server and update it. Something that had been on the list of things-to-do for a long time and I just hadn’t got there.

Moving apps from the Android phone memory to SD card: HTC Desire

Tue, 03 Apr 2012 12:00:00 +0000

So I’ve had my HTC Desire now for approximately 2 years now and whilst it has served me well I’ve had a bug bear with it for a long time. Essentially the phone has only 148 MB of internal storage and after the basic Android OS has been installed this is reduced to under 134 MB. Now I knew that this phone didn’t have much memory when I bought it, and as a result when I bought the phone I bought it with a 16 GB microSD card.

Clone an (openSUSE) machine and make it functional

Sun, 29 Jan 2012 12:00:00 +0000

In the last week I’ve had some need to setup some new PCs to control some of our machines in the lab. The problem being that the control interface between machine and PC is quite complicated and uses some pretty niche software, on the openSUSE OS (which is one of the most difficult linux OS’s IMHO).

So rather than having to install openSUSE from scratch using repositories that are no longer supported (because the software isn’t supported by the new SUSE distros) and then try to install the software and configure it, I thought it perhaps best to take a PC that is working and clone it onto a new PC.

Getting an openSUSE machine online

Sun, 29 Jan 2012 12:00:00 +0000

This week I’ve been cloning machines, but each new machine has slightly different hardware, even if it’s just a MAC address or serial number.

So, a quick disclaimer I’m using openSUSE 11.3 and so things will be slightly different for different versions of SUSE or different distros.

Get online

SUSE unlike some of the more user friendly OS’s doesn’t go out of it’s way to get you online. So we need to manually tell it what to do. First open a terminal and log in as a system admin:

PyMOL with Ubuntu 11.10 rendering problems

Sun, 29 Jan 2012 12:00:00 +0000

Recently my (Ubuntu 11.10 64bit) machine stopped playing nice with PyMOL and whenever I tried to render or move an object I was facing a >40 second render time, while both of my CPU cores jumped to >90%

When I paid close attention to the startup messages in the PyMOL command window I could see:

1
2
3
4
5
6
7
8
9


Detected OpenGL version 2.0 or greater. Shaders available.

 PyMOLShader_NewFromFile-Error: Unable to open file '/usr/lib/pymodules/python2.7/pymol/data/shaders/default.vs' PYMOL_PATH='/usr/lib/pymodules/python2.7/pymol'

 PyMOLShader_NewFromFile-Warning: default shader files not found, loading from memory.

 PyMOLShader_NewFromFile-Error: Unable to open file '/usr/lib/pymodules/python2.7/pymol/data/shaders/volume.vs' PYMOL_PATH='/usr/lib/pymodules/python2.7/pymol'

 PyMOLShader_NewFromFile-Warning: volume shader files not found, loading from memory.

In short, the current version of PyMOL supported by Ubuntu (1.4.1-1) was having troubles playing nicely with the new version of Python (2.7).

Rosetta 3.2 with Ubuntu 11.04

Wed, 11 May 2011 10:45:00 +0000

Rosetta is a very useful program for the prediction of protein structure, folding and interactions be that docking with proteins or ligands.

However, the current version of Rosetta (3.2.1) uses SCons to compile itself on your system. This is usually fine except that these use gcc libraries 4.1 and before.

Ubunutu 11.04 considers anything before gcc v4.4 to be obsolete and you can’t install them (even with apt-get).

As a result, a small amount of hacking is required.

Ubuntu 11.04 and MATLAB x64 C, C++ and Fortran libraries

Sun, 01 May 2011 12:00:00 +0000

MatLab requires the use of libraries (think translation dictionaries) to run code that is outside of it’s native realm (a hybrid C).

The most common of computing languages are those of C and Fortran (an old machine code).

The problem is that old code runs typically on 32 bit (a way of addressing memory), whereas most modern operating systems are now running 64 bit.

If you run a 64 bit Linux distro then you cannot run 32 bit MATLAB and you will be forced to install and run x64 MATLAB (there are some work around for Windows).

Recovering a RAID1 disk from a corrupt array

Tue, 12 Apr 2011 12:00:00 +0000

I recently had our backup server at work die. Upon investigation the Kubuntu 8 server had had it’s OS drive and one of the RAID1 data disks corrupted to such an extent it couldn’t boot. The bad data disk was so bad that it couldn’t even be assigned a hardware address under Linux or Windows.

The only hope remained with the remaining RAID1 disk.

As the server had been set up by a previous administrator as a software RAID, I thought it would be a nice simple job of sticking the drive into a good computer and copying files off.

Western Digital Green Drives (WD20EARS) and spin downs

Fri, 01 Apr 2011 12:00:00 +0000

Recently I bought myself 2 new Western Digital (WD) drives for my home NAS box. Doing the good thing for the planet and not wanting massive electricity bills, I thought I’d try the recommend Western Digital Green Drives. Drives that are touted as being energy efficient compared to their WD blue and black office and server based brethren.

I personally have always opted for WD drives as all Hitachi drives I’ve had seem to make horrible noises and die quickly, Samsung always expensive and Seagate always been a bit slow.

[Geek-post] Why I choose open source

Thu, 01 Jul 2010 12:29:00 +0000

I’m currently using an awful lot of software and getting to the stage where I’m having to develop my own. For this reason I thought I’d keep a little track of what I’m using and why. I’m in the process of moving all of my computers across to Ubuntu for a number of reasons. Really the tipping point came one day when Windows 7 (the uncrashable) blue screened on me one too many times. Now I’m perfectly happy with my gaming machine to use Windows and crash occasionally. This is because by and large (despite Wine trying hard) you can really only game in Windows, and that’s largely due to it being the largest market. This combined with the fact that the machine barely ever has the sides on means that a crash or two is unavoidable. However, this my work PC, the PC that has its core programs, talks to the internet and that’s it.

Technology on Morgan Bye

DataProc - a (near) complete guide

DataProc - Passing runtime variables

On this page

Passing runtime variables in DataProc and Airflow

Introduction

DataProc - Cluster startup optimization

On this page

DataProc cluster startup optimization

Introduction

DataProc - Cluster configuration

On this page

DataProc cluster configuration

Introduction

DataProc - Variables

On this page

Passing environment variables in DataProc and Composer

Executive summary

DataProc - Understanding environments: shared vs local

On this page

Understanding environments: shared vs local

Conceptual overview of Airflow and Spark environments

DataProc - Everything You Need to Know

On this page

Everything you need to know about DataProc

Understanding Airflow and Spark in Depth

Dumpster diving & retrieval of Wordpress content

Background

Xepr license work around

HTC Desire headphone jack fix

Troubleshooting a non-connecting Bruker ELEXSYS spectrometer

Logarithmic backups on a linux server

Synchronization

Lab server setup - Ubuntu 12.4

Moving apps from the Android phone memory to SD card: HTC Desire

Clone an (openSUSE) machine and make it functional

Getting an openSUSE machine online

Get online

PyMOL with Ubuntu 11.10 rendering problems

Rosetta 3.2 with Ubuntu 11.04

Ubuntu 11.04 and MATLAB x64 C, C++ and Fortran libraries

Recovering a RAID1 disk from a corrupt array

Western Digital Green Drives (WD20EARS) and spin downs

[Geek-post] Why I choose open source