Every true podcast has a free and publicly available RSS feed that contains information about the show and each episode. In turn, those episode items include metadata about the show and a link to a hosted audio file. In this tutorial, we will download transcripts for the latest episodes of our favorite shows and store them in text files on our computer.

Before You Start

You will need a Deepgram API Key - get one here. You will also need to install jq and yq to traverse and manipulate XML in your terminal (the data format used for RSS feeds).

This tutorial will be a set of building blocks, slowly growing in complexity towards our end goal. We'll take it slow and explain each step so you can apply this knowledge in other contexts, too.

We'll use the NPR Morning Edition Podcast Feed: https://feeds.npr.org/510318/podcast.xml, but this can be swapped out for your favorite podcast.

Getting Started

Open up your terminal and run the following:

curl https://feeds.npr.org/510318/podcast.xml

This should display the full RSS feed - a bunch of XML (similar to HTML) containing information about the feed.

Get Just The Episode Items

The structure of the XML includes an rss tag containing a channel tag. Inside of channel is a whole bunch of metadata tags for the show and a set of item tags for each episode. item tags are not inside of a containing list as we might expect with HTML - they are all direct children of channel. Try running the following command:

curl https://feeds.npr.org/510318/podcast.xml | xq '.rss.channel.item[]'

This pipes the curl output into the xq command and extracts all of the item tags. It also pretty prints it in the terminal, which I find quite helpful when exploring the data. What is after the xq command in quotes is known as the 'expression.'