How HLS Video Streaming Works

Harshit Sharma

9 min read

We watch video on the web every day but rarely stop to think about what happens between hitting play and seeing pixels move. The video starts blurry, sharpens within a second, and plays smoothly even as your connection fluctuates.

None of that is accidental. Underneath, it's a pipeline that was figured out decades ago in broadcast television and adapted for HTTP. The core idea: don't send one big file. Send many small pieces, and let the player pick the right ones.

It all starts with understanding what a video actually is.

What is a video, really?

A video is a sequence of images played fast enough to create the illusion of motion. Each image is called a frame. Play 30 frames per second and your brain fills in the gaps¹.

Each frame is just a grid of pixels, the same as any image. A single 1080p frame has 1920 × 1080 pixels, each storing three bytes (red, green, blue):

  • 1920 × 1080 = 2,073,600 pixels
  • × 3 bytes per pixel = 6.2 MB per frame
  • × 30 frames per second = 186 MB per second
  • × 600 seconds (10 minutes) = ~112 GB

Over a hundred gigabytes for a 10-minute clip. That's clearly not what gets sent over the wire.

Video = Frames = Pixels

Each frame is a grid of RGB pixels. Play 30 per second and it looks like motion.

Frame 1 pixels (6x6 sample)
One pixel stores 3 bytes
R40
G75
B120

3 bytes per pixel x 2,073,600 pixels = 6.2 MB per frame

1 / 12

Video codecs like H.264 compress this by a huge margin. Instead of storing every pixel of every frame, they store the first frame fully (a keyframe), then only store what changed in subsequent frames². A talking head where the background doesn't move? Most of each frame is identical to the last. Only the mouth and subtle movements need updating.

After encoding, that 112 GB becomes roughly 200–500 MB. A 200x reduction. But even 300 MB is a lot to send as a single file.

The simple approach: serve one file

The most obvious approach is to encode the video once and put it on a server:

<video src="https://cdn.example.com/talk.mp4" controls />

The browser downloads the file and plays it. Simple.

But this has three problems that compound:

No adaptation. A user on fiber gets the same 1080p file as a user on 3G. One watches smoothly. The other waits minutes for enough data to buffer. There's no way to serve different quality levels because there's only one file.

Slow start. The browser needs to download enough of the file before playback can begin. For a 300 MB video over a 5 Mbps connection, that's a noticeable wait before anything appears on screen.

Fragile seeking. Jump to the middle of the video and the browser might need to download everything before that point, depending on how the file is structured³.

The single-file approach treats video like a document: download it, then use it. But video is a timeline. People jump around and connections change. You need something that adapts.

The clever approach: chop and multiply

The solution combines two ideas.

Idea 1: segment the video. Instead of one file, chop the video into small pieces, typically 6 seconds each. A 10-minute video becomes ~100 segments. The player downloads them one at a time and plays them back-to-back. It only needs the next segment to keep playing.

Idea 2: encode at multiple qualities. Render the same video at several resolution and bitrate pairs: 1080p, 720p, 480p, 360p. Now you have options for different connection speeds.

Combine both: segment the video and encode each segment at every quality level. Now the player can switch quality on every segment boundary. Fast connection? Grab the next 6 seconds at 1080p. Bandwidth drops? Next segment at 480p. The switch is invisible because each segment is only a few seconds long.

This is HLS (HTTP Live Streaming). Apple designed it, and it's the dominant streaming protocol on the web today.

HLS Structure

One video becomes multiple quality tracks, each split into segments.

input.mp4
transcode + segment
1080p
720p
480p
360p
0:000:240:48

Each row is a separate quality track. Each block is a 6-second .ts segment. The player picks one block per time slot.

The whole system boils down to three things: the segments, the playlists, and the player. Let's look at each.

The segments

To create segments, you take the original video and transcode it at multiple quality levels using FFmpeg:

# 1080p at 5 Mbps
ffmpeg -i input.mp4 -vf scale=1920:1080 -c:v h264 -b:v 5000k \
  -c:a aac -b:a 128k -hls_time 6 -hls_playlist_type vod \
  -hls_segment_filename "1080p/segment_%03d.ts" 1080p/playlist.m3u8
 
# 720p at 2.5 Mbps
ffmpeg -i input.mp4 -vf scale=1280:720 -c:v h264 -b:v 2500k \
  -c:a aac -b:a 128k -hls_time 6 -hls_playlist_type vod \
  -hls_segment_filename "720p/segment_%03d.ts" 720p/playlist.m3u8
 
# 480p at 1 Mbps
ffmpeg -i input.mp4 -vf scale=854:480 -c:v h264 -b:v 1000k \
  -c:a aac -b:a 96k -hls_time 6 -hls_playlist_type vod \
  -hls_segment_filename "480p/segment_%03d.ts" 480p/playlist.m3u8
 
# 360p at 500 Kbps
ffmpeg -i input.mp4 -vf scale=640:360 -c:v h264 -b:v 500k \
  -c:a aac -b:a 64k -hls_time 6 -hls_playlist_type vod \
  -hls_segment_filename "360p/segment_%03d.ts" 360p/playlist.m3u8

Each command does the same thing: scale the video, encode it with H.264 video and AAC audio at a target bitrate, and split the output into 6-second .ts segment files. It also generates a playlist for each resolution.

The segments use the .ts container (MPEG Transport Stream), not .mp4. Transport Stream was designed for broadcast television (cable, satellite) where a receiver might tune in at any point. Each .ts segment is self-contained with enough metadata to decode independently. You don't need to read a file header first⁴.

One critical detail: every segment must start with a keyframe. Without it, the player can't begin decoding that segment because the first frames would reference previous frames that aren't there. FFmpeg's -force_key_frames flag handles this:

ffmpeg -i input.mp4 -force_key_frames "expr:gte(t,n_forced*6)" ...

This forces a keyframe every 6 seconds, aligned with segment boundaries. Every quality level must have keyframes at exactly the same timestamps. If they don't align, switching between qualities mid-stream causes a visible glitch.

Keyframes at segment boundaries

Each segment starts with a keyframe (I-frame). The rest store only what changed.

Segment 06s
Segment 16s
Segment 26s
I-frame (keyframe)
P-frame (delta)

The player can start decoding at any keyframe. That's why every segment boundary needs one. Bar height shows relative frame size.

The playlists

After transcoding, you have hundreds of .ts files organized by quality. But the player needs to know what's available and in what order. That's what the .m3u8 playlist files are for.

Each quality level gets a variant playlist, a simple text file listing its segments in order:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:6
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-PLAYLIST-TYPE:VOD
 
#EXTINF:6.000,
segment_000.ts
#EXTINF:6.000,
segment_001.ts
#EXTINF:6.000,
segment_002.ts
#EXTINF:5.240,
segment_003.ts
 
#EXT-X-ENDLIST

#EXTINF:6.000 means "this segment is 6 seconds long." The filename follows on the next line. #EXT-X-ENDLIST marks the end of the video (this is a VOD, not a live stream).

Then there's the master playlist. This is the entry point that ties everything together:

#EXTM3U
 
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
1080p/playlist.m3u8
 
#EXT-X-STREAM-INF:BANDWIDTH=2500000,RESOLUTION=1280x720
720p/playlist.m3u8
 
#EXT-X-STREAM-INF:BANDWIDTH=1000000,RESOLUTION=854x480
480p/playlist.m3u8
 
#EXT-X-STREAM-INF:BANDWIDTH=500000,RESOLUTION=640x360
360p/playlist.m3u8

Each #EXT-X-STREAM-INF line describes a variant: the bandwidth it requires and the resolution it provides. The line after it points to that variant's playlist.

Think of it as a table of contents. The master playlist says "here are your options." Each variant playlist says "here are the pieces, in order."

Playlist hierarchy

Click a variant to see what's inside its playlist.

master.m3u8

The player

The player's job is to read the map and make smart decisions. Here's what happens when you hit play:

  1. Fetch the master playlist. Parse the available variants and their bandwidths.
  2. Pick a starting quality. The player doesn't know your connection speed yet. It usually starts with the lowest or a middle variant to be safe.
  3. Fetch the variant playlist. Get the segment list for the chosen quality.
  4. Download the first segment. Time how long the download takes. Now it has a real bandwidth estimate.
  5. Play and adapt. For every subsequent segment, the player re-estimates bandwidth based on recent download speeds. If the connection is faster than the current variant needs, step up to higher quality. If slower, step down.

This is adaptive bitrate streaming (ABR). The player is constantly measuring and switching. A 10-minute video might be served as 40 segments at 720p, 30 at 1080p, and 10 at 480p, all stitched together seamlessly.

Adaptive bitrate

Drag the bandwidth slider while playing. Watch the quality adapt.

Connection speed3.0 Mbps
3G4GWi-FiFiber
Segments downloaded
360p
480p
720p
1080p

The quality switches are invisible because every segment starts with a keyframe. The player can jump from 480p to 1080p between any two segments and start decoding immediately. No artifacts, no glitches.

The final file structure

After the full pipeline runs, a single 10-minute video becomes something like this:

master.m3u8                    # the entry point
 
1080p/
  playlist.m3u8                # variant playlist
  segment_000.ts  (3.6 MB)    # 6 seconds of 1080p video
  segment_001.ts  (3.8 MB)
  ... ~100 segments
 
720p/
  playlist.m3u8
  segment_000.ts  (1.8 MB)
  segment_001.ts  (1.9 MB)
  ... ~100 segments
 
480p/
  playlist.m3u8
  segment_000.ts  (0.7 MB)
  ... ~100 segments
 
360p/
  playlist.m3u8
  segment_000.ts  (0.4 MB)
  ... ~100 segments

One file became roughly 400 files. That sounds excessive until you realize no single viewer downloads all of them. Each viewer downloads one quality level's worth of segments, maybe switching between two levels, and the CDN caches everything at the edge.

Output file tree

~405 files from a single video. Click folders to expand.

... ~97 more segments
1080p total: ~360 MB360p total: ~40 MBNo viewer downloads all of them

Why plain HTTP makes this work

HLS doesn't need a special streaming server. No WebSocket connections. No custom protocols. Every segment is just a static file served over HTTP.

The player makes a series of regular GET requests. First the master playlist. Then a variant playlist. Then segment after segment. The CDN serves static files. Load balancers, caches, and edge nodes all work exactly as they would for images or JavaScript bundles.

This is why HLS won over earlier streaming protocols that required dedicated servers. It rides the entire existing HTTP and CDN infrastructure for free. Any server that can serve files can serve HLS video.

Why this matters

Every time you watch a video online, all of this runs behind the scenes. A pipeline chopped the original file into hundreds of segments at multiple quality levels. A set of text-based playlists maps the structure. A player in your browser reads those playlists, estimates your bandwidth in real time, and downloads the right segment at the right quality, every 6 seconds, without you noticing.

Streaming video isn't really "streaming" in the way most people imagine. There's no continuous river of data flowing from server to player. It's a player downloading a sequence of small files, one after another, pretending they're continuous. The quality switches on YouTube and Netflix? That's the player switching which folder it downloads from. One segment at a time.

The ideas behind this (segmentation, adaptive quality, playlist-based indexing) were solved in broadcast television decades ago. HLS just adapted them for the web, using HTTP as the transport layer.


¹ Most web video uses 24, 30, or 60 fps. Higher frame rates mean smoother motion but proportionally larger files before compression.

² This is called inter-frame compression. Keyframes (I-frames) store a complete image. Predicted frames (P-frames) store only the differences from the previous frame. Bidirectional frames (B-frames) reference both past and future frames for even better compression.

³ MP4 files store a metadata structure called the moov atom. If it's at the end of the file (common in non-optimized exports), the browser has to download the end of the file before it can seek to the middle. Running ffmpeg -movflags +faststart moves it to the beginning.

⁴ Modern HLS also supports fragmented MP4 (fMP4) segments instead of TS. Same concept (self-contained chunks) but with a more efficient container. That's what the #EXT-X-MAP tag in newer playlists points to.


Written by Harshit Sharma. If you want to know when new posts are out, follow me on Twitter.