This blog post contains the notes I made while trying to understand how digital video actually works, and what goes on behind it. It is mostly a broader overview of the topic, and I plan to dive deep into topics that interest me in the upcoming blog posts. This one will help you understand the concepts of digital media.
Audio → what you listen to
Video -> what you see
Digital media (or a “video” in common terms) is combination of “video”, and “audio” , where video is basically a grouping of ordered digital pictures (or “frames”) captured at a given frequency (say, 30 frames per second), and a digital audio track. Later, when the content will be played, the device could display all the frames in sequence, each one in a 33 ms interval (1000 ms / 30 fps), and thus we’ll perceive it as a movie.
The problem: raw video & audio are too big in size!
Consider a 5 seconds long media.
6 channels * 48000 hz * 5 seconds * 2 bytes = 2.8MB
1920 * 1080 * 3 bytes * 5s * 60fps = 1.7 GB
So 5 second of 1080p 60fps raw media is taking up around 1.7 GB of space. If used this way, distribution of media will become impossible as it will take huge amounts of bandwidth to transfer. It is unfeasible to use uncompressed media. Therefore, we compress the media using a codec.
A codec is basically a piece of software which can significantly reduce the size of a media without much loss in it’s quality. Question is, how does it do that?
Encoding is the process of taking a raw uncompressed media file and converting it into a compressed one. Our 1.7GB video would be compressed to around 2MB using H264
codec. Encoding exploits the way our vision works and removes redundant data in the media that our eyes cannot differentiate any way. If you are curious and want to know more about this, go here.
Decoding is the reverse process that needs to be done when the video is played. So, if you are compressing a video using a certain codec, the media player that you will use to play that video needs to support the codec in order to decompress the video.
Container is the way how audio & video co-exists together in a single file. Once we have compressed the individual audio and video streams, we wrap them into a file which we call a “container”. It contains both the audio, and video streams, and the general metadata, such as title, resolution, multiple audio tracks, subtitle tracks, etc. mp4
, webm
, mkv
are all the different container formats. Each have their own way (codecs) of storing video & audio data. For example, mp4
container generally stores the audio using aac
codec, and the video using the h264
codec.
Here, we will take a look at what operations are commonly done related to media processing. We will commonly encounter this terminology and should know what it means, and why is it needed.
We can refer to the FFMPEG internal architecture to see what phases are involved in making any changes to a media file.
When ffmpeg makes any change to a media file, it -
Let’s see the terminology used for doing some common operations on media →
That’s it for this post, knowing all these basics helped me understand and participate in conversations that involved media processing. I hope this has been helpful to you as well, see you the next time!