Automatic Episode Ordering Script

MKV playback, recompression, remuxing, codec packs, players, howtos, etc.
Post Reply
Hittsy
Posts: 17
Joined: Mon Jun 09, 2025 4:08 am

Automatic Episode Ordering Script

Post by Hittsy »

Are you tired of ripped tv shows having episodes that are out-of-order? Are you tired of having to view 5 minutes of each episode, compare it to a wiki or streaming service, and then number it manually? Do you have a lot of TV shows to get through that you're putting off lest you stay awake until 3am doing work that a data-entry specialist should do?

I got ya covered!

Code: Select all

#!/bin/bash

#this function returns the number of seconds of black frames that exist at the start of a video, to ensure more accurate comparisons.
find_black() {
  too_long=$(ffmpeg -i "$1" -t 00:00:05 -vf blackdetect=d=0.05:pix_th=0.2 -an -f null - 2>&1 | grep "blackdetect") #gives us a ffmpeg output detailing black frames
  regex="black_start:([0-9.]+) black_end:([0-9.]+) black_duration:([0-9.]+)" #regex to ID the timestamps for the black-padding

  # then do the if-spaghetti
  if [[ $too_long =~ $regex ]]; then #compare - this fails if we get nothing from ffmpeg for black-detection

    if (( $(echo "${BASH_REMATCH[1]} < 1" | bc -l))); then #test to see if our first occurance of black-frames happens within the first second.
      echo ${BASH_REMATCH[2]}
    else
      echo "0" #if not, then it's likely just an early scene transition, and should be discarded
    fi
  else
    echo "0" #we got nothing from ffmpeg, sadge
  fi
}

#finds the SSIM in case that wasn't obvious. returns it as a single value (ex:5.56547)
#arguments are supposed to be: pirate-media, pirate-black-padding, reference-media, reference-black-padding, time (to analyze). in that order
#hwaccel - none is the default, as this is a purely decode operation; cpu decoding is much faster than cuda.
find_SSIM() {
  lengthy=$(ffmpeg -y -hwaccel none -ss "$2" -i "$1"  -hwaccel none -ss "$4" -i "$3"  -t 00:$5:00 -filter_complex "[0:v]pad=aspect=16/9:x=-1:y=-1[a];[1:v]pad=aspect=16/9:x=-1:y=-1[b];[b]split[c][d];[a][c]scale=rw:rh[e];[e][d]ssim" -f null - 2>&1 | grep "SSIM")
  temp="${lengthy##*(}" #trims the output
  SIMM="${temp%)}" #grab the SSIM to match the video files
  echo $SIMM
}

#WE'RE EXECUTING 'SUCCESSFULLY' TODAY!
echo "heck"

#testing to ensure the arguements are valid directories
if [ -z "$2" ]; then
  echo "|ERROR| Valid usage is './ripped_dir ./pirate_dir' and optionally, time (as a integer)"
  exit 1
fi
if [ ! -d "$1" ]; then
  echo "|ERROR| $1 is not a directory"
  exit 1
fi
if [ ! -d "$2" ]; then
  echo "|ERROR| $2 is not a directory"
  exit 1
fi

#set some defaults
if [ ! -z "$3" ]; then
  time=$3
  else
  time=02
fi

#create the lists
ripped_files=("$1"/*)
pirated_files=("$2"/*)

#present useful info to user
max=$((${#ripped_files[@]}*${#pirated_files[@]}))
echo "checking $time minutes into the file"
echo "total comparisons to be made: $max"
count=0

#main loop
for item1 in "${ripped_files[@]}"; do
  index=0
  max_index=0
  max_value=0 #set values for each unknown video file
  black_len1=$(find_black "$item1") #finds the black frame padding at the start of the video

  for item2 in "${pirated_files[@]}"; do
    black_len2=$(find_black "$item2") #finds the black frame padding at the start of *this* video
    result=$(find_SSIM "$item2" "$black_len2" "$item1" "$black_len1" "$time") #call the function to do the comparison

    ((count++))
    ((index++)) #increment some useful values

    echo "$count: $result"
    if (( $(echo "$result > $max_value" | bc -l))); then #do math to check if we need to update our running totals
      max_value=$result
      max_index=$index #update our totals
    fi
  done
  mv "$item1" "$1/$max_index-$count.mkv" #rename the file according to our best guess

  #generates black-padding offset values for the upcoming proof file.
  black_proof1=$(find_black "$1/$max_index-$count.mkv")
  black_proof2=$(find_black "${pirated_files[(($max_index-1))]}")

  #this generates a proof file, which can be used to check the script's work, this uses the same filtergraph chain that the SSIM comparison uses, so you can see how the SSIM number was generated
  ffmpeg -hide_banner -loglevel error -stats -ss "$black_proof1" -i "$1/$max_index-$count.mkv" -ss "$black_proof2" -i "${pirated_files[(($max_index-1))]}" -filter_complex "[0:v]pad=aspect=16/9:x=-1:y=-1[a];[1:v]pad=aspect=16/9:x=-1:y=-1[b];[a]split[c][d];[b][c]scale=rw:rh[f];[d][f]hstack" -c:v hevc_nvenc -t 00:01:30 "$1/proof_$max_index-$count.mkv"

done
The above bash script is designed to generate a SSIM number to compare each ripped TV show episode against video files that you might acquire through other means, and then number said video files with the index of it's closest match with your arbitrary video files.

From there, you can use a bog-standard renaming tool like krename to make your tv show rips match something like plex's or jellyfin's naming conventions. (For example, in krename you can use Star Trek Deep Space Nine (1993) - S01E##.)

F.A.Q.

Q: How do I use it?
A: Simply save the code as something.sh, and then run something.sh [ripped directory] [arbitrary video directory] [time] . "Time" is how many minutes into each episode the script should check to generate it's SSIM number, I generally get good enough results with 2 and have that set as the default - you can increase it however long you like for more accurate results, at the expense of runtime.

Q: What do you mean "video files that you might acquire through other means"?
A: I do not condone piracy, because I wish to ensure that studios get paid for the work they put into creating a piece of media, and because they also created a disk for me to rip. I also do not recommend piracy for the simple fact that most pirates will re-encode and compress the snot out of their tv shows to ensure they can be torrented easily, making the video look like garbage. They do however, always put in the work to ensure their episodes are numbered/ordered correctly (unrelated: make sure to seed your Linux ISOs).

Q: If you do not condone piracy, why is it referenced so much (variables, error messages, etc.) in your script?
A: I was listening to the PotC OST whilst writing this. There is absolutely no other connection, in any way.

Q: What do I need to run this script?
A: You need a Linux machine, or a machine that can parse a bash script, ffmpeg, and bc (basic calculator - a very common linux util).

Q: Why is it so slow?
A: There is a lot of input sanitization that needs to happen for this script to generate reliable SSIM numbers. "Arbitrary videos" might crop out letterboxing/columnboxing, but blu-ray as a standard requires padding to ensure the video is always in a 16:9 aspect ratio (I also pad the ripped media, in case we're dealing with DVD rips rather than add additional logic to scale the "arbitrary video" accordingly). This means I have to add a padding step to my ffmpeg filtergraphs, which results in your system having to copy a lot of video around in RAM. On my system, my CPU hits about 60% utilization whilst running this, simply because of RAM latency. There is also additional work needed in case the video needs to be scaled - for example if your rip is in 4K and the "arbitrary video" is in 1080p.

Q: Why is it so damn slow?
A: It also has to make n^2 comparisons, where n is the number of episodes you're comparing. I could add some advanced logic to take an "arbitrary video" off the list once it has been matched with a rip, but if the script matches a ripped episode incorrectly, that would take a falsely matched "arbitrary video" off of the list, resulting in another ripped episode getting falsely matched, and another, and another...

Q: How can I speed it up?
A: Outside of buying faster hardware, there are 2 ways. 1, play with the time argument listed above - If your TV show has cold opens, but no recaps or lengthy title sequences, you might be able to drop it down to 1-2 minutes. But, if your media does have recaps, or something scuffed about your "Arbitrary videos" is getting past the input sanitization, you might be forced to bump up the analysis time. 2, whilst TV shows aren't always titled in order, we can guarantee that they're printed in order on the disk, so you could analyze one disk at a time, at the expense of taking more time to rename them afterwards (for krename name (year) - S0XE##{Y;1} is helpful, where X is the season and Y is where you want it to start counting up from).

Q: Why did you make this?
A: I have literally stayed up til 5am some nights matching episodes of babylon 5 or breaking bad. When a very dear friend of mine asked to watch Fringe (2008) on plex with me, the community wiki wasn't detailed enough for me to parse the names, and the episodes didn't have their name/title anywhere in the actual video.

Q: What is the proof file for?
A: To check it's work! The proof file shows the first 90 seconds of the reference and "arbitrary video" side-by-side, with the same padding, scaling, and synchronization applied so you can easily check if the script has made a correct match, or see perhaps what went wrong.

Q: What should I do if I get a bad match?
A: Talk about it here? I don't guarantee it'll be flawless, and I make no promises to fix all or any edge cases. I hope the failures however are few, and it should be easy to rename manually.

If you have any suggestions for improvement, please let me know. I can't guarantee I'll look at everything, but since this is a script I use personally, I am interested in any way it might be improved.
Post Reply