I got ya covered!
Code: Select all
#!/bin/bash
#this function returns the number of seconds of black frames that exist at the start of a video, to ensure more accurate comparisons.
find_black() {
too_long=$(ffmpeg -i "$1" -t 00:00:05 -vf blackdetect=d=0.05:pix_th=0.2 -an -f null - 2>&1 | grep "blackdetect") #gives us a ffmpeg output detailing black frames
regex="black_start:([0-9.]+) black_end:([0-9.]+) black_duration:([0-9.]+)" #regex to ID the timestamps for the black-padding
# then do the if-spaghetti
if [[ $too_long =~ $regex ]]; then #compare - this fails if we get nothing from ffmpeg for black-detection
if (( $(echo "${BASH_REMATCH[1]} < 1" | bc -l))); then #test to see if our first occurance of black-frames happens within the first second.
echo ${BASH_REMATCH[2]}
else
echo "0" #if not, then it's likely just an early scene transition, and should be discarded
fi
else
echo "0" #we got nothing from ffmpeg, sadge
fi
}
#finds the SSIM in case that wasn't obvious. returns it as a single value (ex:5.56547)
#arguments are supposed to be: pirate-media, pirate-black-padding, reference-media, reference-black-padding, time (to analyze). in that order
#hwaccel - none is the default, as this is a purely decode operation; cpu decoding is much faster than cuda.
find_SSIM() {
lengthy=$(ffmpeg -y -hwaccel none -ss "$2" -i "$1" -hwaccel none -ss "$4" -i "$3" -t 00:$5:00 -filter_complex "[0:v]pad=aspect=16/9:x=-1:y=-1[a];[1:v]pad=aspect=16/9:x=-1:y=-1[b];[b]split[c][d];[a][c]scale=rw:rh[e];[e][d]ssim" -f null - 2>&1 | grep "SSIM")
temp="${lengthy##*(}" #trims the output
SIMM="${temp%)}" #grab the SSIM to match the video files
echo $SIMM
}
#WE'RE EXECUTING 'SUCCESSFULLY' TODAY!
echo "heck"
#testing to ensure the arguements are valid directories
if [ -z "$2" ]; then
echo "|ERROR| Valid usage is './ripped_dir ./pirate_dir' and optionally, time (as a integer)"
exit 1
fi
if [ ! -d "$1" ]; then
echo "|ERROR| $1 is not a directory"
exit 1
fi
if [ ! -d "$2" ]; then
echo "|ERROR| $2 is not a directory"
exit 1
fi
#set some defaults
if [ ! -z "$3" ]; then
time=$3
else
time=02
fi
#create the lists
ripped_files=("$1"/*)
pirated_files=("$2"/*)
#present useful info to user
max=$((${#ripped_files[@]}*${#pirated_files[@]}))
echo "checking $time minutes into the file"
echo "total comparisons to be made: $max"
count=0
#main loop
for item1 in "${ripped_files[@]}"; do
index=0
max_index=0
max_value=0 #set values for each unknown video file
black_len1=$(find_black "$item1") #finds the black frame padding at the start of the video
for item2 in "${pirated_files[@]}"; do
black_len2=$(find_black "$item2") #finds the black frame padding at the start of *this* video
result=$(find_SSIM "$item2" "$black_len2" "$item1" "$black_len1" "$time") #call the function to do the comparison
((count++))
((index++)) #increment some useful values
echo "$count: $result"
if (( $(echo "$result > $max_value" | bc -l))); then #do math to check if we need to update our running totals
max_value=$result
max_index=$index #update our totals
fi
done
mv "$item1" "$1/$max_index-$count.mkv" #rename the file according to our best guess
#generates black-padding offset values for the upcoming proof file.
black_proof1=$(find_black "$1/$max_index-$count.mkv")
black_proof2=$(find_black "${pirated_files[(($max_index-1))]}")
#this generates a proof file, which can be used to check the script's work, this uses the same filtergraph chain that the SSIM comparison uses, so you can see how the SSIM number was generated
ffmpeg -hide_banner -loglevel error -stats -ss "$black_proof1" -i "$1/$max_index-$count.mkv" -ss "$black_proof2" -i "${pirated_files[(($max_index-1))]}" -filter_complex "[0:v]pad=aspect=16/9:x=-1:y=-1[a];[1:v]pad=aspect=16/9:x=-1:y=-1[b];[a]split[c][d];[b][c]scale=rw:rh[f];[d][f]hstack" -c:v hevc_nvenc -t 00:01:30 "$1/proof_$max_index-$count.mkv"
done
From there, you can use a bog-standard renaming tool like krename to make your tv show rips match something like plex's or jellyfin's naming conventions. (For example, in krename you can use Star Trek Deep Space Nine (1993) - S01E##.)
F.A.Q.
Q: How do I use it?
A: Simply save the code as something.sh, and then run something.sh [ripped directory] [arbitrary video directory] [time] . "Time" is how many minutes into each episode the script should check to generate it's SSIM number, I generally get good enough results with 2 and have that set as the default - you can increase it however long you like for more accurate results, at the expense of runtime.
Q: What do you mean "video files that you might acquire through other means"?
A: I do not condone piracy, because I wish to ensure that studios get paid for the work they put into creating a piece of media, and because they also created a disk for me to rip. I also do not recommend piracy for the simple fact that most pirates will re-encode and compress the snot out of their tv shows to ensure they can be torrented easily, making the video look like garbage. They do however, always put in the work to ensure their episodes are numbered/ordered correctly (unrelated: make sure to seed your Linux ISOs).
Q: If you do not condone piracy, why is it referenced so much (variables, error messages, etc.) in your script?
A: I was listening to the PotC OST whilst writing this. There is absolutely no other connection, in any way.
Q: What do I need to run this script?
A: You need a Linux machine, or a machine that can parse a bash script, ffmpeg, and bc (basic calculator - a very common linux util).
Q: Why is it so slow?
A: There is a lot of input sanitization that needs to happen for this script to generate reliable SSIM numbers. "Arbitrary videos" might crop out letterboxing/columnboxing, but blu-ray as a standard requires padding to ensure the video is always in a 16:9 aspect ratio (I also pad the ripped media, in case we're dealing with DVD rips rather than add additional logic to scale the "arbitrary video" accordingly). This means I have to add a padding step to my ffmpeg filtergraphs, which results in your system having to copy a lot of video around in RAM. On my system, my CPU hits about 60% utilization whilst running this, simply because of RAM latency. There is also additional work needed in case the video needs to be scaled - for example if your rip is in 4K and the "arbitrary video" is in 1080p.
Q: Why is it so damn slow?
A: It also has to make n^2 comparisons, where n is the number of episodes you're comparing. I could add some advanced logic to take an "arbitrary video" off the list once it has been matched with a rip, but if the script matches a ripped episode incorrectly, that would take a falsely matched "arbitrary video" off of the list, resulting in another ripped episode getting falsely matched, and another, and another...
Q: How can I speed it up?
A: Outside of buying faster hardware, there are 2 ways. 1, play with the time argument listed above - If your TV show has cold opens, but no recaps or lengthy title sequences, you might be able to drop it down to 1-2 minutes. But, if your media does have recaps, or something scuffed about your "Arbitrary videos" is getting past the input sanitization, you might be forced to bump up the analysis time. 2, whilst TV shows aren't always titled in order, we can guarantee that they're printed in order on the disk, so you could analyze one disk at a time, at the expense of taking more time to rename them afterwards (for krename name (year) - S0XE##{Y;1} is helpful, where X is the season and Y is where you want it to start counting up from).
Q: Why did you make this?
A: I have literally stayed up til 5am some nights matching episodes of babylon 5 or breaking bad. When a very dear friend of mine asked to watch Fringe (2008) on plex with me, the community wiki wasn't detailed enough for me to parse the names, and the episodes didn't have their name/title anywhere in the actual video.
Q: What is the proof file for?
A: To check it's work! The proof file shows the first 90 seconds of the reference and "arbitrary video" side-by-side, with the same padding, scaling, and synchronization applied so you can easily check if the script has made a correct match, or see perhaps what went wrong.
Q: What should I do if I get a bad match?
A: Talk about it here? I don't guarantee it'll be flawless, and I make no promises to fix all or any edge cases. I hope the failures however are few, and it should be easy to rename manually.
If you have any suggestions for improvement, please let me know. I can't guarantee I'll look at everything, but since this is a script I use personally, I am interested in any way it might be improved.