Nvidia unveils Super SloMo deep learning-powered motion interpolation

I'm sure that, like me, you probably smirk or roll your eyes when you see a TV investigator confidently command a subordinate to “enhance” a low-resolution image. Well, we live in the future, folks. Nvidia's just demonstrated a similar sort of technology for slowing down standard-speed video. Check out this video of what the company calls Super SloMo.

Nvidia says that the system runs on Tesla V100 GPUs and uses the Pytorch deep-learning framework. Apparently, the team that created this technology trained their system on over 11,000 videos shot at 240 FPS. Once it was trained, the neural network was able to take a regular video and create completely realistic-looking intermediate frames to produce a higher-speed version.

That higher-frame-rate video can then be played back at the original speed to produce a slow-motion effect, even when the video was originally recorded using a low frame rate. Alternatively, you could watch the high-speed video in its new frame rate—assuming you have the display to reproduce it, of course.

The effect, at least as demonstrated in the video above, is incredibly realistic. It's easy to envision folks making use of this technology on a future GeForce product. If we step a bit into the realm of fantasy, it's also easy to imagine this technology—perhaps along with a fixed-function accelerator—being used to improve the smoothness of movies, TV shows, or games.

If you're a developer looking to learn exactly how Super SloMo works, you'll have to wait until Thursday. Nvidia's researchers will be talking about the technique in a presentation at the 2018 Computer Vision and Pattern Recognition conference in Salt Lake City, UT.

Comments closed
    • psuedonymous
    • 1 year ago

    I’m seeing a lot of “it’s just optical flow!” and “just use SVP!” alongside this. If you [url=https://arxiv.org/abs/1712.00080<]go read the paper[/url<] you'll notice that the point of this is to be able to [i<]arbitrarily[/i<] generate intermediate frames at any point in time between two given frames, rather than the current requirement to generate a single new frame precisely between two existing frames.

      • Chrispy_
      • 1 year ago

      SVP does more than you give it credit for, and already [i<]arbitrarily[/i<] generates intermediate frames at any point in time between two given frames. How else do you think it interpolates 24fps to 60fps, which is a common usage for it. Trust me, video interpolation - at least to the same level of "eww, I can see interpolation errors all over the place" has been at this level for a decade or more. You probably weren't aware of it because the first thing most people do the minute they notice it enabled on a modern TV is immediately disable it.

      • dragontamer5788
      • 1 year ago

      [quote<]I'm seeing a lot of "it's just optical flow!" and "just use SVP!" alongside this.[/quote<] I just read the abstract, and it confirms my suspicions. [quote<]We start by computing bi-directional optical flow between the input images using a U-Net architecture[/quote<] Uh... yeah. This video demonstrates what optical flow retiming looks like. Complete with the errors of optical flow... like difficulty with background objects or "crossing" (Chrispy_'s criticism with the ice-skate. Because any optical-flow methodology will have difficulty when the ice-skate crosses the hockey player's leg). [quote<]rather than the current requirement to generate a single new frame precisely between two existing frames.[/quote<] Did you even see what optical flow retiming looks like? I included examples in my post: [url<]https://www.youtube.com/watch?v=M_LE96nGqik[/url<]. You can use optical flow retiming to smoothly generate an arbitrary number of frames between arbitrary numbers of existing frames. In fact, that's why the paper probably uses optical-flow as a basis and builds on top of it.

    • elites2012
    • 1 year ago

    waste of resources. focus on video cards and cuda workstations. im sure disney could use a few, instead of those power hungry macs.

    • Chrispy_
    • 1 year ago

    Nope, sorry – but it’s still wrong.

    In the ice hockey scene, look at GATT’s hand as he grabs the goal and his left ice skate.

    At the original framerate you mind fills in the gaps and it all looks right. There’s something SERIOUSLY messed up with most of these examples. Either your brain detects it as utterly wrong and you go WHOAH NELLY, THAT’S FUBAR, or the interpolation blurs the edges that you’re using for motion definition and it’s a hinderance rather than a help.

    For the NELLY-FUBAR, look at the GATT ice hockey example I listed.
    For the OMG BLUR, look at the twirling dancer’s dress (and right knee on the first spin)
    If you can’t see what I mean, watch it at 1080p60. The interpolation errors are really quite [i<]disturbing[/i<]! Televisions have been doing this (badly) for a decade at least, and Nvidia's attempts here make all the same mistakes. Show me something that uses intelligence to make sure that humans only have one hand on each arm, one blade on each ice-skate, and kneecaps that aren't made of swarming rats under the skin.

      • Spunjji
      • 1 year ago

      The slow-mo-guys videos are the most revealing because all the filter really does for you there is make the image noticeably more blurry whilst slowing the footage to the point of tedium. Watch the surface of the water on the balloon one.

      Funnily enough I was most impressed by the ice hockey clip at first, then I read your comment and went back… those skates are fuuuucked. It’s like they suddenly turned into tiny black holes that are gravitationally lensing everything behind them.

        • Chrispy_
        • 1 year ago

        I’m also wholly unimpressed by the netting on the goal moving with the limbs of the players passing in front of it.

        How dumb is this “deep learning AI” if it can’t distinguish between a moving human and static net. The two objects are very very different in every way yet it’s clearly confused about which is which.

          • dragontamer5788
          • 1 year ago

          It appears to be just a refinement of optical flow techniques (which also can’t tell the difference between background and foreground objects very well).

          The very point of applying deep learning to this problem is to figure out foreground vs background (a truly difficult problem in the space of image analysis). If you apply all this computational power but STILL can’t figure it out, then its basically a failure IMO.

          There’s a reason why 300 was shot in front of a green-screen. So that the green-screen could automatically tell the difference between background-and-foreground. Then the optical-flow slowdown effect would be applied individually on the actors (and a separate algorithm handles the reconstruction of the digital background).

    • dragontamer5788
    • 1 year ago

    Okay, so… deep learning is cool and all. But… this isn’t a good application of it.

    Optical Flow based time remapping has been around for over a decade (although its only been introduced into cheaper below $1000 video editing packages recently), and can actually be done with modern processors relatively simply.

    [url<]https://www.youtube.com/watch?v=3YE5tff8pqg[/url<] Of course, a dedicated package like Twixtor (dedicated optical-flow retiming plugin for ~$300ish) achieves better results. [url<]https://www.youtube.com/watch?v=M_LE96nGqik[/url<] Like, seriously. There's a ton of software filled with competitors: [url<]https://borisfx.com/effects/continuum-optical-flow/[/url<] This is basically a standard technique of a video editor. I'm just a hobbyist and the tools are finally in my price range. But you can definitely see this sort of effect in movies even years ago.

    • GrimDanfango
    • 1 year ago

    Sooner or later, machine learning will be able to create video with *no* input frames…
    Next it’ll be creating video with no human input…
    Then it’ll be steering the fate of humanity with it’s own news channel…

    Let’s hope its goals are benevolent 😛

      • meerkt
      • 1 year ago

      [url<]https://arstechnica.com/gaming/2018/06/this-wild-ai-generated-film-is-the-next-step-in-whole-movie-puppetry/[/url<] Though the older film, with its more limited ML scope, is better.

    • blastdoor
    • 1 year ago

    I’ll keep smirking at those “enhance” scenes, because that’s still impossible. This isn’t going to make a previously illegible license plate legible, for example. Instead, it would be analogous to replacing an illegible image of a license plate with a combination of letters and numbers that the algorithm deems most common, given all the license plates its seen in the past. That’s fine if, for entertainment purposes, you just want something that looks more like a license plate. But if you actually want to know a specific license number, you’re out of luck.

      • meerkt
      • 1 year ago

      At least with video input it is possible to combine data from multiple frames.

      I don’t know what happened in this area in recent years, but here’s something from 11 years ago. See the photo on page 7:
      [url<]https://pdfs.semanticscholar.org/cdee/210eb723e06c0ec791dab0929d9bbca4c505.pdf[/url<]

      • sluggo
      • 1 year ago

      I’m not sure I agree. Yes, like most folks, I got a good laugh out of that scene in Bladerunner when Decker is manipulating that photograph. But manipulating the bits that make up the blurred image of a license plate [i<]when the manipulating entity knows it's looking at a license plate[/i<] is a different scenario altogether from what happened in Bladerunner. If the intelligence knows it's looking at a car because it's seen millions of cars and can generalize the shape and configuration, and also knows that the license plate is what's frequently mounted in the location of interest on the object that it assumes is a car, it can then go about matching up the blurred image to a coherent image of a license plate. It can throw out all the other image possibilities so you don't end up with an image that doesn't make sense, given the metadata. It's a bit like the PRML code on HDD controllers. The data coming off the heads in no way resembles 1's and 0's. When you look at it on a scope It's pretty close to white noise. The controller, though, knows that there's a finite number of permitted patterns of "noise", and it quickly matches up the read pattern with what it knows to be possible, given the characteristics of that particular drive's heads and platters. It then assigns a sequence of "cleaned-up" bits to the pattern read and transfers it to the buffer. Cleaning up a blurred image of a license plate would require training over a much larger universe of possible outcomes than in the HDD scenario, but that's a problem of scale, not possibility. Cleaning up the image of a license plate [i<]when you know it's a license plate[/i<] is something I would expect is already possible and being done somewhere.

        • blastdoor
        • 1 year ago

        I think we are talking about two different things.

        I’m talking about an image of a license plate that a human is looking at and wants to “enhance” so that the license number if legible. If the human cannot decipher the number from looking at the image (perhaps using standard image cleanup tools – -not AI), the AI isn’t going to either.

        My point is just –if the info isn’t there, you can’t impute it out of thin air. That’s never going to change.

          • chuckula
          • 1 year ago

          [quote<]If the human cannot decipher the number from looking at the image (perhaps using standard image cleanup tools - -not AI), the AI isn't going to either.[/quote<] I wouldn't be too sure about that. You might as well say that if a human can't recognize a face then an AI can't either. But that's not necessarily true at all: [url<]https://www.wired.com/2016/09/machine-learning-can-identify-pixelated-faces-researchers-show/[/url<] And no, the AI isn't magically unpixelating the image to return it to its original state. You need to understand that AI is not a one-to-one replacement for a human brain. There are images that an AI can properly classify that a human [b<]cannot[/b<] classify and vice-versa.

            • blastdoor
            • 1 year ago

            [quote<]And no, the AI isn't magically unpixelating the image to return it to its original state.[/quote<] Yes

          • Redocbew
          • 1 year ago

          The point is that the whole thing is essentially a guessing game. If the information was there, then the image wouldn’t be blurry in the first place. The process of “training” is a way to place boundaries and show the system what’s a good guess and what’s a bad one.

          If the machine knows it’s looking at a license plate, then it can make reasonable guesses as to what information should be there. If it doesn’t, then all bets are off, and you’re just as likely to end up with a potato in the image as you are the missing characters.

            • blastdoor
            • 1 year ago

            Knowing that it’s a license plate is far from half the battle.

            • Redocbew
            • 1 year ago

            For this specific example it might be. For other things there very well could be a lot of guess work still involved once you figure out the object in your picture is a dog, or a tree, or whatever.

      • Zizy
      • 1 year ago

      Reconstruction will never recover all data, convolution is a lossy process, but you might be surprised by the amount of deblurring you can achieve in practice 🙂

      Grab a picture, put some Gaussian blur and some movement blur. Perhaps some noise too. Corrupt the image just to the point where you can’t distinguish anything on the image, no matter how hard you look.
      Now throw that to any of deblurring stuff (say MATLAB’s deconvblind), and you will see lots of ringing artefacts, but the original image will be reasonably visible if you didn’t go overboard with blur and noise.

      Sure, it is obvious tons of kernel+ reconstructed image combinations fit the given blurred image (even if the exact blur kernel is known) as deconvolution is ill-conditioned problem, but based on some reasonable heuristics you can guess what could be the most likely original image.

      And this is just what you can do with a simple picture without even trying to reconstruct the scene (no assumptions like “this line is of the fence and is reasonably assumed to be such and such”). With movies, you can combine multiple frames and get far beyond that.

    • Voldenuit
    • 1 year ago

    This tech will be amazing for my po- I mean, [i<]fluid dynamics[/i<] videos.

      • Chrispy_
      • 1 year ago

      Just use SVP for your fluid dynamics videos.

      • Redocbew
      • 1 year ago

      Have your upvotes for the things I now cannot unsee. Thanks a lot.

      • unclesharkey
      • 1 year ago

      Think of all of the applications it will have in the porn industry.

      • kvndoom
      • 1 year ago

      Dynamic fluid of the Ron Jeremy variety…?

    • Mat3
    • 1 year ago

    “Alternatively, you could watch the high-speed video in its new frame rate” – Personally I can’t stand the soap opera effect on new TVs and immediately look in the settings to turn it off.

    • DPete27
    • 1 year ago

    Could this be used as a form of video compression? If they could pare down the hardware requirements, it could be useful for consumer use. Especially when we start moving toward 4k and 8k video content. You could stream/upload 6fps of a movie and the hardware could quadruple it back to 24fps on the consumer end. Or stream 12fps in combination conventional compression (ie loss of detail).

    Also, how similar is this to what Smooth Video Project is doing?

      • chuckula
      • 1 year ago

      Broadly speaking interpolating between frames (or “key frames”) is definitely nothing new.

      Nvidia’s selling point here is using neural networks to gin up information that was never actually present in the original source material in the first place. Traditional compression is more about intelligently throwing out information that was in the original source material to get a high-quality reproduction that is encoded using fewer bits of data. So these techniques could be related to each other.

        • DPete27
        • 1 year ago

        Right, “traditional” compression involves throwing away details the viewer [hopefully] wont notice/care to reduce the data footprint. This could be used as a substitute or in conjunction with “traditional” compression.

          • chuckula
          • 1 year ago

          I could almost see something like this being used in [i<]de[/i<]compression. Imagine a standard 24FPS video source that's compressed normally. The decompressor would use this interpolation technique to generate a more realistic 48FPS output without requiring the original source material to use 48FPS inputs to the compression algorithm. Having said that, it's clearly not practical to do right now in a "live" setting.

          • jihadjoe
          • 1 year ago

          Yup, it’s not just about throwing away data in a way that the viewer might not notice, but also stuff that can be reconstructed later on.

          A neural network algorithm that can ‘imagine’ details that weren’t present in the original video could be used to restore the same kind of detail removed by a culling algorithm that’s aware of how the neural network side works.

          • meerkt
          • 1 year ago

          Video codecs already use motion compensation to minimize the residual they have to store for each frame. But maybe this can improve upon that.

      • dragontamer5788
      • 1 year ago

      [quote<]Could this be used as a form of video compression? [/quote<] The opposite. Video Compression gurus discovered the algorithm first, and then video editors used the algorithm afterwards. Optical-flow based analysis of images existed as early as MPEG2, probably earlier (I'm only familiar with MPEG2 aka DivX). Oh right, and then NVidia slaps a neural network on top of optical flow and pretends they've invented something. Erm... no. They just reinvented optical flow analysis. And I'm going to bet that its slower and worse-performing than the methods that are in say... H264 standard (which is over a decade old), or next-gen H265. ----------------- MPEG2's motion vectors work like this: theorize that all the pixels in your block came from somewhere. Calculate motion vectors for your block (usually 32x32). Compare against the original image. Is PSNR (signal-to-noise-ratio) low? (aka: does the "motion-vector" image look similar??) If so, use the motion-vector methodology. Otherwise, recompute the image from scratch (aka: I-Frame) More or less, that's the process. I'm not 100% sure of the details, but that's how it was explained to me. [url<]http://www.bretl.com/mpeghtml/mpeg2vc1.HTM[/url<] I can only imagine that H.264 or H.265 has even better motion estimation algorithms available these days. Mpeg2 was designed and deployed in the 90s. EDIT: Apparently, motion vectors were in use as early as Mpeg1 from 1988. So this is a very, [b<]very[/b<] old compression technique.

    • WhatMeWorry
    • 1 year ago

    Wake me up when they cure cancer.

      • RickyTick
      • 1 year ago

      If they did, a large number of posters here would bemoan it anyway, or blame ngreedia for creating cancer in the first place.

        • chuckula
        • 1 year ago

        I could definitely see DoomGuy64 and Xeridea refusing treatment if they found out it was made in part using Nvidia hardware.

        And DoomGuy64 would accuse Nvidia of artificially gimping the tumors to boot.

      • davidbowser
      • 1 year ago

      (disclaimer – I work for Google in Healthcare and Life Sciences. My opinions are my own)

      Assuming your comment was tongue-in-cheek, but we are in fact working on it.

      The bulk of the work that’s public so far is around early detection, but there is also a TON of work being done to both increase the effectiveness of treatments as well as looking at Genomics to find precursors and (in theory) prevent them.

      [url<]https://ai.googleblog.com/2017/03/assisting-pathologists-in-detecting.html[/url<] [url<]https://cloud.google.com/customers/stanford-universitys-center-of-genomics-and-personalized-medicine/[/url<]

        • blastdoor
        • 1 year ago

        The cure for cancer is the cure for mortality.

        My hunch is that prematurely achieving immorality is the answer to Fermi’s “where are they?” question, because science advances one funeral at a time.

        So thanks for seeking to end humanity, google! 😉

          • Beahmont
          • 1 year ago

          This is nonsense.

          At best the cure for cancer is the cure for aging. Even if we humans could completely stop dying from aging, there would still be plenty of dead humans. We humans still kill each other at a fairly decent clip. We also do a lot of really stupid things and die from those stupid things at an even greater rate.

          Now when Google cracks the ability to program human neurons, a way to read stored information in a human neuron system, and creates a way to make more neurons more or less at will? Yeah, that’s when Google creates immortality. And that is an ‘achievement’ that’s far closer than most people realize even if it still is a long way off as things stand right now.

      • ptsant
      • 1 year ago

      They actually give out free Titan Xp GPUs if you have a decent research project. Our lab just got one a couple of weeks ago and we are, indeed, using it to do cancer research.

      • jihadjoe
      • 1 year ago

      “But we could use it for games!”

        • Neutronbeam
        • 1 year ago

        Think of it as being used for a health Crysis.

    • derFunkenstein
    • 1 year ago

    I wanted to see a larger version rather than the embed so I opened it in YouTube and promptly got an ad before the Nvidia marketing demo played. I hate the internet.

      • chuckula
      • 1 year ago

      Just wait until the ad-before-the-ad plays in super slowmo too!

      • Dashak
      • 1 year ago

      …how do you not have adblockers on all your devices by now?

    • Goty
    • 1 year ago

    This is pretty impressive, but the artifacts in the interpolated frames are pretty distracting to me at times. Some of the clips are better than others (the drifting scene, for instance), but ones like the jelly tennis scene aren’t great because of the “pulsing” effect.

      • chuckula
      • 1 year ago

      Machine learning is great but it can’t completely replace missing information.

      The “natural” paths of movement of solid objects like the drifting car or even a more complex object like a human who has articulated joints are going to be easier to interpolate than complex [and inherently chaotic] particle systems like the liquid in the tennis racket demo.

      • derFunkenstein
      • 1 year ago

      In the hockey scene, you can see the guy’s foot doing some weird stuff as he’s falling. It definitely sucks you into uncanny valley, but all the same it’s pretty cool.

    • chuckula
    • 1 year ago

    [quote<]Alternatively, you could watch the high-speed video in its new frame rate—assuming you have the display to reproduce it, of course. [/quote<] Nice try Nvidia, but no amount of $9000 GPU hardware will convince us to pay the G-sync tax. #Freesync4Life

Pin It on Pinterest

Share This