How AI Happens

10 Years of FAIR at Meta with Sama Director of ML Jerome Pasquero

Episode Summary

After attending Meta's event celebrating 10 Years of AI progress at FAIR, Rob shares what he learned with Sama's Director of Machine Learning, Jerome Pasquero, for some much needed technical insight.

Episode Notes

Jerome discusses Meta's Segment Anything Model, Ego Exo 4D, the nature of Self Supervised Learning, and what it would mean to have a non-language based approach to machine teaching. 

For more, including quotes from Meta Researchers, check out the Sama Blog

Episode Transcription

Rob Stevenson  00:02

Welcome to how AI happens. A podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists, and machine learning engineers, as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson. And we're about to learn how AI happens. Okay, hello again, all of my machine-teaching data scientists darlings out there in podcast land. It's me, Rob here with another installment of how AI happens as an extra special installments. Because I just went to this event and I'm just bursting at the seams. I'm just shaking. So I'm so excited to talk about it. I'm gonna give some background on what this is all about. Basically, we are working on getting an episode recorded with the VP of AI research over at meta. Joelle Pineau. She's amazing. She's a total Rockstar published a ton of papers. She was working on generative before it was cool. All that. And as part of that I was working with meta comps. And they were like, hey, you know what we're going to be doing a press event for the 10 year anniversary of fair fair is of course Meadows AI research arm. And I don't know if you if you all are familiar with fair, but basically they just have this like unlimited resources and a mandate to work on cool stuff. And they are working on really, really cool stuff. And so part of this event was the folks over at Mehta were just demonstrating what they've been working on and giving us updates on their models. And maybe it makes me look naive to admit how much my mind was blown by this stuff. I didn't know some of the stuff was going on. Maybe y'all are a little bit hip to that jive, but I'm just so excited to talk about it. And what I wanted to do is, before we had that episode with Joelle, I wanted to bring in a subject matter expert from sama to react to some of this stuff, so I could report what I saw. And then we can translate it into what it means in a more technical sense. So in order to do that, I have brought on friend of the show returning champion and the director of machine learning over at Sama, Jerome pesquero. Jerome, welcome to the podcast. 


Jerome Pasquero  02:14

Hey, Rob, it's nice to see you again.


Rob Stevenson  02:16

You as well. Welcome back to the podcast, I should say. And with a with a shiny new title. You've been promoted since the last time I had you on so congrats. Thank you. Let's get right into it here. Jerome, one of the first things that stood out to me was this segment anything model. And it is basically a notation. But the bounding box is more specific. And it's automatic. As so much computer vision is trained on segmented images. This was very exciting, because it looks like oh, that's just an automated way to do it. So I was curious, you know, I'm sure the folks out there listening are familiar with segment anything, it wasn't launched at at this event. Obviously, it came out a few months back, but they were showing some of the progress that's made. I was hoping you could kind of react to segment anything and whether you think this is something that is more for the consumer, or is this something that can like replace the way we train some computer vision models?


Jerome Pasquero  03:10

Yeah, great question. So it's a little bit of both. And we refer to it as Sam, we've been following Sam for quite a little while, right? It's a really impressive model or set of models. And it does like not only like give you the bounding box, obviously, but also kind of like the the polygons around the objects, right? Of course, like everyone else, we've experimented with it a lot. And it's very impressive in the results in how granular you can be. Now, it doesn't do everything, don't write, in the sense that you might get that polygon around the object that you're trying to detect. But then you have no idea that the attributes of that object, for instance, it's a little bit more difficult to do that. And also, it's really good on objects that it's seen before. And it's massive data set of objects that was trained on. But like for stuff that it hasn't seen that much, and then a lot more specialized. We found that sometimes if it failed or require the human supervision, which is something that we actually do provide that at some time by making sure that we can correct these, the smaller sometimes large errors that such a model will make. So a good example of that, like you were talking about whether this is helpful for lake are useful in the consumer or more than an industry. In industry. If you think of detecting all the different parts of a printed electronics circuit, it might not do as well on this issue, especially if it's got multiple, multiple layers and multiple types of chips on it. And because it's it hasn't seen as many right in the real world, it might do a really good job on 90% of the components, but that other 10% is just as important that is the first 90% And that's where it might fail. But if we're talking more in the consumer world, like you know, presumably the dataset that they used was mostly pictures of selfies or nice scenery. That's probably where we see the the best results in So automatic segment segmentation.


Rob Stevenson  05:02

right, So the more specific the segmentation would need to be in, the more attribute rich, the more you would need some more human annotation going on is the idea, right? Or


Jerome Pasquero  05:11

the more I would say specialized, right, like, it's always comes back to how much of that particular type of features are specificity as it's seen in its training data set, right, the training data set that is that they use that is available it even if they collected some of it presumably doesn't have a, you know, a perfectly equally distributed classes across the board, right. So some objects that it hasn't seen very often it might not be able to segment them as well. And there are lots of lots of use cases in interest in history where, you know, it's very important to be able to do very well on these objects that are not commonly seen.


Rob Stevenson  05:49

Is there anything model follows a lot of widely available AI right now, which is that, oh, it gets you, you know, 60% of the way there it like can take away some of the boring stuff. So say you were to you wanted to use it for a very detailed, complex industrial application, right? You run this like anything model, and it gets the first obvious stuff out of the way, right. It's like, oh, that's, you know, that's a microprocessor great, something obvious so that the annotator can be more efficient, right? It's like, now I don't spend half of my time doing the really, really obvious stuff that like a toddler could point out. Now, I can only focus on the stuff that only I can do, which this is the goal of a lot of current tech, I think where we are, is that fair to say? 


Jerome Pasquero  06:30

Yeah, that's correct. Yeah. So it's a great tool to get the easy stuff out of the way. But as I mentioned earlier, you're still going to have errors and things that need to modify, there's things that like requirements that are different for the particular company for a particular use case, right. So that's where the annotators still has a huge role to play in it. And that's probably the most important role, right? Because using SAM is almost off the shelf today. So it won't give you any competitive edge as as a company to use it right. In your particular application. It's it's when you like start dealing with how do you deal with the specificities about your use case in your business, so that you might get the competitive edge and get the efficiencies, the gains that you're looking after. So just to come back to Yeah, the the annotation at this level, we just get all the easy stuff out of the way, we need ways to identify what went well, and what didn't go well, and what needs to be modified. So we'll see, you know, it didn't quite capture a chip properly, it didn't get the legs of that little legs of that ship. Whereas for that particular application, it's needed. And then like, as I said, it's often it's not only about segmenting the object, we also have a number of attributes that are very important for that object, and that Sam doesn't give you those attributes. And those can be very, very detailed, think of like, whether there's a small flaw or not on that, on that ship, whether it's a particular model of chip, whether I you know, you can think of a number of different sometimes up to 100 attributes for one, just one object that we annotate and for that, we haven't seen any alternatives to using a human annotators today, eventually, probably the models might be able to capture all these intricacies about the properties of the object itself. But it's never going to be able to capture all of them, because again, they're very specific to a particular application in business. And even if they did, there's still going to be errors. And these errors need to be corrected by either a domain expert or someone at least who's been trained in identifying the potential errors that machine learning algorithms can make.


Rob Stevenson  08:39

Yeah, of course, there is that gap between automation and expertise slash specificity. I've noticed that even like with using chat GPT I've noticed that getting a little bit worse and I've noticed it's it was just not ready the output is just like 50% of the way there right it's a good start it takes away the obvious stuff but there's still a lot more needs to be added to it.


Jerome Pasquero  08:57

You're right about that about your analogy with the chat GPT I mean like there's a lot of talking about like human augmentation some people believe in it other people's don't believe I'm in the in the on the team who believes that this is what we're trying to do over here like human augmentation is just allows humans to do things better, faster, and get rid at least for now, of the things that can be automated, but nothing, not everything will be automated and then just it allows the human to kind of climb that ladder of cognition like in terms of skills that they can they can do and let the more tedious work done by the machine learning algorithms as they become available.


Rob Stevenson  09:39

Yes, and that the human augmentation seems to be what meta is focusing on certainly with the ego. So for D stuff, this stuff is amazing. I was really, really excited by this. Basically, you will use your quest headset, right your VR headset, and then you watch someone perform a task and it could be fixing a toilet, it could be playing the xylophone, it could be swinging a tennis racket. And then the motions you're meant to take are projected in a hologram from your point of view over your body. So you would move your body in the way that you are seeing, and this is how a lot of learning takes place, right? It's monkey see, monkey do. I watch my tennis instructor volley a tennis ball and I try and move my body exactly like they did. And I listen to what they're saying. And I copy their motions that feels like this more immersive augmented version basically, like of masterclass, you know, it's like, if I could just be if I could have that person in the room with me. And if I could move my body in that way. So that is what's going on with Alexa for D. And it's really exciting. There's also it seems like it's almost coming for the How to section of YouTube. And one of the speaker's was like when's the last time you use a how to video on YouTube. And I was like, last week, all the time. And last week, I was trying to sharpen some knives in my kitchen. And I watched this video of the chef showing me how to sharpen knives. And I had to, there was one section of it, I had to go back and watch like 10 times, because it was too fast. And there was the angle wasn't great. And what I needed was eco eco 40, what I needed was to watch that, and to be able to slow it down, and then for it to be projected in front of me so I can mimic it. And this in a way feels like a more accurate measure of how learning happens, at least non language based learning. Another thing that came up at this event was Yan Laocoon, was there, you know, he's a, he's an all star, right? He's the chief AI researcher over there, he won the Turing Award, he plays chess with Yoshua Bengio. Like that's the that's the, you know, the arena he's playing in. And he was talking about how so much of machine teaching right now is language based. It's like annotated information fed into a model and but like the language is limited. And that's not how learning takes place. Everywhere. He gave the example of a human infant who doesn't have any language capability, but yet can learn really, really fast. And so his point was like, we're missing something, there's some version of processing going on, that is not language based. And it's fundamental to how not just humans, but all living things learn. And that is not a part of our technology right now. He didn't explicitly connect Eagle Expo for D, but it was faint enough. Like, I figured that must be what they're trying to do with Eagle Expo for D because that's what an infant is doing. It's like they're copying what they see in the real world. They're also experimenting and using their touch and taste and smell and all that. But I was curious if you would react to that idea drum that there is more to learning than language? And is that what we're missing in? In this this missing teaching approach? 


Jerome Pasquero  12:36

Yeah, I think you're touching on something very important. Right? Now, there's a stark contrast between how you know, using your EXO 4d, for instance, to kind of train humans in a really close feedback loop. Because that's how we learn right? Like you do something you do it wrong, you're just adjusted in terms of the actions and and that's how to learn and like how really, and how that process can happen pretty quickly, especially if, if that that feedback loop is very tight, Right Action correction reaction. And then you go round and round again, contrast between that, which usually just requires a couple of iterations, unless you're trying to become a professional tennis player, for instance, at a very high level. But other than that, like sharpening your knife that happens fairly quickly, you are I would assume so. So a contrast between that and how our models are learning today, which is, again, the same kind of idea through repetition, but we're not talking about like going through that loop two or three times we're talking going through it like hundreds of 1000s of times, right, which is kind of crazy, as you alluded to, and I heard the eloquent say that multiple times, like, our child doesn't need 100,000 images of a firetruck to know what a fire truck is, right? And they can even see it in a different form, like the fire truck might be a physical toy, and they know what it is. It could be on the street and they know what it is. I think that's that's an example and two dimensions. So as much as we were seeing like huge progress in AI, thanks to deep learning most, mostly, we haven't solved it like it's this is probably not the right approach. Right? It is one component that is probably important in our whole system has a very important role to play. And that's super exciting. But we're still missing something here. I would also add that it's not only on the kind of like sensory processing front, but also on, on on just the reasoning part, like on the lever on top of that, where we haven't solved the reasoning at all, like, even in autonomous driving, which, you know, we could say is pretty advanced in terms of the processing of all the different modalities and everything. The model has no idea what it's doing right and why it's making any decision rather than than another so so anything that add to decision making level is still kind of really open for for grabs here in terms of, of research and solutions.


Rob Stevenson  15:06

you know, thing John pointed out was that it's not a data issue. It's not like, Oh, if we had enough, you know, annotated data had enough language, we could do it because very quickly to train that, you know, these billions of units of billions of points of data, you very quickly run out of the entire corpus of text that's ever been recorded by humans, right?


Jerome Pasquero  15:25

Not only that, but it's like you're approaching the problem in the reverse order than one would expect, because you're starting from the data, which is the manifestation of a reasoning process and reverse engineering to try to understand the key mechanisms that dictate this, the system right now it's working in some way it is working, but it requires still so much resources and computational power, that there's gotta be something better because this is not how we work as humans right there. The ultimate goal is to be able to do it the other way around, like to have these fundamental concepts from which we can extract all the different skills that we have downstream, right? So and you're right, in the sense as well, like people are starting to worry, like, if we grab all the texts on the internet, not only is it it's huge, but it's not infinite. And it is also a closed ecosystem, right? It's a closed system. So we're not actually adding new information to what is used for training our models, are we are we in at risk of being stuck in, in kind of, I don't know if I'd call it an infinite loop, but in a vicious circle, where like, we're just rehashing the same information over and over and over and over again, and never learning anything new. Right. And that includes all the biases and from from from, from that data that we have, right? Which, through this process only gets enhanced? Right? 


Rob Stevenson  17:03

Yeah. Yeah. It just would have all the same problems humans do.


Jerome Pasquero  17:06

Yeah, even and the worst problems would be augmented like, it would be sad, tragic.


Rob Stevenson  17:11

Yeah, exactly. This is also the importance it sounds like with ego Expo 4d with thinking beyond language of just multi modality and lots of different types of data. Because going back to the child, the human infant, they learn by watching, they learn by touching, they put things everything in their mouth, if you've ever had a been around a baby just goes right in the mouth, right up the nose, all of these things. So that surely is under indexed upon I mean, we see it a lot in an autonomous this idea of sensor fusion, right, that there's some sort of balance to all of the intake and that we can achieve sensor Nirvana and have the most optimal path based on all this information. That's still all vision into texto. Right. So what is multi modalities place in in improving teaching?


Jerome Pasquero  17:58

Yeah, I think that's one going to be one of the main areas of research that everyone's going to be interested in over the next few few years. Because as you said, like humans are really good at processing different modalities and just using it as not only kind of a redundant signal, like if you see something and you touch it, then you know, it feels like it touches it touches like it feels like it's probably what you think it is right. But but also just as a way to help with the reasoning like the the higher level on top of it. So today, as you said, it's kind of like a little bit at the Infancy we are starting to see models such as, you know, chatty video solutions, such as GPT that can take multi model input and create multimodal outputs of text and vision most of the time, but the reality is that there's a lot more senses, one that I'm specifically are very interested in is haptics because my background is actually in haptics, right. So as you said, kids put stuff in their mouth, and they learn from it. But if we come back to the example of the kid learning what a fire truck is about, by touching that fire truck, there's also information in there that they're using to actually understand this. So I'm not sure what form this is going to take. I leave it to the researchers to figure it out, like what else they you know, what other mechanisms that can help for fusing, all that different all those different modality or signals from different modalities. But again, I don't think that today we really nailed it, like we were starting to see ways to to to fuse these things. Okay, you know, and it's been around for a few years if you think of stuff like clipping betting, but it's not really a purely fusion like humans would do with leverage when they're trying to learn


Rob Stevenson  19:41

something that came up right away. Joelle sort of kicked off the events and she said that self supervised learning is the foundation of every model at fair and that was surprising to me. Maybe it's not surprising to you. Is it? Is it I guess I just started there before I you know, asked a hackneyed question


Jerome Pasquero  19:59

no Oh, it's not very surprising, like, you know, with the emergence of like foundation models and stuff like Assam, like these large, large amount models, they have been trained on using the self supervised learning techniques just because it allows them to process so much more data that doesn't need to be manually annotated or annotated by humans, right. So I think that in that sense, she's right, it's difficult to say that is not right. But that when we're talking about these generalists, models that can do things can do a lot of, you know, things pretty well, we're gonna continue to see a trend in using self supervised supervised techniques on raw data. The problem really is when you're trying to get those big models to do stuff that's very specific, like we were talking about before, right. And most of the time, you need to fine tune them. And in that fine tuning process, you still need data that often doesn't exist out there. Right. And that's the kind of data that is specialized and needs to be labeled labeled with a high level of expertise and accuracy. So I still think there's a very important role for for humans in there.


Rob Stevenson  21:12

That's what I wanted to ask because self supervised learning, it feels like a way to say unsupervised learning that just a little easier to swallow. Like, surely someone has to come in and babysit this thing. At some point. Like there's got to be a role for humans in the loop. It's not just let this thing run. So maybe that is a question for Joelle is like, what does that mean for somebody to go from supervised learning to self supervised learning? What is the difference in the parlance? And how much intervention in monitoring is really still going on? I'd be curious to hear about that.


Jerome Pasquero  21:39

Yeah, for sure. Me too. I mean, like, that's definitely something I'd love to hear her comment about, like, what is the role for humans, I believe it will go in the direction not to put words in her mouth of saying like, just for fine tuning, we still need humans, but also for just making sure that everything is still working as expected, because machine learning models today are very bad at that kind of like, regulating themselves, like not regulating but validating that what they're outputting is right. They only know what they know. So and so you need an external factor or an external entity to actually make sure that what they're uploading makes sense in the real world, right? And that today, there's no alternative to that, then to us humans, because we live in this world, we know it better than anyone else. And actually, the applications that we're going after are for us. So we're the ultimate judge on whether you know, the output of a model, whether it is at a classification level, super simple task today, or something much more complicated is the right outputs. And I don't see a way out of this now, today, you can use more powerful bigger models to kind of keep an eye on those smaller models. But ultimately, who's actually supervising these big powerful models watching?


Rob Stevenson  22:47

Yeah, it's


Jerome Pasquero  22:49

still it's still us.


Rob Stevenson  22:51

That's what I wanted to get out. Because it's like, maybe I can get Joelle to concede that self supervised learning just means marginally less supervised learning.


Jerome Pasquero  22:59

I think it's just, yeah, ultimately, and it's, it's scary to think that at some point, you know, humans won't be supervising the models, because then like, that's where we get into the this today, which is still in a kind of sci fi. world, but unlike the models taking over, right, and let's not forget, this is still just a technology. Like any others, it's more powerful, but it is still in technology. And the reason why we have technology is just to make our lives easier, right? So the the goal, and all of this is just still to help humans, so they have to be the ultimate judges as to whether it's helping them or not, right?


Rob Stevenson  23:37

Yeah. Before I let you go, Jerome, I want to outsource some of my interviewing to you. And ask you what what you would like, you know, to hear Joe comment more on because we're going to have a more technical conversation, of course, what are some follow ups? What would you like to know more from Joelle?


Jerome Pasquero  23:54

Yeah, I think we've touched on those like, so I'll, I'll make sure to listen to that episode, for sure. So I can't wait for it. I'd say that, like I would ask her about what she sees as the whole role for humans, humans in the loop in the future, and future applications. And, you know, it's always hard to predict, but I'm sure she has an opinion on this. So this is definitely something interesting. We also talked about multi modality, like, what are the first modalities that we're going to go after, for instance? And are there new ideas around how we can fuse them other than just like, what we're doing today is pretty much concatenating them together, right? And just like putting pieces together? Are there better? Other ways to do that? And then a large question about what is she most excited about? Because that is always a great question for people like us with us and that get inspired about like, where to put our, our chips right on the on the table in the future because if a researcher of that caliber is excited about something, there's a better chance that it's going to happen than if they're not right so it's a little apparently but Yeah, I go around those those things for sure.


Rob Stevenson  25:02

Okay. Yeah. Thanks, Jim. That's great. And I'll definitely ask her that I'm excited to hear what she's excited about too, because they have kind of unlimited resources, they can work on anything they want. So like, how do they choose, you know, when you have, you know, a kid in the candy store, and you can do research on whatever application you want, and then you probably pick something pretty awesome. And that's what I saw at the event. So, drum this has been really fun, shooting the breeze with you about all this stuff and unpacking what I saw over there. So thank you for putting a little bit more of a technical bend on it. I appreciate you being here. With me. This was really fun. It's


Jerome Pasquero  25:32

always a pleasure to talk to you where Rob So anytime and all


Rob Stevenson  25:36

of you out there listening, if you want to know some more, I'm doing a a lengthier more in depth write up on the summer blog where I will have some of the content I captured. I have some video of Yann LeCun. I have a bunch of a bunch of pull quotes from people who were chatting. There's some media that we received as being part of the event. So we'll mention some more of the models that they debuted. And And what's exciting about it, that's all over on the blog. And yeah, thanks for tuning in. Until next time with Joelle Pineau. I've been Rob Stevenson, Jerome pesquero has been drum buss Wero and you've all been amazing wonderful machine learning engineers, AI specialist data scientist whomever however you showed up thank you for being here have a great one. How AI happens is brought to you by sama. Sama provides accurate data for ambitious AI specializing in image video and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, e commerce, media, med tech, robotics and agriculture. For more information, head to