BurgerBot: GPT-4, Segmentation, and Manipulation

CMU Fall 2023Graduate Coursework

Dec 15

Written By Muno O

CMU 11-851 Course: Talking to Robots

Goal —

Develop a robot chef, called BurgerBot, capable of crafting a simple dish from a predefined set of ingredients which are plastic burger toy parts (top bun, lettuce, cheese, patty, tomato, bottom bun) using a combined framework of GPT-4 and a segmentation model.

Key Question —

How can we design and implement a robot chef that collaborates with humans through natural language instructions and visual information, adapting to diverse instructions and unexpected situations to achieve successful task completion?

Motivation —

Our project goal aims to bridge the gap between theoretical advancements in language models and practical implementations in the field of culinary automation, ultimately contributing to the broader discourse on the synergy between natural language processing and robotics. This project has implications for several sectors, including individuals with busy schedules, home appliance companies, service restaurants, and scientists engaged in sensory exploration.

Important Interaction Considerations —

At the start of the project, I came up with this dialogue with 5 core interaction considerations I hoped to see when interacting with BurgerBot.

Collaborative Task Completion:

Building a burger involves multiple steps and coordination between different actions (grilling patty, assembling ingredients). Seamless collaboration ensures efficient and accurate execution, leading to a better experience.

Positive and Polite Interaction:

Positive communication fosters a sense of trust and comfort for the user. Politeness shows the robot's "human touch" and enhances the overall dining experience.

Feedback and Confirmation:

Asking for user input and confirming details before proceeding reduces errors and ensures BurgerBot meets the user's preferences.

Menu Options and Customization:

Offering various options and allowing customization caters to individual tastes and dietary needs.

Error Handling:

The ability to handle unexpected situations, like missing ingredients or incorrect instructions, is crucial for smooth operation. This demonstrates the robot's adaptability and minimizes frustration for the user.

Watch Demo Video

System Assumptions—

Rigid process:

At the moment, the robot is restricted to the process of first creating the recipe, and then, all at once, performing the necessary actions. While it is possible for the system to skip an ingredient that can't be found, there are no other last-minute changes possible.

Ignoring Manipulation Specifics:

Since the equipment available to us is a stationary robot arm with a suction cup attachment, the manipulation capabilities are very limited. The system right now will therefore not be handling real food, but merely a plastic toy burger. It is also limited to simple stacking tasks, ignoring challenges like washing and cutting ingredients, handling hot ingredients, flipping patties, etc.

Pre-Defined List of Ingredients:

The pre-defined list of ingredients currently limits the user to ordering either a sandwich or a burger in different configurations. This decision is based on the manipulation challenges mentioned before, but also to prevent the user from being disappointed later when their order can't be prepared due to missing items.

Hardcoded Building Area:

While the location for picking up ingredients is determined by the image segmentation process, the drop-off area is currently hardcoded into the system.

Hardcoded Z Dimensions:

Currently, the z-dimension of picking up and dropping off ingredients is roughly estimated in advance and hardcoded into the system.

Reflection —

During the course of this robotics project, my focus was on simplifying the proof-of-concept interactions. Opting for a suction end-effector and plastic toy parts, we avoided the complexity of a multi-dextrous arm. When testing our system, the surface textures of the burger parts became a critical consideration during manipulation tasks, as there was a significant difference between glossy and matte toy parts. The former occasionally slipped from the suction cup, while the latter consistently adhered. However, the project encountered a bottleneck with prolonged dialogue response times, elongating the duration of multiple tests. Recognizing the importance of swift response times for an enhanced user experience, a future consideration for this project would be the optimization of existing algorithms and exploring more efficient ASR/TTS models to streamline the overall process.

Acknowledgement —

I would like to acknowledge Professor Yonatan Bisk for his invaluable guidance, expertise, and insights with this course. I would also like to express my sincere gratitude to my research teammates who drove the success of this project: Marlies Goes, Jeel Shah, Chen Wu.

$\setCounter{0}$

Muno O

BurgerBot: GPT-4, Segmentation, and Manipulation

CMU 11-851 Course: Talking to Robots

EDAhub: Data Analysis for Investor Relations Communications

Scientific Named Entity Recognition