Autonomous Robotic Arm

For this project, I worked with 3 graduate students and one other undergraduate student in order to create an autonomous robotic arm. The following is our final findings:

Introduction

The combined field of artificial intelligence and robotics have seen many great advancements in the past few years. From Boston Dynamics successfully training a humanoid robot to perform parkour to the seeing Starship drones delivering products, the research space has really blossomed. More practical applications of the amalgamation of these two areas includes assisting individuals with disabilities in performing remedial tasks, assisting blue collar workers in order to create a safer work environment, and assisting first responders in search and rescue missions through the use of autonomous drones. With this in mind, it is our team’s goal to develop neural network systems that facilitate the function of an autonomous robotic arm that can locate, pick-up, and deliver objects to a goal. This will require us to develop two main neural networks: A computer vision classifier for detecting objects and some supervised or reinforcement learner to drive the movement of the robotic arm.

Data

Classifier: We started by gathering a data set containing pictures of objects that we wanted to detect. We decided to use colored blocks as they would be an easy object for a real robotic arm to pick up. We ordered colored blocks off of Amazon and took over 250 pictures of our blocks. While gathering this data we took into consideration many different variables such as lighting conditions, foreground/background noise, relative block placement, etc. We attempted to collect a dataset with a wide variety of variable values. We then preprocessed the pictures, reduced size and quality, in order to expedite our training process. After the pictures were preprocessed we hand labeled the occurences of each color block in each image. The last step was to convert the XML label files for the images to a single CSV file; This made it easier to load the data into our model. Once this was complete, the dataset was ready to be split into training and testing data and fed to our model.

Supervised: The Mujoco FetchPickAndPlace-v1 environment provides a masked array which describes the entirety of the environment’s state space. At each time step an action is determined and used as the input for the environment’s step function. The output of this function redefines the environment’s state and a new observation is determined and the process repeats. As a first attempt at solving the pick-and-place task, a neural network using supervised learning (i.e. no reward function) was implemented. Supervised learning requires training data in the form of input-output pairs. Given the observation-action design of the Mujoco environment, it was logical to use this information as the input-output pairs for the neural network.

The observation masked array has three components: observation, desired-goal, and achieved-goal. The observation component fully describes the space including the desired-goal location. This is extracted and reformatted into a vector of length 20 and is used as input data. The action taken at each time step is represented as a vector with length four. The first three values correspond to the direction of the robot’s motion during that time step, with respect to the gripper. The fourth component defines the new state of the gripper.

A solution was hard coded using information from only the observation at each time step to define the appropriate action. The completed task broke down into three components, each of which required specific direction: First, the robotic arm needed to move into a position from which it can grab the block, but hasn’t knocked it off the table in the process. Second, it needs to grab the block, and third, it needs to carry the block to the goal location. The Mujoco environment does not have a way of inherently recognizing if the robot has the block or not, so each step of the task was coded using only location information for the simulated space. The observation and resulting action were recorded as pairs for each time step of the full task. One simulation has 50 time steps, and 200 with randomized initial states were used to collect a total training set of 10,000 observation-action pairs. 5,000 were used for training and 5,000 were used for validation.

Reinforcement Learning: We used the OpenAI Gym Mujoco environment to collect our data. The Mujoco environment gave us an observation in the form of 10 continuous numbers about the state of the arm, and 3 continuous numbers representing the x, y, z position of the goal. The information was given to us in a dictionary so it was necessary to write a wrapper for the OpenAI Gym environment to convert the information into a simple list.

Architecture Design and Training

Classifier: In order to determine what network architecture would work best for our specific problem, we analyzed, studied, and modified many different architectures from the Tensorflow object-detection model zoo. We split our dataset into 80% training and 20% testing, then proceeded to train and evaluate many different models. Depending on the initial performance of the classifier we would either tweak parameters of the network or decide to move on and test another model. After much testing we found two networks that worked particularly well, but had enough difference in the performance where it was worth evaluating and comparing the results. The first network is an RCNN with an Inception architecture. This is a deep network with auxiliary classifiers to assist with network convergence and has a good balance between the number of layers and the size of each layer. This network also includes a few inception layers before the network outputs which allows for dimensionality reduction and parallel structures thus mitigating any impact from the structural changes of nearby components (i.e. dropout). This model was much faster to train than our second model. The second network is an RCNN with a ResNet architecture. This, again, is a deep network but with gated recurrent units and a strong emphasis on batch normalization. This network however is only composed of convolutional layers. The training process for this model took about 3 times as long as the previous model.

Supervised: The model architecture used for the supervised learning neural network consisted of five hidden layers, each of which used relu activation and dropout. The output layer used softmax activation. This was a fully-connected feed-forward neural network. From first to fifth, the hidden layers had lengths 128, 256, 512, 256, and 128. Some publications that worked with the FetchPickAndPlace environment used networks with three hidden layers of length 256, but this architecture did not perform as well for us as the deeper network with more of a tree architecture. The Adam optimizer, a variation of stochastic gradient decent, was used for learning and categorical cross-entropy was used to calculate loss. The learning rate was 0.001.

Reinforcement Learning: This project required the team to study many different algorithms in the areas of reinforcement learning and genetics. We focused on how it can be applied to real-world problems and which algorithms are good to apply in simulations; Due to time constraints, we could not test all algorithms in our environment. Before moving forward with reinforcement learning, we created a python based Snake game and the algorithm was tested. The game uses a neural network with binary inputs which indicate whether or not it is safe to travel in a certain direction or other information such as the relative location of the food. The network’s architecture is composed of 2 hidden layers of 24 nodes each using a ‘relu’ activation function and an output node with a 'tanh' activation function. During the development process we explored the usage of Q-learning and Markov decision processes. These tools allowed us to create a reward based function to give the snake a score at each timestep, which then influenced the prediction of our next state from the Markov decision process. With this setup, snake was able to earn a score of 47 out of 50 points which proved the reliability of our reinforcement learner. In earlier iterations, the model referenced a Q-table in order to find a reward for each timestep, but it did not prove to be a reliable reward system. In testing with these methods, I determined that using Q-learning with Neural Networks would be an excellent method to try in our Mujoco robotic arm environment. We used Deep Q Learning on the Mujoco environment but it was unable to converge at a reasonable pace and necessitated another approach.

In our final iteration, we used A2C and PPO on the ‘FetchReach’ environment. In the end, we were not able to get the ‘FetchPickAndPlace’ environment to converge but we attained good results on the ‘FetchReach’ environment and other OpenAI environments such as the bipedal walker environment. Our network had an actor and a critic. The actor received 13 observations about the environment and produced 3 continuous numbers to represent our actions. The critic took the same 13 observations and produced an expected future value of the state, we used that in our training loop to check if actions were better or worse than expected.

Results

Classifier: The Inception RCNN did converge around 6,000 epochs and the loss maintained an average of 0.15. By evaluating the training loss charts of this network, we observed that the plotted loss was very noisy and jumped sporadically throughout the entire training process. Upon examination, we have determined that it was the addition of the inception layers that may have made it more difficult for our model to consistently learn more complex features. We saw “noisier” predictions from the classifier which resulted in incorrectly classifying blocks or just not detecting them at all. The ResNet RCNN also converged at around 6,000 epochs. The plotted training loss was much more smooth and maintained an average loss of 0.07. We believe that the pure convolutional approach helped our model learn more complicated features; This classifier could identify blocks that are partially occluded by other objects or other blocks and multiple blocks in the same scene with much better performance.

Supervised: After 200 epochs, the neural network seemed to have converged. At this point the accuracy was around 0.75. Testing the trained model on randomized initial conditions showed that this was misleading. At the beginning of each simulation the robot would move towards the block and during the second part of the simulation it would move towards the goal position with a closed gripper. Unfortunately, it was not learning to grab the block, so the overall purpose of moving the block from one location to another was missed. The action of grabbing the block is represented by a relatively small number of training data and not represented in the action (output), and so this element of the task was not being adequately represented in the accuracy which was determined by weather or not an action was appropriate given an observation. We decided to move forward with reinforcement learning algorithms instead of perusing this approach.

Reinforcement Learning: Using the methods mentioned above we were able to solve the ‘FetchReach’ environment in the OpenAI Gym. In this environment a robotic arm moves towards a predetermined random point in space. We were sadly unable to solve the ‘FetchPickAndPlace’ environment which would take a block from a table and move it to a predetermined random point in space. However, while attempting to prove to ourselves that PPO was right for our problem, we solved another environment. ‘BipedalWalker’ is an environment with a robot that is incentivized to walk on uneven terrain. We were able to produce very consistent forward locomotion. In the end, PPO provided much better results and was able to capture the continuous nature of our different environments.

WORKLOAD DISTRIBUTION

Aditya: I was tasked with creating the game Snake in python as well as implementing several different agent models which would attempt to beat the game. I started by creating a simple neural network and, over many iterations, tweaked the architecture to increase agent performance. I used techniques associated with Q-learning and Markov decision processes.

Ananth: I worked on researching and narrowing the focus down to the Mujoco environments ‘FetchReach’ and ‘FetchPickAndPlace’ which would be used for our simulations as well as custom reward functions in case the default functions need improvement. Furthermore, I researched the steps necessary to transfer the work from our simulation to the physical world including the equipment and sensors necessary to track object locations and movements so that those physical coordinates can be mapped to virtual coordinates like in the simulation. While we were not able to construct the robot arm physically, I learned a lot about the nuances and physical challenges of going from an ideal physical environment to a real environment. Finally I prepared the corresponding slides about the Mujoco environments and physical environment creation in our presentation.

Andrew: I worked on researching and implementing reinforcement learning in our project. While I was not thrilled with our inability to solve the ‘fetch pick and place’ environment I did learn a ton. I started out by implementing Deep Q Networks on our robotic arm problem. Unfortunately it did not work in our Mujoco environment, deciding instead to try our network on the virtual Pong environment. Trials here proved to be much more successful. I then switched focus to Proximal Policy Optimization (PPO) after talking with Phd Student Stephen McAleer, who was incredibly helpful. I learned a lot, read a lot of research papers, and dug through many blog posts, but I really enjoyed learning about reinforcement learning.

Eisah: I worked on the computer vision object detection classifier. I took all of the photos used in our dataset, preprocessed them, and assisted in labelling them with Shivani. I downloaded, analyzed, and modified many of the models downloaded from the tensorflow model zoo and trained several of these models over the course of a few weeks. I analyzed and compared the results of my training and decided on two networks to focus on for the presentation. In this process I learned how to use the tensorflow object detection library as well as CudaCNN for GPU acceleration. Additionally, I assisted in setting up the Mujoco environment for the robotic arm, and set up some of our initial models for testing and exploration. I prepared the introduction and computer vision slides on our presentation as well as proof read, reformatted and wrote my corresponding sections in this final document. Also handled miscellaneous administrative tasks (canvas team sign-up, document submission, etc).

Shivani: I worked on the supervised learning component of this project. This involved developing a mechanism for creating a training data set, implementing and training a neural network model, and then testing it. I also wrote the corresponding paper sections, and spent some time labeling blocks for the computer vision model. I am disappointed that the supervised learning approach was unsuccessful, but I think that ultimately reinforcement learning is a more appropriate approach to take for this type of problem and I learned a lot about how neural networks operate by working on this problem.