Loading [MathJax]/extensions/Safe.js
Welcome to ShortScience.org!
RSS Feed Twitter Facebook

  • ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
  • The website has 1583 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
  • Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
  • Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
  • Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.
hide


[link]
Summary by CodyWild 3 years ago

The goal of this paper is to learn a model that embeds 2D keypoints(the locations of specific key body parts in 2D space) representing a particular pose into a vector embedding where nearby points in embedding space are also nearby in 3D space. This sort of model is useful because the same 3D pose can generate a wide variety of 2D pose projections, and it can be useful to learn which apparently-distinct representations actually map to the same 3D pose.

To do this, the basic approach used by the authors (with has a few variants), is

  • Take a dataset of 3D poses, and corresponding 2D projections
  • Define a notion of "matching" 3D poses, based on a parameter kappa, which designates the maximum average per-joint distance at which two 3D poses can be considered the same
  • Construct triplets composed of an anchor pose, a "positive" pose (a different 2D pose with a matching 3D pose), and a "negative" pose (some other 2D pose sampled from the dataset using a strategy that explicitly seeks out hard negative examples)
  • Calculate a triplet loss, that pushes positive examples closer together, and pulls negative examples farther apart. This is specifically done by defining a probabilistic representation of p(match | z1, z2), or, the probability of a match in 3D space given the embeddings of the two 2D poses. This is parametrized using a sigmoid with trainable parameters, as shown below

  • They they calculate a distance kernel as the negative log of that probability, and calculate the basic triplet loss, which tries to maximize the diff between the the distance between negative examples, and the distance between positive examples.
  • They also add an additional loss further incentivizing the match probability to be higher on the positive pair (in addition to just pushing the positive and negative pair further apart)
  • The final loss is a Gaussian prior loss, incentivizing the learned embeddings z to be in the shape of a Gaussian

This represents the central shape of the method. Some additional ablations include:

  • Camera Augmentation: Creational additional triplets by taking existing 3D poses and generating artificial pairs of 2D poses at different camera views
  • Temporal Pose Embedding - Embedding multiple temporally connected pose, rather than just a single one
  • Keypoint Dropout - To simulate situations where some keypoints are occluded, the authors tried training with some keypoints dropped out, either keypoints selected at random, or selected jointly and non-independently based on a model of which keypoints are likely to be occluded together

The authors found that their method was generally quite a bit stronger that prior approaches for the task of querying similar 3D poses given a 2D pose input, including some alternate methods that do direct 3D estimation.

more
[link]
Summary by CodyWild 3 years ago

Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server.

Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by:

  • Sending a copy of the current model to each device/client
  • Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device
  • Sending that gradient back to the central server
  • Averaging those gradients and applying them all at once to a central model

The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting model back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models.

Specifically, their algorithm, FedAvg, works by:

  • Dividing the data on a given device into batches of size B
  • Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server
  • Repeating this for E epochs

Conceptually, this should work perfectly well in the world where data from each batch is IID - independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even when the IID condition is met.

The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily non-IID (in their case: where different "devices" had non-overlapping sets of digits).

They show that, in both the IID and non-IID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases.

more
arxiv.org
arxiv-vanity.com
scholar.google.com
Learning to Ground Multi-Agent Communication with Autoencoders
Toru Lin and Minyoung Huh and Chris Stauffer and Ser-Nam Lim and Phillip Isola
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.LG, cs.AI, cs.CL, cs.MA

more
[link]
Summary by CodyWild 3 years ago

In certain classes of multi-agent cooperation games, it's useful for agents to be able to coordinate on future actions, which is an obvious use case for having a communication channel between the two players. However, prior work in multi-agent RL has shown that it's surprisingly hard to train agents that (1) consistently learn to use a communication channel in a way that is informative rather than random, and (2) if they do use communication, can come to a common grounding on the meaning of symbols, to use them in an effective way.

This paper suggests the straightforward and clever approach of, instead of just having agents communicate using arbitrary vectors produced as part of a policy, having those communication vectors be directly linked to the content of an agent's observations. Specifically, this is done by taking the encoding of the image that is used for making policy decisions, and passing that encoding through an autoencoder, using the bottleneck at the middle of the autoencoder as the communication vector sent to other agents. This structure incentivizes the agent to generate communication vectors that are intrinsically grounded in the observation, enforcing a certain level of consistency that the authors hope makes it easier for the other agent to follow and interpret the communication.

Empirically, there seem to be fairly compelling evidence that this autoencoder-based form of grounding is more stable and thus more mutually learnable than learning from RL alone. The authors even found that adding RL training to the autoencoder-based training deteriorated performance.

more
arxiv.org
arxiv-vanity.com
scholar.google.com
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari and Liangzhe Yuan and Rui Qian and Wei-Hong Chuang and Shih-Fu Chang and Yin Cui and Boqing Gong
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CV, cs.AI, cs.LG, cs.MM, eess.IV

more
[link]
Summary by CodyWild 3 years ago

This strikes me as a really straightforward, clever, and exciting paper that uses the supervision intrinsic in the visual, audio, and text streams of a video to train a shared multimodal model.

The basic premise is:

  • Tokenize all three modalities into a sequence of embedding tokens. For video, split into patches, and linearly project the voxels of these patches to get a per-token representation. For audio, a similar strategy but with waveform patches. For text, the normal per-token embedding is done. Combine this tokenization with a modality-specific positional encoding.
  • Run all of these embeddings through a Transformer with shared weights for all three modalities
  • Take the final projected CLS representation for each the video patches, and perform contrastive learning against both an aligned audio patch, and an aligned text region. This contrastive loss is calculated by, for each pair, projecting into a shared space (video and audio each project into a shared audio-video space, video and text each project into a shared video-text space, with specific projection weights), and then doing a normal contrastive setup where positive pairs come either from a direct alignment of audio and video, or from a soft "nearest neighbors" alignment of text with video, to account for not all video snippets containing text

One technique that was fun in its simplicity was the author's DropToken strategy, which basically just said "hey, we have a high-resolution input, what if we just randomly dropped tokens within our sequence to reduce the S^2 sequence length cost. This obviously leads to some performance cost, but they found it not very dramatic.

Experimental results were all-around impressive, achieving SOTA on a number of modality-specific tasks (action prediction in video, audio prediction) with their cross-modality model.

more
arxiv.org
arxiv-vanity.com
scholar.google.com
Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation
Mark Sandler and Andrew Howard and Menglong Zhu and Andrey Zhmoginov and Liang-Chieh Chen
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV

more
[link]
Summary by CodyWild 3 years ago

This work expands on prior techniques for designing models that can both be stored using fewer parameters, and also execute using fewer operations and less memory, both of which are key desiderata for having trained machine learning models be usable on phones and other personal devices.

The main contribution of the original MobileNets paper was to introduce the idea of using "factored" decompositions of Depthwise and Pointwise convolutions, which separate the procedures of "pull information from a spatial range" and "mix information across channels" into two distinct steps. In this paper, they continue to use this basic Depthwise infrastructure, but also add a new design element: the inverted-residual linear bottleneck.

The reasoning behind this new layer type comes from the observation that, often, the set of relevant points in a high-dimensional space (such as the 'per-pixel' activations inside a conv net) actually lives on a lower-dimensional manifold. So, theoretically, and naively, one could just try to use lower dimensional internal representations to map the dimensionality of that assumed manifold. However, the authors argue that ReLU non-linearities kill information (because of the region where all inputs are mapped to zero), and so having layers contain only the number of dimensions needed for the manifold would mean that you end up with too-few dimensions after the ReLU information loss. However, you need to have non-linearities somewhere in the network in order to be able to learn complex, non-linear functions.

So, the authors suggest a method to mostly use smaller-dimensional representations internally, but still maintain ReLus and the network's needed complexity.

  • A lower-dimensional output is "projected up" into a higher dimensional output
  • A ReLu is applied on this higher-dimensional layer
  • That layer is then projected down into a smaller-dimensional layer, which uses a linear activation to avoid information loss
  • A residual connection between the lower-dimensional output at the beginning and end of the expansion

This way, we still maintain the network's non-linearity, but also replace some of the network's higher-dimensional layers with lower-dimensional linear ones

more

Send Feedback
ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: