Abstract:
It is hard to generate an image at target view well for previous cross-view image translation methods that directly adopt a simple encoder-decoder or U-Net structure, especially for drastically different views and severe deformation cases. To ease this problem, we propose a novel two-stage framework with a new Cascaded Cross MLP-Mixer (CrossMLP) sub-network in the first stage and one refined pixel-level loss in the second stage. In the first stage, the CrossMLP sub-network learns the latent transformation cues between image code and semantic map code via our novel cross MLP-Mixer blocks. Then the coarse results are generated progressively under the guidance of those cues. Moreover, in the second stage, we design a refined pixel-level loss that eases the noisy semantic label problem in the cross-view translation task in a much simple fashion for better network optimization. Extensive experimental results on Dayton~cite{vo2016localizing} and CVUSA~cite{workman2015wide} datasets show that our method can generate significantly better results than state-of-the-art methods. The source code, data, and trained models are available later.