As I understand the function of the kernels: Each kernel is responsible for a certain type of feature.
Why are not all features (= N kernels) already been taken into account in the first phase (=Filtering Stage) of “Depthwise Separable Convolutions”? Do we not loose (or at least diminish) some features from the original image, if we take the N kernels (= filters) in the second, cascaded phase?
Because of the length and the formula aspects, I have made some thoughts about and believe to have found a result in “this linked paper”. But, is the answer correct?