In order to ensure proper playback, the audio and video data MUST always be multiplexed together along with the timing information. This is done by various container formats like mp4 or 3gpp or mov.
In the file formats the audio and video data are partitioned in the form of chunks and their time of playout is marked. This allows players to understand when to display video to screen and put the audio samples to speaker - irrespective of how they arrive. Usually there is sufficient buffer to ensure that sufficient time still exists after all network delays to reach the data to rendering even if audio and video has a different amount of delay while transmission.
If you use container formats such as ones mentioned above, RTP doesn't need to know whether particular packet is Audio or Video.
One more thing - SSRC doesn't really provide any crucial timing information by itself. It is only a label. For example, if i a DVR is receiving data from 16 cameras (and 16 microphones for audio), it needs reference for each such source. This is only an address or identifier not a source of timing information. So if logically, audio and video comes from same source, it can have same tag.