We have pretty good fake voice technology, but afaik it requires significant amounts of audio.
For example, Google Translate has pretty good voices for many languages
but it does not need to convey different emotions and they're not concerned with imitating unusual intonation patterns (as they would need to be to imitate Trump or Obama, who both have fairly distinctive intonations). But for the voices that are the highest quality, like say French or Hindi (the Hindi voice is also very good at English, actually), they are most likely using
concatenative synthesis, which is essentially stitching the required sounds together from a database of recordings. For the sound /k/ in "recording", for example, you wouldn't want to use just any /k/, you'd want one where it follows a vowel like in "re" and is followed by a vowel like in "or", and your database works best if you have the most possible combinations, which then allows you to generate any novel word or sequence of words (since the first/last sound of the next/preceding word will also have an effect).
To get the high quality output of the Google Translate Hindi voice, they most likely have hours of a paid voice actress reading text designed to include many different sound combinations and the types of intonation they need to cover (Google Translate might not need to be able to generate, say, a sarcastic intonation, but they do want "question intonation" and things of that nature).
Its been a little while since I have learned much about voice synthesis, but I imagine this still holds based on the fact that they're still using robot voices for many languages on Google Translate (like Welsh, Serbian and Swahili) and completely lack voice synthesis for others (like Lao, Persian and Hebrew). If they could get high quality synthesis with only a few minutes of recordings, they would probably be willing to put the little money required into it for these lower resource languages (Swahili, Persian and Lao do, after all, have tens of millions of speakers each).
I assume if you want to account for different
voice qualities like shouting, singing and so forth, you might need similar amounts of the appropriate type to be really accurate.
The conclusion I would draw for this is that it would be easiest to generate high-quality imitation voices for high-profile politicians, actors and TV/radio/podcast hosts. Anyone who has a lot of recorded data available. I wouldn't be too worried about them being able to fake your voice based on, say, you telling a telemarketer you're not interested. That's not enough high-quality data. But someone like Donald Trump or Barack Obama? They can definitely generate some pretty good audio, even if they couldn't, say, do a good standup comedy routine with all the required variation in intonation required for that type of performance.