Quantcast

(GSOC 2016) Regarding the Virtual Singer project idea...

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

(GSOC 2016) Regarding the Virtual Singer project idea...

syrma
Hello!

I have been researching the possibility of using a Virtual Singer for MuseScore.

I downloaded and compiled from source some of the following software (and directly tested others from installing the packages). I will talk about all the software I have looked at/tested, before talking about those I consider promising. As I lack the experience and the insight to give definite judgement, I would be grateful for any input.
 
- E-Cantorix (https://github.com/divVerent/ecantorix): 

A perl singing synthesis software using espeak. This unfortunately doesn't look like something that can be directly exploited, the impression given by the headache-inducing robotic voice. There could be some good ideas to take from it, although I still have nothing in mind.

- Festival Speech Synthesis System's singing mode (http://www.festvox.org/festival/ ):

The speech synthesis' singing mode came as way better than e-cantorix in matter of usability (from my own experience that might not be representative), although the output still lacks quality. The input for this mode is a special xml file that specifies the notes and their durations for each word (festival being foremost a speech synthesis system).

As for the singing mode output, aside from the robotic voice (that is still way more decent than e-cantorix'), I have to say it sounded pretty random. The British voice would pronounce some words faster than the American one, completely messing up the rhythm. Or sometimes the tone gets off. Mainly the dissimilarity between how we speak and how we sing that can makes huge differences.

- Sinsy (http://sinsy.sourceforge.net/): 

Aside from the pretty impressive (non-open source) version that is presented on their website (Japanese (3 voices), Dubious English (2 voices), and Chinese (1 voice) singing synthesis from a music xml file), the open source version only supports Japanese, and only one voice is available (which is clearly of a lesser quality than the ones on the website). It uses the hts_engine API (http://hts-engine.sourceforge.net/).  

Pros:
- Quite easy to use; compiled and run with minor trouble.
- Supports Japanese well.
- It is straightforward to get results, as it directly converts from MusicXML files (as generated from MuseScore) to audio.
- The free voice can sound pretty decent.

Cons:
- Depending on what kind of project would be better, the integration into Mscore could be a problem. The software takes a descriptive file and a voice and converts them into audio. It could be fine for an external tool, but I am not sure how the audio could be exploited in real time/playback inside the software.
- Only supports Japanese. (there might be a possibility to add other languages through espeak)
- Has only one voice available. (aside from the fact that it is for Japanese, the lack of choice might be hindering)
- The free voice sounds horrible with long notes. (Really.)

- World (https://github.com/mmorise/World):

World is an open source speech synthesis system. Although very unlike anything that I've looked at before. World can analyse and synthesize voice. I must admit that the result is impressive, very natural sounding, or at least far from being robot-like (even if we play with unrealistic parameters). However it has no idea of language, so something needs to be built on top of it. (vConnect-STAND is a possible option. It is built upon World, sound nice according to youtube demos, but I haven't tried it yet. The documentation I've come across is in Japanese, so I am slowly going through it).

Pros:
- Very good results.
- Can be used in real time; it might be possible to integrate it into Mscore.

Cons:
- Very low level.

- QTau (https://notabug.org/isengaara/qtau) and Cadencii (https://github.com/cadencii/cadencii-nt): 

Two free software editors written with C++ and Qt. Although neither of them are voice synthesis technologies, they both make use of vConnect-STAND (in addition to e-cantorix for QTau, and Utau + Vocaloid for Cadencii). I think the way they do things may be interesting, but I have yet to study them in depth. I would like to do so after figuring vConnect-STAND out.

The ideas page stated that an external tool would be good to practice along, but I am not sure what kind of project would be best to consider. Depending on this, some tools may or may not be good, so I would really like to discuss this project idea.
I would greatly appreciate any kind of input or guidance. Please let me know what I am missing, if I disregarded an interesting possibility, or whether I should keep going on this path.

Thank you!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

David Cuny
Non-developer jumping in again.

Sinsy supports English, and can be accessed via a web service. Send a MusicXML file in, and get a .wav file back. For implementation details, see:

https://pypi.python.org/pypi/sinsy-cli/

Since Sinsy says it works well with MuseScore's MusicXML (it says so on the Sinsy page), this is probably the simplest approach. Of course, it requires an internet connection.

Here's a demo I recently did using the Sinsy English male voice:

https://soundcloud.com/dcuny/twinkle-twinkle-little-star-sinsy

Clearly, it would have been nice had they used a native English speaker - all the Sinsy English singers have heavy accents.


Many of the other "traditional" voice synthesis projects that support singing seem to be suffering from bitrot.

There's a lot of work done to create a Vocaloid clone, and UTAU is pretty much the leader there. One of the biggest problems is that good English vocal synthesis requires a large database of recorded transitions, and a lot of manual effort. I don't think you'd want to expose that sort of editing complexity in MuseScore.


I've been working on a singing program that uses formant synthesis, but it hasn't been released because it's not nearly as natural as the HMM approach, and doesn't handle rapid articulations well:

https://soundcloud.com/dcuny/twinkle-twinkle-little-star-12

-- David


On Sat, Mar 19, 2016 at 12:51 AM, syrma <[hidden email]> wrote:
Hello!

I have been researching the possibility of using a Virtual Singer for
MuseScore.

I downloaded and compiled from source some of the following software (and
directly tested others from installing the packages). I will talk about all
the software I have looked at/tested, before talking about those I consider
promising. As I lack the experience and the insight to give definite
judgement, I would be grateful for any input.

- E-Cantorix (https://github.com/divVerent/ecantorix):

A perl singing synthesis software using espeak. This unfortunately doesn't
look like something that can be directly exploited, the impression given by
the headache-inducing robotic voice. There could be some good ideas to take
from it, although I still have nothing in mind.

- Festival Speech Synthesis System's singing mode
(http://www.festvox.org/festival/ ):

The speech synthesis' singing mode came as way better than e-cantorix in
matter of usability (from my own experience that might not be
representative), although the output still lacks quality. The input for this
mode is a special xml file that specifies the notes and their durations for
each word (festival being foremost a speech synthesis system).

As for the singing mode output, aside from the robotic voice (that is still
way more decent than e-cantorix'), I have to say it sounded pretty random.
The British voice would pronounce some words faster than the American one,
completely messing up the rhythm. Or sometimes the tone gets off. Mainly the
dissimilarity between how we speak and how we sing that can makes huge
differences.

- Sinsy (http://sinsy.sourceforge.net/):

Aside from the pretty impressive (non-open source) version that is presented
on their website (Japanese (3 voices), Dubious English (2 voices), and
Chinese (1 voice) singing synthesis from a music xml file), the open source
version only supports Japanese, and only one voice is available (which is
clearly of a lesser quality than the ones on the website). It uses the
hts_engine API (http://hts-engine.sourceforge.net/).

Pros:
- Quite easy to use; compiled and run with minor trouble.
- Supports Japanese well.
- It is straightforward to get results, as it directly converts from
MusicXML files (as generated from MuseScore) to audio.
- The free voice can sound pretty decent.

Cons:
- Depending on what kind of project would be better, the integration into
Mscore could be a problem. The software takes a descriptive file and a voice
and converts them into audio. It could be fine for an external tool, but I
am not sure how the audio could be exploited in real time/playback inside
the software.
- Only supports Japanese. (there might be a possibility to add other
languages through espeak)
- Has only one voice available. (aside from the fact that it is for
Japanese, the lack of choice might be hindering)
- The free voice sounds horrible with long notes. (Really.)

- World (https://github.com/mmorise/World):

World is an open source speech synthesis system. Although very unlike
anything that I've looked at before. World can analyse and synthesize voice.
I must admit that the result is impressive, very natural sounding, or at
least far from being robot-like (even if we play with unrealistic
parameters). However it has no idea of language, so something needs to be
built on top of it. (vConnect-STAND is a possible option. It is built upon
World, sound nice according to youtube demos, but I haven't tried it yet.
The documentation I've come across is in Japanese, so I am slowly going
through it).

Pros:
- Very good results.
- Can be used in real time; it might be possible to integrate it into
Mscore.

Cons:
- Very low level.

- QTau (https://notabug.org/isengaara/qtau) and Cadencii
(https://github.com/cadencii/cadencii-nt):

Two free software editors written with C++ and Qt. Although neither of them
are voice synthesis technologies, they both make use of vConnect-STAND (in
addition to e-cantorix for QTau, and Utau + Vocaloid for Cadencii). I think
the way they do things may be interesting, but I have yet to study them in
depth. I would like to do so after figuring vConnect-STAND out.

The ideas page stated that an external tool would be good to practice along,
but I am not sure what kind of project would be best to consider. Depending
on this, some tools may or may not be good, so I would really like to
discuss this project idea.
I would greatly appreciate any kind of input or guidance. Please let me know
what I am missing, if I disregarded an interesting possibility, or whether I
should keep going on this path.

Thank you!



--
View this message in context: http://dev-list.musescore.org/GSOC-2016-Regarding-the-Virtual-Singer-project-idea-tp7579698.html
Sent from the MuseScore Developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

Tobias Platen
In reply to this post by syrma
I'm currently working on an eSpeak fork with better singing support (no
external perl script needed) and also on an MBROLA replacement based on
WORLD. For a good singing synthesizer you have to combine multiple of
those programs.

Tobias Platen

On 03/19/2016 08:51 AM, syrma wrote:

> Hello!
>
> I have been researching the possibility of using a Virtual Singer for
> MuseScore.
>
> I downloaded and compiled from source some of the following software (and
> directly tested others from installing the packages). I will talk about all
> the software I have looked at/tested, before talking about those I consider
> promising. As I lack the experience and the insight to give definite
> judgement, I would be grateful for any input.
>
> - E-Cantorix (https://github.com/divVerent/ecantorix):
>
> A perl singing synthesis software using espeak. This unfortunately doesn't
> look like something that can be directly exploited, the impression given by
> the headache-inducing robotic voice. There could be some good ideas to take
> from it, although I still have nothing in mind.
>
> - Festival Speech Synthesis System's singing mode
> (http://www.festvox.org/festival/ ):
>
> The speech synthesis' singing mode came as way better than e-cantorix in
> matter of usability (from my own experience that might not be
> representative), although the output still lacks quality. The input for this
> mode is a special xml file that specifies the notes and their durations for
> each word (festival being foremost a speech synthesis system).
>
> As for the singing mode output, aside from the robotic voice (that is still
> way more decent than e-cantorix'), I have to say it sounded pretty random.
> The British voice would pronounce some words faster than the American one,
> completely messing up the rhythm. Or sometimes the tone gets off. Mainly the
> dissimilarity between how we speak and how we sing that can makes huge
> differences.
>
> - Sinsy (http://sinsy.sourceforge.net/):
>
> Aside from the pretty impressive (non-open source) version that is presented
> on their website (Japanese (3 voices), Dubious English (2 voices), and
> Chinese (1 voice) singing synthesis from a music xml file), the open source
> version only supports Japanese, and only one voice is available (which is
> clearly of a lesser quality than the ones on the website). It uses the
> hts_engine API (http://hts-engine.sourceforge.net/).
>
> Pros:
> - Quite easy to use; compiled and run with minor trouble.
> - Supports Japanese well.
> - It is straightforward to get results, as it directly converts from
> MusicXML files (as generated from MuseScore) to audio.
> - The free voice can sound pretty decent.
>
> Cons:
> - Depending on what kind of project would be better, the integration into
> Mscore could be a problem. The software takes a descriptive file and a voice
> and converts them into audio. It could be fine for an external tool, but I
> am not sure how the audio could be exploited in real time/playback inside
> the software.
> - Only supports Japanese. (there might be a possibility to add other
> languages through espeak)
> - Has only one voice available. (aside from the fact that it is for
> Japanese, the lack of choice might be hindering)
> - The free voice sounds horrible with long notes. (Really.)
>
> - World (https://github.com/mmorise/World):
>
> World is an open source speech synthesis system. Although very unlike
> anything that I've looked at before. World can analyse and synthesize voice.
> I must admit that the result is impressive, very natural sounding, or at
> least far from being robot-like (even if we play with unrealistic
> parameters). However it has no idea of language, so something needs to be
> built on top of it. (vConnect-STAND is a possible option. It is built upon
> World, sound nice according to youtube demos, but I haven't tried it yet.
> The documentation I've come across is in Japanese, so I am slowly going
> through it).
>
> Pros:
> - Very good results.
> - Can be used in real time; it might be possible to integrate it into
> Mscore.
>
> Cons:
> - Very low level.
>
> - QTau (https://notabug.org/isengaara/qtau) and Cadencii
> (https://github.com/cadencii/cadencii-nt):
>
> Two free software editors written with C++ and Qt. Although neither of them
> are voice synthesis technologies, they both make use of vConnect-STAND (in
> addition to e-cantorix for QTau, and Utau + Vocaloid for Cadencii). I think
> the way they do things may be interesting, but I have yet to study them in
> depth. I would like to do so after figuring vConnect-STAND out.
>
> The ideas page stated that an external tool would be good to practice along,
> but I am not sure what kind of project would be best to consider. Depending
> on this, some tools may or may not be good, so I would really like to
> discuss this project idea.
> I would greatly appreciate any kind of input or guidance. Please let me know
> what I am missing, if I disregarded an interesting possibility, or whether I
> should keep going on this path.
>
> Thank you!
>
>
>
> --
> View this message in context: http://dev-list.musescore.org/GSOC-2016-Regarding-the-Virtual-Singer-project-idea-tp7579698.html
> Sent from the MuseScore Developer mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
> _______________________________________________
> Mscore-developer mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/mscore-developer
>

--
Sent from my Libreboot X200

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

syrma
In reply to this post by David Cuny
David Cuny wrote
Non-developer jumping in again.

*Sinsy* supports English, and can be accessed via a web service. Send a
MusicXML file in, and get a .wav file back. For implementation details, see:

https://pypi.python.org/pypi/sinsy-cli/

Since Sinsy says it works well with MuseScore's MusicXML (it says so on the
Sinsy page), this is probably the simplest approach. Of course, it requires
an internet connection.
Thank you for the link!
I mentioned briefly the web service, but I am not very confident about using it. I tried the service myself with some MusicXML files and I must say the results are impressive. However, aside from the fact that it requires an internet connection (and that might hinder some users), I am not sure about the juridical aspect of it (will it remain free forever?). The open source version is definitely the easiest to exploit, but as I lack feasibility knowledge, inputs on the matter are very much welcome.
Moreover, in the ideal project where the audio would be played during editing on MuseScore, how much can the delay affect the user's experience? Especially since we tend to need a frequent preview even after very small editing. I will keep the idea in mind, though.

Tobias Platen wrote
I'm currently working on an eSpeak fork with better singing support (no
external perl script needed) and also on an MBROLA replacement based on
WORLD. For a good singing synthesizer you have to combine multiple of
those programs.
Indeed, I think one should make the best use out of existent projects to get better results. By the way I have been through your code on QTau (mostly the vconnect_synth part), and I wondered how far have you exactly gotten with using v.Connect-STAND? I have been quite interested in it lately, mainly because it seems we can get some good results out of it, but it seems overly buggy with anything that isn't Japanese, and there is little doc available. It took me a little while to get it to convert an Utau database on Linux (all thanks to your debian package), but I'm still struggling with it on Windows using Cadencii. I think a better configuration would be needed to get it to work without all this trouble (looking back, the way Sinsy compiled and worked so easily is probably a big plus)

On another (closely related) subject, I have been discussing on IRC with Lasconic the kind of Virtual Singer project that would be the best for MuseScore, and I meant to ask the two of you how would you see it working?  How many settings would be necessary for it to work on Mscore, and if possible, which ones? (assuming that ideally, it would let the user play/preview the song the way it is now possible to play notes) The ones I can think of, by looking at something like Cadencii, are the word dictionary (language?), the renderer/synthesiser (if we use more than one), ... There are also settings about the singing style (decay, accent, some settings for rising and falling movement), and some parameters to pass to World, but I am not sure I am making enough sense of all of them.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

David Cuny
Mentioning again that I'm not a developer, so take everything with a grain of salt here:

> However, aside from the fact that it requires an internet connection (and that might hinder
> some users), I am not sure about the juridical aspect of it (will it remain free forever?).

I think it's fair to assume that at some point the web service will stop being available.

> The open source version is definitely the easiest to exploit, but as I lack feasibility
> knowledge, inputs on the matter are very much welcome.

The last time I looked into it, the Sinsy source didn't include the tools needed to train the HMM. While those are available elsewhere, there is obviously effort required to:

   * Learn the HMM training tools
   * Record a corpus in the target language
   * Train the HMM
   * Add the HMM to Sinsy

Taking the long-term view, having more than a single voice database is a good thing (no matter what the language), so the maintenance cost of the building voices is obviously a factor.


> Moreover, in the ideal project where the audio would be played during
> editing on MuseScore, how much can the delay affect the user's experience?

While sung playback during editing would be ideal, that may be a bit difficult in practice.

Vocaloid can play single syllables during editing with no real delay, but requires time to "compile" anything longer.

UTAU has what appears to be a fairly long delay when it constructs the output since it has to call a number of programs to glue things together.

Band in a Box uses Sinsy, and you can hear MIDI playback during editing, but calling Sinsy is a separate step, and generates the entire song. Response time from the web service is fairly good - I've never been tempted to to get coffee.


> How many settings would be necessary for it to work on Mscore, and if
> possible, which ones?

Before addressing this, I'll mention a few text-to-phoneme issues you might have to deal with:

* Vocaloid had mixed British and American pronunciations in their dictionary, which leads to bad results.
* Users need to be able to override dictionary values, obviously. Sinsy allows this using square braces: http://sinsy.sp.nitech.ac.jp/reference.pdf
* Which phoneme system will you use? How can the users see the list of phonemes?
* Can the users access the dictionary, to choose between options?
* For dictionary lookup, will you combine a word back from its syllables "catalog", or lookup using the supplied hyphenation "cat-a-log"?
* What happens if the hyphenation is wrong, which often happens?
* What happens if the word isn't in the dictionary? Is there a fallback algorithm?


The most important settings to control (IMNSHO) are:

* Minimum duration of note to get automatic vibrato;
* Percent of note to apply vibrato to; and
* How much legato to apply between notes, including under/overshoot

-- David


On Mon, Mar 21, 2016 at 3:25 AM, syrma <[hidden email]> wrote:
David Cuny wrote
> Non-developer jumping in again.
>
> *Sinsy* supports English, and can be accessed via a web service. Send a
> MusicXML file in, and get a .wav file back. For implementation details,
> see:
>
> https://pypi.python.org/pypi/sinsy-cli/
>
> Since Sinsy says it works well with MuseScore's MusicXML (it says so on
> the
> Sinsy page), this is probably the simplest approach. Of course, it
> requires
> an internet connection.

Thank you for the link!
I mentioned briefly the web service, but I am not very confident about using
it. I tried the service myself with some MusicXML files and I must say the
results are impressive. However, aside from the fact that it requires an
internet connection (and that might hinder some users), I am not sure about
the juridical aspect of it (will it remain free forever?). The open source
version is definitely the easiest to exploit, but as I lack feasibility
knowledge, inputs on the matter are very much welcome.
Moreover, in the ideal project where the audio would be played during
editing on MuseScore, how much can the delay affect the user's experience?
Especially since we tend to need a frequent preview even after very small
editing. I will keep the idea in mind, though.


Tobias Platen wrote
> I'm currently working on an eSpeak fork with better singing support (no
> external perl script needed) and also on an MBROLA replacement based on
> WORLD. For a good singing synthesizer you have to combine multiple of
> those programs.

Indeed, I think one should make the best use out of existent projects to get
better results. By the way I have been through your code on QTau (mostly the
vconnect_synth part), and I wondered how far have you exactly gotten with
using v.Connect-STAND? I have been quite interested in it lately, mainly
because it seems we can get some good results out of it, but it seems overly
buggy with anything that isn't Japanese, and there is little doc available.
It took me a little while to get it to convert an Utau database on Linux
(all thanks to your debian package), but I'm still struggling with it on
Windows using Cadencii. I think a better configuration would be needed to
get it to work without all this trouble (looking back, the way Sinsy
compiled and worked so easily is probably a big plus)

On another (closely related) subject, I have been discussing on IRC with
Lasconic the kind of Virtual Singer project that would be the best for
MuseScore, and I meant to ask the two of you how would you see it working?
How many settings would be necessary for it to work on Mscore, and if
possible, which ones? (assuming that ideally, it would let the user
play/preview the song the way it is now possible to play notes) The ones I
can think of, by looking at something like Cadencii, are the word dictionary
(language?), the renderer/synthesiser (if we use more than one), ... There
are also settings about the singing style (decay, accent, some settings for
rising and falling movement), and some parameters to pass to World, but I am
not sure I am making enough sense of all of them.



--
View this message in context: http://dev-list.musescore.org/GSOC-2016-Regarding-the-Virtual-Singer-project-idea-tp7579698p7579723.html
Sent from the MuseScore Developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

syrma
Thank you for your reply.

As for the playback, I also think that singing each note the moment we put it is impossible; we need to set the lyrics, and even then, the synthesis takes time. But getting it to play like in Cadencii would probably be good; to press play once everything is set, that is. Cadencii takes a while to do that though, and at some point, the time spent waiting for the synthesis is probably multiple times the time spent actually editing (that being said I think a lot of optimization is possible on Cadencii so it's probably not the best example).

Leaving the questions about dictionaries for later, a side note about my struggles with v.Connect-STAND, Cadencii's synthesizing engine. I have finally been able to get some results out of it (by switching between my Linux and Windows every time one gets a problem). The rendering is more than decent in my opinion (although it depends a lot on the settings and the used voicebank, and it could sound worse than e-cantorix if not used properly (okay, not that bad, but still) ), and I think it is an interesting tool to use overall (some Utau users import their Utaus to v.Connect-stand to get a better rendering, but it is sometimes a little tricky). However, there are a few points that hinders direct use:

- The Windows binaries won't work unless the system is as Japanese as possible, and while I don't know what is causing this yet (because I am not used to compiling on Windows), this needs a fix.
- Encoding auto-detection is probably needed; even my Linux-built version needs a default input encoded as shift-jis (the typical encoding when dealing with files created by Japanese users on Windows). It supports other encodings, but the user needs to specify them.
- The software takes a meta text sequence file (its own format), and outputs an audio. While I think implementing a conversion from a score to a meta text sequence would be sufficient for the first part of the project (generating the audio), optionally, I believe an optimization might be possible. As v.Connect's based on World (which implements real-time singing synthesis according to their introduction page), I am wondering whether changing the code to intercept the parameters before the audio is generated, and playing it in real-time would be possible. I have not dived into v.Connect's code far enough, so if someone who did thinks I am going a wrong and completely impossible way, please do let me know.

A very interesting point in it however is its ability to convert and use Utau voicebanks, with the great amount of downlodable utaus on the net (let's forget for now about the mass of problems that alone causes). While looking for the possibility of using English with Utau voices, I came, among others, across this page : http://utau.wiki/cv-vc (see also: utau.wiki/tutorials:cvvc-english-tutorial-by-mystsaphyr ). This seems to be popular enough that a lot of utauloids use this method to simulate non-Japanese pronunciation. Namine Ritsu, a free voice for v.Connect-stand (and the most popular one), also has recordings of this kind, although the way English is rendered far from being perfect, and accents are all for the user to simulate. There are also (non-open source) plugins that can convert lyrics (or rather sequence files) from CVVC to VCV (another style used in utaus). Even though this allows for the user to get and add their sets of voice from the internet, I can easily think of a few issues one can come across:

- Making the user input phonetic symbols instead of actual lyrics is not a solution. I think it may be possible to convert lyrics to espeak phonemes, and implement the remaining conversion step (that would depend on the voice). That gets us to another set of problems; the user would need to supply both the word and the hyphenation. And even then, some other problems are bound to happen, either because the word isn't in the dictionary or because the sound isn't available. In the first case, the user may need to provide the pronunciation (a proper noun for example). Beside this, should we let the user modify the pronunciation they want (after it is automatically generated) to simulate an accent or to make something sound more natural?

- Encoding problems, always. Japanese on Windows is unpredictably tricky to deal with.

- Voicebanks are usually recorded for a precise language. I could be wrong, but for now I don't see how we could detect the language unless the user specifies it. Also, some of the Japanese are only compatible with either romaji or kana (we could use kakasi to convert either the lyrics or the voicebank).

Anyways, I don't think any amount of work of one summer would be enough to even think about all the issues (everything is so much more complicated than it first seems). The question would be, how much would make an acceptable project?

The project I have in mind for now would be something like the following:

- As a first step, taking care of the usability issues of v.Connect-stand, or ideally turning it into a usable library.
- Implementing the generation of meta text sequences (it would be interesting to see how Cadencii, the open source C++/Qt editor, does it). This should include the processing of whatever settings we have (including phonemes) as this kind of files should provide all the information needed for synthesis.
- Making a MuseScore plugin out of the two aforementioned items. This would include in addition:
      - the front-end (collecting settings)
      - the playback function

Though I don't know if this is relevant to the current discussion (or at all), while looking for a good free voice data, I found Namine Ritsu's license is very unclear to me (the site the wiki pages link to for the use terms doesn't exist anymore). There is a separation between the character (visual art, profile, ...) and the voice resources. I suspect from the contradicting official information that it has changed over the time. The character itself seems to be the property of canon, but there doesn't seem to be any restrictions over the use of the voices. In addition, this voicebank (http://hal-the-cat.music.coocan.jp/ritsu_e.html) says it is released under the terms of GPLv3. I assume at least this voicebank is safe enough.
[Unclear official material :
- http://www.canon-voice.com/english/kiyaku.html (the English says something very unclear about the character but the voice is free)
- http://canon-voice.com/ritsu.html ]

So immediate questions are:
- Is this a realistic and/or an acceptable project?
- I am not aware of MuseScore plugin rules, so is such an approach alright? If not, what is the better way?
- I am not sure where to integrate the second part, but I think the part to integrate into MuseScore should be as general as possible to add gradually support for other tools.

Sorry for the long post. Please let me know your opinion, and whether I am analyzing things wrong!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

David Cuny
Non-developer David again:

The issues of running UTAU (and UTAU-derived tools) under a Japanese locale has been enough to keep me from trying it out.

> Making the user input phonetic symbols instead of actual lyrics is
> not a solution.

Sorry, I didn't mean to propose that. I just wanted to note that a fallback that allowed phonetic symbols would be necessary.

As to the rest, my (unofficial) thought is that it currently takes quite a bit of manual intervention to get English working well with the UTAU toolchain, whether it uses VCV or CVVC. And each approach requires a different set of tools to connect the samples together. It seems to me that there's quite a bit of risk of not coming out with something usable at the end.

-- David




On Tue, Mar 22, 2016 at 7:58 AM, syrma <[hidden email]> wrote:
Thank you for your reply.

As for the playback, I also think that singing each note the moment we put
it is impossible; we need to set the lyrics, and even then, the synthesis
takes time. But getting it to play like in Cadencii would probably be good;
to press play once everything is set, that is. Cadencii takes a while to do
that though, and at some point, the time spent waiting for the synthesis is
probably multiple times the time spent actually editing (that being said I
think a lot of optimization is possible on Cadencii so it's probably not the
best example).

Leaving the questions about dictionaries for later, a side note about my
struggles with v.Connect-STAND, Cadencii's synthesizing engine. I have
finally been able to get some results out of it (by switching between my
Linux and Windows every time one gets a problem). The rendering is more than
decent in my opinion (although it depends a lot on the settings and the used
voicebank, and it could sound worse than e-cantorix if not used properly
(okay, not that bad, but still) ), and I think it is an interesting tool to
use overall (some Utau users import their Utaus to v.Connect-stand to get a
better rendering, but it is sometimes a little tricky). However, there are a
few points that hinders direct use:

- The Windows binaries won't work unless the system is as Japanese as
possible, and while I don't know what is causing this yet (because I am not
used to compiling on Windows), this needs a fix.
- Encoding auto-detection is probably needed; even my Linux-built version
needs a default input encoded as shift-jis (the typical encoding when
dealing with files created by Japanese users on Windows). It supports other
encodings, but the user needs to specify them.
- The software takes a meta text sequence file (its own format), and outputs
an audio. While I think implementing a conversion from a score to a meta
text sequence would be sufficient for the first part of the project
(generating the audio), optionally, I believe an optimization might be
possible. As v.Connect's based on World (which implements real-time singing
synthesis according to their introduction page), I am wondering whether
changing the code to intercept the parameters before the audio is generated,
and playing it in real-time would be possible. I have not dived into
v.Connect's code far enough, so if someone who did thinks I am going a wrong
and completely impossible way, please do let me know.

A very interesting point in it however is its ability to convert and use
Utau voicebanks, with the great amount of downlodable utaus on the net
(let's forget for now about the mass of problems that alone causes). While
looking for the possibility of using English with Utau voices, I came, among
others, across this page : http://utau.wiki/cv-vc (see also:
utau.wiki/tutorials:cvvc-english-tutorial-by-mystsaphyr ). This seems to be
popular enough that a lot of utauloids use this method to simulate
non-Japanese pronunciation. Namine Ritsu, a free voice for v.Connect-stand
(and the most popular one), also has recordings of this kind, although the
way English is rendered far from being perfect, and accents are all for the
user to simulate. There are also (non-open source) plugins that can convert
lyrics (or rather sequence files) from CVVC to VCV (another style used in
utaus). Even though this allows for the user to get and add their sets of
voice from the internet, I can easily think of a few issues one can come
across:

- Making the user input phonetic symbols instead of actual lyrics is not a
solution. I think it may be possible to convert lyrics to espeak phonemes,
and implement the remaining conversion step (that would depend on the
voice). That gets us to another set of problems; the user would need to
supply both the word and the hyphenation. And even then, some other problems
are bound to happen, either because the word isn't in the dictionary or
because the sound isn't available. In the first case, the user may need to
provide the pronunciation (a proper noun for example). Beside this, should
we let the user modify the pronunciation they want (after it is
automatically generated) to simulate an accent or to make something sound
more natural?

- Encoding problems, always. Japanese on Windows is unpredictably tricky to
deal with.

- Voicebanks are usually recorded for a precise language. I could be wrong,
but for now I don't see how we could detect the language unless the user
specifies it. Also, some of the Japanese are only compatible with either
romaji or kana (we could use kakasi to convert either the lyrics or the
voicebank).

Anyways, I don't think any amount of work of one summer would be enough to
even think about all the issues (everything is so much more complicated than
it first seems). The question would be, how much would make an acceptable
project?

The project I have in mind for now would be something like the following:

- As a first step, taking care of the usability issues of v.Connect-stand,
or ideally turning it into a usable library.
- Implementing the generation of meta text sequences (it would be
interesting to see how Cadencii, the open source C++/Qt editor, does it).
This should include the processing of whatever settings we have (including
phonemes) as this kind of files should provide all the information needed
for synthesis.
- Making a MuseScore plugin out of the two aforementioned items. This would
include in addition:
      - the front-end (collecting settings)
      - the playback function

Though I don't know if this is relevant to the current discussion (or at
all), while looking for a good free voice data, I found Namine Ritsu's
license is very unclear to me (the site the wiki pages link to for the use
terms doesn't exist anymore). There is a separation between the character
(visual art, profile, ...) and the voice resources. I suspect from the
contradicting official information that it has changed over the time. The
character itself seems to be the property of canon, but there doesn't seem
to be any restrictions over the use of the voices. In addition, this
voicebank (http://hal-the-cat.music.coocan.jp/ritsu_e.html) says it is
released under the terms of GPLv3. I assume at least this voicebank is safe
enough.
[Unclear official material :
- http://www.canon-voice.com/english/kiyaku.html (the English says something
very unclear about the character but the voice is free)
- http://canon-voice.com/ritsu.html ]

So immediate questions are:
- Is this a realistic and/or an acceptable project?
- I am not aware of MuseScore plugin rules, so is such an approach alright?
If not, what is the better way?
- I am not sure where to integrate the second part, but I think the part to
integrate into MuseScore should be as general as possible to add gradually
support for other tools.

Sorry for the long post. Please let me know your opinion, and whether I am
analyzing things wrong!



--
View this message in context: http://dev-list.musescore.org/GSOC-2016-Regarding-the-Virtual-Singer-project-idea-tp7579698p7579737.html
Sent from the MuseScore Developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

Tobias Platen


On 03/22/2016 08:18 PM, David Cuny wrote:
> Non-developer David again:
>
> The issues of running UTAU (and UTAU-derived tools) under a Japanese locale
> has been enough to keep me from trying it out.
For European languages eSpeak is the best one, but that will require
much more work. eSpeak can convert text to phonetic symbols for many
languages.

>
>> Making the user input phonetic symbols instead of actual lyrics is
>> not a solution.
>
> Sorry, I didn't mean to propose that. I just wanted to note that a fallback
> that allowed phonetic symbols would be necessary.
>
> As to the rest, my (unofficial) thought is that it currently takes quite a
> bit of manual intervention to get English working well with the UTAU
> toolchain, whether it uses VCV or CVVC. And each approach requires a
> different set of tools to connect the samples together. It seems to me that
> there's quite a bit of risk of not coming out with something usable at the
> end.
>
> -- David
>
>
>
>
> On Tue, Mar 22, 2016 at 7:58 AM, syrma <[hidden email]> wrote:
>
>> Thank you for your reply.
>>
>> As for the playback, I also think that singing each note the moment we put
>> it is impossible; we need to set the lyrics, and even then, the synthesis
>> takes time. But getting it to play like in Cadencii would probably be good;
>> to press play once everything is set, that is. Cadencii takes a while to do
>> that though, and at some point, the time spent waiting for the synthesis is
>> probably multiple times the time spent actually editing (that being said I
>> think a lot of optimization is possible on Cadencii so it's probably not
>> the
>> best example).
>>
>> Leaving the questions about dictionaries for later, a side note about my
>> struggles with v.Connect-STAND, Cadencii's synthesizing engine. I have
>> finally been able to get some results out of it (by switching between my
>> Linux and Windows every time one gets a problem). The rendering is more
>> than
>> decent in my opinion (although it depends a lot on the settings and the
>> used
>> voicebank, and it could sound worse than e-cantorix if not used properly
>> (okay, not that bad, but still) ), and I think it is an interesting tool to
>> use overall (some Utau users import their Utaus to v.Connect-stand to get a
>> better rendering, but it is sometimes a little tricky). However, there are
>> a
>> few points that hinders direct use:
>>
>> - The Windows binaries won't work unless the system is as Japanese as
>> possible, and while I don't know what is causing this yet (because I am not
>> used to compiling on Windows), this needs a fix.
>> - Encoding auto-detection is probably needed; even my Linux-built version
>> needs a default input encoded as shift-jis (the typical encoding when
>> dealing with files created by Japanese users on Windows). It supports other
>> encodings, but the user needs to specify them.
>> - The software takes a meta text sequence file (its own format), and
>> outputs
>> an audio. While I think implementing a conversion from a score to a meta
>> text sequence would be sufficient for the first part of the project
>> (generating the audio), optionally, I believe an optimization might be
>> possible. As v.Connect's based on World (which implements real-time singing
>> synthesis according to their introduction page), I am wondering whether
>> changing the code to intercept the parameters before the audio is
>> generated,
>> and playing it in real-time would be possible. I have not dived into
>> v.Connect's code far enough, so if someone who did thinks I am going a
>> wrong
>> and completely impossible way, please do let me know.
>>
>> A very interesting point in it however is its ability to convert and use
>> Utau voicebanks, with the great amount of downlodable utaus on the net
>> (let's forget for now about the mass of problems that alone causes). While
>> looking for the possibility of using English with Utau voices, I came,
>> among
>> others, across this page : http://utau.wiki/cv-vc (see also:
>> utau.wiki/tutorials:cvvc-english-tutorial-by-mystsaphyr ). This seems to be
>> popular enough that a lot of utauloids use this method to simulate
>> non-Japanese pronunciation. Namine Ritsu, a free voice for v.Connect-stand
>> (and the most popular one), also has recordings of this kind, although the
>> way English is rendered far from being perfect, and accents are all for the
>> user to simulate. There are also (non-open source) plugins that can convert
>> lyrics (or rather sequence files) from CVVC to VCV (another style used in
>> utaus). Even though this allows for the user to get and add their sets of
>> voice from the internet, I can easily think of a few issues one can come
>> across:
>>
>> - Making the user input phonetic symbols instead of actual lyrics is not a
>> solution. I think it may be possible to convert lyrics to espeak phonemes,
>> and implement the remaining conversion step (that would depend on the
>> voice). That gets us to another set of problems; the user would need to
>> supply both the word and the hyphenation. And even then, some other
>> problems
>> are bound to happen, either because the word isn't in the dictionary or
>> because the sound isn't available. In the first case, the user may need to
>> provide the pronunciation (a proper noun for example). Beside this, should
>> we let the user modify the pronunciation they want (after it is
>> automatically generated) to simulate an accent or to make something sound
>> more natural?
>>
>> - Encoding problems, always. Japanese on Windows is unpredictably tricky to
>> deal with.
>>
>> - Voicebanks are usually recorded for a precise language. I could be wrong,
>> but for now I don't see how we could detect the language unless the user
>> specifies it. Also, some of the Japanese are only compatible with either
>> romaji or kana (we could use kakasi to convert either the lyrics or the
>> voicebank).
>>
>> Anyways, I don't think any amount of work of one summer would be enough to
>> even think about all the issues (everything is so much more complicated
>> than
>> it first seems). The question would be, how much would make an acceptable
>> project?
>>
>> The project I have in mind for now would be something like the following:
>>
>> - As a first step, taking care of the usability issues of v.Connect-stand,
>> or ideally turning it into a usable library.
>> - Implementing the generation of meta text sequences (it would be
>> interesting to see how Cadencii, the open source C++/Qt editor, does it).
>> This should include the processing of whatever settings we have (including
>> phonemes) as this kind of files should provide all the information needed
>> for synthesis.
>> - Making a MuseScore plugin out of the two aforementioned items. This would
>> include in addition:
>>        - the front-end (collecting settings)
>>        - the playback function
>>
>> Though I don't know if this is relevant to the current discussion (or at
>> all), while looking for a good free voice data, I found Namine Ritsu's
>> license is very unclear to me (the site the wiki pages link to for the use
>> terms doesn't exist anymore). There is a separation between the character
>> (visual art, profile, ...) and the voice resources. I suspect from the
>> contradicting official information that it has changed over the time. The
>> character itself seems to be the property of canon, but there doesn't seem
>> to be any restrictions over the use of the voices. In addition, this
>> voicebank (http://hal-the-cat.music.coocan.jp/ritsu_e.html) says it is
>> released under the terms of GPLv3. I assume at least this voicebank is safe
>> enough.
>> [Unclear official material :
>> - http://www.canon-voice.com/english/kiyaku.html (the English says
>> something
>> very unclear about the character but the voice is free)
>> - http://canon-voice.com/ritsu.html ]
>>
>> So immediate questions are:
>> - Is this a realistic and/or an acceptable project?
>> - I am not aware of MuseScore plugin rules, so is such an approach alright?
>> If not, what is the better way?
>> - I am not sure where to integrate the second part, but I think the part to
>> integrate into MuseScore should be as general as possible to add gradually
>> support for other tools.
>>
>> Sorry for the long post. Please let me know your opinion, and whether I am
>> analyzing things wrong!
>>
>>
>>
>> --
>> View this message in context:
>> http://dev-list.musescore.org/GSOC-2016-Regarding-the-Virtual-Singer-project-idea-tp7579698p7579737.html
>> Sent from the MuseScore Developer mailing list archive at Nabble.com.
>>
>>
>> ------------------------------------------------------------------------------
>> Transform Data into Opportunity.
>> Accelerate data analysis in your applications with
>> Intel Data Analytics Acceleration Library.
>> Click to learn more.
>> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
>> _______________________________________________
>> Mscore-developer mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/mscore-developer
>>
>
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
>
>
>
> _______________________________________________
> Mscore-developer mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/mscore-developer
>

--
Sent from my Libreboot X200

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

syrma
Tobias Platen wrote
On 03/22/2016 08:18 PM, David Cuny wrote:
> Sorry, I didn't mean to propose that. I just wanted to note that a fallback
> that allowed phonetic symbols would be necessary.
>
> As to the rest, my (unofficial) thought is that it currently takes quite a
> bit of manual intervention to get English working well with the UTAU
> toolchain, whether it uses VCV or CVVC. And each approach requires a
> different set of tools to connect the samples together. It seems to me that
> there's quite a bit of risk of not coming out with something usable at the
> end.
>
> -- David
I understand and share your concerns. English is a real problem, starting from phonetics to the scarceness of samples. I think the same about the necessity to allow phonetics, though we will need opinions on whether this is even an acceptable possibility in MuseScore.

Tobias Platen wrote
For European languages eSpeak is the best one, but that will require
much more work. eSpeak can convert text to phonetic symbols for many
languages.
How many eSpeak phonemes are there approximatively in English? Although I have thought about the possibility to make program to try and convert every phoneme to its CVVC equivalent, I get the feeling that this is a bad idea. There is simply no real equivalent, and users who get almost natural results out of it have good tricks to do so (play with either a vowel or a consonant sounds according to the song's speed, to the rhythm, to the consonants that are pronounced near each other. To make it the closest to how a human would say it. The most natural sound isn't the most obvious, and my level in English is insufficient to do such a thing). Even if one would go for doing the most obvious algorithm (provided it is possible), how much time would it need?

I know that English support is quite a priority, as it is one of the most used languages. But it seems too risky. Would a project where the English support is an optional deliverable be worthy? I don't think it lacks priority, but it seems much more feasible to implement it over something already existent.

Your inputs on a better approach that would make English more straight forward are very much welcome!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

David Cuny

syrma wrote:

> I think the same about the
> necessity to allow phonetics, though we will need opinions on whether
> this is even an acceptable possibility in MuseScore.

Sinsy accepts phonetic input to override defaults by enclosing it in square braces, and commas separating the phonemes, as in:

   there will be an an[ae*,n]-swer[s,w,er] let[l,eh*] it[r,iy*] be

See: http://sinsy.sp.nitech.ac.jp/reference.pdf

Mscore will allow lyrics typed in this format.


> I don't think it lacks priority, but it seems much more feasible
> to implement it over something already existent.

This would be for the devs to answer, not me.

But if the choice is between something that works now but might break in the future (Sinsy) or something that doesn't work now, but might eventually (virtually every other vocal synthesis program), I know which I'd vote for. ;-)

-- David


On Tue, Mar 22, 2016 at 2:20 PM, syrma <[hidden email]> wrote:
Tobias Platen wrote
> On 03/22/2016 08:18 PM, David Cuny wrote:
>> Sorry, I didn't mean to propose that. I just wanted to note that a
>> fallback
>> that allowed phonetic symbols would be necessary.
>>
>> As to the rest, my (unofficial) thought is that it currently takes quite
>> a
>> bit of manual intervention to get English working well with the UTAU
>> toolchain, whether it uses VCV or CVVC. And each approach requires a
>> different set of tools to connect the samples together. It seems to me
>> that
>> there's quite a bit of risk of not coming out with something usable at
>> the
>> end.
>>
>> -- David

I understand and share your concerns. English is a real problem, starting
from phonetics to the scarceness of samples. I think the same about the
necessity to allow phonetics, though we will need opinions on whether this
is even an acceptable possibility in MuseScore.


Tobias Platen wrote
> For European languages eSpeak is the best one, but that will require
> much more work. eSpeak can convert text to phonetic symbols for many
> languages.

How many eSpeak phonemes are there approximatively in English? Although I
have thought about the possibility to make program to try and convert every
phoneme to its CVVC equivalent, I get the feeling that this is a bad idea.
There is simply no real equivalent, and users who get almost natural results
out of it have good tricks to do so (play with either a vowel or a consonant
sounds according to the song's speed, to the rhythm, to the consonants that
are pronounced near each other. To make it the closest to how a human would
say it. The most natural sound isn't the most obvious, and my level in
English is insufficient to do such a thing). Even if one would go for doing
the most obvious algorithm (provided it is possible), how much time would it
need?

I know that English support is quite a priority, as it is one of the most
used languages. But it seems too risky. Would a project where the English
support is an optional deliverable be worthy? I don't think it lacks
priority, but it seems much more feasible to implement it over something
already existent.

Your inputs on a better approach that would make English more straight
forward are very much welcome!



--
View this message in context: http://dev-list.musescore.org/GSOC-2016-Regarding-the-Virtual-Singer-project-idea-tp7579698p7579741.html
Sent from the MuseScore Developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

lasconic
Administrator
My impression is that MuseScore users are musicians, singer, choir directors and often not technologists or linguists. When we talk about a virtual singer, what they envision is "I create a SATB score, enter lyrics, eventually I set the language of the lyrics somewhere in the settings, and then I press play and the computer sings". 

If the result is good enough, it's good enough. If it can be made very good without more tweaking then it's even better. But entering phonemes, setting the vibrato etc..., 80% of them will not do it. I'm not saying we shouldn't have some of these possibilities for people who want to tweak things, I'm just saying that in most cases it will not be used. 
To say it differently, the goal is not to make Hatsune Miku videos and spend a lot of time tweaking the pronunciation and intonation of the voice. The goal is to get a better audio rendering for voice than the current Oohh and Aahh MIDI sound provided by our soundfont engine.

To illustrate what I just said, I feel MuseScore UI for this function should be more like the virtual singer in PDF2Music or in Harmony Assistant than the virtual singer in Vocaloid.

Hope it helps,
lasconic


2016-03-23 9:27 GMT+04:00 David Cuny <[hidden email]>:

syrma wrote:

> I think the same about the
> necessity to allow phonetics, though we will need opinions on whether
> this is even an acceptable possibility in MuseScore.

Sinsy accepts phonetic input to override defaults by enclosing it in square braces, and commas separating the phonemes, as in:

   there will be an an[ae*,n]-swer[s,w,er] let[l,eh*] it[r,iy*] be

See: http://sinsy.sp.nitech.ac.jp/reference.pdf

Mscore will allow lyrics typed in this format.


> I don't think it lacks priority, but it seems much more feasible
> to implement it over something already existent.

This would be for the devs to answer, not me.

But if the choice is between something that works now but might break in the future (Sinsy) or something that doesn't work now, but might eventually (virtually every other vocal synthesis program), I know which I'd vote for. ;-)

-- David


On Tue, Mar 22, 2016 at 2:20 PM, syrma <[hidden email]> wrote:
Tobias Platen wrote
> On 03/22/2016 08:18 PM, David Cuny wrote:
>> Sorry, I didn't mean to propose that. I just wanted to note that a
>> fallback
>> that allowed phonetic symbols would be necessary.
>>
>> As to the rest, my (unofficial) thought is that it currently takes quite
>> a
>> bit of manual intervention to get English working well with the UTAU
>> toolchain, whether it uses VCV or CVVC. And each approach requires a
>> different set of tools to connect the samples together. It seems to me
>> that
>> there's quite a bit of risk of not coming out with something usable at
>> the
>> end.
>>
>> -- David

I understand and share your concerns. English is a real problem, starting
from phonetics to the scarceness of samples. I think the same about the
necessity to allow phonetics, though we will need opinions on whether this
is even an acceptable possibility in MuseScore.


Tobias Platen wrote
> For European languages eSpeak is the best one, but that will require
> much more work. eSpeak can convert text to phonetic symbols for many
> languages.

How many eSpeak phonemes are there approximatively in English? Although I
have thought about the possibility to make program to try and convert every
phoneme to its CVVC equivalent, I get the feeling that this is a bad idea.
There is simply no real equivalent, and users who get almost natural results
out of it have good tricks to do so (play with either a vowel or a consonant
sounds according to the song's speed, to the rhythm, to the consonants that
are pronounced near each other. To make it the closest to how a human would
say it. The most natural sound isn't the most obvious, and my level in
English is insufficient to do such a thing). Even if one would go for doing
the most obvious algorithm (provided it is possible), how much time would it
need?

I know that English support is quite a priority, as it is one of the most
used languages. But it seems too risky. Would a project where the English
support is an optional deliverable be worthy? I don't think it lacks
priority, but it seems much more feasible to implement it over something
already existent.

Your inputs on a better approach that would make English more straight
forward are very much welcome!



--
View this message in context: http://dev-list.musescore.org/GSOC-2016-Regarding-the-Virtual-Singer-project-idea-tp7579698p7579741.html
Sent from the MuseScore Developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer



------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

benjisan
In reply to this post by syrma
I'm a choir master and a voice teacher, and i'm agree with that idea : we are looking for the easiest solution!
I actually do as many of other voice teachers and choir masters : using Musescore for editing sheets, and  using Harmony Assistant rival for audio examples. But we all would like to use Musescore without Harmony Assistant!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

David Cuny
In reply to this post by lasconic
Lasconic wrote:

> But entering phonemes, setting the vibrato etc..., 80% of them will not do it.

Yes, sorry if I gave the impression that people want to use phonetics. I just wanted to make the point that Sinsy already supported a format that mscore could use. So there's no need to reinvent the wheel here..

Chances are that if the program pronounces a word wrong, the user is much more likely to try a more creative spelling.

That means that you'll need some sort of routine to guess the pronunciation if the word can't be found in the dictionary, and you'll need to handle incorrectly hyphenated words as well.

It also has to handle "interesting" spelling, even if it seems to break basic rules (like having more than one vowel per syllable), or slurs where a single vowel is spread out over several notes... which gets even more interesting when that vowel is a diphthong.

Being able to do all that robustly and gracefully presents an interesting challenge all by itself - without even getting into the complexity of actual vocal synthesis.

-- David


On Wed, Mar 23, 2016 at 12:05 AM, Lasconic <[hidden email]> wrote:
My impression is that MuseScore users are musicians, singer, choir directors and often not technologists or linguists. When we talk about a virtual singer, what they envision is "I create a SATB score, enter lyrics, eventually I set the language of the lyrics somewhere in the settings, and then I press play and the computer sings". 

If the result is good enough, it's good enough. If it can be made very good without more tweaking then it's even better. But entering phonemes, setting the vibrato etc..., 80% of them will not do it. I'm not saying we shouldn't have some of these possibilities for people who want to tweak things, I'm just saying that in most cases it will not be used. 
To say it differently, the goal is not to make Hatsune Miku videos and spend a lot of time tweaking the pronunciation and intonation of the voice. The goal is to get a better audio rendering for voice than the current Oohh and Aahh MIDI sound provided by our soundfont engine.

To illustrate what I just said, I feel MuseScore UI for this function should be more like the virtual singer in PDF2Music or in Harmony Assistant than the virtual singer in Vocaloid.

Hope it helps,
lasconic


2016-03-23 9:27 GMT+04:00 David Cuny <[hidden email]>:

syrma wrote:

> I think the same about the
> necessity to allow phonetics, though we will need opinions on whether
> this is even an acceptable possibility in MuseScore.

Sinsy accepts phonetic input to override defaults by enclosing it in square braces, and commas separating the phonemes, as in:

   there will be an an[ae*,n]-swer[s,w,er] let[l,eh*] it[r,iy*] be

See: http://sinsy.sp.nitech.ac.jp/reference.pdf

Mscore will allow lyrics typed in this format.


> I don't think it lacks priority, but it seems much more feasible
> to implement it over something already existent.

This would be for the devs to answer, not me.

But if the choice is between something that works now but might break in the future (Sinsy) or something that doesn't work now, but might eventually (virtually every other vocal synthesis program), I know which I'd vote for. ;-)

-- David


On Tue, Mar 22, 2016 at 2:20 PM, syrma <[hidden email]> wrote:
Tobias Platen wrote
> On 03/22/2016 08:18 PM, David Cuny wrote:
>> Sorry, I didn't mean to propose that. I just wanted to note that a
>> fallback
>> that allowed phonetic symbols would be necessary.
>>
>> As to the rest, my (unofficial) thought is that it currently takes quite
>> a
>> bit of manual intervention to get English working well with the UTAU
>> toolchain, whether it uses VCV or CVVC. And each approach requires a
>> different set of tools to connect the samples together. It seems to me
>> that
>> there's quite a bit of risk of not coming out with something usable at
>> the
>> end.
>>
>> -- David

I understand and share your concerns. English is a real problem, starting
from phonetics to the scarceness of samples. I think the same about the
necessity to allow phonetics, though we will need opinions on whether this
is even an acceptable possibility in MuseScore.


Tobias Platen wrote
> For European languages eSpeak is the best one, but that will require
> much more work. eSpeak can convert text to phonetic symbols for many
> languages.

How many eSpeak phonemes are there approximatively in English? Although I
have thought about the possibility to make program to try and convert every
phoneme to its CVVC equivalent, I get the feeling that this is a bad idea.
There is simply no real equivalent, and users who get almost natural results
out of it have good tricks to do so (play with either a vowel or a consonant
sounds according to the song's speed, to the rhythm, to the consonants that
are pronounced near each other. To make it the closest to how a human would
say it. The most natural sound isn't the most obvious, and my level in
English is insufficient to do such a thing). Even if one would go for doing
the most obvious algorithm (provided it is possible), how much time would it
need?

I know that English support is quite a priority, as it is one of the most
used languages. But it seems too risky. Would a project where the English
support is an optional deliverable be worthy? I don't think it lacks
priority, but it seems much more feasible to implement it over something
already existent.

Your inputs on a better approach that would make English more straight
forward are very much welcome!



--
View this message in context: http://dev-list.musescore.org/GSOC-2016-Regarding-the-Virtual-Singer-project-idea-tp7579698p7579741.html
Sent from the MuseScore Developer mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer



------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer



------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Mscore-developer mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/mscore-developer
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

syrma
In reply to this post by lasconic
I understand the idea, thank you.

I picture the interaction with the UI as follows:

- The user enter notes and lyrics as they normally would, only by setting the Instrument to "Virtual Singer". Additional settings would depend on the synthesis tool.
- Each staff inside a virtual singer that has lyrics is separated to make an independent input for the singing synthesizer.
- Synthesis step. (to be discussed later)
- The output audio need to be separated into notes probably by time calculation (tempo, note durations, ...) I am currently looking more in depth into the fluid component to see how to achieve this. (I would be very grateful for developers' inputs/advices on this)
- The score is played starting from the cursor (like MuseScore does in other situations).
- Repeat the steps after each modification.

As for the synthesis step, I am left with two (maybe three) options. To summarize the previous discussion, they are:

- Using Sinsy (open-source) :

Pros:
- Direct input.
Cons:
- Works only for Japanese. Adding support for other languages might be possible, but the lack of compatible data could be hindering;
- Lacks quality compared to the two following options.

- Using Sinsy ( web-service)

Pros:
- Direct input;
- Supports Japanese, Chinese, and more importantly English (even though the accent isn't all that good, I don't think this is a big problem);
- The output's quality is more than decent.
Cons:
- No guarantee of unlimited availability;
- Requires an internet connection;
- The web service in itself isn't all that slow, but depending on both the user's internet connection and the file's size, the delay could get really big (waiting for a response, then downloading the audio (compressed, in the number of staves, then uncompressing it), then processing it...);
- No possibility of adding other languages for us (unless the developing team does).

- Using v.Connect-STAND:

Pros:
- A lot of possibilities (in voices, languages, accents, ...), only limited by the big Utau voice database on the internet.  
- Could give very good results if used correctly (and with the proper voice set)
- Japanese works directly, and it is possible to add support for several other languages (English, French, ...) .
(- Maybe a possible performance optimization is possible.)
Cons:
- Using it requires to hack into it to solve several usability issues.
- Indirect input (would require something to generate its meta text sequences from scores and the good settings, with the amount of work this could require)
- There's hardly any doc, and code comments are all in Japanese (while I can read it with some effort, it is clearly not ideal)
- Adding support for any language, especially English, will make us face problems about phonetics and converting to the used voice set's own lyrics format.
- A lot of voicebank are Virtual singing enthusiasts' product, with unclear licences and random quality. Looking for something adequate (for MuseScore and a precise language) could be a very tiring matter (and so would producing our own set of voice).

Other interesting tools I can think of right now:
- eSpeak could come in handy to add support for European languages (since it can convert pretty smoothly from text to phonemes), whatever the synthesising tool we choose to use.
- kakasi can do the Romaji/Kana conversion, so we could allow both inputs for Japanese.
- iconv, convmv, and everything else that makes dealing with file encoding easier (for some tools).

The ideal solution of course would be to combine all the aforementioned tools and then add some more, and I honestly do want to. But the time constraint makes this completely unrealistic, so a good first step would be to make the smallest functional thing, and since anything functional requires English, Sinsy's web service may be the most obvious first step. How acceptable is the delay, though? We should probably impose a waiting time limit (and maybe consequently a file size limit). Developers' opinion on this would be great!

benjisan wrote
I'm a choir master and a voice teacher, and i'm agree with that idea : we are looking for the easiest solution!
I actually do as many of other voice teachers and choir masters : using Musescore for editing sheets, and  using Harmony Assistant rival for audio examples. But we all would like to use Musescore without Harmony Assistant!
Could I please also have your opinion on the suggested interacting scenario? Is there something you could use and see missing there?

As the deadline is dangerously closing in, I will try to make a draft by today, I will take into account any additional suggestion.

Thank you very much.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

benjisan
Ok, great! And may you please also see for an adaptation for making a karaoke with the future Virtual Singer addon?
That'll be very great. No one of your rivals do it!!! ^^
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: (GSOC 2016) Regarding the Virtual Singer project idea...

benjisan
I used Sinsy, and i think it's a very great result.
But that didn't took all of my 4 voices (Soprano, Alto, Tenor, Bass), that only selected the first voice.
Nethertheless, i think that only One voice as same as Sinsy vocals would be a good beginning!
Loading...