The inner workings of advanced karaoke effects ---------------------------------------------- So you've seen all those fancy karaoke effects used in most anime fansubs today, and you're wondering how exactly it's done. Maybe you even want to create some yourself. Maybe you've already done some basic effects, timed a karaoke with \k, \kf etc. in ASS, maybe made some simple effects with \t. To start off, not all karaoke effects are actually made with ASS. Some people use Adobe AfterEffects (AFX), a professional program for overlaying text and graphics on video. I don't have any experience using AFX at all, so I won't try to comment on how that works. Some people use even more special software. I know of at least one karaoke produced in 3D Studio Max. Perhaps there are more. Some people have even developed their own tools for rendering karaokes! Note that creating the most advanced effects REQUIRE PROGRAMMING SKILLS. If you don't know what a function call, data structure, reference data type or conditional statement is, start by leaning some kind of real programming language. Personally I will recommend trying out Python if you're new to programming. I'll assume you can work some things out yourself as well. I don't want to state the obvious too often :) But this document will focus on how effects are created with ASS. I'll start by going over the most basic stuff. First off, there's the basic \k tags for timing karaoke. Here's the list: \k is the simplest. When the syllable is up, it changes color in an instant. \kf is somewhat like \k, but instead of changing color in an instant, the color is changed gradually from left to right. An important thing to note is that no matter what rotation (\fr) is applied, \kf *always* highlights from left to right, relatively to the video canvas. \K is the same as \kf (unless I'm mistaken :) \ko is also like \kf, but instead of the inside of the text changing color, the border around the text changes color. All of the \k tags take a single number as argument, the duration of the syllable in CENTISECONDS (1/100 of a second.) This is the same scale used for timing lines in ASS. Every syllable can also be said to have a "start" and an "end" time. The start-time of a syllable is the sum of the durations of all syllables up to that syllable. The end-time is the start-time plus the duration of the syllable itself. Next thing is the \t tag. \t means "transformation": it gradually transforms the following text, changing its size, color, rotation, alpha or even clipping-region. Note that it can't change the position of the text! There are four forms of the \t tag: \t(tags) \t(start,end,tags) \t(speed,tags) \t(start,end,speed,tags) As as important note, the \t tag uses MILISECONDS (1/1000 second) for timing. The start and end times are relative to the beginning of the line the \t tag is on. If you leave out start and end, the transformation takes place over the entire duration of the line. The speed argument changes the transformation from happening in a linear fashion into an exponential fashion. Speed is a real number larger than zero. Leaving it out is the same as setting it to 1 (one). If speed is between 0 and 1, the transformation starts fast and ends slow. If it's larger than 1, it starts slow and ends fast. No matter what value speed has, the transformation will always take the same amount of time, as defined by the start and end. A \t tag affects all text from where it appears until the end of line, or the next \r tag. (\r resets all overrides to the line default.) Here's a sample fictional karaoke line, using a \t effect: {\r\1c&H000000&\t(0,100,\1c&HFFFFFF&))}so{\r\1c&H000000&\t(100,200,\1c&HFFFFFF&)}ra Note that the first \r is totally superflous, but you should get used to seeing and doing such superflous things, for the simple reason that it's easier. Since it's tedious and error-prone to make any kind of effect other than basic \k by hand, you'll almost always be using some kind of program to generate the effects for you, based on a simple \k timed script. Putting an \r at the start of every syllable is easier to program, and it doesn't do any harm, so there's no reason not to. Next up is \pos and \move. These tags are basically the same, except that \move animates the line, while \pos does not. You can only have one of these tags in each line. (If you have more, the first one is used and the rest ignored.) For this reason, they're not very useful in simple karaoke where one line in the original timed script corresponds to one line in the final styled script. The primary reason you'd want to use one of these is that you can avoid the automatic collision detection and resolution in ASS, ie. make two lines overlap on screen. (You'll want to do that if you have lines fading into each other and similar.) \pos(x,y) \move(x1,y1,x2,y2) \move(x1,y1,x2,y2,t1,t2) x and y coordinates are in the virtual script resolution specified in the PlayResX and PlayResY lines in [Script Info] in the ASS file. \pos simply positions the line on the screen, nothing more, nothing less. It doesn't matter where on the line you put it. \move makes the line move from (x1,y1) to (x2,y2), either over the entire duration of the line, or only during the interval from t1 to t2. Again, the times are specified in miliseconds, relative to the start of the line. In the time before t1, the line is positioned at (x1,y1) and after t2, it's positioned at (x2,y2). \move can only move the line with linear speed (no acceleration like with \t) and only in straight lines. Note that the position spcified works differently depending on the alignment (\an) of the line. Just play around with it, it's pretty simple. And now, for the first important trick to learn: Making parts of a line move a bit, without affecting the rest of the line! Suppose you want to make syllables "jump" whe highlighted. What can you do? First option is to make two lines per syllable, positioning each with \move and making the first line per syllable move it up, the second down. That's a lot of work. But there is an easier solution: Enter rotations! What happens if you have a circle with an extremely large radius, and take the length along the periphery of a small angle like 0.1 degrees? You get a reasonable distance, but only along one axis. Try this: {\org(10000,240)\frz0\t(0,100,\frz0.1)\t(100,200,\frz0)}orz You should see orz jumping a bit up and down. It doesn't really look like a rotation, does it? Note that I talked about having multiple lines per syllable in the part above. That's perhaps the most important "rule" for advanced effects! Sometimes you can get away with a trick like the above, but more often than not, there's simply no way to make an effect without having multiple lines per syllable. There are two primary hurdles you need to pass, in order to use multiple lines (or even just one!) per syllable. The first one is the simple one of splitting every timed line in the \k timed script into multiple lines, perhaps also copying every syllable-line multiple times. The second is how to position the syllables. Since there's no longer surrounding text, but just a single syllable (like "so") on the line, it won't get a sensible position just by itself. Enter karaoke generation frameworks! That's a name I just made up for software and scripts you can use to write your own scripts for generating multiple lines per syllable, and applying effects to them. I created one myself, the Automation system in Aegisub (http://aegisub.net) which is based on the Lua language. Of course, several other exist, but the basic ideas are still the same. I will try not to be specific about any systems, but just describe the techniques here. End of pimping. A karaoke framework should (must!) be able to use ASS style information to calculate how much space some text takes up, when rendered on screen. Using that information, you can calculate where the text would be placed, if it was rendered as part of its original line, without any effects applied. When you have the positioning information for each syllable, you can start to do effects. The hardest part of making an effect is working out exactly how it should be done technically, the actual programming it is often easier. Try to split the effect you want into atomic parts. I'll take an example from a karaoke I recently worked on. The text initially appears on screen without any borders of any kind. When a syllable is highlighted, a thin border appears zooming in from the center of the syllable, ending up at the correct position around the syllable. Another, thicker, border also appears below the syllable, extending from left to right, ending up outlining the entire syllable. The thin border, creates an "echo" when it reaches its final size, ie. two more copies of the thin outline appears, extending slightly more around the syllable, fading out. Here's a cut from the generated effect: Dialogue: 1,0:00:29.11,0:00:30.40,romaji,,0000,0000,0000,,{\bord5\an4\1a&HFF&\3c&H003FFDBB&\pos(242,40)\fscx0\t(0,200,\fscx100)}i Dialogue: 2,0:00:29.11,0:00:30.40,romaji,,0000,0000,0000,,{\bord0\an5\pos(245,40)}i Dialogue: 3,0:00:29.11,0:00:30.40,romaji,,0000,0000,0000,,{\bord1.5\an5\1a&HFF&\pos(245,40)\fscx0\fscy0\t(0,200,\fscx100\fscy100)}i Dialogue: 4,0:00:29.31,0:00:30.31,romaji,,0000,0000,0000,,{\bord1.5\an5\1a&HFF&\3a&H80&\pos(245,40)\fscx100\fscy100\t(\3a&HFF&\fscx185\fscy160)}i Dialogue: 4,0:00:29.31,0:00:30.31,romaji,,0000,0000,0000,,{\bord1.5\an5\1a&HFF&\3a&H80&\pos(245,40)\fscx100\fscy100\t(\3a&HFF&\fscx202\fscy177)}i Dialogue: 1,0:00:29.11,0:00:30.40,romaji,,0000,0000,0000,,{\bord5\an4\1a&HFF&\3c&H003FFDBB&\pos(249,40)\fscx0\t(200,430,\fscx100)}ke Dialogue: 2,0:00:29.11,0:00:30.40,romaji,,0000,0000,0000,,{\bord0\an5\pos(268,40)}ke Dialogue: 3,0:00:29.11,0:00:30.40,romaji,,0000,0000,0000,,{\bord1.5\an5\1a&HFF&\pos(268,40)\fscx0\fscy0\t(200,430,\fscx100\fscy100)}ke Dialogue: 4,0:00:29.54,0:00:30.54,romaji,,0000,0000,0000,,{\bord1.5\an5\1a&HFF&\3a&H80&\pos(268,40)\fscx100\fscy100\t(\3a&HFF&\fscx175\fscy153)}ke Dialogue: 4,0:00:29.54,0:00:30.54,romaji,,0000,0000,0000,,{\bord1.5\an5\1a&HFF&\3a&H80&\pos(268,40)\fscx100\fscy100\t(\3a&HFF&\fscx191\fscy169)}ke Note the five lines per syllable. Each part of the effect gets its own layer. The fill of the text is in layer 2, such that the thick border can use layer 1, in order to be below it. The thinner border gets layer 3 in order to be above those two parts, and the echoes end up in layer 4. Because the thick border should extend from the left, it needs to have alignment 4, middle-left. It starts out with X-scale 0, making it completely invisible. Then a \t makes it extend out to 100% width. Pretty simple. The line for this is timed like the full line, since the result should stay after the effect itself ends, and it's easier doing it that way, than having the line first start when needed. This is really a detail, and may be personal preference. I don't think I need to comment the fill line. The thin border is quite similar to the thick one, except it has another alignment (and thus also different position), and it changes both X and Y scaling. Last off are the thin border echoes. These always last one second. They shouldn't start until right when the highlighting of the syllable is finished, and disappear again, so their timing is different from that of the regular line. This also easily allows them to stay a bit, even after the rest of the line is cleared from screen. Notice that the echoes of each syllable are scaled to different final sizes. First of all, the two echoes per syllable shouldn't overlap 100%, so their sizes need to vary. Second, because the echo is of the border extending from center, how much the echo should extend further must really depend on the speed of the original border extending, which in turn translates to the duration of the syllable. I used a simple calculation to get those final scales of the echoes: 110+1500/syl.duration The constants are then changed a bit for X and Y sizes, and the two echoes per syllable. I hope the above description has inspired you enough on how to make effects using just the text itself. But sometimes, just the text isn't enough. Sometimes you need non-text elements.. Luckily, ASS provides a way of making vector graphics, which can be styled and animated just like text: The \p tag. I won't go into detail about how it works here (you do have an ASS override tag reference, don't you?), but I'll give some examples here: A five-tacked star: {\p1}m 0 -6 l 2 -2 6 -2 2 1 5 5 0 3 -5 5 -2 1 -6 -2 -2 -2 A kind of four-tacked star: {\p1}m 20 0 b 20 10 30 20 40 20 b 30 20 20 30 20 40 b 20 30 10 20 0 20 b 10 20 20 10 20 0 A one-pixel dot: {\p1}m 0 0 l 0 1 1 1 1 0 Now drawings by themself aren't that interesting, but here's some things you can consider: (I've used most of them myself.) - Make an object fly over the text, either placing it or removing it - Make the text explode into a storm of little dots ("debris") - Make objects fall onto the text, highlighting it - Provide a semi-transculent, colored background for some text But novel uses of drawings are what we're all waiting for. Go get some crazy ideas! Another invaluable tool is \clip. It takes two forms, rectangular and shaped. Unfortunately, the shaped kind doesn't have as much use as the rectangular one, for one thing because it can't be animated. There's two primary uses of \clip in karaokes: Splitting the text into different parts that get different styling in a way not possible using regular override tags. You can for example create color gradients this way: Create a lot of lines, place them at the same position, \clip them into small strips of one pixel width (or height) and give them slightly different colors. As a replacement for \kf. Because of \kf's limitation of only highlighting from left to right, it's useless in eg. vertical kanji lines. This requires animating the \clip with \t. As a final note: Don't be afraid if your ASS files with generated karaoke effects suddenly pass the 10 megabyte mark. It's not that bad. It's first when TextSub starts crashing, it's bad :) Written on 2006-01-20 by jfs (AnimeReactorDK, Infidels!, freelancer and Aegisub developer)