The Honor of 10

Posted by - May 15, 09

usagijen managed a great mixture of closing thoughts and defense of her giving Darker than Black a 10 [out of 10]. I added my own response, but ghostlightning pointed to one of coburn’s December posts which I recall, but never managed to respond to a second time. There were good points made:

Jen:

My 10 is greater than your 10

This is a duality in truth.

moritheil:

A rating is always subjective.

I suppose concur, or rather, no matter how objective one may deem their rating, we must still take it as a subjective stance :( Nothing wrong here, the Narutos of the world will still get their 1’s and 10’s… it keeps spinning /whoosh so why do ratings matter at all on a system-wide scope? I have arguments, but not the point of this entry.

ghostlightning:

People are subjective to be sure, and the ratings we give on MAL I think serve as icebreakers in opening discussions more than anything.

This goes beyond the way (the rating system 10/10 or relative, matters not). Discussions are about thoughts, possibly streams of them. This goes hand in hand that a rating list only goes so far when trying to inspect the individual. An organized history of thoughts are a nice reference point in doing so.

Senna:

I know people who won’t even give their favorite series a 10/10 because “it’s not perfect;” I think that’s silly

I concur, though just as interesting is someone’s highest-ranked title. It’s just not very interesting when it shares the same position with 20 other works.

Finally, coburn’s followups:

I agree that relative-tiered levelling is more accurate in dealing with the broad range of shows out there – especially at the lower ends of the scale. Especially in that some people will be stricter than others depending on their personal aims in rating shows.

Ah yes, notice this statement on strictness will apply to any rating system. People who rate differently, should be allowed (and possibly encouraged) to use different scales. coburn does, in fact, understand the system.

Still the lack of a magic symbolic 10/10 doesn’t quite click with my actual experience. My favourites really do mean that much more to me. Maybe I’d have to include a couple of empty theoretical tiers to segregate them from the crowd.

Ah yes, the magic of 100%. I’ve come to the realization, that this is only possibly in relative ratings if the user keeps a 10-level system, but again, possibly more interesting is what appears at the top, regardless of level. Y/N? A top 10 list is rather wicked, but something tells me people wouldn’t mind seeing top 20 or top 100 as well. I feel a correlation with ratings is in order.

Secondly, empty null/void-tiers, it has been pondered. This is not an issue, the current system has a switch which removes empty tiers, disabling is quite simple. The issue is exploiting such a thing, but extreme ratings should usually be thrown out anyway. As compensation, the current calculation algorithm is non-linear, meaning higher tiers have a greater weight/position than lower tiers (the mathematical significance of a rating becomes less important on the way down).

TheBigN:

I do get annoyed when people strongly force comparisons between two anime shows, as if the shows were specifically made to fight against each other, when that’s not the case.

Back to coburn:

TheBigN: I reckon the advantage of doing things out of 10, as opposed to something like Ryan suggests, is that the extent to which shows are in combat with one another is reduced.

It is relative because the items are ranked relative to each other… this is Sparta, jk.

There are two ways to go about this. 1) Whether we like it or not, a 10-point system is relative, but abstracted from the direct comparison between the experiences/works. If someone gives two titles a 5 and a 7, there is inference of comparison. 2) Ratings are quite 1-dimensional, which is good reason to include annotation when rating. This preferably should not involve comparison between works, because the rating already expresses the outcome.

The fundamental nature of rating is comparison. How an individual rates, whether through abstracted grading or direct comparison likely will differ. The dimension of the ratings will likely differ as well; the overall experience vs storytelling+production, etc. What is important to notice in all of this is that both fixed and relative rating systems can be used for any dimension or grading style. If we substitute the word “unbounded” or “dynamic” for “relative,” the hard difference becomes the ability of a user to create their own rating system, or not.

With that said, I realized this rating system requires a different mindset about how we numerically categorize our media experiences, but once the consensus releases the ingrained concept of 10/10 ratings and realizes the backwards compatibility of relative ratings, both RRS and fixed-point systems will reach a more meaningful realization.

Notes

I use terms like work or item in order to hopefully step-out of the specific case of rating anime, and into the case of rating media (or possibly anything), period.

12 Comments on The Honor of 10

Respond

  1. I don’t know what’s more useful/difficult: ratings or forced rankings. Maybe there’s an interplay… forced ranking per rating.

    I have these as my 10s (MAL):

    Aria the Animation
    Aria the Origination
    Cowboy Bebop
    Giant Robo
    Legend of the Galactic Heroes
    Macross
    Mobile Suit Gundam: War in the Pocket
    My Neighbor Totoro
    Ocean Waves
    Tengen Toppa Gurren Lagann

    What I find interesting is that among the Aria shows, I like Aria the Natural best, because it has more of well what I like about the show simply by being the longest (being the middle part).

    Also, only Macross, TTGL, Cowboy Bebop appear on my 5 favorites list. This I think speaks to what Jen said about “My 10 is greater than your 10.”

    The 7 I gave Macross 7 may be far more intense than Jen’s 10 for Darker than Black. Not only does this underscore the subjectivity involved, but also re-introduces the dynamic of favorites.

    Okay, forced ranked, the list looks like this:

    1. Legend of the Galactic Heroes
    2. Giant Robo
    3. Mobile Suit Gundam: War in the Pocket
    4. Tengen Toppa Gurren Lagann
    5. Cowboy Bebop
    6. Aria the Animation
    7. Macross
    8. My Neighbor Totoro
    9. Aria the Origination
    10. Ocean Waves

    There is a difference in this ranking, in that I attempted to be the least subjective as I can possibly get. I took into account excellence over time (number of episodes), being groundbreaking or different for its milieu, being timeless (not that big a factor because it is a prerequisite for a 10), being near seamless in the use of the elements it factored in, the execution of presenting such elements relative to budget and technological constraints, etc.

    However, the standards aren’t airtight and if I pay too much attention to them it will deliver diminishing returns in terms of the pursuit of enjoyment from this anime hobby.

  2. Ryan A says:

    Yes. <- this was deterministically generated.

    I believe you get the gist of the way RRS does it, because you have immediately spotted what tends to be difficult, “where/how do I place these series if I’m trying to make it proper?”

    The answer is that at any given moment in the continuous experience of said hobby, the items need not be in some permanent place. In fact, the further down the list it goes, the less effect it has on the overall platform. So really, it is only the top 10-33% of the list which truly matters (this is for ratings, not necessarily favorites).

    Organizing that list above from the 10s displays my point about ratings, and how the system is not quite the pinnacle of accuracy. Part of the growing theory is that not everything needs to be rated, especially if giving it a pity rating (ie. giving it a 7 when hardly anything is given less than a 7, basically a 7 equates to a 1). What can/should be “forced rated” are shows that deserve such distinction between each other. In most cases, I would say this is anywhere between 10 and 30 series for a user.

    Another consideration is that this sorting is tedious to try and do all at one time, as it forces the mind to think in quite a different manner from when assigning a number. This is why I believe more accurate overall ratings are achieved with such a system (that and the algorithm has a per-user account of what a ranking means based on the rating history and not a numerical value alone).

    For favorites, I think it’s fine to have a similar dynamic list, which will likely be different than ratings, but as you say, there is a difference in the attempt at objectivity (or lax of subjectivity between ratings and favorites). I agree. Both lists serve a good purpose, and comparing these for a single user shows the finer details.

    So that is the relative portion of it, but I’m growing more fond of this notion that it doesn’t need to be relative at all. It just needs to be ratings, and accompany the tiers a user wishes to use, whether it be 10, 15, 50, 5… etc.

    Rather than force the comparisons into order, we should start with a small number of tiers, say 10, and if it just so happens that we need another tier between two existing levels, we have the ability to create it. If one can think in this hybrid manner of using fixed-ratings (whatever size list they choose) in most cases, but utilizing the ability to create new levels when truly needed, then ratings or favorites should become something of greater value.

    As it stands, all 10s, 9s, or 8s are likely not equivalent, especially between users. Relating what one user’s 10 means to another isn’t going to happen on any of these well-known sites, but it should.

    So much to discuss. Basically, a user rated a title a 9, and the overall rating among all users is a 7.6, but the problem is that 7.6 is not a 7.6 by the standards of the user’s rating system. So what numerical value is it? These sites cannot answer.

    If one were to use the RRS with fixed 10-point system, the overall ratings would still have drastically different meaning. Some 7’s might become 3’s, there would be an abundance of 5’s, but a user would be assured that they could know exactly what that 7.6 means on their own 10-point scale (and it would likely not be a 7.6).

  3. animekritik says:

    on MAL i score strictly on my own love for the show. if i absolutely love it, it gets a 10. i don’t see why someone would argue that the show they love is “really” a 6 or a 7 (people do it all the time, I know, but I don’t see how it’s appropriate in any way). A house divided against itself cannot stand & so forth…

  4. moritheil says:

    “I suppose concur, or rather, no matter how objective one may deem their rating, we must still take it as a subjective stance.”

    Which effectively makes it subjective to the reader. There could be a distinction, I suppose, but categorically I think recognizing the subjectivity of ratings is essential. The nature of subjectivity is that people try hard to be objective and still fail.

    I don’t think that someone who insists that Naruto is a 10 is necessarily lying or deluded; it might be that Naruto has whatever they’re looking for at the moment. This will definitely not be what everyone is looking for, and it might not be what they themselves appreciate most in a few years. In no way do those qualifiers detract from the sincerity of the rating.

  5. Ryan A says:

    @animekritik AniDB/MAL should get rid of its aggregate ratings and series placement then, but how one rates is tangential to how ratings are evaluated for everyone else (apart from the individual user). The reasoning behind giving a 10 or the top tier for a loved series is the liberty of the user; any rating perspective can be taken (overall enjoyment, technical qualities, story hardliners, etc). I think it’s productive for users to note such a thing.

    If I look at a rating history of a user, let’s say on a 5 point scale, and I see nearly 100 4’s and 60 5’s, it really doesn’t make any sense. That’s 160 series which would have scored 8/10 or better on MAL. Is this good information about the titles [or the user]?

    At some point these ratings lose significance. Dynamic ratings are an attempt at maintaining significance (mathematically, system-wide) even with such ratings. For the individual scope, it solely allows pliability (yes, one can have an 11 out of 10 if they want to) while staying relevant to other users (those who don’t have the 11th tier).

    @moritheil always subjectivity, but what are your thoughts on varying degrees? Or is it, like digital, either zero or one. As for the Naruto is a 10 ordeal which later must be re-evaluated, I wrote on the pliability of ratings history earlier this year. [entry]

    Finally, this notion of relative rating doesn’t require a different way about rating. It can be thought of directly comparing series, but users may still rating with a 10/10 scale, just as it would be done anywhere else. The difference comes in flexibility, customization, and calculated significance throughout the entire system.

    I’d be interested to hear what changes you guys think might be beneficial to fixed-point systems. I begin these endeavors because I don’t feel the available tools are adequate (multiple sites for media, fixed-point rating systems), which is the same for the microblogging stream on melative; allows status updates about externally defined topics (pure content message with none of that, #SeriesNameTag 05: … garbage in the message).

    Or another thought, I could translate your ratings and show exactly what they would look like in the system… this would be interesting with two users, which would bring in divergence/conformity though it’s better with more users.

  6. What could be done, for these sites that insist on aggregate ratings, is to insist on its own standards.

    10: Masterpiece, can be recommended to almost anyone. Lack of appreciation will be rare and almost always happen due to extenuating circumstances.
    1: A total discredit to anime as a medium. Will only please collectors of terrible experiences.

    …etc.

    Everyone does not have to comply! One just has to opt-out of the system, which means one’s ratings won’t be part of the aggregate ratings.

    All this does is to derive real conclusions from ratings. The conceptual heavy lifting will be in determining what the standards/ratings tiers/meanings will be.

  7. Ryan A says:

    @ghostlightning

    Addition: thinking on the system-wide level is important for standards, but giving user-level abilities to change the way the default works is something I like. A good [tangential] example would be the visibility coded into the stream, which allows users to have a completely private or friends-only microblog, but still maintain the ability to publish items with public visibility.

    In this dynamic algo I’ve been developing, the experiences of the user, mix their ratings of each experience play into a larger calculation which results in 2 aggregate statistics.

    The thing is, each user does not have equal weighting power. A user rating everything a 9 or 10 is not going to contribute much for a series as opposed to a user who has few things rated 10, and the series was rated 10 (as well as a full spectrum of ratings).

    What a user grades a given series is important, but all other grades for other series are taken into consideration as well.

    A large part of these dynamic ratings is generating accurate ratings for what ratings are given, relative to all the items within the system.

    I believe AniDB does use those standards, but then again when considering the subset of Masterpieces, they can still be organized and rated. IMO, it’s incomplete to leave it in such fashion. The other issue is that it can never be an exact rating even if going for a 9.786, since the more precision added the less we actually get what the hundredths and thousandths places mean. Using a sorted list nullifies the need for this decimal precision, but it’s not to say we can’t place things on the same level.

    Everyone does not have to comply. This is true, which is even more push for dynamic ratings which allow users their own compliances, completely separate from other users, but at the same time, having an algorithm which makes aggregate sense of these various rating systems.

    The tiers meaning: A>B, A<B, A=B … that is it.

    If a user imports their ratings from 10-point system, it’s automatically in the standard. Over time they may branch out the ratings (especially in the upper tiers), but don’t need to. If there happens to be more than 10 tiers, the list might be more expressive, such as a hybrid between a top X list and standard point-ratings. The list can change over time, becoming larger and more complex or smaller and less complex, or it doesn’t need to do anything since normal 10-point ratings will work just the same individually; the difference is in calculating the aggregate.

  8. I won’t pretend that I got all of what you said, but I do sense that you’ve really thought this through. I like the idea (I think) that the system sorts things out automatically no matter the behavior of the user. What I’m less excited about is that I’d have to spend time figuring out what the user means when she rates something a 7.

    Is it ‘good’ 7? Or, as you say, a ‘pity’ 7?

    Not everyone can be articulate, as I don’t think even I can easily work through a meaning for each number (never mind the decimals, if any). So, having one prepared as a default that can be opted-out from is still a good idea I think.

  9. Michael says:

    I don’t MAL so I avoid stuff like this.

    10/10 for the post. ;)

  10. Ryan A says:

    @ghostlightning, the beauty is that the system can translate whatever their 7 means, into an approximate “tier” in your own ratings. You don’t have to figure out what another person’s scale means, we just follow the ordering operations (>,< ,=).

    Of course, there are additional things which would make it more accurate, like finding an intersection series, and then showing the relative distance of other series compared between two lists, while accounting for the difference in tiers (ie. I have 54 tiers, Michael has 10 I believe).

    Naturally, just because something ends up on the lowest tier of a smaller list, doesn't mean it should end up on the bottom of a larger list as well; it might end up somewhere in the middle, depending on a possible offset. This sort of thing isn't implemented at this time, but is possible.

    @Michael, ^_^ MAL does what it does. I would have used MAL if it wasn't AniDB (thought MAL is less robust with better UI), which I was already fed up with since logging episodic watches and nothing more is quite useless. melative takes a different, less anal approach to experience, there is states and history.

    Also, Michael, if you use Firefox and have time, come play with the Stream. http://melative.com/stream (might have to reset pass)

    Edit/Addition: @ghostlightning Not sure if you’ve come across this explanation, but I feel it might make more sense as to why this ratings system requires a different perspective about ratings (in terms of energy and potential). [link]

    Converting to a 10-point system is indeed possible, but it is closest related to potential (or aggregate position on all lists). The issue is that if the aggregate position is the bottom 10%, that is technically a 1/10 on a fixed point system. Thinking in a fixed-point mindset here will cause trouble, since the true implication is that if there are 10 levels, said item appears on the bottom level. It’s not exactly the same as rating it a absolute 1/10… it just means its at the bottom of all items in the system…. no real number needs to be attached… it’s relative.

  11. aurabolt says:

    Holy shit, walls of text.

    Right, everyone uses a different scale. A ‘7′ does not mean the same for everyone. I do not care much for a title’s overall rating, but I would like to see this implemented for generating user-compatibility-ratings (in MAL).

    Using offsets does sound like a possible solution to “translating” a user’s scale. On the other hand, I refuse to support a dynamic-point system. Ten is enough. Any fucker who needs more than ten needs to quit inflating his/her goddamn scores.

    I personally use scores from ‘5′ to ‘9′ (similar to letter grades). ‘6′ is the neutral line. ‘7′ is where I start saving the series. ‘9′ is reserved to only two series, so ‘8′ is generally the most I give.

  12. Ryan A says:

    @aurabolt

    lol BELL CURVE. Well, the thing about using more levels doesn’t really change the fact that a user only has 100% to deal with. 100% is the best a user can give any item. It’s broken down more so the system doesn’t have to support the tenth or hundredths places. There is no 8.5, but something might get a potential of 85% depending on the levels the user is using.

    What does occur, is that if a new level is placed at the top, say 11, everything else is automagically shifted to appropriate percentages.

    That’s what I’ve mentioned as potential, which is more useful in user-to-user evaluations than calculating a overall system rating. For that, we would use the energy/kinetic scoring. The difference is, kinetic scoring considers the depth and diversity of a user’s ratings and adds the outcome to an absolute value for a given item.

    The main difference in thinking about this sort of scoring, is transition.

    Item X deserves grade Q.

    This is the current way.

    Item X should be grouped in grade/level Q because of the set Y (other items in Q).

    This is better for dynamic ratings. It’s about grouping items by grade, but not specifically caring what that grade value is… the algorithm will yield something relevant, which can then be looked at between items in the system. (all items are evaluated the same through connecting the dynamic user ratings).

    The most important part is how the grade-levels are related greater,lesser,equal (equal implies the the two levels are the same level, level 1 = level 1).

    Basic example using your 5 to 9 ratings:

    5 = 5
    6 = 4
    7 = 3
    8 = 2
    9 = 1

    5 levels. Now here are my levels, as possibly related to the above:

    10 = 5
    9 = 4.25
    8 = 3.75
    7 = 3.3
    6 = 2.8
    5 = 2.4
    4 = 2.0
    3 = 1.6
    2 = 1.25
    1 = 1

    The actual numbers would be different, but don’t evenly divide out because it matters how many items are on various levels. (similar to matching the centers of the bell curves)

    At some point the numbers really don’t matter… which is the objective. What matters is the position of a level and how many levels+items are above and below it. This could be perfectly used with a 10-point system, but hopefully it is clear why the results would be different. Also, bottom offsets would be good, or rather, having a list size, but allowing blank bottom slots where wanted…. no need for top offset since down-boosting is meh, and we care about the list in only one direction.

Respond

Comments

Comments