Would you care to elaborate a bit? I mostly now the theorical parts, but I really lack experience on the subject...
Well I mean the syntax isn't that complicated. And sure, the way you write a query can drastically affect the performance, but that's only true if your query is poorly written to begin with. If you don't have the right indexes created, it will affect every query.
An index is basically a hash table, so that if you say SELECT * FROM FOO WHERE BAR=7, your options are either iterate over every single row in the table, looking for the value [O
] or use a hash table [O(1)]. If there's an index on bar, it's O(1), otherwise it's O
. For small databases this isn't a big deal, but for tables with many rows this could be millions of searches every time you run a query.
Certain columns are obvious. Any primary and foreign key constraints should have indexes, because by definition of being a "key", you are going to be filtering on those columns. As a result most database software will warn or prevent you from creating a key with no index for that reason.
But you don't need to create indices on just 1 column. For example, if you've got columns A and B you could write the following queries:
Code:
SELECT * FROM FOO WHERE A=7
SELECT * FROM FOO WHERE B=7
SELECT * FROM FOO WHERE A=7 AND B=7
And now having an index on just A and just B isn't enough to get the fastest performance from the 3rd query.
You can summarize the performance characteristics with different index configurations as follows:
Code:
Performance
Index On Query 1 Query 2 Query 3
A Fastest Slowest Slower
B Slowest Fastest Slower
A+B Slowest Slowest Fastest
The third one is probably surprising. If you have an index on A+B, why is a query against A slow? Because it's a *combined* index. Creating an index on A+B means creating a *single* hash table where the values being hashed are tuples of the form (<value from A>, <value from B>).
The reason Query 3 is marked "Slower" instead of "Slowest" in the first and second rows is because even though there is no index on A+B, there is still an index on the individual columns. So it can narrow down the search space by using the index from A on the A conditional to limit the set of possible rows, before doing a linear scan on that result set.
So if you want all 3 of these to be fast, you need 3 indices. Of course, the amount of storage required by an index depends on how many rows are in the table, and with millions of rows in your table you can get very large indices. So there's a performance / storage tradeoff to be had.
This problem can arise in subtle ways. For example, you might write something like:
SELECT A.foo, B.bar from A INNER JOIN B on A.id = B.id
The performance of this can drop to... I don't remember, but I think M*N in the worst case scenario. The relationship between indices and joins is surprisingly complicated, and you could probably write an entire magazine article on it.
That kind of knowledge may only be useful to DBAs and not software engineers, but in any case, it doesn't change the basic premise, which is that joins are important
Indices aren't just hash tables of full values though. Most (all?) databases will also index every prefix of a string. This allows the index to be used with the sql LIKE operator. But you have to be careful since only prefixes are indexed, which means your index only helps you up to the first wildcard. So for example, even with an index, running the query
SELECT * FROM FOO WHERE NAME LIKE '%Bar%'
will be a full table scan, since the index is useless in this scenario. But if you write this instead:
SELECT * FROM FOO WHERE NAME LIKE 'Bar%Baz'
Then the database engine can use the index against Bar, and (hopefully) drastically reduce the search space before doing a full table scan against the rest.
I'm sure there's more to know, that's the basics that I can think of off the top of my head.