Today, we’ll examine the uses of ORM versus raw SQL to determine why this choice matters and when to use each one. Throughout the process, I’ll use the code from a toy repo I wrote specifically for this blog post.
Don’t Forget About SQL
I remember being interviewed by “Jeff” (not his real name) at an early-stage startup. Jeff is a junior engineer, two years out of college, and his experience in SQL is what he learned in a database course as part of his CS degree—and maybe a YouTube video on how to start up a docker container running Postgres.
Jeff has forgotten how to use JOIN and WHERE, and he doesn’t even remember where HAVING would go in a query or if it’s even a SQL condition at all. We’ve all been Jeff at some point, and it’s easy to relate to being less experienced.
But Jeff asked me what my experience with ORMs and SQL is, so I explained that while I can work with ORMs just fine, I can write better SQL than an ORM can, and at some point, the performance loss from working from an ORM is going to be a bottleneck.
Jeff, having been raised during a time when computer power is cheap and easy to rent, and who probably sported the latest MacBook Pro during his four years learning CS, can’t really fathom how performance has anything to do with SQL because he barely looks at the metrics of the 5-person startup he works at. After all, with the 100 customers they do have, the CPU usage barely goes above 10%.
Jeff thinks as long as you paginate data, everything’s fine. He scoffed at the idea that you’d have to write raw SQL at some point because he hasn’t had the experience of working on databases that sit on the largest machine you can rent in AWS. Nor has he put out fires on a system that processes millions of documents daily in which the database is the bottleneck.
We can all understand how Jeff formed this viewpoint. However, there are times when ORM is a good option and times when it isn’t. When an ORM can’t handle the job, a strong knowledge of raw SQL is crucial.
When Raw SQL Matters
Using an ORM is 100% what you need to be doing in the early stages of a company. When I say that, I mean you should be using it until you have enough customers using your product that paying for a larger machine to host your database is no longer financially feasible. You have to do this because growing the number of features and gaining customers at a small startup is more important than breathing.
If you optimize your code before this point, you’ll have the best-written SQL code that nobody uses because your customers will go with a faster-growing competitor. Once you’ve reached the “oh shit” point and can’t just use your credit card to solve the problem, you’ll need to pull up your sleeves and start writing raw SQL.
At a very early-stage startup, it would look something like this: Code Example 1
const userExists = await transaction.user.findUnique({
where: {
id: userId,
company_users: {
some: {
company_id: companyId
}
}
},
})This code is totally fine and will run smoothly long enough to get the growth you need to survive. It generates SQL that looks like: Code Example 1 generated SQL
--Generates the following SQL:
SELECT
"public"."User"."id",
"public"."User"."name",
"public"."User"."email",
"public"."User"."created_at",
"public"."User"."updated_at",
"public"."User"."deleted_at"
FROM "public"."User"
WHERE ("public"."User"."id" = $1
AND ("public"."User"."id") IN
(SELECT "t1"."user_id"
FROM "public"."CompanyUser" AS "t1"
WHERE ("t1"."company_id" = $2
AND "t1"."user_id" IS NOT NULL)))
LIMIT $3 OFFSET $4What you see is valid SQL, but it might miss the indexes. If you’re trying to optimize the database (DB) and see queries like this, it may not be easy to understand what’s going on or how to optimize it.
However, later on, when you have already scaled the DB as high as possible and need to start optimizing your SQL, you can write something like: Code Example 2
const query = `
SELECT cu.user_id, cu.company_id
FROM "CompanyUser" AS cu
WHERE cu.user_id = $1 AND cu.company_id = $2;
`
const result = await client.query(query, [userId, companyId])This command hits just one table, and with an index, it will be really fast. But if you take this approach to writing SQL each time you need to touch the database while writing new features, you’ll spend a lot of time writing new code, testing to see if it hits the indexes, and so on.
Doing this on a database with very little usage, like at an early-stage startup, really doesn’t make sense because PostgreSQL might not even use the indexes if the table is very small and already in its cache. So, you reap no benefits on a DB that doesn’t need to push its performance limits.
What an ORM Can’t Do
Let’s say we have a database with users, companies, and posts like in this example code, and you want to find the top number of authors who post per month, along with their post count. To do this in Prisma, you’d probably have to query in a loop and make several poorly performing calls to the DB.
That code is not something I want to write out here. You should really never query in a loop because it will have negative performance implications on the database.
const query = `
DO $$
DECLARE
i int;
BEGIN
CREATE TEMP TABLE Months (month_id int, month_start timestamp);
FOR i IN 1..12 LOOP
INSERT INTO Months (month_id, month_start) VALUES
(i,
make_date(EXTRACT(year FROM current_date)::int, i, 1));
END LOOP;
END $$;
WITH MonthlyPostCounts AS (
SELECT
m.month_id,
m.month_start,
u.id AS user_id,
COUNT(p.id) AS post_count,
DENSE_RANK() OVER (PARTITION BY m.month_id ORDER BY COUNT(p.id) DESC) AS rank
FROM
Months m
LEFT JOIN
"Posts" p ON DATE_TRUNC('month', p.created_at) = DATE_TRUNC('month', m.month_start)
LEFT JOIN
"User" u ON p.author_id = u.id
GROUP BY
m.month_id, m.month_start, u.id
)
SELECT
m.month_start,
mp.user_id,
u.name AS user_name,
mp.post_count AS post_count
FROM
Months m
LEFT JOIN
MonthlyPostCounts mp ON m.month_id = mp.month_id AND mp.rank = 1
LEFT JOIN
"User" u ON mp.user_id = u.id
ORDER BY
m.month_id;
`
const results = await client.query(query)This code will perform better in production under a heavy load because it’s just one query that counts the user’s number of posts per month for several months, all at once. You’d most likely want to do this on a read replica or data lake or, if you’re using the Google Cloud platform, something like BigQuery.
In this example, the first statement between the double dollar signs ($$) is just to create a temporary table of months that we can then use as a Common Table Expression (CTE) to count and rank each user’s post count per month.
We could probably improve this if we had a real production instance with a lot of data, a high amount of usage, and indexes for these tables, but this is just to explain that in one raw SQL statement, you can do what would take Prisma several statements to do. In fact, when doing it raw, you can tune the query/index and get much higher performance out of it.
Handy ORM Shortcuts
When working at an early-stage startup, it’s important to take shortcuts and use tools/libraries that get new features in front of customers quickly so you can get their feedback. It’s not worth working on a feature for 12 months (basically a century in the startup universe) before getting feedback to see if people even want to use it.
It’s beyond the scope of this blog post to go into how you can generate typed query parameters that you can inject into Prisma queries so that you can re-use your role/permissions-based code repeatedly without re-writing it into the Prisma query each time.
Tools like CASL Prisma can help with this very early on, and it’s super helpful for getting the company to a high growth phase. When performance starts to matter, it’s time to remove tools like this and start writing it by hand or have a service that caches access/permissions on a per-user basis outside the DB.
Conclusion
While it’s clear that ORMs can be a great tool for early-stage startups, ORMs can negatively affect performance once a business grows large enough that performance becomes a consideration. As a business evolves and grows its client base, raw SQL is a far superior option that improves a company’s efficiency.
Fun fact:
Did you know PostgreSQL has a sleep command? I can’t think of any situation where I’d want to use this outside of testing a framework’s timeout code.