GitHub Copilot and the Methods of Rationality
5 points • 2 comments
Recent @malisper Activity
GitHub Copilot and the Methods of Rationality
5 points • 2 comments
> the number of cold emails and calls we get from car part distributors is amusing
Heck, my company is called Freshpaint. The number of people that visit our site, freshpaint.io, see all the copy about data infrastructure, and still contact us looking for a quote to paint their house is hilarious.
> It's not specified in the article?
From the article:
> The function should have expected O(1) performance.
> Do that for rate limiting and it'll be super easy to DoS you
How so? The rate limiter has the same performance as, for example, a hash table. Operations are usually O(1), but are periodically O(n). It's not like every service that uses a hash-table is DoS-able.
> hence the proposed solution has a worst case complexity of O(N)?
Worst case yes, but the question is concerned with the average case.
> "Produce a dictionary of the number of times each letter occurs in each word." is heading the wrong direction. You want an encoding that does not care about the number of times each letter occurs in each word.
How so? "aab" is an anagram of "baa", but they are not anagrams of "ab".
> The solution described in the article is likely to be extremely wasteful in both time and memory, by allocating a queue entry for each call, and then O(n) scanning and dropping stale entries on each successive call.
Even if n elements are scanned in the the worst case, the expected time it takes to perform such a scan is O(1). This is because we perform a scan on each insertion and O(n) elements are deleted in a scan of length n. That means the number of elements scanned is proportional to the number of elements inserted.
> Tabulating call count by division(s) of time would be less obviously problematic.
Tabulating call count by division(s) of time doesn't quite work. If a 9 calls are made at the last second of one division of time and 9 more calls are made in the first second of the next division of time, you will hit 18 calls in a two second interval which is over the limit of 10 calls for any one minute interval.
> Rate limiter: I would probably use a hash table of this structure -- called[yyyy-mm-dd][hh-mm] and then increment the hashtable for that minute, for example, called[2022-01-02][22-01]++ and drop any entries for the last day at the end of the day.
As Bradley said this implementation can be called more than 10 times within a 1 minute window. If it's called 9 times at the last second of one minute and 9 times at the start of the next minute, it will be called 18 times within a two second interval.
The "algorithm" I wrote in this post is a formalized version of some problem solving techniques I picked up from the classic book "How to Solve It". In particular "working backwards". You start from the goal you want and see what you know that is applicable to getting you to goal.
There's a specific reason I didn't mention priority queues in the post. In most cases, anything you can do with a heap you can do with a binary tree instead! A binary tree has O(log(n)) insert and deletion which is the same as a traditional heap. The only advantage a traditional heap has is you can construct a heap in O(n) time whereas a binary tree takes O(nlog(n)) time.
Of course there are even more niche data structures like a Fibonacci heap which have O(1) insertion, but you will have to get extremely unlucky to get asked about a Fibonacci heap in an interview.
The post touches upon it, but I didn't really understand the point. Why doesn't synchronous replication in Postgres work for this use case? With synchronous replication you have a primary and secondary. Your queries go to the primary and the secondary is guaranteed to be at least as up to date as the primary. That way if the primary goes down, you can query the secondary instead and not lose any data.
(I used to work at Heap.)
I think the post is spot on and it shows the Satchel team put did a ton of research into the post. There are two comments I would make on top of the post. First, know if the advice applies to you. If you are pre-product market fit it's probably too early to think about event based analytics. If you have a small number of users and are able to talk with all of them, you will get much more meaningful data getting to know them than if you were to set up product analytics. You probably don't have enough users to get meaningful data from product analytics anyways.
Second, while the autotrack functionality at Heap is fantastic, what I saw was that a significant portion of Heap's customers were not able to use it. This primarily happened because in addition to using Heap's autotrack to collect data, a lot of Heap's customers were also using Segment to collect and route the data between different tools. This created two different sources of truth for the data and Segment usually wound up winning. For that reason, I left Heap ~18 months ago to start Freshpaint (freshpaint.io). Freshpaint is an autotrack based alternative to Segment, allowing you to autotrack data and feed that same dataset into all your different tools. That way you get all the advantages of autotrack without needing to maintain two sources of truth.
That's a different argument that what most people in the thread are making and in my opinion, a reasonable one.
ITT: people claiming DoorDash has no path to profitability - the numbers in the S-1 tell a different story. On page 112, there's a chart of how much profit they make per order based on how long the user has been on the platform. In the first year, the value is negative because DoorDash spends money on sales and marketing to acquire the customer. By the third year, DoorDash is making a consistent profit of 8% on each order placed.
When you look at the numbers in aggregate, it appears they are massively unprofitable. When you look at the numbers by cohort, it's clear they are investing money in sales and marketing and based on their metrics, they will generate a significant return on their investment over the next few years. Over time, as a larger and larger percentage of their users become recurring users, their profit on each order will approach 8%.
IIRC, EXPLAIN VERBOSE will show you the columns being selected by each step of the plan. The inlining showed up there.
I've mentioned this story here before, but one of the most surprising performance gains I saw was by eliminating TOAST look ups. If I recall correctly, each time you use the `->>` operator on a TOASTed JSONb column, the column will be deTOASTed. That means if you write a query like:
and x is TOASTed, Postgres will deTOAST x three different times. This multiplies the amount of data that needs to be processed and dramatically slows things down.
SELECT x ->> 'field1', x ->> 'field2', x ->> 'field3' FROM table
My first attempt to fix this was to read the field in one query and use a subselect to pull out the individual fields. This attempt was thwarted by the Postgres optimizer which inlined the subquery and still resulted in deTOASTing the field multiple times.
After a discussion with the Postgres IRC, RhodiumToad pointed out that if I add OFFSET 0 to the end of the subquery, that will prevent Postgres from inlining it. After retrying that, I saw an order of magnitude improvement due to eliminating the redundant work.
For a post detailing the modern data infrastructure I'm surprised they intentionally leave out SaaS analytics tools. I find this especially surprising given a16z has invested >$65M into Mixpanel.
Based on my experience working at an analytics company and running one myself, what this post misses out is that an increasing number of people working with data today are not engineers. These people can range from product managers who are trying to figure out what features the company should focus on building, marketers to figure out how to drive more traffic to their website, or even the CEO trying to understand how their business as a whole is doing.
For that reason, you'll still see many companies pay for full stack analytics tools (Mixpanel, Amplitude, Heap) in addition to building out their own data stack internally. It's becoming more and more important that the data is accessible to everyone at your company including the non-technical users. If you try to get everyone to use your own in-house built system, that's not going to happen.
> Absolutely false. This is simply not how copyright works!
How so? Here's what section 5 of the AGPL says:
> You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
> c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.
As for patents which is covered in section 11:
> If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients.
> Disgusting amount of control over people's lives. Y'all need a union.
FWIW this is more Google being risk adverse than Google being evil. If a Google employee in their free time contributes to an AGPL project, that employee needs to open source all IP related to their contribution to the project. Depending on the specifics and how the AGPL is interpreted in court, Google could be forced to open source their internal IP.
FWIW, I believe all large companies take a similar stance on the AGPL.
> But this only can happen if replace X with Y and add Z are in the same transaction. If you rollback the transaction, then you start by removing Z, then replacing Y with X. What I'm missing here?
Z can be in a different transaction than X and Y. If two transactions run concurrently, one that replaces X with Y and a second one that inserts Z, the above scenario can happen.
> Does it do this in the reverse order?
I would guess so, but I haven't looked up the implementation so I'm not sure. There's a ton of race conditions that can come up depending on the exact order you write things so I'm sure the actual implementation is pretty messy.