postgresql - SQL: Calculating the day over day repeat user percentage -
i have events
table has user_id
, created_at
columns.
what i'd figure out percentage of users having day-over-day repeat entries in events
table.
so, if day d1
has user_id
s [a,b,c]
, , day d2
has user_id
s [b,d,e]
, b
repeat user_id
of 3 (from d1
), , there 33% overlap between d1
, d2
.
i'd able extend arbitrary number of days.
the schema in question:
create table events ( events_id serial primary key , user_id varchar(255) not null , created_at datetime not null );
this large table, having ~25mm rows per day, ~4.5mm distinct user_id
s per day.
example data set:
+---------+---------------------+ | user_id | created_at | +---------+---------------------+ | bob | 2014-12-02 11:11:11 | | sally | 2014-12-02 12:12:11 | | zed | 2014-12-02 12:22:11 | | | ... | | chris | 2014-12-03 11:13:11 | | mark | 2014-12-03 11:11:13 | | zed | 2014-12-03 11:11:33 | | | ... | | sydney | 2014-12-04 11:14:11 | | zed | 2014-12-04 11:44:11 | | chris | 2014-12-04 11:44:11 | | | ... | | sydney | 2014-12-05 11:15:11 | | zed | 2014-12-05 11:55:11 | | chris | 2014-12-05 11:55:15 | | sandy | 2014-12-05 11:55:51 | | sydney | 2014-12-05 11:55:55 | +---------+---------------------+
expected output:
+------------+---------------------------+ | day | returning_user_percentage | +------------+---------------------------+ | 2014-12-02 | null | | 2014-12-03 | 33 | | 2014-12-04 | 66 | | 2014-12-05 | 75 | +------------+---------------------------+
additionally, , quite far more simple, part 2: i'd know how many new users there each day, "new" means user_id
has not been seen.
answer updated question:
"calculate every day percentage of distinct users have entries previous day well."
with e (select created_at::date, user_id day events group 1, 2) select e.day , round(100.0 * count(e1.user_id) / count(*), 2) pct_repeat_user e left join e e1 on e1.user_id = e.user_id , e1.day = e.day - 1 group 1 order 1;
returns desired result - 0
instead of null
first day, seems more correct me.
explain
your "date"
timestamp
(not "datetime"). unique users per day, castdate
, group users in cteu
. crucial or nonsensical results cross-joining duplicate users on 1 day duplicate users on previous day.building on that,
left join
previous day. count users can found "yesterday" , divide number of users "today". multiply100.0
coerce numbernumeric
before dividing , useround()
desired number of fractional digits.since
day
data typedate
can subtractinteger
1 "yesterday.be aware "day" defined time zone - if should have data multiple time zones. not act long store
timestamp
, there may inherent error in data. details:
db design
for millions of rows , multiple entries per user urgently advice create separate users
table , reference reduce disk space , improve performance:
create table users ( user_id serial primary key , username text not null ); create table events ( events_id serial primary key , user_id int not null references users , created_at timestamp not null );
this instrumental speed number or other queries.
Comments
Post a Comment