postgresql - SQL: Calculating the day over day repeat user percentage -


i have events table has user_id , created_at columns.

what i'd figure out percentage of users having day-over-day repeat entries in events table.

so, if day d1 has user_ids [a,b,c], , day d2 has user_ids [b,d,e], b repeat user_id of 3 (from d1), , there 33% overlap between d1 , d2.

i'd able extend arbitrary number of days.

the schema in question:

create table events (   events_id serial primary key , user_id varchar(255) not null , created_at datetime not null ); 

this large table, having ~25mm rows per day, ~4.5mm distinct user_ids per day.

example data set:

 +---------+---------------------+ | user_id |     created_at      | +---------+---------------------+ | bob     | 2014-12-02 11:11:11 | | sally   | 2014-12-02 12:12:11 | | zed     | 2014-12-02 12:22:11 | |         | ...                 | | chris   | 2014-12-03 11:13:11 | | mark    | 2014-12-03 11:11:13 | | zed     | 2014-12-03 11:11:33 | |         | ...                 | | sydney  | 2014-12-04 11:14:11 | | zed     | 2014-12-04 11:44:11 | | chris   | 2014-12-04 11:44:11 | |         | ...                 | | sydney  | 2014-12-05 11:15:11 | | zed     | 2014-12-05 11:55:11 | | chris   | 2014-12-05 11:55:15 | | sandy   | 2014-12-05 11:55:51 | | sydney  | 2014-12-05 11:55:55 | +---------+---------------------+ 

expected output:

 +------------+---------------------------+ |    day     | returning_user_percentage | +------------+---------------------------+ | 2014-12-02 | null                      | | 2014-12-03 | 33                        | | 2014-12-04 | 66                        | | 2014-12-05 | 75                        | +------------+---------------------------+ 

additionally, , quite far more simple, part 2: i'd know how many new users there each day, "new" means user_id has not been seen.

answer updated question:

"calculate every day percentage of distinct users have entries previous day well."

with e (select created_at::date, user_id day events group 1, 2) select e.day      , round(100.0 * count(e1.user_id) / count(*), 2) pct_repeat_user   e left   join e e1 on e1.user_id = e.user_id                 , e1.day = e.day - 1 group  1 order  1; 

returns desired result - 0 instead of null first day, seems more correct me.

explain

  • your "date" timestamp (not "datetime"). unique users per day, cast date , group users in cte u. crucial or nonsensical results cross-joining duplicate users on 1 day duplicate users on previous day.

  • building on that, left join previous day. count users can found "yesterday" , divide number of users "today". multiply 100.0 coerce number numeric before dividing , use round() desired number of fractional digits.

  • since day data type date can subtract integer 1 "yesterday.

  • be aware "day" defined time zone - if should have data multiple time zones. not act long store timestamp, there may inherent error in data. details:

sql fiddle.

db design

for millions of rows , multiple entries per user urgently advice create separate users table , reference reduce disk space , improve performance:

create table users (   user_id serial primary key , username text not null );  create table events (   events_id serial primary key , user_id int not null references users , created_at timestamp not null ); 

this instrumental speed number or other queries.


Comments

Popular posts from this blog

c++ - OpenMP unpredictable overhead -

ruby on rails - RuntimeError: Circular dependency detected while autoloading constant - ActiveAdmin.register Role -

javascript - Wordpress slider, not displayed 100% width -