tmp
- two approaches:
a. 5% chance to be an ad per post.
b. every 20 post must have an ad in it.
(1. compute each expected value and variance for number of ads in 100 posts.
Approach a: let X denotes the number of ads, the probability distribution of X is binomial distribution. X~n(n,p)
E(X) = 100*0.05 = 5
σ2 = Var(X) = np*(1-p) = 100*0.05*(1-0.05) = 4.75
Approach b:
E(X) = 100/20 = 5
(2. probability of getting more than 10 ads in 100 posts with approach a.
approach a应该符合二项分布,p=0.05, q = 1-p = 0.95.
如果用二项分布解的话:
p(more than 10 ads) = 1 - p(less than or equal 10 ads)
p(less than or equal 10 ads) = p(ads = 0) + p(ads=1) + ... + p(ads=10)
等式右边的每一项可以用二项分布的概率密度函数解。。
但是这么解会比较耗时间。可以考虑用正态分布去估计(因为我们看到样本数目比较大np>=5且nq>=5):
此处 mu = 5, var = 100*0.05*0.95 = 4.75, sigma = sqrt(4.75) = 2.18
Z = (x - mu)/sigma = (10 - 5)/2.18 = 2.29
如果没有Z表,可以这么估计:(我不太确定这么估计是否会让面试官满意,但是应该比没有估计好吧。。)
我们很熟悉的单尾Z<1.96的概率是0.975, 所以p(Z>1.96) = 0.025
所以p(Z>2.29) < p(Z>1.96) = 0.025
解答:
100个posts中有超过10个广告的概率不超过2.5%,具体数字根据查表得到为1.1%。
根据二项分布的解法,用Excel算了一下,结果是:1.1472%
(3. expected number of seeing back-to-back ads in 100 posts with two approaches. 意思就是平均会有几次连着看两个posts that are both ads
(4. 每25个post出现一个ad 或4%的概率出现 哪种好?
- 问了hypothesis testing,如何和non-technical stakeholder解释test design(设定confidence/power level的意义),为什么要看significance
- DS- 统计问了几个distribution,解释了一下poisson distribution
- In a class you have 30 students, what is the probability of that we have exactly two people who have the same birthday?
- 一道stats题: 1)一个硬币翻几次翻到head 2) 翻几次翻到连续三次 head https://www.cs.cornell.edu/~ginsparg/physics/INFO295/mh.pdf
- Table A (date, Order_id, Product ID, Sales)
Table B( Product ID, Product_category, Brand name)
1) YoY Calculation in Oracle for the month level using case statement
with sale as
(
select '2015-01-10' sale_date, 100 sales from dual UNION
select '2015-01-11' sale_date, 100 sales from dual UNION
select '2015-01-12' sale_date, 100 sales from dual UNION
select '2015-02-10' sale_date, 100 sales from dual UNION
select '2015-02-11' sale_date, 100 sales from dual UNION
select '2015-03-10' sale_date, 100 sales from dual UNION
select '2015-04-10' sale_date, 100 sales from dual UNION
select '2016-01-10' sale_date, 110 sales from dual UNION
select '2016-01-11' sale_date, 110 sales from dual UNION
select '2016-01-12' sale_date, 110 sales from dual UNION
select '2016-02-10' sale_date, 110 sales from dual UNION
select '2016-02-11' sale_date, 110 sales from dual UNION
select '2016-03-10' sale_date, 110 sales from dual UNION
select '2016-05-10' sale_date, 110 sales from dual
)
select mon, sum(case when yr = 2015 then monthly_sales end) "2015", sum(case when yr = 2016 then monthly_sales end) "2016" from
(
select extract(year from to_date(sale_date,'YY-MM-DD')) yr,extract(month from to_date(sale_date,'YY-MM-DD')) mon,sum(sales) monthly_sales from sale
group by extract(year from to_date(sale_date,'YY-MM-DD')),extract(month from to_date(sale_date,'YY-MM-DD'))
) t
group by mon;
|
2015 |
2016 |
YoY |
Jan |
207 |
304 |
47% |
Feb |
245 |
290 |
18% |
- Pivoting using case statement :Find the monthly Sales in columns for all years
|
Jan |
Feb |
Mar |
Apr |
May |
Jun |
Jul |
Aug |
Sep |
Oct |
Nov |
Dec |
2010 |
140 |
142 |
143 |
115 |
173 |
130 |
136 |
106 |
102 |
174 |
154 |
104 |
2011 |
127 |
123 |
191 |
148 |
120 |
164 |
151 |
151 |
200 |
163 |
144 |
171 |
2012 |
227 |
168 |
190 |
169 |
234 |
174 |
217 |
207 |
181 |
185 |
168 |
193 |
select yr, sum(case when mon = 1 then monthly_sales end) Jan, sum(case when mon = 2 then monthly_sales end) Feb,
sum(case when mon = 3 then monthly_sales end) Mar, sum(case when mon = 4 then monthly_sales end) Apr,
sum(case when mon = 5 then monthly_sales end) May, sum(case when mon = 6 then monthly_sales end) Jun,
sum(case when mon = 7 then monthly_sales end) Jul, sum(case when mon = 8 then monthly_sales end) Aug,
sum(case when mon = 9 then monthly_sales end) Sep, sum(case when mon = 10 then monthly_sales end) Oct,
sum(case when mon = 11 then monthly_sales end) Nov, sum(case when mon = 12 then monthly_sales end) "Dec"
from
(
select extract(year from to_date(sale_date,'YY-MM-DD')) yr,extract(month from to_date(sale_date,'YY-MM-DD')) mon,sum(sales) monthly_sales from sale
group by extract(year from to_date(sale_date,'YY-MM-DD')),extract(month from to_date(sale_date,'YY-MM-DD'))
) t
group by yr;
- Find Duplicate row in Oracle :SELECT * FROM emp a WHERE rowid = (SELECT MAX(rowid) FROM EMP b WHERE a.empno=b.empno)
http://www.java67.com/2013/04/10-frequently-asked-sql-query-interview-questions-answers-database.html?m=1
Q9 - Skewed Join
Here we have 2 tables:
- Table of customer accounts: acc (id, name)
- Click-stream table: click (id, acc_id, ts, url)
Symptoms: When we join those 2 tables on account id all reducers except one finish in no time. The last reducers runs forever and never finishes.
What is the root cause of the issue? How it could be fixed.
Answer
Most likely we have a 'skewed join' situation. For instance we have 90% from clicks from unregistered customers. In acc table we have a special record for them -1:guest.
If acc table is small enough to fit into memory, then we can just use map-side join.
If acc is too big to fit into memory, but we know that our data distribution in this join is constant, then we can perform a manual optimization by splitting this join in 2: join all account except 'guest' and then add clicks from guests.
INSERT overwrite TABLE click_acc SELECT c.*, a.name FROM click c JOIN acc a ON a.id = c.acc_id WHERE c.acc_id != -1;
INSERT
INTO
TABLE
click_acc SELECT
*, 'guest'
name FROM
click c WHERE
c.acc_id =
-1;
Q10 - In memory partitions filtering
The table has many partition keys (number of actual partitions 1M+): event (a,b,c) partitioned by (day string, hour int, code int) location 's://my_bucket/tables/event'.
Query filters data by partitions to improve performance:
SELECT * FROM event WHERE DAY='2015-01-01' AND HOUR=1 AND pmod(code,10) IN (5,7);
Actual performance is very poor. Here are the symptoms:
- Hive spends several hours preparing the job
- When MR job is submitted it finishes in several minutes.
Answer
Hive performs in-memory filtering of partitions because it cannot translate UDF 'pmod' to native SQL code. This UDF should be removed from the where clause to allow Hive generate native SQL and perform partitions filtering on RDBMS (MySql) side.