# New Site 新的网站

Hi everyone! I am happy to announce I have a new version of this website. It is a new iteration of the website. Most of the old articles will still be here, and the new articles will be published in the new website here. You can find several new articles there now, and some of them are actually published months ago. One of the article is related to a major update of WriteTeX. For further update of WriteTeX, please also refer to new website. Clearly, the new website is not full functional, and more updates are on the way. Welcome to visit the new website @ longqi.ga.

# Release of WriteTeX v 1.00

Everything That Has A Beginning Has An End. From “The Matrix Revolutions”

# 打假：印度是强奸之国

## 均值分析

globalmean <- rape %>%
group_by(Year) %>%
summarise(mean=round(mean(Events,na.rm=T),2),
median=median(Events,na.rm=T),
sd=round(sd(Events,na.rm=T),2))



# How to parallelize 'do' computation in dplyr

## Introduction

Recently, I took apart in the IJCAI-17 Customer Flow Forecasts. It is an interesting competition in some extent. The datasets provided include:

1. shop_info: shop information data
2. user_pay: users pay behavior
3. user_view: users view behavior
4. prediction：test set and submission format

Because the nature of this problem is to predict time series, methods specifically designed for this task should be tested. The well-known ones include:

1. ARIMA series models
2. ETS series models
3. Regression models

And it is not hard to find out that customer flow is a seasonal time series. Therefore, time series decomposition such as X12 and STL may be useful tools in analysis.

### Preprocessing

The datasets include plenty of information such as the user_id make a payment to shop_id at time. Because the goal is to predict the flow of each shop and it is hard to build a user_id profile based model with only this amount of data provided, a shop_id profile based solution appears to be a better choice, i.e., we will build a model for each shop, and do the prediction. Therefore, for the preprocessing, the user_id should be aggregated. This is a pretty entry level task for dpylr(R) or pandas(Python) user. Therefore, I do not share code for this part, the results are organized as following dataset:

library(psych)
summary(tc2017)
##     shop_id       time_stamp             date_week         nb_pay
##  Min.   :   1   Min.   :2015-06-26   Monday   :86544   Min.   :   1.0
##  1st Qu.: 504   1st Qu.:2016-02-03   Tuesday  :84851   1st Qu.:  51.0
##  Median :1019   Median :2016-05-22   Wednesday:85283   Median :  82.0
##  Mean   :1008   Mean   :2016-05-04   Thursday :85643   Mean   : 116.3
##  3rd Qu.:1512   3rd Qu.:2016-08-15   Friday   :86041   3rd Qu.: 135.0
##  Max.   :2000   Max.   :2016-10-31   Saturday :84902   Max.   :4704.0
##                                      Sunday   :86011
describe(tc2017)
##             vars      n    mean     sd median trimmed    mad min  max
## shop_id        1 599275 1007.58 577.88   1019 1008.69 747.23   1 2000
## time_stamp*    2 599275     NaN     NA     NA     NaN     NA Inf -Inf
## date_week*     3 599275    4.00   2.00      4    4.00   2.97   1    7
## nb_pay         4 599275  116.26 132.04     82   93.64  54.86   1 4704
##             range  skew kurtosis   se
## shop_id      1999 -0.02    -1.22 0.75
## time_stamp*  -Inf    NA       NA   NA
## date_week*      6  0.00    -1.25 0.00
## nb_pay       4703  7.06   105.67 0.17

### Exploratory

First, let us make some figures using off course ggplot2. Plots for first five shops:

Above two figures are quite messy. We can notice that the data have different range, which means that we may have to worry about NAs. Moreover, most of the series do not steady in the given range. For these five curves, the curves are more steady after April 2016. Then, above series are plotted into separated panels as follows:

p <- ggplot(tc2017 %>% filter(shop_id<6),aes(time_stamp,nb_pay)) +
geom_line() +
facet_wrap(~shop_id, ncol = 1, scales = "free")
print(p)

Some series have strong seasonal feature, such as curve for shop_id==4. We may need to consider the seasonal effect. A quick acf drawing is shown as below:

acf((tc2017 %>% filter(shop_id==4))$nb_pay) It can be observed that the periodic pattern is quite clear, the period is 7 and it is the length of one week. Therefore, we plot the data against the weekday: p <- ggplot(tc2017 %>% filter(shop_id==4), aes(time_stamp,nb_pay)) + geom_line(size=1) + facet_grid(date_week~.,scales = "fixed")+ theme_bw() print(p) It is shown in above figure, the number of customs is much steady when we investigate the flow on the same weekday. This pattern also appears in the data of other shops. p <- ggplot(tc2017 %>% filter(shop_id<6), aes(time_stamp,nb_pay,color = date_week)) + geom_line(size=1) + facet_grid(date_week~shop_id,scales = "free")+ scale_color_brewer(palette = "Set1")+ theme_bw()+ theme(legend.position = "none") print(p) Generally, the flows have quite different patterns between weekdays and weekends. However, the longtime trend also plays important role in the flow. Let's make some predictions here: library(forecast) library(xts) to.ts <- function (x) { ts(x$nb_pay,frequency = 7)
}