The Bootstrap Technique | 부트스트랩

Data Science/개념과 용어

Chan Lee 2024. 10. 23. 07:22

Data science에서는 population의 unknown parameter를 estimate 하는 것이 목표일 때가 많습니다.

예를 들어 전 국민의 소득을 estimate 하고 싶다고 해보겠습니다.

중위 소득을 구해서 이를 지표로 사용하려고 한다고 하겠습니다.

1. If you have a census:

Just calculate the parameter from the census, and you're done.

Population 데이터가 준비 완료 되었다면, 바로 계산만 하면 됩니다.

하지만, 이런 경우가 당연히 흔하지 않겠죠?

2. If you don't have a census:

Take a random sample from the population.

Usa a statistic as an estimate of the parameter.

하지만, 우리는 현실에서 많은 양의 sample으로부터 정보를 반복적으로 수집하기 어려운 경우가 많습니다.

Sampling 과정은 비용이 많이 들기 때문에, 우리는 샘플링을 최소화 하고 싶습니다.

그럴 때, The Bootstrap technique, re-sampling method가 활용될 수 있습니다.

The bootstrap is a technique for simulating repeated random sampling.

부트스트랩은 표본에서 복원추출을 반복하여 추가적인 표본을 추출하여 전체 표본을 늘리는 방법입니다.

All that we have is the original sample, which is large and random.

우리는 1회 수집한 원본 샘플이 존재하고, 이 샘플은 샘플 사이즈가 크고 무작위로 수집되었습니다.

이런 경우, 해당 원본 샘플은 매우 확실하게 population의 distribution을 닮을 것 입니다.

So, we can sample at random from the original sample. (More sample from the sample)

즉, 우리는 해당 원본 샘플에서 샘플을 추가로 추출하는 것으로 population에서 sampling을 하는 것을 대체할 수 있습니다.

Re-sampling from the original random sample ⩬ Sampling from the population (with high probability)

Bootstrap technique로 추출된 추가 표본들은 "높은 확률로" 모집단에서 추출한 표본과 유사합니다.

즉, 매우 낮은 확률이지만 분명히 추가 표본들이 population을 적절하게 represent하지 못할 수 있고, wrong conclusion으로 이어질 수 있습니다.

It is important to do re-sampling with replacement.

Also, the size of the new sample has to be the same as the original one, so that the two estimates are comparable.

Using <datscience> module,

The default behavior of tbl.sample():

at random with replacement, the same number of times as rows of tbl.

So we can simply use original_sample.sample() to get the bootstrap samples.