Data Science/๊ฐœ๋…๊ณผ ์šฉ์–ด

The Bootstrap Technique | ๋ถ€ํŠธ์ŠคํŠธ๋žฉ

Chan Lee 2024. 10. 23. 07:22

Data science์—์„œ๋Š” population์˜ unknown parameter๋ฅผ estimate ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ผ ๋•Œ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. 

์˜ˆ๋ฅผ ๋“ค์–ด ์ „ ๊ตญ๋ฏผ์˜ ์†Œ๋“์„ estimate ํ•˜๊ณ  ์‹ถ๋‹ค๊ณ  ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

์ค‘์œ„ ์†Œ๋“์„ ๊ตฌํ•ด์„œ ์ด๋ฅผ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ํ•œ๋‹ค๊ณ  ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

1. If you have a census: 

Just calculate the parameter from the census, and you're done. 

Population ๋ฐ์ดํ„ฐ๊ฐ€ ์ค€๋น„ ์™„๋ฃŒ ๋˜์—ˆ๋‹ค๋ฉด, ๋ฐ”๋กœ ๊ณ„์‚ฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. 

ํ•˜์ง€๋งŒ, ์ด๋Ÿฐ ๊ฒฝ์šฐ๊ฐ€ ๋‹น์—ฐํžˆ ํ”ํ•˜์ง€ ์•Š๊ฒ ์ฃ ? 

 

2. If you don't have a census: 

Take a random sample from the population. 

Usa a statistic as an estimate of the parameter. 

ํ•˜์ง€๋งŒ, ์šฐ๋ฆฌ๋Š” ํ˜„์‹ค์—์„œ ๋งŽ์€ ์–‘์˜ sample์œผ๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜์ง‘ํ•˜๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. 

Sampling ๊ณผ์ •์€ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค๊ธฐ ๋•Œ๋ฌธ์—, ์šฐ๋ฆฌ๋Š” ์ƒ˜ํ”Œ๋ง์„ ์ตœ์†Œํ™” ํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค. 

 

๊ทธ๋Ÿด ๋•Œ, The Bootstrap technique, re-sampling method๊ฐ€ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 The bootstrap is a technique for simulating repeated random sampling. 

๋ถ€ํŠธ์ŠคํŠธ๋žฉ์€ ํ‘œ๋ณธ์—์„œ ๋ณต์›์ถ”์ถœ์„ ๋ฐ˜๋ณตํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ ํ‘œ๋ณธ์„ ์ถ”์ถœํ•˜์—ฌ ์ „์ฒด ํ‘œ๋ณธ์„ ๋Š˜๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 

 

All that we have is the original sample, which is large and random.

์šฐ๋ฆฌ๋Š” 1ํšŒ ์ˆ˜์ง‘ํ•œ ์›๋ณธ ์ƒ˜ํ”Œ์ด ์กด์žฌํ•˜๊ณ , ์ด ์ƒ˜ํ”Œ์€ ์ƒ˜ํ”Œ ์‚ฌ์ด์ฆˆ๊ฐ€ ํฌ๊ณ  ๋ฌด์ž‘์œ„๋กœ ์ˆ˜์ง‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 

์ด๋Ÿฐ ๊ฒฝ์šฐ, ํ•ด๋‹น ์›๋ณธ ์ƒ˜ํ”Œ์€ ๋งค์šฐ ํ™•์‹คํ•˜๊ฒŒ population์˜ distribution์„ ๋‹ฎ์„ ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

 

So, we can sample at random from the original sample. (More sample from the sample) 

์ฆ‰, ์šฐ๋ฆฌ๋Š” ํ•ด๋‹น ์›๋ณธ ์ƒ˜ํ”Œ์—์„œ ์ƒ˜ํ”Œ์„ ์ถ”๊ฐ€๋กœ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์œผ๋กœ population์—์„œ sampling์„ ํ•˜๋Š” ๊ฒƒ์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

Re-sampling from the original random sample โฉฌ Sampling from the population (with high probability) 

Bootstrap technique๋กœ ์ถ”์ถœ๋œ ์ถ”๊ฐ€ ํ‘œ๋ณธ๋“ค์€ "๋†’์€ ํ™•๋ฅ ๋กœ" ๋ชจ์ง‘๋‹จ์—์„œ ์ถ”์ถœํ•œ ํ‘œ๋ณธ๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. 

์ฆ‰, ๋งค์šฐ ๋‚ฎ์€ ํ™•๋ฅ ์ด์ง€๋งŒ ๋ถ„๋ช…ํžˆ ์ถ”๊ฐ€ ํ‘œ๋ณธ๋“ค์ด population์„ ์ ์ ˆํ•˜๊ฒŒ representํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ๊ณ , wrong conclusion์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

 

It is important to do re-sampling with replacement

Also, the size of the new sample has to be the same as the original one, so that the two estimates are comparable. 

 

Using <datscience> module, 

The default behavior of tbl.sample():

at random with replacement, the same number of times as rows of tbl. 

 

So we can simply use original_sample.sample() to get the bootstrap samples.