PyPackage01—Pandas02_汇总和频数统计 单变量频数统计&多变量分组统计中的相关方法~1.count&unique&nuniqueimportpandasaspdtest_data=pd.DataFrame
单变量频数统计&多变量分组统计中的相关方法~
1. count&unique&nunique
import pandas as pd
test_data = pd.DataFrame({
'x1': ["a", "b", "c", "b"],
"x2": [1, 2, 3, 4],
"x3": [4, 3, 2, 1]
})
test_data
|
x1 |
x2 |
x3 |
0 |
a |
1 |
4 |
1 |
b |
2 |
3 |
2 |
c |
3 |
2 |
3 |
b |
4 |
1 |
1.1 统计个数count
test_data.x1.count()
4
1.2 统计不重复值个数nunique
test_data.x1.nunique()
3
1.3 筛选不重复值
test_data.x1.unique()
array(['a', 'b', 'c'], dtype=object)
1.4 统计某一个值的频数
不同于列表,可以直接统计某个值出现的次数,DataFrame需要做一些转换。
list(test_data.x1).count('b')
2
sum(test_data.x1.apply(lambda x: 1 if x=='b' else 0))
2
test_data.x1.apply(lambda x: 1 if x=='b' else 0).sum()
2
2. 分组统计—groupby
groupby有一点奇葩,分组之后,label都变成索引(行名了),可以设置as_index=False改变默认参数。
文档地址:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
import pandas as pd
x = pd.DataFrame({
"x1": ["a", "a", "b", "b", 'c'],
"x2": [1, 1, 1, 2, 2],
"x3": [1, 2, 3, 4, 5]
})
x
|
x1 |
x2 |
x3 |
0 |
a |
1 |
1 |
1 |
a |
1 |
2 |
2 |
b |
1 |
3 |
3 |
b |
2 |
4 |
4 |
c |
2 |
5 |
2.1 分组统计count(*)
x.groupby(by='x1').count()
|
x2 |
x3 |
x1 |
|
|
a |
2 |
2 |
b |
2 |
2 |
c |
1 |
1 |
x.groupby(by=['x1', 'x2'], as_index=False).count()
|
x1 |
x2 |
x3 |
0 |
a |
1 |
2 |
1 |
b |
1 |
1 |
2 |
b |
2 |
1 |
3 |
c |
2 |
1 |
x.groupby(by='x1').size()
x1
a 2
b 2
c 1
dtype: int64
2.2 分组统计count(distinct col1)
x.groupby(by='x1').nunique()
|
x1 |
x2 |
x3 |
x1 |
|
|
|
a |
1 |
1 |
2 |
b |
1 |
2 |
2 |
c |
1 |
1 |
1 |
2.4 其余统计函数
x.groupby(by=["x1",'x2']).mean()
|
|
x3 |
x1 |
x2 |
|
a |
1 |
1.5 |
b |
1 |
3.0 |
2 |
4.0 |
c |
2 |
5.0 |
x.groupby(by=["x1",'x2']).sum()
|
|
x3 |
x1 |
x2 |
|
a |
1 |
3 |
b |
1 |
3 |
2 |
4 |
c |
2 |
5 |
x.groupby(by=["x1",'x2'], as_index=False).aggregate(sum)
|
x1 |
x2 |
x3 |
0 |
a |
1 |
3 |
1 |
b |
1 |
3 |
2 |
b |
2 |
4 |
3 |
c |
2 |
5 |
2.5 整体的描述统计
x.groupby(by=["x1",'x2'], as_index=True).describe()
|
|
x3 |
|
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
x1 |
x2 |
|
|
|
|
|
|
|
|
a |
1 |
2.0 |
1.5 |
0.707107 |
1.0 |
1.25 |
1.5 |
1.75 |
2.0 |
b |
1 |
1.0 |
3.0 |
NaN |
3.0 |
3.00 |
3.0 |
3.00 |
3.0 |
2 |
1.0 |
4.0 |
NaN |
4.0 |
4.00 |
4.0 |
4.00 |
4.0 |
c |
2 |
1.0 |
5.0 |
NaN |
5.0 |
5.00 |
5.0 |
5.00 |
5.0 |
x.groupby(by=["x1",'x2'], as_index=False).describe()
|
x2 |
x3 |
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
count |
mean |
std |
min |
25% |
50% |
75% |
max |
0 |
2.0 |
1.0 |
0.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
2.0 |
1.5 |
0.707107 |
1.0 |
1.25 |
1.5 |
1.75 |
2.0 |
1 |
1.0 |
1.0 |
NaN |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
3.0 |
NaN |
3.0 |
3.00 |
3.0 |
3.00 |
3.0 |
2 |
1.0 |
2.0 |
NaN |
2.0 |
2.0 |
2.0 |
2.0 |
2.0 |
1.0 |
4.0 |
NaN |
4.0 |
4.00 |
4.0 |
4.00 |
4.0 |
3 |
1.0 |
2.0 |
NaN |
2.0 |
2.0 |
2.0 |
2.0 |
2.0 |
1.0 |
5.0 |
NaN |
5.0 |
5.00 |
5.0 |
5.00 |
5.0 |
2018-10-13 于南京市栖霞区紫东创业园
今天的文章PyPackage01—Pandas02_汇总和频数统计分享到此就结束了,感谢您的阅读。
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
如需转载请保留出处:https://bianchenghao.cn/66409.html